In the final installment of this discussion of issues in categorical variables, I will talk about one more possible solution to the issue of sparsely populated classes, created by too many levels of the categorical variable or rare classes. For a more detailed description of the issue and other possible solutions, see Part I and Part II.
These previous articles showed Optimal Binning, a tool in STATISTICA that strategically groups class levels for optimal predictive performance, and Recode Outliers, which identifies and recodes any class with less than a user-defined percentage of the data’s cases. My final suggestion as a possible solution to this issue is a manual recode of classes to combine groups in a way that makes logical sense. Each of these solutions has merit. They each have their place for sure. Likely, I have left out a valid alternative as well. Selecting the best option should be done based on several factors such as:
Manually recoding categorical levels in a systematic fashion is more time consuming than the Optimal Binning or Recode Outliers tools that were previously discussed. The results, however, are more interpretable. The resulting groups from a manual recode were crafted by logic.
Here is a comparison of the three proposed strategies in terms of interpretability and ease of implementation.
While optimal binning and recode outliers are fast and easy, the results may not be as user friendly in a means plot. The manual recode is the best option for interpretability, but the option requiring the most time and effort. The good news is: we have the time for it today.
Looking back at the Census.sta example data set used previously, we have a variable measuring the highest level of education obtained. It is called AHGA and is the 5th variable in the spreadsheet. A histogram of the raw data shows several groups with ample observations such as children, high school graduate, some college but no degree and college graduate (BA, AB, BS). The remaining categories have relatively small frequencies.
Because this variable is somewhat ordinal, finding a logical grouping strategy is pretty easy. For example, combining all education levels that are less than a high school diploma seems logical. Completed Associates and Bachelor degrees can logically be grouped as can advanced and professional degrees.
So I have devised a manual recode strategy. How can it be implemented? I will use the recode tool to do it. The recode tool can be used on the existing data. However, I am a firm believer in data integrity. I never change data, I only add to it.
On the Data tab, select Add from the Variables drop down menu. On the Add Variables dialog, make the following selections: How many =1, After=AHGA, and Name=AHGA Recoded. Click OK to add the blank variable to the spreadsheet.
To make the recode easier, next create the text label summary for the AHGA variable. On the Data tab, select Specs in the Variables field. On the Variable 5 dialog, select Text Labels to open the Text Labels Editor [AHGA] dialog. Select Output to spreadsheet to create the summary.
This will make the recode easier. I can use the numeric codes instead of the text labels, which are pretty long in some cases.
Highlight the newly created variable, AHGA Recoded. On the Data tab, select Recode to open the Recode Values of Variable 6: AHGA Recoded dialog box. Here we will enter the manual recode strategy. The first 2 categories, High school graduate and Some college but no degree, 101 and 102 respectively from the text label summary, will remain unchanged. These categories had ample observations. In category 3, the criterion is AHGA=103 and the New Value 3 is No high school diploma. The number,103, represents the education level, 10th grade. This category is recoded to the new group, No high school diploma.
Following this same convention, fill in the recode dialog to manually recode all categories of AHGA. Some categories will not be changed such as the first 2, High school graduate and Some college but no degree.
After all categories of the original AHGA variable have been recoded, create a histogram of the recoded variable.
The resulting categories are logically created and they each have ample data so that they can be useful in analysis.
Sometimes we all make things harder than they have to be, hence the title of my blog, Round Hole, Square Peg. My dad uses the acronym, KISS: Keep it simple, stupid. The simple fact is that once we start down a path, its hard to turn around and try a new one. But at times, that is exactly what we need to do to make things easier. This is as true in life as it is in data analysis.