Categorical variables can be very useful in finding interesting patterns and relationships in data. They are often great for showing trends in graphs and for building predictive models. These types of variables can also introduce a unique set of problems. Several possible solutions exist within STATISTICA for addressing these issues. I will explore these over the next few weeks.
When using categorical variables, it is good to ask the questions:
These are good questions. I’m glad you asked! As any great philosopher would, I will answer your questions with more questions. Do you expect to find distinctions between all of these levels? Could this information be consolidated into a more manageable set of variables without great loss of information? How many observations (people, widgets, transactions, etc.) are being used to define this class? Does that seem like enough? Are you looking to build models that are simple and easy to interpret?
In the health care industry, a variable with over 1000 distinct procedure codes could boil down to a handful of procedure types. Using the variable with only a few levels would make interpreting the results far easier and likely won’t hurt the analysis much, if any. Processing times would improve significantly as well. Think of how many patients would be needed to get sufficient records on each of those procedure types. A common procedure will easily have ample data while rare ones may only have 1 or 2, maybe 10 observations. Is that enough to be drawing conclusions in a database of tens or hundreds of thousands of records? or millions of records?
Even a variable like state in the US may benefit from consolidation. The US has 50 states, the District of Columbia and a few US territories. This group of just over 50 categories may be better represented by a new variable that splits the nation into regional subgroups.
STATISTICA Data Miner offers a tool called Optimal Binning, which is easy and automatic. Click here to follow the example of combining grouping codes using the Optimal Binning tool. The original data contains Standard Industrial Classification codes, SIC Codes. The data, plotted here,
have 478 distinct categories. From the histogram, we can see that many of these have a small frequency. Some categories have only 1 observation. Certainly some categories need combined.
Imagine that this data is split in the train and validation samples. The validation sample will have SIC codes not accounted for in the model, due to the sparse nature of many of these code levels. Additionally, in most model-building procedures, a categorical predictor variable with m levels will translate to m-1 dummy variables. In our SIC codes example, these codes with one or only a few observations will still add a variable to the analysis. That variable is populated with 99% or more zeros and very few observations of 1. That variable is not contributing to the analysis, but only adding unnecessary baggage.
After the Optimal Binning procedure, a new categorical variable is created with 6 levels. Many groups were combined, based on their relationship with the target variable, to make this variable more efficient. Here is a look at the coded variable:
In my next few posts, I will talk about some other options for consolidating categories. Stay tuned!
Part II
Part III
Sometimes we all make things harder than they have to be, hence the title of my blog, Round Hole, Square Peg. My dad uses the acronym, KISS: Keep it simple, stupid. The simple fact is that once we start down a path, its hard to turn around and try a new one. But at times, that is exactly what we need to do to make things easier. This is as true in life as it is in data analysis.