Categorical variables can be very useful in finding interesting patterns and relationships in data. They are often great for showing trends in graphs and for building predictive models. These types of variables can also introduce a unique set of problems. Several possible solutions exist within STATISTICA for addressing these issues. I will explore these over the next few weeks.
When using categorical variables, it is good to ask the questions:
- Does the variable have too many levels?
- Does each level have sufficient data to be useful?
How many levels exactly is too many? And what is sufficient data for each level?
These are good questions. I’m glad you asked! As any great philosopher would, I will answer your questions with more questions. Do you expect to find distinctions between all of these levels? Could this information be consolidated into a more manageable set of variables without great loss of information? How many observations (people, widgets, transactions, etc.) are being used to define this class? Does that seem like enough? Are you looking to build models that are simple and easy to interpret?
In the health care industry, a variable with over 1000 distinct procedure codes could boil down to a handful of procedure types. Using the variable with only a few levels would make interpreting the results far easier and likely won’t hurt the analysis much, if any. Processing times would improve significantly as well. Think of how many patients would be needed to get sufficient records on each of those procedure types. A common procedure will easily have ample data while rare ones may only have 1 or 2, maybe 10 observations. Is that enough to be drawing conclusions in a database of tens or hundreds of thousands of records? or millions of records?
Even a variable like state in the US may benefit from consolidation. The US has 50 states, the District of Columbia and a few US territories. This group of just over 50 categories may be better represented by a new variable that splits the nation into regional subgroups.
How are we going to create these new groupings? It’s easy, right?
STATISTICA Data Miner offers a tool called Optimal Binning, which is easy and automatic. Click here to follow the example of combining grouping codes using the Optimal Binning tool. The original data contains Standard Industrial Classification codes, SIC Codes. The data, plotted here,
have 478 distinct categories. From the histogram, we can see that many of these have a small frequency. Some categories have only 1 observation. Certainly some categories need combined.
Why is this a problem?
Imagine that this data is split in the train and validation samples. The validation sample will have SIC codes not accounted for in the model, due to the sparse nature of many of these code levels. Additionally, in most model-building procedures, a categorical predictor variable with m levels will translate to m-1 dummy variables. In our SIC codes example, these codes with one or only a few observations will still add a variable to the analysis. That variable is populated with 99% or more zeros and very few observations of 1. That variable is not contributing to the analysis, but only adding unnecessary baggage.
Optimal Binning - The Results
After the Optimal Binning procedure, a new categorical variable is created with 6 levels. Many groups were combined, based on their relationship with the target variable, to make this variable more efficient. Here is a look at the coded variable:
In my next few posts, I will talk about some other options for consolidating categories. Stay tuned!