Search the Electronic Statistics Textbook

StatSoft.comTextbookOptimal Binning

Optimal Binning for Predictive Data Mining

Overview

Optimal binning is a method of pre-processing categorical predictors with many classes in predictive data mining projects. Many of the analytic methods available in Data Mining become inefficient when applied to analyses that involve categorical predictors with thousands or tens of thousands of classes each. Such variables, however, are commonly found in many domains where data mining techniques can yield important insights.

Histogram of SIC codesTypical examples of categorical predictor variables with many classes are postal zip codes or Standard Industrial Classification codes (SIC, or the newer 6-digit NAICS codes), which classify all industrial activity. There are more than 10,000 4-digit SIC codes, and manufacturers of industrial machinery or service providers to industry routinely record the SIC codes for their customers in their data warehouses to utilize that information in their marketing. The problem, however, is that many useful analytic procedures, such as linear models, logistic regression, etc, cannot "handle" categorical predictor variables with 10,000 classes. The design matrices constructed from such predictors will become very large, and generally unusable for building valid (for prediction) linear models.

 

Combining Groups for Prediction

The solution to this problem is to combine the classes in such predictors (with thousands of categories) to yield a much smaller "aggregated" set of groups, each consisting of many individual classes from the original categorical predictor. One way to do this is to apply a CHAID-like algorithm to find a good combination of classes, with respect to a particular continuous or categorical outcome variable of interest. The general principal is fairly simple: try various combinations of classes to find the best such combination that will maximize the relationship of the newly recoded variable to the outcome variable. Note that the final recoding (re-grouping) may not represent a global optimum (the single best recoding), but only a "good" recoding (local optimum), sufficient to enable useful subsequent analyses.