www.statsoft.com

Electronic statistics textbook banner

Decision trees

 
Decision trees are a class of predictive data mining tools which predict either a categorical or continuous response variable. They get their name from the structure of the models built. A series of decisions are made to segment the data into homogeneous subgroups. This is also called recursive partitioning. When drawn out graphically, the model can resemble a tree with branches.
 
Decision Tree
 
Several tools fall into the category of decision tree including Classification and Regression Trees (C&RT) , Chi Square Automatic Interaction Detector (CHAID), Random Forests and Boosted Trees. Each of these tools has unique qualities while sharing the principles of decision trees. C&RT and CHAID both build only one tree, while Random Forest and Boosted Trees build multiple.
 
A decision tree is comprised of nodes and splits to the data. The tree starts with all training data residing in the first node. An initial split is made using a predictor variable, segmenting the data into 2 or more child nodes. Splits can then be made from the child nodes. A terminal node is one where no more splits are made.  Predictions are made based on the make-up of terminal nodes.
 
To use a decision tree to make a prediction, the split decisions are followed until a terminal node is reached. This can be done manually by reviewing the tree graph. Software programs can make these predictions as well and for a large data set for deployment.
 
Decision trees offer many advantaged. One important advantage is the ease of interpretation of a decision tree. While the tree can be complex, involving a large number of splits and nodes, users can interpret the model. Additionally, making model predictions does not involve mathematical calculations as in General Linear Models. The predictions are based on decision rules. In classification problems, the user can specify misclassification cost. Decision trees tend to give good predictive accuracy and can allow for missing data in deployment. 
 
Some minor challenges also exist. Decision trees can be made for regression problems, but this task is more challenging to find a good model. Finding the right sized tree may require experience and evaluation. A tree with too few of splits misses out on improved predictive accuracy, while a tree with too many splits in unnecessarily complicated. Too many splits may mean over fitting the data leading to poor generalization. Safeguards, such as using separate train and test data sets and cross-validation exist to combat this issue.