Decision Tree-Pruning-Cost Complexity Method
Machine learning is a problem of trade-offs. The classic issue is over-fitting versus under-fitting. Over-fitting happens when a model memorizes its training data so well that it is learning noise on top of the signal. Under-fitting is an opposite event : the model is too simple to find the patterns in the data.
Over-fitting results in decision trees that are more complex than necessary, training error no longer provides a good estimate of how well the tree will perform on previously unseen data.
In machine learning and data mining, pruning is a technique associated with decision trees. One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A tree that is too large risks over-fitting the training data and poorly generalizing to new samples. A small tree might not capture important structural information about the sample space. However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error. This problem is known as the horizon effect. A common strategy is to grow the tree until each node contains a small number of instances then use pruning to remove nodes that do not provide additional information. Decision trees are the most susceptible out of all the machine learning algorithms to over-fitting and effective pruning can reduce this likelihood.
Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set. There are many techniques for tree pruning that differ in the measurement that is used to optimize performance. The one we will talk about in this blog is Cost Complexity Pruning aka Weakest Link Pruning. Pruning can occur in a top down or bottom up fashion. A top down pruning will traverse nodes and trim sub-trees starting at the root, while a bottom up pruning will start at the leaf nodes. Reduced error pruning and Cost complexity pruning these are two popular pruning algorithms.
Lets take an example, different drugs dosages on x-axis and drug effectiveness on y-axis. In below graph we can see, when drug dosage was too low or too high then drug was not effective. Medium dosages were very effective and moderately high dosages were moderately effective.
When we fit regression tree to the training data, we will get corresponding tree. Where each leaf corresponded to the average drug effectiveness from a different cluster. This tree performs good job on training data, what if we test it on test data (red circles)? The residual for first and fourth cluster are pretty close to predicted values. So their residuals, the difference between predicted and actual values are not very large. However residuals for third cluster observations are larger than before and residual for second cluster observations are much larger. These observations (second cluster) from training data with 100% drug effectiveness now look a bit like outliers. That means that may be we over-fit the regression tree to the training data.
One way to prevent over-fitting a regression tree to the training data is to remove some of the leaves and replace the split with leaf that is the average of a large number of observations. Now all observations between 14.5 and 29 go to leaf on the far right. The large residuals tell us that the new tree doesn’t fit the training data as well as before. But the new sub-tree does a much better job with the testing data.
Note : If we want to prune the tree more, we could remove last two leaves and replace the split with a leaf that is the average of a large number of observations. And again we could remove last two leaves and replace the split with a leaf that is the average of all of the observations.
Then how do we decide which tree to use? Here we will solve this problem using cost complexity pruning.
The first step in cost complexity pruning is to calculate the sum of squared residual (SSR) for each tree. We will start with the original full size tree.
The SSR for the observations with dosages < 14.5 is (25–14.2)² + (25–14.2)² + (25–14.2)² + (25–14.2)² + (20–14.2)² + (5–14.2)² =584.84.
Likewise The SSR for the observations with dosages >= 29 is 75.
The SSR for the observations with dosages >= 23.5 and < 29 is 148.8.
And the SSR for the observations with dosages >= 14.5 and < 23.5 is 0.
Thus the total sum of squared residual for whole tree is 584.84 + 75 + 148.8 + 0 = 808.64.
Now let’s calculate SSR for the sub-tree with one fewer leaf.
SSR for when dosages < 14.5 is the same as before i.e. =584.84. And it is same for dosages >= 29 is 75.
But SSR for when dosages is between 14.5 and 29 is 5099.8. Thus the total SSR for sub-tree with 3 leaves is =584.84 + 75 + 5099.8 = 5759.64.
Similarly, the SSR for the sub-tree with two leaves is 19,243.7. Lastly, the SSR for the sub-tree with only one leaf is 25,597.2.
The sum of squared residual is relatively small for the original, full sized tree. But each time we remove a leaf, the sum of squared residual gets larger.
This happened because the whole idea was for the pruned tree to not fit the training data. So next question is how do we compare these trees? Weakest link pruning works by calculating a tree score, that is based on the sum of square residuals for the tree or sub-tree and a tree complexity penalty that is a function of the number of leaves in the tree or sub-tree. The tree complexity penalty compensates for the difference in the number of leaves.
Tree Score = sum of squared residual + αT
α (alpha) is a tuning parameter that we finding using cross validation.
For now, let’s let α = 10,000 and calculate tree score for each tree.
The tree score for original, full sized tree is 40,808.64. [Tree score = 808.64 + 10,000 * 4].
Now calculate the tree score for the sub-tree with one fewer leaf. SSR for this sub-tree is 5,759.64 and leaf is equal to 3. Therefore total tree score = 35,759.64.
The tree score for sub-tree with 2 leaves is 39,243.7. Lastly . the tree score for the sub-tree with only 1 leaf is 35597.2.
Because α = 10,000, the tree complexity penalty for the tree with 1 leaf was 10,000 and the tree complexity penalty for the tree with 2 leaves was 20,000. And so on, thus the more leaves, larger the penalty. After we calculated tree score for all trees. We pick second sub-tree because it has the lowest tree score.
But if we set α = 22,000 and calculate tree scores then we would use the sub-tree with only 1 leaf because it has lowest tree score. Thus, the value for α makes a difference in our choice of sub-tree.
Now will look how to find best value for α. First, using all of the data build a full sized regression tree (This full sized tree fit ti all of the data, not just the Training data). One more important thing is this full sized tree has the lowest tree score when α = 0. This is because when α = 0, the tree complexity penalty becomes 0 and tree score is just the sum of square residuals.
So let’s put α = 0 here, to remind us that this tree has the lowest tree score when α = 0. Now we will increase α until pruning leaves will give us a lower tree score. In this case, when α = 10,000 we will get a lower tree score if we remove last 2 leaves.
Now we increase α again until pruning leaves will give us a lower tree score. In this case, when α = 15,000 we will get a lower tree score if we remove last 2 leaves. And when α = 22,000 we will get a lower tree score if we remove last 2 leaves.
In the end, different values for α give us a sequence of trees, from full sized to just a leaf.
Now go back to the full data-set and build a full sized regression tree (This full sized tree is different than before because it was fit to all of the data, not just the training data). Divide data into training and testing data-sets. Just using the training data, use the α values we find before to build a full tree and sequence of sub-tree that minimize the tree score.
Now calculate the sum of squared residual for each new tree using only the testing data. In this case, the tree with α = 10,000 had the smallest sum of squared residual for the testing data.
Now go back and create new training and new testing data. Build a new sequence of trees, from full sized to a leaf. using α values we found before. Then we calculate the sum of squared residual using the new testing data, This time α = 0 had lowest sum of square residuals. Just keep repeating until we have done 10-fold cross validation. And the value for α that, on average. give us the lowest sum of squared residual with testing data. is the final value for α.
In this case, the optimal trees built with α = 10,000 had, on average, the lowest sum of square residuals. So α = 10,000 is our final value. Lastly, we go back to the original trees and sub-trees made from the full data and pick the tree that corresponds to the value for α that we selected (α = 10,000). That sub-tree will be the final, pruned tree.