Decision Tree
A structure based on sequential decision process is a decision tree. As tree grows from root to leaf. Decision tree also starts from root, a feature is evaluated and one of the two branches are selected. This procedure is repeated until a final leaf is reached which normally represents the target we are looking for. It influenced a wide area of machine learning, covering both classification and regression.
A decision tree is a flowchart-like structure in which internal node represents a “test”. for ex. A person has heart rate greater than 150 bpm if yes, then he need to see a doctor, else no worry. In general, a decision tree ask a question and then classifies it based on the answer.
NOTE : classification can be categories or numeric.
Moving forward with same example, this time we will start with person’s age
Some important points keep in mind while drawing decision tree
1) Threshold for each node can be different (heart rate > 150 bpm and heart rate > 120 bpm).
2) Order of the node doesn’t have to be same.
3)Final classification (leaf) of the tree can be repeated.
Using decision tree our goal is to reduce impurity in the least number of splits to have a very short decision path between the ‘root’ and ‘leaf’.
In classification , decision tree start splitting each feature in training data. Now our goal is to decide which feature should be first in our decision tree and for that we measure compare impurity. There are number of ways to measure impurity. Calculating impurity using Gini impurity is easy and widely used.
In regression, instead of considering an impurity measure one of the most common choice is to pick the feature that minimizes the mean square error (MSE), considering the average prediction of node.
This is a simple example considered here but generally data set have large set of features, therefore large split results in huge tree. Such trees are complex and can lead to over-fitting. This situation can be avoid by setting a minimum number of training input to use in each leaf or maximum depth refers to the length of the longest path from a root to a leaf. In other words, the growth of the tree proceeds until one of the following conditions is verified :
- All nodes are pure.
- The information gain is null.
- The maximum depth has been reached.