# 10 – Decision Trees

Decision trees are essentially rule-based classifiers, that try to extract “rules” to classify your data. These rules are conjunctions, which are logical AND operations. So you might infer rules such as “if it has wings, and it has a beak, then it’s a bird”. Decision trees are very human-interpretable, because the rules can be understood in terms of the original features.

The most basic decision tree algorithm is ID3 (Iterative Dichotomiser 3). ID3 assumes that all features are categorical–that is, each feature value can take one of certain “categories”. The temperature is either hot, mild, or cold; cars move fast or slow, etc. First, let’s talk a little bit about information theory.

## Information Theory

C. E. Shannon was the first to quantify information and talk about how much information is produced by a process, in his stunning 1948 paper, “A Mathematical Theory of Communication”, that has been cited over 60K times. I strongly urge you to read page 11, where he defines entropy. Entropy is a measure of chaos in a system, or the amount of “surprise”, in some sense. We’d like to quantify this amount of chaos, such that it has some properties:

1. It is a continuous-valued function.
2. If all the events in the system are equally likely, then the entropy is maximum, because we have no means of saying what event will occur.

There are other properties as well, but these two are important, and in page 28, Shannon proves that the only function that satisfies all the properties is $H = -\sum\limits_{i=1}^n p_i \log_2 p_i$

where $p_i$ is the probability of the $i$th event. There is another way of looking at entropy. Suppose you have $n$ events in a system, each with some probability of occurring. Now for any one particular event, let’s think up a function that satisfies the following properties:

1. If the event has probability 0, then it would be very surprising if it did occur, so the “surprise” is infinity.
2. If the event has probability 1 and it occurs, it’s not surprising at all, so the “surprise” is 0.
3. If two events $E_i$ and $E_j$ have probabilities $p_i$ and $p_j$, and if $p_i > p_j$, then it would be quite surprising if $E_j$ occurred instead of $E_i$.

The function satisfying these properties is $-\log p$ for a single event, and for a system with many events, the total chaos is the sum of all of these functions, weighted by the probability of the events.

How is this useful for decision trees? In a decision tree, at each node, you split the data based on some attribute. At the root node of the decision tree, you have the entire dataset. Say you have two classes, and that in the data, 50% of the rows are from each class. Then the probability of each class is 0.5, and we can compute the entropy–this turns out to be 1. Thus, the data is highly impure or chaotic. We then try to find a splitting condition–an attribute, such that if you split the dataset by that attribute, the entropy of the system reduces. Graphically, here’s what we’re doing (credits: this blog post) Clearly, the subsets are much “cleaner” in some sense, than the original dataset. But if you have multiple attributes, how do you decide which attribute splits the data the cleanest? That’s where a metric called the information gain comes into play. Let’s compute the information gain for the above figure.

We begin by computing the entropies of each set. The “parent” node for both the “child” nodes has seven positive (plus) examples and six negative (circle) examples. The entropy of the “parent” node is $H_{parent} = -\left( \frac{6}{13}\log_2 \frac{6}{13} + \frac{7}{13}\log_2 \frac{7}{13} \right) = 0.9957$

Similarly, the entropy of the children 0.9852 and 0.9183. Notice that these are both less than the parent entropy. To compute the information gain, first calculate the weighted sum of the child entropies, where the weights are the fraction of examples in that subset. Then, subtract this from the parent entropy. Mathematically, $\Delta = H_{parent} - \sum_i \frac{N(V_i)}{N}H_i$

where $N$ is the total number of examples, and $N(V_i)$ represents the number of examples in the $i$th partition. In our case, the information gain is $\Delta = 0.9957 - \left( \frac{7}{13} \times 0.9852 + \frac{6}{13} \times 0.9183 \right) = 0.04137$

which isn’t really all that much. So to find the splitting condition, you simply pick the attribute that gives you the highest information gain. You continue splitting the subsets until either you have no attributes left to split on, or the entropy of that subset is 0 (all the examples in that subset belong to the same class).

That’s really all there is to the ID3 algorithm. It’s simple and straightforward, but it does have some limitations. In particular, the basic ID3 algorithm works only on categorical variables. Also, ID3 works best when there is a good balance between each class (that is, there isn’t much skew towards one class). Finally, ID3 does not handle noise in data well at all, and may end up producing extremely complicated trees, which aren’t necessary, and are also difficult to interpret.

There are ways to solve the above problems. Handling continuous-valued attributes can be done by searching for a value that splits the data with high information gain. The C4.5 and CART (Classification and Regression Tree) decision tree algorithms do this. Techniques such as bootstrapping can help to reduce the imbalance between classes in your data. And finally, you can reduce the complexity of the trees that you get by pruning them–that is, beyond some chosen depth, you choose not to split the dataset any further, which sacrifices some accuracy to avoid overfitting and to make the tree more human-interpretable. C4.5 and CART do this as well, and the DecisionTreeClassifier class in sklearn implements the CART algorithm.