Table of Content

  • Table of Content
  • 1, Decision Trees
  • 2, Impurities
  • 3, Prediction
  • 4. Interpretation
Block

1, Decision Trees

Given dataset X={(xi,yi)}i=1NX = \{(x_i, y_i)\}^N_{i=1} where xiRnx_i \in \mathbb{R}^n and yi{0;1}y_i \in \{0 ; 1\} for binary classification.

A random forest model is expressed as an ensemble of KK decision trees:

TRF={T1,T2,,TK}\begin{gather*} \mathcal{T}_{RF} = \{T_1, T_2, \ldots , T_K \} \end{gather*}
Block

Where each decision trees TkT_k is defined as follows:

  • Consider the bootstrapped dataset for tree TkT_k as:
Xk(b)={(xi,yi)(b)}i=1N(xi,yi)(b)with replacement from X\begin{gather*} X_k^{(b)} = \{ (x_i, y_i)^{(b)} \}^N_{i=1} \qquad (x_i, y_i)^{(b)} \sim \text{with replacement from } X \end{gather*}
Block

This process is also known as bagging (Bootstrap AGGregatING). Approximately 63.2% of the original samples appear in each bootstrapped dataset, while 36.8% are left out (out-of-bag samples).

  • If we consider each arbitrary dataset at a node of tree TkT_k to be XNodeX_{\text{Node}} then each nodes recursively splits the dataset into 2 subsets:
XL={(xi,yi)xi,jt}: Left splitXR={(xi,yi)xi,j>t}: Right split\begin{align} & X_L = \{(x_i, y_i) \mid x_{i, j} \leq t\} \quad : \text{ Left split} \quad \\[5pt] & X_R = \{(x_i, y_i) \mid x_{i, j} > t\} \quad : \text{ Right split} \quad \end{align}
Block

Where:

  • jj is the index of a feature, jk{1,,n}j_k \in \{ 1, \ldots, n \} and k{1,2,log2n}k \in \{ 1, 2, \ldots \left \lfloor{\log_2 n}\right \rfloor \} and jkjlklj_k \neq j_l \forall k \neq l. log2n\left \lfloor{\log_2 n}\right \rfloor is introduce to reduce overfitting by reducing the feature space.
  • tt is a threshold with bounds depending on the feature jj.
  • The splits are done recursively until some stopping criteria is met (eg. max_depth, min_samples_leaf, etc…)

2, Impurities

The goal of each splits of a node is to minimize the Impurity II, here are some common Impurity measures:

  • Gini:
I(X)=1c{0;1}(XcX)2I(X) = 1 - \sum_{c \in \{0; 1\}} \left(\frac{|X_c|}{|X|}\right)^2
Block
  • Entropy:
I(X)=c{0;1}XcXlogXcXI(X) = - \sum_{c \in \{0; 1\}} \frac{|X_c|}{|X|} \log \frac{|X_c|}{|X|}
Block

Where XcX_c denotes the samples of XX that belongs to class cc.

Thus we have the objective function:

I^=argminj,t(XLXNodeI(XL)+XRXNodeI(XR))\begin{gather*} \hat{I} = \text{arg}\min\limits_{j, t} \left( \frac{|X_L|}{|X_{\text{Node}}|} I(X_L) + \frac{|X_R|}{|X_{\text{Node}}|} I(X_R) \right) \end{gather*}
Block

3, Prediction

Each tree TkT_k produces a label Tk(xq)T_k (x_q) for a query data point xqx_q, based on the splits and thresholds learned during training. Specifically, the prediction is made by traversing the tree from the root to a leaf node:

  • Starting at the root node, we examine the feature at the decision node (indexed by jj), applying the threshold tt.
  • Depending on whether the feature value of xqx_q (i.e., xq,jx_{q,j}) is less than or greater than the threshold tt, we move either to the left or the right child node, repeating the decision process.
  • This recursive process continues until we reach a leaf node, which contains a class label (either 0 or 1).

The final prediction for the query point xqx_q is determined by aggregating the outputs of all trees in the forest. Specifically, the majority vote (mode) across the trees’ predictions is used to make the final classification:

yq=mode({T1(xq),T2(xq),,TK(xq)})y_q = \text{mode}(\{ T_1 (x_q), T_2 (x_q), \ldots , T_K (x_q) \})
Block

4. Interpretation

Random Forest models can be challenging to interpret due to their ensemble nature. However, there are several tools and techniques that can help interpret and explain the model, including:

  • Feature Importance: One of the main advantages of Random Forests is their ability to provide feature importance scores. These scores reflect how important each feature is for the classification decision. The importance is computed based on how much the feature decreases the impurity (Gini or Entropy) when used in decision splits across all trees. Features that lead to significant reductions in impurity are considered more important.

A common way to compute feature importance is by averaging the decrease in impurity across all trees when that feature is used to split a node:

Importance(j)=1Kk=1KΔIj(k)\text{Importance}(j) = \frac{1}{K} \sum_{k=1}^K \Delta I_j^{(k)}
Block

Where ΔIj(k)\Delta I_j^{(k)} is the reduction in impurity for feature jj in tree TkT_k.

  • Partial Dependence Plots (PDPs): These plots show the relationship between a feature and the predicted outcome, marginalizing over the other features. PDPs can help visualize how the model’s predictions change as the value of a particular feature changes, while holding other features constant.

  • Tree Visualizations: While individual decision trees in a random forest can be complex and deep, visualizing a few trees can offer insights into the decision-making process of the model. These visualizations display the decisions at each node, helping to understand which features and thresholds are being used in splits.

  • Out-of-Bag Error Estimation: Since Random Forest uses bootstrapping to train each tree, about 36.8% of the data is left out of the training set for each tree (out-of-bag or OOB samples). The OOB samples can be used as a test set to estimate the performance of the model, providing an internal validation method without needing a separate test set.

  • Permutation Feature Importance: This method involves shuffling the values of a feature and measuring the decrease in model accuracy. A significant drop in accuracy when a feature is shuffled indicates high importance. This technique can be more reliable than traditional importance scores in some cases, especially when features are correlated.

By combining these techniques, we can gain a deeper understanding of how Random Forest models make their decisions and how each feature contributes to the overall classification process.