Random Forest Notes

#machine-learning #notes

19/11/2024 4 Min.

Written by: Yuk

Table of Content

1, Decision Trees

Given dataset \(X = \{(x_i, y_i)\}^N_{i=1}\) where \(x_i \in \mathbb{R}^n\) and \(y_i \in \{0 ; 1\}\) for binary classification.

A random forest model is expressed as an ensemble of \(K\) decision trees:

\[\begin{gather*} \mathcal{T}_{RF} = \{T_1, T_2, \ldots , T_K \} \end{gather*}\]

Where each decision trees \(T_k\) is defined as follows:

Consider the bootstrapped dataset for tree \(T_k\) as:

\[\begin{gather*} X_k^{(b)} = \{ (x_i, y_i)^{(b)} \}^N_{i=1} \qquad (x_i, y_i)^{(b)} \sim \text{with replacement from } X \end{gather*}\]

This process is also known as bagging (Bootstrap AGGregatING). Approximately 63.2% of the original samples appear in each bootstrapped dataset, while 36.8% are left out (out-of-bag samples).

If we consider each arbitrary dataset at a node of tree \(T_k\) to be \(X_{\text{Node}}\) then each nodes recursively splits the dataset into 2 subsets:

\[\begin{align} & X_L = \{(x_i, y_i) \mid x_{i, j} \leq t\} \quad : \text{ Left split} \quad \\[5pt] & X_R = \{(x_i, y_i) \mid x_{i, j} > t\} \quad : \text{ Right split} \quad \end{align}\]

Where:

\(j\) is the index of a feature, \(j_k \in \{ 1, \ldots, n \}\) and \(k \in \{ 1, 2, \ldots \left \lfloor{\log_2 n}\right \rfloor \}\) and \(j_k \neq j_l \forall k \neq l\). \(\left \lfloor{\log_2 n}\right \rfloor\) is introduce to reduce overfitting by reducing the feature space.
\(t\) is a threshold with bounds depending on the feature \(j\).
The splits are done recursively until some stopping criteria is met (eg. max_depth, min_samples_leaf, etc…)

2, Impurities

The goal of each splits of a node is to minimize the Impurity \(I\), here are some common Impurity measures:

Gini:

\[I(X) = 1 - \sum_{c \in \{0; 1\}} \left(\frac{|X_c|}{|X|}\right)^2\]

Entropy:

\[I(X) = - \sum_{c \in \{0; 1\}} \frac{|X_c|}{|X|} \log \frac{|X_c|}{|X|}\]

Where \(X_c\) denotes the samples of \(X\) that belongs to class \(c\).

Thus we have the objective function:

\[\begin{gather*} \hat{I} = \text{arg}\min\limits_{j, t} \left( \frac{|X_L|}{|X_{\text{Node}}|} I(X_L) + \frac{|X_R|}{|X_{\text{Node}}|} I(X_R) \right) \end{gather*}\]

3, Prediction

Each tree \(T_k\) produces a label \(T_k (x_q)\) for a query data point \(x_q\), based on the splits and thresholds learned during training. Specifically, the prediction is made by traversing the tree from the root to a leaf node:

Starting at the root node, we examine the feature at the decision node (indexed by \(j\)), applying the threshold \(t\).
Depending on whether the feature value of \(x_q\) (i.e., \(x_{q,j}\)) is less than or greater than the threshold \(t\), we move either to the left or the right child node, repeating the decision process.
This recursive process continues until we reach a leaf node, which contains a class label (either 0 or 1).

The final prediction for the query point \(x_q\) is determined by aggregating the outputs of all trees in the forest. Specifically, the majority vote (mode) across the trees’ predictions is used to make the final classification:

\[y_q = \text{mode}(\{ T_1 (x_q), T_2 (x_q), \ldots , T_K (x_q) \})\]

4. Interpretation

Random Forest models can be challenging to interpret due to their ensemble nature. However, there are several tools and techniques that can help interpret and explain the model, including:

Feature Importance: One of the main advantages of Random Forests is their ability to provide feature importance scores. These scores reflect how important each feature is for the classification decision. The importance is computed based on how much the feature decreases the impurity (Gini or Entropy) when used in decision splits across all trees. Features that lead to significant reductions in impurity are considered more important.

A common way to compute feature importance is by averaging the decrease in impurity across all trees when that feature is used to split a node:

\[\text{Importance}(j) = \frac{1}{K} \sum_{k=1}^K \Delta I_j^{(k)}\]

Where \(\Delta I_j^{(k)}\) is the reduction in impurity for feature \(j\) in tree \(T_k\).

Partial Dependence Plots (PDPs): These plots show the relationship between a feature and the predicted outcome, marginalizing over the other features. PDPs can help visualize how the model’s predictions change as the value of a particular feature changes, while holding other features constant.
Tree Visualizations: While individual decision trees in a random forest can be complex and deep, visualizing a few trees can offer insights into the decision-making process of the model. These visualizations display the decisions at each node, helping to understand which features and thresholds are being used in splits.
Out-of-Bag Error Estimation: Since Random Forest uses bootstrapping to train each tree, about 36.8% of the data is left out of the training set for each tree (out-of-bag or OOB samples). The OOB samples can be used as a test set to estimate the performance of the model, providing an internal validation method without needing a separate test set.
Permutation Feature Importance: This method involves shuffling the values of a feature and measuring the decrease in model accuracy. A significant drop in accuracy when a feature is shuffled indicates high importance. This technique can be more reliable than traditional importance scores in some cases, especially when features are correlated.

By combining these techniques, we can gain a deeper understanding of how Random Forest models make their decisions and how each feature contributes to the overall classification process.