Random Forest Notes
Table of Content
Block
1, Decision Trees
Given dataset where and for binary classification.
A random forest model is expressed as an ensemble of decision trees:
Block
Where each decision trees is defined as follows:
- Consider the bootstrapped dataset for tree as:
Block
This process is also known as bagging (Bootstrap AGGregatING). Approximately 63.2% of the original samples appear in each bootstrapped dataset, while 36.8% are left out (out-of-bag samples).
- If we consider each arbitrary dataset at a node of tree to be then each nodes recursively splits the dataset into 2 subsets:
Block
Where:
- is the index of a feature, and and . is introduce to reduce overfitting by reducing the feature space.
- is a threshold with bounds depending on the feature .
- The splits are done recursively until some stopping criteria is met (eg.
max_depth,min_samples_leaf, etc…)
2, Impurities
The goal of each splits of a node is to minimize the Impurity , here are some common Impurity measures:
- Gini:
Block
- Entropy:
Block
Where denotes the samples of that belongs to class .
Thus we have the objective function:
Block
3, Prediction
Each tree produces a label for a query data point , based on the splits and thresholds learned during training. Specifically, the prediction is made by traversing the tree from the root to a leaf node:
- Starting at the root node, we examine the feature at the decision node (indexed by ), applying the threshold .
- Depending on whether the feature value of (i.e., ) is less than or greater than the threshold , we move either to the left or the right child node, repeating the decision process.
- This recursive process continues until we reach a leaf node, which contains a class label (either 0 or 1).
The final prediction for the query point is determined by aggregating the outputs of all trees in the forest. Specifically, the majority vote (mode) across the trees’ predictions is used to make the final classification:
Block
4. Interpretation
Random Forest models can be challenging to interpret due to their ensemble nature. However, there are several tools and techniques that can help interpret and explain the model, including:
- Feature Importance: One of the main advantages of Random Forests is their ability to provide feature importance scores. These scores reflect how important each feature is for the classification decision. The importance is computed based on how much the feature decreases the impurity (Gini or Entropy) when used in decision splits across all trees. Features that lead to significant reductions in impurity are considered more important.
A common way to compute feature importance is by averaging the decrease in impurity across all trees when that feature is used to split a node:
Block
Where is the reduction in impurity for feature in tree .
-
Partial Dependence Plots (PDPs): These plots show the relationship between a feature and the predicted outcome, marginalizing over the other features. PDPs can help visualize how the model’s predictions change as the value of a particular feature changes, while holding other features constant.
-
Tree Visualizations: While individual decision trees in a random forest can be complex and deep, visualizing a few trees can offer insights into the decision-making process of the model. These visualizations display the decisions at each node, helping to understand which features and thresholds are being used in splits.
-
Out-of-Bag Error Estimation: Since Random Forest uses bootstrapping to train each tree, about 36.8% of the data is left out of the training set for each tree (out-of-bag or OOB samples). The OOB samples can be used as a test set to estimate the performance of the model, providing an internal validation method without needing a separate test set.
-
Permutation Feature Importance: This method involves shuffling the values of a feature and measuring the decrease in model accuracy. A significant drop in accuracy when a feature is shuffled indicates high importance. This technique can be more reliable than traditional importance scores in some cases, especially when features are correlated.
By combining these techniques, we can gain a deeper understanding of how Random Forest models make their decisions and how each feature contributes to the overall classification process.