How to calculate the feature importance in random forests

Feature importance is a crucial concept in machine learning, especially when working with ensemble algorithms like random forest. Understanding the importance of different features in our dataset allows us to gain insights into which factors are most influential in making predictions. This information can aid in feature selection, model optimization, and improving overall model interpretability.

Random forest

Why calculate feature importance

Calculating feature importance helps us answer the question, “Which features contribute the most to the model’s predictive performance?” By identifying the most influential features, we can focus our attention on these variables, potentially discarding less relevant ones. This process not only enhances the efficiency of the model but also provides valuable insights into the underlying relationships within the data. Moreover, feature importance analysis assists in avoiding overfitting by understanding which features might introduce noise or unnecessary complexity.

Feature importance in random forest

Random Forest calculates feature importance based on the decrease in impurity (often measured by Gini impurity or entropy) that a feature brings when used for splitting during tree construction. The general steps to calculate feature importance are as follows:

1. Build the forest: Train a random forest model on the dataset. The ensemble of decision trees will learn the relationships between features and the target variable.

2. Shuffle feature values: For each feature, randomly shuffle its values while keeping the target variable unchanged. This randomizes the relationship between the feature and the target.

3. Evaluate impact: Use the trained random forest to make predictions on the modified dataset where one feature has been shuffled. Calculate the performance drop caused by the shuffled feature. The larger the drop, the more important the feature is likely to be. This drop in performance can be measured using metrics like accuracy, Gini impurity, or mean squared error.

4. Repeat for all features: Repeat steps $2$ and $3$ for all features in the dataset. This provides a measure of the impact of shuffling each feature on the model’s performance.

5. Normalize importance scores: Normalize the importance scores across all features so that they sum up to 1 or 100%. This step ensures that the importance scores are comparable.

Note: There are many other impurity measures that we can use, such as MAE (mean absolute error), MSE (mean squared error), information gain, misclassification error, cross-entropy (log loss), variance impurity, etc.

In the iris dataset, it’s often the case that the petal-related features (petal length and petal width) are more important than the sepal-related features (sepal length and sepal width) when it comes to distinguishing between different species of iris, as shown in the fugure. The specific order of importance might vary, but generally, petal-related features tend to be more informative than sepal-related features. We plotted the features in the descending order on the basis of feature importance. By observing the plot, we can see which feature have the highest impact on the model’s decisions, which can help us understand which aspects of the data are most relevant for accurate predictions.

Feature importance

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
# Load dataset (example using the Iris dataset)
data = load_iris()
X = data.data
y = data.target
# Train a random forest model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)
# Get feature importances from the trained model
feature_importances = clf.feature_importances_
# Get feature names
feature_names = data.feature_names
# Sort indices in descending order of feature importance
indices = np.argsort(feature_importances)[::-1]
# Plot the feature importances
plt.figure(figsize=(8, 6))
plt.title("Feature Importance - Random Forest")
plt.bar(range(X.shape[1]), feature_importances[indices], align="center", width=0.5)
plt.xticks(range(X.shape[1]), np.array(feature_names)[indices])
plt.ylabel("Normalized Importance")
plt.show()

Code explanation

Lines 1–4: We import the necessary libraries.

Lines 7–9: We load the iris dataset, save the feature data in X, and target labels in y.

Lines 12–13: We create a random forest classifier with 100 decision trees and a fixed random state of 42 for reproducibility. Then, we train it on X and y.

Line 16–19: We extract the feature importances computed by the trained random forest model and the names of the features in the dataset.

Line 22: We sort the indices of features in descending order based on their importance scores.

Lines 25–30: Finally, we create a bar plot of feature importances, where the x-axis represents feature indices and the y-axis represents normalized importance scores.

Conclusion