Ensemble methods in machine learning combine the strengths of multiple models for enhanced performance. Blending in an ensemble classifier that involves combining the predictions of multiple base models, often trained on the same dataset, using a meta-model (blender). The blender takes the individual model predictions as input and produces the final ensemble prediction. This technique aims to improve overall predictive performance by leveraging diverse model strengths.
Let’s look at the steps required to implement the blending algorithm in Python.
The first step is to import the required libraries.
from sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_diabetesfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scoreimport pandas as pdimport numpy as np
The next step is to load the dataset. We’ll use the diabetes dataset provided by the sklearn
library. There are 10 baseline variables (numerical features) collected from patients. The target variable is a quantitative measure of disease progression one year after baseline (a continuous numerical value). This represents how the diabetes has progressed, with higher values indicating worse progression. The train_test_split function divides the dataset into training and testing data.
# Convert regression target to binary classification (e.g., threshold at median)y = np.where(diabetes.target > np.median(diabetes.target), 1, 0) # <-- Create binary classesX = diabetes.data # Features remain the same# Split into train/testX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The next step is to choose the base models. For this example, we’ll use the random forest classifier and gradient boosting classifier.
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)gb_model = GradientBoostingClassifier(n_estimators=50, learning_rate=0.1, max_depth=3, random_state=42)
In this step, we will train the two base models, namely random forest (rf_model
) and gradient boosting (gb_model
), using the training data (X_train
and y_train
).
rf_model.fit(X_train, y_train)gb_model.fit(X_train, y_train)
In this step, we will make predictions on a separate test set (X_test
) using the trained random forest and gradient boosting models.
rf_pred = rf_model.predict(X_test)gb_pred = gb_model.predict(X_test)
Now, we will organize the predictions along with the true labels (y_test
) into a pandas DataFrame (X_blend
). Each column of this DataFrame corresponds to the predictions from a specific base model.
blender_input = {'RandomForest': rf_pred, 'GradientBoosting': gb_pred}X_blend = pd.DataFrame(blender_input)y_blend = y_test
Finally, we will train a blending model, logistic regression (blender_model
), on the DataFrame containing the predictions from the base models (X_blend
). This blending model learns to combine the predictions effectively and serves as a meta-model to improve overall predictive performance.
blender_model = LogisticRegression()blender_model.fit(X_blend, y_blend)
Now, we’ll make the predictions on the test set and calculate accuracy.
rf_test_pred = rf_model.predict(X_test)gb_test_pred = gb_model.predict(X_test)blender_input_test = {'RandomForest': rf_test_pred, 'GradientBoosting': gb_test_pred}X_blend_test = pd.DataFrame(blender_input_test)final_pred = blender_model.predict(X_blend_test)accuracy = accuracy_score(y_test, final_pred)print("Blending Accuracy: {:.2f}%".format(accuracy * 100))
The following code shows how we can implement the blending
ensemble classifier in Python:
from sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_diabetesfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scoreimport pandas as pdimport numpy as np# Load diabetes datasetdiabetes = load_diabetes()# Convert regression target to binary classification (e.g., threshold at median)y = np.where(diabetes.target > np.median(diabetes.target), 1, 0) # <-- Create binary classesX = diabetes.data # Features remain the same# Split into train/testX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Base models (same as before)rf_model = RandomForestClassifier(n_estimators=50, random_state=42)gb_model = GradientBoostingClassifier(n_estimators=50, learning_rate=0.1, max_depth=3, random_state=42)# Train base modelsrf_model.fit(X_train, y_train)gb_model.fit(X_train, y_train)# Generate predictions from base modelsrf_pred = rf_model.predict(X_test)gb_pred = gb_model.predict(X_test)# Combine predictions using a blender (logistic regression)blender_input = {'RandomForest': rf_pred, 'GradientBoosting': gb_pred}X_blend = pd.DataFrame(blender_input)y_blend = y_test # Target variable for the blender# Train blenderblender_model = LogisticRegression()blender_model.fit(X_blend, y_blend)# Make final predictionrf_test_pred = rf_model.predict(X_test)gb_test_pred = gb_model.predict(X_test)blender_input_test = {'RandomForest': rf_test_pred, 'GradientBoosting': gb_test_pred}X_blend_test = pd.DataFrame(blender_input_test)final_pred = blender_model.predict(X_blend_test)# Evaluate final predictionaccuracy = accuracy_score(y_test, final_pred)print("Blending Accuracy: {:.2f}%".format(accuracy * 100))
Lines 1–7: We import the required libraries.
Line 10: We load the diabetes dataset from sklearn
and store it in the diabetes
variable.
Line 17: This line splits the dataset into train and test.
Lines 20–25: We define RandomForestClassifier
and GradientBoostingClassifier
as the base models for the blending_model.
Lines 28–29: We use the trained models to generate predictions on the test set.
Lines 33–34: These lines organize the predictions into a pandas DataFrame (X_blend
) and set the true labels as the blending target (y_blend
).
Lines 37–38: Here, we initialize and train a logistic regression model (blender_model
) using the blended predictions and true labels.
Lines 41–45: The trained model is used to make predictions on the test data.
Lines 48–49: The code calculates the accuracy of the model’s predictions by comparing them to the true labels in the test set. The accuracy is printed as a percentage.
Free Resources