Ensemble methods in Python: Blending

Ensemble methods in machine learning combine the strengths of multiple models for enhanced performance. Blending in an ensemble classifier that involves combining the predictions of multiple base models, often trained on the same dataset, using a meta-model (blender). The blender takes the individual model predictions as input and produces the final ensemble prediction. This technique aims to improve overall predictive performance by leveraging diverse model strengths.

Blending Algorithm
Blending Algorithm

How to implement blending using Python

Let’s look at the steps required to implement the blending algorithm in Python.

Import the libraries

The first step is to import the required libraries.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

Load the dataset

The next step is to load the dataset. We’ll use the diabetes dataset provided by the sklearn library. There are 10 baseline variables (numerical features) collected from patients. The target variable is a quantitative measure of disease progression one year after baseline (a continuous numerical value). This represents how the diabetes has progressed, with higher values indicating worse progression. The train_test_split function divides the dataset into training and testing data.

# Convert regression target to binary classification (e.g., threshold at median)
y = np.where(diabetes.target > np.median(diabetes.target), 1, 0) # <-- Create binary classes
X = diabetes.data # Features remain the same
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Define the base models

The next step is to choose the base models. For this example, we’ll use the random forest classifier and gradient boosting classifier.

rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=50, learning_rate=0.1, max_depth=3, random_state=42)

Training base models

In this step, we will train the two base models, namely random forest (rf_model) and gradient boosting (gb_model), using the training data (X_train and y_train).

rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)

Generating predictions

In this step, we will make predictions on a separate test set (X_test) using the trained random forest and gradient boosting models.

rf_pred = rf_model.predict(X_test)
gb_pred = gb_model.predict(X_test)

Preparing blending input

Now, we will organize the predictions along with the true labels (y_test) into a pandas DataFrame (X_blend). Each column of this DataFrame corresponds to the predictions from a specific base model.

blender_input = {'RandomForest': rf_pred, 'GradientBoosting': gb_pred}
X_blend = pd.DataFrame(blender_input)
y_blend = y_test

Training blending model

Finally, we will train a blending model, logistic regression (blender_model), on the DataFrame containing the predictions from the base models (X_blend). This blending model learns to combine the predictions effectively and serves as a meta-model to improve overall predictive performance.

blender_model = LogisticRegression()
blender_model.fit(X_blend, y_blend)

Predict and evaluate

Now, we’ll make the predictions on the test set and calculate accuracy.

rf_test_pred = rf_model.predict(X_test)
gb_test_pred = gb_model.predict(X_test)
blender_input_test = {'RandomForest': rf_test_pred, 'GradientBoosting': gb_test_pred}
X_blend_test = pd.DataFrame(blender_input_test)
final_pred = blender_model.predict(X_blend_test)
accuracy = accuracy_score(y_test, final_pred)
print("Blending Accuracy: {:.2f}%".format(accuracy * 100))

Code example

The following code shows how we can implement the blending ensemble classifier in Python:

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Load diabetes dataset
diabetes = load_diabetes()
# Convert regression target to binary classification (e.g., threshold at median)
y = np.where(diabetes.target > np.median(diabetes.target), 1, 0) # <-- Create binary classes
X = diabetes.data # Features remain the same
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Base models (same as before)
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=50, learning_rate=0.1, max_depth=3, random_state=42)
# Train base models
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
# Generate predictions from base models
rf_pred = rf_model.predict(X_test)
gb_pred = gb_model.predict(X_test)
# Combine predictions using a blender (logistic regression)
blender_input = {'RandomForest': rf_pred, 'GradientBoosting': gb_pred}
X_blend = pd.DataFrame(blender_input)
y_blend = y_test # Target variable for the blender
# Train blender
blender_model = LogisticRegression()
blender_model.fit(X_blend, y_blend)
# Make final prediction
rf_test_pred = rf_model.predict(X_test)
gb_test_pred = gb_model.predict(X_test)
blender_input_test = {'RandomForest': rf_test_pred, 'GradientBoosting': gb_test_pred}
X_blend_test = pd.DataFrame(blender_input_test)
final_pred = blender_model.predict(X_blend_test)
# Evaluate final prediction
accuracy = accuracy_score(y_test, final_pred)
print("Blending Accuracy: {:.2f}%".format(accuracy * 100))

Code explanation

  • Lines 1–7: We import the required libraries.

  • Line 10: We load the diabetes dataset from sklearn and store it in the diabetes variable.

  • Line 17: This line splits the dataset into train and test.

  • Lines 20–25: We define RandomForestClassifier and GradientBoostingClassifier as the base models for the blending_model.

  • Lines 28–29: We use the trained models to generate predictions on the test set.

  • Lines 33–34: These lines organize the predictions into a pandas DataFrame (X_blend) and set the true labels as the blending target (y_blend).

  • Lines 37–38: Here, we initialize and train a logistic regression model (blender_model) using the blended predictions and true labels.

  • Lines 41–45: The trained model is used to make predictions on the test data.

  • Lines 48–49: The code calculates the accuracy of the model’s predictions by comparing them to the true labels in the test set. The accuracy is printed as a percentage.


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved