Ensemble methods in Python: Blending

Ensemble methods in machine learning combine the strengths of multiple models for enhanced performance. Blending in an ensemble classifier that involves combining the predictions of multiple base models, often trained on the same dataset, using a meta-model (blender). The blender takes the individual model predictions as input and produces the final ensemble prediction. This technique aims to improve overall predictive performance by leveraging diverse model strengths.

Load the dataset

The next step is to load the dataset. We’ll use the diabetes dataset provided by the sklearn library. There are 10 baseline variables (numerical features) collected from patients. The target variable is a quantitative measure of disease progression one year after baseline (a continuous numerical value). This represents how the diabetes has progressed, with higher values indicating worse progression. The train_test_split function divides the dataset into training and testing data.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes  
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Load diabetes dataset 
diabetes = load_diabetes()
# Convert regression target to binary classification (e.g., threshold at median)
y = np.where(diabetes.target > np.median(diabetes.target), 1, 0)  # <-- Create binary classes
X = diabetes.data  # Features remain the same
# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Base models (same as before)
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=50, learning_rate=0.1, max_depth=3, random_state=42)
# Train base models
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
# Generate predictions from base models
rf_pred = rf_model.predict(X_test)
gb_pred = gb_model.predict(X_test)
# Combine predictions using a blender (logistic regression)
blender_input = {'RandomForest': rf_pred, 'GradientBoosting': gb_pred}
X_blend = pd.DataFrame(blender_input)
y_blend = y_test  # Target variable for the blender
# Train blender
blender_model = LogisticRegression()
blender_model.fit(X_blend, y_blend)
# Make final prediction
rf_test_pred = rf_model.predict(X_test)
gb_test_pred = gb_model.predict(X_test)
blender_input_test = {'RandomForest': rf_test_pred, 'GradientBoosting': gb_test_pred}
X_blend_test = pd.DataFrame(blender_input_test)
final_pred = blender_model.predict(X_blend_test)
# Evaluate final prediction
accuracy = accuracy_score(y_test, final_pred)
print("Blending Accuracy: {:.2f}%".format(accuracy * 100))

Code explanation

Lines 1–7: We import the required libraries.
Line 10: We load the diabetes dataset from sklearn and store it in the diabetes variable.
Line 17: This line splits the dataset into train and test.
Lines 20–25: We define RandomForestClassifier and GradientBoostingClassifier as the base models for the blending_model.
Lines 28–29: We use the trained models to generate predictions on the test set.
Lines 33–34: These lines organize the predictions into a pandas DataFrame (X_blend) and set the true labels as the blending target (y_blend).
Lines 37–38: Here, we initialize and train a logistic regression model (blender_model) using the blended predictions and true labels.
Lines 41–45: The trained model is used to make predictions on the test data.
Lines 48–49: The code calculates the accuracy of the model’s predictions by comparing them to the true labels in the test set. The accuracy is printed as a percentage.

Ensemble methods in Python: Blending

How to implement blending using Python

Import the libraries

Load the dataset

Define the base models

Training base models

Generating predictions

Preparing blending input

Training blending model

Predict and evaluate

Code example

Code explanation