Understanding predict_proba() from MultiOutputClassifier

The predict_proba method is commonly used in machine learning to obtain probability estimates for the different possible outcomes or classes of a classification problem. When working with a multioutput classification problem, such as using MultiOutputClassifier in scikit-learn, predict_proba can obtain probability estimates for each output variable or dimension.

Here are some key methods and concepts associated with predict_proba to help us understand this method in greater detail, starting with the definition of multioutput classification.

Note: To learn more about the scikit-learn library, check out this Answer.

What is multioutput classification?

Multioutput classification is a type of supervised learning in which we have multiple target variables, each with its own set of possible classes or labels. In multilabel classification, an input can belong to multiple classes simultaneously, or, in other words, a particular input can have multiple labels.

The MultiOutputClassifier method

MultiOutputClassifier is a wrapper in scikit-learn that allows us to apply a single classifier to each output variable in a multioutput classification problem. It treats each output variable as an independent binary classification problem and uses the specified base classifier.

The syntax for this wrapper is given below:

multi_output_classifier = MultiOutputClassifier(base_classifier)

Note: The base_classifier method can be any supervised learning algorithm such as the Random Forest algorithm.

Now, let's discuss the steps to implement this method.

Step 1: Create a sample dataset

Firstly, we will create a sample dataset which consists of 100 samples with four features and two outputs with each output having a class label from a possible three in total:

# Define the number of samples
num_samples = 100
# Generate random features (X); here 4 random features are being generated
X = np.random.rand(num_samples, 4)
# Generate random target values (y) for two output variables, each with three classes
num_classes = 3
y1 = np.random.randint(0, num_classes, size=num_samples)
y2 = np.random.randint(0, num_classes, size=num_samples)

A diagram visualizing the setup of the model is given below for a better understanding of this example:

Multioutput classification with four input features
Multioutput classification with four input features

After this, we split the dataset into training and test datasets using the train_test_split library in a 80-20 ratio:

X_train, X_test, y_train, y_test = train_test_split(X, np.column_stack((y1, y2)), test_size=0.2)

Step 2: Create a multioutput classifier

Next, we initalize our multioutput classifier with the aid of a base classifier, which is taken to be the RandomForestClassifier for this specific coding example.

# Create a multioutput classifier using a base classifier (e.g., RandomForest)
base_classifier = RandomForestClassifier()
multi_output_classifier = MultiOutputClassifier(base_classifier)
Initializing MultiOutputClassifier

The base classifier is passed to MultiOutputClassifier so it is now ready to be fitted by our sample dataset.

Step 3: Fit the sample dataset

Next, we will use predict_proba. After fitting the training data to the multi_output_classifier, the predict_proba method is fitted onto the test data to generate output probabilities for each output variable:

# Fit the multioutput classifier to the training data
multi_output_classifier.fit(X_train, y_train)
# Use the predict_proba method to get probability estimates for each output variable
probabilities = multi_output_classifier.predict_proba(X_test)
print(probabilities)

Note: The predict_proba method only takes one input parameter, which is the test data itself.

Code example

The code using the predict_proba method is given below:

from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
# Define the number of samples
num_samples = 100
# Generate random features (X); here 4 random features are being generated
X = np.random.rand(num_samples, 4)
# Generate random target values (y) for two output variables, each with three classes
num_classes = 3
y1 = np.random.randint(0, num_classes, size=num_samples)
y2 = np.random.randint(0, num_classes, size=num_samples)
# Create a sample training and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, np.column_stack((y1, y2)), test_size=0.2)
# Create a multioutput classifier using a base classifier (e.g., RandomForest)
base_classifier = RandomForestClassifier()
multi_output_classifier = MultiOutputClassifier(base_classifier)
# Fit the multioutput classifier to the training data
multi_output_classifier.fit(X_train, y_train)
# Use the predict_proba method to get probability estimates for each output variable
probabilities = multi_output_classifier.predict_proba(X_test)
print(probabilities)

Code explanation

The line-by-line code explanation is given below:

  • Lines 6–15: We start off by creating a multidimensional dataset since the MultiOutputClassifier wrapper can only work with multidimensional datasets. For this example, we take four features, having hundred samples overall. Also, we generate random label values for two output variables y1 and y2, each having three classes to choose from. Once the input data and output labels are generated, we split them into training and test datasets, having a 80-20 split.

  • Lines 17–19: Next, we use a base classifier to create a multioutput classifier. Here, we use the RandomForestClassifier to create the multioutput classifier.

  • Lines 21–25: Finally, we fit the training data to the multioutput classifier in order to get the probability estimates for two of the output variables using the predict_proba method. The test data is used for this purpose. The probabilites are stored in the probabilities variable, which is then printed.

Keep in mind that the structure of the probability arrays may vary depending on the specific classifier we are using. For example, some classifiers may return probabilities as an array of shape (n_samples, n_classes), while others may return them differently.

Output

If we look at the probabilites output, we can see that each row corresponds to each test sample having three different probability samples, which makes sense because we have three classes for the input dataset.

[[0.6, 0.4, 0. ],
[0.3, 0.6, 0.1],
[0. , 1. , 0. ],...]
Some output rows showing the probabilities for the three classes

Conclusion

Overall, predict_proba in the MultiOutputClassifier wrapper allows us to obtain probability estimates for each output variable in a multioutput classification problem. This helps to evaluate the model’s confidence in its predictions for each dimension or target variable.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved