Understanding predict_proba() from MultiOutputClassifier

The predict_proba method is commonly used in machine learning to obtain probability estimates for the different possible outcomes or classes of a classification problem. When working with a multioutput classification problem, such as using MultiOutputClassifier in scikit-learn, predict_proba can obtain probability estimates for each output variable or dimension.

Here are some key methods and concepts associated with predict_proba to help us understand this method in greater detail, starting with the definition of multioutput classification.

Note: To learn more about the scikit-learn library, check out this Answer.

What is multioutput classification?

Multioutput classification is a type of supervised learning in which we have multiple target variables, each with its own set of possible classes or labels. In multilabel classification, an input can belong to multiple classes simultaneously, or, in other words, a particular input can have multiple labels.

The `MultiOutputClassifier` method

MultiOutputClassifier is a wrapper in scikit-learn that allows us to apply a single classifier to each output variable in a multioutput classification problem. It treats each output variable as an independent binary classification problem and uses the specified base classifier.

The syntax for this wrapper is given below:

from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
# Define the number of samples
num_samples = 100
# Generate random features (X); here 4 random features are being generated
X = np.random.rand(num_samples, 4) 
# Generate random target values (y) for two output variables, each with three classes
num_classes = 3
y1 = np.random.randint(0, num_classes, size=num_samples)
y2 = np.random.randint(0, num_classes, size=num_samples)
# Create a sample training and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, np.column_stack((y1, y2)), test_size=0.2)
# Create a multioutput classifier using a base classifier (e.g., RandomForest)
base_classifier = RandomForestClassifier()
multi_output_classifier = MultiOutputClassifier(base_classifier)
# Fit the multioutput classifier to the training data
multi_output_classifier.fit(X_train, y_train)
# Use the predict_proba method to get probability estimates for each output variable
probabilities = multi_output_classifier.predict_proba(X_test)
print(probabilities)

Code explanation

The line-by-line code explanation is given below:

Lines 6–15: We start off by creating a multidimensional dataset since the MultiOutputClassifier wrapper can only work with multidimensional datasets. For this example, we take four features, having hundred samples overall. Also, we generate random label values for two output variables y1 and y2, each having three classes to choose from. Once the input data and output labels are generated, we split them into training and test datasets, having a 80-20 split.
Lines 17–19: Next, we use a base classifier to create a multioutput classifier. Here, we use the RandomForestClassifier to create the multioutput classifier.
Lines 21–25: Finally, we fit the training data to the multioutput classifier in order to get the probability estimates for two of the output variables using the predict_proba method. The test data is used for this purpose. The probabilites are stored in the probabilities variable, which is then printed.

Keep in mind that the structure of the probability arrays may vary depending on the specific classifier we are using. For example, some classifiers may return probabilities as an array of shape (n_samples, n_classes), while others may return them differently.

Output

If we look at the probabilites output, we can see that each row corresponds to each test sample having three different probability samples, which makes sense because we have three classes for the input dataset.

Understanding predict_proba() from MultiOutputClassifier

What is multioutput classification?

The `MultiOutputClassifier` method

Step 1: Create a sample dataset

Step 2: Create a multioutput classifier

Step 3: Fit the sample dataset

Code example

Code explanation

Output

Conclusion

Understanding predict_proba() from MultiOutputClassifier

What is multioutput classification?

The MultiOutputClassifier method

Step 1: Create a sample dataset

Step 2: Create a multioutput classifier

Step 3: Fit the sample dataset

Code example

Code explanation

Output

Conclusion

The `MultiOutputClassifier` method