The predict_proba
method is commonly used in machine learning to obtain probability estimates for the different possible outcomes or classes of a classification problem. When working with a multioutput classification problem, such as using MultiOutputClassifier
in scikit-learn
, predict_proba
can obtain probability estimates for each output variable or dimension.
Here are some key methods and concepts associated with predict_proba
to help us understand this method in greater detail, starting with the definition of multioutput classification.
Note: To learn more about the
scikit-learn
library, check out this Answer.
Multioutput classification is a type of supervised learning in which we have multiple target variables, each with its own set of possible classes or labels. In multilabel classification, an input can belong to multiple classes simultaneously, or, in other words, a particular input can have multiple labels.
MultiOutputClassifier
methodMultiOutputClassifier
is a wrapper in scikit-learn
that allows us to apply a single classifier to each output variable in a multioutput classification problem. It treats each output variable as an independent binary classification problem and uses the specified base classifier.
The syntax for this wrapper is given below:
multi_output_classifier = MultiOutputClassifier(base_classifier)
Note: The
base_classifier
method can be any supervised learning algorithm such as the Random Forest algorithm.
Now, let's discuss the steps to implement this method.
Firstly, we will create a sample dataset which consists of 100 samples with four features and two outputs with each output having a class label from a possible three in total:
# Define the number of samplesnum_samples = 100# Generate random features (X); here 4 random features are being generatedX = np.random.rand(num_samples, 4)# Generate random target values (y) for two output variables, each with three classesnum_classes = 3y1 = np.random.randint(0, num_classes, size=num_samples)y2 = np.random.randint(0, num_classes, size=num_samples)
A diagram visualizing the setup of the model is given below for a better understanding of this example:
After this, we split the dataset into training and test datasets using the train_test_split
library in a 80-20 ratio:
X_train, X_test, y_train, y_test = train_test_split(X, np.column_stack((y1, y2)), test_size=0.2)
Next, we initalize our multioutput classifier with the aid of a base classifier, which is taken to be the RandomForestClassifier
for this specific coding example.
# Create a multioutput classifier using a base classifier (e.g., RandomForest)base_classifier = RandomForestClassifier()multi_output_classifier = MultiOutputClassifier(base_classifier)
The base classifier is passed to MultiOutputClassifier
so it is now ready to be fitted by our sample dataset.
Next, we will use predict_proba
. After fitting the training data to the multi_output_classifier
, the predict_proba
method is fitted onto the test data to generate output probabilities for each output variable:
# Fit the multioutput classifier to the training datamulti_output_classifier.fit(X_train, y_train)# Use the predict_proba method to get probability estimates for each output variableprobabilities = multi_output_classifier.predict_proba(X_test)print(probabilities)
Note: The
predict_proba
method only takes one input parameter, which is the test data itself.
The code using the predict_proba
method is given below:
from sklearn.multioutput import MultiOutputClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitimport numpy as np# Define the number of samplesnum_samples = 100# Generate random features (X); here 4 random features are being generatedX = np.random.rand(num_samples, 4)# Generate random target values (y) for two output variables, each with three classesnum_classes = 3y1 = np.random.randint(0, num_classes, size=num_samples)y2 = np.random.randint(0, num_classes, size=num_samples)# Create a sample training and test datasetX_train, X_test, y_train, y_test = train_test_split(X, np.column_stack((y1, y2)), test_size=0.2)# Create a multioutput classifier using a base classifier (e.g., RandomForest)base_classifier = RandomForestClassifier()multi_output_classifier = MultiOutputClassifier(base_classifier)# Fit the multioutput classifier to the training datamulti_output_classifier.fit(X_train, y_train)# Use the predict_proba method to get probability estimates for each output variableprobabilities = multi_output_classifier.predict_proba(X_test)print(probabilities)
The line-by-line code explanation is given below:
Lines 6–15: We start off by creating a multidimensional dataset since the MultiOutputClassifier
wrapper can only work with multidimensional datasets. For this example, we take four features, having hundred samples overall. Also, we generate random label values for two output variables y1
and y2
, each having three classes to choose from. Once the input data and output labels are generated, we split them into training and test datasets, having a 80-20 split.
Lines 17–19: Next, we use a base classifier to create a multioutput classifier. Here, we use the RandomForestClassifier
to create the multioutput classifier.
Lines 21–25: Finally, we fit the training data to the multioutput classifier in order to get the probability estimates for two of the output variables using the predict_proba
method. The test data is used for this purpose. The probabilites are stored in the probabilities
variable, which is then printed.
Keep in mind that the structure of the probability arrays may vary depending on the specific classifier we are using. For example, some classifiers may return probabilities as an array of shape (n_samples, n_classes)
, while others may return them differently.
If we look at the probabilites
output, we can see that each row corresponds to each test sample having three different probability samples, which makes sense because we have three classes for the input dataset.
[[0.6, 0.4, 0. ],[0.3, 0.6, 0.1],[0. , 1. , 0. ],...]
Overall, predict_proba
in the MultiOutputClassifier
wrapper allows us to obtain probability estimates for each output variable in a multioutput classification problem. This helps to evaluate the model’s confidence in its predictions for each dimension or target variable.
Free Resources