In scikit-learn, cross-validation splits the dataset into multiple folds, trains the model on some folds, and tests it on the remaining folds. This process repeats, and the results are averaged to assess model performance.
Key takeaways:
Cross-validation evaluates a machine learning model’s ability to generalize to unseen data.
Scikit-learn offers two main functions for cross-validation:
cross_val_score
andcross_validate
.
cross_val_score
uses a single metric to evaluate the model across multiple data splits.
cross_validate
allows using multiple metrics for model evaluation.Both functions help assess the model’s performance and consistency across different datasets.
Higher cross-validation scores indicate better model generalization.
Cross-validation is a machine learning technique used to evaluate the generalization ability and quality of the models undergoing training. It helps assess the model’s capability to run on unseen data. Scikit-learn
, also known as sklearn
, is an open-source Python library for making and evaluating machine-learning models.
In this Answer, we will learn how the sklearn
Python library performs cross-validation on machine learning models and the benefits of doing so. We’ll analyze the functions that perform cross-validation on datasets.
sklearn
library has many different approaches to performing cross-validation on machine learning models. The functions we’ll be discussing are cross_val_score
and cross_validate
.
cross_val_score
functionThe cross_val_score
function performs cross-validation on the dataset and the estimator of the machine learning model under training and testing. An estimator is an object that represents the machine-learning model being trained. The dataset represents the collection of data on which the model is trained and tested.
The cross_val_score
function can take five arguments. The description of the arguments is as follows:
Estimator instance: An estimator instance of the model being trained.
Dataset features matrix: A 2D matrix having features and data points.
Dataset labels: The labels the model is trying to predict.
Iterator: If integers, it represents total iterations, each with different splits. It is represented by cv
.
scoring: The metric for performing cross-validation. The score
method of the estimator is used by default. To change the method, specify it as the scoring
parameter.
The example below shows how the cross_val_score
uses a single metric r2
, in its cross-validation process. The r2
metric specifies the generalization capability of the linear regression model.
import numpy as npfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import recall_score# Sample features and label datax = np.array([[1000, 1556, 6566], [2066, 2445, 7665], [3450, 3325, 8365], [4130, 4465, 9524], [5023, 5465, 9645]])y = np.array([2230, 3560, 6405, 7560, 9302])# Instance of linear regressionlri = LinearRegression()# Cross-validation with estimator with 2 iterationsscores = cross_val_score(lri, X, y, cv=2, scoring='r2')print(f"test_r2: {scores.mean():.2f} with standard deviation {scores.std():.2f}")
Lines 1–4: Import numpy
and use LinearRegression
, cross_val_score
, and recall_score
from sklearn
.
Lines 8–9: Define datasets X
and y
. For linear regression, the data should be in linear form.
Line 12: Define lri
as the instance of the estimator, LinearRegression
.
Line 15: Calculate the scores
with cross_val_score
using two iterations and metric r2
.
Line 17: Print the scores
mean and standard deviation.
cross_validate
functionThe cross_validate
function of the sklearn
library helps us to specify multiple metrics while training and testing the model. While the scoring
perimeter in cross_val_score
was a string of metric names, in cross_validate
, it is an array of strings having multiple metric names specified to the scoring
perimeter.
The cross_validate
function has the same parameters as the cross_val_score
. Here is a demonstration of how to use multiple metrics to test the model.
import numpy as npfrom sklearn.model_selection import cross_validatefrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import recall_score# Sample features and label datax = np.array([[1000, 1556, 6566], [2066, 2445, 7665], [3450, 3325, 8365], [4130, 4465, 9524], [5023, 5465, 9645]])y = np.array([2230, 3560, 6405, 7560, 9302])# Instance of linear regressionlri = LinearRegression()scoring = ['r2', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error']# Cross-validation with estimator with 2 iterationsscores = cross_validate(lri, x, y, cv=2, scoring=scoring)print(f"test_r2: {scores['test_r2'].mean():.2f} with standard deviation {scores['test_r2'].std():.2f}")print(f"neg_mean_absolute_error: {scores['test_neg_mean_absolute_error'].mean():.2f} with standard deviation {scores['test_neg_mean_absolute_error'].std():.2f}")print(f"neg_mean_absolute_percentage_error: {scores['test_neg_mean_absolute_percentage_error'].mean():.2f} with standard deviation {scores['test_neg_mean_absolute_error'].std():.2f}")
Lines 1–4: Import numpy
and use LinearRegression
, cross_val_score
, and recall_score
from sklearn
.
Lines 8–9: Define datasets X
and y
. For linear regression, the data should be in linear form.
Line 12: Define lri
as the instance of the estimator, LinearRegression
.
Line 14: Define the scoring
array to hold the names of the metrics r2
, neg_mean_absolute_error
, and neg_mean_absolute_percentage_error
.
Line 16: Calculate the scores
with cross_val_score
using two iterations and the scoring
array metrics.
Lines 18–20: Print the key and values of the score
array. The keys against which values are calculated are test_r2
, test_neg_mean_absolute_error
, and test_neg_mean_absolute_error
.
To sum up, two functions perform basic cross-validation on a dataset. The cross_val_score
function takes a single metric to train the data against. On the other hand, the cross_validate
function takes multiple metrics in the form of an array to train the data. The method of choosing the cross-validation to work for the data is to decide the metrics to train the data. The results of these functions help us evaluate the generalization ability of the model being trained. The higher the result, the more likely the model is to work on various datasets.
Haven’t found what you were looking for? Contact Us
Free Resources