XGBoost (eXtreme gradient boosting) is a well-known and strong machine-learning library commonly used for classification, regression, and cross-validation of applications.
xgb.cv()
functionCross-validation helps evaluate machine learning models by testing the model's performance on unknown data while avoiding
The xgb.cv()
function runs k-fold cross-validation on a given dataset to effectively estimate model performance and modify hyperparameters. By averaging the performance across multiple folds, it reduces the impact of data randomness.
Here is the basic syntax of the function xgb.cv()
:
xgb.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, metrics=(),obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None,as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)
params
is a required parameter representing a dictionary of XGBoost hyperparameters.
dtrain
is a required parameter representing the DMatrix training data.
num_boost_round
is a required parameter representing the number of boosting rounds (iterations).
nfold
is a required parameter representing the number of folds for cross-validation.
early_stopping_rounds
is a required parameter, if specified, will stop the training early if the performance does not improve for this many rounds.
seed
is a required parameter representing a random seed for reproducibility.
metrics
is an optional parameter representing a tuple or list of evaluation metrics to use during cross-validation.
stratified
is an optional parameter that tells whether to perform stratified sampling for cross-validation.
obj
is an optional parameter representing a custom objective function to be optimized during training.
feval
is an optional parameter representing a custom evaluation function used to calculate additional evaluation metrics during training.
maximize
is an optional parameter that tells whether to maximize the evaluation metric.
fpreproc
is an optional parameter representing a function that preprocesses data before training, which is used to modify the DMatrix.
as_pandas
is an optional parameter that tells whether to return the cross-validation results as a pandas DataFrame.
verbose_eval
is an optional parameter that controls the verbosity of the evaluation results.
show_stdv
is an optional parameter that tells whether to display the standard deviation of evaluation results during cross-validation.
callbacks
is an optional parameter representing custom callback functions that can be used to customize the training process.
shuffle
is an optional parameter that tells us whether to shuffle the data before splitting it into folds for cross-validation.
Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.
Let's demonstrate the use of xgb.cv()
with the following code sample:
import xgboost as xgbimport numpy as np#Creating a smaller synthetic datasetnp.random.seed(42)X = np.random.rand(50, 3)y = np.random.randint(0, 2, 50)#Converting the data to DMatrixdata = xgb.DMatrix(X, label=y)#Hyperparametersparams = {'objective': 'binary:logistic','max_depth': 3,'learning_rate': 0.1,}#Performing cross-validationcv_results = xgb.cv(params, data, num_boost_round=10, nfold=3,metrics='logloss', seed=42)#Printing the resultsprint(cv_results)
Line 1–2: Firstly, we import the necessary xgb
and np
modules.
Line 5–7: Now, we create a smaller synthetic dataset with 50 samples and 3 features for our convenience using random.rand()
and random.randint()
functions. The variable y
is binary, having values 0 or 1.
Line 10: Now, we use xgb.DMatrix()
to convert the numpy arrays X
and y
into a DMatrix named data
.
Line 13–17: We create a dictionary named params
containing the hyperparameters for our XGBoost model. We set the objective to binary:logistic
regression, the maximum depth of each tree to 3, and the learning rate to 0.1.
Line 20: Here, we call the xgb.cv()
function with the specified hyperparameters, data, and other parameters like num_boost_round=10
, the number of cross-validation folds nfold=3
, and the evaluation metric set to logloss
.
Line 24: Finally, we print the cv_results
that return a DataFrame containing the cross-validation results on the console.
Upon execution, the code will display a table containing cross-validation results for each boosting round and evaluate the model’s performance using log loss.
The output looks something like this:
train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std0 0.636732 0.002798 0.672943 0.0100681 0.602735 0.004768 0.647836 0.0155432 0.576188 0.005845 0.628812 0.0182023 0.553682 0.006804 0.613275 0.0214824 0.535220 0.006787 0.600717 0.0234615 0.518873 0.006657 0.590129 0.0246236 0.504942 0.006400 0.580991 0.0253397 0.492948 0.006396 0.573358 0.0259068 0.482799 0.006090 0.567531 0.0264549 0.473649 0.006044 0.563117 0.026462
We can see that the table has four columns that show the log loss values for the training and test sets at each boosting round. The model's performance and variance are estimated using the mean and standard deviation over several cross-validation folds.
Overall, the xgb.cv()
function is a useful tool for cross-validation to evaluate the performance of XGBoost models. It provides significant insights into the model's performance and choices for adjusting the number of boosting rounds and folds. We can develop strong and accurate machine-learning models with XGBoost by offering numerous evaluation metrics and options.
Free Resources