XGBoost (eXtreme gradient boosting) is a popular open-source machine-learning library known for its remarkable performance in various machine-learning tasks.
xgb.DMatrix()
functionThe xgboost.DMatrix()
function creates a specialized data structure called DMatrix (short for Data Matrix). This data structure is optimized for memory efficiency and faster computation, making it ideal for large-scale datasets.
Once the DMatrix is created, it can be used directly in training with XGBoost algorithms like classification or regression tasks. It can also be used in cross-validation and hyperparameter tuning of modules.
The syntax of the xgb.DMatrix()
function is given below:
xgb.DMatrix(data, label=None, weight=None, base_margin=None, missing=None,silent=False, feature_names=None, feature_types=None, nthread=None, group=None,qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None,enable_categorical=False, data_split_mode=DataSplitMode.ROW)
data
is a required parameter representing the input data for the DMatrix.
label
is an optional parameter that shows the target labels for the training data.
weight
is an optional parameter representing the weight for each instance.
base_margin
is an optional parameter that specifies the initial prediction score for the model.
missing
is an optional parameter representing the missing value in the data. By default, it is set to None
.
silent
is an optional parameter on whether to print messages during DMatrix creation. It is set to False
by default.
feature_names
is an optional parameter representing a list of feature names, which will be used to name the columns of the DMatrix.
feature_types
is an optional parameter representing a list of strings that specify the types of features. It can be 'int
', 'float
', 'i
', 'q
', 'u
', or 's
'.
nthread
is an optional parameter representing the number of threads for converting data. If not specified, the maximum number of available threads will be used.
group
is an optional parameter representing the group or query ID for ranking tasks.
qid
is an optional parameter representing a query ID for ranking tasks, similar to the group
parameter.
label_lower_bound
is an optional parameter representing the lower bound of the label values.
label_upper_bound
is an optional parameter representing the upper bound of the label values.
feature_weights
is an optional parameter representing a weight for each feature.
enable_categorical
is an optional parameter. If set to True
, categorical features are treated as such during training and prediction.
data_split_mode
is an optional parameter specifying how data splits are performed when using different data containers. The default is DataSplitMode.ROW
.
Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.
Let's illustrate the use of xgb.DMatrix()
with a basic code example using the diabetes dataset:
import xgboost as xgbimport pandas as pdfrom sklearn.datasets import load_diabetesfrom sklearn.model_selection import train_test_split#Loading the diabetes datasetdata = load_diabetes()X, y = data.data, data.target#Splitting the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)#Creating a DMatrix for training and testing dataD_train = xgb.DMatrix(X_train, label=y_train)D_test = xgb.DMatrix(X_test, label=y_test)#Printing basic information about the DMatrixprint("Number of training samples in DMatrix:", D_train.num_row())print("Number of features in Dmatrix:", D_train.num_col())
Line 1–3: Firstly, we import the necessary modules. The xgb
, pd
modules and load_diabetes
from the sklearn.datasets
module to load the dataset.
Line 4: Next, we import the train_test_split
function from the sklearn.model_selection
module to split the dataset into training and test sets.
Line 7: Now, we fetch and store the diabetes dataset in the data
variable.
Line 8: We separate the features X
and target labels y
from the loaded dataset in this line.
Line 11: Here, we split the data into training and test sets using train_test_split
. It takes the features X
and target labels y
as input and splits them. The test set size is 0.2
, which makes 20% of the whole dataset, and the random state is 42 to provide consistency.
Line 14–15: In these lines, we create two instances of xgb.DMatrix()
. It takes the feature data X
and the target variable y
as arguments for training and testing data separately.
Line 18–19: Finally, we print the number of samples and features in the DMatrix using the num_row()
method and the num_col()
method of the DMatrix object.
Upon execution, the code will show the number of samples and features in the DMatrix created from the diabetes dataset.
The output looks like this:
Number of training samples in DMatrix: 353Number of features in Dmatrix: 10
Overall, the xgboost.DMatrix()
function is a vital part of XGBoost that improves performance and simplifies the training process by effectively processing data, making it memory efficient for large-scale machine-learning models. This function becomes necessary for using XGBoost's potential in various real-world machine-learning applications.
Free Resources