XGBoost is a gradient boosting algorithm used in supervised machine learning. It belongs to the family of ensemble learning techniques, which combine the predictions of multiple models to produce a more accurate and robust final prediction. By combining the strengths of decision trees and boosting, XGBoost excels in handling structured, tabular data and provides a superior alternative to traditional methods like Random Forests.
Note: XGBoost is used for both classification and regression tasks.
Here's a step-by-step explanation of how this process works:
XGBoost first initializes the model with a single decision tree. The initial tree makes predictions based on the average target value of the training data. It typically consists of a single leaf node. The prediction of the initial tree can be represented as:
where
In case of classification problem, it takes the mode of the sample data as initial prediction.
Let's consider a sample dataset to understand the algorithm.
Exam1 | Exam2 | Final Exam |
85 | 90 | 92 |
75 | 65 | 78 |
95 | 85 | 96 |
80 | 70 | 85 |
The initial prediction for this model is calculated as:
The initial tree will look like this:
After initializing the model, we predict output for the complete sample data. Here are the predictions for our example:
Exam1 | Exam2 | Final Exam | Updated Predictions |
85 | 90 | 92 | 87.75 |
75 | 65 | 78 | 87.75 |
95 | 85 | 96 | 87.75 |
80 | 70 | 85 | 87.75 |
Next, we compute the residuals of sample data. Residuals are taken as the differences between the actual target values and the initial predictions. The residual for each data point i
is given by:
where
Here are the residuals for our dataset:
Exam1 | Exam2 | Final exam | Predicted final exam | Residuals |
85 | 90 | 92 | 87.75 | 4.25 |
75 | 65 | 78 | 87.75 | -9.75 |
95 | 85 | 96 | 87.75 | 8.25 |
80 | 70 | 85 | 87.75 | -2.75 |
XGBoost fits a new decision tree to predict the error of the initial tree using the residuals from the previous step. To fit a new tree, we will traverse all the data rows through the tree and write the residuals on the leaf nodes.
Now, for all the leaf nodes with more than one entry, take the average:
The tree will be updated as follows:
The new decision tree predictions are combined with the predictions of the previous model, and the model becomes more accurate. The updated prediction is the sum of the previous prediction and a fraction of the new prediction, gained from the new decision tree:
where
To control the influence of each new tree, XGBoost introduces a learning rate. The learning rate scales the contribution of each tree, preventing the model from overfitting too quickly.
Let's take 0.3 as the learning rate for our example. The updated model will be as follows:
For our set of sample data, the updated predictions will become:
Exam1 | Exam2 | Final exam | Updated predictions |
85 | 90 | 92 | 87.75+0.3(6.25) = 89.625 |
75 | 65 | 78 | 87.75+0.3(-6.25) = 85.875 |
95 | 85 | 96 | 87.75+0.3(6.25) = 89.625 |
80 | 70 | 85 | 87.75+0.3(-6.25) = 85.875 |
Note that updated predictions are better than initial predictions.
The boosting process continues iteratively repeating from steps 3 to 5, until certain terminal conditions are met. These conditions can be the number of boosting rounds (maximum iterations) or a specific threshold for performance improvement.
On each iteration, a tree will be added to the model. Each successive tree tries to predict and correct the error of the model before that tree. Here is the overview of what the model will look like after termination:
Ensemble learning: XGBoost combines the predictions of multiple weak learners (decision trees) to create a strong predictive model.
Tree pruning: XGBoost applies pruning techniques to limit the depth of trees, which reduces overfitting and improves generalization.
Handling missing values: XGBoost can handle missing values during tree construction without the need for imputation.
Feature importance and selection: XGBoost automatically handles feature selection and can capture feature interactions, making it effective in handling high-dimensional datasets.
Parallel and distributed computing: XGBoost is optimized for parallel and distributed computing, making it efficient for large-scale datasets.
XGBoost is a dominant and popular machine learning algorithm due to its exceptional performance, scalability, and interpretability. It efficiently handles structured data, provides valuable feature importance insights, and excels at complex predictive modeling tasks, pushing the boundaries of AI applications. Understanding the power of XGBoost can give us a competitive edge in solving real-world problems and making more informed decisions based on data-driven insights.
Free Resources