Regularization, in general, is a broader mechanism that is employed in supervised machine learning algorithms to prevent the model from overfitting and strictly binding towards patterns that are present in the training set.
In the training corpora that is provided to the model, there might be numerous features that are enumerated as deterministic in the output or predicted value. For instance, to train a model for land value prediction, the dataset might consist of real-estate features such as the number of nearby schools, number of nearby hospitals, the number of malls, electricity access, and so on.
Similarly, there might be some features that are minimally important in the regression model such as the average number of people in the area with pets. It is essential that the model has been trained so as to eradicate any possible dependency on features that are present in the training set, but only exhibit a certain pattern or effect within the training dataset.
In simple terms, the average number of people in the area with pets might correspond to a lower property value in the subset of the data that is used for training purposes but may have no correlation with the property value in the broader, context of real-estate properties.
This potentially poses a problem. The model will eventually be overfitted and not be generalized enough to make a reasonable prediction when an instance of the test dataset is introduced to the model. As a matter of fact, accuracy reports for the training phase will be amazingly high but, in contrast, will be drastically low in the test phase.
This calls for regularization.
Regularization bars this overfitting by penalizing the coefficients of parameters. The cost function is changed so that every parameter is penalized as per the magnitude of its coefficient. Consequently, each parameter value is deprecated relatively and its effect is lowered which prevents the model from overfitting.
Lasso regularization alters the cost function as per the equation shown below, where:
Similarly, Ridge or L2 regularization has the cost function that penalizes the parameter coefficients with the L2 norm.
The regularization coefficient,
L1 regularization causes coefficients to converge to 0 rather quickly since the constraint bounds all weight vectors to lie within the L1 norm.
The rate of convergence is higher for L1 due to the first derivative of loss being simply
The addition of mean squared error in the function for cost,
As per the equation of gradient descent with a learning rate of
From the equations above, it can be deduced that Lasso regularization decrements and updates faster which causes a quick convergence.
As L1 strictly decays the parameter values to zero, it can be a useful practice to use L1 and eradicate unnecessary features in the input space which would substantially increase the training rate, decrease the chances of overfitting, and consequently save time.
Free Resources