Why does L1 regularization yield sparse solutions?

Need for regularization

Regularization, in general, is a broader mechanism that is employed in supervised machine learning algorithms to prevent the model from overfitting and strictly binding towards patterns that are present in the training set.

Problem diagnosis

In the training corpora that is provided to the model, there might be numerous features that are enumerated as deterministic in the output or predicted value. For instance, to train a model for land value prediction, the dataset might consist of real-estate features such as the number of nearby schools, number of nearby hospitals, the number of malls, electricity access, and so on.

Similarly, there might be some features that are minimally important in the regression model such as the average number of people in the area with pets. It is essential that the model has been trained so as to eradicate any possible dependency on features that are present in the training set, but only exhibit a certain pattern or effect within the training dataset.

In simple terms, the average number of people in the area with pets might correspond to a lower property value in the subset of the data that is used for training purposes but may have no correlation with the property value in the broader, context of real-estate properties.

This potentially poses a problem. The model will eventually be overfitted and not be generalized enough to make a reasonable prediction when an instance of the test dataset is introduced to the model. As a matter of fact, accuracy reports for the training phase will be amazingly high but, in contrast, will be drastically low in the test phase.

This calls for regularization.

How does regularization help?

Regularization bars this overfitting by penalizing the coefficients of parameters. The cost function is changed so that every parameter is penalized as per the magnitude of its coefficient. Consequently, each parameter value is deprecated relatively and its effect is lowered which prevents the model from overfitting.

Lasso regularization alters the cost function as per the equation shown below, where:

$C$ is the cost function
$n$ is the number of training samples
$k$ is the number of parameters, $X$ is the input feature
$\lambda$ is the regularization parameter

In the case of Lasso or L1 regularization, the weight update is represented as $w_{i+1} = w_{i} - \lambda\alpha$ .
In the case of Ridge or L2 regularization, the weight update is represented as $w_{i+1} = w_{i} - 2\lambda\alpha \:w_{i}$

From the equations above, it can be deduced that Lasso regularization decrements and updates faster which causes a quick convergence.

Usage of L1 in feature selection

As L1 strictly decays the parameter values to zero, it can be a useful practice to use L1 and eradicate unnecessary features in the input space which would substantially increase the training rate, decrease the chances of overfitting, and consequently save time.

New on Educative

Learn to Code

Learn any Language as a beginner

Develop a human edge in an AI powered world and learn to code with AI from our beginner friendly catalog

🏆 Leaderboard

Daily Coding Challenge

Solve a new coding challenge every day and climb the leaderboard

Free Resources

Why does L1 regularization yield sparse solutions?

Need for regularization

Problem diagnosis

How does regularization help?

Reason for sparsity

Usage of L1 in feature selection