Why does L1 regularization yield sparse solutions?

Need for regularization

Regularization, in general, is a broader mechanism that is employed in supervised machine learning algorithms to prevent the model from overfitting and strictly binding towards patterns that are present in the training set.

Problem diagnosis

In the training corpora that is provided to the model, there might be numerous features that are enumerated as deterministic in the output or predicted value. For instance, to train a model for land value prediction, the dataset might consist of real-estate features such as the number of nearby schools, number of nearby hospitals, the number of malls, electricity access, and so on.

Similarly, there might be some features that are minimally important in the regression model such as the average number of people in the area with pets. It is essential that the model has been trained so as to eradicate any possible dependency on features that are present in the training set, but only exhibit a certain pattern or effect within the training dataset.

In simple terms, the average number of people in the area with pets might correspond to a lower property value in the subset of the data that is used for training purposes but may have no correlation with the property value in the broader, context of real-estate properties.

This potentially poses a problem. The model will eventually be overfitted and not be generalized enough to make a reasonable prediction when an instance of the test dataset is introduced to the model. As a matter of fact, accuracy reports for the training phase will be amazingly high but, in contrast, will be drastically low in the test phase.

This calls for regularization.

How does regularization help?

Regularization bars this overfitting by penalizing the coefficients of parameters. The cost function is changed so that every parameter is penalized as per the magnitude of its coefficient. Consequently, each parameter value is deprecated relatively and its effect is lowered which prevents the model from overfitting.

Lasso regularization alters the cost function as per the equation shown below, where:

  • CCis the cost function
  • nn is the number of training samples
  • kk is the number of parameters,XXis the input feature
  • λ\lambda is the regularization parameter

Similarly, Ridge or L2 regularization has the cost function that penalizes the parameter coefficients with the L2 norm.

The regularization coefficient, λ\lambda, is a constant that constricts the amount that each coefficient should be penalized. Hence, a greater λ\lambdawould imply a greater overall penalty.

Reason for sparsity

L1 regularization causes coefficients to converge to 0 rather quickly since the constraint bounds all weight vectors to lie within the L1 norm.

The rate of convergence is higher for L1 due to the first derivative of loss being simply λ\lambda for L1 whereas being 2λ2\lambda for L2.

The addition of mean squared error in the function for cost, CC, can be ignored in the derivative since it is included in every regularization technique.

As per the equation of gradient descent with a learning rate of α\alpha, the weight update can be represented as follows:

  • In the case of Lasso or L1 regularization, the weight update is represented as wi+1=wiλαw_{i+1} = w_{i} - \lambda\alpha .
  • In the case of Ridge or L2 regularization, the weight update is represented as wi+1=wi2λαwi w_{i+1} = w_{i} - 2\lambda\alpha \:w_{i}

From the equations above, it can be deduced that Lasso regularization decrements and updates faster which causes a quick convergence.

Usage of L1 in feature selection

As L1 strictly decays the parameter values to zero, it can be a useful practice to use L1 and eradicate unnecessary features in the input space which would substantially increase the training rate, decrease the chances of overfitting, and consequently save time.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved