Linear models in scikit-learn

Key takeaways:
Linear models are essential for regression and classification, assuming linear relationships between input features and target variable.
Scikit-learn's linear models offer flexibility, efficiency, regularization options, and clear interpretability of model coefficients.
Some important algorithms to implement linear models in scikit-learn include linear regression, logistic regression, ridge regression, and lasso regression, each catering to specific tasks and complexities.

Linear models form the cornerstone of many machine learning algorithms, providing simple yet powerful tools for regression and classification tasks. In linear models, scikit-learn, a popular machine-learning library in Python, offers a comprehensive suite of tools and utilities. This Answer delves into the fundamentals of linear models in scikit-learn, discussing their applications, key features, and notable algorithms.

Introduction to linear models

Linear models are a class of algorithms that assume a linear relationship between the input features and the target variable. Despite their simplicity, linear models are widely used due to their interpretability, efficiency, and effectiveness in various scenarios.

In scikit-learn, linear models encompass a range of algorithms suitable for regression, classification, and other tasks.

Key features of linear models in scikit-learn

Here are a few key features of linear models in scikit-learn:

Flexibility: scikit-learn provides a versatile framework for implementing linear models, offering a variety of algorithms tailored to different problem types and data distributions.
Efficiency: Linear models are computationally efficient, making them suitable for large-scale datasets. scikit-learn’s implementation optimizes computation, enabling quick training and prediction times.
Regularization: Many linear models in scikit-learn support regularization techniques such as L1L1 regularization adds a penalty equal to the sum of the absolute values of the model coefficients to the loss function. This encourages some coefficients to become exactly zero, simplifying the model. (Lasso) and L2L2 regularization adds a penalty equal to the sum of the squares of the model coefficients to the loss function. This discourages large coefficients, reducing overfitting and improving the model's ability to generalize. (Ridge) regularization, which help prevent overfitting and improve generalization performance.
Interpretability: Linear models offer straightforward interpretations of model coefficients, allowing users to easily understand each feature’s influence on the target variable.

Algorithms in scikit-learn

Some of the following are important algorithms to implement linear models in scikit-learn.

Linear regression: One of the simplest yet effective regression algorithms, linear regression fits a linear relationship between the input features and the target variable. scikit-learn’s LinearRegression class provides robust implementation with options for regularization.
Logistic regression: Despite its name, logistic regression is a linear model used for binary classification tasks. It estimates the probability that a given input belongs to a particular class. scikit-learn’s LogisticRegression class offers efficient optimization algorithms and support for multi-class classification.
Ridge regression: Ridge regression is a linear regression technique that incorporates L2 regularization to penalize large coefficients, thus reducing model complexity and improving generalization. scikit-learn’s Ridge class allows users to tune the regularization strength.
Lasso regression: Like Ridge regression, Lasso regression incorporates L1 regularization to encourage sparsity in the coefficient matrix. It can be particularly useful for feature selection. scikit-learn’s Lasso class provides an implementation of this algorithm.

Code example

Let’s have a look at the implementation of the above-mentioned algorithms in scikit-learn using Python

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import Ridge, Lasso
from sklearn.linear_model import LogisticRegression
# load the Boston Housing dataset
boston = load_boston()
X = boston.data
y = boston.target
# select a variable from X
x_variable = X[:, 5] # RM: average number of rooms per dwelling
# create and fit the model
model = LinearRegression()
model.fit(x_variable.reshape(-1, 1), y)
# predict with the model
y_pred = model.predict(x_variable.reshape(-1, 1))
# plot the results
fig, ax = plt.subplots(figsize=(7, 3.5), dpi=300)
plt.scatter(x_variable, y, label='Actual')
plt.plot(x_variable, y_pred, color='red', label='Regression Line')
plt.xlabel('Number of rooms')
plt.ylabel('House price')
plt.title('Scatter Plot with Regression Line')
plt.legend()
# display
fig.subplots_adjust(bottom=0.15)
fig.savefig("output/output.png")
# print the predicted classes
print("Linear Regression")
print("Actual values: ", y[:5])
print("Prediction: ", y_pred[:5])
# Ridge regression
ridge = Ridge(alpha=0.5)
ridge.fit(X, y)
ridge_pred = ridge.predict(X)
# Lasso regression
lasso = Lasso(alpha=0.5)
lasso.fit(X, y)
lasso_pred = lasso.predict(X)
# Compare the predictions
print("Ridge Regression")
print("Actual values: ", y[:5])
print("Prediction: ", ridge_pred[:5].round(1))
print("Lasso Regression")
print("Actual values: ", y[:5])
print("Prediction: ", lasso_pred[:5].round(1))
# Logistic regression
# generate a random classification dataset
X, y = make_classification(
    n_samples=1000, n_features=1, n_informative=1,
    n_redundant=0, n_clusters_per_class=1, random_state=0
)
# create the logistic regression model
model = LogisticRegression(penalty='l1', C=10)
# fitting
model.fit(X, y)
# predicting
predictions = model.predict(X)
# print the predicted classes for the new samples
print("Logistic Regression")
print("Actual values: ", y[:5])
print("Prediction: ", predictions[:5])

Explanation

Lines 1–6: Import the required libraries.
Line 9: Load the Boston dataset.
Lines 17–18: Create a LinearRegression instance and fit it to the data.
Line 21: Use the trained model to make predictions.
Line 43: Create a Ridge model with alpha=0.5. A higher alpha means stronger regularization.
Line 48: Create a Lasso model with alpha=0.5. A higher alpha increases regularization strength.
Lines 60–63: Generate a random classification dataset with 1000 samples and 1 feature. All features are informative.
Line 66: Create a logistic regression model with:
- penalty='l1': Sets L1 regularization.
- C=10: Controls regularization strength. Lower values mean stronger regularization.
Line 69: Fit the model to the dataset.
Line 72: Make predictions.

Conclusion

Linear models are a core part of machine learning. They are simple, efficient, and easy to interpret. In scikit-learn, linear models are implemented with flexibility and robustness. This makes them useful for both beginners and experienced users.

Whether used for regression or classification, they offer powerful tools for solving a wide range of machine learning problems.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources