How to perform linear regression in Julia

Key takeaways:

  1. Linear regression predicts a dependent variable based on one or more independent variables using a statistical model, typically represented as a linear equation.

  2. Julia is a robust programming language suited for linear regression due to its efficiency in data processing and numerical computations.

  3. The GLM (generalized linear models) package is widely used in Julia for implementing linear regression, offering flexible modeling and data-fitting capabilities.

  4. GLM provides various link functions for different distributions, such as LogitLink for Bernoulli and Binomial distributions, and IdentityLink for Normal distributions, making it adaptable to multiple regression scenarios.

  5. The GLM package also allows for comparing different linear regression models using statistical functions like ftest() to assess the best-fitting model for a given dataset.

Regression is a statistical process that predicts the value of a variable based on the values of the variable(s) it depends on. The former is called a dependent variable, while the latter is called an independent variable.

Linear regression
Linear regression

The relationship between the independent variables and the dependent variable is defined by a statistical model M(θ)M(\theta), where θ\theta is the parameter used to define the model. For instance, a simple linear regression model is the linear equation of the first degree y=θ1x1+θ0y = \theta_1 x_1 + \theta_0. Here, θ1\theta_1 and θ0\theta_0 are the model parameters, x1x_1 and is the input (data) variable, and yy is the output (dependent) variable whose value is to be determined.

Implementing linear regression in Julia

Julia is an open-source programming language that finds applications in data processing, numerical computations, and data visualization due to its computational robustness. Due to this reason, Julia is well-suited for implementing linear regression and other machine learning algorithms.

Many open-source packages have been developed for linear regression in Julia. However, in this Answer, we’ll explore the GLM package for linear regression using generalized linear models (GLM). GLM provides flexible functionality by separating the modeling and data-fitting stages of linear regression.

The following table depicts which link functionA link function in generalized linear models (GLM) transforms the predicted values of the dependent variable to align with the distribution of the data, ensuring the model fits the specific type of response variable. should be used for different distributions:

Distribution

Link Function

Bernoulli

LogitLink

Binomial

LogitLink

Gamma

InverseLink

Geometric

LogLink

InverseGaussian

InverseSquareLink

NegativeBinomial

NegativeBinomialLink

Normal

IdentityLink

Poisson

LogLink

Moreover, GLM also provides additional functionalities for assessing the performance of the regression model and for comparing the performance of different linear regression models for a given dataset.

Code example

To better understand how GLM is used for linear regression in Julia, look at the code example given below. This code uses only one independent variable.

using GLM, DataFrames, Lathe
import Random
# Setting a random seed so that the code example can be replicated
Random.seed!(1234)
# Generate some random data for demonstration purposes
x = rand(100)
y = 2 * x.^3 + 0.5 * randn(100)
# Perform linear regression for the model y = ax + b
model1 = lm(@formula(y ~ x), DataFrame(x=x, y=y))
# Print the summary of the regression
println(model1)
# Access the coefficients
println("Intercept: ", coef(model1)[1])
println("Slope: ", coef(model1)[2])
# Perform linear regression for the model y = ax^2 + bx + c
model2 = lm(@formula(y ~ x^2 + x), DataFrame(x=x, y=y))
# Print the summary of the regression
println(model2)
# Access the coefficients
println("Intercept: ", coef(model2)[1])
println("Slope: ", coef(model2)[2])
# Comparing models
println(ftest(model1.model, model2.model))

Explanation

  • Lines 1–2: The required libraries are loaded.

  • Line 5: The random seed is set so that the code example can be replicated.

  • Lines 8–9: The independent and dependent variables are defined.

  • For Model 1 (y = ax + b) on line 12:

    • a (slope) = coef(model1)[2]

    • b (intercept) = coef(model1)[1]

  • Lines 12–17: The independent variable is fit for the linear regression model y=ax+by=ax+b, and the details of the model are printed. This model has the intercept value at 0.53347-0.53347 with slope 2.062822.06282 and used this formula to fit the curve: y1+xy \approx 1 + x

  • For Model 2 (y = ax^2 + bx + c) on line 20:

    • a (quadratic term) = coef(model2)[3]

    • b (linear term) = coef(model2)[2]

    • c (intercept) = coef(model2)[1]

  • Lines 20–25: The independent variable is fit for the linear regression model y=ax2+bx+cy=ax^2+bx+c, and the details of the model are printed. This model has the intercept value at 0.09804 with x2=3.66237x^2 = 3.66237 and x=1.69532x = -1.69532. It used the following formula to fit the curve: y1+x2+xy \approx 1 + x^2 + x.

  • Line 28: The models are compared using the ftest() function and the results are printed.

Note: This code example has been implemented using Julia=1.8.1.

Conclusion

In conclusion, linear regression is a statistical method to predict a dependent variable based on independent variables. It finds a mathematical model that best describes this relationship. Julia, a powerful programming language, is well-suited for implementing linear regression through packages like GLM. This code example demonstrates how GLM can be used to fit and compare different linear regression models in Julia.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved