What is maximum likelihood estimation?

Overview

Maximum likelihood estimation (MLE) is a framework that is used to determine the parameters of a machine learning model. The parameters obtained as a result always maximize the likelihood so that the observed data conforms to the model that produces it.

MLE aims to fit a distribution to the provided data so that the data can be easier to work with. The results and inference from the data can then be generalized.

This method is often used in many machine learning models, such as logistic regression, where finding the best parameters for the models is vital in optimizing the model to make future predictions.

Formulation

Let's now understand the mathematical formulation that is used behind this framework. It effectively solves the problem by searching the parameter space to find the optimal parameters that conform to the given dataset (X)(X).

Here XX is considered as a dataset comprising of nn data points that are "independent and identically distributed (i.i.d)" and can be represented in the following mathematical notation:

Following this, we can state the formulation for MLE in the following way:

The notation above represents the data's conditional probability (P)(P) given the probability distribution and its parameters defined by θ\theta. For example, suppose the method tries to fit a gaussian distribution, then the parameters we would like to maximize the likelihood function for, will be the mean (μ\mu) and the standard deviation (σ\sigma).

For a single data point xx, the probability density function is the following in the case of gaussian distribution:

Here P(x;μ,σ)P(x;\mu,\sigma) is the probability of observing a data point given the parameters μ\mu and σ\sigma that are unknown to us.

As we know that the data points in our dataset are independent of each other so we can extend the notation above over the whole dataset with the help of the law of independence of probability and obtain:

We know that the values for probability always lie in the range of (0,1)(0,1) so the product of small numbers as shown above would result in the joint probability P(X;μ,σ)P(X;\mu,\sigma) being near the bounds of zero. To overcome this problem we introduce the log of conditional probabilities. The values obtained after applying the mathematical function of log would be more stable & in the range of [,0)[-\infty,0). Therefore, this product changes into a summation over the whole dataset using the following log property:

Now using the property above we can translate our problem into the one below:

The relation between the joint probability distribution and the likelihood function can be seen from the following equation:

In the notation above, L(μ,σ;X) L(\mu,\sigma;X) is referred to as the likelihood of parameters given the data, which is equivalent to P(X;μ,σ)P(X;\mu,\sigma) that is the probability of data provided by the parameters. The method here tries to find the values of μ\mu and σ\sigma such that our likelihood function is maximized as given by the following equation:

Conclusion

To conclude, we can see that this framework is an efficient way to tune the parameters of a model. Also, it is essential to note that as we increase the size of the dataset, the quality of the maximum likelihood estimator rises significantly.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved