Machine learning probability distributions

Probability distributions play a fundamental role in understanding uncertainty and randomness in machine learning models. They provide a mathematical framework for describing the likelihood of different outcomes and are used extensively in various machine learning algorithms. In this Answer, we’ll explore some common probability distributions encountered in machine learning and their applications.

Gaussian distribution (normal distribution)

The Gaussian distribution, also known as the normal distribution, is a frequently used bell-shaped curve in statistics and machine learning. The determining factors for the distribution are its mean (μ) and standard deviation (σ).

The formula and graph of the Gaussian distribution is as follows:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Gaussian distribution (normal distribution)
mu = 0
sigma = 1
gaussian_samples = np.random.normal(mu, sigma, 1000)
# Plot histogram
plt.figure(figsize=(5, 5))
plt.hist(gaussian_samples, bins=10, density=True, alpha=0.6, color='g')
plt.title('Gaussian Distribution (Normal Distribution)')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Plot the PDF (probability density function)
x = np.linspace(-3, 3, 1000)
plt.plot(x, norm.pdf(x, mu, sigma), 'r-', lw=1)
plt.savefig('output/guassian.png')
plt.show()

Let’s break down the code step by step:

Lines 1–3: We import the necessary libraries, numpy for mathematical operations and array manipulation, matplotlib.pyplot for plotting capabilities and from scipy.stats import norm, which imports the Gaussian distribution from the SciPy library’s stats module.
Lines 6–8: We define the mean (mu) of the Gaussian distribution as 0 and define the standard deviation (sigma) of the Gaussian distribution as 1. Next, we generate 1000 random samples from the Gaussian distribution with the specified mean (mu) and standard deviation (sigma) using np.random.normal.
Lines 11–15: We create a new figure with a size of 5x5 inches using plt.figure and plot a histogram of the generated samples using plt.hist. The bins=10 argument specifies the number of bins for the histogram, the density=True argument normalizes the histogram to form a probability density, alpha=0.6 sets the transparency to 0.6, and color='g' sets the color of the bars to green.
Lines 18–21: We generate a set of x-values using np.linspace to represent a range of values for the PDF. Next, we compute the PDF of the Gaussian distribution using norm.pdf(x, mu, sigma) from SciPy’s norm module. Lastly, we plot the PDF using plt.plot. The line style is set to red ('r-') with a line width of 1 (lw=1).

Gaussian distributions often arise naturally in many real-world phenomena, such as heights of people, test scores, and measurement errors. They are commonly used in algorithms like linear regression, Gaussian processes, and as prior distributions in Bayesian inference.

Bernoulli distribution

The Bernoulli distribution describes a variable with only two outcomes: success (typically represented by 1) and failure (typically represented by 0). It is defined by a single parameter, $p$ , which is the likelihood of success.

The formula and graph of the Bernoulli distribution is as follows:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli
# Bernoulli distribution
p = 0.3  # Probability of success
n = 1    # Number of trials (always 1 for Bernoulli)
bernoulli_samples = bernoulli.rvs(p, size=1000)
# Plot histogram
plt.figure(figsize=(6, 6))
plt.hist(bernoulli_samples, bins=[0, 1, 2], density=True, alpha=0.6, color='b')
plt.title('Bernoulli Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Plot the PMF (probability mass function)
plt.plot([0, 1], bernoulli.pmf([0, 1], p), 'ro-', lw=2)
plt.xticks([0, 1], ['Failure (0)', 'Success (1)'])
plt.savefig('output/guassian.png')
plt.show()

Let’s break down the code step by step:

Lines 1–3: We import the necessary libraries, numpy for mathematical operations and array manipulation, matplotlib.pyplot for plotting capabilities and from scipy.stats import bernoulli, which imports the Bernoulli distribution from the SciPy library’s stats module.
Lines 6–8: We specify the parameter p (probability of success) and n (number of trials, always 1 for Bernoulli) for the Bernoulli distribution and generate 1000 random samples with the specified probability of success p from the Bernoulli distribution using bernoulli.rvs.
Lines 11–15: We create a new figure with a size of 8x6 inches using plt.figure and plot a histogram of the generated samples (bernoulli_samples) using plt.hist. The bins are specified as [0, 1, 2] to separate the values into two categories: 0 (failure) and 1 (success). The density=True argument normalizes the histogram to form a probability density, alpha=0.6 sets the transparency to 0.6, and color='b' sets the color of the bars to blue.
Lines 18–21: We plot the probability mass function (PMF) of the Bernoulli distribution using plt.plot. The x-values [0, 1] represent the two possible outcomes (failure and success), and bernoulli.pmf([0, 1], p) computes the corresponding probabilities. The PMF is plotted as red circles ('ro-') with a line width of 2 (lw=2). Furthermore, we customize the x-axis ticks to represent 'Failure (0)' and 'Success (1)' and display the plot on the screen.

Bernoulli distributions are used to model binary outcomes, such as coin flips, yes/no decisions, and binary classification tasks in machine learning algorithms like logistic regression and binary decision trees.

Binomial distribution

The binomial distribution extends the concept of the Bernoulli distribution across several independent trials. It calculates the chances of achieving a certain number of successful outcomes, $k$ , in a set number of attempts, $n$ , where each attempt has an identical chance of success, represented by $p$ , and the chance of failure is given by $q$ (1- $p$ ).

The formula and graph of the Binomial distribution is as follows:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
# Binomial distribution
n = 6   # Number of trials
p = 0.5  # Probability of success in each trial
binomial_samples = binom.rvs(n, p, size=1000)
# Plot histogram
plt.figure(figsize=(8, 6))
plt.hist(binomial_samples, bins=np.arange(0, n+2)-0.5, density=True, alpha=0.6, color='m')
plt.title('Binomial Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Plot the PMF (probability mass function)
x = np.arange(0, n+1)
plt.plot(x, binom.pmf(x, n, p), 'ro-', lw=2)
plt.savefig('output/binom.png')
plt.show()

Let's break down the code step by step:

Lines 1–3: We are importing the necessary libraries, numpy for mathematical operations and array manipulation, matplotlib.pyplot for plotting capabilities and from scipy.stats import binom, which imports the binomial distribution from the SciPy library’s stats module.
Lines 6–8: We specify the parameters n (number of trials) and p (probability of success in each trial) for the binomial distribution and generate random samples from the binomial distribution using binom.rvs.
Lines 11–15: We create a new figure with a size of 8x6 inches using plt.figure and plot a histogram of the generated samples with custom bin edges using plt.hist. The histogram is normalized to form a probability density (density=True), with a transparency of 0.6 and a magenta color.
Lines 18–21: We generate values for the x-axis using np.arange and plot the probability mass function (PMF) of the binomial distribution using binom.pmf. The PMF is plotted as red dots connected by a line.

The binomial distribution calculates the odds of getting a specific number of successful outcomes in a series of independent trials, each with two distinct possible results, like flipping coins for heads or tracking successful experiments.

Poisson distribution

The Poisson distribution predicts the likelihood of a certain number of events happening within a fixed time or space frame, given that these events occur at a constant rate and independently from each other. The average number of events is represented by the parameter lambda λ.

The formula and graph of the Poisson distribution is as follows:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson
# Poisson Distribution
mu = 1  # Average rate (mean number of events in a fixed interval)
poisson_samples = poisson.rvs(mu, size=1000)
# Plot histogram
plt.figure(figsize=(8, 6))
plt.hist(poisson_samples, bins=range(5), density=True, alpha=0.6, color='c')
plt.title('Poisson Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Plot the PMF (Probability Mass Function)
x = np.arange(0, 5)
plt.plot(x, poisson.pmf(x, mu), 'ro-', lw=2)
plt.savefig('output/poisson.png')
plt.show()

Let’s break down the code step by step:

Lines 1–3: We are importing the necessary libraries, numpy for mathematical operations and array manipulation, matplotlib.pyplot for plotting capabilities and from scipy.stats import poisson, which imports the Poisson distribution from the SciPy library's stats module.
Lines 6–7: We initialize the average rate (mu) for the Poisson distribution as 1, representing the mean number of events in a fixed interval. Next, we generate a sample of 1000 random variates from the Poisson distribution with the specified average rate using the poisson.rvs function.
Lines 10–14: We create a new figure with a size of 8x6 inches for plotting and plot a histogram of the Poisson samples (poisson_samples) with bins from 0 to 4, normalized to form a probability density. We also set the title of the plot as 'Poisson Distribution', along with the x-axis label as 'Value' and the y-axis label as 'Frequency'.
Lines 17–20: We generate an array of values from 0 to 4 using numpy's arange function, representing the possible values for the Poisson distribution and plot the probability mass function (PMF) of the Poisson distribution for the generated values (x) and the specified average rate (mu). Lastly, we display the plot in the figure.

Understanding the Poisson distribution and its applications is crucial for accurately predicting and managing various phenomena, such as crime hotspots, genetic variations, and event counts.

Quiz

Test your understanding by doing the quiz below!

Free Resources

Machine learning probability distributions

Gaussian distribution (normal distribution)

Bernoulli distribution

Binomial distribution

Poisson distribution

Quiz

Conclusion