Probability distributions play a fundamental role in understanding uncertainty and randomness in machine learning models. They provide a mathematical framework for describing the likelihood of different outcomes and are used extensively in various machine learning algorithms. In this Answer, we’ll explore some common probability distributions encountered in machine learning and their applications.
The Gaussian distribution, also known as the normal distribution, is a frequently used bell-shaped curve in statistics and machine learning. The determining factors for the distribution are its mean (μ) and standard deviation (σ).
The formula and graph of the Gaussian distribution is as follows:
We can plot this distribution in Python using the following code:
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import norm# Gaussian distribution (normal distribution)mu = 0sigma = 1gaussian_samples = np.random.normal(mu, sigma, 1000)# Plot histogramplt.figure(figsize=(5, 5))plt.hist(gaussian_samples, bins=10, density=True, alpha=0.6, color='g')plt.title('Gaussian Distribution (Normal Distribution)')plt.xlabel('Value')plt.ylabel('Frequency')# Plot the PDF (probability density function)x = np.linspace(-3, 3, 1000)plt.plot(x, norm.pdf(x, mu, sigma), 'r-', lw=1)plt.savefig('output/guassian.png')plt.show()
Let’s break down the code step by step:
Lines 1–3: We import the necessary libraries, numpy
for mathematical operations and array manipulation, matplotlib
.pyplot
for plotting capabilities and from scipy.stats import norm
, which imports the Gaussian distribution from the SciPy library’s stats module.
Lines 6–8: We define the mean (mu
) of the Gaussian distribution as 0 and define the standard deviation (sigma
) of the Gaussian distribution as 1. Next, we generate 1000 random samples from the Gaussian distribution with the specified mean (mu
) and standard deviation (sigma
) using np.random.normal
.
Lines 11–15: We create a new figure with a size of 5x5 inches using plt.figure
and plot a histogram of the generated samples using plt.hist
. The bins=10
argument specifies the number of bins for the histogram, the density=True
argument normalizes the histogram to form a probability density, alpha=0.6
sets the transparency to 0.6, and color='g'
sets the color of the bars to green.
Lines 18–21: We generate a set of x-values using np.linspace
to represent a range of values for the PDF. Next, we compute the PDF of the Gaussian distribution using norm.pdf(x, mu, sigma)
from SciPy’s norm
module. Lastly, we plot the PDF using plt.plot
. The line style is set to red ('r-'
) with a line width of 1 (lw=1
).
Gaussian distributions often arise naturally in many real-world phenomena, such as heights of people, test scores, and measurement errors. They are commonly used in algorithms like linear regression, Gaussian processes, and as prior distributions in Bayesian inference.
The Bernoulli distribution describes a variable with only two outcomes: success (typically represented by 1) and failure (typically represented by 0). It is defined by a single parameter,
The formula and graph of the Bernoulli distribution is as follows:
We can plot this distribution in Python using the following code:
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import bernoulli# Bernoulli distributionp = 0.3 # Probability of successn = 1 # Number of trials (always 1 for Bernoulli)bernoulli_samples = bernoulli.rvs(p, size=1000)# Plot histogramplt.figure(figsize=(6, 6))plt.hist(bernoulli_samples, bins=[0, 1, 2], density=True, alpha=0.6, color='b')plt.title('Bernoulli Distribution')plt.xlabel('Value')plt.ylabel('Frequency')# Plot the PMF (probability mass function)plt.plot([0, 1], bernoulli.pmf([0, 1], p), 'ro-', lw=2)plt.xticks([0, 1], ['Failure (0)', 'Success (1)'])plt.savefig('output/guassian.png')plt.show()
Let’s break down the code step by step:
Lines 1–3: We import the necessary libraries, numpy
for mathematical operations and array manipulation, matplotlib.pyplot
for plotting capabilities and from scipy.stats import bernoulli
, which imports the Bernoulli distribution from the SciPy library’s stats module.
Lines 6–8: We specify the parameter p
(probability of success) and n
(number of trials, always 1 for Bernoulli) for the Bernoulli distribution and generate 1000 random samples with the specified probability of success p
from the Bernoulli distribution using bernoulli.rvs
.
Lines 11–15: We create a new figure with a size of 8x6 inches using plt.figure
and plot a histogram of the generated samples (bernoulli_samples
) using plt.hist
. The bins are specified as [0, 1, 2] to separate the values into two categories: 0 (failure) and 1 (success). The density=True
argument normalizes the histogram to form a probability density, alpha=0.6
sets the transparency to 0.6, and color='b'
sets the color of the bars to blue.
Lines 18–21: We plot the probability mass function (PMF) of the Bernoulli distribution using plt.plot
. The x-values [0, 1] represent the two possible outcomes (failure and success), and bernoulli.pmf([0, 1], p)
computes the corresponding probabilities. The PMF is plotted as red circles ('ro-'
) with a line width of 2 (lw=2
). Furthermore, we customize the x-axis ticks to represent 'Failure (0)'
and 'Success (1)'
and display the plot on the screen.
Bernoulli distributions are used to model binary outcomes, such as coin flips, yes/no decisions, and binary classification tasks in machine learning algorithms like logistic regression and binary decision trees.
The binomial distribution extends the concept of the Bernoulli distribution across several independent trials. It calculates the chances of achieving a certain number of successful outcomes,
The formula and graph of the Binomial distribution is as follows:
Here, n = number of samples, x = an element in the sample, p = probability of success and q = probability of failure (1-p).
We can plot this distribution in Python using the following code:
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import binom# Binomial distributionn = 6 # Number of trialsp = 0.5 # Probability of success in each trialbinomial_samples = binom.rvs(n, p, size=1000)# Plot histogramplt.figure(figsize=(8, 6))plt.hist(binomial_samples, bins=np.arange(0, n+2)-0.5, density=True, alpha=0.6, color='m')plt.title('Binomial Distribution')plt.xlabel('Value')plt.ylabel('Frequency')# Plot the PMF (probability mass function)x = np.arange(0, n+1)plt.plot(x, binom.pmf(x, n, p), 'ro-', lw=2)plt.savefig('output/binom.png')plt.show()
Let's break down the code step by step:
Lines 1–3: We are importing the necessary libraries, numpy
for mathematical operations and array manipulation, matplotlib.pyplot
for plotting capabilities and from scipy.stats import binom
, which imports the binomial distribution from the SciPy library’s stats module.
Lines 6–8: We specify the parameters n
(number of trials) and p
(probability of success in each trial) for the binomial distribution and generate random samples from the binomial distribution using binom.rvs
.
Lines 11–15: We create a new figure with a size of 8x6 inches using plt.figure
and plot a histogram of the generated samples with custom bin edges using plt.hist
. The histogram is normalized to form a probability density (density=True
), with a transparency of 0.6 and a magenta color.
Lines 18–21: We generate values for the x-axis using np.arange
and plot the probability mass function (PMF) of the binomial distribution using binom.pmf
. The PMF is plotted as red dots connected by a line.
The binomial distribution calculates the odds of getting a specific number of successful outcomes in a series of independent trials, each with two distinct possible results, like flipping coins for heads or tracking successful experiments.
The Poisson distribution predicts the likelihood of a certain number of events happening within a fixed time or space frame, given that these events occur at a constant rate and independently from each other. The average number of events is represented by the parameter lambda λ.
The formula and graph of the Poisson distribution is as follows:
You can plot this distribution in Python through the following code:
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import poisson# Poisson Distributionmu = 1 # Average rate (mean number of events in a fixed interval)poisson_samples = poisson.rvs(mu, size=1000)# Plot histogramplt.figure(figsize=(8, 6))plt.hist(poisson_samples, bins=range(5), density=True, alpha=0.6, color='c')plt.title('Poisson Distribution')plt.xlabel('Value')plt.ylabel('Frequency')# Plot the PMF (Probability Mass Function)x = np.arange(0, 5)plt.plot(x, poisson.pmf(x, mu), 'ro-', lw=2)plt.savefig('output/poisson.png')plt.show()
Let’s break down the code step by step:
Lines 1–3: We are importing the necessary libraries, numpy
for mathematical operations and array manipulation, matplotlib.pyplot
for plotting capabilities and from scipy.stats import poisson
, which imports the Poisson distribution from the SciPy library's stats module.
Lines 6–7: We initialize the average rate (mu
) for the Poisson distribution as 1, representing the mean number of events in a fixed interval. Next, we generate a sample of 1000 random variates from the Poisson distribution with the specified average rate using the poisson.rvs
function.
Lines 10–14: We create a new figure with a size of 8x6 inches for plotting and plot a histogram of the Poisson samples (poisson_samples
) with bins from 0 to 4, normalized to form a probability density. We also set the title of the plot as 'Poisson Distribution'
, along with the x-axis label as 'Value
' and the y-axis label as 'Frequency'
.
Lines 17–20: We generate an array of values from 0 to 4 using numpy's arange
function, representing the possible values for the Poisson distribution and plot the probability mass function (PMF) of the Poisson distribution for the generated values (x
) and the specified average rate (mu
). Lastly, we display the plot in the figure.
Understanding the Poisson distribution and its applications is crucial for accurately predicting and managing various phenomena, such as crime hotspots, genetic variations, and event counts.
Test your understanding by doing the quiz below!
Choose the correct option.
Which distribution models the number of events occurring in a fixed interval of time or space, given a constant average rate?
Gaussian distribution
Binomial distribution
Bernoulli distribution
Poisson distribution
These are just a few examples of common probability distributions used in machine learning. Understanding these distributions and their properties is crucial for developing and interpreting machine learning models effectively. By leveraging the appropriate probability distributions, machine learning practitioners can make informed decisions, model complex phenomena, and extract meaningful insights from data.
Free Resources