What is the central limit theorem?

The central limit theorem (CLT) is a statistical concept that states that when independent random variables are added, their sum tends to follow a normal distribution, regardless of the distribution of the individual variables.

Formula

If $X_1,X_2,..., X_n$ are independent and identically distributed random variables with mean $μ$ and standard deviation $σ$ , then the distribution of the sample means $\bar{X}$ is approximately normal with mean $μ$ and standard deviation $\frac{σ}{\sqrt{n}}$ , as $n$ approaches infinity.

In mathematical notation:

Where:

$\bar{X}$ represents the sample mean,

$N$ denotes the normal distribution,

$μ$ represents the population mean,

$σ$ represents the population standard deviation,

$n$ is the sample size.

This formula states that as the sample size $n$ increases, the sampling distribution of the sample means approaches a normal distribution with the same mean as the population $μ$ and a standard deviation $σ$ divided by the square root of the sample size $\sqrt{n}$ . This implies that the larger the sample size, the closer the distribution of the sample means will resemble a normal distribution.

Conditions

To apply the central limit theorem successfully, the following conditions should be met:

Random sampling: The samples should be selected randomly from the population.
Independence: Each observation within the sample should be independent of each other.
Sample size: The sample size should be sufficiently large. While there is no fixed rule, a sample size of 30 or greater is often considered adequate for the CLT.

Importance

The central limit theorem holds immense importance due to the following reasons:

Reliable estimation: It makes accurate inferences about the population parameters based on sample means.
Hypothesis testing: The CLT provides the foundation for many hypothesis tests, enabling researchers to draw valid conclusions.
Approximation: It simplifies complex distributions by approximating them with the standard normal distribution.
Predictive modeling: The CLT basis for various statistical models, helping in forecasting and prediction tasks.

Code example

The CLT is a small difference between the original and predicted value. The two mean values will come even closer if the sample size increases.

import numpy as np
# Generate an array with 1000 random numbers
x = np.random.randint(0, 1000, size = (1, 1000))[0]
# Original mean
print("The original mean value:", x.mean())
# Choose 20 random samples, each containing 15 data points
resamples = [np.random.choice(x, size = 15, replace = True) for i in range(20)]
# List of means of random samples
avg_list = []
for i in range(0,20):
    avg_list.append(resamples[i].mean())
# Predicted mean 
predicted_mean = sum(avg_list) / len(avg_list)
print("The predicted mean value:", predicted_mean)

Code explanation

Line 4: An array with 1000 random values is created.

Line 7: The average of the dataset is computed using the mean() method.

Line 10: 20 random samples are gathered, each containing 15 data points.

Line 13–15: The average value of each random sample is computed and stored in a list.

Line 19: The predicted_mean value is calculated by taking an average of the values in the list.

Conclusion

In conclusion, the central limit theorem is a fundamental concept in statistics that states that the distribution of sample means tends to be approximately normal, regardless of the shape of the population distribution. It estimates population parameters, makes inferences, and conducts hypothesis tests based on sample data.

Free Resources