The data collection process can be expensive in data science. Often at times, we can only collect a limited amount of data. We need to estimate quantities related to a population known as a population statistic. These can include the mean, median, or standard deviation of the population data. If large amounts of data are not available and further data collection is not possible, we can rely on a technique known as bootstrap sampling.
Bootstrap sampling involves estimating summary statistics by averaging estimates from small amounts of randomly sampled data from the original data we have. This process is done with replacement. This means a value sampled from the original data will form a part of the smaller sample and be replaced with the original data. Thus, a single value can be a part of the smaller sample more than once.
A single smaller sample can be made by following the steps below:
Sample size can be as big as the size of original data. However, it is usually not computationally feasible. Hence, a size of 50% to 80% of the original data is mostly used.
The illustration below shows how we can bootstrap a single sample:
Since we form smaller samples to estimate a statistic such as mean or median, we need to calculate the required statistic for each smaller sample that we form. We can choose several bootstrap samples that we will form and calculate the statistic for each sample. The estimated statistic will be an average of all the statistics obtained from each smaller sample.
We can summarize the process as follows:
Bootstrap is a simple technique to obtain an estimate of the population statistics. It has the following advantages:
The code snippet below shows a simple example of bootstrap sampling in Python:
import numpy as npimport randomx = np.random.normal(loc= 300.0, size=1000) # Creating a normal random sample of size 1000 centered around 300print("Actual Mean:", np.mean(x)) # Mean of original sample for comparison latersample_mean = [] # To store means of each smaller samplefor i in range(50): # Create 50 bootstrap samplesy = random.sample(x.tolist(), 30) # Randomly take 30 values with replacementavg = np.mean(y) # Find mean of the smaller samplesample_mean.append(avg) # Add mean to the listprint("Bootstrapped Mean:", np.mean(sample_mean)) # Take average of all statistics collected from smaller samples
The code above shows how bootstrap sampling produces similar results compared to the original data.
Free Resources