Outliers are the data that is different from the rest of the dataset and significantly influence the analysis. In this answer, we'll show how to identify outliers in the dataset using SciPy. We'll use two methods for this purpose:
Z-score
IQR (Interquartile range)
First, we import the required libraries to identify outliers in the dataset:
import numpy as npfrom scipy import stats
This approach quantifies how far a data point deviates from the mean regarding standard deviations. We set a threshold (usually 2 or 3) to identify data points with high z-scores as potential outliers.
Here is the coding example to identify outliers in the dataset using the z-score method:
import numpy as npfrom scipy import statsnp.random.seed(42)data = np.random.normal(0, 1, 1000)data[900:] += 5z_scores = np.abs(stats.zscore(data))threshold = 3outliers_z = np.where(z_scores > threshold)[0]print("Outliers detected using z-score:", outliers_z)
Lines 1–2: We import numpy
and scipy
libraries.
Line 4: We use random.seed()
function to generate random numbers by setting the seed value 42
for reproducibility.
Line 5: We generate an array of 1000
random numbers with mean 0
and standard deviation 1
.
Line 6: We add 5
to the values of data
from the index 900
onwards, introducing outliers.
Line 8: We determine the deviation of a data point from the mean by calculating the number of standard deviations.
Line 10: We set the threshold
value for identifying potential outliers.
Line 11: We identify the values that exceed the threshold
.
Line 13: We print the identified outliers.
This method creates a limit between the first and third quartiles and considers the data points that exceed this limit as outliers.
Here is the coding example to identify outliers in the dataset using the IQR method:
import numpy as npfrom scipy import statsnp.random.seed(42)data = np.random.normal(0, 1, 1000)data[900:] += 5first_quar = np.percentile(data, 25)third_quar = np.percentile(data, 75)IQR = third_quar - first_quarlower_limit = first_quar - 1.5 * IQRupper_limit = third_quar + 1.5 * IQRoutliers_iqr = stats.iqr(data, nan_policy='omit', axis=0, rng=(25, 75)) * 1.5outliers = np.where((data < lower_limit) | (data > upper_limit))[0]print("Outliers detected using IQR:", outliers)
Line 8: We calculate the first quartile (25th percentile) of the data.
Line 9: We calculate the third quartile (75th percentile) of the data.
Line 10: We compute the range between the first and third quartiles.
Lines 12 and 13: We define upper and lower limits to identify potential outliers using the IQR method.
Line 15: We compute the IQR using the SciPy library's iqr
function.
Line 17: We determine the outliers using a logical condition that checks if data points are outside the calculated limit.
Line 18: We print the identified outliers.
Box plot is commonly used to visualize the presence of outliers in a dataset. Here is the visualization of identified outliers using IQR method:
import matplotlib.pyplot as pltplt.figure(figsize=(10, 6))plt.boxplot(data)plt.title("Box Plot of Data with Outliers (IQR Method)")plt.show()
Note: The Z-score and interquartile range (IQR) methods are two different approaches for identifying outliers in a dataset, and they can produce different results due to the differences in their underlying principles and calculations.
Free Resources