How to identify outliers in a dataset using SciPy in Python

Outliers are the data that is different from the rest of the dataset and significantly influence the analysis. In this answer, we'll show how to identify outliers in the dataset using SciPy. We'll use two methods for this purpose:

  • Z-score

  • IQR (Interquartile range)

First, we import the required libraries to identify outliers in the dataset:

import numpy as np
from scipy import stats
Import required libraries

Using Z-score method

This approach quantifies how far a data point deviates from the mean regarding standard deviations. We set a threshold (usually 2 or 3) to identify data points with high z-scores as potential outliers.

Coding example

Here is the coding example to identify outliers in the dataset using the z-score method:

import numpy as np
from scipy import stats
np.random.seed(42)
data = np.random.normal(0, 1, 1000)
data[900:] += 5
z_scores = np.abs(stats.zscore(data))
threshold = 3
outliers_z = np.where(z_scores > threshold)[0]
print("Outliers detected using z-score:", outliers_z)

Explanation

  • Lines 1–2: We import numpy and scipy libraries.

  • Line 4: We use random.seed() function to generate random numbers by setting the seed value 42 for reproducibility.

  • Line 5: We generate an array of 1000 random numbers with mean 0 and standard deviation 1.

  • Line 6: We add 5 to the values of data from the index 900 onwards, introducing outliers.

  • Line 8: We determine the deviation of a data point from the mean by calculating the number of standard deviations.

  • Line 10: We set the threshold value for identifying potential outliers.

  • Line 11: We identify the values that exceed the threshold.

  • Line 13: We print the identified outliers.

Using IQR method

This method creates a limit between the first and third quartiles and considers the data points that exceed this limit as outliers.

Coding example

Here is the coding example to identify outliers in the dataset using the IQR method:

import numpy as np
from scipy import stats
np.random.seed(42)
data = np.random.normal(0, 1, 1000)
data[900:] += 5
first_quar = np.percentile(data, 25)
third_quar = np.percentile(data, 75)
IQR = third_quar - first_quar
lower_limit = first_quar - 1.5 * IQR
upper_limit = third_quar + 1.5 * IQR
outliers_iqr = stats.iqr(data, nan_policy='omit', axis=0, rng=(25, 75)) * 1.5
outliers = np.where((data < lower_limit) | (data > upper_limit))[0]
print("Outliers detected using IQR:", outliers)

Explanation

  • Line 8: We calculate the first quartile (25th percentile) of the data.

  • Line 9: We calculate the third quartile (75th percentile) of the data.

  • Line 10: We compute the range between the first and third quartiles.

  • Lines 12 and 13: We define upper and lower limits to identify potential outliers using the IQR method.

  • Line 15: We compute the IQR using the SciPy library's iqr function.

  • Line 17: We determine the outliers using a logical condition that checks if data points are outside the calculated limit.

  • Line 18: We print the identified outliers.

Visualization of identified outliers

Box plot is commonly used to visualize the presence of outliers in a dataset. Here is the visualization of identified outliers using IQR method:

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.boxplot(data)
plt.title("Box Plot of Data with Outliers (IQR Method)")
plt.show()

Note: The Z-score and interquartile range (IQR) methods are two different approaches for identifying outliers in a dataset, and they can produce different results due to the differences in their underlying principles and calculations.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved