What are outliers?

Key takeaways:

  • Outliers are data points that significantly differ from the rest of the dataset and can affect statistical analyses, such as the mean and standard deviation.

  • They can be caused by human error, natural variation, or rare events and require careful inspection before dismissal.

  • Outliers can be identified using visual representations (like scatter plots) and statistical techniques, including the z-score and the interquartile range (IQR) method.

  • The IQR method involves calculating Q1 and Q3, determining the IQR, and identifying outliers based on lower and upper bounds.

Outliers are the data points or observations that differ significantly from the rest of the data and can singlehandedly affect the mean and standard deviation of the dataset. Outliers are not only unusually distant from the rest of the data but can also impact the interpretation of a dataset.

An outlier in the data
An outlier in the data

Outliers causes

Various reasons can result in outliers, including the following:

  • Human error while recording the data or measurement errors and data entry mistakes are examples of human errors leading to outliers.

  • Natural variations can also be responsible for outliers. These outliers are not necessarily wrong but rather represent the tail-end of the distribution.

  • There can also be some true exceptions. These outliers are real data points that are out of the ordinary and must be subjected to study to obtain rare results.

Outliers detection

Data points with sine curve and anomalous outlier
Data points with sine curve and anomalous outlier

Statistical techniques can be very valuable in detecting outliers. Techniques like z-score, which measures the standard deviation, distance of a point from the mean, or the IQR method, which measures the distance between the first and third quartiles, can be used in outlier detections.

Treatment

Outliers must be carefully studied before a decision is made to scrap them. We have understood that true exceptions can be part of our dataset, leading to interesting discoveries. Therefore, before dismissing them as errors, we carefully consider them to understand their true nature.

  • Data validation: We check the data for errors and typos and trace our data collection methods to ensure that the point of concern results from a mistake.

  • Contextual meaning: We deep-dive into understanding the context of the data points, knowing that the outliers could represent legitimate but rare phenomena. Therefore, we consider whether the value of this data point is a physical possibility and what the statistical chances of its appearance are.

  • Data distribution: We use various probability distribution techniques, such as the normal distribution, and analyze whether the data point can be included in the possible data ranges for that distribution.

To reiterate, not all outliers result from errors, so they must be carefully scrutinized before dismissing them.

Calculate outliers using the IQR method

The interquartile range (IQR) method is a common statistical technique for identifying outliers. Here are the following steps to calculate outliers using the IQR method:

  • Sort the data in ascending order.

  • Calculate Q1 and Q3 quartiles.

    • Q1 is the 25th percentile of the data. It marks the point below which 25% of the data falls.

    • Q3 is the 75th percentile. It marks the point below which 75% of the data falls.

Note: Learn how to calculate quartiles in Python in our Answer: Basic Statistics Using Python.

  •  IQR is the difference between Q3 and Q1.

  • The lower and upper bounds for outliers are calculated using the IQR:

  • Any data point below the lower bound or above the upper bound is considered an outlier.

Let’s implement the steps above in Python. The following code uses the quantile() method to calculate Q1 and Q3, allowing us to identify outliers using the IQR method:

import pandas as pd
# Data sorted in ascending order
data = {
'Scores': [70, 20, 80, 85, 90, 65, 68, 91, 150] # Outlier: 100
}
# Create DataFrame
df = pd.DataFrame(data)
# Calculate Q1, Q3, and IQR for the Scores column
Q1 = df['Scores'].quantile(0.25)
Q3 = df['Scores'].quantile(0.75)
IQR = Q3 - Q1
# Identify outliers
outliers_df = df[(df['Scores'] < (Q1 - 1.5 * IQR)) | (df['Scores'] > (Q3 + 1.5 * IQR))]
# Print outliers without index
print("Outliers:")
print(outliers_df.to_string(index=False))

Ready to dive into data science? Data Science Projects with Python will teach you to explore datasets, build models, and apply techniques like logistic regression and decision trees to real-world problems.

Conclusion

Outliers are crucial in data analysis, data mining, and machine learning. Understanding and correctly identifying their nature using methods like the IQR is essential for drawing accurate conclusions from data. While outliers can indicate errors or anomalies, they may also reveal valuable insights. Therefore, it is vital to analyze them carefully before deciding to exclude them from datasets.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


Are outliers always errors?

No, outliers can represent legitimate data points that provide valuable insights, not just errors.


How can we identify outliers in a dataset using SciPy in Python?

Outliers are the data that is different from the rest of the dataset and significantly influence the analysis. We can identify outliers in the dataset using SciPy with the following two methods:

  • z-score
  • IQR (Interquartile range)

Here is the code example with z-score method:

import numpy as np
from scipy import stats

np.random.seed(42)
data = np.random.normal(0, 1, 1000)
data[900:] += 5

z_scores = np.abs(stats.zscore(data))

threshold = 3 
outliers_z = np.where(z_scores > threshold)[0]

print("Outliers detected using z-score:", outliers_z)

Check out our detailed Answer on How to identify outliers in a dataset using SciPy in Python.


How can we detect an outlier with the local outlier factor?

The local outlier factor is a density-based technique that identifies the outliers based on the neighbors. The data points that are lying in areas with lower density as compared to the neighbors are considered anomalous.

Check out our detailed Answer on Outlier detection with the local outlier factor.


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved