Outlier detection with the local outlier factor

The local outlier factor is a density-based technique that identifies the outliers based on the neighbors. The data points that are lying in areas with lower density as compared to the neighbors are considered anomalous.

Algorithm

This technique uses the following algorithm to calculate the anomaly score and categorizes the data points to find the anomalous ones.

The formula used for local factor outlier
The formula used for local factor outlier

In this algorithm, we calculate the local and global density to calculate the anomaly score.

  • a: We calculate the average local reachability density of data points in the neighborhood of xi.

  • b: Total elements present in the neighborhood of xi.

  • c: We calculate the local reachability density of xi.

How does it work?

We create or import a dataset and then calculate the anomaly score for each data point according to which the marker is positioned and drawn. The data points with more density are considered normal, and the data points lying on the less dense marker around them or away from them are considered outliers.

Representation of data points and scores in plot
Representation of data points and scores in plot

How to implement this understanding?

Let's write a code step-by-step that generates sample data, fits the model to it, and then creates a scatter plot to visualize the results obtained after applying the algorithm.

While generating the dataset and assigning values, keep in mind that the neighbors considered should be:

  • Greater than the minimum number of samples a cluster has to contain so that other samples can be local outliers relative to this cluster.

  • Smaller than the maximum number of close-by samples that can potentially be local outliers.

Before starting the code, let's understand the modules we must import and how they are used.

Required imports

We import the following from numpy, matplotlib and sklearn libraries.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerPathCollection
from sklearn.neighbors import LocalOutlierFactor
  • numpy handles data arrays and performs numerical operations.

  • matplotlib.pyplot creates and customizes data visuals, including various types of plots.

  • matplotlib.legend_handler creates and customizes data visuals, including various types of plots. We import HandlerPathCollection to detect outliers.

  • sklearn.neighbors accesses functionalities for robust covariance estimation. We import LocalOutlierFactor to detect outliers.

Step 1: Generate data

We generate the random sample data using random from numpy.

  • random.randn is used to create clusters for the inliers.

  • random.uniform is used to create the outliers.

Once the random data is generated, we calculate the length of the outlying data points and assign them a ground_truth label of -1. The inlying points are at the start, and the outlying points are at the end, so make sure the correct data points are assigned the outlying ground truth value.

import numpy as np
np.random.seed(42)
X_inliers = 0.3 * np.random.randn(140, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]
X_outliers = np.random.uniform(low=-4, high=4, size=(30, 2))
dataArr = np.r_[X_inliers, X_outliers]
n_outliers = len(X_outliers)
ground_truth = np.ones(len(dataArr), dtype=int)
ground_truth[-n_outliers:] = -1

Step 2: Fit the model

We create an instance of the LocalOutlierFactor that we import from the neighbor module. Once the instance is created, we use fir_predict to fir the model on our dataset and predict labels for it. Along with the predictions, we also store the total prediction errors that are identified during the process as well as the outliers scores for the data points.

from sklearn.neighbors import LocalOutlierFactor
clf = LocalOutlierFactor(n_neighbors=24, contamination=0.1)
y_pred = clf.fit_predict(dataArr)
n_errors = (y_pred != ground_truth).sum()
X_scores = clf.negative_outlier_factor_

Step 3: Plot the results

Once the data is generated and the model is successfully fitted to get the predictions and necessary statistical information, we plot our results. A scatter plot is created for a defined axis range in which all the data points are created on their coordinates, and around them, a circular marker is drawn that represents the outlier score. The position of the marker is obtained after applying the algorithm.

Example code

In this example, we create a plot for a randomly generated dataset and show the results using a scattered plot.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from matplotlib.legend_handler import HandlerPathCollection

np.random.seed(42)

X_inliers = 0.3 * np.random.randn(140, 2)
X_inliers = np.r_[X_inliers + 2, X_inliers - 2]
X_outliers = np.random.uniform(low=-4, high=4, size=(30, 2))
dataArr = np.r_[X_inliers, X_outliers]

n_outliers = len(X_outliers)
ground_truth = np.ones(len(dataArr), dtype=int)
ground_truth[-n_outliers:] = -1

clf = LocalOutlierFactor(n_neighbors=24, contamination=0.1)
y_pred = clf.fit_predict(dataArr)
n_errors = (y_pred != ground_truth).sum()
X_scores = clf.negative_outlier_factor_

def update_legend_marker_size(handle, orig):
    "Customize size of the legend marker"
    handle.update_from(orig)
    handle.set_sizes([20])


plt.scatter(dataArr[:, 0], dataArr[:, 1], color="blue", s=3.0, label="Data points")

radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())

scatter = plt.scatter(
    dataArr[:, 0],
    dataArr[:, 1],
    s=1000 * radius,
    edgecolors="purple",
    facecolors="none",
    label="Outlier scores",
)

plt.axis("tight")
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.xlabel("prediction errors: %d" % (n_errors))

plt.legend(
    handler_map={scatter: HandlerPathCollection(update_func=update_legend_marker_size)}

)

plt.title("Plot Local Outlier Factor Results")

plt.show()
Detecting outliers using local factor outlier technique.

Code explanation

  • Lines 1–4: Import the required method and libraries.

  • Line 6: Set a random seed to ensure every time the code is executed, the same result is produced.

  • Lines 8–11: Generate two sets of data for inlier and outlier data points each and store them in the dataArr. This dataset has 140 inlying points and 30 outlying data points.

  • Lines 13–15: Set the ground_truth labels of the outlying data points as -1.

  • Line 17: Create a LocalOutlierFactor instance and save it on clf. We specify the n_neighbors as 24 and contamination as 0.1.

  • Lines 18–20: Fit the LocalOutlierFactor algorithm to the data inside the dataArr and obtain the predictions, errors, and outlier scores.

  • Lines 22–25: Customize the legend as per requirement and set its size.

  • Line 28: Create a scatter block using scatter() and pass the color, size, and label as parameters.

  • Lines 32–38: Plot a circular representation of the outlier scores and specify the properties as per requirement.

  • Lines 41–44: Define the plot limits for the x-axis and the y-axis and set the axis label.

  • Lines 46–47: Create a legend for the plot and set the size of the markers using update_legend_marker_size.

  • Line 51: Set a suitable title for the plot that defines it.

  • Line 52: Use show() to display the created plot.

Code output

A scattered plot is created that shows the data points in blue. A purple circular marker is drawn around the data points representing the outlier score for each.

The plot created through the example code.
The plot created through the example code.

Summary

T local outlier factor is an effective unsupervised technique to detect anomalies in a dataset and present it in a scatter plot. We can implement it using the following steps:

  • Generate random data for the inliers and outliers and store it in an array.

  • Fit the model on the data points and obtain the results using the algorithm.

  • Plot the obtained results in a scatter plot that can be used to identify the inliers and outliers.

Test your understanding

Match The Answer
Select an option from the left-hand side

To ensure the same plot is executed every time.

HandlerPathCollection()

To generate random data for outliers.

np.random.seed(42)

To control the properties of legend.

np.random.uniform()


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved