What is the curse of dimensionality and how to overcome it?

Problem diagnosis

Clustering machine learning algorithms like k-NN and others work based on the inductive bias that points that are close to each other in the feature space have the same class label.

This measure of proximity can be mapped onto real-life examples. However, when data spreads onto higher dimensions, a fundamental issue with this approach arises. An increasing number of dimensions in the feature space implies that seemingly close data instances are in fact themselves quite different and far from one another. Since k-NN uses a definite approach, which takes into account the $k$ nearest data instances, the algorithm must always return some relatively close points. However, when we consider a greater number of features for these data points, we find that the relatively closer data points are distant from the test instance despite being relatively near when compared with the rest of the test instances. This problem of clustering data spread over many dimensions is called the curse of dimensionality.

The depiction of this problem can be visualized from the illustration above. The points clustered together when $k=3$ in 3 dimensions are definitely close together on a relative scale when compared to every other point in the dataset. However, their nearness is not that great when the points are viewed in isolation.

This erodes the inductive bias of k-NN itself, bringing the suitability of the algorithm into question.

Comparative analysis

Let our initial hypothesis be that the data points can be sampled uniformly within a container in the hyperspace with $d$ dimensions of unit length. Moreover, within this hyperspace, let there be a smaller subspace that encapsulates the k-nearest points from an objective test point. Let the length of each of the subspace dimensions be $x$ , which is also the minimum length required to encapsulate the maximally distant nearest neighbor.

Assuming $k = 7$ and $n = 1000$ , while $d=3$ , we can approximate $x$ to be 0.191.

When we increase the number of dimensions from 3 to 100 for the same values of $k$ and $n$ , we can approximate $x$ to be 0.974.

From this, we can deduce the following. An increase in the number of dimensions shoots the distances of the k-nearest neighbors so much that it cannot be placed as a reasonable subject to k-NN's inductive bias of proximity.

Overcoming the curse of dimensionality

Real-world data sets have a tremendous number of dimensions. Classical problems such as image classification or email filters are bound by multiple supplementing features, drawing us into the curse of dimensionality almost every time.

We can address this by resolving our dimensions so that only the very useful and deterministic features are used. This way, we can have spaces with low intrinsic dimensionalityThe actual number of features that can be used to classify data points efficiently., which will not be subject to the curse of dimensionality.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources