Clustering machine learning algorithms like k-NN and others work based on the inductive bias that points that are close to each other in the feature space have the same class label.
This measure of proximity can be mapped onto real-life examples. However, when data spreads onto higher dimensions, a fundamental issue with this approach arises. An increasing number of dimensions in the feature space implies that seemingly close data instances are in fact themselves quite different and far from one another. Since k-NN uses a definite approach, which takes into account the
The depiction of this problem can be visualized from the illustration above. The points clustered together when
This erodes the inductive bias of k-NN itself, bringing the suitability of the algorithm into question.
Let our initial hypothesis be that the data points can be sampled uniformly within a container in the hyperspace with
Since the training instances are sampled uniformly, the ratio of the volumes of the hyperspace to the subspace would approximately be the same as the ratio of the total number of points
Assuming
When we increase the number of dimensions from 3 to 100 for the same values of
From this, we can deduce the following. An increase in the number of dimensions shoots the distances of the k-nearest neighbors so much that it cannot be placed as a reasonable subject to k-NN's inductive bias of proximity.
Real-world data sets have a tremendous number of dimensions. Classical problems such as image classification or email filters are bound by multiple supplementing features, drawing us into the curse of dimensionality almost every time.
We can address this by resolving our dimensions so that only the very useful and deterministic features are used. This way, we can have spaces with
Free Resources