How k-means clustering and feature selection is NP-hard

NP-hard (nondeterministic polynomial time) refers to a group of computational problems that are as challenging as or harder than the most difficult problems in the NP class.

In other words, an NP-hard (nondeterministic polynomial time) problem is a problem that belongs to the class of NP problems and can be reduced to any other problem in the NP class. This means that if there were an efficient algorithm for solving any NP-hard problem, it would also solve all other problems in the NP class.

Therefore, discovering an accurate solution to an NP-hard problem requires an amount of time that grows exponentially in proportion to the problem's size. As the size of the problem increases, the time required to solve it multiplies. This implies that solving large instances of NP-hard problems in exact terms is infeasible.

Examples of NP-hard problems

Traveling salesperson problem
0/1 knapsack problem
Graph coloring problem

NP-hard problem example

The hardness of k-means clustering is related to the problem of Euclidean k-center, a well-known NP-hard problem. In the Euclidean k-center problem, we are given a set of points in Euclidean space, and we want to find the k-center such that the maximum distance between a point and its closest center is minimized.

The k-means clustering problem is a relaxation of the Euclidean k-center problem, where we can choose any point as a center, not just the given points. The k-means clustering problem seeks to partition the points into k clusters to minimize the sum of squared distances between each point and its cluster center.

While the k-means clustering problem is computationally more manageable than the Euclidean k-center problem, it is still NP-hard in the worst case, as shown by a reduction from the Euclidean k-center problem.

In the first image, we have 100 data points, and we want to find all possible combinations of points ( $k$ =3), then the number of combinations using this formula $(n choose k) = n! / (k! * (n-k)!):$

$(100 choose 3) = 100! / (3! * 97!) = 161,700$

This means that there are 161,700 possible combinations of three points that we need to consider to solve the Euclidean k-center problem for 100 data points. As we can see, the number of possible combinations grows very quickly with $k$ and can become very large even for a relatively small number of data points.

Let's go through an executable example of the Euclidean $k$ -center problem:

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from pulp import *
# generate random data points
np.random.seed(0)
X = np.random.rand(100, 2)
# plot data points
plt.figure(figsize=(7, 7))
plt.scatter(X[:, 0], X[:, 1])
plt.title("Data points")
plt.savefig("output/graph1.png", dpi = 300)
plt.show()
# solve Euclidean k-center problem
k = 3
n = X.shape[0]
d = cdist(X, X, 'euclidean')
model = LpProblem(name='Euclidean k-center', sense=LpMinimize)
y = {i: LpVariable(name=f'y{i}', cat='Binary') for i in range(n)}
z = LpVariable(name='z', lowBound=0, upBound=np.sqrt(2), cat='Continuous')
model += z
for j in range(n):
    model += lpSum(y[i] for i in range(n) if d[i, j] <= z) >= 1
model += lpSum(y.values()) == k
model.solve()
# extract solution
centers = [X[i, :] for i in range(n) if value(y[i]) > 0]
labels = np.argmin(cdist(X, centers, 'euclidean'), axis=1)
# plot clusters
plt.figure(figsize=(7, 7))
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(np.array(centers)[:, 0], np.array(centers)[:, 1], marker='x', s=200, linewidths=3, color='r')
plt.title("Euclidean k-center clustering")
plt.savefig("output/graph2.png", dpi = 300)
plt.show()

Lines 7–8: The 100 random data points with two dimensions are generated.
Lines 11–15: The data points are then plotted using Matplotlib. The scatter() function is used to plot the points. The title() function sets the title of the plot. Finally, the show() function displays the plot.
Lines 18–28: The code solves the Euclidean k-center problem using linear programming. The k variable is set to 3, which specifies that we want to cluster the data points into 3 clusters. The n variable is set to the number of data points (100). The cdist() function from Scipy is used to compute the Euclidean distance matrix between all pairs of points. The LpProblem() is used to create a new linear programming problem, and the problem is set up as a minimization problem. The LpVariable() function is used to create binary variables y for each data point and a continuous variable z to represent the maximum distance between a data point and its assigned center. Finally, the solve() method is called to solve the linear programming problem.
Lines 31–32: In this section, the solution to the linear programming problem is extracted. The code creates a list of centers, which are the data points that were selected as the centers.
Lines 35–40: These lines generate the scatter plot of the output by coloring and marking a cross(x) on each cluster.

How feature selection is NP-hard problem

Let's see the coding example of how the feature selection is an NP-hard problem:

import itertools
import numpy as np
from sklearn.linear_model import LinearRegression
# Generate a random dataset with 20 features
X = np.random.rand(1000, 20)
y = np.random.rand(1000)
# Compute the performance of all possible subsets of features using an exhaustive search algorithm
best_score = -np.inf
best_subset = None
for subset in itertools.combinations(range(X.shape[1]), 5):
    X_subset = X[:, subset]
    model = LinearRegression().fit(X_subset, y)
    score = model.score(X_subset, y)
    if score > best_score:
        best_score = score
        best_subset = subset
print("Best subset:", best_subset)
print("Best score:", best_score)

The above code tries to solve the feature selection problem for a dataset with 1000 samples and 20 features by exhaustively searching all possible subset combinations of five features using Python's itertools module. For each subset, a linear regression model is trained, and the performance is measured in terms of the R-squared coefficient. But, the algorithm has a high time complexity of $O(2^{20})$ , making it unsuitable for datasets with a large number of features. Sometimes even our coding widget prompts us with the execution timeout. Alternative heuristic and approximate algorithms can be used to solve the feature selection problem more efficiently.

Note: These algorithms do not guarantee the optimal solution and may produce suboptimal solutions.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources