NP-hard (nondeterministic polynomial time) refers to a group of computational problems that are as challenging as or harder than the most difficult problems in the NP class.
In other words, an NP-hard (nondeterministic polynomial time) problem is a problem that belongs to the class of NP problems and can be reduced to any other problem in the NP class. This means that if there were an efficient algorithm for solving any NP-hard problem, it would also solve all other problems in the NP class.
Therefore, discovering an accurate solution to an NP-hard problem requires an amount of time that grows exponentially in proportion to the problem's size. As the size of the problem increases, the time required to solve it multiplies. This implies that solving large instances of NP-hard problems in exact terms is infeasible.
Traveling salesperson problem
0/1 knapsack problem
Graph coloring problem
Any problem will become NP-complete if it belongs to the NP class of problems, and any other problem in the NP class can be transformed into polynomial time. Therefore, a problem must be a part of the NP and NP-hard classes to be NP-complete.
The hardness of k-means clustering is related to the problem of Euclidean k-center, a well-known NP-hard problem. In the Euclidean k-center problem, we are given a set of points in Euclidean space, and we want to find the k-center such that the maximum distance between a point and its closest center is minimized.
The k-means clustering problem is a relaxation of the Euclidean k-center problem, where we can choose any point as a center, not just the given points. The k-means clustering problem seeks to partition the points into k clusters to minimize the sum of squared distances between each point and its cluster center.
While the k-means clustering problem is computationally more manageable than the Euclidean k-center problem, it is still NP-hard in the worst case, as shown by a reduction from the Euclidean k-center problem.
In the first image, we have 100 data points, and we want to find all possible combinations of points (
This means that there are 161,700 possible combinations of three points that we need to consider to solve the Euclidean k-center problem for 100 data points. As we can see, the number of possible combinations grows very quickly with
Let's go through an executable example of the Euclidean
import numpy as npimport matplotlib.pyplot as pltfrom scipy.spatial.distance import cdistfrom pulp import *# generate random data pointsnp.random.seed(0)X = np.random.rand(100, 2)# plot data pointsplt.figure(figsize=(7, 7))plt.scatter(X[:, 0], X[:, 1])plt.title("Data points")plt.savefig("output/graph1.png", dpi = 300)plt.show()# solve Euclidean k-center problemk = 3n = X.shape[0]d = cdist(X, X, 'euclidean')model = LpProblem(name='Euclidean k-center', sense=LpMinimize)y = {i: LpVariable(name=f'y{i}', cat='Binary') for i in range(n)}z = LpVariable(name='z', lowBound=0, upBound=np.sqrt(2), cat='Continuous')model += zfor j in range(n):model += lpSum(y[i] for i in range(n) if d[i, j] <= z) >= 1model += lpSum(y.values()) == kmodel.solve()# extract solutioncenters = [X[i, :] for i in range(n) if value(y[i]) > 0]labels = np.argmin(cdist(X, centers, 'euclidean'), axis=1)# plot clustersplt.figure(figsize=(7, 7))plt.scatter(X[:, 0], X[:, 1], c=labels)plt.scatter(np.array(centers)[:, 0], np.array(centers)[:, 1], marker='x', s=200, linewidths=3, color='r')plt.title("Euclidean k-center clustering")plt.savefig("output/graph2.png", dpi = 300)plt.show()
Lines 7–8: The 100 random data points with two dimensions are generated.
Lines 11–15: The data points are then plotted using Matplotlib
. The scatter()
function is used to plot the points. The title()
function sets the title of the plot. Finally, the show()
function displays the plot.
Lines 18–28: The code solves the Euclidean k-center problem using linear programming. The k
variable is set to 3, which specifies that we want to cluster the data points into 3 clusters. The n
variable is set to the number of data points (100). The cdist()
function from Scipy
is used to compute the Euclidean distance matrix between all pairs of points. The LpProblem()
is used to create a new linear programming problem, and the problem is set up as a minimization problem. The LpVariable()
function is used to create binary variables y
for each data point and a continuous variable z
to represent the maximum distance between a data point and its assigned center. Finally, the solve()
method is called to solve the linear programming problem.
Lines 31–32: In this section, the solution to the linear programming problem is extracted. The code creates a list of centers, which are the data points that were selected as the centers.
Lines 35–40: These lines generate the scatter plot of the output by coloring and marking a cross(x) on each cluster.
Let's see the coding example of how the feature selection is an NP-hard problem:
import itertoolsimport numpy as npfrom sklearn.linear_model import LinearRegression# Generate a random dataset with 20 featuresX = np.random.rand(1000, 20)y = np.random.rand(1000)# Compute the performance of all possible subsets of features using an exhaustive search algorithmbest_score = -np.infbest_subset = Nonefor subset in itertools.combinations(range(X.shape[1]), 5):X_subset = X[:, subset]model = LinearRegression().fit(X_subset, y)score = model.score(X_subset, y)if score > best_score:best_score = scorebest_subset = subsetprint("Best subset:", best_subset)print("Best score:", best_score)
The above code tries to solve the feature selection problem for a dataset with 1000 samples and 20 features by exhaustively searching all possible subset combinations of five features using Python's itertools
module. For each subset, a linear regression model is trained, and the performance is measured in terms of the R-squared coefficient. But, the algorithm has a high time complexity of
Note: These algorithms do not guarantee the optimal solution and may produce suboptimal solutions.
Free Resources