K-means
is an unsupervised Machine Learning algorithm. In order to determine the optimal numbers of clusters (k
), the Elbow method
is most commonly used.
The two methods to determine optimal clusters include distortion
and inertia
, the first of which uses Euclidean distance to calculate the average of the squared distances between the cluster centers of the respective clusters, while the latter is the sum of the squared distance of sample cluster points from their closest centroid. Now, there are two ways to implement the Elbow method using either distortions or inertia. We will discuss both.
The simple yet practical demonstration of the Elbow method is given below. First, we will import the required libraries and visualize the data.
import numpy as npfrom sklearn.cluster import KMeansfrom sklearn import metricsimport matplotlib.pyplot as pltfrom scipy.spatial.distance import cdistxaxis = np.array([1, 4, 5, 6, 1, 4, 5, 6, 1, 2, 4, 6, 5, 5, 5, 1, 3, 4, 2, 6, 4, 5, 5])yaxis = np.array([5, 4, 5, 6, 5, 3, 5, 4, 4, 4, 8, 1, 3, 2, 1, 5, 1, 8, 7, 6, 9, 1, 10])X = np.array(list(zip(xaxis, yaxis))).reshape(len(xaxis), 2)plt.plot()plt.title('Data Visualization')plt.xlabel('x')plt.ylabel('y')plt.xlim([0, 10])plt.ylim([0, 15])plt.scatter(xaxis, yaxis)plt.show()
Looking at the scatter plot, it is safe to say that it can be quite difficult to determine the optimal number of clusters just by visualizing the data. Hence, the Elbow method would help us to make a better decision about the number of clusters.
Two implementations for the elbow method are displayed below, in different tabs, for the above data.
# K-means using Euclidean distance and distortion conceptdistortions = []#total number of clustersK = range(1,10)#for every cluster value we calculate distortionfor k in K:kmeanModel = KMeans(n_clusters=k).fit(X)kmeanModel.fit(X)distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])# Now that we have all the distortions we will plot the graphplt.plot(K, distortions, 'bx-')plt.xlabel('k-clusters')plt.ylabel('Distortion')plt.xlim([0, 10])# plt.ylim([0, 3]) can set the y limit accordingly if and when needed otherwise it sets default value depending on graphplt.title('Elbow method with Euclidean Distance')plt.show()
As is evident from the graph, as k
k = 4
is optimal where k
is the number of clusters. We will always choose the value of k
at the elbow.