What is the elbow method in Python?

K-means is an unsupervised Machine Learning algorithm. In order to determine the optimal numbers of clusters (k), the Elbow method is most commonly used.

The two methods to determine optimal clusters include distortion and inertia, the first of which uses Euclidean distance to calculate the average of the squared distances between the cluster centers of the respective clusters, while the latter is the sum of the squared distance of sample cluster points from their closest centroid. Now, there are two ways to implement the Elbow method using either distortions or inertia. We will discuss both.

The simple yet practical demonstration of the Elbow method is given below. First, we will import the required libraries and visualize the data.

import numpy as np
from sklearn.cluster import KMeans
from sklearn import metrics
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
xaxis = np.array([1, 4, 5, 6, 1, 4, 5, 6, 1, 2, 4, 6, 5, 5, 5, 1, 3, 4, 2, 6, 4, 5, 5])
yaxis = np.array([5, 4, 5, 6, 5, 3, 5, 4, 4, 4, 8, 1, 3, 2, 1, 5, 1, 8, 7, 6, 9, 1, 10])
X = np.array(list(zip(xaxis, yaxis))).reshape(len(xaxis), 2)
plt.plot()
plt.title('Data Visualization')
plt.xlabel('x')
plt.ylabel('y')
plt.xlim([0, 10])
plt.ylim([0, 15])
plt.scatter(xaxis, yaxis)
plt.show()

Looking at the scatter plot, it is safe to say that it can be quite difficult to determine the optimal number of clusters just by visualizing the data. Hence, the Elbow method would help us to make a better decision about the number of clusters.

Two implementations for the elbow method are displayed below, in different tabs, for the above data.

# K-means using Euclidean distance and distortion concept
distortions = []
#total number of clusters
K = range(1,10)
#for every cluster value we calculate distortion
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
# Now that we have all the distortions we will plot the graph
plt.plot(K, distortions, 'bx-')
plt.xlabel('k-clusters')
plt.ylabel('Distortion')
plt.xlim([0, 10])
# plt.ylim([0, 3]) can set the y limit accordingly if and when needed otherwise it sets default value depending on graph
plt.title('Elbow method with Euclidean Distance')
plt.show()

As is evident from the graph, as k increasesthe number of clusters increases the value of the y-axis decreases because the distance from the cluster centroid decreases. As the improvement declines rapidly, it forms an elbow shapethe decrease afterward is in a linear fashion: i.e., not always a straight line #key#. Both graphs show that k = 4 is optimal where k is the number of clusters. We will always choose the value of k at the elbow.


Free Resources