A dendrogram is essentially a tree diagram that is used to visualize the hierarchical relationships between similar entities. In Python, a dendrogram is created to illustrate the output of hierarchical clusters. Hierarchical clustering is an example of an unsupervised learning algorithm that assigns objects to different clusters based on similarities in a top-down fashion.
The resulting diagram contains groups or clusters different from each other, having multiple endpoints or leaves significantly similar to their counterparts within the same group. A real-world example of hierarchical clustering is the organization of files and folders in a computer hard drive which are stored in a hierarchy.
An example of hierarchical clustering is shown below. In the first image, different data points are represented on a plane while the second image illustrates the relevant clusters.
It is important to note that dendrograms describe the relationship between clusters and their relative instances. This is why, we can read the dendrogram by analyzing the respective height on which objects are grouped.
In the figure above, it is evident that the instances , and are closer to each other followed by and , and so on. Thus the height of the link joining is the smallest. The next comparable height is between the link and and vice versa.
In the following code snapshot, a sample code is given to create a dendrogram using random data points in Python. For this purpose, the linkage()
method of the cluster.hierachy
package of scipy
library is used.
import numpy as npfrom scipy.cluster.hierarchy import linkage, dendrogramimport matplotlib.pyplot as plt# Generate random coordinatesx = np.random.randint(1, 50, 10)y = np.random.randint(1, 50, 10)# Scatter plot of the randomly generated pointsfig, ax = plt.subplots(dpi=800)ax.scatter(x, y, c='red', marker='o')ax.set_title('Scatter plot of randomly generated points')ax.set_xlabel('X-axis')ax.set_ylabel('Y-axis')fig.savefig("output/scatter_plot.png")plt.show() # Display the scatter plotplt.close(fig)# Prepare data for clusteringcoord_points = list(zip(x, y))clusters = linkage(coord_points, method='average', metric='euclidean')# Plot dendrogramfig, ax = plt.subplots(dpi=800)dendrogram(clusters, ax=ax)ax.set_title('Sample Dendrogram')ax.set_xlabel('Points')ax.set_ylabel('Euclidean distance')# Save the dendrogram plot to a filefig.savefig("output/dendrogram.png")plt.close(fig)
Let’s understand the code above:
numpy
library to create random numbers which will act as points to perform hierarchical clustering.linkage
and dendrogram
methods from scipy.cluster.hierarchy
.
linkage
function is used to perform hierarchical or agglomerative clustering.dendrogram
function is used to visualize the hierarchical clustering encoded by the linkage matrix.matplotlib.pyplot
to create scatter plot and dendrogram.average
distance measure in euclidean
method.Free Resources