In Unsupervised Machine Learning, many clustering algorithms are used to group objects for analysis and finding patterns. One commonly known technique is Agglomerative Clustering, where objects that are close to each other are placed in one group. In the beginning, all objects are single clusters (leaves)
and the algorithm keeps on clustering objects until a single cluster (root)
remains. Clustering forms a tree like structure called a dendrogram
.
Agglomerative cluster is a common type of Hierarchical Clustering
that is also called Agglomerative Nesting (AGNES)
. It follows bottom up
approach while clustering objects.
The algorithm includes the following steps:
A dendrogram is used to represent the hierarchical relationship between objects. The height of the dendrogram shows the order in which objects are clustered together. In the above example, the first cluster is formed by the closest objects, i.e., E and F. So, the blue-colored link merging them together is of the shortest height. Afterward, objects A and B are merged using a pink-colored link that is of second shortest height.
The distance between clusters can be measured using various techniques including Euclidean, Manhattan, Minkowski, etc. The most commonly used technique is Euclidean Distance. The distance between the two points (x1,y1)
and (x2,y2)
can be found using the Euclidean formula:
d = sqrt((x1-x2)^2 + (y1-y2)^2)
.
The formula is further explained by the example given below:
Firstly, import all the required libraries. Then, generate a 2D array of all the data points with their coordinates array. After you initialize the Agglomerative Clustering model, call the fit
method on it. Lastly, plot the dendrogram to see the clustering results.
The Agglomerative function takes distance threshold
and n_clusters
as parameters.
distance threshold
`: It is the linkage distance threshold above which clusters will not be merged, and it shows the limit at which to cut the dendrogram tree.
n_clusters
: It shows the number of clusters to find.
For more parameters details, follow the link.
# import librariesimport numpy as npfrom sklearn.cluster import AgglomerativeClusteringfrom scipy.cluster.hierarchy import dendrogram# generate coordinates array for six samplesX = np.array([[1.3, 4.8], [2.3, 5.5], [3.6, 1.3], [6.1, 5.1], [6.2, 2.5], [6.7, 3.4]])# instantiate Agglomerative Clustering instanceclustering_model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)# call fit method with array of sample coordinates passed as a parametertrained_model = clustering_model.fit(X)# A method for generating dendrogramdef plot_dendrogram(model, **kwargs):# Create linkage matrix and then plot the dendrogram# create the counts of samples under each nodecounts = np.zeros(model.children_.shape[0])n_samples = len(model.labels_)for i, merge in enumerate(model.children_):current_count = 0for child_idx in merge:if child_idx < n_samples:current_count += 1 # leaf nodeelse:current_count += counts[child_idx - n_samples]counts[i] = current_countlinkage_matrix = np.column_stack([model.children_, model.distances_, counts]).astype(float)# Plot the corresponding dendrogramdendrogram(linkage_matrix, **kwargs)# plot dendrogram to visualize clustersplot_dendrogram(trained_model)
Note: To run the above code cell, use sklearn version >= 0.22