Find Clusters From Dendrogram in Hierarchical Clustering Using Python

You might have studied various tutorials on hierarchical clustering that teach how to plot a dendrogram. This article discusses how to find clusters from a dendrogram in Python.

What are Dendrograms?

In hierarchical clustering, a dendrogram is a visual tool illustrating how each cluster is composed by drawing links between the clusters based on their similarities. I have already discussed how to create a dendrogram from a given dataset using different linkage methods in this article on how to plot a dendrogram in python

Dendrograms are a great tool for visualizing clustering if there is a hierarchy present in the dataset. We use dendrograms in hierarchical clustering and I have already discussed it in detail in this article on agglomerative clustering numerical example.

Find Clusters From Dendrograms using Python in Hierarchical Clustering

You can find clusters from the linkage matrix of the dendrograms in python. While creating a dendrogram we need to create the distance matrix from the dataset. Using the distance matrix, we create the linkage matrix for the dendrogram.

Using the linkage matrix, we can find the clusters from the dendrogram in Python. For this, we can use the fcluster() function defined in the scipy.cluster.hierarchy module. The fcluster() function has the following syntax.

fcluster(Z, t, criterion='inconsistent', depth=2, R=None)

Here,

  • The Z parameter takes the linkage matrix as its input argument. 
  • The parameter t takes the number of clusters or the threshold to apply when forming clusters based on the criterion parameter.
  • The criterion parameter is used to specify how the clusters are formed. 
    • If the criterion is set to ‘inconsistent‘, If a cluster node and all its descendants have an inconsistent value less than or equal to the value in parameter t, then all its leaf descendants belong to the same cluster. When no non-singleton cluster meets this criterion, every node is assigned to its own cluster. 
    • If the criterion is set to “distance”, the clusters are formed in a manner so that the original observations in each cluster have no greater cophenetic distance than t.
    • If the criterion is set to “maxclust”, the function finds a minimum threshold r so that the cophenetic distance between any two original observations in the same cluster is no more than r and no more than t clusters are formed.
    • There are other values for the criterion parameter that you can study about on this link.
  • The depth parameter is used only when the criterion parameter is set to “inconsistent”. It specifies the maximum depth to perform the inconsistency calculation. It has no meaning for the other criteria.
  • The parameter R takes the inconsistency matrix to use for the ‘inconsistent’ criterion. This matrix is computed if not provided.

After execution, the fcluster() function returns a numpy array containing the cluster labels for each cluster. You can find the clusters from the linkage matrix of the dendrogram using the fcluster() function as shown below.

import pandas as pd
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
data = [[1, 1], [2, 3], [3, 5],[4,5],[6,6],[7,5]]
points=["A","B","C","D","E","F"]
df = pd.DataFrame(data, columns=['xcord', 'ycord'],index=points)
ytdist=pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
linkage_matrix = linkage(ytdist, "ward")
cluster_labels = fcluster(linkage_matrix,3,criterion='maxclust')
print("The data points are:")
print(data)
print("cluster labels are:")
print(cluster_labels)

Output:

The data points are:
[[1, 1], [2, 3], [3, 5], [4, 5], [6, 6], [7, 5]]
cluster labels are:
[1 1 2 2 3 3]

In this example, we have a dataset of 6 points. We have divided the dataset into three clusters using the fcluster() function. For this,

  • We have passed the linkage matrix as the first input argument which is assigned to the parameter Z.
  • We passed the number of clusters as the second input argument to the fcluster() function. It is assigned to the parameter t.
  • The third input argument is the literal “maxclust“. It is assigned to the criterion parameter.

Instead of using the above approach, you can also find the clusters from a dendrogram using python in hierarchical clustering using the AgglomerativeClustering() function defined in the sklearn module. I have discussed this approach in the article on agglomerative clustering in Python.

Conclusion

In this article, we have discussed how to find clusters from a dendrogram using the linkage matrix in python. I hope you enjoyed reading this article.

To learn more about programming, you can read this article on how to create a chat application in python. You might also like this article on how to create a task scheduler.

Stay tuned for more informative articles.

Happy Learning!

Similar Posts