How do you check the “goodness” of clusters? Can we use these metrics to identify an ideal number of clusters? We discuss the answers to these questions in this post. First, what makes a set of clusters “good”? Ideally, we want within-cluster variance to be low, and the between-cluster variance to be high. In other words, the within-cluster distance should be low, and the distance between clusters should be high.
We’ll discuss several metrics here. The metrics listed here are not at all exhaustive, but are the most commonly reported and used.
How do we define the distance between or within clusters? Is it the maximum distance between any two points? The minimum? Or perhaps the mean or median? We can use any of these, and what you choose depends on what your goal is. Let’s now look at some clustering metrics.
Generalized Dunn index
Wikipedia has an excellent article on this , so we’ll follow the notations used there. Suppose we use a within-cluster distance metric for cluster . This can be the maximum, minimum, or mean/median of all this distances between two (distinct) points in the cluster. You could also compute the average distance between all points and the cluster mean. Now, let denote the distance (with any definition) between clusters . Then, for clusters, the Generalized Dunn index (GDI) is defined as
How we choose these distances depends on our goal. When we use something like k-means, which aim for globular clusters, the mean or maximum distance between any two points within a cluster seems like a reasonable within-cluster distance, and the minimum distance between any two points seems like a reasonable between-cluster distance. For density-based and spectral clustering, the reverse seems like a reasonable choice.
The Generalized Dunn index is computationally expensive to compute, especially as the number of clusters increases. This is its main drawback. You should note that because the denominator uses max, that the Dunn index is a worst-case (lower bound) score. This is because if every cluster except one is compact and clearly separated from the others, but the remaining one is not compact (that is, the within-cluster distance is high), then the Dunn index reduces. You should use other metrics along with the Dunn index to get a better idea of the clustering validity.
When we choose to be the minimum distance between points in the two clusters, and to be the maximum distance between two points in the same cluster, we get the Dunn index.
Baker-Hubert Gamma index
The Baker-Hubert gamma index is an adaptation of the Gamma index of correlation between two vectors of data .
Suppose we have two vectors . Then, two given indices are said to be concordant if whenever , we have . Otherwise, the two indices are said to be discordant. We now compute the number of concordant pairs, , and the number of discordant pairs . The gamma index is then defined as
In the context of clustering, we define the first vector to be the set of distances between two points in the data (regardless of whether or not they’re in the same cluster). The corresponding element of the second vector is binary–it is 0 if the two points are in the same cluster, and 1 otherwise. The Gamma index is then computed.
From Bernard Desgraupes’ document , the number is the number of times that a distance between two points in the same cluster is less than than a distance between two points in different clusters. Why? Because for a concordant pair, in our case, means that the th pair was in the same cluster (i.e., ), and the th pair in the vector had two points in different clusters (i.e., and thus ). The number is the number of times the opposite situation occurs.
The silhouette index is one of the most commonly reported metric, especially with k-means clustering. The silhouette index is the average of the silhouette values of each point, where the silhouette value is a measure of how similar an object is to its own cluster as compared to other clusters . To compute the silhouette value for a point that belongs to cluster , we first compute :
We divide by , since we don’t include the distance of the point to itself. This may be interpreted as how good of an assignment the point to its current cluster is. Next, we find for each cluster that does not belong to, we find the average distance of to all points in that cluster:
Finally, we compute the silhouette value:
The silhouette index (the average of all silhouette values) lies between -1 and +1, and can be interpreted as follows :
|0.71-1.0||A strong structure has been found|
|0.51-0.70||A reasonable structure has been found|
|0.26-0.50||The structure is weak and could be artificial. Try additional methods of data analysis.|
|<0.25||No substantial structure has been found|
The Calinski-Harabasz index, also called the pseudo-F statistic, yields the ratio of between-cluster variance to within cluster variance . This metric is used with hierarchical clustering methods. If at a given step in the algorithm, there are clusters, is the number of data points, the pseudo-F statistic is defined as
where is the between-cluster sum of squares, and is the within-cluster sum of squares. Larger values of the pseudo-F statistic are preferred, since this implies compact and well-separated clusters.
Because hierarchical clustering does not maintain the distance between points when we form the dendogram, one might wonder how well the distances between the original data points are preserved in the dendogram. The cophenetic correlation is a measure of this, and is defined as follows : let represent the distance of and . Let be the height of the dendogram at which and first get merged into one cluster. Finally, we let represent the mean of all the s, and be the mean of all the s. The cophenetic correlation is defined as:
A value of c close to 1 is ideal.
Comparing different partitions
We can also compare the partitions created using two algorithms. This can be used to test the effectiveness of the clusters produced. The measures we discuss here fall under external indices or evaluations , since we’re comparing two partitions. Therefore, to test the efficacy of your clustering algorithm, you could hold back some data, say 20%, when you give it to the clustering algorithm. You cluster this held-out data independently, say using human experts. Once the algorithm has identified clusters (using the 80% that you gave it), you give it this previously left-out (20%) data, and see how well it performs compared to the “gold standard” human clustering. Essentially, we’re labeling the held-out data. Let’s look at some metrics. First, we’ll briefly define some notation, from Wikipedia .
Suppose we have a dataset, . Suppose we want to compare two partitions, and . Let’s define the following:
- : The number of data points that are in the same cluster in both the partitions.
- : The number of data points that are in the same cluster in partition , but in different clusters in partition .
- : The number of data points that are in different clusters in partition , but in the same cluster in partition .
- : The number of data points that are in different clusters in both the partitions.
The Rand index is defined as
Intuitively, the Rand index acts as an accuracy measure for clusters, and is between 0 and 1. There’s also an adjusted Rand index (ARI), which is actually used in practice. The ARI is corrected for chance. To compute the adjusted Rand index, we first compute a contingency table:
Essentially, each entry is the number of common elements in and . Now, the Adjusted Rand index (ARI), also called the corrected Rand index, is defined as :
If the Rand index acts like an accuracy measure, we can similarly define precision and recall measures. The Fowlkes-Mallows index, also called the G-measure, is the geometric mean of the precision and recall. In the equation below, the first term corresponds to the precision, and the second corresponds to the recall.
Hubert Gamma index
For each of the two partitions, we can associate a random variable. Think of a random variable as a function, say that maps to real numbers. For each partition, we define a random variable that takes in two indices, and , such that , and returns if are in the same cluster in that partition. Using the notation above, let’s call these two partitions and . Then, the Hubert Gamma index is the correlation coefficient of the random variables associated with these two partitions :
 L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons, 2009