12.6 – Evaluating clusters

How do you check the “goodness” of clusters? Can we use these metrics to identify an ideal number of clusters? We discuss the answers to these questions in this post. First, what makes a set of clusters “good”? Ideally, we want within-cluster variance to be low, and the between-cluster variance to be high. In other words, the within-cluster distance should be low, and the distance between clusters should be high.

We’ll discuss several metrics here. The metrics listed here are not at all exhaustive, but are the most commonly reported and used.

Clustering metrics

How do we define the distance between or within clusters? Is it the maximum distance between any two points? The minimum? Or perhaps the mean or median? We can use any of these, and what you choose depends on what your goal is. Let’s now look at some clustering metrics.

Generalized Dunn index

Wikipedia has an excellent article on this [1], so we’ll follow the notations used there. Suppose we use a within-cluster distance metric \Delta_i for cluster i. This can be the maximum, minimum, or mean/median of all this distances between two (distinct) points in the cluster. You could also compute the average distance between all points and the cluster mean. Now, let \delta(C_i, C_j) denote the distance (with any definition) between clusters i, j. Then, for m clusters, the Generalized Dunn index (GDI) is defined as

DI_M = \frac{\min\limits_{1 \leq i \leq j \leq m} \delta(C_i, C_j)}{\max\limits_k \Delta_k}

How we choose these distances depends on our goal. When we use something like k-means, which aim for globular clusters, the mean or maximum distance between any two points within a cluster seems like a reasonable within-cluster distance, and the minimum distance between any two points seems like a reasonable between-cluster distance. For density-based and spectral clustering, the reverse seems like a reasonable choice.

The Generalized Dunn index is computationally expensive to compute, especially as the number of clusters increases. This is its main drawback. You should note that because the denominator uses max, that the Dunn index is a worst-case (lower bound) score. This is because if every cluster except one is compact and clearly separated from the others, but the remaining one is not compact (that is, the within-cluster distance is high), then the Dunn index reduces. You should use other metrics along with the Dunn index to get a better idea of the clustering validity.

When we choose \delta to be the minimum distance between points in the two clusters, and \Delta to be the maximum distance between two points in the same cluster, we get the Dunn index.

Baker-Hubert Gamma index

The Baker-Hubert gamma index is an adaptation of the Gamma index of correlation between two vectors of data [2].

Suppose we have two vectors A = \{a_1, a_2, \ldots, a_n\}, B = \{b_1, b_2, \ldots, b_n\}. Then, two given indices i, j are said to be concordant if whenever a_i < a_j, we have b_i < b_j. Otherwise, the two indices are said to be discordant. We now compute the number of concordant pairs, s^+, and the number of discordant pairs s^-. The gamma index is then defined as

\Gamma = \frac{s^+-s^-}{s^++s^-}

In the context of clustering, we define the first vector to be the set of distances between two points in the data (regardless of whether or not they’re in the same cluster). The corresponding element of the second vector is binary–it is 0 if the two points are in the same cluster, and 1 otherwise. The Gamma index is then computed.

From Bernard Desgraupes’ document [2], the number s^+ is the number of times that a distance between two points in the same cluster is less than than a distance between two points in different clusters. Why? Because for a concordant pair, in our case, b_i < b_j means that the ith pair was in the same cluster (i.e., b_i = 0), and the jth pair in the vector A had two points in different clusters (i.e., b_j = 1 and thus b_i = 0 < 1 = b_j). The number s^- is the number of times the opposite situation occurs.

Silhouette index

The silhouette index is one of the most commonly reported metric, especially with k-means clustering. The silhouette index is the average of the silhouette values of each point, where the silhouette value is a measure of how similar an object is to its own cluster as compared to other clusters [8]. To compute the silhouette value for a point x^{(i)} that belongs to cluster C_i, we first compute a(i):

a(i) = \frac{1}{|C_i|-1}\sum\limits_{x^{(j)}\in C_i, i \neq j}d(x^{(i)}, x^{(j)})

We divide by |C_i|-1, since we don’t include the distance of the point to itself. This may be interpreted as how good of an assignment the point to its current cluster is. Next, we find for each cluster that x^{(i)} does not belong to, we find the average distance of x^{(i)} to all points in that cluster:

b(i) = \min\limits_{i \neq j} \frac{1}{|C_j|}\sum\limits_{x^{(j)} \in C_j}d(x^{(i)}, x^{(j)})

Finally, we compute the silhouette value:

s(i) = \begin{cases} 1-a(i)/b(i), & \mbox{if } a(i) \leq b(i) \\ b(i)/a(i)-1, & \mbox{if } a(i) > b(i) \\ \end{cases}

The silhouette index (the average of all silhouette values) lies between -1 and +1, and can be interpreted as follows [9]:

Value Interpretation
0.71-1.0 A strong structure has been found
0.51-0.70 A reasonable structure has been found
0.26-0.50 The structure is weak and could be artificial. Try additional methods of data analysis.
<0.25 No substantial structure has been found

Calinski-Harabasz index

The Calinski-Harabasz index, also called the pseudo-F statistic, yields the ratio of between-cluster variance to within cluster variance [5]. This metric is used with hierarchical clustering methods. If at a given step in the algorithm, there are k clusters, m is the number of data points, the pseudo-F statistic is defined as

F = \frac{GSS / (k-1)}{WSS / (m-k)}

where GSS is the between-cluster sum of squares, and WSS is the within-cluster sum of squares. Larger values of the pseudo-F statistic are preferred, since this implies compact and well-separated clusters.

Cophenetic correlation

Because hierarchical clustering does not maintain the distance between points when we form the dendogram, one might wonder how well the distances between the original data points are preserved in the dendogram. The cophenetic correlation is a measure of this, and is defined as follows [6]: let d_{ij} represent the distance of x^{(i)} and x^{(j)}. Let t_{ij} be the height of the dendogram at which x^{(i)} and x^{(j)} first get merged into one cluster. Finally, we let \bar{d} represent the mean of all the d_{ij}s, and \bar{t} be the mean of all the t_{ij}s. The cophenetic correlation is defined as:

c = \frac {\sum\limits_{i<j} (d_{ij} - \bar{x})(t_{ij} - \bar{t})}{\sqrt{[\sum\limits_{i<j}(d_{ij}-\bar{x})^2] [\sum\limits_{i<j}(t_{ij}-\bar{t})^2]}}

A value of c close to 1 is ideal.

Comparing different partitions

We can also compare the partitions created using two algorithms. This can be used to test the effectiveness of the clusters produced. The measures we discuss here fall under external indices or evaluations [3], since we’re comparing two partitions. Therefore, to test the efficacy of your clustering algorithm, you could hold back some data, say 20%, when you give it to the clustering algorithm. You cluster this held-out data independently, say using human experts. Once the algorithm has identified clusters (using the 80% that you gave it), you give it this previously left-out (20%) data, and see how well it performs compared to the “gold standard” human clustering. Essentially, we’re labeling the held-out data. Let’s look at some metrics. First, we’ll briefly define some notation, from Wikipedia [4].

Suppose we have a dataset, X = \{ x^{(1)}, x^{(2)}, \ldots, x^{(m)}\}. Suppose we want to compare two partitions, P_1 = \{ X_1, X_2, \ldots, X_r\} and P_2 = \{ Y_1, Y_2, \ldots, Y_s\}. Let’s define the following:

  • yy: The number of data points that are in the same cluster in both the partitions.
  • yn: The number of data points that are in the same cluster in partition P_1, but in different clusters in partition P_2.
  • ny: The number of data points that are in different clusters in partition P_1, but in the same cluster in partition P_1.
  • nn: The number of data points that are in different clusters in both the partitions.

Rand index

The Rand index is defined as

R = \frac{yy+nn}{yy+yn+ny+nn}

Intuitively, the Rand index acts as an accuracy measure for clusters, and is between 0 and 1. There’s also an adjusted Rand index (ARI), which is actually used in practice. The ARI is corrected for chance. To compute the adjusted Rand index, we first compute a contingency table:

{\displaystyle {\begin{array}{c|cccc|c}{{} \atop X}\!\diagdown \!^{Y}&Y_{1}&Y_{2}&\ldots &Y_{s}&{\text{Sums}}\\\hline X_{1}&n_{11}&n_{12}&\ldots &n_{1s}&a_{1}\\X_{2}&n_{21}&n_{22}&\ldots &n_{2s}&a_{2}\\\vdots &\vdots &\vdots &\ddots &\vdots &\vdots \\X_{r}&n_{r1}&n_{r2}&\ldots &n_{rs}&a_{r}\\\hline {\text{Sums}}&b_{1}&b_{2}&\ldots &b_{s}&\end{array}}}

Essentially, each entry is the number of common elements in X_i and Y_j. Now, the Adjusted Rand index (ARI), also called the corrected Rand index, is defined as [4]:

{\displaystyle \overbrace {ARI} ^{\text{Adjusted Index}}={\frac {\overbrace {\sum _{ij}{\binom {n_{ij}}{2}}} ^{\text{Index}}-\overbrace {\left[\sum _{i}{\binom {a_{i}}{2}}\sum _{j}{\binom {b_{j}}{2}}\right]/{\binom {n}{2}}} ^{\text{Expected Index}}}{\underbrace {{\frac {1}{2}}\left[\sum _{i}{\binom {a_{i}}{2}}+\sum _{j}{\binom {b_{j}}{2}}\right]} _{\text{Max Index}}-\underbrace {\left[\sum _{i}{\binom {a_{i}}{2}}\sum _{j}{\binom {b_{j}}{2}}\right]/{\binom {n}{2}}} _{\text{Expected Index}}}}}

Fowlkes-Mallows index

If the Rand index acts like an accuracy measure, we can similarly define precision and recall measures. The Fowlkes-Mallows index, also called the G-measure, is the geometric mean of the precision and recall. In the equation below, the first term corresponds to the precision, and the second corresponds to the recall.

FM = \sqrt{\frac{yy}{yy+yn}\cdot \frac{yy}{yy+ny}}

Hubert Gamma index

For each of the two partitions, we can associate a random variable. Think of a random variable as a function, say R(x) that maps to real numbers. For each partition, we define a random variable that takes in two indices, i and j, such that i < j, and returns 1 if x^{(i)}, x^{(j)} are in the same cluster in that partition. Using the notation above, let’s call these two partitions X and Y. Then, the Hubert Gamma index is the correlation coefficient of the random variables R_1, R_2 associated with these two partitions [2]:

\begin{aligned} \Gamma &= \frac{\sum\limits_{i<j} (R_1(i, j) - \mu_{R_1})(R_2(i, j)-\mu_{R_2})}{n\sigma_{R_1}\sigma_{R_2}} \\ &= \frac{n\cdot yy -(yy+yn)(yy+ny)}{\sqrt{(yy+yn)(yy+ny)(nn+yn)(nn+ny)}} \end{aligned}

where n = yy+yn+ny+nn.

Sources

[1] Dunn index – Wikipedia

[2] Clustering Indices by Bernard Desgraupes

[3] Cluster analysis – Wikipedia

[4] Rand index – Wikipedia

[5] Lela Wilkinson et al., Cluster analysis

[6] Cophenetic correlation – Wikipedia

[7] Where to cut a dendogram? by chl on Cross Validated

[8] Silhouette (clustering) – Wikipedia

[9] L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons, 2009

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s