12.3 – Hierarchical Clustering

We’ll now discuss a class of clustering algorithms called hierarchical clustering algorithms. In hierarchical clustering, you don’t immediately build a set number of clusters (although you certainly could, as we’ll discuss). Instead, you make a hierarchy of clusters.

q20_b
Image from Analytics Vidhya

As you can see from the image on the left, hierarchical clustering forms nested clusters. This is very different from the previous clustering approaches that we discussed. On the right is the corresponding dendogram that shows the point numbers on the x-axis and the distance on the y-axis. Dendograms are another way of visualizing how the cluster hierarchy was formed. Broadly, hierarchical clustering algorithms are of two types:

  • Agglomerative hierarchical clustering algorithms are bottom-up algorithms that start by treating every point as a unique cluster, and going up the tree (dendogram) to create one final cluster.
  • Divisive hierarchical clustering algorithms work top-down, considering all points as one single cluster, and splitting this iteratively.

We will only discuss agglomerative hierarchical clustering here. In agglomerative hierarchical clustering, you start by treating every point as its own cluster. You then combine clusters to get bigger clusters, until you end up with a single cluster containing all the points.

How do you combine clusters? In the beginning, when all the clusters have only one point each, it’s trivial: you want the distance between points in the same cluster to be small (this is called cohesion) and the distance between points in different clusters to be larger (this is called separation); therefore, you merge the points that are closest to each other. But what about future steps? There are several heuristics we could use [1]:

  • Single Linkage or MIN: In this case, the distance between two clusters is the minimum of the distances between any two points, each taken from one of the clusters. This handles non-elliptical shapes, but is sensitive to noise and outliers.
  • Complete Linkage or MAX: This is like the first one, but we instead consider the maximum distance. This is less sensitive to noise and outliers, but favors globular shapes. The diagram we showed on top uses this approach.
  • Group Average: In this, you take average of the distances between all possible pairs of points, with one point from each cluster.
  • Ward’s method: In Ward’s method, you combine the clusters that result in the minimum within-cluster variance [2]. Another way of phrasing it is that you combine the clusters that result in the minimum increase in the squared error. How do we compute this? Let’s define the error sum of squares, as follows. Let x^{(j)[i]}_k denote the kth feature of the jth training example, which belongs to the ith cluster. This is the same notation we used in the first post, with the added notation for cluster number. We’ll use \bar{x}^{[i]}_k to denote the kth feature of the mean of the ith cluster. Then, the error sum of squares is defined as

ESS = \sum\limits_i \sum\limits_j \sum\limits_k \left\vert x^{(j)[i]}_k - \bar{x}^{[i]}_k \right\vert^2

   That is, we find the squared difference between each feature value of each point in every cluster and the corresponding feature value for the cluster’s mean, and add up all these squared differences. We merge clusters so that the ESS value after merging is minimum [3].

The Lance-Williams algorithms

In the naive implementation of agglomerative hierarchical clustering, you compute the distance matrix between each cluster at each step. The Lance-Williams algorithms are a family of agglomerative hierarchical clustering algorithms which are represented by a recursive formula for computing cluster distances at each step. Any hierarchical clustering technique that can be expressed using the Lance-Williams formula does not need to keep the original data points [1].

We’ll use Wikipedia’s notation here [3]. Suppose that C_i, C_j are the next clusters to be merged. Let d_{ij} be the distance (using any method discussed above) between C_i and C_j. Further, let d_{(ij)k} be the distance between the merged cluster C_i \cup C_j and cluster C_k. An algorithm belongs to the Lance-Williams family if the distance d_{(ij)k} can be recursively computed as

d_{(ij)k} = \alpha_i d_{ik} + \alpha_j d_{jk} + \beta d_{ij} + \gamma \left\vert d_{ik}-d_{jk} \right\vert

All the methods we discussed above belong to the Lance-Williams family. This table summarizes the coefficient values [1].

Clustering Method \alpha_i \alpha_j \beta \gamma
Single Linkage 0.5 0.5 0 -0.5
Complete Linkage 0.5 0.5 0 0.5
Group Average \frac{n_i}{n_i+n_j} \frac{n_j}{n_i+n_j} 0 0
Ward’s method \frac{n_i+n_k}{n_i+n_j+n_k} \frac{n_j+n_k}{n_i+n_j+n_k} \frac{-n_k}{n_i+n_j+n_k} 0

Implementing hierarchical clustering

Now let’s implement hierarchical clustering. We’ll use the Lance-Williams formula. We’ll do this step-by-step. Rather than show you the function to do the clustering, I’ll instead show you how to manually run the steps, and then develop the function. This will prove to be more helpful. We will not implement a function to compute the distance matrix. We’ll simply use a built-in function to do that for us. First, let’s start with a dataset (from [1]).

x = [[0.4, 0.53], [0.22, 0.38], [0.35, 0.32], [0.26, 0.19], [0.08, 0.41], [0.45, 0.30]]

We’ll compute the distance matrix now:

from sklearn.metrics import pairwise_distances
d1 = pairwise_distances(x)

Let’s write functions to implement the Lance-Williams update formula:

def update(dist, i, j, k, alpha1, alpha2, beta, gamma):
    return alpha1 * dist[i][k] + alpha2 * dist[j][k] + beta * dist[i][j] + \
        gamma * abs(dist[i][k] - dist[j][k])

We’ll only implement single linkage here, but the others should be easy enough:

def single_link(dist, i, j, k):
    return update(dist, i, j, k, 0.5, 0.5, 0, -0.5)

Next, we need to combine the two clusters that have the minimum distance from the matrix. Basically, we just need to use argmin. However, the diagonal elements of our matrix are 0, and therefore the least. So we’ll make these larger values to get the right argmin values.

d1[d1 == 0] = np.max(d1) + 1

I arbitrarily added 1; you could multiply by 2 or 5 or 10, anything at all, really, but be sure that the distance matrix values will never possible reach that high. Now we can get the argmin, or the indices of the least value. However, since we have a 2D distance matrix, we’ll need to use another function as well to get the correct indices:

indices = np.unravel_index(np.argmin(d1), d1.shape)

The argmin function would’ve given us a single number–it flattens our 2D array into a 1D array and then computes the argmin value; the unravel_index function gives us the corresponding 2D indices for the minimum element.

Now that we know the indices of the minimum element, these indices correspond to what clusters we’re merging. In the above step, you should’ve gotten (2, 5) as the indices. This means we’re merging the 3rd and 6th clusters (since Python uses a 0-based indexing for arrays). What we’ll do now is this. We’ll remove the 3rd and 6th clusters entirely (that is, the 3rd and 6th rows and columns). We’ll only remove these rows and columns in a copy of the distance matrix–we will still need the original one to use the Lance-Williams formula. After deleting in the copy, we’ll add one row and one column corresponding to the combined cluster. Since the distance matrix is symmetric, we only have to compute the values of the row, and fill the column with the same values. These distance values can be computed using the Lance-Williams formula. Let’s proceed.

d2 = np.delete(d1, indices[0], axis=0)
d2 = np.delete(d2, indices[1] - 1, axis=0)

d2 = np.delete(d2, indices[0], axis=1)
d2 = np.delete(d2, indices[1] - 1, axis=1)

Above, we’ve deleted the 3rd and 6th rows and columns (axis=0 means rows, axis=1 means columns). Note that np.delete does not delete in-place; it returns a new array, so in the first statement we’re making a new array variable, and in the subsequent lines, we’re modifying this new array. Thus, the original distance matrix is unchanged. If you print out d2 now, you’ll find that it’s a 4×4 matrix, and you should see that it’s obtained by removing the 3rd and 6th rows and columns.

Now, let’s add a row and a column in the beginning, corresponding to the combined cluster.

d2 = np.insert(d2, 0, 0, axis=1)
d2 = np.insert(d2, 0, 0, axis=0)

We’ve filled this row and column with zeros (the third argument). Finally, let’s fill these with the real values using the Lance-Williams formula:

cur_index = 0
for i in range(len(d)):
    if i not in indices:
        cur_index += 1
        d2[0][cur_index] = d2[cur_index][0] = single_link(d1, indices[0], indices[1], i)
        print(cur_index, single_link(d1, indices[0], indices[1], i))

We use a cur_index variable to get the correct index for the value that we’re filling in. We skip the indices that we had removed (hence the if condition), and we call the function that we wrote earlier using the indices, since those are the i and j values. The value of k is the cluster that we want to compute the distance of the new cluster to, so we pass in our loop variable, i (perhaps confusingly named).

Finally, let’s make sure our very first matrix element is not zero (because we don’t want argmin to point to it).

d2[0][0] = d2[1][1]

Let’s now build a mapping of what the indices in the new 5 x 5 matrix mean with respect to the original clusters. We removed the 3rd and 6th indices, and added one in the beginning corresponding to the new cluster. Therefore, the first index is now the (3, 6) cluster, and the rest of the indices are the first, second, fourth, and fifth clusters (or points).

Now that we’ve gone over the steps, let’s write out the clustering function.

def hierarchical_cluster(points, metric='euclidean'):
    dist = pairwise_distances(points, metric=metric)
    print('Distance matrix:')
    print(dist)
    
    dist[dist == 0] = np.max(dist) + 1
    
    # Maintain a list of clusters
    clusters = [str(i) for i in range(1, len(dist) + 1)]
    
    # Convert to numpy array and set dtype of appropriate size
    dtype = '<U' + str(5 * sum([len(i) for i in clusters]))
    clusters = np.array(clusters, dtype=dtype)
    
    for _ in range(len(dist) - 1):
        # Find the indices of the minimum element
        indices = np.unravel_index(np.argmin(dist), dist.shape)

        # Update the cluster list
        c1 = clusters[indices[0]]
        c2 = clusters[indices[1]]
        new_cluster = '(' + c1 + ', ' + c2 + ')'
        print(new_cluster)
        clusters = np.delete(clusters, indices)
        clusters = np.insert(clusters, 0, new_cluster, axis=0)
        print('Clusters:', clusters)

        # Build a new distance matrix: start by removing the older points
        new_dist = np.delete(dist, indices[0], axis=0)
        new_dist = np.delete(new_dist, indices[1] - 1, axis=0)
        new_dist = np.delete(new_dist, indices[0], axis=1)
        new_dist = np.delete(new_dist, indices[1] - 1, axis=1)

        # Next, add the combined cluster at the beginning. Start by
        # creating spaces for the distances between this new cluster
        # and all other clusters.
        new_dist = np.insert(new_dist, 0, 0, axis=1)
        new_dist = np.insert(new_dist, 0, 0, axis=0)

        # Fill in values of the distances using the Lance-Williams equation
        cur_index = 0
        for i in range(len(dist)):
            if i not in indices:
                cur_index += 1
                new_dist[0][cur_index] = new_dist[cur_index][0] = single_link(dist, 
                                                                              indices[0], 
                                                                              indices[1], 
                                                                              i)
        new_dist[0][0] = np.max(new_dist) + 1
        dist = new_dist
    
    return clusters

There’s a little extra stuff going on, so let’s go over this. We first compute the pairwise distance matrix, using the Euclidean distance metric. Alternatively, we could pass any other metric, like the Manhattan distance, as well.

We maintain a list of clusters that we made. Initially, this is just a list of individual clusters. We convert it to a numpy array, but there’s a small catch. Numpy will only allocate enough space for each array element so that the current largest element (here, the longest string) will fit. However, when we merge clusters, the strings will become longer, so we need to tell numpy to change the data type to a larger size. We tell it to allocate us 5 times the total length of all the strings. This again is simply a heuristic. You could use 2 times instead, and it will probably work. When merging clusters, we remove the two individual clusters from our list, and then add a merged string at the beginning of our cluster list.

The rest of the function is what we’ve already seen. Except the last line of the loop. Here, we set the dist matrix to the new distance matrix. This is because when we compute the next step distance matrix, we only require the previous one, not the initial one. In fact, we don’t even need the original points.

Let’s run our function on our data points.

hierarchical_cluster([[0.4, 0.53], [0.22, 0.38], [0.35, 0.32], [0.26, 0.19], [0.08, 0.41], [0.45, 0.30]])

In my system, I get the output below.

Distance matrix:
[[0.         0.23430749 0.21587033 0.36769553 0.34176015 0.23537205]
 [0.23430749 0.         0.14317821 0.19416488 0.14317821 0.24351591]
 [0.21587033 0.14317821 0.         0.15811388 0.28460499 0.10198039]
 [0.36769553 0.19416488 0.15811388 0.         0.28425341 0.21954498]
 [0.34176015 0.14317821 0.28460499 0.28425341 0.         0.38600518]
 [0.23537205 0.24351591 0.10198039 0.21954498 0.38600518 0.        ]]
(3, 6)
Clusters: ['(3, 6)' '1' '2' '4' '5']
(2, 5)
Clusters: ['(2, 5)' '(3, 6)' '1' '4']
((2, 5), (3, 6))
Clusters: ['((2, 5), (3, 6))' '1' '4']
(((2, 5), (3, 6)), 4)
Clusters: ['(((2, 5), (3, 6)), 4)' '1']
((((2, 5), (3, 6)), 4), 1)
Clusters: ['((((2, 5), (3, 6)), 4), 1)']

We can clearly see what clusters were merged at each step. Finally, we end up with one single cluster.

Sources

[1] Tan, P.N., 2018. Introduction to data mining. Pearson Education India.

[2] Ward’s method – Wikipedia

[3] STAT 505 – Penn State University

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s