12.7 – Finding the number of clusters

Now that we know how to cluster data, let’s look at how many clusters is a good idea. Clustering is typically used as an exploratory device to see if there’s any structure in the data. In some cases, though, it may be useful to know a priori how many clusters may be there. Several methods exist to find out the number of clusters, and we will discuss them here.

Elbow method

The elbow method is a rather simple method of computing an ideal number of clusters. The idea is simple–for k=1 to 10 clusters (say), compute some metric, like the pseudo-F (popular for hierarchical clusters) or the sum of within-cluster sum of squares (for k-means, the within cluster sum of squares (WSS) may be taken as the sum of squares of distances from each point to the center).

You plot these values against the number of clusters. Typically, you will see a bend, or an “elbow”. The elbow is the point at which adding more clusters doesn’t help explain more. You can also plot the percentage of variance explained, which is derived from the WSS. Note that if you do not see a noticeable elbow, there probably aren’t any distinct clusters in the data. This is a visual test for that. Let’s implement this. We’ll build on our k-means clustering code from before. Let’s write a WSS function:

def WSS():
    """
    Finds the sum of all within-cluster sum of squares
    """
    sum_ = 0
    for i in range(len(C)):
        center = C[i]
        for j in clusters[i]:
            point = x[j]
            sum_ += distance(point, center)
    return sum_

Where all the other variables and functions are as defined previously. Now, we can write some code to generate random initial centers, cluster the data, and compute the error for 1 to say 5 clusters. You might get a slightly wonky curve, but that’s alright. The code sometimes gives nan values in the centers–this happens when there’s an empty cluster, so there’s a division by 0–we will not fix that here, but the fix is to simply add a condition to check for empty clusters, and ignore them. It’s not usually an issue, and for our case, you can simply re-run the code to make sure this doesn’t happen.

errors = []

for i in range(1, 6):  # up to 5 clusters
    n_clusters = i
    lower_limits = [min(x.T[i]) for i in range(x.shape[1])]
    upper_limits = [max(x.T[i]) for i in range(x.shape[1])]

    C = [[np.random.uniform(lower_limits[i], upper_limits[i]) for i in range(x.shape[1])] for _ in range(n_clusters)]
    
    for _ in range(5):  # number of steps
        step()

    errors.append(WSS())

We find the lower and upper limits of each dimension (our points were 2-dimensional, if you recall). We then make random centers, and run our clustering algorithm for 5 steps. You can run it longer if you wish. We gather all the error values to plot later. Let’s plot the elbow curve:

plt.plot(range(1, len(errors)+1), errors, 'ro--');

index

One would think both 2 and 4 are elbows, but we choose 4 clusters here because there still is a pretty big dip from 3 to 4. Let’s now plot the clusters and their centers for 4 clusters. To do this, let’s run the above code again, removing the loop:

n_clusters = 4
lower_limits = [min(x.T[i]) for i in range(x.shape[1])]
upper_limits = [max(x.T[i]) for i in range(x.shape[1])]

C = [[np.random.uniform(lower_limits[i], upper_limits[i]) for i in range(x.shape[1])] for _ in range(n_clusters)]
    
for _ in range(5):  # number of steps
    step()

And now we can plot the clusters:

plt.scatter(x.T[0], x.T[1], c=assign_clusters(x));
plt.plot(np.array(C).T[0], np.array(C).T[1], 'rx');

index2

This seems like a reasonable clustering. You might get slightly different clusters as well.

Silhouette method

In this method, you could do one of two things. You could use the silhouette index as a metric and use the elbow method described above. Alternatively, you could create a silhouette plot. A silhouette plot can be thought of as a specially organized horizontal bar graph. Points in the same cluster are grouped together, and within each cluster, the silhouette values are arranged in descending order. Here’s what it looks like (image taken from sklearn documentation).

sphx_glr_plot_kmeans_silhouette_analysis_003

The red vertical line is optional–it is the silhouette index (recall that this is the average of all the silhouette values). If all the blobs end to the right of the red line, then this is a good clustering. The linked documentation also has code to generate this plot.

Gap statistics

The Gap statistic was proposed by Tibshirani et al. in their 2001 paper at Stanford University. It’s based on comparing the error with that of a null reference distribution–a set of data points generated by a distribution so that there is no obvious clustering. The paper suggests two ways of getting a null reference distribution:

  • Generate uniform data for each of the features of the data. This is easy to implement and straightforward.
  • Generate uniform data from a uniform distribution, along the principal components of the data. Quoting the paper (page 4):
In detail, if X is our n \times p data matrix, assume that the columns have mean 0 and compute the singular value decomposition X=UDV^T. We transform via X^\prime=XV and then draw uniform features Z^\prime over the ranges of the columns of X^\prime, as in method a) above. Finally we back-transform via Z = Z^\prime V^T to give reference data Z.

The second is slightly more complex, but as the paper notes, “takes into account the shape of the data distribution and makes the procedure rotationally invariant, as long as the clustering method itself is invariant.” First, for every cluster C_k, we compute the sum of all pairwise distances:

D_k = \sum\limits_{x^{(i)}, x^{(j)} \in C_k} d(x^{(i)}, x^{(j)})

We then find the normalized distance:

W_k = \frac{1}{2n_k}D_k

where n_k is the number of points in cluster C_k and we divide by two because each pair is considered twice in the summation. Formally, the Gap statistic is defined as:

\text{Gap}(k) = \mathbb{E}^*_n\left[\log W_k\right] - \log W_k

Essentially, the first term is the one for the null reference distribution, and the second one is for the real data. We choose k = \text{argmax Gap}(k). More specifically, we generate B null reference datasets, and compute the average of \log W_k for each of them. These values of \log W_k will also have a standard deviation, \text{sd}(k). Accounting for the simulation error, we compute:

s_k = \sqrt{1+1/B}\text{sd}(k)

We choose the smallest k such that \text{Gap}(k) \geq \text{Gap}(k+1)-s_{k+1}. Let’s now implement this. We’ll first implement D_k.

def Dk(i):
    return np.sum([distance(x[pair[0]], x[pair[1]]) 
                   for pair in list(itertools.combinations(clusters[i], 2))
                  ]) / len(clusters[i])

We use the itertools package to get all pairs of points within a cluster, and use the formula for D_k. Note that we use the combinations function, which treats the pairs (a, b) and (b, a) as the same, so we don’t get all pairs twice. For this reason, we remove the 2 in the denominator. We can now implement W_k:

def Wk():
    return np.sum([Dk(i) for i in range(len(C))])

We’ll make a dataset using sklearn here:

X = make_blobs(500, cluster_std=1.3)[0]

And plot it:

plt.scatter(X.T[0], X.T[1])

index

Let’s create a variable for the number of features:

dimensions = X.shape[1]

And some variables required to compute the Gap statistic:

Wks = []
Ewks = []
sds = []
B = 10

The first stores all the \log W_k values for each k. The second stores all the \mathbb{E}\left[\log W_k \right] values. The third stores all the \text{sd}(k) values. The rest of the code isn’t too hard:

for i in range(1, 5):  # up to 4 clusters
    n_clusters = i
    
    # First, run the clustering on actual data
    x = X
    lower_limits = [min(x.T[i]) for i in range(x.shape[1])]
    upper_limits = [max(x.T[i]) for i in range(x.shape[1])]
    
    C = [[np.random.uniform(lower_limits[i], upper_limits[i]) for i in range(dimensions)] for _ in range(n_clusters)]
    
    for _ in range(5):  # number of steps
        step()

    Wks.append(np.log(Wk()))
    
    Bwks = []
    # Now, on the null reference data
    for j in range(B):
        # Generate a dataset of 30 samples
        x = np.array([[np.random.uniform(lower_limits[i], upper_limits[i]) for i in range(dimensions)] for _ in range(30)])
        
        # Generate initial centers
        C = [[np.random.uniform(lower_limits[i], upper_limits[i]) for i in range(dimensions)] for _ in range(n_clusters)]
        # Cluster
        for _ in range(5):
            step()
 
        Bwks.append(np.log(Wk()))
    
    Ewks.append(np.average(Bwks))
    sds.append(np.sqrt(np.var(Bwks)))

We know there are supposed to be 3 clusters. We check for up to 4. Since all the functions we used for k-means clustering assume that the x (lowercase) variable stores the data, we set it to our data, X. We compute the lower and upper limits for each dimension. We set up some random centers and cluster by running 5 steps. Finally, we compute \log W_k and add it to the list.

Next, we do this exact same thing for the null reference data. We create a variable called Bwks that stores the \log W_k for each of the B reference datasets. Each time in the loop running B times, we create a uniform null reference set (method (a)), set the lowercase x variable to that, and then create random centers. We then cluster and find the \log W_k value for this null reference dataset. We append the average of these to the list that stores \mathbb{E}\left[ \log W_k \right]. We also compute the standard deviation, which is the square root of the variance.

Now, let’s compute s_k:

sks = [np.sqrt(1 + 1./B) * i for i in sds]

Now, we can compute the Gap statistics:

Gaps = [Ewks[i] - Wks[i] for i in range(len(Wks))]

And plot this:

plt.plot(range(1, len(Gaps)+1), Gaps, 'ro--');

index

It seems like 3 might be a good number of clusters. Let’s check if it fulfills our other criterion as well:

for i in range(len(sks)-1):
    print('for k =', i+1, ', Gap(k) >= Gap(k+1) - s(k+1) is', Gaps[i] >= Gaps[i+1] - sks[i+1])

This for me shows:

for k = 1 , Gap(k) >= Gap(k+1) - s(k+1) is False
for k = 2 , Gap(k) >= Gap(k+1) - s(k+1) is False
for k = 3 , Gap(k) >= Gap(k+1) - s(k+1) is True

Therefore, we pick the number of clusters as 3.

Note: I’m not sure why the Gap statistic values here (on the y-axis) are negative. I’m pretty sure it has something to do with the way we’ve computed it, but it’s possible there’s an error in what we’re doing. If so, please tell me about it. You can find another implementation for the Gap statistic in this blog.

Closing thoughts

We’ve covered a lot of ground in clustering, ranging from several popular clustering algorithms to clustering metrics and even finding an optimal number of clusters. But how do we know if the data is “clustered enough”, or “strongly clustered”? This Stack Exchange answer gives an overview. Essentially, the idea is to use statistical methods to “resample” from your dataset. Alternatively, the answer claims you could add small amounts of noise to your data, and re-run the clustering algorithm. Then, you use a metric called the Jaccard similarity index to check how many points remain in the same cluster despite this.

With this, we stop our discussion of cluster analysis, and move on to other topics.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s