12.2 – DBSCAN

Let’s look at another heuristic for deciding what points constitute a cluster. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering approach. This means that points that are packed together are put into a cluster. How is this different from k-Means? In k-Means, you have a fixed number of clusters, and every point is assigned to one. This is not necessarily the case with DBSCAN. If DBSCAN finds that a point is not close enough to any neighboring point to satisfy a density requirement, it’s put in an individual cluster.

DBSCAN identifies three types of points:

1. Core points are located in the interior of a high-density region (cluster).
2. Border points are located on the edge of a cluster. They are not core points, but fall in the neighborhood of (at least) one.
3. Noise points are points that are neither core points nor border points.

Core points are identified by two user-specified parameters (as opposed to only one parameter in k-Means): eps and minPts. Within a point’s neighborhood of radius eps, if there are at least minPts number of points, then it is a core point.

Let’s look at this image. All the red points are core points. For each of these points, within a neighborhood of radius eps, there are at least minPts=4 points. We should clarify a confusion here. Our definition does not say at least minPts other points, it only says at least minPts points. So if you look at the right-most red point, it has only 3 other points within an eps-neighborhood, but we also count the point itself. The yellow points are border points, and the blue point is a noise point.

Algorithm

We’ll show two algorithms here–a short version and a longer, more detailed version.

Short version

This is taken from source [1].

```ALGORITHM DBSCAN(eps, minPts):
==============================
1. Label all points as core, border, or noise points.
2. Eliminate all noise points.
3. Put an edge between all core points that are within eps of each other.
4. Assign each border point to one of the clusters of its associated
core points.```

This is pretty easy to follow. Core points that are within eps of each other are put in a cluster, and border points are assigned to any of the clusters they belong to.

Longer version

This is adapted from Wikipedia’s article [2].

```ALGORITHM DBSCAN(eps, minPts):
==============================
1. C = 0
2. for each point P:
2.1 if label(P) is not null then continue
2.2 Neighbors N = RangeQuery(P, eps)
2.3 if |N| < minPts then:
2.3.1 label(P) = Noise
2.3.2 continue
2.4 C = C + 1
2.5 label(P) = C
2.6 Set S = N - {P}
2.7 for each point Q in S:
2.7.1 if label(Q) = Noise then label(Q) = C
2.7.2 if label(Q) is not null then continue
2.7.3 label(Q) = C
2.7.4 Neighbors N = RangeQuery(Q, eps)
2.7.5 if |N| >= minPts then S = S U N```

This algorithm essentially does the same thing, taking a different approach. Rather than label points as core or border, it numbers cluster. The algorithm iterates over every point in the training set. First, it labels points. In step 2.1, we check if the point has been previously labeled. If so, we don’t need to process it again. Otherwise, we find all neighbors in an eps radius, and if the number of points in this neighborhood is less than minPts, this point is labeled a noise point, and we move on to the next point (note that continue moves to the next iteration of the loop). Note that these points labeled as Noise could be border points too. We’ll fix this in a later step (2.7.1). Now that we’ve ascertained that the current point is neither a noise nor a border point, it’s a core point that belongs to a cluster. We increment the current cluster number and assign the current point to it. Next, we get all the points only in the neighborhood of the point, which does not include the point itself. If we’ve previously mislabeled this as a Noise point, we now know it’s a border point, and label it in the current cluster. If a point in this neighborhood has previously been labeled, we move on to the next one. Otherwise, if it’s unlabeled, we know it’s a border point, so we label it now (step 2.7.3). We then add the points from the neighborhood of point Q to this neighborhood set and repeat. Roughly, this corresponds to steps 3 and 4 of the shorter version of the algorithm.

Selecting the parameters

Now, there is the issue of finding eps and minPts. We’ll rely on the idea that for any core point, the kth nearest neighbors are roughly at the same distance. Therefore, we will look at the distances of points to their kth nearest neighbors [1]. For points within a cluster, this distance will be small if $k$ is less than the number of points in the cluster. If the cluster densities do not vary too much, there should not be much variation in these distances. For noise points, however, this distance will be much larger. Therefore, if we sort points according to their distance from their kth nearest neighbor, and plot this, we will see a sharp increase at a suitable value for eps. We then simply take the value of k as minPts. k = 4 is a reasonable choice for most 2D datasets. In the plot below, a value of eps around 7 seems reasonable.

Practical considerations

When does DBSCAN work, and when does it fail? Because it uses a density-based approach, DBSCAN is resistant to noise, unlike k-Means. It can also handle clusters of different shapes and sizes, since it only looks at density. k-Means, on the other hand, tries to get globular clusters. On the other hand, DBSCAN cannot handle clusters of varying densities. Further, if the data is high-dimensional, it becomes difficult to find a good distance metric to cluster the data properly. Lastly, DBSCAN is quite sensitive to the choice of parameters, and small changes in the parameter values can create significantly different results.