Let’s look at another heuristic for deciding what points constitute a cluster. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering approach. This means that points that are packed together are put into a cluster. How is this different from k-Means? In k-Means, you have a fixed number of clusters, and every point is assigned to one. This is not necessarily the case with DBSCAN. If DBSCAN finds that a point is not close *enough* to any neighboring point to satisfy a density requirement, it’s put in an individual cluster.

DBSCAN identifies three types of points:

**Core points**are located in the interior of a high-density region (cluster).**Border points**are located on the edge of a cluster. They are not core points, but fall in the neighborhood of (at least) one.**Noise points**are points that are neither core points nor border points.

Core points are identified by two user-specified parameters (as opposed to only one parameter in k-Means): *eps* and *minPts*. Within a point’s neighborhood of radius *eps*, if there are at least *minPts* number of points, then it is a core point.

Let’s look at this image. All the red points are core points. For each of these points, within a neighborhood of radius *eps*, there are at least *minPts=4* points. We should clarify a confusion here. Our definition does not say at least *minPts* *other points*, it only says at least *minPts points*. So if you look at the right-most red point, it has only 3 other points within an *eps-*neighborhood, but we also count the point itself. The yellow points are border points, and the blue point is a noise point.

## Algorithm

We’ll show two algorithms here–a short version and a longer, more detailed version.

### Short version

This is taken from source [1].

ALGORITHM DBSCAN(eps, minPts): ============================== 1. Label all points as core, border, or noise points. 2. Eliminate all noise points. 3. Put an edge between all core points that are withinepsof each other. 4. Assign each border point to one of the clusters of its associated core points.

This is pretty easy to follow. Core points that are within *eps* of each other are put in a cluster, and border points are assigned to any of the clusters they belong to.

### Longer version

This is adapted from Wikipedia’s article [2].

ALGORITHM DBSCAN(eps, minPts): ============================== 1. C = 0 2.for eachpoint P: 2.1iflabel(P)is notnullthen continue2.2 Neighbors N = RangeQuery(P, eps) 2.3if|N| < minPtsthen: 2.3.1 label(P) = Noise 2.3.2continue2.4 C = C + 1 2.5 label(P) = C 2.6 Set S = N - {P} 2.7for eachpoint QinS: 2.7.1iflabel(Q) = Noisethenlabel(Q) = C 2.7.2iflabel(Q)is notnull then continue2.7.3 label(Q) = C 2.7.4 Neighbors N = RangeQuery(Q, eps) 2.7.5if|N| >= minPtsthenS = S U N

This algorithm essentially does the same thing, taking a different approach. Rather than label points as core or border, it numbers cluster. The algorithm iterates over every point in the training set. First, it labels points. In step 2.1, we check if the point has been previously labeled. If so, we don’t need to process it again. Otherwise, we find all neighbors in an *eps* radius, and if the number of points in this neighborhood is less than *minPts*, this point is labeled a noise point, and we move on to the next point (note that continue moves to the next iteration of the loop). Note that these points labeled as Noise could be border points too. We’ll fix this in a later step (2.7.1). Now that we’ve ascertained that the current point is neither a noise nor a border point, it’s a core point that belongs to a cluster. We increment the current cluster number and assign the current point to it. Next, we get all the points *only in the neighborhood of the point*, which does not include the point itself. If we’ve previously mislabeled this as a Noise point, we now know it’s a border point, and label it in the current cluster. If a point in this neighborhood has previously been labeled, we move on to the next one. Otherwise, if it’s unlabeled, we know it’s a border point, so we label it now (step 2.7.3). We then add the points from the neighborhood of point Q to this neighborhood set and repeat. Roughly, this corresponds to steps 3 and 4 of the shorter version of the algorithm.

## Selecting the parameters

Now, there is the issue of finding *eps* and *minPts*. We’ll rely on the idea that for any core point, the *k*th nearest neighbors are roughly at the same distance. Therefore, we will look at the distances of points to their *k*th nearest neighbors [1]. For points within a cluster, this distance will be small if is less than the number of points in the cluster. If the cluster densities do not vary too much, there should not be much variation in these distances. For noise points, however, this distance will be much larger. Therefore, if we sort points according to their distance from their *k*th nearest neighbor, and plot this, we will see a sharp increase at a suitable value for *eps*. We then simply take the value of *k* as *minPts.* k = 4 is a reasonable choice for most 2D datasets. In the plot below, a value of *eps* around 7 seems reasonable.

## Practical considerations

When does DBSCAN work, and when does it fail? Because it uses a density-based approach, DBSCAN is resistant to noise, unlike k-Means. It can also handle clusters of different shapes and sizes, since it only looks at density. k-Means, on the other hand, tries to get globular clusters. On the other hand, DBSCAN cannot handle clusters of varying densities. Further, if the data is high-dimensional, it becomes difficult to find a good distance metric to cluster the data properly. Lastly, DBSCAN is quite sensitive to the choice of parameters, and small changes in the parameter values can create significantly different results.

## Additional information

More on DBSCAN can be found in Dr. Saha’s notes on the topic.

## Sources

[1] Tan, P.N., 2018. *Introduction to data mining*. Pearson Education India.