Issue #65 - Unsupervised clustering with DBSCAN

and

Jul 14, 2024

∙ Paid

💊 Pill of the Week

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a powerful and versatile unsupervised learning algorithm used for clustering tasks. Unlike other clustering methods, DBSCAN can identify clusters of varying shapes and sizes and is particularly effective at handling noise and outliers. In this article we will explore the mechanics of DBSCAN, its applications, and how to implement it in Python.

How Does DBSCAN Work?

DBSCAN groups together points that are closely packed together while marking points that are in low-density regions as outliers. Here’s a step-by-step explanation:

1. Parameters

Epsilon (ε): The maximum distance between two points for them to be considered as part of the same neighborhood.
Minimum Samples (minPts): The minimum number of points required to form a dense region (a cluster).

2. Core Points, Border Points, and Noise

Core Points: Points that have at least minPts neighbors within ε distance.
Border Points: Points that are within ε distance of a core point but do not have enough neighbors to be a core point themselves.
Noise: Points that are neither core points nor border points.

3. Clustering Process

Expand Clusters: DBSCAN starts with an arbitrary point and retrieves all points density-reachable from it. If the point is a core point, a cluster is formed. If not, the point is labeled as noise.
Density-Reachable: A point is density-reachable from another point if it lies within ε distance of the core point and can be reached by a chain of core points.
Termination: The process repeats until all points have been visited.

Key points to remember:
The epsilon value determines the size of the neighborhood.
The minimum samples value (not shown in the plot) determines how many points need to be in the epsilon neighborhood for a point to be considered a core point.
Clusters are formed by connecting core points that are within epsilon distance of each other, along with their border points.

In this example:
The minimum number of points (minPts) within the ε density area to form a cluster is 3.
Core points: Any point with at least 3 points within that ε-defined area.
Border points: any point within a density area of a core point but that doesn't reach the minPts threshold (3 in this case).
Noise point: any point that doesn't reach the minPts threshold (3) and it is not in the reach of a core point. These points are considered outliers.

DBSCAN: Key Features

Noise Handling

DBSCAN can identify and ignore outliers, making it robust to noisy datasets.

Clusters of Arbitrary Shape

Unlike K-Means, DBSCAN can find clusters of varying shapes and sizes, adapting to the inherent structure of the data.

When to Use DBSCAN?

These are the scenarios that are ideal for using DBSCAN:

Non-Globular Clusters: Ideal for datasets with clusters of arbitrary shapes.
Noise Presence: Effective in datasets with significant noise and outliers.
Unknown Number of Clusters: No need to specify the number of clusters a priori.

Pros and Cons

Pros:

Robust to Noise: Can effectively handle outliers.
No Need to Specify Clusters: Automatically determines the number of clusters based on the data.
Arbitrary Shape Clusters: Can find clusters of varying shapes and sizes.

Cons:

Parameter Sensitivity: The choice of ε and minPts can significantly affect the results.
Scalability: May struggle with very large datasets due to the computational complexity of distance calculations.
Variable Density Clusters: May have difficulty with clusters of varying density.

Python Implementation

Here's a basic example of using DBSCAN for clustering using the scikit-learn library:

from sklearn.cluster import DBSCAN
import numpy as np

# Create the DBSCAN model
db = DBSCAN(eps=0.3, min_samples=10)

# Fit the model
db.fit(X)

# Extracting the labels
labels = db.labels_

# Identifying core samples
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

# Number of clusters in labels, ignoring noise if present
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print(f'Estimated number of clusters: {n_clusters_}')

Interpreting the Results

Key outputs from a DBSCAN model include:

Keep reading with a 7-day free trial

Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.