Machine Learning 简明教程

Machine Learning - HDBSCAN Clustering

Working of HDBSCAN Clustering

HDBSCAN 使用基于可互达图构建聚类的层次,基于可互达图,其中每个数据点都是一个节点,节点之间的边由相似度或距离的测量值加权。通过将两个数据点的可互达距离低于给定阈值而连接两点来构建该图。

HDBSCAN builds a hierarchy of clusters using a mutual-reachability graph, which is a graph where each data point is a node and the edges between them are weighted by a measure of similarity or distance. The graph is built by connecting two points with an edge if their mutual reachability distance is below a given threshold.

两个点之间的可互达距离是其可达距离的最大值,可达距离是对一个点可以多容易地从另一个点抵达的测量。两个点之间的可达距离定义为其距离和其路径上任意点的最小密度的最大值。

The mutual reachability distance between two points is the maximum of their reachability distances, which is a measure of how easily one point can be reached from the other. The reachability distance between two points is defined as the maximum of their distance and the minimum density of any point along their path.

然后使用最小生成树 (MST) 算法从基于可互达图中提取聚类的层次。MST 的叶子对应于各个数据点,而内部节点对应于不同大小和形状的聚类。

The hierarchy of clusters is then extracted from the mutual-reachability graph using a minimum spanning tree (MST) algorithm. The leaves of the MST correspond to the individual data points, while the internal nodes correspond to clusters of varying sizes and shapes.

然后,HDBSCAN 算法向 MST 应用聚合树算法来提取聚类。聚合树是 MST 的一个紧凑表现形式,仅包括树的内部节点。然后在特定级别对聚合树进行剪切以获得聚类,其中剪切级别由用户定义的最小聚类大小或基于聚类稳定性的启发法确定。

The HDBSCAN algorithm then applies a condensed tree algorithm to the MST to extract the clusters. The condensed tree is a compact representation of the MST that only includes the internal nodes of the tree. The condensed tree is then cut at a certain level to obtain the clusters, with the level of the cut determined by a user-defined minimum cluster size or a heuristic based on the stability of the clusters.

Implementation in Python

HDBSCAN 以一个 Python 库的形式提供,可以通过 pip 进行安装。该库提供了 HDBSCAN 算法的实现以及若干用于数据预处理和可视化的实用函数。

HDBSCAN is available as a Python library that can be installed using pip. The library provides an implementation of the HDBSCAN algorithm along with several useful functions for data preprocessing and visualization.

Installation

要安装 HDBSCAN,请打开一个终端窗口并键入以下命令:

To install HDBSCAN, open a terminal window and type the following command −

pip install hdbscan

Usage

要使用 HDBSCAN,首先导入 hdbscan 库:

To use HDBSCAN, first import the hdbscan library −

import hdbscan

接着,我们使用 scikit-learn 中的 make_blobs() 函数生成一个样本数据集:

Next, we generate a sample dataset using the make_blobs() function from scikit-learn −

# generate random dataset with 1000 samples and 3 clusters
X, y = make_blobs(n_samples=1000, centers=3, random_state=42)

现在,创建一个 HDBSCAN 类的实例,并用数据对它进行拟合:

Now, create an instance of the HDBSCAN class and fit it to the data −

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, metric='euclidean')

# fit the data to the clusterer
clusterer.fit(X)

这会将 HDBSCAN 应用到数据集上,并根据簇标将每个点分配到一个簇。要可视化聚类结果,您可以绘制数据,并将每个点根据它的簇标进行着色:

This will apply HDBSCAN to the dataset and assign each point to a cluster. To visualize the clustering results, you can plot the data with color each point according to its cluster label −

# get the cluster labels
labels = clusterer.labels_

# create a colormap for the clusters
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)

# plot the data with each point colored according to its cluster label
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=colors[labels])
plt.show()

此代码将生成一个散点图,每个点根据它的簇标进行着色,如下所示:

This code will produce a scatter plot of the data with each point colored according to its cluster label as follows −

hdbscan

HDBSCAN 还提供了若干参数,可以用于微调聚类结果:

HDBSCAN also provides several parameters that can be adjusted to fine-tune the clustering results −

  1. min_cluster_size − The minimum size of a cluster. Points that are not part of any cluster are labeled as noise.

  2. min_samples − The minimum number of samples in a neighborhood for a point to be considered a core point.

  3. cluster_selection_epsilon − The radius of the neighborhood used for cluster selection.

  4. metric − The distance metric used to measure the similarity between points.

Advantages of HDBSCAN Clustering

与其他聚类算法相比,HDBSCAN 有几个优点:

HDBSCAN has several advantages over other clustering algorithms −

  1. Better handling of clusters of varying densities − HDBSCAN can identify clusters of different densities, which is a common problem in many datasets.

  2. Ability to detect clusters of different shapes and sizes − HDBSCAN can identify clusters that are not necessarily spherical, which is another common problem in many datasets.

  3. No need to specify the number of clusters − HDBSCAN does not require the user to specify the number of clusters, which can be difficult to determine a priori.

  4. Robust to noise − HDBSCAN is robust to noisy data and can identify outliers as noise points.