Machine Learning 简明教程

Machine Learning - Density-Based Clustering

基于密度的聚类基于这样的思想:聚类是高密度区域,由低密度区域隔开。

Density-based clustering is based on the idea that clusters are regions of high density separated by regions of low density.

  1. The algorithm works by first identifying "core" data points, which are data points that have a minimum number of neighbors within a specified distance. These core data points form the center of a cluster.

  2. Next, the algorithm identifies "border" data points, which are data points that are not core data points but have at least one core data point as a neighbor.

  3. Finally, the algorithm identifies "noise" data points, which are data points that are not core data points or border data points.

以下是最常见的基于密度的聚类算法——

Here are the most common density-based clustering algorithms −

DBSCAN Clustering

DBSCAN (基于密度的噪声应用空间聚类)算法是最常见的基于密度的聚类算法之一。DBSCAN算法需要两个参数:最小邻居数(minPts)和核心数据点之间的最大距离(eps)。

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm is one of the most common density-based clustering algorithms. The DBSCAN algorithm requires two parameters: the minimum number of neighbors (minPts) and the maximum distance between core data points (eps).

OPTICS Clustering

OPTICS (对点进行排序以识别聚类结构)是一种基于密度的聚类算法,通过建立数据集的可达性图来进行操作。可达性图是有向图,它将数据点连接到指定距离阈值内的最近邻居。可达性图中的边根据连接数据点之间的距离进行加权。然后该算法基于指定的密度阈值,通过递归分割可达性图将可达性图构造为分层的聚类结构。

OPTICS (Ordering Points to Identify the Clustering Structure) is a density-based clustering algorithm that operates by building a reachability graph of the dataset. The reachability graph is a directed graph that connects each data point to its nearest neighbors within a specified distance threshold. The edges in the reachability graph are weighted according to the distance between the connected data points. The algorithm then constructs a hierarchical clustering structure by recursively splitting the reachability graph into clusters based on a specified density threshold.

HDBSCAN Clustering

HDBSCAN (分层基于密度的噪声应用空间聚类)是基于密度聚类的聚类算法。这是一种较新的算法,它建立在流行的DBSCAN算法之上,并具有许多优势,例如更好地处理不同密度的聚类,以及检测不同形状和大小的聚类的能力。

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that is based on density clustering. It is a newer algorithm that builds upon the popular DBSCAN algorithm and offers several advantages over it, such as better handling of clusters of varying densities and the ability to detect clusters of different shapes and sizes.

在接下来的三章中,我们将详细讨论这三个基于密度的聚类算法,以及它们在Python中的实现。

In the next three chapters, we will discuss all the three density-based clustering algorithms in detail along with their implementation in Python.