Scikit Learn 简明教程

Scikit Learn - Clustering Methods

在这里,我们将研究 Sklearn 中的聚类方法,这将有助于识别数据样本中的相似性。

Here, we will study about the clustering methods in Sklearn which will help in identification of any similarity in the data samples.

聚类方法是最有用的无监督 ML 方法之一,用于查找数据样本之间的相似性和关系模式。找到后,它们将根据特征将这些样本聚类到相似组中。聚类确定了现有未标记数据中的固有分组,这就是它很重要的原因。

Clustering methods, one of the most useful unsupervised ML methods, used to find similarity & relationship patterns among data samples. After that, they cluster those samples into groups having similarity based on features. Clustering determines the intrinsic grouping among the present unlabeled data, that’s why it is important.

Scikit-learn 库有 sklearn.cluster ,可以对未标记数据执行聚类。该模块下,scikit-leran 具有以下聚类方法:

The Scikit-learn library have sklearn.cluster to perform clustering of unlabeled data. Under this module scikit-leran have the following clustering methods −

KMeans

该算法计算质心并进行迭代,直到找到最优质心。它需要指定聚类数,这就是它假定聚类数已知的原因。该算法的主要逻辑是根据最小化标准(称为惯性)将数据聚类到 n 个相等方差组中,从而将样本分隔开。算法识别的聚类数由“K”表示。

This algorithm computes the centroids and iterates until it finds optimal centroid. It requires the number of clusters to be specified that’s why it assumes that they are already known. The main logic of this algorithm is to cluster the data separating samples in n number of groups of equal variances by minimizing the criteria known as the inertia. The number of clusters identified by algorithm is represented by ‘K.

Scikit-learn 有模块 sklearn.cluster.KMeans 来执行 K 均值聚类。在计算聚类中心和惯性值时,名为 sample_weight 的参数允许模块 sklearn.cluster.KMeans 将更多权重分配给某些样本。

Scikit-learn have sklearn.cluster.KMeans module to perform K-Means clustering. While computing cluster centers and value of inertia, the parameter named sample_weight allows sklearn.cluster.KMeans module to assign more weight to some samples.

Affinity Propagation

此算法基于“信息传递”的概念,在样本的不同对之间传递信息,直到收敛。它不要求在运行算法之前指定聚类数。该算法具有 O(N2T) 量级的复杂度,这是它最大的缺点。

This algorithm is based on the concept of ‘message passing’ between different pairs of samples until convergence. It does not require the number of clusters to be specified before running the algorithm. The algorithm has a time complexity of the order 𝑂(𝑁2𝑇), which is the biggest disadvantage of it.

Scikit-learn 有模块 sklearn.cluster.AffinityPropagation 来执行 Affinity Propagation 聚类。

Scikit-learn have sklearn.cluster.AffinityPropagation module to perform Affinity Propagation clustering.

Mean Shift

此算法主要发现样本平滑密度中的 blobs 。它通过将点移向数据点的最高密度,迭代地将数据点分配给聚类。它不依赖于参数 bandwidth 来指定搜索区域的大小,而是自动设置聚类数。

This algorithm mainly discovers blobs in a smooth density of samples. It assigns the datapoints to the clusters iteratively by shifting points towards the highest density of datapoints. Instead of relying on a parameter named bandwidth dictating the size of the region to search through, it automatically sets the number of clusters.

Scikit-learn 有模块 sklearn.cluster.MeanShift 来执行均值偏移聚类。

Scikit-learn have sklearn.cluster.MeanShift module to perform Mean Shift clustering.

Spectral Clustering

在聚类之前,此算法基本上使用特征值,即数据的相似性矩阵频谱,在更少维度上执行降维。当聚类数很多时,不建议使用此算法。

Before clustering, this algorithm basically uses the eigenvalues i.e. spectrum of the similarity matrix of the data to perform dimensionality reduction in fewer dimensions. The use of this algorithm is not advisable when there are large number of clusters.

Scikit-learn 有模块 sklearn.cluster.SpectralClustering 来执行谱聚类。

Scikit-learn have sklearn.cluster.SpectralClustering module to perform Spectral clustering.

Hierarchical Clustering

此算法通过连续合并或拆分聚类来构建嵌套聚类。此聚类层次结构表示为树状图,即树。它分为以下两类:

This algorithm builds nested clusters by merging or splitting the clusters successively. This cluster hierarchy is represented as dendrogram i.e. tree. It falls into following two categories −

Agglomerative hierarchical algorithms - 在这种分层算法中,每个数据点都被视为单个聚类。然后,它连续聚合成对聚类。这使用了自下而上的方法。

Agglomerative hierarchical algorithms − In this kind of hierarchical algorithm, every data point is treated like a single cluster. It then successively agglomerates the pairs of clusters. This uses the bottom-up approach.

Divisive hierarchical algorithms - 在此分层算法中,所有数据点都被视为一大堆聚类。在此聚类过程中,使用自上而下的方法,将一大堆聚类分成各种小聚类。

Divisive hierarchical algorithms − In this hierarchical algorithm, all data points are treated as one big cluster. In this the process of clustering involves dividing, by using top-down approach, the one big cluster into various small clusters.

Scikit-learn 有 sklearn.cluster.AgglomerativeClustering 模块,可用于执行凝聚层次聚类。

Scikit-learn have sklearn.cluster.AgglomerativeClustering module to perform Agglomerative Hierarchical clustering.

DBSCAN

它代表 “Density-based spatial clustering of applications with noise” 。此算法基于“聚类”和“噪音”的直观概念,其中聚类是数据空间中较低密度的密集区域,由较低密度的数据点区域分隔。

It stands for “Density-based spatial clustering of applications with noise”. This algorithm is based on the intuitive notion of “clusters” & “noise” that clusters are dense regions of the lower density in the data space, separated by lower density regions of data points.

Scikit-learn 有 sklearn.cluster.DBSCAN 模块,可用于执行 DBSCAN 聚类。此算法使用两个重要参数,即 min_samples 和 eps,以定义密集。

Scikit-learn have sklearn.cluster.DBSCAN module to perform DBSCAN clustering. There are two important parameters namely min_samples and eps used by this algorithm to define dense.

参数 min_samples 值越大或参数 eps 值越小,将表明形成聚类所需的较高密度数据点。

Higher value of parameter min_samples or lower value of the parameter eps will give an indication about the higher density of data points which is necessary to form a cluster.

OPTICS

它代表 “Ordering points to identify the clustering structure” 。此算法还会在空间数据中找到基于密度的聚类。其基本工作逻辑类似于 DBSCAN。

It stands for “Ordering points to identify the clustering structure”. This algorithm also finds density-based clusters in spatial data. It’s basic working logic is like DBSCAN.

它通过以下方式解决 DBSCAN 算法的一个主要弱点——在密度不一致的数据中检测有意义的聚类问题:以空间上最接近的点成为排序中的邻居的方式来对数据库的点进行排序。

It addresses a major weakness of DBSCAN algorithm-the problem of detecting meaningful clusters in data of varying density-by ordering the points of the database in such a way that spatially closest points become neighbors in the ordering.

Scikit-learn 有 sklearn.cluster.OPTICS 模块,可用于执行 OPTICS 聚类。

Scikit-learn have sklearn.cluster.OPTICS module to perform OPTICS clustering.

BIRCH

它代表使用层次结构进行平衡迭代减少和聚类。用于对大型数据集执行层次聚类。它为给定数据构建一个名为 CFT (即 Characteristics Feature Tree )的树。

It stands for Balanced iterative reducing and clustering using hierarchies. It is used to perform hierarchical clustering over large data sets. It builds a tree named CFT i.e. Characteristics Feature Tree, for the given data.

CFT 的优点在于,称为 CF(特征特性)节点的数据节点包含聚类所需的必要信息,从而进一步避免了在内存中保存整个输入数据的需要。

The advantage of CFT is that the data nodes called CF (Characteristics Feature) nodes holds the necessary information for clustering which further prevents the need to hold the entire input data in memory.

Scikit-learn 有 sklearn.cluster.Birch 模块,可用于执行 BIRCH 聚类。

Scikit-learn have sklearn.cluster.Birch module to perform BIRCH clustering.

Comparing Clustering Algorithms

下表将对 scikit-learn 中的聚类算法进行比较(基于参数、可伸缩性和度量)。

Following table will give a comparison (based on parameters, scalability and metric) of the clustering algorithms in scikit-learn.

Sr.No

Algorithm Name

Parameters

Scalability

Metric Used

1

K-Means

No. of clusters

Very large n_samples

The distance between points.

2

Affinity Propagation

Damping

It’s not scalable with n_samples

Graph Distance

3

Mean-Shift

Bandwidth

It’s not scalable with n_samples.

The distance between points.

4

Spectral Clustering

No.of clusters

Medium level of scalability with n_samples. Small level of scalability with n_clusters.

Graph Distance

5

Hierarchical Clustering

Distance threshold or No.of clusters

Large n_samples Large n_clusters

The distance between points.

6

DBSCAN

Size of neighborhood

Very large n_samples and medium n_clusters.

Nearest point distance

7

OPTICS

Minimum cluster membership

Very large n_samples and large n_clusters.

The distance between points.

8

BIRCH

Threshold, Branching factor

Large n_samples Large n_clusters

The Euclidean distance between points.

K-Means Clustering on Scikit-learn Digit dataset

在此示例中,我们将对 digits 数据集应用 K-means 聚类。此算法将识别类似数字,而无需使用原始的标签信息。实现是在 Jupyter notebook 上完成的。

In this example, we will apply K-means clustering on digits dataset. This algorithm will identify similar digits without using the original label information. Implementation is done on Jupyter notebook.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

Output

1797, 64)

该输出显示,digit 数据集包含 1797 个样本,具有 64 个特征。

This output shows that digit dataset is having 1797 samples with 64 features.

Example

现在,执行 K-Means 聚类,如下所示:−

Now, perform the K-Means clustering as follows −

kmeans = KMeans(n_clusters = 10, random_state = 0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

Output

(10, 64)

该输出显示,K-Means 聚类创建了 10 个具有 64 个特征的聚类。

This output shows that K-means clustering created 10 clusters with 64 features.

Example

fig, ax = plt.subplots(2, 5, figsize = (8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks = [], yticks = [])
axi.imshow(center, interpolation = 'nearest', cmap = plt.cm.binary)

Output

以下输出包含图片,显示了 K-Means 聚类学习到的聚类中心。

The below output has images showing clusters centers learned by K-Means Clustering.

clusters centers

接下来,下面的 Python 脚本将学习到的聚类标签(通过 K-Means)与其中找到的真实标签匹配 −

Next, the Python script below will match the learned cluster labels (by K-Means) with the true labels found in them −

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]

我们还可以借助下面提到的命令检查准确性。

We can also check the accuracy with the help of the below mentioned command.

from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)

Output

0.7935447968836951

Complete Implementation Example

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np

from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
kmeans = KMeans(n_clusters = 10, random_state = 0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
fig, ax = plt.subplots(2, 5, figsize = (8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
   axi.set(xticks=[], yticks = [])
   axi.imshow(center, interpolation = 'nearest', cmap = plt.cm.binary)
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
   mask = (clusters == i)
   labels[mask] = mode(digits.target[mask])[0]
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)