Machine Learning 简明教程
Machine Learning - Agglomerative Clustering
凝聚层次聚类是一种分层聚类算法,它以每个数据点作为其自己的簇开始,并迭代地合并最接近的簇,直到达到停止标准。它是一种自下而上的方法,生成一个树状图,它是一个树状图,显示了簇之间的层次关系。该算法可以使用 Python 中的 scikit-learn 库实现。
Agglomerative clustering is a hierarchical clustering algorithm that starts with each data point as its own cluster and iteratively merges the closest clusters until a stopping criterion is reached. It is a bottom-up approach that produces a dendrogram, which is a tree-like diagram that shows the hierarchical relationship between the clusters. The algorithm can be implemented using the scikit-learn library in Python.
Implementation in Python
我们将使用鸢尾花数据集进行演示。第一步是导入必要的库并加载数据集。
We will use the iris dataset for demonstration. The first step is to import the necessary libraries and load the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
iris = load_iris()
X = iris.data
y = iris.target
下一步是创建包含每对簇之间距离的连结矩阵。我们可以使用 scipy.cluster.hierarchy 模块中的 linkage 函数创建连结矩阵。
The next step is to create a linkage matrix that contains the distances between each pair of clusters. We can use the linkage function from the scipy.cluster.hierarchy module to create the linkage matrix.
Z = linkage(X, 'ward')
“ward”方法用于计算簇之间的距离。它最小化了正在合并的簇之间的距离的方差。
The 'ward' method is used to calculate the distances between the clusters. It minimizes the variance of the distances between the clusters being merged.
我们可以使用同一模块中的 dendrogram 函数可视化树状图。
We can visualize the dendrogram using the dendrogram function from the same module.
plt.figure(figsize=(7.5, 3.5))
plt.title("Iris Dendrogram")
dendrogram(Z)
plt.show()
生成树状图(参见以下绘图)显示了簇之间的层次关系。我们可以看到,该算法首先合并了最接近的簇,并且当我们向上移动树时,簇之间的距离会增加。
The resulting dendrogram (see the following plot) shows the hierarchical relationship between the clusters. We can see that the algorithm has merged the closest clusters first, and the distance between the clusters increases as we move up the tree.
data:image/s3,"s3://crabby-images/5df26/5df26d485eb9dd1f8405c2e93e885a28402ebbbb" alt="agglomerative clustering"
最后一步是应用聚类算法并提取簇标签。我们可以使用 sklearn.cluster 模块中的 AgglomerativeClustering 类来应用算法。
The final step is to apply the clustering algorithm and extract the cluster labels. We can use the AgglomerativeClustering class from the sklearn.cluster module to apply the algorithm.
model = AgglomerativeClustering(n_clusters=3)
model.fit(X)
labels = model.labels_
n_clusters 参数指定从数据中提取的簇的数量。在本例中,我们指定 n_clusters=3,因为我们知道鸢尾花数据集有三个类别。
The n_clusters parameter specifies the number of clusters to be extracted from the data. In this case, we have specified n_clusters=3 because we know that the iris dataset has three classes.
我们可以使用散点图来可视化生成簇。
We can visualize the resulting clusters using a scatter plot.
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Agglomerative Clustering Results")
plt.show()
生成的图显示了算法识别的三个簇。我们可以看到,算法已成功地将数据点分离到它们各自的类中。
The resulting plot shows the three clusters identified by the algorithm. We can see that the algorithm has successfully separated the data points into their respective classes.
data:image/s3,"s3://crabby-images/f8025/f8025cc32530cf7d17e6e931a8c34c29cd236e94" alt="agglomerative clustering results"
Example
以下是 Agglomerative Clustering 在 Python 中的完整实现:
Here is the complete implementation of Agglomerative Clustering in Python −
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
Z = linkage(X, 'ward')
# Plot the dendogram
plt.figure(figsize=(7.5, 3.5))
plt.title("Iris Dendrogram")
dendrogram(Z)
plt.show()
# create an instance of the AgglomerativeClustering class
model = AgglomerativeClustering(n_clusters=3)
# fit the model to the dataset
model.fit(X)
labels = model.labels_
# Plot the results
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Agglomerative Clustering Results")
plt.show()
Advantages of Agglomerative Clustering
以下是使用 Agglomerative Clustering 的优点:
Following are the advantages of using Agglomerative Clustering −
-
Produces a dendrogram that shows the hierarchical relationship between the clusters.
-
Can handle different types of distance metrics and linkage methods.
-
Allows for a flexible number of clusters to be extracted from the data.
-
Can handle large datasets with efficient implementations.
Disadvantages of Agglomerative Clustering
以下是使用 Agglomerative Clustering 的一些缺点:
Following are some of the disadvantages of using Agglomerative Clustering −
-
Can be computationally expensive for large datasets.
-
Can produce imbalanced clusters if the distance metric or linkage method is not appropriate for the data.
-
The final result may be sensitive to the choice of distance metric and linkage method used.
-
The dendrogram may be difficult to interpret for large datasets with many clusters.