Machine Learning With Python 简明教程
Clustering Algorithms - Hierarchical Clustering
Introduction to Hierarchical Clustering
层次聚类是另一种无监督学习算法,用于将具有相似特征的未标记数据点分组在一起。层次聚类算法分为以下两类:
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following two categories −
Agglomerative hierarchical algorithms − 在凝聚层次算法中,每个数据点都被视为一个单一簇,然后依次合并或凝聚(自下而上方法)簇对。簇的层次结构表示为树状图或树结构。
Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.
Divisive hierarchical algorithms − 另一方面,在分裂层次算法中,所有数据点都被视为一个大簇,而聚类过程涉及将一个大簇分割(自上而下方法)为多个小簇。
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data points are treated as one big cluster and the process of clustering involves dividing (Top-down approach) the one big cluster into various small clusters.
Steps to Perform Agglomerative Hierarchical Clustering
我们将解释最常用、最重要的层次聚类,即凝聚。执行此操作的步骤如下 −
We are going to explain the most used and important Hierarchical clustering i.e. agglomerative. The steps to perform the same is as follows −
-
Step 1 − Treat each data point as single cluster. Hence, we will be having, say K clusters at start. The number of data points will also be K at start.
-
Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints. This will result in total of K-1 clusters.
-
Step 3 − Now, to form more clusters we need to join two closet clusters. This will result in total of K-2 clusters.
-
Step 4 − Now, to form one big cluster repeat the above three steps until K would become 0 i.e. no more data points left to join.
-
Step 5 − At last, after making one single big cluster, dendrograms will be used to divide into multiple clusters depending upon the problem.
Role of Dendrograms in Agglomerative Hierarchical Clustering
如我们在上一步中所讨论的,一旦形成了大簇,树状图的作用就开始了。树状图将用于根据我们的问题将簇分割为多个相关数据点的簇。我们可以借助以下示例来理解:
As we discussed in the last step, the role of dendrogram starts once the big cluster is formed. Dendrogram will be used to split the clusters into multiple cluster of related data points depending upon our problem. It can be understood with the help of following example −
Example 1
为了理解,让我们开始导入所需的库,如下所示:
To understand, let us start with importing the required libraries as follows −
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
接下来,我们将绘制我们为此示例获取的数据点 −
Next, we will be plotting the datapoints we have taken for this example −
X = np.array([[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],])
labels = range(1, 11)
plt.figure(figsize=(10, 7))
plt.subplots_adjust(bottom=0.1)
plt.scatter(X[:,0],X[:,1], label='True Position')
for label, x, y in zip(labels, X[:, 0], X[:, 1]):
plt.annotate(label,xy=(x, y), xytext=(-3, 3),textcoords='offset points', ha='right', va='bottom')
plt.show()

从上图中,很容易看出,我们的数据点中有两个簇,但实际数据中有可能是数千个簇。接下来,我们将使用 SciPy 库绘制我们的数据点的树状图:
From the above diagram, it is very easy to see that we have two clusters in out datapoints but in the real world data, there can be thousands of clusters. Next, we will be plotting the dendrograms of our datapoints by using Scipy library −
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
linked = linkage(X, 'single')
labelList = range(1, 11)
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top',labels=labelList, distance_sort='descending',show_leaf_counts=True)
plt.show()

现在,在大簇形成后,将选择最长的垂直距离。然后在其周围绘制一条垂直线,如下图所示。由于水平线在两点处穿过蓝线,因此簇的数量将是两个。
Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then drawn through it as shown in the following diagram. As the horizontal line crosses the blue line at two points, the number of clusters would be two.

接下来,我们需要导入用于聚类的类,并调用它的 fit_predict 方法来预测簇。我们导入的是 sklearn.cluster 库的 AgglomerativeClustering 类:
Next, we need to import the class for clustering and call its fit_predict method to predict the cluster. We are importing AgglomerativeClustering class of sklearn.cluster library −
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(X)
接下来,使用以下代码绘制簇 −
Next, plot the cluster with the help of following code −
plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap='rainbow')

上图显示了我们数据点中的两个簇。
The above diagram shows the two clusters from our datapoints.
Example2
如我们从上面讨论的简单示例中了解到的树状图概念,让我们转到另一个示例,其中我们使用层次聚类创建了 Pima Indian Diabetes 数据集中数据点的簇:
As we understood the concept of dendrograms from the simple example discussed above, let us move to another example in which we are creating clusters of the data point in Pima Indian Diabetes Dataset by using hierarchical clustering −
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import numpy as np
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
data.shape
(768, 9)
data.head()
slno. |
preg |
Plas |
Pres |
skin |
test |
mass |
pedi |
age |
class |
0 |
6 |
148 |
72 |
35 |
0 |
33.6 |
0.627 |
50 |
1 |
1 |
1 |
85 |
66 |
29 |
0 |
26.6 |
0.351 |
31 |
0 |
2 |
8 |
183 |
64 |
0 |
0 |
23.3 |
0.672 |
32 |
1 |
3 |
1 |
89 |
66 |
23 |
94 |
28.1 |
0.167 |
21 |
0 |
4 |
0 |
137 |
40 |
35 |
168 |
43.1 |
2.288 |
33 |
1 |
patient_data = data.iloc[:, 3:5].values
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Patient Dendograms")
dend = shc.dendrogram(shc.linkage(data, method='ward'))

from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
cluster.fit_predict(patient_data)
plt.figure(figsize=(10, 7))
plt.scatter(patient_data[:,0], patient_data[:,1], c=cluster.labels_, cmap='rainbow')
