Machine Learning With Python 简明教程

Clustering Algorithms - K-means Algorithm

Introduction to K-Means Algorithm

K-means 聚类算法计算质心并迭代,直到找到最优质心。它假设已知集群数。它也被称为 flat clustering 算法。k-means 中算法通过数据识别的集群数由“K”表示。

K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

此算法中,数据点被分配给一个集群,数据点和质心之间的平方距离总和将达到最小值。可以理解的是,集群内方差越小,同一集群内的数据点越相似。

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within same cluster.

Working of K-Means Algorithm

我们可借助以下步骤理解 K 均值聚类算法的工作原理:

We can understand the working of K-Means clustering algorithm with the help of following steps −

  1. Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.

  2. Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple words, classify the data based on the number of data points.

  3. Step 3 − Now it will compute the cluster centroids.

  4. Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of data points to the clusters that are not changing any more −

4.1 - 首先,计算数据点和质心之间的平方距离之和。

4.1 − First, the sum of squared distance between data points and centroids would be computed.

4.2 - 现在,我们需要将每个数据点分配给比其他集群(质心)更近的集群。

4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).

4.3 - 最后,通过取该集群内的所有数据点的平均值来计算此集群的质心。

4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.

K 均值采用 Expectation-Maximization 方法来解决问题。期望步用于将数据点分配给最接近的集群,最大化步用于计算每个集群的质心。

K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used for assigning the data points to the closest cluster and the Maximization-step is used for computing the centroid of each cluster.

使用 K 均值算法时,我们需要注意以下事项:

While working with K-means algorithm we need to take care of the following things −

  1. While working with clustering algorithms including K-Means, it is recommended to standardize the data because such algorithms use distance-based measurement to determine the similarity between data points.

  2. Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in a local optimum and may not converge to global optimum. That is why it is recommended to use different initializations of centroids.

Implementation in Python

用于实现 K 均值聚类算法的以下两个示例将有助于我们更好地理解此算法:

The following two examples of implementing K-Means clustering algorithm will help us in its better understanding −

Example 1

这是一个简单的示例,用于理解 k 均值的工作方式。在此示例中,我们将首先生成包含 4 个不同斑点的 2D 数据集,然后应用 k 均值算法来查看结果。

It is a simple example to understand how k-means works. In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the result.

首先,我们将通过导入必要的包来开始:

First, we will start by importing the necessary packages −

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

以下代码将生成包含 4 个斑点的 2D:

The following code will generate the 2D, containing four blobs −

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)

接下来,以下代码将帮助我们可视化数据集:

Next, the following code will help us to visualize the dataset −

plt.scatter(X[:, 0], X[:, 1], s=20);
plt.show()
world map

接下来,创建一个 KMeans 对象并同时提供集群数量,训练模型并进行预测,如下所示:

Next, make an object of KMeans along with providing number of clusters, train the model and do the prediction as follows −

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

现在,在以下代码的帮助下,我们可以绘制并可视化均值 k-Means Python 估计器选择的集群中心−

Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means Python estimator −

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9);
plt.show()
world spot

Example 2

让我们切换到另一个示例,其中我们将对简单的数字数据集应用 K-Means 集群。K-Means 将尝试识别类似数字,而不使用原始标签信息。

Let us move to another example in which we are going to apply K-means clustering on simple digits dataset. K-means will try to identify similar digits without using the original label information.

首先,我们将通过导入必要的包来开始:

First, we will start by importing the necessary packages −

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

接下来,从 sklearn 中加载数字数据集并生成一个对象。我们还可以在此数据集中找到行数和列数,如下所示 -

Next, load the digit dataset from sklearn and make an object of it. We can also find number of rows and columns in this dataset as follows −

from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

Output

(1797, 64)

以上输出表明,此数据集包含 1797 个样本,具有 64 个特征。

The above output shows that this dataset is having 1797 samples with 64 features.

我们可以如以上示例 1 中所述执行集群 −

We can perform the clustering as we did in Example 1 above −

kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

Output

(10, 64)

以上输出表明,K-Means 创建了 10 个集群,具有 64 个特征。

The above output shows that K-means created 10 clusters with 64 features.

fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
   axi.set(xticks=[], yticks=[])
   axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

Output

作为输出,我们将获得以下图像,显示 k-means 了解的集群中心。

As output, we will get following image showing clusters centers learned by k-means.

blur

以下代码行将匹配了解的集群标签和在其中找到的真实标签 -

The following lines of code will match the learned cluster labels with the true labels found in them −

from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
   mask = (clusters == i)
   labels[mask] = mode(digits.target[mask])[0]

接下来,我们可以按如下方式检查精确度 -

Next, we can check the accuracy as follows −

from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)

Output

0.7935447968836951

以上输出表明,精确度约为 80%。

The above output shows that the accuracy is around 80%.

Advantages and Disadvantages

Advantages

以下是 K-Means 集群算法的一些优点 −

The following are some advantages of K-Means clustering algorithms −

  1. It is very easy to understand and implement.

  2. If we have large number of variables then, K-means would be faster than Hierarchical clustering.

  3. On re-computation of centroids, an instance can change the cluster.

  4. Tighter clusters are formed with K-means as compared to Hierarchical clustering.

Disadvantages

以下是 K-Means 集群算法的一些缺点 −

The following are some disadvantages of K-Means clustering algorithms −

  1. It is a bit difficult to predict the number of clusters i.e. the value of k.

  2. Output is strongly impacted by initial inputs like number of clusters (value of k).

  3. Order of data will have strong impact on the final output.

  4. It is very sensitive to rescaling. If we will rescale our data by means of normalization or standardization, then the output will completely change.final output.

  5. It is not good in doing clustering job if the clusters have a complicated geometric shape.

Applications of K-Means Clustering Algorithm

聚类分析的主要目标为 -

The main goals of cluster analysis are −

  1. To get a meaningful intuition from the data we are working with.

  2. Cluster-then-predict where different models will be built for different subgroups.

为了实现上述目标,K 均值聚类表现得足够好。它可以用于以下应用中 -

To fulfill the above-mentioned goals, K-means clustering is performing well enough. It can be used in following applications −

  1. Market segmentation

  2. Document Clustering

  3. Image segmentation

  4. Image compression

  5. Customer segmentation

  6. Analyzing the trend on dynamic data