Big Data Analytics 简明教程

Big Data Analytics - K-Means Clustering

k 均值聚类旨在将 n 个观测值划分为 k 个簇,其中每个观测值属于具有最接近均值的簇,并充当该簇的原型。这导致数据空间划分为 Voronoi 单元。

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

给定一组观测值 (x1, x2, …, xn),其中每个观测值都是一个 d 维实向量,k 均值聚类旨在将 n 个观测值划分为 k 个组 G = {G1, G2, …, Gk},以便最小化如下定义的簇内平方和 (WCSS):

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k groups G = {G1, G2, …, Gk} so as to minimize the within-cluster sum of squares (WCSS) defined as follows −

argmin \: \sum_{i = 1}^{k} \sum_{x \in S_{i}}\parallel x - \mu_{i}\parallel ^2

后一个公式显示了为了在 k 均值聚类中找到最佳原型而最小化的目标函数。该公式的直观含义是,我们希望找到彼此不同的组,并且每个组的每个成员都应与其所属簇的其他成员相似。

The later formula shows the objective function that is minimized in order to find the optimal prototypes in k-means clustering. The intuition of the formula is that we would like to find groups that are different with each other and each member of each group should be similar with the other members of each cluster.

以下示例演示了如何在 R 中运行 k 均值聚类算法。

The following example demonstrates how to run the k-means clustering algorithm in R.

library(ggplot2)
# Prepare Data
data = mtcars

# We need to scale the data to have zero mean and unit variance
data <- scale(data)

# Determine number of clusters
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:dim(data)[2]) {
   wss[i] <- sum(kmeans(data, centers = i)$withinss)
}

# Plot the clusters
plot(1:dim(data)[2], wss, type = "b", xlab = "Number of Clusters",
   ylab = "Within groups sum of squares")

为了为 K 找到一个好的值,我们可以为 K 的不同值绘制组内平方和。当添加更多组时,此度量通常会减少,我们希望找到组内平方和减少开始变慢的点。在图形中,此值由 K = 6 最佳表示。

In order to find a good value for K, we can plot the within groups sum of squares for different values of K. This metric normally decreases as more groups are added, we would like to find a point where the decrease in the within groups sum of squares starts decreasing slowly. In the plot, this value is best represented by K = 6.

number cluster

现在已经定义了 K 的值,需要使用该值运行该算法。

Now that the value of K has been defined, it is needed to run the algorithm with that value.

# K-Means Cluster Analysis
fit <- kmeans(data, 5) # 5 cluster solution

# get cluster means
aggregate(data,by = list(fit$cluster),FUN = mean)

# append cluster assignment
data <- data.frame(data, fit$cluster)