Machine Learning 简明教程
Machine Learning - K-Medoids Clustering
K-Medoids Clustering - Algorithm
K 型 medoids 聚类算法可以总结如下:-
The K-medoids clustering algorithm can be summarized as follows −
-
Initialize k medoids − Select k random data points from the dataset as the initial medoids.
-
Assign data points to medoids − Assign each data point to the nearest medoid.
-
Update medoids − For each cluster, select the data point that minimizes the sum of distances to all the other data points in the cluster, and set it as the new medoid.
-
Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached.
Implementation in Python
要在 Python 中实施 K 型 medoids 聚类,我们可以使用 scikit-learn 库。 scikit-learn 库提供了 KMedoids 类,该类可用于在数据集上执行 K 型 medoids 聚类。
To implement K-medoids clustering in Python, we can use the scikit-learn library. The scikit-learn library provides the KMedoids class, which can be used to perform K-medoids clustering on a dataset.
首先,我们需要导入所需的库:-
First, we need to import the required libraries −
from sklearn_extra.cluster import KMedoids
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
接下来,我们使用 scikit-learn 中的 make_blobs() 函数生成示例数据集:-
Next, we generate a sample dataset using the make_blobs() function from scikit-learn −
X, y = make_blobs(n_samples=500, centers=3, random_state=42)
在此,我们生成一个包含 500 个数据点和 3 个群集的数据集。
Here, we generate a dataset with 500 data points and 3 clusters.
接下来,我们初始化 KMedoids 类并拟合数据 −
Next, we initialize the KMedoids class and fit the data −
kmedoids = KMedoids(n_clusters=3, random_state=42)
kmedoids.fit(X)
在此,我们设置群集数为 3 并使用 random_state 参数以确保可重现性。
Here, we set the number of clusters to 3 and use the random_state parameter to ensure reproducibility.
最后,我们可使用散点图可视化群集结果 −
Finally, we can visualize the clustering results using a scatter plot −
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=kmedoids.labels_, cmap='viridis')
plt.scatter(kmedoids.cluster_centers_[:, 0],
kmedoids.cluster_centers_[:, 1], marker='x', color='red')
plt.show()
Example
以下是完整的 Python 实现代码 −
Here is the complete implementation in Python −
from sklearn_extra.cluster import KMedoids
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generate sample data
X, y = make_blobs(n_samples=500, centers=3, random_state=42)
# Cluster the data using KMedoids
kmedoids = KMedoids(n_clusters=3, random_state=42)
kmedoids.fit(X)
# Plot the results
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=kmedoids.labels_, cmap='viridis')
plt.scatter(kmedoids.cluster_centers_[:, 0],
kmedoids.cluster_centers_[:, 1], marker='x', color='red')
plt.show()
在此,我们以散点图形式绘制数据点,并根据其群集标签进行着色。我们还将质心绘制为红十字。
Here, we plot the data points as a scatter plot and color them based on their cluster labels. We also plot the medoids as red crosses.
K-Medoids Clustering - Advantages
以下是使用 K-medoids 群集的优势 −
Here are the advantages of using K-medoids clustering −
-
Robust to outliers and noise − K-medoids clustering is more robust to outliers and noise than K-means clustering because it uses a representative data point, called a medoid, to represent the center of the cluster.
-
Can handle non-Euclidean distance metrics − K-medoids clustering can be used with any distance metric, including non-Euclidean distance metrics, such as Manhattan distance and cosine similarity.
-
Computationally efficient − K-medoids clustering has a computational complexity of O(k*n^2), which is lower than the computational complexity of K-means clustering.
K-Medoids Clustering - Disadvantages
使用 K-medoids 群集的劣势如下 −
The disadvantages of using K-medoids clustering are as follows −
-
Sensitive to the choice of k − The performance of K-medoids clustering can be sensitive to the choice of k, the number of clusters.
-
Not suitable for high-dimensional data − K-medoids clustering may not perform well on high-dimensional data because the medoid selection process becomes computationally expensive.