Scikit Learn 简明教程
Scikit Learn - Clustering Performance Evaluation
有各种函数可以帮助我们评估聚类算法的性能。
There are various functions with the help of which we can evaluate the performance of clustering algorithms.
以下是 Scikit-learn 提供的一些用于评估聚类性能的重要且使用最广泛的函数 −
Following are some important and mostly used functions given by the Scikit-learn for evaluating clustering performance −
Adjusted Rand Index
Rand Index 是一个计算两个聚类之间相似性度量的函数。对于此计算,Rand 索引考虑了所有样本对,并计算在预测聚类和真实聚类中分配到相似或不同聚类中的对数。然后,通过使用以下公式将原始 Rand 索引得分“调整为机会”,得到调整后的 Rand 索引得分 −
Rand Index is a function that computes a similarity measure between two clustering. For this computation rand index considers all pairs of samples and counting pairs that are assigned in the similar or different clusters in the predicted and true clustering. Afterwards, the raw Rand Index score is ‘adjusted for chance’ into the Adjusted Rand Index score by using the following formula −
它有两个参数,即 labels_true (即 ground truth 类标签)和 labels_pred (即要评估的聚类标签)。
It has two parameters namely labels_true, which is ground truth class labels, and labels_pred, which are clusters label to evaluate.
Mutual Information Based Score
互信息是一个计算两个分配一致性的函数。它忽略了排列。有以下版本可用 −
Mutual Information is a function that computes the agreement of the two assignments. It ignores the permutations. There are following versions available −
Normalized Mutual Information (NMI)
Scikit learn 具有 sklearn.metrics.normalized_mutual_info_score 模块。
Scikit learn have sklearn.metrics.normalized_mutual_info_score module.
Example
from sklearn.metrics.cluster import normalized_mutual_info_score
labels_true = [0, 0, 1, 1, 1, 1]
labels_pred = [0, 0, 2, 2, 3, 3]
normalized_mutual_info_score (labels_true, labels_pred)
Adjusted Mutual Information (AMI)
Scikit learn 具有 sklearn.metrics.adjusted_mutual_info_score 模块。
Scikit learn have sklearn.metrics.adjusted_mutual_info_score module.
Fowlkes-Mallows Score
Fowlkes-Mallows 函数测量一组点的两个聚类的相似性。它可以定义为成对精度和召回率的几何平均值。
The Fowlkes-Mallows function measures the similarity of two clustering of a set of points. It may be defined as the geometric mean of the pairwise precision and recall.
在数学上,
Mathematically,
此处, TP = True Positive − 属于真实标签和预测标签中相同聚类的点对数。
Here, TP = True Positive − number of pair of points belonging to the same clusters in true as well as predicted labels both.
FP = False Positive − 属于真实标签中相同聚类的点对数,但不属于预测标签中相同聚类的点对数。
FP = False Positive − number of pair of points belonging to the same clusters in true labels but not in the predicted labels.
FN = False Negative − 属于预测标签中相同聚类的点对数,但不属于真实标签中相同聚类的点对数。
FN = False Negative − number of pair of points belonging to the same clusters in the predicted labels but not in the true labels.
Scikit learn 具有 sklearn.metrics.fowlkes_mallows_score 模块 −
The Scikit learn has sklearn.metrics.fowlkes_mallows_score module −
Silhouette Coefficient
Silhouette 函数将使用所有样本的平均类内距离和每个样本的平均最近类距离计算所有样本的平均 Silhouette 系数。
The Silhouette function will compute the mean Silhouette Coefficient of all samples using the mean intra-cluster distance and the mean nearest-cluster distance for each sample.
在数学上,
Mathematically,
此处,a 是类内距离。
Here, a is intra-cluster distance.
并且,b 是平均最近类距离。
and, b is mean nearest-cluster distance.
Scikit learn 具有 sklearn.metrics.silhouette_score 模块 −
The Scikit learn have sklearn.metrics.silhouette_score module −
Example
from sklearn import metrics.silhouette_score
from sklearn.metrics import pairwise_distances
from sklearn import datasets
import numpy as np
from sklearn.cluster import KMeans
dataset = datasets.load_iris()
X = dataset.data
y = dataset.target
kmeans_model = KMeans(n_clusters = 3, random_state = 1).fit(X)
labels = kmeans_model.labels_
silhouette_score(X, labels, metric = 'euclidean')
Contingency Matrix
此矩阵将报告(true,predicted)中每个可信对的交集基数。分类问题的混淆矩阵是一个方阵 Contingency。
This matrix will report the intersection cardinality for every trusted pair of (true, predicted). Confusion matrix for classification problems is a square contingency matrix.
Scikit learn 具有 sklearn.metrics.contingency_matrix 模块。
The Scikit learn have sklearn.metrics.contingency_matrix module.
Example
from sklearn.metrics.cluster import contingency_matrix
x = ["a", "a", "a", "b", "b", "b"]
y = [1, 1, 2, 0, 1, 2]
contingency_matrix(x, y)
Output
array([
[0, 2, 1],
[1, 1, 1]
])
以上输出的第一行显示了三个真实集群为“a”的样本中,没有一个在 0 中,其中两个在 1 中,一个在 2 中。另一方面,第二行显示了三个真实集群为“b”的样本中,有 1 个在 0 中,有 1 个在 1 中,有 1 个在 2 中。
The first row of above output shows that among three samples whose true cluster is “a”, none of them is in 0, two of the are in 1 and 1 is in 2. On the other hand, second row shows that among three samples whose true cluster is “b”, 1 is in 0, 1 is in 1 and 1 is in 2.