Artificial Intelligence With Python 简明教程
AI with Python - Unsupervised Learning: Clustering
无监督机器学习算法没有任何提供任何指导的监督者。这就是为什么它们与某些被称为真正的人工智能密切相关。
Unsupervised machine learning algorithms do not have any supervisor to provide any sort of guidance. That is why they are closely aligned with what some call true artificial intelligence.
在无监督学习中,没有正确答案,也没有指导。算法需要发现数据中用于学习的有趣模式。
In unsupervised learning, there would be no correct answer and no teacher for the guidance. Algorithms need to discover the interesting pattern in data for learning.
What is Clustering?
从根本上讲,这是一种无监督学习方法,是一种在许多领域中使用的统计数据分析的常用技术。聚类主要是将一组观察分成子集(称为簇)的任务,以便同一簇中的观察在某种意义上相似,并且与其他簇中的观察不同。简而言之,我们可以说,聚类的主要目标是根据相似性和相异性对数据进行分组。
Basically, it is a type of unsupervised learning method and a common technique for statistical data analysis used in many fields. Clustering mainly is a task of dividing the set of observations into subsets, called clusters, in such a way that observations in the same cluster are similar in one sense and they are dissimilar to the observations in other clusters. In simple words, we can say that the main goal of clustering is to group the data on the basis of similarity and dissimilarity.
例如,下图显示了不同簇中类似类型的数据 -
For example, the following diagram shows similar kind of data in different clusters −
Algorithms for Clustering the Data
以下是用于聚类数据的几种常见算法 -
Following are a few common algorithms for clustering the data −
K-Means algorithm
K 均值聚类算法是用于聚类数据的一种众所周知的算法。我们需要假设已经知道簇的数量。这也称为平面聚类。它是一种迭代聚类算法。此算法需要遵循以下步骤 -
K-means clustering algorithm is one of the well-known algorithms for clustering the data. We need to assume that the numbers of clusters are already known. This is also called flat clustering. It is an iterative clustering algorithm. The steps given below need to be followed for this algorithm −
Step 1 - 我们需要指定 K 个子组的所需数量。
Step 1 − We need to specify the desired number of K subgroups.
Step 2 - 固定簇的数量,并将每个数据点随机分配给一个簇。或者换句话说,我们需要根据簇的数量对我们的数据进行分类。
Step 2 − Fix the number of clusters and randomly assign each data point to a cluster. Or in other words we need to classify our data based on the number of clusters.
在此步骤中,应计算簇质心。
In this step, cluster centroids should be computed.
由于这是一个迭代算法,我们需要通过每一次迭代更新 K 个质心的位置,直到找到全局最优值,或者换句话说,质心达到其最佳位置。
As this is an iterative algorithm, we need to update the locations of K centroids with every iteration until we find the global optima or in other words the centroids reach at their optimal locations.
以下代码将有助于在 Python 中实现 K 均值聚类算法。我们将使用 Scikit-learn 模块。
The following code will help in implementing K-means clustering algorithm in Python. We are going to use the Scikit-learn module.
让我们导入必要的软件包 −
Let us import the necessary packages −
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
以下代码行将有助于通过使用 sklearn.dataset 包中的 make_blob 来生成包含四个斑点的二维数据集。
The following line of code will help in generating the two-dimensional dataset, containing four blobs, by using make_blob from the sklearn.dataset package.
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples = 500, centers = 4,
cluster_std = 0.40, random_state = 0)
我们可以使用以下代码可视化数据集 -
We can visualize the dataset by using the following code −
plt.scatter(X[:, 0], X[:, 1], s = 50);
plt.show()
在这里,我们将 kmeans 初始化为 KMeans 算法,其中包含需要多少个簇(n_clusters)的必需参数。
Here, we are initializing kmeans to be the KMeans algorithm, with the required parameter of how many clusters (n_clusters).
kmeans = KMeans(n_clusters = 4)
我们需要使用输入数据训练 K-means 模型。
We need to train the K-means model with the input data.
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c = y_kmeans, s = 50, cmap = 'viridis')
centers = kmeans.cluster_centers_
以下给出的代码将帮助我们根据我们的数据绘制和可视化机器的发现,以及根据要找到的簇的数量进行拟合。
The code given below will help us plot and visualize the machine’s findings based on our data, and the fitment according to the number of clusters that are to be found.
plt.scatter(centers[:, 0], centers[:, 1], c = 'black', s = 200, alpha = 0.5);
plt.show()
Mean Shift Algorithm
这是无监督学习中使用的另一种流行且强大的聚类算法。它不会做出任何假设,因此它是非参数算法。它也称为层次聚类或均值漂移聚类分析。以下将是此算法的基本步骤 -
It is another popular and powerful clustering algorithm used in unsupervised learning. It does not make any assumptions hence it is a non-parametric algorithm. It is also called hierarchical clustering or mean shift cluster analysis. Followings would be the basic steps of this algorithm −
-
First of all, we need to start with the data points assigned to a cluster of their own.
-
Now, it computes the centroids and update the location of new centroids.
-
By repeating this process, we move closer the peak of cluster i.e. towards the region of higher density.
-
This algorithm stops at the stage where centroids do not move anymore.
在以下代码的帮助下,我们在 Python 中实现了 Mean Shift 聚类算法。我们将使用 Scikit-learn 模块。
With the help of following code we are implementing Mean Shift clustering algorithm in Python. We are going to use Scikit-learn module.
让我们导入必要的软件包 −
Let us import the necessary packages −
import numpy as np
from sklearn.cluster import MeanShift
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
下面的代码将通过使用 sklearn.dataset 包中的 make_blob 帮助生成包含四个 Blob 的二维数据集。
The following code will help in generating the two-dimensional dataset, containing four blobs, by using make_blob from the sklearn.dataset package.
from sklearn.datasets.samples_generator import make_blobs
我们可以用以下代码可视化数据集
We can visualize the dataset with the following code
centers = [[2,2],[4,5],[3,10]]
X, _ = make_blobs(n_samples = 500, centers = centers, cluster_std = 1)
plt.scatter(X[:,0],X[:,1])
plt.show()
现在,我们需要使用输入数据训练 Mean Shift 聚类模型。
Now, we need to train the Mean Shift cluster model with the input data.
ms = MeanShift()
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
以下代码将按输入数据打印聚类中心和预期的聚类数目
The following code will print the cluster centers and the expected number of cluster as per the input data −
print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Estimated clusters:", n_clusters_)
[[ 3.23005036 3.84771893]
[ 3.02057451 9.88928991]]
Estimated clusters: 2
下面给出的代码将帮助绘制和可视化机器基于我们数据的发现,以及根据要找到的聚类数目的拟合。
The code given below will help plot and visualize the machine’s findings based on our data, and the fitment according to the number of clusters that are to be found.
colors = 10*['r.','g.','b.','c.','k.','y.','m.']
for i in range(len(X)):
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],
marker = "x",color = 'k', s = 150, linewidths = 5, zorder = 10)
plt.show()
Measuring the Clustering Performance
真实世界的数据并不会自然地组织到一定数量的独特聚类中。由于这个原因,可视化和得出推论并不容易。这就是我们也需要测量聚类性能及其质量的原因。这可以通过轮廓分析来完成。
The real world data is not naturally organized into number of distinctive clusters. Due to this reason, it is not easy to visualize and draw inferences. That is why we need to measure the clustering performance as well as its quality. It can be done with the help of silhouette analysis.
Silhouette Analysis
此方法可用于通过测量聚类之间的距离来检查聚类的质量。基本上,它提供了一种通过给出自举得分来评估聚类数目等参数的方法。此得分是一个指标,用于测量一个聚类中的每个点与邻近聚类中的点的接近程度。
This method can be used to check the quality of clustering by measuring the distance between the clusters. Basically, it provides a way to assess the parameters like number of clusters by giving a silhouette score. This score is a metric that measures how close each point in one cluster is to the points in the neighboring clusters.
Analysis of silhouette score
此得分在 [-1, 1] 的范围内。以下是对此得分的分析:
The score has a range of [-1, 1]. Following is the analysis of this score −
-
Score of +1 − Score near +1 indicates that the sample is far away from the neighboring cluster.
-
Score of 0 − Score 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters.
-
Score of -1 − Negative score indicates that the samples have been assigned to the wrong clusters.
Calculating Silhouette Score
在本节中,我们将了解如何计算轮廓得分。
In this section, we will learn how to calculate the silhouette score.
轮廓得分可使用以下公式计算:
Silhouette score can be calculated by using the following formula −
轮廓得分 = \frac{\left ( p-q \right )}{max\left ( p,q \right )}
silhouette score = \frac{\left ( p-q \right )}{max\left ( p,q \right )}
此处,𝑝 是到最近聚类中数据点(该数据点不属于该聚类)的平均距离。而 𝑞 是到自身聚类中所有点之间的平均聚类内距离。
Here, 𝑝 is the mean distance to the points in the nearest cluster that the data point is not a part of. And, 𝑞 is the mean intra-cluster distance to all the points in its own cluster.
为了找到最优聚类数目,我们需要通过从 sklearn 包中导入 metrics 模块来再次运行聚类算法。在以下示例中,我们将运行 K 均值聚类算法来查找最优聚类数目:
For finding the optimal number of clusters, we need to run the clustering algorithm again by importing the metrics module from the sklearn package. In the following example, we will run the K-means clustering algorithm to find the optimal number of clusters −
导入必要的包,如下所示
Import the necessary packages as shown −
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
通过以下代码的帮助,我们将通过使用 sklearn.dataset 包中的 make_blob 生成包含四个 Blob 的二维数据集。
With the help of the following code, we will generate the two-dimensional dataset, containing four blobs, by using make_blob from the sklearn.dataset package.
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples = 500, centers = 4, cluster_std = 0.40, random_state = 0)
按照所示初始化变量−
Initialize the variables as shown −
scores = []
values = np.arange(2, 10)
我们需要对 K-means 模型进行所有值的迭代,还需要使用输入数据进行训练。
We need to iterate the K-means model through all the values and also need to train it with the input data.
for num_clusters in values:
kmeans = KMeans(init = 'k-means++', n_clusters = num_clusters, n_init = 10)
kmeans.fit(X)
现使用欧氏距离度量估计当前聚类模型的轮廓得分 −
Now, estimate the silhouette score for the current clustering model using the Euclidean distance metric −
score = metrics.silhouette_score(X, kmeans.labels_,
metric = 'euclidean', sample_size = len(X))
下列代码行将有助于显示簇数量以及轮廓得分。
The following line of code will help in displaying the number of clusters as well as Silhouette score.
print("\nNumber of clusters =", num_clusters)
print("Silhouette score =", score)
scores.append(score)
您将收到以下输出 −
You will receive the following output −
Number of clusters = 9
Silhouette score = 0.340391138371
num_clusters = np.argmax(scores) + values[0]
print('\nOptimal number of clusters =', num_clusters)
现在,最佳簇数的输出如下 −
Now, the output for optimal number of clusters would be as follows −
Optimal number of clusters = 2
Finding Nearest Neighbors
如果我们想要构建推荐系统(例如,电影推荐系统),那么就需要理解寻找最近邻域的概念。这是因为推荐系统利用最近邻域的概念。
If we want to build recommender systems such as a movie recommender system then we need to understand the concept of finding the nearest neighbors. It is because the recommender system utilizes the concept of nearest neighbors.
concept of finding nearest neighbors 可以定义为从给定数据集寻找与输入点最接近点的过程。这种 KNN(K-最近邻域)算法的主要用途是构建分类系统,根据输入数据点到各个类别的接近程度对数据点进行分类。
The concept of finding nearest neighbors may be defined as the process of finding the closest point to the input point from the given dataset. The main use of this KNN)K-nearest neighbors) algorithm is to build classification systems that classify a data point on the proximity of the input data point to various classes.
下方给出的 Python 代码有助于寻找给定数据集的 K-最近邻域 −
The Python code given below helps in finding the K-nearest neighbors of a given data set −
按照所示导入必要的包。此处,我们正在使用 sklearn 包中的 NearestNeighbors 模块
Import the necessary packages as shown below. Here, we are using the NearestNeighbors module from the sklearn package
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
现在,我们来定义输入数据 −
Let us now define the input data −
A = np.array([[3.1, 2.3], [2.3, 4.2], [3.9, 3.5], [3.7, 6.4], [4.8, 1.9],
[8.3, 3.1], [5.2, 7.5], [4.8, 4.7], [3.5, 5.1], [4.4, 2.9],])
现在,我们需要定义最近邻域 −
Now, we need to define the nearest neighbors −
k = 3
我们还需要指定要从中寻找最近邻域的测试数据 −
We also need to give the test data from which the nearest neighbors is to be found −
test_data = [3.3, 2.9]
下列代码可以可视化和绘制我们定义的输入数据 −
The following code can visualize and plot the input data defined by us −
plt.figure()
plt.title('Input data')
plt.scatter(A[:,0], A[:,1], marker = 'o', s = 100, color = 'black')
现在,我们需要构建 K 最近邻域。该对象也需要接受训练
Now, we need to build the K Nearest Neighbor. The object also needs to be trained
knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(X)
distances, indices = knn_model.kneighbors([test_data])
现在,我们可以按照所示打印 K 最近邻域 −
Now, we can print the K nearest neighbors as follows
print("\nK Nearest Neighbors:")
for rank, index in enumerate(indices[0][:k], start = 1):
print(str(rank) + " is", A[index])
我们可以将最近邻域与测试数据点一起可视化
We can visualize the nearest neighbors along with the test data point
plt.figure()
plt.title('Nearest neighbors')
plt.scatter(A[:, 0], X[:, 1], marker = 'o', s = 100, color = 'k')
plt.scatter(A[indices][0][:][:, 0], A[indices][0][:][:, 1],
marker = 'o', s = 250, color = 'k', facecolors = 'none')
plt.scatter(test_data[0], test_data[1],
marker = 'x', s = 100, color = 'k')
plt.show()
K-Nearest Neighbors Classifier
K-最近邻域 (KNN) 分类器是一种使用最近邻域算法对给定数据点进行分类的分类模型。我们在上一部分中已经实现了 KNN 算法,现在我们准备使用该算法构建一个 KNN 分类器。
A K-Nearest Neighbors (KNN) classifier is a classification model that uses the nearest neighbors algorithm to classify a given data point. We have implemented the KNN algorithm in the last section, now we are going to build a KNN classifier using that algorithm.
Concept of KNN Classifier
K-最近邻域分类的基本原理是寻找预先定义的数目,即与新样本(必须进行分类)距离最近的“k”个训练样本。新样本将从邻域本身获取其标签。KNN 分类器有一个用户定义的固定常数,用于确定邻域数。对于距离,标准欧氏距离是最常见的选择。KNN 分类器直接针对学习的样本进行处理,而不是创建学习规则。KNN 算法属于所有机器学习算法中最简单的算法之一。它在很多分类和回归问题(例如,字符识别或图像分析)中都取得了相当不错的效果。
The basic concept of K-nearest neighbor classification is to find a predefined number, i.e., the 'k' − of training samples closest in distance to a new sample, which has to be classified. New samples will get their label from the neighbors itself. The KNN classifiers have a fixed user defined constant for the number of neighbors which have to be determined. For the distance, standard Euclidean distance is the most common choice. The KNN Classifier works directly on the learned samples rather than creating the rules for learning. The KNN algorithm is among the simplest of all machine learning algorithms. It has been quite successful in a large number of classification and regression problems, for example, character recognition or image analysis.
Example
我们正在构建一个 KNN 分类器来识别数字。为此,我们将使用 MNIST 数据集。我们将在 Jupyter Notebook 中编写此代码。
We are building a KNN classifier to recognize digits. For this, we will use the MNIST dataset. We will write this code in the Jupyter Notebook.
导入必需的软件包,如下所示。
Import the necessary packages as shown below.
此处我们正在使用 sklearn.neighbors 包中的 KNeighborsClassifier 模块 −
Here we are using the KNeighborsClassifier module from the sklearn.neighbors package −
from sklearn.datasets import *
import pandas as pd
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import numpy as np
以下代码将显示数字图像以验证我们必须测试的图像 −
The following code will display the image of digit to verify what image we have to test −
def Image_display(i):
plt.imshow(digit['images'][i],cmap = 'Greys_r')
plt.show()
现在,我们需要加载 MNIST 数据集。实际上,总共是 1797 幅图像,但我们将前 1600 幅图像用作训练样本,剩下的 197 幅将保留用于测试目的。
Now, we need to load the MNIST dataset. Actually there are total 1797 images but we are using the first 1600 images as training sample and the remaining 197 would be kept for testing purpose.
digit = load_digits()
digit_d = pd.DataFrame(digit['data'][0:1600])
现在,在显示图像时,我们可以看到以下输出 −
Now, on displaying the images we can see the output as follows −
Image_display(0)
digit.keys()
现在,我们需要创建训练和测试数据集,并将测试数据集提供给 KNN 分类器。
Now, we need to create the training and testing data set and supply testing data set to the KNN classifiers.
train_x = digit['data'][:1600]
train_y = digit['target'][:1600]
KNN = KNeighborsClassifier(20)
KNN.fit(train_x,train_y)
以下输出将创建 K 近邻分类器构造函数 −
The following output will create the K nearest neighbor classifier constructor −
KNeighborsClassifier(algorithm = 'auto', leaf_size = 30, metric = 'minkowski',
metric_params = None, n_jobs = 1, n_neighbors = 20, p = 2,
weights = 'uniform')
我们需要通过提供任意大于训练样本的 1600 的数字来创建测试样本。
We need to create the testing sample by providing any arbitrary number greater than 1600, which were the training samples.
test = np.array(digit['data'][1725])
test1 = test.reshape(1,-1)
Image_display(1725)
Image_display(6)
6 的图像显示如下 −
Image of 6 is displayed as follows −
现在,我们将预测测试数据,如下所示 −
Now we will predict the test data as follows −
KNN.predict(test1)
上述代码将生成以下输出 −
The above code will generate the following output −
array([6])
现在,考虑以下内容 −
Now, consider the following −
digit['target_names']
上述代码将生成以下输出 −
The above code will generate the following output −
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])