Scikit Learn 简明教程

Scikit Learn - KNN Learning

k-NN（k-近邻），最简单的机器学习算法之一，本质上是非参数化和惰性的。非参数化意味着对底层数据分布没有假设，即，模型结构由数据集确定。惰性或基于实例的学习意味着，出于模型生成目的，它不需要任何训练数据点，而整个训练数据都用于测试阶段。

k-NN (k-Nearest Neighbor), one of the simplest machine learning algorithms, is non-parametric and lazy in nature. Non-parametric means that there is no assumption for the underlying data distribution i.e. the model structure is determined from the dataset. Lazy or instance-based learning means that for the purpose of model generation, it does not require any training data points and whole training data is used in the testing phase.

k-NN 算法包含以下两个步骤：

The k-NN algorithm consist of the following two steps −

Step 1

在此步骤中，它计算并存储训练集中每个样本的 k 个最近近邻。

In this step, it computes and stores the k nearest neighbors for each sample in the training set.

Step 2

在此步骤中，对于未标记的样本，它从数据集中检索 k 个最近近邻。然后，在这些 k 个最近近邻中，它通过投票（获得多数票的类别获胜）来预测类别。

In this step, for an unlabeled sample, it retrieves the k nearest neighbors from dataset. Then among these k-nearest neighbors, it predicts the class through voting (class with majority votes wins).

实现了 k-近邻算法的模块 sklearn.neighbors 提供了基于 unsupervised 及 supervised 近邻的学习方法的功能。

The module, sklearn.neighbors that implements the k-nearest neighbors algorithm, provides the functionality for unsupervised as well as supervised neighbors-based learning methods.

无监督最近近邻实现了不同的算法（BallTree、KDTree 或 Brute Force）来查找每个样本的最近近邻。此无监督版本基本上只是步骤 1，如上所述，并且是需要邻域搜索的许多算法（KNN 和 K-means 是最著名的算法）的基础。简单来说，它是非监督学习器，用于实现邻域搜索。

The unsupervised nearest neighbors implement different algorithms (BallTree, KDTree or Brute Force) to find the nearest neighbor(s) for each sample. This unsupervised version is basically only step 1, which is discussed above, and the foundation of many algorithms (KNN and K-means being the famous one) which require the neighbor search. In simple words, it is Unsupervised learner for implementing neighbor searches.

另一方面，基于监督最近近邻的学习用于分类和回归。

On the other hand, the supervised neighbors-based learning is used for classification as well as regression.

Unsupervised KNN Learning

如前所述，存在许多算法（如 KNN 和 K-Means 算法）需要最近邻邻搜索。这就是为什么 Scikit-learn 决定将其邻域搜索部分实现为其自己的“学习器”的原因。将邻域搜索作为独立学习器的原因是，计算用于找到最近邻的 all pair-wise 距离明显不是非常有效的。让我们看看 Sklearn 用于实现无监督最近邻学习的模块以及示例。

As discussed, there exist many algorithms like KNN and K-Means that requires nearest neighbor searches. That is why Scikit-learn decided to implement the neighbor search part as its own “learner”. The reason behind making neighbor search as a separate learner is that computing all pairwise distance for finding a nearest neighbor is obviously not very efficient. Let’s see the module used by Sklearn to implement unsupervised nearest neighbor learning along with example.

Scikit-learn module

sklearn.neighbors.NearestNeighbors 是用于实现无监督最近邻学习的模块。它使用名为 BallTree、KDTree 或 Brute Force 的特定最近邻算法。换言之，它作为这三个算法的统一接口。

sklearn.neighbors.NearestNeighbors is the module used to implement unsupervised nearest neighbor learning. It uses specific nearest neighbor algorithms named BallTree, KDTree or Brute Force. In other words, it acts as a uniform interface to these three algorithms.

Parameters

下表包含 NearestNeighbors 模块所使用的参数：

Followings table consist the parameters used by NearestNeighbors module −

Sr.No	Parameter & Description
1	n_neighbors − int, optional The number of neighbors to get. The default value is 5.
2	radius − float, optional It limits the distance of neighbors to returns. The default value is 1.0.
3	algorithm − {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional This parameter will take the algorithm (BallTree, KDTree or Brute-force) you want to use to compute the nearest neighbors. If you will provide ‘auto’, it will attempt to decide the most appropriate algorithm based on the values passed to fit method.
4	leaf_size − int, optional It can affect the speed of the construction & query as well as the memory required to store the tree. It is passed to BallTree or KDTree. Although the optimal value depends on the nature of the problem, its default value is 30.
5	metric − string or callable It is the metric to use for distance computation between points. We can pass it as a string or callable function. In case of callable function, the metric is called on each pair of rows and the resulting value is recorded. It is less efficient than passing the metric name as a string. We can choose from metric from scikit-learn or scipy.spatial.distance. the valid values are as follows − Scikit-learn − [‘cosine’,’manhattan’,‘Euclidean’, ‘l1’,’l2’, ‘cityblock’] Scipy.spatial.distance − [‘braycurtis’,‘canberra’,‘chebyshev’,‘dice’,‘hamming’,‘jaccard’, ‘correlation’,‘kulsinski’,‘mahalanobis’,‘minkowski’,‘rogerstanimoto’,‘russellrao’, ‘sokalmicheme’,’sokalsneath’, ‘seuclidean’, ‘sqeuclidean’, ‘yule’]. The default metric is ‘Minkowski’.
6	P − integer, optional It is the parameter for the Minkowski metric. The default value is 2 which is equivalent to using Euclidean_distance(l2).
7	metric_params − dict, optional This is the additional keyword arguments for the metric function. The default value is None.
8	N_jobs − int or None, optional It reprsetst the numer of parallel jobs to run for neighbor search. The default value is None.

Implementation Example

下面的示例将使用 sklearn.neighbors.NearestNeighbors 模块在两组数据之间找出最近邻点。

The example below will find the nearest neighbors between two sets of data by using the sklearn.neighbors.NearestNeighbors module.

首先，我们需要导入所需的模块和包−

First, we need to import the required module and packages −

from sklearn.neighbors import NearestNeighbors
import numpy as np

现在，在导入包之后，定义我们希望在其中查找最近邻点的两个数据集 −

Now, after importing the packages, define the sets of data in between we want to find the nearest neighbors −

Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]])

接下来，按如下所示应用非监督式学习算法 −

Next, apply the unsupervised learning algorithm, as follows −

nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm = 'ball_tree')

然后，使用输入数据集拟合模型。

Next, fit the model with input data set.

nrst_neigh.fit(Input_data)

现在，找出数据集的 K-近邻点。它将返回每个点的邻居的索引和距离。

Now, find the K-neighbors of data set. It will return the indices and distances of the neighbors of each point.

distances, indices = nbrs.kneighbors(Input_data)
indices

Output

array(
   [
      [0, 1, 3],
      [1, 2, 0],
      [2, 1, 0],
      [3, 4, 0],
      [4, 5, 3],
      [5, 6, 4],
      [6, 5, 4]
   ], dtype = int64
)
distances

Output

array(
   [
      [0. , 1.41421356, 2.23606798],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 2.82842712],
      [0. , 1.41421356, 2.23606798],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 1.41421356],
      [0. , 1.41421356, 2.82842712]
   ]
)

上面的输出显示每个点的最近邻点都是该点本身，即为零。这是因为查询集与训练集匹配。

The above output shows that the nearest neighbor of each point is the point itself i.e. at zero. It is because the query set matches the training set.

Example

我们还可以通过生成稀疏图来显示相邻点之间的连接，如下所示 −

We can also show a connection between neighboring points by producing a sparse graph as follows −

nrst_neigh.kneighbors_graph(Input_data).toarray()

Output

array(
   [
      [1., 1., 0., 1., 0., 0., 0.],
      [1., 1., 1., 0., 0., 0., 0.],
      [1., 1., 1., 0., 0., 0., 0.],
      [1., 0., 0., 1., 1., 0., 0.],
      [0., 0., 0., 1., 1., 1., 0.],
      [0., 0., 0., 0., 1., 1., 1.],
      [0., 0., 0., 0., 1., 1., 1.]
   ]
)

一旦我们拟合了非监督式 NearestNeighbors 模型，数据将存储在一个基于为参数 ‘algorithm’ 设置的值的数据结构中。然后，我们可以在需要邻居搜索的模型中使用此非监督式学习器的 kneighbors 。

Once we fit the unsupervised NearestNeighbors model, the data will be stored in a data structure based on the value set for the argument ‘algorithm’. After that we can use this unsupervised learner’s kneighbors in a model which requires neighbor searches.

Complete working/executable program

from sklearn.neighbors import NearestNeighbors
import numpy as np
Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]])
nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm='ball_tree')
nrst_neigh.fit(Input_data)
distances, indices = nbrs.kneighbors(Input_data)
indices
distances
nrst_neigh.kneighbors_graph(Input_data).toarray()

Supervised KNN Learning

基于监督邻域的学习用于以下方面 −

The supervised neighbors-based learning is used for following −

Classification, for the data with discrete labels
Regression, for the data with continuous labels.

Nearest Neighbor Classifier

我们借助以下两个特征来理解基于邻居的分类 −

We can understand Neighbors-based classification with the help of following two characteristics −

It is computed from a simple majority vote of the nearest neighbors of each point.
It simply stores instances of the training data, that’s why it is a type of non-generalizing learning.

Scikit-learn modules

以下是 scikit-learn 使用的两种不同类型的最近邻分类器——

Followings are the two different types of nearest neighbor classifiers used by scikit-learn −

S.No.

Classifiers & Description

KNeighborsClassifierThe K in the name of this classifier represents the k nearest neighbors, where k is an integer value specified by the user. Hence as the name suggests, this classifier implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data.

RadiusNeighborsClassifierThe Radius in the name of this classifier represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this classifier implements learning based on the number neighbors within a fixed radius r of each training point.

Nearest Neighbor Regressor

它用于数据标签本质上是连续的情况下。分配的数据标签基于其最近邻的标签均值计算。

It is used in the cases where data labels are continuous in nature. The assigned data labels are computed on the basis on the mean of the labels of its nearest neighbors.

以下是 scikit-learn 使用的两种不同类型的最近邻回归器——

Followings are the two different types of nearest neighbor regressors used by scikit-learn −

KNeighborsRegressor

此回归器名称中的 K 表示 k 个最近邻，其中 k 是 integer value 指定的用户。因此，顾名思义，此回归器基于 k 个最近邻实现学习。k 的值的选取取决于数据。让我们借助一个实现示例进一步理解这一点。

The K in the name of this regressor represents the k nearest neighbors, where k is an integer value specified by the user. Hence, as the name suggests, this regressor implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data. Let’s understand it more with the help of an implementation example.

以下是 scikit-learn 使用的两种不同类型的最近邻回归器——

Followings are the two different types of nearest neighbor regressors used by scikit-learn −

Implementation Example

在此示例中，我们将使用 scikit-learn KNeighborsRegressor 在数据集上（即 Iris Flower 数据集）实现 KNN。

In this example, we will be implementing KNN on data set named Iris Flower data set by using scikit-learn KNeighborsRegressor.

首先，按如下方式导入鸢尾花数据集——

First, import the iris dataset as follows −

from sklearn.datasets import load_iris
iris = load_iris()

现在，我们需要将数据分成训练数据和测试数据。我们将使用 Sklearn train_test_split 函数将数据分成 70（训练数据）和 20（测试数据）的比率——

Now, we need to split the data into training and testing data. We will be using Sklearn train_test_split function to split the data into the ratio of 70 (training data) and 20 (testing data) −

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

接下来，我们将借助 Sklearn 预处理模块进行数据缩放，如下所示——

Next, we will be doing data scaling with the help of Sklearn preprocessing module as follows −

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

接下来，从 Sklearn 导入 KNeighborsRegressor 类，并按如下方式提供邻居值。

Next, import the KNeighborsRegressor class from Sklearn and provide the value of neighbors as follows.

Example

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 8)
knnr.fit(X_train, y_train)

Output

KNeighborsRegressor(
   algorithm = 'auto', leaf_size = 30, metric = 'minkowski',
   metric_params = None, n_jobs = None, n_neighbors = 8, p = 2,
   weights = 'uniform'
)

Example

现在，我们按如下方式查找 MSE（均方误差）——

Now, we can find the MSE (Mean Squared Error) as follows −

print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean()))

Output

The MSE is: 4.4333349609375

Example

现在，使用它按如下方式预测值——

Now, use it to predict the value as follows −

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 3)
knnr.fit(X, y)
print(knnr.predict([[2.5]]))

Output

[0.66666667]

Complete working/executable program

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=8)
knnr.fit(X_train, y_train)

print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean()))

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=3)
knnr.fit(X, y)
print(knnr.predict([[2.5]]))

RadiusNeighborsRegressor

此回归器名称中的 Radius 表示指定半径 r 内的最近邻域，其中 r 是用户指定的浮点值。因此，顾名思义，此回归器基于每个训练点固定半径 r 内的邻居数实现学习。让我们借助一个实现示例进一步理解这一点——

The Radius in the name of this regressor represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this regressor implements learning based on the number neighbors within a fixed radius r of each training point. Let’s understand it more with the help if an implementation example −

Implementation Example

在此示例中，我们将使用 scikit-learn RadiusNeighborsRegressor 在数据集上（即 Iris Flower 数据集）实现 KNN。

In this example, we will be implementing KNN on data set named Iris Flower data set by using scikit-learn RadiusNeighborsRegressor −

首先，按如下方式导入鸢尾花数据集——

First, import the iris dataset as follows −

from sklearn.datasets import load_iris
iris = load_iris()

现在，我们需要将数据分成训练数据和测试数据。我们将使用 Sklearn train_test_split 函数将数据分成 70（训练数据）和 20（测试数据）的比率——

Now, we need to split the data into training and testing data. We will be using Sklearn train_test_split function to split the data into the ratio of 70 (training data) and 20 (testing data) −

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

接下来，我们将借助 Sklearn 预处理模块进行数据缩放，如下所示——

Next, we will be doing data scaling with the help of Sklearn preprocessing module as follows −

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

接下来，从 Sklearn 导入 RadiusneighborsRegressor 类，并按如下方式提供半径值——

Next, import the RadiusneighborsRegressor class from Sklearn and provide the value of radius as follows −

import numpy as np
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X_train, y_train)

Example

现在，我们按如下方式查找 MSE（均方误差）——

Now, we can find the MSE (Mean Squared Error) as follows −

print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))

Output

The MSE is: The MSE is: 5.666666666666667

Example

现在，使用它按如下方式预测值——

Now, use it to predict the value as follows −

X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius=1)
knnr_r.fit(X, y)
print(knnr_r.predict([[2.5]]))

Output

[1.]

Complete working/executable program

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data[:, :4]
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
import numpy as np
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius = 1)
knnr_r.fit(X_train, y_train)
print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import RadiusNeighborsRegressor
knnr_r = RadiusNeighborsRegressor(radius = 1)
knnr_r.fit(X, y)
print(knnr_r.predict([[2.5]]))