Machine Learning With Python 简明教程
KNN Algorithm - Finding Nearest Neighbors
Introduction
K 近邻 (KNN) 算法是一种有监督的 ML 算法,可用于分类和回归预测问题。然而,它主要用于工业中的分类预测问题。以下两个特性很好地定义了 KNN -
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The following two properties would define KNN well −
-
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification.
-
Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.
Working of KNN Algorithm
K 近邻 (KNN) 算法使用“特征相似性”来预测新数据点的值,这意味着新的数据点将根据它与训练集中的点的匹配程度分配一个值。我们可以通过以下步骤了解它的工作原理 -
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. We can understand its working with the help of following steps −
-
Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data.
-
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.
-
Step 3 − For each point in the test data do the following − 3.1 − Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean. 3.2 − Now, based on the distance value, sort them in ascending order. 3.3 − Next, it will choose the top K rows from the sorted array. 3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
-
Step 4 − End
Example
以下是理解 K 的概念和 KNN 算法的工作原理的一个示例 −
The following is an example to understand the concept of K and working of KNN algorithm −
假设我们有一个可以按如下方式绘制的数据集 −
Suppose we have a dataset which can be plotted as follows −
现在,我们需要将带黑点的新的数据点(在点 60,60)分类为蓝色或红色类别。我们假定 K = 3,即它会找到三个最邻近的数据点。它在下一张图中显示 −
Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next diagram −
我们可以在上图中看到带黑点的这个数据点的三近邻。在这三个数据点中,有两个属于红色类别,因此黑点也将被分配到红色类别。
We can see in the above diagram the three nearest neighbors of the data point with black dot. Among those three, two of them lies in Red class hence the black dot will also be assigned in red class.
Implementation in Python
众所周知,K 近邻 (KNN) 算法既可以用于分类,也可以用于回归。以下是使用 Python 将 KNN 同时用作分类器和回归器的程序 −
As we know K-nearest neighbors (KNN) algorithm can be used for both classification as well as regression. The following are the recipes in Python to use KNN as classifier as well as regressor −
KNN as Classifier
首先,从导入必要的 Python 包开始——
First, start with importing necessary python packages −
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
接下来,从其 Web 链接下载 iris 数据集,如下所示——
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
接下来,我们需要按照以下方式为数据集分配列名称 −
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
现在,我们需要按照以下方式将数据集读入 Pandas 数据框 −
Now, we need to read dataset to pandas dataframe as follows −
dataset = pd.read_csv(path, names=headernames)
dataset.head()
slno. |
sepal-length |
sepal-width |
petal-length |
petal-width |
Class |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
2 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
3 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
4 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
数据预处理将借助以下脚本行执行 −
Data Preprocessing will be done with the help of following script lines −
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
接下来,我们将数据分成训练集和测试集。以下代码会将数据集分成 60% 的训练数据和 40% 的测试数据 −
Next, we will divide the data into train and test split. Following code will split the dataset into 60% training data and 40% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
接下来,将按照如下方式对数据进行缩放 −
Next, data scaling will be done as follows −
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
接下来,借助 sklearn 的 KNeighborsClassifier 类按如下方式训练模型 −
Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=8)
classifier.fit(X_train, y_train)
最后,我们需要进行预测。可以使用以下脚本完成——
At last we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
接下来,按照以下方式打印结果 −
Next, print the results as follows −
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60
weighted avg 0.92 0.88 0.88 60
Accuracy: 0.8833333333333333
KNN as Regressor
首先,从导入必要的 Python 包开始——
First, start with importing necessary Python packages −
import numpy as np
import pandas as pd
接下来,从其 Web 链接下载 iris 数据集,如下所示——
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
接下来,我们需要按照以下方式为数据集分配列名称 −
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
现在,我们需要按照以下方式将数据集读入 Pandas 数据框 −
Now, we need to read dataset to pandas dataframe as follows −
data = pd.read_csv(url, names=headernames)
array = data.values
X = array[:,:2]
Y = array[:,2]
data.shape
output:(150, 5)
接下来,从 sklearn 导入 KNeighborsRegressor 以拟合模型 −
Next, import KNeighborsRegressor from sklearn to fit the model −
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=10)
knnr.fit(X, y)
最后,我们可以按如下方式找到 MSE −
At last, we can find the MSE as follows −
print ("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))
Pros and Cons of KNN
Pros
-
It is very simple algorithm to understand and interpret.
-
It is very useful for nonlinear data because there is no assumption about data in this algorithm.
-
It is a versatile algorithm as we can use it for classification as well as regression.
-
It has relatively high accuracy but there are much better supervised learning models than KNN.
Applications of KNN
以下是一些 KNN 可以成功应用的领域 −
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN 可用于银行系统预测某个人是否适合贷款审批?此人是否具有与违约者相似的特征?
KNN can be used in banking system to predict weather an individual is fit for loan approval? Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN 算法可用于通过与具有相似特征的人进行比较来查找个人信用评级。
KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having similar traits.
Politics
借助 KNN 算法,我们可以将潜在选民分为各种类别,如“将投票”、“不会投票”、“将投票给‘国大党’”、“将投票给‘人民党’”。
With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
KNN 算法可以应用的其他领域包括语音识别、手写检测、图像识别和视频识别。
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition and Video Recognition.