Scikit Learn 简明教程
Scikit Learn - K-Nearest Neighbors (KNN)
本章将帮助您理解 Sklearn 中的最近邻方法。
This chapter will help you in understanding the nearest neighbor methods in Sklearn.
基于邻居的学习方法有两种类型,即 supervised 和 unsupervised. 。基于受监督邻居的学习既能用于分类,也能用于回归预测问题,但主要用于工业中的分类预测问题。
Neighbor based learning method are of both types namely supervised and unsupervised. Supervised neighbors-based learning can be used for both classification as well as regression predictive problems but, it is mainly used for classification predictive problems in industry.
基于邻居的学习方法没有专门的训练阶段,会在分类时使用所有数据进行训练。它也不会对基础数据做出任何假设。这就是为什么它们本质上是惰性和非参数化的原因。
Neighbors based learning methods do not have a specialised training phase and uses all the data for training while classification. It also does not assume anything about the underlying data. That’s the reason they are lazy and non-parametric in nature.
最近邻方法背后的主要原则是 −
The main principle behind nearest neighbor methods is −
-
To find a predefined number of training samples closet in distance to the new data point
-
Predict the label from these number of training samples.
此处,样本数量可以是用户定义的常数,如 K-最近邻学习,或根据点局部密度而变化,如基于半径的邻近学习。
Here, the number of samples can be a user-defined constant like in K-nearest neighbor learning or vary based on the local density of point like in radius-based neighbor learning.
sklearn.neighbors Module
Scikit-learn 拥有 sklearn.neighbors 模块,可为无监督和有监督的基于邻居的学习方法提供功能。对于输入,该模块中的类可以处理 NumPy 数组或 scipy.sparse 矩阵。
Scikit-learn have sklearn.neighbors module that provides functionality for both unsupervised and supervised neighbors-based learning methods. As input, the classes in this module can handle either NumPy arrays or scipy.sparse matrices.
Types of algorithms
可在基于邻居方法的实现中使用的不同类型的算法如下 −
Different types of algorithms which can be used in neighbor-based methods’ implementation are as follows −
Brute Force
对数据集中所有点对之间的距离进行蛮力计算提供了一种最简单的邻近搜索实现。在数学上,对于 D 维中的 N 个样本,蛮力方法的规模是 0[DN2]
The brute-force computation of distances between all pairs of points in the dataset provides the most naïve neighbor search implementation. Mathematically, for N samples in D dimensions, brute-force approach scales as 0[DN2]
对于小数据样本,此算法可能非常有用,但随着样本数量的增长,它将变得不可行。可以使用关键字 algorithm=’brute’ 来启用蛮力邻近搜索。
For small data samples, this algorithm can be very useful, but it becomes infeasible as and when number of samples grows. Brute force neighbor search can be enabled by writing the keyword algorithm=’brute’.
K-D Tree
为了解决蛮力方法的计算效率低的问题而发明的一棵基于树的数据结构是 KD 树数据结构。基本上,KD 树是一种称为 K 维树的二叉树结构。它通过将参数空间沿数据轴递归地划分为嵌套正交区域,并填充数据点来划分数据轴。
One of the tree-based data structures that have been invented to address the computational inefficiencies of the brute-force approach, is KD tree data structure. Basically, the KD tree is a binary tree structure which is called K-dimensional tree. It recursively partitions the parameters space along the data axes by dividing it into nested orthographic regions into which the data points are filled.
Advantages
以下是一些 K-D 树算法的优点 −
Following are some advantages of K-D tree algorithm −
Construction is fast − 由于划分仅沿着数据轴执行,因此 K-D 树的构建非常快。
Construction is fast − As the partitioning is performed only along the data axes, K-D tree’s construction is very fast.
Less distance computations − 该算法只需要很少的距离计算即可确定查询点的最近邻。它只需要 𝑶[𝐥𝐨𝐠 (𝑵)] 距离计算。
Less distance computations − This algorithm takes very less distance computations to determine the nearest neighbor of a query point. It only takes 𝑶[𝐥𝐨𝐠 (𝑵)] distance computations.
Disadvantages
Fast for only low-dimensional neighbor searches − 它对于低维(D < 20)邻域搜索非常快,但随着 D 的增长,它变得效率低下。由于划分仅沿着数据轴执行,因此
Fast for only low-dimensional neighbor searches − It is very fast for low-dimensional (D < 20) neighbor searches but as and when D grow it becomes inefficient. As the partitioning is performed only along the data axes,
可以通过编写关键字 algorithm=’kd_tree’ 启用 K-D 树邻近搜索。
K-D tree neighbor searches can be enabled by writing the keyword algorithm=’kd_tree’.
Ball Tree
众所周知,KD 树在高维度中效率低下,因此,为了解决 KD 树的这种低效性,开发了 Ball 树数据结构。在数学上,它以一种递归的方式将数据划分为由质心 C 和半径 r 定义的节点,使得节点中的每个点都位于由质心 C 和半径 r 定义的超球体中。它使用三角不等式(如下所示),该不等式减少了邻近搜索的候选点数量
As we know that KD Tree is inefficient in higher dimensions, hence, to address this inefficiency of KD Tree, Ball tree data structure was developed. Mathematically, it recursively divides the data, into nodes defined by a centroid C and radius r, in such a way that each point in the node lies within the hyper-sphere defined by centroid C and radius r. It uses triangle inequality, given below, which reduces the number of candidate points for a neighbor search
Advantages
以下是 Ball Tree 算法的一些优点 −
Following are some advantages of Ball Tree algorithm −
Efficient on highly structured data − 由于 ball 树以一系列嵌套超球体的形式对数据进行分区,因此它对高度结构化的数据来说效率很高。
Efficient on highly structured data − As ball tree partition the data in a series of nesting hyper-spheres, it is efficient on highly structured data.
Out-performs KD-tree − 球树在高纬度中优于 KD 树,因为它具有球树节点球形的几何形状。
Out-performs KD-tree − Ball tree out-performs KD tree in high dimensions because it has spherical geometry of the ball tree nodes.
Choosing Nearest Neighbors Algorithm
选择给定数据集的最佳算法取决于以下因素 −
The choice of an optimal algorithm for a given dataset depends upon the following factors −
Number of samples (N) and Dimensionality (D)
在选择最近邻算法时,这些是最重要的因素。这是因为以下原因 −
These are the most important factors to be considered while choosing Nearest Neighbor algorithm. It is because of the reasons given below −
-
The query time of Brute Force algorithm grows as O[DN].
-
The query time of Ball tree algorithm grows as O[D log(N)].
-
The query time of KD tree algorithm changes with D in a strange manner that is very difficult to characterize. When D < 20, the cost is O[D log(N)] and this algorithm is very efficient. On the other hand, it is inefficient in case when D > 20 because the cost increases to nearly O[DN].
Data Structure
另一个影响这些算法性能的因素是数据的内在维度或稀疏性。这是因为球树和 KD 树算法的查询时间会受到它的极大影响。而蛮力算法的查询时间不受数据结构改变的影响。通常,球树和 KD 树算法在稀疏数据和较小的内在维度上实施时会产生更快的查询时间。
Another factor that affect the performance of these algorithms is intrinsic dimensionality of the data or sparsity of the data. It is because the query times of Ball tree and KD tree algorithms can be greatly influenced by it. Whereas, the query time of Brute Force algorithm is unchanged by data structure. Generally, Ball tree and KD tree algorithms produces faster query time when implanted on sparser data with smaller intrinsic dimensionality.
Number of Neighbors (k)
查询点请求的邻居数 (k) 会影响球树和 KD 树算法的查询时间。随着邻居数 (k) 的增加,它们的查询时间会变慢。而蛮力算法的查询时间将不受 k 值的影响。
The number of neighbors (k) requested for a query point affects the query time of Ball tree and KD tree algorithms. Their query time becomes slower as number of neighbors (k) increases. Whereas the query time of Brute Force will remain unaffected by the value of k.
Number of query points
由于它们需要构建阶段,因此如果查询点数量很大,则 KD 树和球树算法都会很有效。另一方面,如果查询点数量较少,则蛮力算法的性能会优于 KD 树和球树算法。
Because, they need construction phase, both KD tree and Ball tree algorithms will be effective if there are large number of query points. On the other hand, if there are a smaller number of query points, Brute Force algorithm performs better than KD tree and Ball tree algorithms.