Scikit Learn 简明教程
Scikit Learn - Conventions
Scikit-learn 的对象共享一个统一的基本 API,它包含以下三个互补的接口 −
Scikit-learn’s objects share a uniform basic API that consists of the following three complementary interfaces −
-
Estimator interface − It is for building and fitting the models.
-
Predictor interface − It is for making predictions.
-
Transformer interface − It is for converting data.
这些 API 采用简单的约定,并且设计选择旨在避免框架代码泛滥。
The APIs adopt simple conventions and the design choices have been guided in a manner to avoid the proliferation of framework code.
Purpose of Conventions
这些约定的目的是确保 API 坚持以下广泛原则 −
The purpose of conventions is to make sure that the API stick to the following broad principles −
Consistency − 所有对象(无论是基础对象还是复合对象)都必须共享一致的接口,该接口由一组有限的方法组成。
Consistency − All the objects whether they are basic, or composite must share a consistent interface which further composed of a limited set of methods.
Inspection − 由学习算法确定的构造函数参数和参数值应存储并公开为公有属性。
Inspection − Constructor parameters and parameters values determined by learning algorithm should be stored and exposed as public attributes.
Non-proliferation of classes − 数据集应表示为 NumPy 数组或 Scipy 稀疏矩阵,而超参数名称和值应表示为标准 Python 字符串,以避免框架代码的扩散。
Non-proliferation of classes − Datasets should be represented as NumPy arrays or Scipy sparse matrix whereas hyper-parameters names and values should be represented as standard Python strings to avoid the proliferation of framework code.
Composition − 无论算法是否可表示为序列或数据变换组合,或者自然地视为参数化为其他算法的元算法,都应由现有构建模块实现并组合。
Composition − The algorithms whether they are expressible as sequences or combinations of transformations to the data or naturally viewed as meta-algorithms parameterized on other algorithms, should be implemented and composed from existing building blocks.
Sensible defaults − 在 scikit-learn 中,每当操作需要用户定义的参数时,都会定义一个适当的默认值。此默认值应使操作以合理的方式执行,例如,为手头任务提供基线解决方案。
Sensible defaults − In scikit-learn whenever an operation requires a user-defined parameter, an appropriate default value is defined. This default value should cause the operation to be performed in a sensible way, for example, giving a base-line solution for the task at hand.
Various Conventions
以下是 Sklearn 中可用的约定 −
The conventions available in Sklearn are explained below −
Type casting
它指出输入应转换为 float64 。在以下示例中, sklearn.random_projection 模块用于减少数据的维数,将进行解释 −
It states that the input should be cast to float64. In the following example, in which sklearn.random_projection module used to reduce the dimensionality of the data, will explain it −
Example
import numpy as np
from sklearn import random_projection
rannge = np.random.RandomState(0)
X = range.rand(10,2000)
X = np.array(X, dtype = 'float32')
X.dtype
Transformer_data = random_projection.GaussianRandomProjection()
X_new = transformer.fit_transform(X)
X_new.dtype
Output
dtype('float32')
dtype('float64')
在上面的示例中,我们可以看到 X 是 float32 ,由 float64 转换为 fit_transform(X) 。
In the above example, we can see that X is float32 which is cast to float64 by fit_transform(X).
Refitting & Updating Parameters
可以通过 set_params() 方法在构造估计器后更新并重新拟合估计器的超参数。让我们看以下示例来理解它 −
Hyper-parameters of an estimator can be updated and refitted after it has been constructed via the set_params() method. Let’s see the following example to understand it −
Example
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
X, y = load_iris(return_X_y = True)
clf = SVC()
clf.set_params(kernel = 'linear').fit(X, y)
clf.predict(X[:5])
Output
array([0, 0, 0, 0, 0])
一旦构造了估计器,上面的代码将通过 SVC.set_params() 将默认内核 rbf 更改为线性。
Once the estimator has been constructed, above code will change the default kernel rbf to linear via SVC.set_params().
现在,以下代码将内核改回 rbf,重新拟合估计器并进行第二次预测。
Now, the following code will change back the kernel to rbf to refit the estimator and to make a second prediction.
Example
clf.set_params(kernel = 'rbf', gamma = 'scale').fit(X, y)
clf.predict(X[:5])
Output
array([0, 0, 0, 0, 0])
Complete code
以下是完整的可执行程序 −
The following is the complete executable program −
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
X, y = load_iris(return_X_y = True)
clf = SVC()
clf.set_params(kernel = 'linear').fit(X, y)
clf.predict(X[:5])
clf.set_params(kernel = 'rbf', gamma = 'scale').fit(X, y)
clf.predict(X[:5])
Multiclass & Multilabel fitting
对于多类拟合,学习和预测任务都取决于拟合目标数据的格式。使用的模块是 sklearn.multiclass 。检查以下示例,其中多类分类器拟合在 1d 数组上。
In case of multiclass fitting, both learning and the prediction tasks are dependent on the format of the target data fit upon. The module used is sklearn.multiclass. Check the example below, where multiclass classifier is fit on a 1d array.
Example
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [3, 4], [4, 5], [5, 2], [1, 1]]
y = [0, 0, 1, 1, 2]
classif = OneVsRestClassifier(estimator = SVC(gamma = 'scale',random_state = 0))
classif.fit(X, y).predict(X)
Output
array([0, 0, 1, 1, 2])
在上面的示例中,分类器拟合在一维的多类标签数组上,因此 predict() 方法提供了相应的多类预测。但是,另一方面,还可以像下面这样拟合在二进制标签指示符的二维数组上 −
In the above example, classifier is fit on one dimensional array of multiclass labels and the predict() method hence provides corresponding multiclass prediction. But on the other hand, it is also possible to fit upon a two-dimensional array of binary label indicators as follows −
Example
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [3, 4], [4, 5], [5, 2], [1, 1]]
y = LabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
Output
array(
[
[0, 0, 0],
[0, 0, 0],
[0, 1, 0],
[0, 1, 0],
[0, 0, 0]
]
)
同样,在多标签拟合的情况下,可以像下面这样为一个实例分配多个标签 −
Similarly, in case of multilabel fitting, an instance can be assigned multiple labels as follows −
Example
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
Output
array(
[
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 1, 0],
[1, 0, 1, 1, 0],
[1, 0, 1, 0, 0]
]
)
在上面的示例中, sklearn.MultiLabelBinarizer 用于将多标签的二维数组二值化以进行拟合。这就是 predict() 函数输出一个 2d 数组作为每个实例的多个标签的原因。
In the above example, sklearn.MultiLabelBinarizer is used to binarize the two dimensional array of multilabels to fit upon. That’s why predict() function gives a 2d array as output with multiple labels for each instance.