Scikit Learn 简明教程

Scikit Learn - Extended Linear Modeling

本章重点介绍了 Sklearn 中的多项式特征和管道工具。

This chapter focusses on the polynomial features and pipelining tools in Sklearn.

Introduction to Polynomial Features

针对数据非线性函数训练的线性模型，通常保持线性方法的快速性能。它还允许它们拟合更广泛的数据范围。这就是机器学习中使用针对非线性函数训练的这种线性模型的原因。

Linear models trained on non-linear functions of data generally maintains the fast performance of linear methods. It also allows them to fit a much wider range of data. That’s the reason in machine learning such linear models, that are trained on nonlinear functions, are used.

一个这样的示例是简单的线性回归可以通过从系数构造多项式特征来扩展。

One such example is that a simple linear regression can be extended by constructing polynomial features from the coefficients.

在数学上，假设我们有标准线性回归模型，那么对于 2-D 数据而言，它会如下所示：

Mathematically, suppose we have standard linear regression model then for 2-D data it would look like this −

现在，我们可以将特征组合成二次多项式，我们的模型将如下所示：

Now, we can combine the features in second-order polynomials and our model will look like as follows −

以上仍然是线性模型。在这里，我们看到生成的多项式回归属于相同的线性模型类别，并且可以类似地求解。

The above is still a linear model. Here, we saw that the resulting polynomial regression is in the same class of linear models and can be solved similarly.

为此，scikit-learn 提供了一个名为 PolynomialFeatures 的模块。此模块将输入数据矩阵转换为给定度的新数据矩阵。

To do so, scikit-learn provides a module named PolynomialFeatures. This module transforms an input data matrix into a new data matrix of given degree.

Parameters

下表包含 PolynomialFeatures 模块使用的参数

Followings table consist the parameters used by PolynomialFeatures module

Sr.No	Parameter & Description
1	degree − integer, default = 2 It represents the degree of the polynomial features.
2	interaction_only − Boolean, default = false By default, it is false but if set as true, the features that are products of most degree distinct input features, are produced. Such features are called interaction features.
3	include_bias − Boolean, default = true It includes a bias column i.e. the feature in which all polynomials powers are zero.
4	order − str in {‘C’, ‘F’}, default = ‘C’ This parameter represents the order of output array in the dense case. ‘F’ order means faster to compute but on the other hand, it may slow down subsequent estimators.

Attributes

下表包含 PolynomialFeatures 模块使用的属性

Followings table consist the attributes used by PolynomialFeatures module

Sr.No

Attributes & Description

powers_ − array, shape (n_output_features, n_input_features) It shows powers_ [i,j] is the exponent of the jth input in the ith output.

n_input_features _ − int As name suggests, it gives the total number of input features.

n_output_features _ − int As name suggests, it gives the total number of polynomial output features.

Implementation Example

以下 Python 脚本使用 PolynomialFeatures 转换器将数组从 8 转换为形状 (4,2) −

Following Python script uses PolynomialFeatures transformer to transform array of 8 into shape (4,2) −

from sklearn.preprocessing import PolynomialFeatures
import numpy as np
Y = np.arange(8).reshape(4, 2)
poly = PolynomialFeatures(degree=2)
poly.fit_transform(Y)

Output

array(
   [
      [ 1., 0., 1., 0., 0., 1.],
      [ 1., 2., 3., 4., 6., 9.],
      [ 1., 4., 5., 16., 20., 25.],
      [ 1., 6., 7., 36., 42., 49.]
   ]
)

Streamlining using Pipeline tools

上述预处理（即将输入数据矩阵转换为给定次数的新数据矩阵）可以用 Pipeline 工具简化，这些工具主要用于将多个估计器链接为一个。

The above sort of preprocessing i.e. transforming an input data matrix into a new data matrix of a given degree, can be streamlined with the Pipeline tools, which are basically used to chain multiple estimators into one.

Example

以下使用 Scikit-learn 的管道工具的 Python 脚本简化了预处理（将拟合到 3 阶多项式数据）。

The below python scripts using Scikit-learn’s Pipeline tools to streamline the preprocessing (will fit to an order-3 polynomial data).

#First, import the necessary packages.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np

#Next, create an object of Pipeline tool
Stream_model = Pipeline([('poly', PolynomialFeatures(degree=3)), ('linear', LinearRegression(fit_intercept=False))])

#Provide the size of array and order of polynomial data to fit the model.
x = np.arange(5)
y = 3 - 2 * x + x ** 2 - x ** 3
Stream_model = model.fit(x[:, np.newaxis], y)

#Calculate the input polynomial coefficients.
Stream_model.named_steps['linear'].coef_

Output

array([ 3., -2., 1., -1.])

以上输出结果表明，在多项式特征上训练的线性模型能够恢复确切的输入多项式系数。

The above output shows that the linear model trained on polynomial features is able to recover the exact input polynomial coefficients.