Scikit Learn 简明教程

Scikit Learn - Modelling Process

本章介绍 Sklearn 中涉及的建模流程。让我们详细了解一下并从加载数据集开始。

This chapter deals with the modelling process involved in Sklearn. Let us understand about the same in detail and begin with dataset loading.

Dataset Loading

一个数据集合称为数据集。它具有以下两个组成部分：

A collection of data is called dataset. It is having the following two components −

Features - 数据变量称为其特征。它们也称为预测变量、输入或属性。

Features − The variables of data are called its features. They are also known as predictors, inputs or attributes.

Feature matrix − It is the collection of features, in case there are more than one.
Feature Names − It is the list of all the names of the features.

Response - 它是输出变量，基本上取决于特征变量。它们也称为目标、标签或输出。

Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output.

Response Vector − It is used to represent response column. Generally, we have just one response column.
Target Names − It represent the possible values taken by a response vector.

Scikit-learn 有几个示例数据集，如 iris 和 digits 用于分类， Boston house prices 用于回归。

Scikit-learn have few example datasets like iris and digits for classification and the Boston house prices for regression.

Example

以下是一个加载 iris 数据集的示例：

Following is an example to load iris dataset −

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 10 rows of X:\n", X[:10])

Output

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
First 10 rows of X:
[
   [5.1 3.5 1.4 0.2]
   [4.9 3. 1.4 0.2]
   [4.7 3.2 1.3 0.2]
   [4.6 3.1 1.5 0.2]
   [5. 3.6 1.4 0.2]
   [5.4 3.9 1.7 0.4]
   [4.6 3.4 1.4 0.3]
   [5. 3.4 1.5 0.2]
   [4.4 2.9 1.4 0.2]
   [4.9 3.1 1.5 0.1]
]

Splitting the dataset

要检查我们模型的准确性，我们可以将数据集分成两部分- a training set 和 a testing set 。使用训练集训练模型，使用测试集测试模型。然后，我们可以评估我们的模型表现如何。

To check the accuracy of our model, we can split the dataset into two pieces-a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did.

Example

以下示例将以 70:30 的比例分割数据，即 70% 的数据将用作训练数据，30% 将用作测试数据。正如上面的示例中，数据集是 iris 数据集。

The following example will split the data into 70:30 ratio, i.e. 70% data will be used as training data and 30% will be used as testing data. The dataset is iris dataset as in above example.

from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size = 0.3, random_state = 1
)

print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

Output

(105, 4)
(45, 4)
(105,)
(45,)

如上例所示，它使用 scikit-learn 的 train_test_split() 函数来分割数据集。该函数具有以下参数：

As seen in the example above, it uses train_test_split() function of scikit-learn to split the dataset. This function has the following arguments −

X, y − Here, X is the feature matrix and y is the response vector, which need to be split.
test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.
random_size − It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results.

Train the Model

接下来，我们可以使用我们的数据集来训练一些预测模型。正如所讨论的，scikit-learn 具有各种 Machine Learning (ML) algorithms ，它们具有用于拟合、预测准确率、召回率等的统一接口。

Next, we can use our dataset to train some prediction-model. As discussed, scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for fitting, predicting accuracy, recall etc.

Example

在下面的示例中，我们将使用 KNN（K 最近邻）分类器。不要深入了解 KNN 算法的详细信息，因为会有单独的一章介绍它。此示例仅用于让您理解实现部分。

In the example below, we are going to use KNN (K nearest neighbors) classifier. Don’t go into the details of KNN algorithms, as there will be a separate chapter for that. This example is used to make you understand the implementation part only.

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size = 0.4, random_state=1
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
classifier_knn = KNeighborsClassifier(n_neighbors = 3)
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)
# Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
# Providing sample data and the model will make prediction out of that data

sample = [[5, 5, 3, 2], [2, 4, 3, 5]]
preds = classifier_knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species)

Output

Accuracy: 0.9833333333333333
Predictions: ['versicolor', 'virginica']

Model Persistence

一旦你训练了模型，那么最好让该模型持久化以便于将来使用，这样我们就无需一遍遍地重新训练它。这可以通过 joblib 包的 dump 和 load 特性来完成。

Once you train the model, it is desirable that the model should be persist for future use so that we do not need to retrain it again and again. It can be done with the help of dump and load features of joblib package.

考虑下面的示例，其中我们将保存上述训练的模型（classifier_knn）以备将来使用：

Consider the example below in which we will be saving the above trained model (classifier_knn) for future use −

from sklearn.externals import joblib
joblib.dump(classifier_knn, 'iris_classifier_knn.joblib')

上述代码会将模型保存到名为 iris_classifier_knn.joblib 的文件中。现在，可以使用以下代码从文件中重新加载该对象：

The above code will save the model into file named iris_classifier_knn.joblib. Now, the object can be reloaded from the file with the help of following code −

joblib.load('iris_classifier_knn.joblib')

Preprocessing the Data

由于我们正在处理大量数据，并且该数据处于原始形式，因此在将该数据输入机器学习算法之前，我们需要将其转换为有意义的数据。这个过程称为数据预处理。Scikit-learn 为此目的有一个名为 preprocessing 的包。 preprocessing 包具有以下技术：

As we are dealing with lots of data and that data is in raw form, before inputting that data to machine learning algorithms, we need to convert it into meaningful data. This process is called preprocessing the data. Scikit-learn has package named preprocessing for this purpose. The preprocessing package has the following techniques −

Binarisation

当我们需要将数值转换为布尔值时，使用此预处理技术。

This preprocessing technique is used when we need to convert our numerical values into Boolean values.

Example

import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [2.1, -1.9, 5.5],
   [-1.5, 2.4, 3.5],
   [0.5, -7.9, 5.6],
   [5.9, 2.3, -5.8]]
)
data_binarized = preprocessing.Binarizer(threshold=0.5).transform(input_data)
print("\nBinarized data:\n", data_binarized)

在上述示例中，我们使用了 threshold value = 0.5 这就是为什么所有高于 0.5 的值都将转换为 1，所有低于 0.5 的值都将转换为 0。

In the above example, we used threshold value = 0.5 and that is why, all the values above 0.5 would be converted to 1, and all the values below 0.5 would be converted to 0.

Output

Binarized data:
[
   [ 1. 0. 1.]
   [ 0. 1. 1.]
   [ 0. 0. 1.]
   [ 1. 1. 0.]
]

Mean Removal

此技术用于从特征向量中消除平均值，以便每个特征都以零为中心。

This technique is used to eliminate the mean from feature vector so that every feature centered on zero.

Example

import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [2.1, -1.9, 5.5],
   [-1.5, 2.4, 3.5],
   [0.5, -7.9, 5.6],
   [5.9, 2.3, -5.8]]
)

#displaying the mean and the standard deviation of the input data
print("Mean =", input_data.mean(axis=0))
print("Stddeviation = ", input_data.std(axis=0))
#Removing the mean and the standard deviation of the input data

data_scaled = preprocessing.scale(input_data)
print("Mean_removed =", data_scaled.mean(axis=0))
print("Stddeviation_removed =", data_scaled.std(axis=0))

Output

Mean = [ 1.75 -1.275 2.2 ]
Stddeviation = [ 2.71431391 4.20022321 4.69414529]
Mean_removed = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00]
Stddeviation_removed = [ 1. 1. 1.]

Scaling

我们使用此预处理技术来缩放特征向量。特征向量的缩放很重要，因为特征不应人为地过大或过小。

We use this preprocessing technique for scaling the feature vectors. Scaling of feature vectors is important, because the features should not be synthetically large or small.

Example

import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print ("\nMin max scaled data:\n", data_scaled_minmax)

Output

Min max scaled data:
[
   [ 0.48648649 0.58252427 0.99122807]
   [ 0. 1. 0.81578947]
   [ 0.27027027 0. 1. ]
   [ 1. 0.99029126 0. ]
]

Normalisation

我们使用此预处理技术来修改特征向量。需要对特征向量进行归一化，以便可以按照常见比例测量特征向量。有两种归一化类型，如下所示：

We use this preprocessing technique for modifying the feature vectors. Normalisation of feature vectors is necessary so that the feature vectors can be measured at common scale. There are two types of normalisation as follows −

L1 Normalisation

它也称为最小绝对偏差。它以一种方式修改值，使得每个行中的绝对值的总和始终保持到 1。以下示例显示了在输入数据上实现 L1 归一化的过程。

It is also called Least Absolute Deviations. It modifies the value in such a manner that the sum of the absolute values remains always up to 1 in each row. Following example shows the implementation of L1 normalisation on input data.

Example

import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_normalized_l1 = preprocessing.normalize(input_data, norm='l1')
print("\nL1 normalized data:\n", data_normalized_l1)

Output

L1 normalized data:
[
   [ 0.22105263 -0.2 0.57894737]
   [-0.2027027 0.32432432 0.47297297]
   [ 0.03571429 -0.56428571 0.4 ]
   [ 0.42142857 0.16428571 -0.41428571]
]

L2 Normalisation

也称为最小二乘。它以一种方式修改值，使得每个行中的平方和始终保持到 1。以下示例显示了在输入数据上实现 L2 归一化的过程。

Also called Least Squares. It modifies the value in such a manner that the sum of the squares remains always up to 1 in each row. Following example shows the implementation of L2 normalisation on input data.

Example

import numpy as np
from sklearn import preprocessing
Input_data = np.array(
   [
      [2.1, -1.9, 5.5],
      [-1.5, 2.4, 3.5],
      [0.5, -7.9, 5.6],
      [5.9, 2.3, -5.8]
   ]
)
data_normalized_l2 = preprocessing.normalize(input_data, norm='l2')
print("\nL1 normalized data:\n", data_normalized_l2)

Output

L2 normalized data:
[
   [ 0.33946114 -0.30713151 0.88906489]
   [-0.33325106 0.53320169 0.7775858 ]
   [ 0.05156558 -0.81473612 0.57753446]
   [ 0.68706914 0.26784051 -0.6754239 ]
]