Artificial Intelligence With Python 简明教程

AI with Python – Data Preparation

すでに、教師付きと教師なしの両方の機械学習アルゴリズムを学習しました。これらのアルゴリズムでは、訓練プロセスを開始するためのフォーマットされたデータが必要です。MLアルゴリズムに入力として供給できるような、特定の方法でデータを準備またはフォーマットする必要があります。

We have already studied supervised as well as unsupervised machine learning algorithms. These algorithms require formatted data to start the training process. We must prepare or format data in a certain way so that it can be supplied as an input to ML algorithms.

この章の焦点は、機械学習アルゴリズムのためのデータ準備です。

This chapter focuses on data preparation for machine learning algorithms.

Preprocessing the Data

日常では、多くのデータを扱いますが、このデータは未加工の状態です。機械学習アルゴリズムの入力としてデータを供給するには、意味のあるデータに変換する必要があります。ここでデータの事前処理が問題になってきます。別の言い方をすれば、機械学習アルゴリズムにデータを提供する前に、データを事前処理する必要があります。

In our daily life, we deal with lots of data but this data is in raw form. To provide the data as the input of machine learning algorithms, we need to convert it into a meaningful data. That is where data preprocessing comes into picture. In other simple words, we can say that before providing the data to the machine learning algorithms we need to preprocess the data.

Data preprocessing steps

Pythonでデータを事前処理するには、次の手順に従います。

Follow these steps to preprocess the data in Python −

手順1 − 有用なパッケージのインポート − Pythonを使用している場合、これはデータを特定のフォーマットに、つまり事前処理に変換するための最初のステップとなります。これは次のように行うことができます。

*Step 1 − Importing the useful packages *− If we are using Python then this would be the first step for converting the data into a certain format, i.e., preprocessing. It can be done as follows −

import numpy as np
import sklearn.preprocessing

ここでは、次の2つのパッケージを使用しました。

Here we have used the following two packages −

  1. NumPy − Basically NumPy is a general purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays.

  2. Sklearn.preprocessing − This package provides many common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for machine learning algorithms.

步骤 2 − 定义样本数据 − 导入包后,我们需要定义一些样本数据,以便我们对该数据应用预处理技术。我们现在将定义以下样本数据 −

*Step 2 − Defining sample data *− After importing the packages, we need to define some sample data so that we can apply preprocessing techniques on that data. We will now define the following sample data −

input_data = np.array([2.1, -1.9, 5.5],
                      [-1.5, 2.4, 3.5],
                      [0.5, -7.9, 5.6],
                      [5.9, 2.3, -5.8])

Step3 − Applying preprocessing technique − 在此步骤中,我们需要应用任何预处理技术。

Step3 − Applying preprocessing technique − In this step, we need to apply any of the preprocessing techniques.

以下部分介绍了数据预处理技术。

The following section describes the data preprocessing techniques.

Techniques for Data Preprocessing

数据预处理技术如下所示 −

The techniques for data preprocessing are described below −

Binarization

当我们需要将数值转换成布尔值时,可以使用此预处理技术。我们可以使用内置方法将输入数据二值化,例如使用 0.5 作为阈值,具体方式如下 −

This is the preprocessing technique which is used when we need to convert our numerical values into Boolean values. We can use an inbuilt method to binarize the input data say by using 0.5 as the threshold value in the following way −

data_binarized = preprocessing.Binarizer(threshold = 0.5).transform(input_data)
print("\nBinarized data:\n", data_binarized)

现在,在运行完上述代码后,我们将获得以下输出,所有高于 0.5(阈值)的值都将转换为 1,而所有低于 0.5 的值都将转换为 0。

Now, after running the above code we will get the following output, all the values above 0.5(threshold value) would be converted to 1 and all the values below 0.5 would be converted to 0.

Binarized data

Binarized data

[[ 1. 0. 1.]
[ 0. 1. 1.]
[ 0. 0. 1.]
[ 1. 1. 0.]]

Mean Removal

这是机器学习中使用的另一种非常常见的预处理技术。一般来说,它用于消除特征矢量的均值,以便每个特征都以零为中心。我们还可以从特征矢量中的特征去除偏差。为了对样本数据应用均值去除预处理技术,我们可以编写下面显示的 Python 代码。该代码将显示输入数据的均值和标准差 −

It is another very common preprocessing technique that is used in machine learning. Basically it is used to eliminate the mean from feature vector so that every feature is centered on zero. We can also remove the bias from the features in the feature vector. For applying mean removal preprocessing technique on the sample data, we can write the Python code shown below. The code will display the Mean and Standard deviation of the input data −

print("Mean = ", input_data.mean(axis = 0))
print("Std deviation = ", input_data.std(axis = 0))

在运行完上述代码后,我们将获得以下输出 −

We will get the following output after running the above lines of code −

         Mean = [ 1.75       -1.275       2.2]
Std deviation = [ 2.71431391  4.20022321  4.69414529]

现在,下面的代码将去除输入数据的均值和标准差 −

Now, the code below will remove the Mean and Standard deviation of the input data −

data_scaled = preprocessing.scale(input_data)
print("Mean =", data_scaled.mean(axis=0))
print("Std deviation =", data_scaled.std(axis = 0))

在运行完上述代码后,我们将获得以下输出 −

We will get the following output after running the above lines of code −

         Mean = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00]
Std deviation = [ 1.             1.             1.]

Scaling

这是另一种用于缩放特征矢量的数据预处理技术。需要缩放特征矢量,因为每个特征的值可以在许多随机值之间变化。换句话说,我们可以说缩放很重要,因为我们不希望任何特征人为地变大或变小。借助以下 Python 代码,我们可以缩放我们的输入数据,即特征矢量 −

It is another data preprocessing technique that is used to scale the feature vectors. Scaling of feature vectors is needed because the values of every feature can vary between many random values. In other words we can say that scaling is important because we do not want any feature to be synthetically large or small. With the help of the following Python code, we can do the scaling of our input data, i.e., feature vector −

# Min max scaling

# Min max scaling

data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
print ("\nMin max scaled data:\n", data_scaled_minmax)

在运行完上述代码后,我们将获得以下输出 −

We will get the following output after running the above lines of code −

Min max scaled data

Min max scaled data

[ [ 0.48648649  0.58252427   0.99122807]
[   0.          1.           0.81578947]
[   0.27027027  0.           1.        ]
[   1.          0. 99029126  0.        ]]

Normalization

这是另一种用于修改特征矢量的数据预处理技术。此类修改对于在公共尺度上测量特征矢量是必要的。以下是在机器学习中可使用两种类型的规范化 −

It is another data preprocessing technique that is used to modify the feature vectors. Such kind of modification is necessary to measure the feature vectors on a common scale. Followings are two types of normalization which can be used in machine learning −

L1 Normalization

L1 Normalization

它也被称为 Least Absolute Deviations 。此类规范化修改值,以便每一行中绝对值的总和始终达到 1。可以使用以下 Python 代码对其在输入数据上进行实现 −

It is also referred to as Least Absolute Deviations. This kind of normalization modifies the values so that the sum of the absolute values is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −

# Normalize data
data_normalized_l1 = preprocessing.normalize(input_data, norm = 'l1')
print("\nL1 normalized data:\n", data_normalized_l1)

上述代码行生成以下输出 &miuns;

The above line of code generates the following output &miuns;

L1 normalized data:
[[ 0.22105263  -0.2          0.57894737]
[ -0.2027027    0.32432432   0.47297297]
[  0.03571429  -0.56428571   0.4       ]
[  0.42142857   0.16428571  -0.41428571]]

L2 Normalization

L2 Normalization

它也被称为 least squares 。此类规范化修改值,以便每一行中平方和的总和始终达到 1。可以使用以下 Python 代码对其在输入数据上进行实现 −

It is also referred to as least squares. This kind of normalization modifies the values so that the sum of the squares is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −

# Normalize data
data_normalized_l2 = preprocessing.normalize(input_data, norm = 'l2')
print("\nL2 normalized data:\n", data_normalized_l2)

上述代码行将生成以下输出 −

The above line of code will generate the following output −

L2 normalized data:
[[ 0.33946114  -0.30713151   0.88906489]
[ -0.33325106   0.53320169   0.7775858 ]
[  0.05156558  -0.81473612   0.57753446]
[  0.68706914   0.26784051  -0.6754239 ]]

Labeling the Data

我们已经知道机器学习算法需要采用特定格式的数据。另一个重要要求是必须在将数据作为机器学习算法的输入之前正确地标记数据。例如,如果我们讨论分类法,数据中就有许多标签。这些标签以单词、数字等形式出现。 sklearn 中与机器学习相关的函数要求数据必须有数字标签。因此,如果数据采用其他格式,则必须将其转换为数字。将单词标签转换为数字形式的过程称为标签编码。

We already know that data in a certain format is necessary for machine learning algorithms. Another important requirement is that the data must be labelled properly before sending it as the input of machine learning algorithms. For example, if we talk about classification, there are lot of labels on the data. Those labels are in the form of words, numbers, etc. Functions related to machine learning in sklearn expect that the data must have number labels. Hence, if the data is in other form then it must be converted to numbers. This process of transforming the word labels into numerical form is called label encoding.

Label encoding steps

按照以下步骤对 Python 中的数据标签进行编码 −

Follow these steps for encoding the data labels in Python −

Step1 − Importing the useful packages

Step1 − Importing the useful packages

如果使用 Python,这将是将数据转换为特定格式的第一步,即预处理。具体步骤如下 −

If we are using Python then this would be first step for converting the data into certain format, i.e., preprocessing. It can be done as follows −

import numpy as np
from sklearn import preprocessing

Step 2 − Defining sample labels

Step 2 − Defining sample labels

导入软件包后,我们需要定义一些示例标签,以便创建和训练标签编码器。我们现在将定义以下示例标签 −

After importing the packages, we need to define some sample labels so that we can create and train the label encoder. We will now define the following sample labels −

# Sample input labels
input_labels = ['red','black','red','green','black','yellow','white']

Step 3 − Creating & training of label encoder object

Step 3 − Creating & training of label encoder object

在此步骤中,我们需要创建标签编码器并对其进行训练。以下 Python 代码将帮助完成此操作 −

In this step, we need to create the label encoder and train it. The following Python code will help in doing this −

# Creating the label encoder
encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)

运行上述 Python 代码后,输出如下 −

Following would be the output after running the above Python code −

LabelEncoder()

Step4 − Checking the performance by encoding random ordered list

Step4 − Checking the performance by encoding random ordered list

此步骤可用于通过对随机有序列表进行编码来检查性能。可以编写以下 Python 代码来执行相同操作 −

This step can be used to check the performance by encoding the random ordered list. Following Python code can be written to do the same −

# encoding a set of labels
test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)

标签将按如下方式打印 −

The labels would get printed as follows −

Labels = ['green', 'red', 'black']

现在,我们可以获取编码值列表,即如下方式将单词标签转换为数字 −

Now, we can get the list of encoded values i.e. word labels converted to numbers as follows −

print("Encoded values =", list(encoded_values))

编码值将按如下方式打印 −

The encoded values would get printed as follows −

Encoded values = [1, 2, 0]

Step 5 − Checking the performance by decoding a random set of numbers −

Step 5 − Checking the performance by decoding a random set of numbers −

此步骤可用于通过对一组随机数字进行解码来检查性能。可以编写以下 Python 代码来执行相同操作 −

This step can be used to check the performance by decoding the random set of numbers. Following Python code can be written to do the same −

# decoding a set of values
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)
print("\nEncoded values =", encoded_values)

现在,编码值将按如下方式打印 −

Now, Encoded values would get printed as follows −

Encoded values = [3, 0, 4, 1]
print("\nDecoded labels =", list(decoded_list))

现在,解码值将按如下方式打印 −

Now, decoded values would get printed as follows −

Decoded labels = ['white', 'black', 'yellow', 'green']

Labeled v/s Unlabeled Data

未标记数据主要包括可以从世界上轻易获得的自然或人造物体的样本。它们包括音频、视频、照片、新闻文章等等。

Unlabeled data mainly consists of the samples of natural or human-created object that can easily be obtained from the world. They include, audio, video, photos, news articles, etc.

另一方面,标记数据获取一组未标记数据,并使用有意义的标记或标签或类别来增强该未标记数据的每一项。例如,如果我们有一张照片,那么该标签可以根据照片的内容来标注,即它是一张男孩、女孩、动物或其他任何事物的照片。对数据进行标记需要人类专业知识或对给定未标记数据的判断。

On the other hand, labeled data takes a set of unlabeled data and augments each piece of that unlabeled data with some tag or label or class that is meaningful. For example, if we have a photo then the label can be put based on the content of the photo, i.e., it is photo of a boy or girl or animal or anything else. Labeling the data needs human expertise or judgment about a given piece of unlabeled data.

在许多场景中,未标记数据很多且易于获取,但标记数据通常需要人类/专家进行注释。半监督学习试图将标记和未标记数据组合起来构建更好的模型。

There are many scenarios where unlabeled data is plentiful and easily obtained but labeled data often requires a human/expert to annotate. Semi-supervised learning attempts to combine labeled and unlabeled data to build better models.