Deep Learning With Keras 简明教程

Deep Learning with Keras - Preparing Data

在将数据馈送到我们的网络之前,必须将其转换为网络所需的格式。这称为为网络准备数据。它通常包括将多维输入转换为单维矢量并标准化数据点。

Before we feed the data to our network, it must be converted into the format required by the network. This is called preparing data for the network. It generally consists of converting a multi-dimensional input to a single-dimension vector and normalizing the data points.

Reshaping Input Vector

我们数据集中的图像由 28 x 28 像素组成。在将其馈送到我们的网络之前,这必须转换为大小为 28 * 28 = 784 的单维矢量。我们通过在矢量上调用 reshape 方法来执行此操作。

The images in our dataset consist of 28 x 28 pixels. This must be converted into a single dimensional vector of size 28 * 28 = 784 for feeding it into our network. We do so by calling the reshape method on the vector.

X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)

现在,我们的训练矢量将包含 60000 个数据点,每个数据点都包含大小为 784 的单维矢量。类似地,我们的测试矢量将包含 10000 个数据点,每个数据点都是大小为 784 的单维矢量。

Now, our training vector will consist of 60000 data points, each consisting of a single dimension vector of size 784. Similarly, our test vector will consist of 10000 data points of a single-dimension vector of size 784.

Normalizing Data

输入矢量当前包含的数据在 0 和 255 之间的离散值——灰度级。将这些像素值标准化到 0 和 1 之间有助于加快训练速度。由于我们要使用随机梯度下降,标准化数据还有助于降低陷入局部最优的可能性。

The data that the input vector contains currently has a discrete value between 0 and 255 - the gray scale levels. Normalizing these pixel values between 0 and 1 helps in speeding up the training. As we are going to use stochastic gradient descent, normalizing data will also help in reducing the chance of getting stuck in local optima.

要标准化数据,我们将其表示为浮点类型,并除以 255,如以下代码片段中所示:

To normalize the data, we represent it as float type and divide it by 255 as shown in the following code snippet −

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255

让我们现在看看标准化数据的样子。

Let us now look at how the normalized data looks like.

Examining Normalized Data

要查看标准化数据,我们将调用直方图函数,如下所示:

To view the normalized data, we will call the histogram function as shown here −

plot.hist(X_train[0])
plot.title("Digit: {}".format(y_train[0]))

在这里,我们绘制 X_train 函数的第一个元素的直方图。我们还打印此数据点表示的数字。运行以上代码的输出显示如下:

Here, we plot the histogram of the first element of the X_train vector. We also print the digit represented by this data point. The output of running the above code is shown here −

normalized data

您会注意到,大量密点数的值接近于零。这些是图像中的黑色圆点,显然是图像的主要部分。其余靠近白色的灰度级点代表数字。您可以查看另一个数字的像素分布。以下代码打印训练集中索引为 2 的数字的直方图。

You will notice a thick density of points having value close to zero. These are the black dot points in the image, which obviously is the major portion of the image. The rest of the gray scale points, which are close to white color, represent the digit. You may check out the distribution of pixels for another digit. The code below prints the histogram of a digit at index of 2 in the training dataset.

plot.hist(X_train[2])
plot.title("Digit: {}".format(y_train[2])

运行以上代码的输出显示如下:

The output of running the above code is shown below −

training dataset

比较以上两幅图,您会注意到两幅图像中白色像素的分布不同,表示不同的数字——上方两幅图像中的“5”和“4”。

Comparing the above two figures, you will notice that the distribution of the white pixels in two images differ indicating a representation of a different digit - “5” and “4” in the above two pictures.

接下来,我们将检查我们整个训练集中数据的分布。

Next, we will examine the distribution of data in our full training dataset.

Examining Data Distribution

在我们的机器学习模型针对我们的数据集进行训练之前,我们应该知道我们数据集中唯一数字的分布。我们的图像表示了从 0 到 9 的 10 个不同的数字。我们想知道我们数据集中数字 0、1 等的数量。我们可以使用 NumPy 的 unique 方法获取此信息。

Before we train our machine learning model on our dataset, we should know the distribution of unique digits in our dataset. Our images represent 10 distinct digits ranging from 0 to 9. We would like to know the number of digits 0, 1, etc., in our dataset. We can get this information by using the unique method of Numpy.

使用以下命令打印唯一值的数目及每个数字出现的次数:

Use the following command to print the number of unique values and the number of occurrences of each one

print(np.unique(y_train, return_counts=True))

运行上述命令后,您将看到以下输出:

When you run the above command, you will see the following output −

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8), array([5923, 6742, 5958, 6131, 5842, 5421, 5918, 6265, 5851, 5949]))

它表明有 10 个不同的值——0 到 9。数字 0 出现 5923 次,数字 1 出现 6742 次,依此类推。输出的屏幕截图显示如下:

It shows that there are 10 distinct values — 0 through 9. There are 5923 occurrences of digit 0, 6742 occurrences of digit 1, and so on. The screenshot of the output is shown here −

distinct values

作为数据准备的最后一步,我们需要对数据进行编码。

As a final step in data preparation, we need to encode our data.

Encoding Data

我们的数据集中有十个类别。因此,我们使用独热编码将我们的输出编码到这十个类别中。我们使用 Numpy 实用程序的 to_categorial 方法来执行编码。对输出数据进行编码后,每个数据点都会转换为大小为 10 的单维向量。例如,数字 5 现在将表示为 [0,0,0,0,0,1,0,0,0,0]。

We have ten categories in our dataset. We will thus encode our output in these ten categories using one-hot encoding. We use to_categorial method of Numpy utilities to perform encoding. After the output data is encoded, each data point would be converted into a single dimensional vector of size 10. For example, digit 5 will now be represented as [0,0,0,0,0,1,0,0,0,0].

使用以下代码段对数据进行编码 -

Encode the data using the following piece of code −

n_classes = 10
Y_train = np_utils.to_categorical(y_train, n_classes)

您可以通过打印分类的 Y_train 向量的前 5 个元素来查看编码结果。

You may check out the result of encoding by printing the first 5 elements of the categorized Y_train vector.

使用以下代码打印前 5 个向量 -

Use the following code to print the first 5 vectors −

for i in range(5):
   print (Y_train[i])

您将看到以下输出 −

You will see the following output −

[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]

第一个元素代表数字 5,第二个元素代表数字 0,依此类推。

The first element represents digit 5, the second represents digit 0, and so on.

最后,您还必须对测试数据进行分类,该操作使用以下语句完成 -

Finally, you will have to categorize the test data too, which is done using the following statement −

Y_test = np_utils.to_categorical(y_test, n_classes)

在此阶段,您的数据已完全准备好馈送到网络中。

At this stage, your data is fully prepared for feeding into the network.

接下来,是最重要的部分,即训练我们的网络模型。

Next, comes the most important part and that is training our network model.