Python Deep Learning 简明教程
Training a Neural Network
我们现在将学习如何训练神经网络。我们还将学习反向传播算法和 Python 深度学习中的反向传递。
We will now learn how to train a neural network. We will also learn back propagation algorithm and backward pass in Python Deep Learning.
我们必须找到神经网络权重的最佳值以获取所需的输出。为了训练神经网络,我们使用迭代梯度下降法。我们最初使用权重的随机初始化开始。在随机初始化之后,我们对数据的部分子集做出预测,采用前向传播处理,计算相应的成本函数 C,并根据与 dC/dw 成比例的量更新每个权重 w,即成本函数 w.r.t. 权重的导数。比例常数称为学习速率。
We have to find the optimal values of the weights of a neural network to get the desired output. To train a neural network, we use the iterative gradient descent method. We start initially with random initialization of the weights. After random initialization, we make predictions on some subset of the data with forward-propagation process, compute the corresponding cost function C, and update each weight w by an amount proportional to dC/dw, i.e., the derivative of the cost functions w.r.t. the weight. The proportionality constant is known as the learning rate.
可以使用反向传播算法有效地计算梯度。反向传播或反向传播的关键观察是,由于微分的链式法则,神经网络中每个神经元的梯度可以使用它具有输入边的神经元的梯度来计算。因此,我们反向计算梯度,即先计算输出层的梯度,然后计算最顶层的隐藏层,然后计算之前的隐藏层,依此类推,最后在输入层结束。
The gradients can be calculated efficiently using the back-propagation algorithm. The key observation of backward propagation or backward prop is that because of the chain rule of differentiation, the gradient at each neuron in the neural network can be calculated using the gradient at the neurons, it has outgoing edges to. Hence, we calculate the gradients backwards, i.e., first calculate the gradients of the output layer, then the top-most hidden layer, followed by the preceding hidden layer, and so on, ending at the input layer.
反向传播算法主要使用计算图的想法实现,其中每个神经元在计算图中扩展为许多节点,并执行简单的数学运算,例如加法和乘法。计算图在边上没有任何权重;所有权重都分配给节点,因此权重变成了它们自己的节点。然后在计算图上运行反向传播算法。计算完成后,只需要权重节点的梯度进行更新。其余梯度可以丢弃。
The back-propagation algorithm is implemented mostly using the idea of a computational graph, where each neuron is expanded to many nodes in the computational graph and performs a simple mathematical operation like addition, multiplication. The computational graph does not have any weights on the edges; all weights are assigned to the nodes, so the weights become their own nodes. The backward propagation algorithm is then run on the computational graph. Once the calculation is complete, only the gradients of the weight nodes are required for update. The rest of the gradients can be discarded.
Gradient Descent Optimization Technique
一种常用的优化函数根据其造成的误差调整权重,称为“梯度下降”。
One commonly used optimization function that adjusts weights according to the error they caused is called the “gradient descent.”
梯度是斜率的另一种名称,而斜率在 x-y 图上表示两个变量如何相互关联:运行的上升,时间的变化距离的变化等。在这种情况下,斜率是网络误差和一个权值之间的比率;即,随着权重的变化,误差如何变化。
Gradient is another name for slope, and slope, on an x-y graph, represents how two variables are related to each other: the rise over the run, the change in distance over the change in time, etc. In this case, the slope is the ratio between the network’s error and a single weight; i.e., how does the error change as the weight is varied.
更确切地说,我们想要找到产生最小误差的权重。我们想要找到能够正确表示输入数据中包含的信号,并将其转换为正确分类的权重。
To put it more precisely, we want to find which weight produces the least error. We want to find the weight that correctly represents the signals contained in the input data, and translates them to a correct classification.
随着神经网络的学习,它会缓慢调整许多权重,以便它们可以正确地将信号映射到含义。网络误差与这些权重之间每个的比率是一个导数,dE/dw 计算了权重发生轻微变化导致误差发生轻微变化的程度。
As a neural network learns, it slowly adjusts many weights so that they can map signal to meaning correctly. The ratio between network Error and each of those weights is a derivative, dE/dw that calculates the extent to which a slight change in a weight causes a slight change in the error.
每个权重只是涉及许多变换的深度网络中的一个因素;权重的信号穿过激活并对几层求和,因此我们使用微积分的链式法则来处理网络激活和输出。这将我们引向有问题的权重及其与整体误差的关系。
Each weight is just one factor in a deep network that involves many transforms; the signal of the weight passes through activations and sums over several layers, so we use the chain rule of calculus to work back through the network activations and outputs.This leads us to the weight in question, and its relationship to overall error.
给定两个变量误差和权重,由第三个变量 activation 调解,权重通过它传递。我们可以通过首先计算激活变化如何影响误差变化,以及权重变化如何影响激活变化来计算权重变化如何影响误差变化。
Given two variables, error and weight, are mediated by a third variable, activation, through which the weight is passed. We can calculate how a change in weight affects a change in error by first calculating how a change in activation affects a change in Error, and how a change in weight affects a change in activation.
深度学习的基本思想也无非如此:根据它产生的误差调整模型的权重,直到你无法再减少误差为止。
The basic idea in deep learning is nothing more than that: adjusting a model’s weights in response to the error it produces, until you cannot reduce the error any more.
如果梯度值小,深度网络训练缓慢,如果值高,训练则很快。训练中的任何不准确都会导致输出不准确。从输出到输入训练网络的过程称为反向传播或反向传播。我们知道前向传播从输入开始,向前进行。反向传播进行相反/相反的计算,从右到左计算梯度。
The deep net trains slowly if the gradient value is small and fast if the value is high. Any inaccuracies in training leads to inaccurate outputs. The process of training the nets from the output back to the input is called back propagation or back prop. We know that forward propagation starts with the input and works forward. Back prop does the reverse/opposite calculating the gradient from right to left.
每次计算梯度时,我们都会使用到那一点为止的所有先前的梯度。
Each time we calculate a gradient, we use all the previous gradients up to that point.
让我们从输出层中的节点开始。这条边使用该节点的梯度。当我们返回隐藏层时,它变得更加复杂。0 到 1 之间的两个数字的乘积会给你一个较小的数字。梯度值不断变小,因此反向传播需要花费大量时间进行训练,并且准确性会受到影响。
Let us start at a node in the output layer. The edge uses the gradient at that node. As we go back into the hidden layers, it gets more complex. The product of two numbers between 0 and 1 gives youa smaller number. The gradient value keeps getting smaller and as a result back prop takes a lot of time to train and accuracy suffers.
Challenges in Deep Learning Algorithms
对于浅层神经网络和深层神经网络,都存在一定的挑战,例如过度拟合和计算时间。DNN 会受到过度拟合的影响,因为它们使用了额外的抽象层,使它们能够对训练数据中的稀有依赖关系进行建模。
There are certain challenges for both shallow neural networks and deep neural networks, like overfitting and computation time. DNNs are affected by overfitting because the use of added layers of abstraction which allow them to model rare dependencies in the training data.
在训练期间应用了许多方法以对抗过拟合,例如 dropout、早期停止、数据增强和迁移学习。Dropout 正则化在训练期间会随机略去隐藏层中的单元,有助于避免稀疏依赖性。DNN 考虑多种训练参数,如规模(即层数和每层单元数)、学习速率和初始权重。找到最优参数并不总实用,原因是耗费大量时间和计算资源。批处理等多种技巧可以加快计算速度。GPU 强大的处理能力极大助力于训练过程,因为 GPU 能最好地执行所需的矩阵和向量计算。
Regularization methods such as drop out, early stopping, data augmentation, transfer learning are applied during training to combat overfitting. Drop out regularization randomly omits units from the hidden layers during training which helps in avoiding rare dependencies. DNNs take into consideration several training parameters such as the size, i.e., the number of layers and the number of units per layer, the learning rate and initial weights. Finding optimal parameters is not always practical due to the high cost in time and computational resources. Several hacks such as batching can speed up computation. The large processing power of GPUs has significantly helped the training process, as the matrix and vector computations required are well-executed on the GPUs.
Dropout
Dropout 是神经网络中一种流行的正则化技术。深度神经网络特别容易过拟合。
Dropout is a popular regularization technique for neural networks. Deep neural networks are particularly prone to overfitting.
现在让我们看看 Dropout 是什么,以及它是如何工作的。
Let us now see what dropout is and how it works.
用深度学习先驱之一 Geoffrey Hinton 的话来说:“如果您有一个深度神经网络但它没有过拟合,您可能应该使用更大一点的网络,然后使用 Dropout”。
In the words of Geoffrey Hinton, one of the pioneers of Deep Learning, ‘If you have a deep neural net and it’s not overfitting, you should probably be using a bigger one and using dropout’.
Dropout 是一种技术,在梯度下降的每次迭代中,我们会丢弃一组随机选定的节点。这意味着我们随机忽略一些节点,就像它们不存在一样。
Dropout is a technique where during each iteration of gradient descent, we drop a set of randomly selected nodes. This means that we ignore some nodes randomly as if they do not exist.
每个神经元都有 q 的概率被保留,而被丢弃的概率为 1-q。q 值在神经网络中的每一层中可能不同。对于隐藏层使用 0.5 的值,对于输入层使用 0 的值可在广泛的任务中发挥良好作用。
Each neuron is kept with a probability of q and dropped randomly with probability 1-q. The value q may be different for each layer in the neural network. A value of 0.5 for the hidden layers, and 0 for input layer works well on a wide range of tasks.
在评估和预测期间,不使用 Dropout。每个神经元的输出乘以 q,以便输入下一层的预期值相同。
During evaluation and prediction, no dropout is used. The output of each neuron is multiplied by q so that the input to the next layer has the same expected value.
Dropout 背后的想法如下 - 在没有 Dropout 正则化的神经网络中,神经元会相互产生相互依存性,从而导致过拟合。
The idea behind Dropout is as follows − In a neural network without dropout regularization, neurons develop co-dependency amongst each other that leads to overfitting.
Early Stopping
我们使用称为梯度下降的迭代算法训练神经网络。
We train neural networks using an iterative algorithm called gradient descent.
早期停止背后的想法很直观;当误差开始增加时,我们停止训练。此处,误差是指在验证数据中测量的误差,验证数据是用于微调超参数的部分训练数据。在这种情况下,超参数是停止准则。
The idea behind early stopping is intuitive; we stop training when the error starts to increase. Here, by error, we mean the error measured on validation data, which is the part of training data used for tuning hyper-parameters. In this case, the hyper-parameter is the stop criteria.
Data Augmentation
这个过程是指我们增加数据的量子,或使用现有数据并对其应用一些变换来扩充数据。具体使用的变换取决于我们打算实现的任务。此外,帮助神经网络的变换取决于其架构。
The process where we increase the quantum of data we have or augment it by using existing data and applying some transformations on it. The exact transformations used depend on the task we intend to achieve. Moreover, the transformations that help the neural net depend on its architecture.
例如,在许多计算机视觉任务中(如对象分类),一种有效的数据增强技术是添加新的数据点,这些数据点是原始数据的裁剪或平移版本。
For instance, in many computer vision tasks such as object classification, an effective data augmentation technique is adding new data points that are cropped or translated versions of original data.
当计算机接受图像作为输入时,它会采用像素值数组。假设整个图像向左平移了 15 个像素。我们在不同方向上应用了许多不同的平移,从而生成一个扩充数据集,其大小是原始数据集的数倍。
When a computer accepts an image as an input, it takes in an array of pixel values. Let us say that the whole image is shifted left by 15 pixels. We apply many different shifts in different directions, resulting in an augmented dataset many times the size of the original dataset.
Transfer Learning
采用经过预训练的模型并使用我们自己的数据集对模型进行“微调”的过程称为迁移学习。有多种方法可以做到这一点。以下描述了几种方法:
The process of taking a pre-trained model and “fine-tuning” the model with our own dataset is called transfer learning. There are several ways to do this.A few ways are described below −
-
We train the pre-trained model on a large dataset. Then, we remove the last layer of the network and replace it with a new layer with random weights.
-
We then freeze the weights of all the other layers and train the network normally. Here freezing the layers is not changing the weights during gradient descent or optimization.
这一概念背后的原理是:经过预训练的模型将充当特征提取器,而只有最后一层将在当前任务中接受训练。
The concept behind this is that the pre-trained model will act as a feature extractor, and only the last layer will be trained on the current task.