Apache Mxnet 简明教程

Apache MXNet - Distributed Training

本章是关于 Apache MXNet 中的分布式训练。让我们首先了解 MXNet 中的计算模式。

This chapter is about the distributed training in Apache MXNet. Let us start by understanding what are the modes of computation in MXNet.

Modes of Computation

MXNet,一个多语言 ML 库,为其用户提供了以下两种计算模式−

MXNet, a multi-language ML library, offers its users the following two modes of computation −

Imperative mode

这种计算模式公开了像 NumPy API 这样的接口。例如,在 MXNet 中,使用以下命令式代码在 CPU 和 GPU 上构建一个零张量 −

This mode of computation exposes an interface like NumPy API. For example, in MXNet, use the following imperative code to construct a tensor of zeros on both CPU as well as GPU −

import mxnet as mx
tensor_cpu = mx.nd.zeros((100,), ctx=mx.cpu())
tensor_gpu= mx.nd.zeros((100,), ctx=mx.gpu(0))

正如我们在上面的代码中看到的,MXNets 指定了保存张量的位置,可以在 CPU 或 GPU 设备中。在上面的示例中,它位于位置 0。MXNet 可以令人难以置信地利用设备,因为所有计算都是延迟发生的,而不是瞬间发生的。

As we see in the above code, MXNets specifies the location where to hold the tensor, either in CPU or GPU device. In above example, it is at location 0. MXNet achieve incredible utilisation of the device, because all the computations happen lazily instead of instantaneously.

Symbolic mode

虽然命令式模式非常有用,但这种模式的一个缺点是它的刚性,即所有计算都需要提前知道,以及预定义的数据结构。

Although the imperative mode is quite useful, but one of the drawbacks of this mode is its rigidity, i.e. all the computations need to be known beforehand along with pre-defined data structures.

另一方面,符号模式公开了像 TensorFlow 这样的计算图。它通过允许 MXNet 使用符号或变量来代替固定/预定义的数据结构,从而消除了命令式 API 的缺点。然后,可以将符号解释为一组操作,如下所示−

On the other hand, Symbolic mode exposes a computation graph like TensorFlow. It removes the drawback of imperative API by allowing MXNet to work with symbols or variables instead of fixed/pre-defined data structures. Afterwards, the symbols can be interpreted as a set of operations as follows −

import mxnet as mx
x = mx.sym.Variable(“X”)
y = mx.sym.Variable(“Y”)
z = (x+y)
m = z/100

Kinds of Parallelism

Apache MXNet 支持分布式训练。它使我们能够利用多台机器进行更快速、更有效率的训练。

Apache MXNet supports distributed training. It enables us to leverage multiple machines for faster as well as effective training.

以下是我们可以在多个设备(CPU 或 GPU 设备)上分配神经网络训练工作负载的两种方式−

Following are the two ways in which, we can distribute the workload of training a NN across multiple devices, CPU or GPU device −

Data Parallelism

在这种并行中,每个设备都存储模型的完整副本,并使用数据集的不同部分。设备还会共同更新一个共享模型。我们可以将所有设备放在一台机器上,也可以放在多台机器上。

In this kind of parallelism, each device stores a complete copy of the model and works with a different part of the dataset. Devices also update a shared model collectively. We can locate all the devices on a single machine or across multiple machines.

Model Parallelism

这是另一种并行,它在模型非常大以致无法放入设备内存时派上用场。在模型并行中,不同的设备被分配了学习模型不同部分的任务。这里要说明的重要一点是,目前 Apache MXNet 仅支持单台机器上的模型并行。

It is another kind of parallelism, which comes handy when models are so large that they do not fit into device memory. In model parallelism, different devices are assigned the task of learning different parts of the model. The important point here to note is that currently Apache MXNet supports model parallelism in a single machine only.

Working of distributed training

以下给出的概念是理解 Apache MXNet 中分布式训练工作原理的关键−

The concepts given below are the key to understand the working of distributed training in Apache MXNet −

Types of processes

进程相互通信以完成模型的训练。Apache MXNet 有以下三个进程−

Processes communicates with each other to accomplish the training of a model. Apache MXNet has the following three processes −

Worker

工作节点的任务是对一批训练样本进行训练。工作器节点将在处理每一批之前从服务器获取权重。在处理完该批后,工作器节点将梯度发送到服务器。

The job of worker node is to perform training on a batch of training samples. The Worker nodes will pull weights from the server before processing every batch. The Worker nodes will send gradients to the server, once the batch is processed.

Server

MXNet 可以有多个服务器来存储模型的参数并与工作器节点通信。

MXNet can have multiple servers for storing the model’s parameters and to communicate with the worker nodes.

Scheduler

调度程序的作用是设置集群,其中包括等待每台节点启动的消息以及该节点正在监听的端口。在设置集群后,调度程序让所有进程了解集群中的其他所有节点。这是因为进程可以相互通信。只有一个调度程序。

The role of the scheduler is to set up the cluster, which includes waiting for messages that each node has come up and which port the node is listening to. After setting up the cluster, the scheduler lets all the processes know about every other node in the cluster. It is because the processes can communicate with each other. There is only one scheduler.

KV Store

KV 存储代表 Key-Value 存储。它是用于多设备训练的关键组件。它很重要,因为单一计算机和多台计算机跨设备的参数通信可以通过具有 KVStore 的一个或多个服务器传输参数。让我们借助以下要点了解 KVStore 的工作原理 −

KV stores stands for Key-Value store. It is critical component used for multi-device training. It is important because, the communication of parameters across devices on single as well as across multiple machines is transmitted through one or more servers with a KVStore for the parameters. Let’s understand the working of KVStore with the help of following points −

  1. Each value in KVStore is represented by a key and a value.

  2. Each parameter array in the network is assigned a key and the weights of that parameter array is referred by value.

  3. After that, the worker nodes push gradients after processing a batch. They also pull updated weights before processing a new batch.

KVStore 服务器的概念仅在分布式训练期间存在,其分布式模式可以通过使用包含字符串 dist 的字符串参数 mxnet.kvstore.create 函数来启用

The notion of KVStore server exists only during distributed training and the distributed mode of it is enabled by calling mxnet.kvstore.create function with a string argument containing the word dist

kv = mxnet.kvstore.create(‘dist_sync’)

Distribution of Keys

并非所有服务器都存储所有参数数组或键是必要的,但它们分布在不同的服务器上。键在不同服务器上的这种分布由 KVStore 透明处理,并且哪个服务器存储特定键的决定是随机做出的。

It is not necessary that, all the servers store all the parameters array or keys, but they are distributed across different servers. Such distribution of keys across different servers is handled transparently by the KVStore and the decision of which server stores a specific key is made at random.

如上所述,KVStore 确保每当拉取密钥时,都会将请求发送到具有相应值的服务器。如果某个键的值很大怎么办?在这种情况下,它可以在不同的服务器之间共享。

KVStore, as discussed above, ensures that whenever the key is pulled, its request is sent to that server, which has the corresponding value. What if the value of some key is large? In that case, it may be shared across different servers.

Split training data

作为用户,我们希望每台机器都能处理数据集的不同部分,特别是当在数据并行模式下运行分布式训练时。我们知道,为了将数据迭代器提供的样本块拆分为用于在单个工作器上进行数据并行训练,我们可以使用 mxnet.gluon.utils.split_and_load ,然后将块的各个部分加载到将进一步处理它的设备上。

As being the users, we want each machine to be working on different parts of the dataset, especially, when running distributed training in data parallel mode. We know that, to split a batch of samples provided by the data iterator for data parallel training on a single worker we can use mxnet.gluon.utils.split_and_load and then, load each part of the batch on the device which will process it further.

另一方面,在分布式训练的情况下,在开始时我们需要将数据集分为 n 不同的部分,以便每个工作器得到一个不同的部分。一旦得到,每个工作器都可以使用 split_and_load 再次将数据集的部分在单个计算机上的不同设备上进行划分。所有这些都通过数据迭代器发生。 mxnet.io.MNISTIteratormxnet.io.ImageRecordIter 是 MXNet 中支持此功能的两个此类迭代器。

On the other hand, in case of distributed training, at beginning we need to divide the dataset into n different parts so that every worker gets a different part. Once got, each worker can then use split_and_load to again divide that part of the dataset across different devices on a single machine. All this happen through data iterator. mxnet.io.MNISTIterator and mxnet.io.ImageRecordIter are two such iterators in MXNet that support this feature.

Weights updating

对于更新权重,KVStore 支持以下两种模式 −

For updating the weights, KVStore supports following two modes −

  1. First method aggregates the gradients and updates the weights by using those gradients.

  2. In the second method the server only aggregates gradients.

如果你正在使用 Gluon,可以通过传递 update_on_kvstore 变量在上述方法之间进行选择。我们通过如下创建 trainer 对象来理解它 −

If you are using Gluon, there is an option to choose between above stated methods by passing update_on_kvstore variable. Let’s understand it by creating the trainer object as follows −

trainer = gluon.Trainer(net.collect_params(), optimizer='sgd',
   optimizer_params={'learning_rate': opt.lr,
      'wd': opt.wd,
      'momentum': opt.momentum,
      'multi_precision': True},
      kvstore=kv,
   update_on_kvstore=True)

Modes of Distributed Training

如果 KVStore 创建字符串包含单词 dist,则表示已启用分布式训练。以下是可以通过使用不同类型的 KVStore 启用的不同分布式训练模式 −

If the KVStore creation string contains the word dist, it means the distributed training is enabled. Following are different modes of distributed training that can be enabled by using different types of KVStore −

dist_sync

顾名思义,它表示同步分布式训练。在此当中,所有工作器在每一批次的开始时使用相同的同步模型参数集。

As name implies, it denotes synchronous distributed training. In this, all the workers use the same synchronized set of model parameters at the start of every batch.

此模式的缺点是,在每个批次之后,服务器必须等待接收来自每个工作器的梯度,然后再更新模型参数。这意味着如果一个工作器崩溃,它将停止所有工作器的进度。

The drawback of this mode is that, after each batch the server should have to wait to receive gradients from each worker before it updates the model parameters. This means that if a worker crashes, it would halt the progress of all workers.

dist_async

顾名思义,它表示同步分布式训练。在此当中,服务器接收来自一个工作器的梯度并立即更新其存储。服务器使用更新的存储来响应任何进一步的拉取。

As name implies, it denotes synchronous distributed training. In this, the server receives gradients from one worker and immediately updates its store. Server uses the updated store to respond to any further pulls.

dist_sync mode 相比,它的优点在于完成一批次处理的工作器可以从服务器拉取当前参数并开始下一批次。即使其他工作器尚未完成对前一批次的处理,工作器也可以这样做。它也比 dist_sync 模式快,因为它可以花费更多的时间进行收敛而无需任何同步成本。

The advantage, in comparison of dist_sync mode, is that a worker who finishes processing a batch can pull the current parameters from server and start the next batch. The worker can do so, even if the other worker has not yet finished processing the earlier batch. It is also faster than dist_sync mode because, it can take more epochs to converge without any cost of synchronization.

dist_sync_device

此模式与 dist_sync 模式相同。唯一的区别是,当每个节点上使用多个 GPU 时, dist_sync_device 在 GPU 上聚合梯度和更新权重,而 dist_sync 在 CPU 内存上聚合梯度和更新权重。

This mode is same as dist_sync mode. The only difference is that, when there are multiple GPUs being used on every node dist_sync_device aggregates gradients and updates weights on GPU whereas, dist_sync aggregates gradients and updates weights on CPU memory.

它减少了 GPU 与 CPU 之间昂贵的通信。这就是为什么它比 dist_sync 快的原因。缺点是它增加了 GPU 上的内存使用量。

It reduces expensive communication between GPU and CPU. That is why, it is faster than dist_sync. The drawback is that it increases the memory usage on GPU.

dist_async_device

此模式的工作方式与 dist_sync_device 模式相同,但处于异步模式。

This mode works same as dist_sync_device mode, but in asynchronous mode.