.. _sec_multi_gpu:
Training on Multiple GPUs
=========================
So far we discussed how to train models efficiently on CPUs and GPUs. We
even showed how deep learning frameworks allow one to parallelize
computation and communication automatically between them in
:numref:`sec_auto_para`. We also showed in :numref:`sec_use_gpu` how
to list all the available GPUs on a computer using the ``nvidia-smi``
command. What we did *not* discuss is how to actually parallelize deep
learning training. Instead, we implied in passing that one would somehow
split the data across multiple devices and make it work. The present
section fills in the details and shows how to train a network in
parallel when starting from scratch. Details on how to take advantage of
functionality in high-level APIs is relegated to
:numref:`sec_multi_gpu_concise`. We assume that you are familiar with
minibatch stochastic gradient descent algorithms such as the ones
described in :numref:`sec_minibatch_sgd`.
Splitting the Problem
---------------------
Let us start with a simple computer vision problem and a slightly
archaic network, e.g., with multiple layers of convolutions, pooling,
and possibly a few fully-connected layers in the end. That is, let us
start with a network that looks quite similar to LeNet
:cite:`LeCun.Bottou.Bengio.ea.1998` or AlexNet
:cite:`Krizhevsky.Sutskever.Hinton.2012`. Given multiple GPUs (2 if it
is a desktop server, 4 on an AWS g4dn.12xlarge instance, 8 on a
p3.16xlarge, or 16 on a p2.16xlarge), we want to partition training in a
manner as to achieve good speedup while simultaneously benefitting from
simple and reproducible design choices. Multiple GPUs, after all,
increase both *memory* and *computation* ability. In a nutshell, we have
the following choices, given a minibatch of training data that we want
to classify.
First, we could partition the network across multiple GPUs. That is,
each GPU takes as input the data flowing into a particular layer,
processes data across a number of subsequent layers and then sends the
data to the next GPU. This allows us to process data with larger
networks when compared with what a single GPU could handle. Besides,
memory footprint per GPU can be well controlled (it is a fraction of the
total network footprint).
However, the interface between layers (and thus GPUs) requires tight
synchronization. This can be tricky, in particular if the computational
workloads are not properly matched between layers. The problem is
exacerbated for large numbers of GPUs. The interface between layers also
requires large amounts of data transfer, such as activations and
gradients. This may overwhelm the bandwidth of the GPU buses. Moreover,
compute-intensive, yet sequential operations are nontrivial to
partition. See e.g., :cite:`Mirhoseini.Pham.Le.ea.2017` for a best
effort in this regard. It remains a difficult problem and it is unclear
whether it is possible to achieve good (linear) scaling on nontrivial
problems. We do not recommend it unless there is excellent framework or
operating system support for chaining together multiple GPUs.
Second, we could split the work layerwise. For instance, rather than
computing 64 channels on a single GPU we could split up the problem
across 4 GPUs, each of which generates data for 16 channels. Likewise,
for a fully-connected layer we could split the number of output units.
:numref:`fig_alexnet_original` (taken from
:cite:`Krizhevsky.Sutskever.Hinton.2012`) illustrates this design,
where this strategy was used to deal with GPUs that had a very small
memory footprint (2 GB at the time). This allows for good scaling in
terms of computation, provided that the number of channels (or units) is
not too small. Besides, multiple GPUs can process increasingly larger
networks since the available memory scales linearly.
.. _fig_alexnet_original:
.. figure:: ../img/alexnet-original.svg
   Model parallelism in the original AlexNet design due to limited GPU
   memory.
However, we need a *very large* number of synchronization or barrier
operations since each layer depends on the results from all the other
layers. Moreover, the amount of data that needs to be transferred is
potentially even larger than when distributing layers across GPUs. Thus,
we do not recommend this approach due to its bandwidth cost and
complexity.
Last, we could partition data across multiple GPUs. This way all GPUs
perform the same type of work, albeit on different observations.
Gradients are aggregated across GPUs after each minibatch of training
data. This is the simplest approach and it can be applied in any
situation. We only need to synchronize after each minibatch. That said,
it is highly desirable to start exchanging gradients parameters already
while others are still being computed. Moreover, larger numbers of GPUs
lead to larger minibatch sizes, thus increasing training efficiency.
However, adding more GPUs does not allow us to train larger models.
.. _fig_splitting:
.. figure:: ../img/splitting.svg
   Parallelization on multiple GPUs. From left to right: original
   problem, network partitioning, layerwise partitioning, data
   parallelism.
A comparison of different ways of parallelization on multiple GPUs is
depicted in :numref:`fig_splitting`. By and large, data parallelism is
the most convenient way to proceed, provided that we have access to GPUs
with sufficiently large memory. See also
:cite:`Li.Andersen.Park.ea.2014` for a detailed description of
partitioning for distributed training. GPU memory used to be a problem
in the early days of deep learning. By now this issue has been resolved
for all but the most unusual cases. We focus on data parallelism in what
follows.
Data Parallelism
----------------
Assume that there are :math:`k` GPUs on a machine. Given the model to be
trained, each GPU will maintain a complete set of model parameters
independently though parameter values across the GPUs are identical and
synchronized. As an example, :numref:`fig_data_parallel` illustrates
training with data parallelism when :math:`k=2`.
.. _fig_data_parallel:
.. figure:: ../img/data-parallel.svg
   Calculation of minibatch stochastic gradient descent using data
   parallelism on two GPUs.
In general, the training proceeds as follows:
-  In any iteration of training, given a random minibatch, we split the
   examples in the batch into :math:`k` portions and distribute them
   evenly across the GPUs.
-  Each GPU calculates loss and gradient of the model parameters based
   on the minibatch subset it was assigned.
-  The local gradients of each of the :math:`k` GPUs are aggregated to
   obtain the current minibatch stochastic gradient.
-  The aggregate gradient is re-distributed to each GPU.
-  Each GPU uses this minibatch stochastic gradient to update the
   complete set of model parameters that it maintains.
Note that in practice we *increase* the minibatch size :math:`k`-fold
when training on :math:`k` GPUs such that each GPU has the same amount
of work to do as if we were training on a single GPU only. On a 16-GPU
server this can increase the minibatch size considerably and we may have
to increase the learning rate accordingly. Also note that batch
normalization in :numref:`sec_batch_norm` needs to be adjusted, e.g.,
by keeping a separate batch normalization coefficient per GPU. In what
follows we will use a toy network to illustrate multi-GPU training.
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    %matplotlib inline
    from mxnet import autograd, gluon, np, npx
    from d2l import mxnet as d2l
    
    npx.set_np()
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    %matplotlib inline
    import torch
    from torch import nn
    from torch.nn import functional as F
    from d2l import torch as d2l
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    # Initialize model parameters
    scale = 0.01
    W1 = np.random.normal(scale=scale, size=(20, 1, 3, 3))
    b1 = np.zeros(20)
    W2 = np.random.normal(scale=scale, size=(50, 20, 5, 5))
    b2 = np.zeros(50)
    W3 = np.random.normal(scale=scale, size=(800, 128))
    b3 = np.zeros(128)
    W4 = np.random.normal(scale=scale, size=(128, 10))
    b4 = np.zeros(10)
    params = [W1, b1, W2, b2, W3, b3, W4, b4]
    
    # Define the model
    def lenet(X, params):
        h1_conv = npx.convolution(data=X, weight=params[0], bias=params[1],
                                  kernel=(3, 3), num_filter=20)
        h1_activation = npx.relu(h1_conv)
        h1 = npx.pooling(data=h1_activation, pool_type='avg', kernel=(2, 2),
                         stride=(2, 2))
        h2_conv = npx.convolution(data=h1, weight=params[2], bias=params[3],
                                  kernel=(5, 5), num_filter=50)
        h2_activation = npx.relu(h2_conv)
        h2 = npx.pooling(data=h2_activation, pool_type='avg', kernel=(2, 2),
                         stride=(2, 2))
        h2 = h2.reshape(h2.shape[0], -1)
        h3_linear = np.dot(h2, params[4]) + params[5]
        h3 = npx.relu(h3_linear)
        y_hat = np.dot(h3, params[6]) + params[7]
        return y_hat
    
    # Cross-entropy loss function
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    # Initialize model parameters
    scale = 0.01
    W1 = torch.randn(size=(20, 1, 3, 3)) * scale
    b1 = torch.zeros(20)
    W2 = torch.randn(size=(50, 20, 5, 5)) * scale
    b2 = torch.zeros(50)
    W3 = torch.randn(size=(800, 128)) * scale
    b3 = torch.zeros(128)
    W4 = torch.randn(size=(128, 10)) * scale
    b4 = torch.zeros(10)
    params = [W1, b1, W2, b2, W3, b3, W4, b4]
    
    # Define the model
    def lenet(X, params):
        h1_conv = F.conv2d(input=X, weight=params[0], bias=params[1])
        h1_activation = F.relu(h1_conv)
        h1 = F.avg_pool2d(input=h1_activation, kernel_size=(2, 2), stride=(2, 2))
        h2_conv = F.conv2d(input=h1, weight=params[2], bias=params[3])
        h2_activation = F.relu(h2_conv)
        h2 = F.avg_pool2d(input=h2_activation, kernel_size=(2, 2), stride=(2, 2))
        h2 = h2.reshape(h2.shape[0], -1)
        h3_linear = torch.mm(h2, params[4]) + params[5]
        h3 = F.relu(h3_linear)
        y_hat = torch.mm(h3, params[6]) + params[7]
        return y_hat
    
    # Cross-entropy loss function
    loss = nn.CrossEntropyLoss(reduction='none')
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    def get_params(params, device):
        new_params = [p.copyto(device) for p in params]
        for p in new_params:
            p.attach_grad()
        return new_params
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    def get_params(params, device):
        new_params = [p.to(device) for p in params]
        for p in new_params:
            p.requires_grad_()
        return new_params
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    new_params = get_params(params, d2l.try_gpu(0))
    print('b1 weight:', new_params[1])
    print('b1 grad:', new_params[1].grad)
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    b1 weight: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] @gpu(0)
    b1 grad: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] @gpu(0)
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    new_params = get_params(params, d2l.try_gpu(0))
    print('b1 weight:', new_params[1])
    print('b1 grad:', new_params[1].grad)
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    b1 weight: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
           device='cuda:0', requires_grad=True)
    b1 grad: None
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    def allreduce(data):
        for i in range(1, len(data)):
            data[0][:] += data[i].copyto(data[0].ctx)
        for i in range(1, len(data)):
            data[0].copyto(data[i])
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    def allreduce(data):
        for i in range(1, len(data)):
            data[0][:] += data[i].to(data[0].device)
        for i in range(1, len(data)):
            data[i][:] = data[0].to(data[i].device)
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    data = [np.ones((1, 2), ctx=d2l.try_gpu(i)) * (i + 1) for i in range(2)]
    print('before allreduce:\n', data[0], '\n', data[1])
    allreduce(data)
    print('after allreduce:\n', data[0], '\n', data[1])
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    before allreduce:
     [[1. 1.]] @gpu(0) 
     [[2. 2.]] @gpu(1)
    after allreduce:
     [[3. 3.]] @gpu(0) 
     [[3. 3.]] @gpu(1)
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    data = [torch.ones((1, 2), device=d2l.try_gpu(i)) * (i + 1) for i in range(2)]
    print('before allreduce:\n', data[0], '\n', data[1])
    allreduce(data)
    print('after allreduce:\n', data[0], '\n', data[1])
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    before allreduce:
     tensor([[1., 1.]], device='cuda:0') 
     tensor([[2., 2.]], device='cuda:1')
    after allreduce:
     tensor([[3., 3.]], device='cuda:0') 
     tensor([[3., 3.]], device='cuda:1')
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    data = np.arange(20).reshape(4, 5)
    devices = [npx.gpu(0), npx.gpu(1)]
    split = gluon.utils.split_and_load(data, devices)
    print('input :', data)
    print('load into', devices)
    print('output:', split)
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    input : [[ 0.  1.  2.  3.  4.]
     [ 5.  6.  7.  8.  9.]
     [10. 11. 12. 13. 14.]
     [15. 16. 17. 18. 19.]]
    load into [gpu(0), gpu(1)]
    output: [array([[0., 1., 2., 3., 4.],
           [5., 6., 7., 8., 9.]], ctx=gpu(0)), array([[10., 11., 12., 13., 14.],
           [15., 16., 17., 18., 19.]], ctx=gpu(1))]
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    data = torch.arange(20).reshape(4, 5)
    devices = [torch.device('cuda:0'), torch.device('cuda:1')]
    split = nn.parallel.scatter(data, devices)
    print('input :', data)
    print('load into', devices)
    print('output:', split)
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    input : tensor([[ 0,  1,  2,  3,  4],
            [ 5,  6,  7,  8,  9],
            [10, 11, 12, 13, 14],
            [15, 16, 17, 18, 19]])
    load into [device(type='cuda', index=0), device(type='cuda', index=1)]
    output: (tensor([[0, 1, 2, 3, 4],
            [5, 6, 7, 8, 9]], device='cuda:0'), tensor([[10, 11, 12, 13, 14],
            [15, 16, 17, 18, 19]], device='cuda:1'))
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    #@save
    def split_batch(X, y, devices):
        """Split `X` and `y` into multiple devices."""
        assert X.shape[0] == y.shape[0]
        return (gluon.utils.split_and_load(X, devices),
                gluon.utils.split_and_load(y, devices))
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    #@save
    def split_batch(X, y, devices):
        """Split `X` and `y` into multiple devices."""
        assert X.shape[0] == y.shape[0]
        return (nn.parallel.scatter(X, devices),
                nn.parallel.scatter(y, devices))
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    def train_batch(X, y, device_params, devices, lr):
        X_shards, y_shards = split_batch(X, y, devices)
        with autograd.record():  # Loss is calculated separately on each GPU
            ls = [loss(lenet(X_shard, device_W), y_shard)
                  for X_shard, y_shard, device_W in zip(
                      X_shards, y_shards, device_params)]
        for l in ls:  # Backpropagation is performed separately on each GPU
            l.backward()
        # Sum all gradients from each GPU and broadcast them to all GPUs
        for i in range(len(device_params[0])):
            allreduce([device_params[c][i].grad for c in range(len(devices))])
        # The model parameters are updated separately on each GPU
        for param in device_params:
            d2l.sgd(param, lr, X.shape[0])  # Here, we use a full-size batch
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    def train_batch(X, y, device_params, devices, lr):
        X_shards, y_shards = split_batch(X, y, devices)
        # Loss is calculated separately on each GPU
        ls = [loss(lenet(X_shard, device_W), y_shard).sum()
              for X_shard, y_shard, device_W in zip(
                  X_shards, y_shards, device_params)]
        for l in ls:  # Backpropagation is performed separately on each GPU
            l.backward()
        # Sum all gradients from each GPU and broadcast them to all GPUs
        with torch.no_grad():
            for i in range(len(device_params[0])):
                allreduce([device_params[c][i].grad for c in range(len(devices))])
        # The model parameters are updated separately on each GPU
        for param in device_params:
            d2l.sgd(param, lr, X.shape[0]) # Here, we use a full-size batch
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    def train(num_gpus, batch_size, lr):
        train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
        devices = [d2l.try_gpu(i) for i in range(num_gpus)]
        # Copy model parameters to `num_gpus` GPUs
        device_params = [get_params(params, d) for d in devices]
        num_epochs = 10
        animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
        timer = d2l.Timer()
        for epoch in range(num_epochs):
            timer.start()
            for X, y in train_iter:
                # Perform multi-GPU training for a single minibatch
                train_batch(X, y, device_params, devices, lr)
                npx.waitall()
            timer.stop()
            # Evaluate the model on GPU 0
            animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(
                lambda x: lenet(x, device_params[0]), test_iter, devices[0]),))
        print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '
              f'on {str(devices)}')
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    def train(num_gpus, batch_size, lr):
        train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
        devices = [d2l.try_gpu(i) for i in range(num_gpus)]
        # Copy model parameters to `num_gpus` GPUs
        device_params = [get_params(params, d) for d in devices]
        num_epochs = 10
        animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
        timer = d2l.Timer()
        for epoch in range(num_epochs):
            timer.start()
            for X, y in train_iter:
                # Perform multi-GPU training for a single minibatch
                train_batch(X, y, device_params, devices, lr)
                torch.cuda.synchronize()
            timer.stop()
            # Evaluate the model on GPU 0
            animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(
                lambda x: lenet(x, device_params[0]), test_iter, devices[0]),))
        print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '
              f'on {str(devices)}')
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    train(num_gpus=1, batch_size=256, lr=0.2)
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    test acc: 0.85, 2.5 sec/epoch on [gpu(0)]
.. figure:: output_multiple-gpus_f17d18_93_1.svg
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    train(num_gpus=1, batch_size=256, lr=0.2)
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    test acc: 0.84, 2.6 sec/epoch on [device(type='cuda', index=0)]
.. figure:: output_multiple-gpus_f17d18_96_1.svg
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    train(num_gpus=2, batch_size=256, lr=0.2)
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    test acc: 0.84, 4.9 sec/epoch on [gpu(0), gpu(1)]
.. figure:: output_multiple-gpus_f17d18_102_1.svg
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    train(num_gpus=2, batch_size=256, lr=0.2)
.. raw:: latex
   \diilbookstyleoutputcell
.. parsed-literal::
    :class: output
    test acc: 0.83, 2.6 sec/epoch on [device(type='cuda', index=0), device(type='cuda', index=1)]
.. figure:: output_multiple-gpus_f17d18_105_1.svg
.. raw:: html
     
.. raw:: html
     
.. raw:: html
     
`Discussions `__
.. raw:: html
     
.. raw:: html
     
`Discussions `__
.. raw:: html
     
.. raw:: html