11.5. Concise Implementation of Multi-GPU Computation

In Gluon, we can conveniently use data parallelism to perform multi-GPU computation. For example, we do not need to implement the helper function to synchronize data among multiple GPUs, as described in Section 11.4, ourselves.

First, import the required packages or modules for the experiment in this section. Running the programs in this section requires at least two GPUs.

import d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import nn

11.5.1. Initialize Model Parameters on Multiple GPUs

In this section, we use ResNet-18 as a sample model. Since the input images in this section are original size (not enlarged), the model construction here is different from the ResNet-18 structure described in Section 7.6. This model uses a smaller convolution kernel, stride, and padding at the beginning and removes the maximum pooling layer.

# Save to the d2l package.
def resnet18(num_classes):
    """A slightly modified ResNet-18 model"""
    def resnet_block(num_channels, num_residuals, first_block=False):
        blk = nn.Sequential()
        for i in range(num_residuals):
            if i == 0 and not first_block:
                    num_channels, use_1x1conv=True, strides=2))
        return blk

    net = nn.Sequential()
    # This model uses a smaller convolution kernel, stride, and padding and
    # removes the maximum pooling layer
    net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1),
            nn.BatchNorm(), nn.Activation('relu'))
    net.add(resnet_block(64, 2, first_block=True),
            resnet_block(128, 2),
            resnet_block(256, 2),
            resnet_block(512, 2))
    net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))
    return net

net = resnet18(10)

Previously, we discussed how to use the initialize function’s ctx parameter to initialize model parameters on a CPU or a single GPU. In fact, ctx can accept a range of CPUs and GPUs so as to copy initialized model parameters to all CPUs and GPUs in ctx.

ctx = d2l.try_all_gpus()
net.initialize(init=init.Normal(sigma=0.01), ctx=ctx)

Gluon provides the split_and_load function implemented in the previous section. It can divide a mini-batch of data instances and copy them to each CPU or GPU. Then, the model computation for the data input to each CPU or GPU occurs on that same CPU or GPU.

x = nd.random.uniform(shape=(4, 1, 28, 28))
gpu_x = gluon.utils.split_and_load(x, ctx)
net(gpu_x[0]), net(gpu_x[1])
 [[ 5.48149410e-06 -8.33710715e-07 -1.63167692e-06 -6.36740651e-07
   -3.82161625e-06 -2.35140487e-06 -2.54695942e-06 -9.47847525e-08
   -6.90336265e-07  2.57562351e-06]
  [ 5.47108630e-06 -9.42464624e-07 -1.04940636e-06  9.80811592e-08
   -3.32518175e-06 -2.48629181e-06 -3.36428002e-06  1.04558694e-07
   -6.10013558e-07  2.03278455e-06]]
 <NDArray 2x10 @gpu(0)>,
 [[ 5.61763409e-06 -1.28375871e-06 -1.46055413e-06  1.83029556e-07
   -3.55116504e-06 -2.43710201e-06 -3.57318004e-06 -3.09748373e-07
   -1.10165661e-06  1.89098932e-06]
  [ 5.14186922e-06 -1.37299264e-06 -1.15200896e-06  1.15074045e-07
   -3.73728130e-06 -2.82897167e-06 -3.64771950e-06  1.57815748e-07
   -6.07329866e-07  1.97120107e-06]]
 <NDArray 2x10 @gpu(1)>)

Now we can access the initialized model parameter values through data. It should be noted that weight.data() will return the parameter values on the CPU by default. Since we specified 2 GPUs to initialize the model parameters, we need to specify the GPU to access parameter values. As we can see, the same parameters have the same values on different GPUs.

weight = net[0].params.get('weight')

except RuntimeError:
    print('not initialized on cpu')
weight.data(ctx[0])[0], weight.data(ctx[1])[0]
not initialized on cpu
 [[[-0.01473444 -0.01073093 -0.01042483]
   [-0.01327885 -0.01474966 -0.00524142]
   [ 0.01266256  0.00895064 -0.00601594]]]
 <NDArray 1x3x3 @gpu(0)>,
 [[[-0.01473444 -0.01073093 -0.01042483]
   [-0.01327885 -0.01474966 -0.00524142]
   [ 0.01266256  0.00895064 -0.00601594]]]
 <NDArray 1x3x3 @gpu(1)>)

Remember we define the evaluate_accuracy_gpu in Section 6.6 to support evaluating on a single GPU, now we refine this implementation to support multiple devices.

# Save to the d2l package.
def evaluate_accuracy_gpus(net, data_iter):
    # Query the list of devices.
    ctx_list = list(net.collect_params().values())[0].list_ctx()
    metric = d2l.Accumulator(2)  # num_corrected_examples, num_examples
    for features, labels in data_iter:
        Xs, ys = d2l.split_batch(features, labels, ctx_list)
        pys = [net(X) for X in Xs]  # run in parallel
        metric.add(sum(d2l.accuracy(py, y) for py, y in zip(pys, ys)),
    return metric[0]/metric[1]

11.5.2. Multi-GPU Model Training

When we use multiple GPUs to train the model, the Trainer instance will automatically perform data parallelism, such as dividing mini-batches of data instances and copying them to individual GPUs and summing the gradients of each GPU and broadcasting the result to all GPUs. In this way, we can easily implement the training function.

def train(num_gpus, batch_size, lr):
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    ctx_list = [d2l.try_gpu(i) for i in range(num_gpus)]
                   ctx=ctx_list, force_reinit=True)
    trainer = gluon.Trainer(
        net.collect_params(), 'sgd', {'learning_rate': lr})
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
    timer, num_epochs = d2l.Timer(), 5
    animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
    for epoch in range(num_epochs):
        for features, labels in train_iter:
            Xs, ys = d2l.split_batch(features, labels, ctx_list)
            with autograd.record():
                ls = [loss(net(X), y) for X, y in zip(Xs, ys)]
            for l in ls:
        animator.add(epoch+1, evaluate_accuracy_gpus(net, test_iter))
    print('test acc: %.2f, %.1f sec/epoch on %s' % (
        animator.Y[0][-1], timer.avg(), ctx_list))

First, use a single GPU for training.

train(num_gpus=1, batch_size=256, lr=0.1)
test acc: 0.93, 13.2 sec/epoch on [gpu(0)]

Then we try to use 2 GPUs for training. Compared with the LeNet used in the previous section, ResNet-18 computing is more complicated and the communication time is shorter compared to the calculation time, so parallel computing in ResNet-18 better improves performance.

train(num_gpus=2, batch_size=512, lr=0.2)
test acc: 0.90, 6.8 sec/epoch on [gpu(0), gpu(1)]

11.5.3. Summary

  • In Gluon, we can conveniently perform multi-GPU computations, such as initializing model parameters and training models on multiple GPUs.

11.5.4. Exercises

  • This section uses ResNet-18. Try different epochs, batch sizes, and learning rates. Use more GPUs for computation if conditions permit.

  • Sometimes, different devices provide different computing power. Some can use CPUs and GPUs at the same time, or GPUs of different models. How should we divide mini-batches among different CPUs or GPUs?

11.5.5. Scan the QR Code to Discuss