3.3. Concise Implementation of Linear Regression

The surge of deep learning has inspired the development of a variety of mature software frameworks, that automate much of the repetitive work of implementing deep learning models. In the previous section we relied only on NDarray for data storage and linear algebra and the auto-differentiation capabilities in the autograd package. In practice, because many of the more abstract operations, e.g. data iterators, loss functions, model architectures, and optimizers, are so common, deep learning libraries will give us library functions for these as well.

We have used Gluon to load the MNIST dataset in Section 2.5. In this section, we will how we can implement the linear regression model in Section 3.2 much more concisely with Gluon.

3.3.1. Generating Data Sets

To start, we will generate the same data set as that used in the previous section.

import d2l
from mxnet import autograd, nd, gluon

true_w = nd.array([2, -3.4])
true_b = 4.2
features, labels = d2l.synthetic_data(true_w, true_b, 1000)

3.3.2. Reading Data

Rather than rolling our own iterator, we can call upon Gluon’s data module to read data. Since data is often used as a variable name, we will replace it with the pseudonym gdata (adding the first letter of Gluon), too differentiate the imported data module from a variable we might define. The first step will be to instantiate an ArrayDataset, which takes in one or more NDArrays as arguments. Here, we pass in features and labels as arguments. Next, we will use the ArrayDataset to instantiate a DataLoader, which also requires that we specify a batch_size and specify a Boolean value shuffle indicating whether or not we want the DataLoader to shuffle the data on each epoch (pass through the dataset).

# Save to the d2l package.
def load_array(data_arrays, batch_size, is_train=True):
    """Construct a Gluon data loader"""
    dataset = gluon.data.ArrayDataset(*data_arrays)
    return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train)

batch_size = 10
data_iter = load_array((features, labels), batch_size)

Now we can use data_iter in much the same way as we called the data_iter function in the previous section. To verify that it’s working, we can read and print the first mini-batch of instances.

for X, y in data_iter:
    print(X, y)
[[-0.56865287  0.91230786]
 [-0.5013274  -1.673106  ]
 [ 0.74412745 -0.45482758]
 [ 0.42752823  1.3946159 ]
 [ 1.3671521   0.4300703 ]
 [-0.9840098  -0.1045013 ]
 [ 0.8258905   1.0249989 ]
 [-0.16174842  0.47192234]
 [-0.12898901  0.9547114 ]
 [-0.16185044  0.72164685]]
<NDArray 10x2 @cpu(0)>
[-0.03822104  8.877403    7.2241445   0.31876415  5.4759536   2.5935795
  2.3662217   2.2778223   0.6997888   1.4243342 ]
<NDArray 10 @cpu(0)>

3.3.3. Define the Model

When we implemented linear regression from scratch in the previous section, we had to define the model parameters and explicitly write out the calculation to produce output using basic linear algebra opertions. You should know how to do this. But once your models get more complex, even qualitatively simple changes to the model might result in many low-level changes.

For standard operations, we can use Gluon’s predefined layers, which allow us to focus especially on the layers used to construct the model rather than having to focus on the implementation.

To define a linear model, we first import the nn module, which defines a large number of neural network layers (note that “nn” is an abbreviation for neural networks). We will first define a model variable net, which is a Sequential instance. In Gluon, a Sequential instance can be regarded as a container that concatenates the various layers in sequence. When input data is given, each layer in the container will be calculated in order, and the output of one layer will be the input of the next layer. In this example, since our model consists of only one layer, we do not really need Sequential. But since nearly all of our future models will involve multiple layers, let’s get into the habit early.

from mxnet.gluon import nn
net = nn.Sequential()

Recall the architecture of a single layer network. The layer is fully connected since it connects all inputs with all outputs by means of a matrix-vector multiplication. In Gluon, the fully-connected layer is defined in the Dense class. Since we only want to generate a single scalar output, we set that number to \(1\).


Fig. 3.3.1 Linear regression is a single-layer neural network.


It is worth noting that, for convenience, Gluon does not require us to specify the input shape for each layer. So here, we don’t need to tell Gluon how many inputs go into this linear layer. When we first try to pass data through our model, e.g., when we exedcute net(X) later, Gluon will automatically infer the number of inputs to each layer. We will describe how this works in more detail in the chapter “Deep Learning Computation”.

3.3.4. Initialize Model Parameters

Before using net, we need to initialize the model parameters, such as the weights and biases in the linear regression model. We will import the initializer module from MXNet. This module provides various methods for model parameter initialization. Gluon makes init available as a shortcut (abbreviation) to access the initializer package. By calling init.Normal(sigma=0.01), we specify that each weight parameter should be randomly sampled from a normal distribution with mean 0 and standard deviation 0.01. The bias parameter will be initialized to zero by default. Both weight and bias will be attached with gradients.

from mxnet import init

The code above looks straightforward but in reality something quite strange is happening here. We are initializing parameters for a network even though we haven’t yet told Gluon how many dimensions the input will have. It might be 2 as in our example or it might be 2,000, so we couldn’t just preallocate enough space to make it work.

Gluon let’s us get away with this because behind the scenes, the initialization is deferred until the first time that we attempt to pass data through our network. Just be careful to remember that since the parameters have not been initialized yet we cannot yet manipulate them in any way.

3.3.5. Define the Loss Function

In Gluon, the loss module defines various loss functions. We will replace the imported module loss with the pseudonym gloss, and directly use its implementation of squared loss (L2Loss).

from mxnet.gluon import loss as gloss
loss = gloss.L2Loss()  # The squared loss is also known as the L2 norm loss

3.3.6. Define the Optimization Algorithm

Not surpisingly, we aren’t the first people to implement mini-batch stochastic gradient descent, and thus Gluon supports SGD alongside a number of variations on this algorithm through its Trainer class. When we instantiate the Trainer, we’ll specify the parameters to optimize over (obtainable from our net via net.collect_params()), the optimization algortihm we wish to use (sgd), and a dictionary of hyper-parameters required by our optimization algorithm. SGD just requires that we set the value learning_rate, (here we set it to 0.03).

from mxnet import gluon
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.03})

3.3.7. Training

You might have noticed that expressing our model through Gluon requires comparatively few lines of code. We didn’t have to individually allocate parameters, define our loss function, or implement stochastic gradient descent. Once we start working with much more complex models, the benefits of relying on Gluon’s abstractions will grow considerably. But once we have all the basic pieces in place, the training loop itself is strikingly similar to what we did when implementing everything from scratch.

To refresh your memory: for some number of epochs, we’ll make a complete pass over the dataset (train_data), grabbing one mini-batch of inputs and corresponding ground-truth labels at a time. For each batch, we’ll go through the following ritual:

  • Generate predictions by calling net(X) and calculate the loss l (the forward pass).

  • Calculate gradients by calling l.backward() (the backward pass).

  • Update the model parameters by invoking our SGD optimizer (note that trainer already knows which parameters to optimize over, so we just need to pass in the batch size.

For good measure, we compute the loss after each epoch and print it to monitor progress.

num_epochs = 3
for epoch in range(1, num_epochs + 1):
    for X, y in data_iter:
        with autograd.record():
            l = loss(net(X), y)
    l = loss(net(features), labels)
    print('epoch %d, loss: %f' % (epoch, l.mean().asnumpy()))
epoch 1, loss: 0.040289
epoch 2, loss: 0.000152
epoch 3, loss: 0.000051

The model parameters we have learned and the actual model parameters are compared as below. We get the layer we need from the net and access its weight (weight) and bias (bias). The parameters we have learned and the actual parameters are very close.

w = net[0].weight.data()
print('Error in estimating w', true_w.reshape(w.shape) - w)
b = net[0].bias.data()
print('Error in estimating b', true_b - b)
Error in estimating w
[[ 0.00036335 -0.0002389 ]]
<NDArray 1x2 @cpu(0)>
Error in estimating b
<NDArray 1 @cpu(0)>

3.3.8. Summary

  • Using Gluon, we can implement the model more succinctly.

  • In Gluon, the module data provides tools for data processing, the module nn defines a large number of neural network layers, and the module loss defines various loss functions.

  • MXNet’s module initializer provides various methods for model parameter initialization.

  • Dimensionality and storage are automagically inferred (but caution if you want to access parameters before they’ve been initialized).

3.3.9. Exercises

  1. If we replace l = loss(output, y) with l = loss(output, y).mean(), we need to change trainer.step(batch_size) to trainer.step(1) accordingly. Why?

  2. Review the MXNet documentation to see what loss functions and initialization methods are provided in the modules gluon.loss and init. Replace the loss by Huber’s loss.

  3. How do you access the gradient of dense.weight?

3.3.10. Scan the QR Code to Discuss