3.3. Concise Implementation of Linear Regression¶ Open the notebook in SageMaker Studio Lab
Broad and intense interest in deep learning for the past several years has inspired companies, academics, and hobbyists to develop a variety of mature open source frameworks for automating the repetitive work of implementing gradient-based learning algorithms. In Section 3.2, we relied only on (i) tensors for data storage and linear algebra; and (ii) auto differentiation for calculating gradients. In practice, because data iterators, loss functions, optimizers, and neural network layers are so common, modern libraries implement these components for us as well.
In this section, we will show you how to implement the linear regression model from Section 3.2 concisely by using high-level APIs of deep learning frameworks.
3.3.1. Generating the Dataset¶
To start, we will generate the same dataset as in Section 3.2.
from mxnet import autograd, gluon, np, npx
from d2l import mxnet as d2l
npx.set_np()
true_w = np.array([2, -3.4])
true_b = 4.2
features, labels = d2l.synthetic_data(true_w, true_b, 1000)
import numpy as np
import torch
from torch.utils import data
from d2l import torch as d2l
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = d2l.synthetic_data(true_w, true_b, 1000)
import numpy as np
import tensorflow as tf
from d2l import tensorflow as d2l
true_w = tf.constant([2, -3.4])
true_b = 4.2
features, labels = d2l.synthetic_data(true_w, true_b, 1000)
3.3.2. Reading the Dataset¶
Rather than rolling our own iterator, we can call upon the existing API
in a framework to read data. We pass in features
and labels
as
arguments and specify batch_size
when instantiating a data iterator
object. Besides, the boolean value is_train
indicates whether or not
we want the data iterator object to shuffle the data on each epoch (pass
through the dataset).
def load_array(data_arrays, batch_size, is_train=True): #@save
"""Construct a Gluon data iterator."""
dataset = gluon.data.ArrayDataset(*data_arrays)
return gluon.data.DataLoader(dataset, batch_size, shuffle=is_train)
batch_size = 10
data_iter = load_array((features, labels), batch_size)
def load_array(data_arrays, batch_size, is_train=True): #@save
"""Construct a PyTorch data iterator."""
dataset = data.TensorDataset(*data_arrays)
return data.DataLoader(dataset, batch_size, shuffle=is_train)
batch_size = 10
data_iter = load_array((features, labels), batch_size)
def load_array(data_arrays, batch_size, is_train=True): #@save
"""Construct a TensorFlow data iterator."""
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
batch_size = 10
data_iter = load_array((features, labels), batch_size)
Now we can use data_iter
in much the same way as we called the
data_iter
function in Section 3.2. To verify that
it is working, we can read and print the first minibatch of examples.
Comparing with Section 3.2, here we use iter
to
construct a Python iterator and use next
to obtain the first item
from the iterator.
next(iter(data_iter))
[array([[ 0.17287086, 0.6836102 ],
[-0.11836779, 0.19793853],
[ 0.51737875, 1.6076493 ],
[-0.08403113, -0.8816499 ],
[-0.40720204, 0.5380863 ],
[ 0.59860843, -3.0636313 ],
[-1.9421831 , 0.39020136],
[ 1.5702168 , 1.11278 ],
[-0.44707692, 0.39505652],
[-0.06807706, -0.13130364]]),
array([[ 2.2087321 ],
[ 3.29689 ],
[-0.22868225],
[ 7.0315914 ],
[ 1.5564172 ],
[15.81957 ],
[-1.0176986 ],
[ 3.5531938 ],
[ 1.956641 ],
[ 4.5233345 ]])]
next(iter(data_iter))
[tensor([[ 1.1477, 1.2558],
[ 0.3341, -0.2821],
[ 0.3365, 0.1217],
[ 0.2871, 0.8190],
[-0.8802, 1.2505],
[-0.3033, -0.3111],
[ 0.2943, -0.6745],
[ 0.6015, 1.1968],
[ 0.2001, 0.7602],
[ 0.2266, -0.5029]]),
tensor([[ 2.2239],
[ 5.8224],
[ 4.4543],
[ 1.9789],
[-1.8276],
[ 4.6488],
[ 7.0733],
[ 1.3435],
[ 2.0176],
[ 6.3695]])]
next(iter(data_iter))
(<tf.Tensor: shape=(10, 2), dtype=float32, numpy=
array([[ 0.8189056 , -0.70299053],
[-0.23553985, 3.0204232 ],
[ 0.7063121 , -0.32037047],
[ 1.7371358 , -1.7424992 ],
[ 0.06076531, 0.20629206],
[ 1.9706419 , 0.7340944 ],
[ 0.5541655 , 0.14886267],
[-0.53026026, 0.83554405],
[-0.657761 , 0.15629663],
[ 0.00526189, -0.11952543]], dtype=float32)>,
<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[ 8.223323 ],
[-6.5477023],
[ 6.7100124],
[13.599574 ],
[ 3.6187358],
[ 5.6446624],
[ 4.780884 ],
[ 0.3131682],
[ 2.365241 ],
[ 4.631494 ]], dtype=float32)>)
3.3.3. Defining the Model¶
When we implemented linear regression from scratch in Section 3.2, we defined our model parameters explicitly and coded up the calculations to produce output using basic linear algebra operations. You should know how to do this. But once your models get more complex, and once you have to do this nearly every day, you will be glad for the assistance. The situation is similar to coding up your own blog from scratch. Doing it once or twice is rewarding and instructive, but you would be a lousy web developer if every time you needed a blog you spent a month reinventing the wheel.
For standard operations, we can use a framework’s predefined layers,
which allow us to focus especially on the layers used to construct the
model rather than having to focus on the implementation. We will first
define a model variable net
, which will refer to an instance of the
Sequential
class. The Sequential
class defines a container for
several layers that will be chained together. Given input data, a
Sequential
instance passes it through the first layer, in turn
passing the output as the second layer’s input and so forth. In the
following example, our model consists of only one layer, so we do not
really need Sequential
. But since nearly all of our future models
will involve multiple layers, we will use it anyway just to familiarize
you with the most standard workflow.
Recall the architecture of a single-layer network as shown in Fig. 3.1.2. The layer is said to be fully-connected because each of its inputs is connected to each of its outputs by means of a matrix-vector multiplication.
In Gluon, the fully-connected layer is defined in the Dense
class.
Since we only want to generate a single scalar output, we set that
number to 1.
It is worth noting that, for convenience, Gluon does not require us to
specify the input shape for each layer. So here, we do not need to tell
Gluon how many inputs go into this linear layer. When we first try to
pass data through our model, e.g., when we execute net(X)
later,
Gluon will automatically infer the number of inputs to each layer. We
will describe how this works in more detail later.
# `nn` is an abbreviation for neural networks
from mxnet.gluon import nn
net = nn.Sequential()
net.add(nn.Dense(1))
In PyTorch, the fully-connected layer is defined in the Linear
class. Note that we passed two arguments into nn.Linear
. The first
one specifies the input feature dimension, which is 2, and the second
one is the output feature dimension, which is a single scalar and
therefore 1.
# `nn` is an abbreviation for neural networks
from torch import nn
net = nn.Sequential(nn.Linear(2, 1))
In Keras, the fully-connected layer is defined in the Dense
class.
Since we only want to generate a single scalar output, we set that
number to 1.
It is worth noting that, for convenience, Keras does not require us to
specify the input shape for each layer. So here, we do not need to tell
Keras how many inputs go into this linear layer. When we first try to
pass data through our model, e.g., when we execute net(X)
later,
Keras will automatically infer the number of inputs to each layer. We
will describe how this works in more detail later.
# `keras` is the high-level API for TensorFlow
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1))
3.3.4. Initializing Model Parameters¶
Before using net
, we need to initialize the model parameters, such
as the weights and bias in the linear regression model. Deep learning
frameworks often have a predefined way to initialize the parameters.
Here we specify that each weight parameter should be randomly sampled
from a normal distribution with mean 0 and standard deviation 0.01. The
bias parameter will be initialized to zero.
We will import the initializer
module from MXNet. This module
provides various methods for model parameter initialization. Gluon makes
init
available as a shortcut (abbreviation) to access the
initializer
package. We only specify how to initialize the weight by
calling init.Normal(sigma=0.01)
. Bias parameters are initialized to
zero by default.
from mxnet import init
net.initialize(init.Normal(sigma=0.01))
The code above may look straightforward but you should note that something strange is happening here. We are initializing parameters for a network even though Gluon does not yet know how many dimensions the input will have! It might be 2 as in our example or it might be 2000. Gluon lets us get away with this because behind the scene, the initialization is actually deferred. The real initialization will take place only when we for the first time attempt to pass data through the network. Just be careful to remember that since the parameters have not been initialized yet, we cannot access or manipulate them.
As we have specified the input and output dimensions when constructing
nn.Linear
, now we can access the parameters directly to specify
their initial values. We first locate the layer by net[0]
, which is
the first layer in the network, and then use the weight.data
and
bias.data
methods to access the parameters. Next we use the replace
methods normal_
and fill_
to overwrite parameter values.
net[0].weight.data.normal_(0, 0.01)
net[0].bias.data.fill_(0)
tensor([0.])
The initializers
module in TensorFlow provides various methods for
model parameter initialization. The easiest way to specify the
initialization method in Keras is when creating the layer by specifying
kernel_initializer
. Here we recreate net
again.
initializer = tf.initializers.RandomNormal(stddev=0.01)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
The code above may look straightforward but you should note that something strange is happening here. We are initializing parameters for a network even though Keras does not yet know how many dimensions the input will have! It might be 2 as in our example or it might be 2000. Keras lets us get away with this because behind the scenes, the initialization is actually deferred. The real initialization will take place only when we for the first time attempt to pass data through the network. Just be careful to remember that since the parameters have not been initialized yet, we cannot access or manipulate them.
3.3.5. Defining the Loss Function¶
In Gluon, the loss
module defines various loss functions. In this
example, we will use the Gluon implementation of squared loss
(L2Loss
).
loss = gluon.loss.L2Loss()
The MSELoss
class computes the mean squared error (without the
\(1/2\) factor in (3.1.5)). By default it returns the
average loss over examples.
loss = nn.MSELoss()
The MeanSquaredError
class computes the mean squared error (without
the \(1/2\) factor in (3.1.5)). By default it returns the
average loss over examples.
loss = tf.keras.losses.MeanSquaredError()
3.3.6. Defining the Optimization Algorithm¶
Minibatch stochastic gradient descent is a standard tool for optimizing
neural networks and thus Gluon supports it alongside a number of
variations on this algorithm through its Trainer
class. When we
instantiate Trainer
, we will specify the parameters to optimize over
(obtainable from our model net
via net.collect_params()
), the
optimization algorithm we wish to use (sgd
), and a dictionary of
hyperparameters required by our optimization algorithm. Minibatch
stochastic gradient descent just requires that we set the value
learning_rate
, which is set to 0.03 here.
from mxnet import gluon
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.03})
Minibatch stochastic gradient descent is a standard tool for optimizing
neural networks and thus PyTorch supports it alongside a number of
variations on this algorithm in the optim
module. When we
instantiate an SGD
instance, we will specify the parameters to
optimize over (obtainable from our net via net.parameters()
), with a
dictionary of hyperparameters required by our optimization algorithm.
Minibatch stochastic gradient descent just requires that we set the
value lr
, which is set to 0.03 here.
trainer = torch.optim.SGD(net.parameters(), lr=0.03)
Minibatch stochastic gradient descent is a standard tool for optimizing
neural networks and thus Keras supports it alongside a number of
variations on this algorithm in the optimizers
module. Minibatch
stochastic gradient descent just requires that we set the value
learning_rate
, which is set to 0.03 here.
trainer = tf.keras.optimizers.SGD(learning_rate=0.03)
3.3.7. Training¶
You might have noticed that expressing our model through high-level APIs of a deep learning framework requires comparatively few lines of code. We did not have to individually allocate parameters, define our loss function, or implement minibatch stochastic gradient descent. Once we start working with much more complex models, advantages of high-level APIs will grow considerably. However, once we have all the basic pieces in place, the training loop itself is strikingly similar to what we did when implementing everything from scratch.
To refresh your memory: for some number of epochs, we will make a
complete pass over the dataset (train_data
), iteratively grabbing
one minibatch of inputs and the corresponding ground-truth labels. For
each minibatch, we go through the following ritual:
Generate predictions by calling
net(X)
and calculate the lossl
(the forward propagation).Calculate gradients by running the backpropagation.
Update the model parameters by invoking our optimizer.
For good measure, we compute the loss after each epoch and print it to monitor progress.
num_epochs = 3
for epoch in range(num_epochs):
for X, y in data_iter:
with autograd.record():
l = loss(net(X), y)
l.backward()
trainer.step(batch_size)
l = loss(net(features), labels)
print(f'epoch {epoch + 1}, loss {l.mean().asnumpy():f}')
[19:47:42] src/base.cc:49: GPU context requested, but no GPUs found.
epoch 1, loss 0.024962
epoch 2, loss 0.000092
epoch 3, loss 0.000051
num_epochs = 3
for epoch in range(num_epochs):
for X, y in data_iter:
l = loss(net(X) ,y)
trainer.zero_grad()
l.backward()
trainer.step()
l = loss(net(features), labels)
print(f'epoch {epoch + 1}, loss {l:f}')
epoch 1, loss 0.000227
epoch 2, loss 0.000112
epoch 3, loss 0.000112
num_epochs = 3
for epoch in range(num_epochs):
for X, y in data_iter:
with tf.GradientTape() as tape:
l = loss(net(X, training=True), y)
grads = tape.gradient(l, net.trainable_variables)
trainer.apply_gradients(zip(grads, net.trainable_variables))
l = loss(net(features), labels)
print(f'epoch {epoch + 1}, loss {l:f}')
epoch 1, loss 0.000192
epoch 2, loss 0.000095
epoch 3, loss 0.000096
Below, we compare the model parameters learned by training on finite
data and the actual parameters that generated our dataset. To access
parameters, we first access the layer that we need from net
and then
access that layer’s weights and bias. As in our from-scratch
implementation, note that our estimated parameters are close to their
ground-truth counterparts.
w = net[0].weight.data()
print(f'error in estimating w: {true_w - w.reshape(true_w.shape)}')
b = net[0].bias.data()
print(f'error in estimating b: {true_b - b}')
error in estimating w: [0.00085819 0.00035477]
error in estimating b: [0.00040722]
w = net[0].weight.data
print('error in estimating w:', true_w - w.reshape(true_w.shape))
b = net[0].bias.data
print('error in estimating b:', true_b - b)
error in estimating w: tensor([-0.0006, 0.0001])
error in estimating b: tensor([0.0005])
w = net.get_weights()[0]
print('error in estimating w', true_w - tf.reshape(w, true_w.shape))
b = net.get_weights()[1]
print('error in estimating b', true_b - b)
error in estimating w tf.Tensor([ 0.00024807 -0.00045204], shape=(2,), dtype=float32)
error in estimating b [2.0503998e-05]
3.3.8. Summary¶
Using Gluon, we can implement models much more concisely.
In Gluon, the
data
module provides tools for data processing, thenn
module defines a large number of neural network layers, and theloss
module defines many common loss functions.MXNet’s module
initializer
provides various methods for model parameter initialization.Dimensionality and storage are automatically inferred, but be careful not to attempt to access parameters before they have been initialized.
Using PyTorch’s high-level APIs, we can implement models much more concisely.
In PyTorch, the
data
module provides tools for data processing, thenn
module defines a large number of neural network layers and common loss functions.We can initialize the parameters by replacing their values with methods ending with
_
.
Using TensorFlow’s high-level APIs, we can implement models much more concisely.
In TensorFlow, the
data
module provides tools for data processing, thekeras
module defines a large number of neural network layers and common loss functions.TensorFlow’s module
initializers
provides various methods for model parameter initialization.Dimensionality and storage are automatically inferred (but be careful not to attempt to access parameters before they have been initialized).
3.3.9. Exercises¶
If we replace
l = loss(output, y)
withl = loss(output, y).mean()
, we need to changetrainer.step(batch_size)
totrainer.step(1)
for the code to behave identically. Why?Review the MXNet documentation to see what loss functions and initialization methods are provided in the modules
gluon.loss
andinit
. Replace the loss by Huber’s loss.How do you access the gradient of
dense.weight
?
If we replace
nn.MSELoss(reduction='sum')
withnn.MSELoss()
, how can we change the learning rate for the code to behave identically. Why?Review the PyTorch documentation to see what loss functions and initialization methods are provided. Replace the loss by Huber’s loss.
How do you access the gradient of
net[0].weight
?
Review the TensorFlow documentation to see what loss functions and initialization methods are provided. Replace the loss by Huber’s loss.