.. _sec_lenet:
Convolutional Neural Networks (LeNet)
=====================================
We now have all the ingredients required to assemble a fully-functional
CNN. In our earlier encounter with image data, we applied a softmax
regression model (:numref:`sec_softmax_scratch`) and an MLP model
(:numref:`sec_mlp_scratch`) to pictures of clothing in the
Fashion-MNIST dataset. To make such data amenable to softmax regression
and MLPs, we first flattened each image from a :math:`28\times28` matrix
into a fixed-length :math:`784`-dimensional vector, and thereafter
processed them with fully-connected layers. Now that we have a handle on
convolutional layers, we can retain the spatial structure in our images.
As an additional benefit of replacing fully-connected layers with
convolutional layers, we will enjoy more parsimonious models that
require far fewer parameters.
In this section, we will introduce *LeNet*, among the first published
CNNs to capture wide attention for its performance on computer vision
tasks. The model was introduced by (and named for) Yann LeCun, then a
researcher at AT&T Bell Labs, for the purpose of recognizing handwritten
digits in images :cite:`LeCun.Bottou.Bengio.ea.1998`. This work
represented the culmination of a decade of research developing the
technology. In 1989, LeCun published the first study to successfully
train CNNs via backpropagation.
At the time LeNet achieved outstanding results matching the performance
of support vector machines, then a dominant approach in supervised
learning. LeNet was eventually adapted to recognize digits for
processing deposits in ATM machines. To this day, some ATMs still run
the code that Yann and his colleague Leon Bottou wrote in the 1990s!
LeNet
-----
At a high level, LeNet (LeNet-5) consists of two parts: (i) a
convolutional encoder consisting of two convolutional layers; and (ii) a
dense block consisting of three fully-connected layers; The architecture
is summarized in :numref:`img_lenet`.
.. _img_lenet:
.. figure:: ../img/lenet.svg
Data flow in LeNet. The input is a handwritten digit, the output a
probability over 10 possible outcomes.
The basic units in each convolutional block are a convolutional layer, a
sigmoid activation function, and a subsequent average pooling operation.
Note that while ReLUs and max-pooling work better, these discoveries had
not yet been made in the 1990s. Each convolutional layer uses a
:math:`5\times 5` kernel and a sigmoid activation function. These layers
map spatially arranged inputs to a number of two-dimensional feature
maps, typically increasing the number of channels. The first
convolutional layer has 6 output channels, while the second has 16. Each
:math:`2\times2` pooling operation (stride 2) reduces dimensionality by
a factor of :math:`4` via spatial downsampling. The convolutional block
emits an output with shape given by (batch size, number of channel,
height, width).
In order to pass output from the convolutional block to the dense block,
we must flatten each example in the minibatch. In other words, we take
this four-dimensional input and transform it into the two-dimensional
input expected by fully-connected layers: as a reminder, the
two-dimensional representation that we desire uses the first dimension
to index examples in the minibatch and the second to give the flat
vector representation of each example. LeNet's dense block has three
fully-connected layers, with 120, 84, and 10 outputs, respectively.
Because we are still performing classification, the 10-dimensional
output layer corresponds to the number of possible output classes.
While getting to the point where you truly understand what is going on
inside LeNet may have taken a bit of work, hopefully the following code
snippet will convince you that implementing such models with modern deep
learning frameworks is remarkably simple. We need only to instantiate a
``Sequential`` block and chain together the appropriate layers.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
net = nn.Sequential()
net.add(nn.Conv2D(channels=6, kernel_size=5, padding=2, activation='sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
nn.Conv2D(channels=16, kernel_size=5, activation='sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
# `Dense` will transform an input of the shape (batch size, number of
# channels, height, width) into an input of the shape (batch size,
# number of channels * height * width) automatically by default
nn.Dense(120, activation='sigmoid'),
nn.Dense(84, activation='sigmoid'),
nn.Dense(10))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
from d2l import torch as d2l
net = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
nn.Linear(120, 84), nn.Sigmoid(),
nn.Linear(84, 10))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
from d2l import tensorflow as d2l
def net():
return tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters=6, kernel_size=5, activation='sigmoid',
padding='same'),
tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
tf.keras.layers.Conv2D(filters=16, kernel_size=5,
activation='sigmoid'),
tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(120, activation='sigmoid'),
tf.keras.layers.Dense(84, activation='sigmoid'),
tf.keras.layers.Dense(10)])
.. raw:: html
.. raw:: html
We took a small liberty with the original model, removing the Gaussian
activation in the final layer. Other than that, this network matches the
original LeNet-5 architecture.
By passing a single-channel (black and white) :math:`28 \times 28` image
through the network and printing the output shape at each layer, we can
inspect the model to make sure that its operations line up with what we
expect from :numref:`img_lenet_vert`.
.. _img_lenet_vert:
.. figure:: ../img/lenet-vert.svg
Compressed notation for LeNet-5.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = np.random.uniform(size=(1, 1, 28, 28))
net.initialize()
for layer in net:
X = layer(X)
print(layer.name, 'output shape:\t', X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
conv0 output shape: (1, 6, 28, 28)
pool0 output shape: (1, 6, 14, 14)
conv1 output shape: (1, 16, 10, 10)
pool1 output shape: (1, 16, 5, 5)
dense0 output shape: (1, 120)
dense1 output shape: (1, 84)
dense2 output shape: (1, 10)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
for layer in net:
X = layer(X)
print(layer.__class__.__name__,'output shape: \t',X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Conv2d output shape: torch.Size([1, 6, 28, 28])
Sigmoid output shape: torch.Size([1, 6, 28, 28])
AvgPool2d output shape: torch.Size([1, 6, 14, 14])
Conv2d output shape: torch.Size([1, 16, 10, 10])
Sigmoid output shape: torch.Size([1, 16, 10, 10])
AvgPool2d output shape: torch.Size([1, 16, 5, 5])
Flatten output shape: torch.Size([1, 400])
Linear output shape: torch.Size([1, 120])
Sigmoid output shape: torch.Size([1, 120])
Linear output shape: torch.Size([1, 84])
Sigmoid output shape: torch.Size([1, 84])
Linear output shape: torch.Size([1, 10])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = tf.random.uniform((1, 28, 28, 1))
for layer in net().layers:
X = layer(X)
print(layer.__class__.__name__, 'output shape: \t', X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Conv2D output shape: (1, 28, 28, 6)
AveragePooling2D output shape: (1, 14, 14, 6)
Conv2D output shape: (1, 10, 10, 16)
AveragePooling2D output shape: (1, 5, 5, 16)
Flatten output shape: (1, 400)
Dense output shape: (1, 120)
Dense output shape: (1, 84)
Dense output shape: (1, 10)
.. raw:: html
.. raw:: html
Note that the height and width of the representation at each layer
throughout the convolutional block is reduced (compared with the
previous layer). The first convolutional layer uses 2 pixels of padding
to compensate for the reduction in height and width that would otherwise
result from using a :math:`5 \times 5` kernel. In contrast, the second
convolutional layer forgoes padding, and thus the height and width are
both reduced by 4 pixels. As we go up the stack of layers, the number of
channels increases layer-over-layer from 1 in the input to 6 after the
first convolutional layer and 16 after the second convolutional layer.
However, each pooling layer halves the height and width. Finally, each
fully-connected layer reduces dimensionality, finally emitting an output
whose dimension matches the number of classes.
Training
--------
Now that we have implemented the model, let us run an experiment to see
how LeNet fares on Fashion-MNIST.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
While CNNs have fewer parameters, they can still be more expensive to
compute than similarly deep MLPs because each parameter participates in
many more multiplications. If you have access to a GPU, this might be a
good time to put it into action to speed up training.
.. raw:: html
.. raw:: html
For evaluation, we need to make a slight modification to the
``evaluate_accuracy`` function that we described in
:numref:`sec_softmax_scratch`. Since the full dataset is in the main
memory, we need to copy it to the GPU memory before the model uses GPU
to compute with the dataset.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def evaluate_accuracy_gpu(net, data_iter, device=None): #@save
"""Compute the accuracy for a model on a dataset using a GPU."""
if not device: # Query the first device where the first parameter is on
device = list(net.collect_params().values())[0].list_ctx()[0]
# No. of correct predictions, no. of predictions
metric = d2l.Accumulator(2)
for X, y in data_iter:
X, y = X.as_in_ctx(device), y.as_in_ctx(device)
metric.add(d2l.accuracy(net(X), y), d2l.size(y))
return metric[0] / metric[1]
.. raw:: html
.. raw:: html
For evaluation, we need to make a slight modification to the
``evaluate_accuracy`` function that we described in
:numref:`sec_softmax_scratch`. Since the full dataset is in the main
memory, we need to copy it to the GPU memory before the model uses GPU
to compute with the dataset.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def evaluate_accuracy_gpu(net, data_iter, device=None): #@save
"""Compute the accuracy for a model on a dataset using a GPU."""
if isinstance(net, nn.Module):
net.eval() # Set the model to evaluation mode
if not device:
device = next(iter(net.parameters())).device
# No. of correct predictions, no. of predictions
metric = d2l.Accumulator(2)
with torch.no_grad():
for X, y in data_iter:
if isinstance(X, list):
# Required for BERT Fine-tuning (to be covered later)
X = [x.to(device) for x in X]
else:
X = X.to(device)
y = y.to(device)
metric.add(d2l.accuracy(net(X), y), y.numel())
return metric[0] / metric[1]
.. raw:: html
.. raw:: html
We also need to update our training function to deal with GPUs. Unlike
the ``train_epoch_ch3`` defined in :numref:`sec_softmax_scratch`, we
now need to move each minibatch of data to our designated device
(hopefully, the GPU) prior to making the forward and backward
propagations.
The training function ``train_ch6`` is also similar to ``train_ch3``
defined in :numref:`sec_softmax_scratch`. Since we will be
implementing networks with many layers going forward, we will rely
primarily on high-level APIs. The following training function assumes a
model created from high-level APIs as input and is optimized
accordingly. We initialize the model parameters on the device indicated
by the ``device`` argument, using Xavier initialization as introduced in
:numref:`subsec_xavier`. Just as with MLPs, our loss function is
cross-entropy, and we minimize it via minibatch stochastic gradient
descent. Since each epoch takes tens of seconds to run, we visualize the
training loss more frequently.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
"""Train a model with a GPU (defined in Chapter 6)."""
net.initialize(force_reinit=True, ctx=device, init=init.Xavier())
loss = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(net.collect_params(),
'sgd', {'learning_rate': lr})
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
timer, num_batches = d2l.Timer(), len(train_iter)
for epoch in range(num_epochs):
# Sum of training loss, sum of training accuracy, no. of examples
metric = d2l.Accumulator(3)
for i, (X, y) in enumerate(train_iter):
timer.start()
# Here is the major difference from `d2l.train_epoch_ch3`
X, y = X.as_in_ctx(device), y.as_in_ctx(device)
with autograd.record():
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
trainer.step(X.shape[0])
metric.add(l.sum(), d2l.accuracy(y_hat, y), X.shape[0])
timer.stop()
train_l = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(train_l, train_acc, None))
test_acc = evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
f'test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
f'on {str(device)}')
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
"""Train a model with a GPU (defined in Chapter 6)."""
def init_weights(m):
if type(m) == nn.Linear or type(m) == nn.Conv2d:
nn.init.xavier_uniform_(m.weight)
net.apply(init_weights)
print('training on', device)
net.to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
timer, num_batches = d2l.Timer(), len(train_iter)
for epoch in range(num_epochs):
# Sum of training loss, sum of training accuracy, no. of examples
metric = d2l.Accumulator(3)
net.train()
for i, (X, y) in enumerate(train_iter):
timer.start()
optimizer.zero_grad()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
optimizer.step()
with torch.no_grad():
metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
timer.stop()
train_l = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(train_l, train_acc, None))
test_acc = evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
f'test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
f'on {str(device)}')
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class TrainCallback(tf.keras.callbacks.Callback): #@save
"""A callback to visiualize the training progress."""
def __init__(self, net, train_iter, test_iter, num_epochs, device_name):
self.timer = d2l.Timer()
self.animator = d2l.Animator(
xlabel='epoch', xlim=[1, num_epochs], legend=[
'train loss', 'train acc', 'test acc'])
self.net = net
self.train_iter = train_iter
self.test_iter = test_iter
self.num_epochs = num_epochs
self.device_name = device_name
def on_epoch_begin(self, epoch, logs=None):
self.timer.start()
def on_epoch_end(self, epoch, logs):
self.timer.stop()
test_acc = self.net.evaluate(
self.test_iter, verbose=0, return_dict=True)['accuracy']
metrics = (logs['loss'], logs['accuracy'], test_acc)
self.animator.add(epoch + 1, metrics)
if epoch == self.num_epochs - 1:
batch_size = next(iter(self.train_iter))[0].shape[0]
num_examples = batch_size * tf.data.experimental.cardinality(
self.train_iter).numpy()
print(f'loss {metrics[0]:.3f}, train acc {metrics[1]:.3f}, '
f'test acc {metrics[2]:.3f}')
print(f'{num_examples / self.timer.avg():.1f} examples/sec on '
f'{str(self.device_name)}')
#@save
def train_ch6(net_fn, train_iter, test_iter, num_epochs, lr, device):
"""Train a model with a GPU (defined in Chapter 6)."""
device_name = device._device_name
strategy = tf.distribute.OneDeviceStrategy(device_name)
with strategy.scope():
optimizer = tf.keras.optimizers.SGD(learning_rate=lr)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
net = net_fn()
net.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
callback = TrainCallback(net, train_iter, test_iter, num_epochs,
device_name)
net.fit(train_iter, epochs=num_epochs, verbose=0, callbacks=[callback])
return net
.. raw:: html
.. raw:: html
Now let us train and evaluate the LeNet-5 model.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs = 0.9, 10
train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.474, train acc 0.822, test acc 0.810
38411.7 examples/sec on gpu(0)
.. figure:: output_lenet_4a2e9e_52_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs = 0.9, 10
train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.457, train acc 0.829, test acc 0.809
84565.2 examples/sec on cuda:0
.. figure:: output_lenet_4a2e9e_55_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs = 0.9, 10
train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.460, train acc 0.826, test acc 0.812
59802.1 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_lenet_4a2e9e_58_2.svg
.. raw:: html
.. raw:: html
Summary
-------
- A CNN is a network that employs convolutional layers.
- In a CNN, we interleave convolutions, nonlinearities, and (often)
pooling operations.
- In a CNN, convolutional layers are typically arranged so that they
gradually decrease the spatial resolution of the representations,
while increasing the number of channels.
- In traditional CNNs, the representations encoded by the convolutional
blocks are processed by one or more fully-connected layers prior to
emitting output.
- LeNet was arguably the first successful deployment of such a network.
Exercises
---------
1. Replace the average pooling with maximum pooling. What happens?
2. Try to construct a more complex network based on LeNet to improve its
accuracy.
1. Adjust the convolution window size.
2. Adjust the number of output channels.
3. Adjust the activation function (e.g., ReLU).
4. Adjust the number of convolution layers.
5. Adjust the number of fully connected layers.
6. Adjust the learning rates and other training details (e.g.,
initialization and number of epochs.)
3. Try out the improved network on the original MNIST dataset.
4. Display the activations of the first and second layer of LeNet for
different inputs (e.g., sweaters and coats).
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html