.. _sec_vgg:
Networks Using Blocks (VGG)
===========================
While AlexNet offered empirical evidence that deep CNNs can achieve good
results, it did not provide a general template to guide subsequent
researchers in designing new networks. In the following sections, we
will introduce several heuristic concepts commonly used to design deep
networks.
Progress in this field mirrors that in chip design where engineers went
from placing transistors to logical elements to logic blocks. Similarly,
the design of neural network architectures had grown progressively more
abstract, with researchers moving from thinking in terms of individual
neurons to whole layers, and now to blocks, repeating patterns of
layers.
The idea of using blocks first emerged from the `Visual Geometry
Group `__ (VGG) at Oxford University,
in their eponymously-named *VGG* network. It is easy to implement these
repeated structures in code with any modern deep learning framework by
using loops and subroutines.
.. _subsec_vgg-blocks:
VGG Blocks
----------
The basic building block of classic CNNs is a sequence of the following:
(i) a convolutional layer with padding to maintain the resolution, (ii)
a nonlinearity such as a ReLU, (iii) a pooling layer such as a maximum
pooling layer. One VGG block consists of a sequence of convolutional
layers, followed by a maximum pooling layer for spatial downsampling. In
the original VGG paper :cite:`Simonyan.Zisserman.2014`, the authors
employed convolutions with :math:`3\times3` kernels with padding of 1
(keeping height and width) and :math:`2 \times 2` maximum pooling with
stride of 2 (halving the resolution after each block). In the code
below, we define a function called ``vgg_block`` to implement one VGG
block.
.. raw:: html
.. raw:: html
The function takes two arguments corresponding to the number of
convolutional layers ``num_convs`` and the number of output channels
``num_channels``.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
def vgg_block(num_convs, num_channels):
blk = nn.Sequential()
for _ in range(num_convs):
blk.add(nn.Conv2D(num_channels, kernel_size=3,
padding=1, activation='relu'))
blk.add(nn.MaxPool2D(pool_size=2, strides=2))
return blk
.. raw:: html
.. raw:: html
The function takes three arguments corresponding to the number of
convolutional layers ``num_convs``, the number of input channels
``in_channels`` and the number of output channels ``out_channels``.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
from d2l import torch as d2l
def vgg_block(num_convs, in_channels, out_channels):
layers = []
for _ in range(num_convs):
layers.append(nn.Conv2d(in_channels, out_channels,
kernel_size=3, padding=1))
layers.append(nn.ReLU())
in_channels = out_channels
layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
return nn.Sequential(*layers)
.. raw:: html
.. raw:: html
The function takes two arguments corresponding to the number of
convolutional layers ``num_convs`` and the number of output channels
``num_channels``.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
from d2l import tensorflow as d2l
def vgg_block(num_convs, num_channels):
blk = tf.keras.models.Sequential()
for _ in range(num_convs):
blk.add(tf.keras.layers.Conv2D(num_channels,kernel_size=3,
padding='same',activation='relu'))
blk.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))
return blk
.. raw:: html
.. raw:: html
VGG Network
-----------
Like AlexNet and LeNet, the VGG Network can be partitioned into two
parts: the first consisting mostly of convolutional and pooling layers
and the second consisting of fully-connected layers. This is depicted in
:numref:`fig_vgg`.
.. _fig_vgg:
.. figure:: ../img/vgg.svg
:width: 400px
From AlexNet to VGG that is designed from building blocks.
The convolutional part of the network connects several VGG blocks from
:numref:`fig_vgg` (also defined in the ``vgg_block`` function) in
succession. The following variable ``conv_arch`` consists of a list of
tuples (one per block), where each contains two values: the number of
convolutional layers and the number of output channels, which are
precisely the arguments required to call the ``vgg_block`` function. The
fully-connected part of the VGG network is identical to that covered in
AlexNet.
The original VGG network had 5 convolutional blocks, among which the
first two have one convolutional layer each and the latter three contain
two convolutional layers each. The first block has 64 output channels
and each subsequent block doubles the number of output channels, until
that number reaches 512. Since this network uses 8 convolutional layers
and 3 fully-connected layers, it is often called VGG-11.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))
The following code implements VGG-11. This is a simple matter of
executing a for-loop over ``conv_arch``.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def vgg(conv_arch):
net = nn.Sequential()
# The convolutional part
for (num_convs, num_channels) in conv_arch:
net.add(vgg_block(num_convs, num_channels))
# The fully-connected part
net.add(nn.Dense(4096, activation='relu'), nn.Dropout(0.5),
nn.Dense(4096, activation='relu'), nn.Dropout(0.5),
nn.Dense(10))
return net
net = vgg(conv_arch)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def vgg(conv_arch):
conv_blks = []
in_channels = 1
# The convolutional part
for (num_convs, out_channels) in conv_arch:
conv_blks.append(vgg_block(num_convs, in_channels, out_channels))
in_channels = out_channels
return nn.Sequential(
*conv_blks, nn.Flatten(),
# The fully-connected part
nn.Linear(out_channels * 7 * 7, 4096), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(4096, 10))
net = vgg(conv_arch)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def vgg(conv_arch):
net = tf.keras.models.Sequential()
# The convulational part
for (num_convs, num_channels) in conv_arch:
net.add(vgg_block(num_convs, num_channels))
# The fully-connected part
net.add(tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(4096, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(4096, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10)]))
return net
net = vgg(conv_arch)
.. raw:: html
.. raw:: html
Next, we will construct a single-channel data example with a height and
width of 224 to observe the output shape of each layer.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net.initialize()
X = np.random.uniform(size=(1, 1, 224, 224))
for blk in net:
X = blk(X)
print(blk.name, 'output shape:\t', X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
sequential1 output shape: (1, 64, 112, 112)
sequential2 output shape: (1, 128, 56, 56)
sequential3 output shape: (1, 256, 28, 28)
sequential4 output shape: (1, 512, 14, 14)
sequential5 output shape: (1, 512, 7, 7)
dense0 output shape: (1, 4096)
dropout0 output shape: (1, 4096)
dense1 output shape: (1, 4096)
dropout1 output shape: (1, 4096)
dense2 output shape: (1, 10)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = torch.randn(size=(1, 1, 224, 224))
for blk in net:
X = blk(X)
print(blk.__class__.__name__,'output shape:\t',X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Sequential output shape: torch.Size([1, 64, 112, 112])
Sequential output shape: torch.Size([1, 128, 56, 56])
Sequential output shape: torch.Size([1, 256, 28, 28])
Sequential output shape: torch.Size([1, 512, 14, 14])
Sequential output shape: torch.Size([1, 512, 7, 7])
Flatten output shape: torch.Size([1, 25088])
Linear output shape: torch.Size([1, 4096])
ReLU output shape: torch.Size([1, 4096])
Dropout output shape: torch.Size([1, 4096])
Linear output shape: torch.Size([1, 4096])
ReLU output shape: torch.Size([1, 4096])
Dropout output shape: torch.Size([1, 4096])
Linear output shape: torch.Size([1, 10])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = tf.random.uniform((1, 224, 224, 1))
for blk in net.layers:
X = blk(X)
print(blk.__class__.__name__,'output shape:\t', X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Sequential output shape: (1, 112, 112, 64)
Sequential output shape: (1, 56, 56, 128)
Sequential output shape: (1, 28, 28, 256)
Sequential output shape: (1, 14, 14, 512)
Sequential output shape: (1, 7, 7, 512)
Sequential output shape: (1, 10)
.. raw:: html
.. raw:: html
As you can see, we halve height and width at each block, finally
reaching a height and width of 7 before flattening the representations
for processing by the fully-connected part of the network.
Training
--------
Since VGG-11 is more computationally-heavy than AlexNet we construct a
network with a smaller number of channels. This is more than sufficient
for training on Fashion-MNIST.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
ratio = 4
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
ratio = 4
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
ratio = 4
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
# Recall that this has to be a function that will be passed to
# `d2l.train_ch6()` so that model building/compiling need to be within
# `strategy.scope()` in order to utilize the CPU/GPU devices that we have
net = lambda: vgg(small_conv_arch)
.. raw:: html
.. raw:: html
Apart from using a slightly larger learning rate, the model training
process is similar to that of AlexNet in :numref:`sec_alexnet`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.175, train acc 0.935, test acc 0.926
1783.1 examples/sec on gpu(0)
.. figure:: output_vgg_4a7574_56_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.187, train acc 0.930, test acc 0.916
2563.9 examples/sec on cuda:0
.. figure:: output_vgg_4a7574_59_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.177, train acc 0.934, test acc 0.918
2966.4 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_vgg_4a7574_62_2.svg
.. raw:: html
.. raw:: html
Summary
-------
- VGG-11 constructs a network using reusable convolutional blocks.
Different VGG models can be defined by the differences in the number
of convolutional layers and output channels in each block.
- The use of blocks leads to very compact representations of the
network definition. It allows for efficient design of complex
networks.
- In their VGG paper, Simonyan and Ziserman experimented with various
architectures. In particular, they found that several layers of deep
and narrow convolutions (i.e., :math:`3 \times 3`) were more
effective than fewer layers of wider convolutions.
Exercises
---------
1. When printing out the dimensions of the layers we only saw 8 results
rather than 11. Where did the remaining 3 layer information go?
2. Compared with AlexNet, VGG is much slower in terms of computation,
and it also needs more GPU memory. Analyze the reasons for this.
3. Try changing the height and width of the images in Fashion-MNIST from
224 to 96. What influence does this have on the experiments?
4. Refer to Table 1 in the VGG paper :cite:`Simonyan.Zisserman.2014`
to construct other common models, such as VGG-16 or VGG-19.
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html