.. _sec_resnet:

Residual Networks (ResNet)
==========================


As we design increasingly deeper networks it becomes imperative to
understand how adding layers can increase the complexity and
expressiveness of the network. Even more important is the ability to
design networks where adding layers makes networks strictly more
expressive rather than just different. To make some progress we need a
bit of mathematics.

Function Classes
----------------

Consider :math:`\mathcal{F}`, the class of functions that a specific
network architecture (together with learning rates and other
hyperparameter settings) can reach. That is, for all
:math:`f \in \mathcal{F}` there exists some set of parameters (e.g.,
weights and biases) that can be obtained through training on a suitable
dataset. Let us assume that :math:`f^*` is the "truth" function that we
really would like to find. If it is in :math:`\mathcal{F}`, we are in
good shape but typically we will not be quite so lucky. Instead, we will
try to find some :math:`f^*_\mathcal{F}` which is our best bet within
:math:`\mathcal{F}`. For instance, given a dataset with features
:math:`\mathbf{X}` and labels :math:`\mathbf{y}`, we might try finding
it by solving the following optimization problem:

.. math:: f^*_\mathcal{F} \stackrel{\mathrm{def}}{=} \mathop{\mathrm{argmin}}_f L(\mathbf{X}, \mathbf{y}, f) \text{ subject to } f \in \mathcal{F}.

It is only reasonable to assume that if we design a different and more
powerful architecture :math:`\mathcal{F}'` we should arrive at a better
outcome. In other words, we would expect that :math:`f^*_{\mathcal{F}'}`
is "better" than :math:`f^*_{\mathcal{F}}`. However, if
:math:`\mathcal{F} \not\subseteq \mathcal{F}'` there is no guarantee
that this should even happen. In fact, :math:`f^*_{\mathcal{F}'}` might
well be worse. As illustrated by :numref:`fig_functionclasses`, for
non-nested function classes, a larger function class does not always
move closer to the "truth" function :math:`f^*`. For instance, on the
left of :numref:`fig_functionclasses`, though :math:`\mathcal{F}_3` is
closer to :math:`f^*` than :math:`\mathcal{F}_1`, :math:`\mathcal{F}_6`
moves away and there is no guarantee that further increasing the
complexity can reduce the distance from :math:`f^*`. With nested
function classes where
:math:`\mathcal{F}_1 \subseteq \ldots \subseteq \mathcal{F}_6` on the
right of :numref:`fig_functionclasses`, we can avoid the
aforementioned issue from the non-nested function classes.

.. _fig_functionclasses:

.. figure:: ../img/functionclasses.svg

   For non-nested function classes, a larger (indicated by area)
   function class does not guarantee to get closer to the "truth"
   function (:math:`f^*`). This does not happen in nested function
   classes.


Thus, only if larger function classes contain the smaller ones are we
guaranteed that increasing them strictly increases the expressive power
of the network. For deep neural networks, if we can train the
newly-added layer into an identity function
:math:`f(\mathbf{x}) = \mathbf{x}`, the new model will be as effective
as the original model. As the new model may get a better solution to fit
the training dataset, the added layer might make it easier to reduce
training errors.

This is the question that He et al. considered when working on very deep
computer vision models :cite:`He.Zhang.Ren.ea.2016`. At the heart of
their proposed *residual network* (*ResNet*) is the idea that every
additional layer should more easily contain the identity function as one
of its elements. These considerations are rather profound but they led
to a surprisingly simple solution, a *residual block*. With it, ResNet
won the ImageNet Large Scale Visual Recognition Challenge in 2015. The
design had a profound influence on how to build deep neural networks.

Residual Blocks
---------------

Let us focus on a local part of a neural network, as depicted in
:numref:`fig_residual_block`. Denote the input by :math:`\mathbf{x}`.
We assume that the desired underlying mapping we want to obtain by
learning is :math:`f(\mathbf{x})`, to be used as the input to the
activation function on the top. On the left of
:numref:`fig_residual_block`, the portion within the dotted-line box
must directly learn the mapping :math:`f(\mathbf{x})`. On the right, the
portion within the dotted-line box needs to learn the *residual mapping*
:math:`f(\mathbf{x}) - \mathbf{x}`, which is how the residual block
derives its name. If the identity mapping
:math:`f(\mathbf{x}) = \mathbf{x}` is the desired underlying mapping,
the residual mapping is easier to learn: we only need to push the
weights and biases of the upper weight layer (e.g., fully-connected
layer and convolutional layer) within the dotted-line box to zero. The
right figure in :numref:`fig_residual_block` illustrates the *residual
block* of ResNet, where the solid line carrying the layer input
:math:`\mathbf{x}` to the addition operator is called a *residual
connection* (or *shortcut connection*). With residual blocks, inputs can
forward propagate faster through the residual connections across layers.

.. _fig_residual_block:

.. figure:: ../img/residual-block.svg

   A regular block (left) and a residual block (right).


ResNet follows VGG's full :math:`3\times 3` convolutional layer design.
The residual block has two :math:`3\times 3` convolutional layers with
the same number of output channels. Each convolutional layer is followed
by a batch normalization layer and a ReLU activation function. Then, we
skip these two convolution operations and add the input directly before
the final ReLU activation function. This kind of design requires that
the output of the two convolutional layers has to be of the same shape
as the input, so that they can be added together. If we want to change
the number of channels, we need to introduce an additional
:math:`1\times 1` convolutional layer to transform the input into the
desired shape for the addition operation. Let us have a look at the code
below.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-1-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-1-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-1-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-1-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    from mxnet import np, npx
    from mxnet.gluon import nn
    from d2l import mxnet as d2l
    
    npx.set_np()
    
    class Residual(nn.Block):  #@save
        """The Residual block of ResNet."""
        def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
            super().__init__(**kwargs)
            self.conv1 = nn.Conv2D(num_channels, kernel_size=3, padding=1,
                                   strides=strides)
            self.conv2 = nn.Conv2D(num_channels, kernel_size=3, padding=1)
            if use_1x1conv:
                self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
                                       strides=strides)
            else:
                self.conv3 = None
            self.bn1 = nn.BatchNorm()
            self.bn2 = nn.BatchNorm()
    
        def forward(self, X):
            Y = npx.relu(self.bn1(self.conv1(X)))
            Y = self.bn2(self.conv2(Y))
            if self.conv3:
                X = self.conv3(X)
            return npx.relu(Y + X)


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-1-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    import torch
    from torch import nn
    from torch.nn import functional as F
    from d2l import torch as d2l
    
    
    class Residual(nn.Module):  #@save
        """The Residual block of ResNet."""
        def __init__(self, input_channels, num_channels,
                     use_1x1conv=False, strides=1):
            super().__init__()
            self.conv1 = nn.Conv2d(input_channels, num_channels,
                                   kernel_size=3, padding=1, stride=strides)
            self.conv2 = nn.Conv2d(num_channels, num_channels,
                                   kernel_size=3, padding=1)
            if use_1x1conv:
                self.conv3 = nn.Conv2d(input_channels, num_channels,
                                       kernel_size=1, stride=strides)
            else:
                self.conv3 = None
            self.bn1 = nn.BatchNorm2d(num_channels)
            self.bn2 = nn.BatchNorm2d(num_channels)
    
        def forward(self, X):
            Y = F.relu(self.bn1(self.conv1(X)))
            Y = self.bn2(self.conv2(Y))
            if self.conv3:
                X = self.conv3(X)
            Y += X
            return F.relu(Y)


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-1-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    import tensorflow as tf
    from d2l import tensorflow as d2l
    
    
    class Residual(tf.keras.Model):  #@save
        """The Residual block of ResNet."""
        def __init__(self, num_channels, use_1x1conv=False, strides=1):
            super().__init__()
            self.conv1 = tf.keras.layers.Conv2D(
                num_channels, padding='same', kernel_size=3, strides=strides)
            self.conv2 = tf.keras.layers.Conv2D(
                num_channels, kernel_size=3, padding='same')
            self.conv3 = None
            if use_1x1conv:
                self.conv3 = tf.keras.layers.Conv2D(
                    num_channels, kernel_size=1, strides=strides)
            self.bn1 = tf.keras.layers.BatchNormalization()
            self.bn2 = tf.keras.layers.BatchNormalization()
    
        def call(self, X):
            Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
            Y = self.bn2(self.conv2(Y))
            if self.conv3 is not None:
                X = self.conv3(X)
            Y += X
            return tf.keras.activations.relu(Y)


.. raw:: html

     </div>


.. raw:: html

     </div>

This code generates two types of networks: one where we add the input to
the output before applying the ReLU nonlinearity whenever
``use_1x1conv=False``, and one where we adjust channels and resolution
by means of a :math:`1 \times 1` convolution before adding.
:numref:`fig_resnet_block` illustrates this:

.. _fig_resnet_block:

.. figure:: ../img/resnet-block.svg

   ResNet block with and without :math:`1 \times 1` convolution.


Now let us look at a situation where the input and output are of the
same shape.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-3-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-3-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-3-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-3-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    blk = Residual(3)
    blk.initialize()
    X = np.random.uniform(size=(4, 3, 6, 6))
    blk(X).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    (4, 3, 6, 6)


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-3-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    blk = Residual(3,3)
    X = torch.rand(4, 3, 6, 6)
    Y = blk(X)
    Y.shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    torch.Size([4, 3, 6, 6])


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-3-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    blk = Residual(3)
    X = tf.random.uniform((4, 6, 6, 3))
    Y = blk(X)
    Y.shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    TensorShape([4, 6, 6, 3])


.. raw:: html

     </div>


.. raw:: html

     </div>

We also have the option to halve the output height and width while
increasing the number of output channels.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-5-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-5-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-5-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-5-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    blk = Residual(6, use_1x1conv=True, strides=2)
    blk.initialize()
    blk(X).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    (4, 6, 3, 3)


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-5-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    blk = Residual(3,6, use_1x1conv=True, strides=2)
    blk(X).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    torch.Size([4, 6, 3, 3])


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-5-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    blk = Residual(6, use_1x1conv=True, strides=2)
    blk(X).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    TensorShape([4, 3, 3, 6])


.. raw:: html

     </div>


.. raw:: html

     </div>

ResNet Model
------------

The first two layers of ResNet are the same as those of the GoogLeNet we
described before: the :math:`7\times 7` convolutional layer with 64
output channels and a stride of 2 is followed by the :math:`3\times 3`
maximum pooling layer with a stride of 2. The difference is the batch
normalization layer added after each convolutional layer in ResNet.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-7-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-7-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-7-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-7-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    net = nn.Sequential()
    net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3),
            nn.BatchNorm(), nn.Activation('relu'),
            nn.MaxPool2D(pool_size=3, strides=2, padding=1))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-7-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
                       nn.BatchNorm2d(64), nn.ReLU(),
                       nn.MaxPool2d(kernel_size=3, stride=2, padding=1))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-7-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    b1 = tf.keras.models.Sequential([
        tf.keras.layers.Conv2D(64, kernel_size=7, strides=2, padding='same'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Activation('relu'),
        tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding='same')])


.. raw:: html

     </div>


.. raw:: html

     </div>

GoogLeNet uses four modules made up of Inception blocks. However, ResNet
uses four modules made up of residual blocks, each of which uses several
residual blocks with the same number of output channels. The number of
channels in the first module is the same as the number of input
channels. Since a maximum pooling layer with a stride of 2 has already
been used, it is not necessary to reduce the height and width. In the
first residual block for each of the subsequent modules, the number of
channels is doubled compared with that of the previous module, and the
height and width are halved.

Now, we implement this module. Note that special processing has been
performed on the first module.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-9-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-9-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-9-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-9-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def resnet_block(num_channels, num_residuals, first_block=False):
        blk = nn.Sequential()
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.add(Residual(num_channels, use_1x1conv=True, strides=2))
            else:
                blk.add(Residual(num_channels))
        return blk


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-9-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def resnet_block(input_channels, num_channels, num_residuals,
                     first_block=False):
        blk = []
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.append(Residual(input_channels, num_channels,
                                    use_1x1conv=True, strides=2))
            else:
                blk.append(Residual(num_channels, num_channels))
        return blk


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-9-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    class ResnetBlock(tf.keras.layers.Layer):
        def __init__(self, num_channels, num_residuals, first_block=False,
                     **kwargs):
            super(ResnetBlock, self).__init__(**kwargs)
            self.residual_layers = []
            for i in range(num_residuals):
                if i == 0 and not first_block:
                    self.residual_layers.append(
                        Residual(num_channels, use_1x1conv=True, strides=2))
                else:
                    self.residual_layers.append(Residual(num_channels))
    
        def call(self, X):
            for layer in self.residual_layers.layers:
                X = layer(X)
            return X


.. raw:: html

     </div>


.. raw:: html

     </div>

Then, we add all the modules to ResNet. Here, two residual blocks are
used for each module.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-11-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-11-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-11-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-11-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    net.add(resnet_block(64, 2, first_block=True),
            resnet_block(128, 2),
            resnet_block(256, 2),
            resnet_block(512, 2))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-11-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    b2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
    b3 = nn.Sequential(*resnet_block(64, 128, 2))
    b4 = nn.Sequential(*resnet_block(128, 256, 2))
    b5 = nn.Sequential(*resnet_block(256, 512, 2))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-11-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    b2 = ResnetBlock(64, 2, first_block=True)
    b3 = ResnetBlock(128, 2)
    b4 = ResnetBlock(256, 2)
    b5 = ResnetBlock(512, 2)


.. raw:: html

     </div>


.. raw:: html

     </div>

Finally, just like GoogLeNet, we add a global average pooling layer,
followed by the fully-connected layer output.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-13-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-13-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-13-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-13-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    net.add(nn.GlobalAvgPool2D(), nn.Dense(10))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-13-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    net = nn.Sequential(b1, b2, b3, b4, b5,
                        nn.AdaptiveAvgPool2d((1,1)),
                        nn.Flatten(), nn.Linear(512, 10))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-13-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    # Recall that we define this as a function so we can reuse later and run it
    # within `tf.distribute.MirroredStrategy`'s scope to utilize various
    # computational resources, e.g. GPUs. Also note that even though we have
    # created b1, b2, b3, b4, b5 but we will recreate them inside this function's
    # scope instead
    def net():
        return tf.keras.Sequential([
            # The following layers are the same as b1 that we created earlier
            tf.keras.layers.Conv2D(64, kernel_size=7, strides=2, padding='same'),
            tf.keras.layers.BatchNormalization(),
            tf.keras.layers.Activation('relu'),
            tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding='same'),
            # The following layers are the same as b2, b3, b4, and b5 that we
            # created earlier
            ResnetBlock(64, 2, first_block=True),
            ResnetBlock(128, 2),
            ResnetBlock(256, 2),
            ResnetBlock(512, 2),
            tf.keras.layers.GlobalAvgPool2D(),
            tf.keras.layers.Dense(units=10)])


.. raw:: html

     </div>


.. raw:: html

     </div>

There are 4 convolutional layers in each module (excluding the
:math:`1\times 1` convolutional layer). Together with the first
:math:`7\times 7` convolutional layer and the final fully-connected
layer, there are 18 layers in total. Therefore, this model is commonly
known as ResNet-18. By configuring different numbers of channels and
residual blocks in the module, we can create different ResNet models,
such as the deeper 152-layer ResNet-152. Although the main architecture
of ResNet is similar to that of GoogLeNet, ResNet's structure is simpler
and easier to modify. All these factors have resulted in the rapid and
widespread use of ResNet. :numref:`fig_resnet18` depicts the full
ResNet-18.

.. _fig_resnet18:

.. figure:: ../img/resnet18.svg

   The ResNet-18 architecture.


Before training ResNet, let us observe how the input shape changes
across different modules in ResNet. As in all the previous
architectures, the resolution decreases while the number of channels
increases up until the point where a global average pooling layer
aggregates all features.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-15-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-15-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-15-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-15-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    X = np.random.uniform(size=(1, 1, 224, 224))
    net.initialize()
    for layer in net:
        X = layer(X)
        print(layer.name, 'output shape:\t', X.shape)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    conv5 output shape:	 (1, 64, 112, 112)
    batchnorm4 output shape:	 (1, 64, 112, 112)
    relu0 output shape:	 (1, 64, 112, 112)
    pool0 output shape:	 (1, 64, 56, 56)
    sequential1 output shape:	 (1, 64, 56, 56)
    sequential2 output shape:	 (1, 128, 28, 28)
    sequential3 output shape:	 (1, 256, 14, 14)
    sequential4 output shape:	 (1, 512, 7, 7)
    pool1 output shape:	 (1, 512, 1, 1)
    dense0 output shape:	 (1, 10)


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-15-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    X = torch.rand(size=(1, 1, 224, 224))
    for layer in net:
        X = layer(X)
        print(layer.__class__.__name__,'output shape:\t', X.shape)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    Sequential output shape:	 torch.Size([1, 64, 56, 56])
    Sequential output shape:	 torch.Size([1, 64, 56, 56])
    Sequential output shape:	 torch.Size([1, 128, 28, 28])
    Sequential output shape:	 torch.Size([1, 256, 14, 14])
    Sequential output shape:	 torch.Size([1, 512, 7, 7])
    AdaptiveAvgPool2d output shape:	 torch.Size([1, 512, 1, 1])
    Flatten output shape:	 torch.Size([1, 512])
    Linear output shape:	 torch.Size([1, 10])


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-15-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    X = tf.random.uniform(shape=(1, 224, 224, 1))
    for layer in net().layers:
        X = layer(X)
        print(layer.__class__.__name__,'output shape:\t', X.shape)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    Conv2D output shape:	 (1, 112, 112, 64)
    BatchNormalization output shape:	 (1, 112, 112, 64)
    Activation output shape:	 (1, 112, 112, 64)
    MaxPooling2D output shape:	 (1, 56, 56, 64)
    ResnetBlock output shape:	 (1, 56, 56, 64)
    ResnetBlock output shape:	 (1, 28, 28, 128)
    ResnetBlock output shape:	 (1, 14, 14, 256)
    ResnetBlock output shape:	 (1, 7, 7, 512)
    GlobalAveragePooling2D output shape:	 (1, 512)
    Dense output shape:	 (1, 10)


.. raw:: html

     </div>


.. raw:: html

     </div>

Training
--------

We train ResNet on the Fashion-MNIST dataset, just like before.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-17-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-17-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-17-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-17-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    lr, num_epochs, batch_size = 0.05, 10, 256
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
    d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    loss 0.015, train acc 0.996, test acc 0.899
    4771.5 examples/sec on gpu(0)


.. figure:: output_resnet_46beba_99_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-17-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    lr, num_epochs, batch_size = 0.05, 10, 256
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
    d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    loss 0.011, train acc 0.997, test acc 0.865
    4713.2 examples/sec on cuda:0


.. figure:: output_resnet_46beba_102_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-17-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    lr, num_epochs, batch_size = 0.05, 10, 256
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
    d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    loss 0.006, train acc 0.999, test acc 0.897
    5146.9 examples/sec on /GPU:0


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    <keras.engine.sequential.Sequential at 0x7ff6a0573e80>


.. figure:: output_resnet_46beba_105_2.svg


.. raw:: html

     </div>


.. raw:: html

     </div>

Summary
-------

-  Nested function classes are desirable. Learning an additional layer
   in deep neural networks as an identity function (though this is an
   extreme case) should be made easy.
-  The residual mapping can learn the identity function more easily,
   such as pushing parameters in the weight layer to zero.
-  We can train an effective deep neural network by having residual
   blocks. Inputs can forward propagate faster through the residual
   connections across layers.
-  ResNet had a major influence on the design of subsequent deep neural
   networks, both for convolutional and sequential nature.

Exercises
---------

1. What are the major differences between the Inception block in
   :numref:`fig_inception` and the residual block? After removing some
   paths in the Inception block, how are they related to each other?
2. Refer to Table 1 in the ResNet paper :cite:`He.Zhang.Ren.ea.2016`
   to implement different variants.
3. For deeper networks, ResNet introduces a "bottleneck" architecture to
   reduce model complexity. Try to implement it.
4. In subsequent versions of ResNet, the authors changed the
   "convolution, batch normalization, and activation" structure to the
   "batch normalization, activation, and convolution" structure. Make
   this improvement yourself. See Figure 1 in
   :cite:`He.Zhang.Ren.ea.2016*1` for details.
5. Why can't we just increase the complexity of functions without bound,
   even if the function classes are nested?


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar text"><a href="#mxnet-19-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-19-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-19-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-19-0">

`Discussions <https://discuss.d2l.ai/t/85>`__


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-19-1">

`Discussions <https://discuss.d2l.ai/t/86>`__


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-19-2">

`Discussions <https://discuss.d2l.ai/t/333>`__


.. raw:: html

     </div>


.. raw:: html

     </div>