Densely Connected Networks (DenseNet) ===================================== ResNet significantly changed the view of how to parametrize the functions in deep networks. *DenseNet* (dense convolutional network) is to some extent the logical extension of this :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`. To understand how to arrive at it, let us take a small detour to mathematics. From ResNet to DenseNet ----------------------- Recall the Taylor expansion for functions. For the point :math:`x = 0` it can be written as .. math:: f(x) = f(0) + f'(0) x + \frac{f''(0)}{2!} x^2 + \frac{f'''(0)}{3!} x^3 + \ldots. The key point is that it decomposes a function into increasingly higher order terms. In a similar vein, ResNet decomposes functions into .. math:: f(\mathbf{x}) = \mathbf{x} + g(\mathbf{x}). That is, ResNet decomposes :math:`f` into a simple linear term and a more complex nonlinear one. What if we want to capture (not necessarily add) information beyond two terms? One solution was DenseNet :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`. .. _fig_densenet_block: .. figure:: ../img/densenet-block.svg The main difference between ResNet (left) and DenseNet (right) in cross-layer connections: use of addition and use of concatenation. As shown in :numref:`fig_densenet_block`, the key difference between ResNet and DenseNet is that in the latter case outputs are *concatenated* (denoted by :math:`[,]`) rather than added. As a result, we perform a mapping from :math:`\mathbf{x}` to its values after applying an increasingly complex sequence of functions: .. math:: \mathbf{x} \to \left[ \mathbf{x}, f_1(\mathbf{x}), f_2([\mathbf{x}, f_1(\mathbf{x})]), f_3([\mathbf{x}, f_1(\mathbf{x}), f_2([\mathbf{x}, f_1(\mathbf{x})])]), \ldots\right]. In the end, all these functions are combined in MLP to reduce the number of features again. In terms of implementation this is quite simple: rather than adding terms, we concatenate them. The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers. The dense connections are shown in :numref:`fig_densenet`. .. _fig_densenet: .. figure:: ../img/densenet.svg Dense connections in DenseNet. The main components that compose a DenseNet are *dense blocks* and *transition layers*. The former define how the inputs and outputs are concatenated, while the latter control the number of channels so that it is not too large. Dense Blocks ------------ DenseNet uses the modified "batch normalization, activation, and convolution" structure of ResNet (see the exercise in :numref:`sec_resnet`). First, we implement this convolution block structure. .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python from mxnet import np, npx from mxnet.gluon import nn from d2l import mxnet as d2l npx.set_np() def conv_block(num_channels): blk = nn.Sequential() blk.add(nn.BatchNorm(), nn.Activation('relu'), nn.Conv2D(num_channels, kernel_size=3, padding=1)) return blk .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import torch from torch import nn from d2l import torch as d2l def conv_block(input_channels, num_channels): return nn.Sequential( nn.BatchNorm2d(input_channels), nn.ReLU(), nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import tensorflow as tf from d2l import tensorflow as d2l class ConvBlock(tf.keras.layers.Layer): def __init__(self, num_channels): super(ConvBlock, self).__init__() self.bn = tf.keras.layers.BatchNormalization() self.relu = tf.keras.layers.ReLU() self.conv = tf.keras.layers.Conv2D( filters=num_channels, kernel_size=(3, 3), padding='same') self.listLayers = [self.bn, self.relu, self.conv] def call(self, x): y = x for layer in self.listLayers.layers: y = layer(y) y = tf.keras.layers.concatenate([x,y], axis=-1) return y .. raw:: html

.. raw:: html

A *dense block* consists of multiple convolution blocks, each using the same number of output channels. In the forward propagation, however, we concatenate the input and output of each convolution block on the channel dimension. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python class DenseBlock(nn.Block): def __init__(self, num_convs, num_channels, **kwargs): super().__init__(**kwargs) self.net = nn.Sequential() for _ in range(num_convs): self.net.add(conv_block(num_channels)) def forward(self, X): for blk in self.net: Y = blk(X) # Concatenate the input and output of each block on the channel # dimension X = np.concatenate((X, Y), axis=1) return X .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python class DenseBlock(nn.Module): def __init__(self, num_convs, input_channels, num_channels): super(DenseBlock, self).__init__() layer = [] for i in range(num_convs): layer.append(conv_block( num_channels * i + input_channels, num_channels)) self.net = nn.Sequential(*layer) def forward(self, X): for blk in self.net: Y = blk(X) # Concatenate the input and output of each block on the channel # dimension X = torch.cat((X, Y), dim=1) return X .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python class DenseBlock(tf.keras.layers.Layer): def __init__(self, num_convs, num_channels): super(DenseBlock, self).__init__() self.listLayers = [] for _ in range(num_convs): self.listLayers.append(ConvBlock(num_channels)) def call(self, x): for layer in self.listLayers.layers: x = layer(x) return x .. raw:: html

.. raw:: html

In the following example, we define a ``DenseBlock`` instance with 2 convolution blocks of 10 output channels. When using an input with 3 channels, we will get an output with :math:`3+2\times 10=23` channels. The number of convolution block channels controls the growth in the number of output channels relative to the number of input channels. This is also referred to as the *growth rate*. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python blk = DenseBlock(2, 10) blk.initialize() X = np.random.uniform(size=(4, 3, 8, 8)) Y = blk(X) Y.shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (4, 23, 8, 8) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python blk = DenseBlock(2, 3, 10) X = torch.randn(4, 3, 8, 8) Y = blk(X) Y.shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output torch.Size([4, 23, 8, 8]) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python blk = DenseBlock(2, 10) X = tf.random.uniform((4, 8, 8, 3)) Y = blk(X) Y.shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output TensorShape([4, 8, 8, 23]) .. raw:: html

.. raw:: html

Transition Layers ----------------- Since each dense block will increase the number of channels, adding too many of them will lead to an excessively complex model. A *transition layer* is used to control the complexity of the model. It reduces the number of channels by using the :math:`1\times 1` convolutional layer and halves the height and width of the average pooling layer with a stride of 2, further reducing the complexity of the model. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def transition_block(num_channels): blk = nn.Sequential() blk.add(nn.BatchNorm(), nn.Activation('relu'), nn.Conv2D(num_channels, kernel_size=1), nn.AvgPool2D(pool_size=2, strides=2)) return blk .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def transition_block(input_channels, num_channels): return nn.Sequential( nn.BatchNorm2d(input_channels), nn.ReLU(), nn.Conv2d(input_channels, num_channels, kernel_size=1), nn.AvgPool2d(kernel_size=2, stride=2)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python class TransitionBlock(tf.keras.layers.Layer): def __init__(self, num_channels, **kwargs): super(TransitionBlock, self).__init__(**kwargs) self.batch_norm = tf.keras.layers.BatchNormalization() self.relu = tf.keras.layers.ReLU() self.conv = tf.keras.layers.Conv2D(num_channels, kernel_size=1) self.avg_pool = tf.keras.layers.AvgPool2D(pool_size=2, strides=2) def call(self, x): x = self.batch_norm(x) x = self.relu(x) x = self.conv(x) return self.avg_pool(x) .. raw:: html

.. raw:: html

Apply a transition layer with 10 channels to the output of the dense block in the previous example. This reduces the number of output channels to 10, and halves the height and width. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python blk = transition_block(10) blk.initialize() blk(Y).shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (4, 10, 4, 4) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python blk = transition_block(23, 10) blk(Y).shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output torch.Size([4, 10, 4, 4]) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python blk = TransitionBlock(10) blk(Y).shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output TensorShape([4, 4, 4, 10]) .. raw:: html

.. raw:: html

DenseNet Model -------------- Next, we will construct a DenseNet model. DenseNet first uses the same single convolutional layer and maximum pooling layer as in ResNet. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python net = nn.Sequential() net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3), nn.BatchNorm(), nn.Activation('relu'), nn.MaxPool2D(pool_size=3, strides=2, padding=1)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python b1 = nn.Sequential( nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def block_1(): return tf.keras.Sequential([ tf.keras.layers.Conv2D(64, kernel_size=7, strides=2, padding='same'), tf.keras.layers.BatchNormalization(), tf.keras.layers.ReLU(), tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding='same')]) .. raw:: html

.. raw:: html

Then, similar to the four modules made up of residual blocks that ResNet uses, DenseNet uses four dense blocks. Similar to ResNet, we can set the number of convolutional layers used in each dense block. Here, we set it to 4, consistent with the ResNet-18 model in :numref:`sec_resnet`. Furthermore, we set the number of channels (i.e., growth rate) for the convolutional layers in the dense block to 32, so 128 channels will be added to each dense block. In ResNet, the height and width are reduced between each module by a residual block with a stride of 2. Here, we use the transition layer to halve the height and width and halve the number of channels. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python # `num_channels`: the current number of channels num_channels, growth_rate = 64, 32 num_convs_in_dense_blocks = [4, 4, 4, 4] for i, num_convs in enumerate(num_convs_in_dense_blocks): net.add(DenseBlock(num_convs, growth_rate)) # This is the number of output channels in the previous dense block num_channels += num_convs * growth_rate # A transition layer that halves the number of channels is added between # the dense blocks if i != len(num_convs_in_dense_blocks) - 1: num_channels //= 2 net.add(transition_block(num_channels)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python # `num_channels`: the current number of channels num_channels, growth_rate = 64, 32 num_convs_in_dense_blocks = [4, 4, 4, 4] blks = [] for i, num_convs in enumerate(num_convs_in_dense_blocks): blks.append(DenseBlock(num_convs, num_channels, growth_rate)) # This is the number of output channels in the previous dense block num_channels += num_convs * growth_rate # A transition layer that halves the number of channels is added between # the dense blocks if i != len(num_convs_in_dense_blocks) - 1: blks.append(transition_block(num_channels, num_channels // 2)) num_channels = num_channels // 2 .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def block_2(): net = block_1() # `num_channels`: the current number of channels num_channels, growth_rate = 64, 32 num_convs_in_dense_blocks = [4, 4, 4, 4] for i, num_convs in enumerate(num_convs_in_dense_blocks): net.add(DenseBlock(num_convs, growth_rate)) # This is the number of output channels in the previous dense block num_channels += num_convs * growth_rate # A transition layer that halves the number of channels is added # between the dense blocks if i != len(num_convs_in_dense_blocks) - 1: num_channels //= 2 net.add(TransitionBlock(num_channels)) return net .. raw:: html

.. raw:: html

Similar to ResNet, a global pooling layer and a fully-connected layer are connected at the end to produce the output. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python net.add(nn.BatchNorm(), nn.Activation('relu'), nn.GlobalAvgPool2D(), nn.Dense(10)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python net = nn.Sequential( b1, *blks, nn.BatchNorm2d(num_channels), nn.ReLU(), nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(), nn.Linear(num_channels, 10)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def net(): net = block_2() net.add(tf.keras.layers.BatchNormalization()) net.add(tf.keras.layers.ReLU()) net.add(tf.keras.layers.GlobalAvgPool2D()) net.add(tf.keras.layers.Flatten()) net.add(tf.keras.layers.Dense(10)) return net .. raw:: html

.. raw:: html

Training -------- Since we are using a deeper network here, in this section, we will reduce the input height and width from 224 to 96 to simplify the computation. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python lr, num_epochs, batch_size = 0.1, 10, 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96) d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu()) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output loss 0.142, train acc 0.947, test acc 0.898 5383.6 examples/sec on gpu(0) .. figure:: output_densenet_e82156_99_1.svg .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python lr, num_epochs, batch_size = 0.1, 10, 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96) d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu()) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output loss 0.142, train acc 0.948, test acc 0.882 5574.2 examples/sec on cuda:0 .. figure:: output_densenet_e82156_102_1.svg .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python lr, num_epochs, batch_size = 0.1, 10, 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96) d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu()) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output loss 0.137, train acc 0.951, test acc 0.895 5709.2 examples/sec on /GPU:0 .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output .. figure:: output_densenet_e82156_105_2.svg .. raw:: html

.. raw:: html

Summary ------- - In terms of cross-layer connections, unlike ResNet, where inputs and outputs are added together, DenseNet concatenates inputs and outputs on the channel dimension. - The main components that compose DenseNet are dense blocks and transition layers. - We need to keep the dimensionality under control when composing the network by adding transition layers that shrink the number of channels again. Exercises --------- 1. Why do we use average pooling rather than maximum pooling in the transition layer? 2. One of the advantages mentioned in the DenseNet paper is that its model parameters are smaller than those of ResNet. Why is this the case? 3. One problem for which DenseNet has been criticized is its high memory consumption. 1. Is this really the case? Try to change the input shape to :math:`224\times 224` to see the actual GPU memory consumption. 2. Can you think of an alternative means of reducing the memory consumption? How would you need to change the framework? 4. Implement the various DenseNet versions presented in Table 1 of the DenseNet paper :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`. 5. Design an MLP-based model by applying the DenseNet idea. Apply it to the housing price prediction task in :numref:`sec_kaggle_house`. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html