.. _sec_nin:
Network in Network (NiN)
========================
LeNet, AlexNet, and VGG all share a common design pattern: extract
features exploiting *spatial* structure via a sequence of convolution
and pooling layers and then post-process the representations via
fully-connected layers. The improvements upon LeNet by AlexNet and VGG
mainly lie in how these later networks widen and deepen these two
modules. Alternatively, one could imagine using fully-connected layers
earlier in the process. However, a careless use of dense layers might
give up the spatial structure of the representation entirely, *network
in network* (*NiN*) blocks offer an alternative. They were proposed
based on a very simple insight: to use an MLP on the channels for each
pixel separately :cite:`Lin.Chen.Yan.2013`.
NiN Blocks
----------
Recall that the inputs and outputs of convolutional layers consist of
four-dimensional tensors with axes corresponding to the example,
channel, height, and width. Also recall that the inputs and outputs of
fully-connected layers are typically two-dimensional tensors
corresponding to the example and feature. The idea behind NiN is to
apply a fully-connected layer at each pixel location (for each height
and width). If we tie the weights across each spatial location, we could
think of this as a :math:`1\times 1` convolutional layer (as described
in :numref:`sec_channels`) or as a fully-connected layer acting
independently on each pixel location. Another way to view this is to
think of each element in the spatial dimension (height and width) as
equivalent to an example and a channel as equivalent to a feature.
:numref:`fig_nin` illustrates the main structural differences between
VGG and NiN, and their blocks. The NiN block consists of one
convolutional layer followed by two :math:`1\times 1` convolutional
layers that act as per-pixel fully-connected layers with ReLU
activations. The convolution window shape of the first layer is
typically set by the user. The subsequent window shapes are fixed to
:math:`1 \times 1`.
.. _fig_nin:
.. figure:: ../img/nin.svg
:width: 600px
Comparing architectures of VGG and NiN, and their blocks.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
def nin_block(num_channels, kernel_size, strides, padding):
blk = nn.Sequential()
blk.add(nn.Conv2D(num_channels, kernel_size, strides, padding,
activation='relu'),
nn.Conv2D(num_channels, kernel_size=1, activation='relu'),
nn.Conv2D(num_channels, kernel_size=1, activation='relu'))
return blk
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
from d2l import torch as d2l
def nin_block(in_channels, out_channels, kernel_size, strides, padding):
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=1), nn.ReLU())
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
from d2l import tensorflow as d2l
def nin_block(num_channels, kernel_size, strides, padding):
return tf.keras.models.Sequential([
tf.keras.layers.Conv2D(num_channels, kernel_size, strides=strides,
padding=padding, activation='relu'),
tf.keras.layers.Conv2D(num_channels, kernel_size=1,
activation='relu'),
tf.keras.layers.Conv2D(num_channels, kernel_size=1,
activation='relu')])
.. raw:: html
.. raw:: html
NiN Model
---------
The original NiN network was proposed shortly after AlexNet and clearly
draws some inspiration. NiN uses convolutional layers with window shapes
of :math:`11\times 11`, :math:`5\times 5`, and :math:`3\times 3`, and
the corresponding numbers of output channels are the same as in AlexNet.
Each NiN block is followed by a maximum pooling layer with a stride of 2
and a window shape of :math:`3\times 3`.
One significant difference between NiN and AlexNet is that NiN avoids
fully-connected layers altogether. Instead, NiN uses an NiN block with a
number of output channels equal to the number of label classes, followed
by a *global* average pooling layer, yielding a vector of logits. One
advantage of NiN's design is that it significantly reduces the number of
required model parameters. However, in practice, this design sometimes
requires increased model training time.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = nn.Sequential()
net.add(nin_block(96, kernel_size=11, strides=4, padding=0),
nn.MaxPool2D(pool_size=3, strides=2),
nin_block(256, kernel_size=5, strides=1, padding=2),
nn.MaxPool2D(pool_size=3, strides=2),
nin_block(384, kernel_size=3, strides=1, padding=1),
nn.MaxPool2D(pool_size=3, strides=2),
nn.Dropout(0.5),
# There are 10 label classes
nin_block(10, kernel_size=3, strides=1, padding=1),
# The global average pooling layer automatically sets the window shape
# to the height and width of the input
nn.GlobalAvgPool2D(),
# Transform the four-dimensional output into two-dimensional output
# with a shape of (batch size, 10)
nn.Flatten())
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = nn.Sequential(
nin_block(1, 96, kernel_size=11, strides=4, padding=0),
nn.MaxPool2d(3, stride=2),
nin_block(96, 256, kernel_size=5, strides=1, padding=2),
nn.MaxPool2d(3, stride=2),
nin_block(256, 384, kernel_size=3, strides=1, padding=1),
nn.MaxPool2d(3, stride=2),
nn.Dropout(0.5),
# There are 10 label classes
nin_block(384, 10, kernel_size=3, strides=1, padding=1),
nn.AdaptiveAvgPool2d((1, 1)),
# Transform the four-dimensional output into two-dimensional output with a
# shape of (batch size, 10)
nn.Flatten())
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def net():
return tf.keras.models.Sequential([
nin_block(96, kernel_size=11, strides=4, padding='valid'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
nin_block(256, kernel_size=5, strides=1, padding='same'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
nin_block(384, kernel_size=3, strides=1, padding='same'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
tf.keras.layers.Dropout(0.5),
# There are 10 label classes
nin_block(10, kernel_size=3, strides=1, padding='same'),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Reshape((1, 1, 10)),
# Transform the four-dimensional output into two-dimensional output
# with a shape of (batch size, 10)
tf.keras.layers.Flatten(),
])
.. raw:: html
.. raw:: html
We create a data example to see the output shape of each block.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = np.random.uniform(size=(1, 1, 224, 224))
net.initialize()
for layer in net:
X = layer(X)
print(layer.name, 'output shape:\t', X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
sequential1 output shape: (1, 96, 54, 54)
pool0 output shape: (1, 96, 26, 26)
sequential2 output shape: (1, 256, 26, 26)
pool1 output shape: (1, 256, 12, 12)
sequential3 output shape: (1, 384, 12, 12)
pool2 output shape: (1, 384, 5, 5)
dropout0 output shape: (1, 384, 5, 5)
sequential4 output shape: (1, 10, 5, 5)
pool3 output shape: (1, 10, 1, 1)
flatten0 output shape: (1, 10)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
X = layer(X)
print(layer.__class__.__name__,'output shape:\t', X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Sequential output shape: torch.Size([1, 96, 54, 54])
MaxPool2d output shape: torch.Size([1, 96, 26, 26])
Sequential output shape: torch.Size([1, 256, 26, 26])
MaxPool2d output shape: torch.Size([1, 256, 12, 12])
Sequential output shape: torch.Size([1, 384, 12, 12])
MaxPool2d output shape: torch.Size([1, 384, 5, 5])
Dropout output shape: torch.Size([1, 384, 5, 5])
Sequential output shape: torch.Size([1, 10, 5, 5])
AdaptiveAvgPool2d output shape: torch.Size([1, 10, 1, 1])
Flatten output shape: torch.Size([1, 10])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = tf.random.uniform((1, 224, 224, 1))
for layer in net().layers:
X = layer(X)
print(layer.__class__.__name__,'output shape:\t', X.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Sequential output shape: (1, 54, 54, 96)
MaxPooling2D output shape: (1, 26, 26, 96)
Sequential output shape: (1, 26, 26, 256)
MaxPooling2D output shape: (1, 12, 12, 256)
Sequential output shape: (1, 12, 12, 384)
MaxPooling2D output shape: (1, 5, 5, 384)
Dropout output shape: (1, 5, 5, 384)
Sequential output shape: (1, 5, 5, 10)
GlobalAveragePooling2D output shape: (1, 10)
Reshape output shape: (1, 1, 1, 10)
Flatten output shape: (1, 10)
.. raw:: html
.. raw:: html
Training
--------
As before we use Fashion-MNIST to train the model. NiN's training is
similar to that for AlexNet and VGG.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.416, train acc 0.846, test acc 0.761
2919.8 examples/sec on gpu(0)
.. figure:: output_nin_8ad4f3_39_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.336, train acc 0.874, test acc 0.867
3192.8 examples/sec on cuda:0
.. figure:: output_nin_8ad4f3_42_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.689, train acc 0.745, test acc 0.733
3651.4 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_nin_8ad4f3_45_2.svg
.. raw:: html
.. raw:: html
Summary
-------
- NiN uses blocks consisting of a convolutional layer and multiple
:math:`1\times 1` convolutional layers. This can be used within the
convolutional stack to allow for more per-pixel nonlinearity.
- NiN removes the fully-connected layers and replaces them with global
average pooling (i.e., summing over all locations) after reducing the
number of channels to the desired number of outputs (e.g., 10 for
Fashion-MNIST).
- Removing the fully-connected layers reduces overfitting. NiN has
dramatically fewer parameters.
- The NiN design influenced many subsequent CNN designs.
Exercises
---------
1. Tune the hyperparameters to improve the classification accuracy.
2. Why are there two :math:`1\times 1` convolutional layers in the NiN
block? Remove one of them, and then observe and analyze the
experimental phenomena.
3. Calculate the resource usage for NiN.
1. What is the number of parameters?
2. What is the amount of computation?
3. What is the amount of memory needed during training?
4. What is the amount of memory needed during prediction?
4. What are possible problems with reducing the
:math:`384 \times 5 \times 5` representation to a
:math:`10 \times 5 \times 5` representation in one step?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html