.. _sec_padding:
Padding and Stride
==================
In the previous example of :numref:`fig_correlation`, our input had
both a height and width of 3 and our convolution kernel had both a
height and width of 2, yielding an output representation with dimension
:math:`2\times2`. As we generalized in :numref:`sec_conv_layer`,
assuming that the input shape is :math:`n_h\times n_w` and the
convolution kernel shape is :math:`k_h\times k_w`, then the output shape
will be :math:`(n_h-k_h+1) \times (n_w-k_w+1)`. Therefore, the output
shape of the convolutional layer is determined by the shape of the input
and the shape of the convolution kernel.
In several cases, we incorporate techniques, including padding and
strided convolutions, that affect the size of the output. As motivation,
note that since kernels generally have width and height greater than
:math:`1`, after applying many successive convolutions, we tend to wind
up with outputs that are considerably smaller than our input. If we
start with a :math:`240 \times 240` pixel image, :math:`10` layers of
:math:`5 \times 5` convolutions reduce the image to
:math:`200 \times 200` pixels, slicing off :math:`30 \%` of the image
and with it obliterating any interesting information on the boundaries
of the original image. *Padding* is the most popular tool for handling
this issue.
In other cases, we may want to reduce the dimensionality drastically,
e.g., if we find the original input resolution to be unwieldy. *Strided
convolutions* are a popular technique that can help in these instances.
Padding
-------
As described above, one tricky issue when applying convolutional layers
is that we tend to lose pixels on the perimeter of our image. Since we
typically use small kernels, for any given convolution, we might only
lose a few pixels, but this can add up as we apply many successive
convolutional layers. One straightforward solution to this problem is to
add extra pixels of filler around the boundary of our input image, thus
increasing the effective size of the image. Typically, we set the values
of the extra pixels to zero. In :numref:`img_conv_pad`, we pad a
:math:`3 \times 3` input, increasing its size to :math:`5 \times 5`. The
corresponding output then increases to a :math:`4 \times 4` matrix. The
shaded portions are the first output element as well as the input and
kernel tensor elements used for the output computation:
:math:`0\times0+0\times1+0\times2+0\times3=0`.
.. _img_conv_pad:
.. figure:: ../img/conv-pad.svg
Two-dimensional cross-correlation with padding.
In general, if we add a total of :math:`p_h` rows of padding (roughly
half on top and half on bottom) and a total of :math:`p_w` columns of
padding (roughly half on the left and half on the right), the output
shape will be
.. math:: (n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1).
This means that the height and width of the output will increase by
:math:`p_h` and :math:`p_w`, respectively.
In many cases, we will want to set :math:`p_h=k_h-1` and
:math:`p_w=k_w-1` to give the input and output the same height and
width. This will make it easier to predict the output shape of each
layer when constructing the network. Assuming that :math:`k_h` is odd
here, we will pad :math:`p_h/2` rows on both sides of the height. If
:math:`k_h` is even, one possibility is to pad
:math:`\lceil p_h/2\rceil` rows on the top of the input and
:math:`\lfloor p_h/2\rfloor` rows on the bottom. We will pad both sides
of the width in the same way.
CNNs commonly use convolution kernels with odd height and width values,
such as 1, 3, 5, or 7. Choosing odd kernel sizes has the benefit that we
can preserve the spatial dimensionality while padding with the same
number of rows on top and bottom, and the same number of columns on left
and right.
Moreover, this practice of using odd kernels and padding to precisely
preserve dimensionality offers a clerical benefit. For any
two-dimensional tensor ``X``, when the kernel's size is odd and the
number of padding rows and columns on all sides are the same, producing
an output with the same height and width as the input, we know that the
output ``Y[i, j]`` is calculated by cross-correlation of the input and
convolution kernel with the window centered on ``X[i, j]``.
In the following example, we create a two-dimensional convolutional
layer with a height and width of 3 and apply 1 pixel of padding on all
sides. Given an input with a height and width of 8, we find that the
height and width of the output is also 8.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import np, npx
from mxnet.gluon import nn
npx.set_np()
# For convenience, we define a function to calculate the convolutional layer.
# This function initializes the convolutional layer weights and performs
# corresponding dimensionality elevations and reductions on the input and
# output
def comp_conv2d(conv2d, X):
conv2d.initialize()
# Here (1, 1) indicates that the batch size and the number of channels
# are both 1
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
# Exclude the first two dimensions that do not interest us: examples and
# channels
return Y.reshape(Y.shape[2:])
# Note that here 1 row or column is padded on either side, so a total of 2
# rows or columns are added
conv2d = nn.Conv2D(1, kernel_size=3, padding=1)
X = np.random.uniform(size=(8, 8))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(8, 8)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
# We define a convenience function to calculate the convolutional layer. This
# function initializes the convolutional layer weights and performs
# corresponding dimensionality elevations and reductions on the input and
# output
def comp_conv2d(conv2d, X):
# Here (1, 1) indicates that the batch size and the number of channels
# are both 1
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
# Exclude the first two dimensions that do not interest us: examples and
# channels
return Y.reshape(Y.shape[2:])
# Note that here 1 row or column is padded on either side, so a total of 2
# rows or columns are added
conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([8, 8])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
# We define a convenience function to calculate the convolutional layer. This
# function initializes the convolutional layer weights and performs
# corresponding dimensionality elevations and reductions on the input and
# output
def comp_conv2d(conv2d, X):
# Here (1, 1) indicates that the batch size and the number of channels
# are both 1
X = tf.reshape(X, (1, ) + X.shape + (1, ))
Y = conv2d(X)
# Exclude the first two dimensions that do not interest us: examples and
# channels
return tf.reshape(Y, Y.shape[1:3])
# Note that here 1 row or column is padded on either side, so a total of 2
# rows or columns are added
conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same')
X = tf.random.uniform(shape=(8, 8))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
TensorShape([8, 8])
.. raw:: html
.. raw:: html
When the height and width of the convolution kernel are different, we
can make the output and input have the same height and width by setting
different padding numbers for height and width.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Here, we use a convolution kernel with a height of 5 and a width of 3. The
# padding numbers on either side of the height and width are 2 and 1,
# respectively
conv2d = nn.Conv2D(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(8, 8)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Here, we use a convolution kernel with a height of 5 and a width of 3. The
# padding numbers on either side of the height and width are 2 and 1,
# respectively
conv2d = nn.Conv2d(1, 1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([8, 8])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Here, we use a convolution kernel with a height of 5 and a width of 3. The
# padding numbers on either side of the height and width are 2 and 1,
# respectively
conv2d = tf.keras.layers.Conv2D(1, kernel_size=(5, 3), padding='same')
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
TensorShape([8, 8])
.. raw:: html
.. raw:: html
Stride
------
When computing the cross-correlation, we start with the convolution
window at the upper-left corner of the input tensor, and then slide it
over all locations both down and to the right. In previous examples, we
default to sliding one element at a time. However, sometimes, either for
computational efficiency or because we wish to downsample, we move our
window more than one element at a time, skipping the intermediate
locations.
We refer to the number of rows and columns traversed per slide as the
*stride*. So far, we have used strides of 1, both for height and width.
Sometimes, we may want to use a larger stride.
:numref:`img_conv_stride` shows a two-dimensional cross-correlation
operation with a stride of 3 vertically and 2 horizontally. The shaded
portions are the output elements as well as the input and kernel tensor
elements used for the output computation:
:math:`0\times0+0\times1+1\times2+2\times3=8`,
:math:`0\times0+6\times1+0\times2+0\times3=6`. We can see that when the
second element of the first column is outputted, the convolution window
slides down three rows. The convolution window slides two columns to the
right when the second element of the first row is outputted. When the
convolution window continues to slide two columns to the right on the
input, there is no output because the input element cannot fill the
window (unless we add another column of padding).
.. _img_conv_stride:
.. figure:: ../img/conv-stride.svg
Cross-correlation with strides of 3 and 2 for height and width,
respectively.
In general, when the stride for the height is :math:`s_h` and the stride
for the width is :math:`s_w`, the output shape is
.. math:: \lfloor(n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor(n_w-k_w+p_w+s_w)/s_w\rfloor.
If we set :math:`p_h=k_h-1` and :math:`p_w=k_w-1`, then the output shape
will be simplified to
:math:`\lfloor(n_h+s_h-1)/s_h\rfloor \times \lfloor(n_w+s_w-1)/s_w\rfloor`.
Going a step further, if the input height and width are divisible by the
strides on the height and width, then the output shape will be
:math:`(n_h/s_h) \times (n_w/s_w)`.
Below, we set the strides on both the height and width to 2, thus
halving the input height and width.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.Conv2D(1, kernel_size=3, padding=1, strides=2)
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(4, 4)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1, stride=2)
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([4, 4])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = tf.keras.layers.Conv2D(1, kernel_size=3, padding='same', strides=2)
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
TensorShape([4, 4])
.. raw:: html
.. raw:: html
Next, we will look at a slightly more complicated example.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.Conv2D(1, kernel_size=(3, 5), padding=(0, 1), strides=(3, 4))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(2, 2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = nn.Conv2d(1, 1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([2, 2])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
conv2d = tf.keras.layers.Conv2D(1, kernel_size=(3,5), padding='valid',
strides=(3, 4))
comp_conv2d(conv2d, X).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
TensorShape([2, 1])
.. raw:: html
.. raw:: html
For the sake of brevity, when the padding number on both sides of the
input height and width are :math:`p_h` and :math:`p_w` respectively, we
call the padding :math:`(p_h, p_w)`. Specifically, when
:math:`p_h = p_w = p`, the padding is :math:`p`. When the strides on the
height and width are :math:`s_h` and :math:`s_w`, respectively, we call
the stride :math:`(s_h, s_w)`. Specifically, when :math:`s_h = s_w = s`,
the stride is :math:`s`. By default, the padding is 0 and the stride is
1. In practice, we rarely use inhomogeneous strides or padding, i.e., we
usually have :math:`p_h = p_w` and :math:`s_h = s_w`.
Summary
-------
- Padding can increase the height and width of the output. This is
often used to give the output the same height and width as the input.
- The stride can reduce the resolution of the output, for example
reducing the height and width of the output to only :math:`1/n` of
the height and width of the input (:math:`n` is an integer greater
than :math:`1`).
- Padding and stride can be used to adjust the dimensionality of the
data effectively.
Exercises
---------
1. For the last example in this section, use mathematics to calculate
the output shape to see if it is consistent with the experimental
result.
2. Try other padding and stride combinations on the experiments in this
section.
3. For audio signals, what does a stride of 2 correspond to?
4. What are the computational benefits of a stride larger than 1?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html