.. _sec_batch_norm:
Batch Normalization
===================
Training deep neural networks is difficult. And getting them to converge
in a reasonable amount of time can be tricky. In this section, we
describe *batch normalization*, a popular and effective technique that
consistently accelerates the convergence of deep networks
:cite:`Ioffe.Szegedy.2015`. Together with residual blocks---covered
later in :numref:`sec_resnet`---batch normalization has made it
possible for practitioners to routinely train networks with over 100
layers.
Training Deep Networks
----------------------
To motivate batch normalization, let us review a few practical
challenges that arise when training machine learning models and neural
networks in particular.
First, choices regarding data preprocessing often make an enormous
difference in the final results. Recall our application of MLPs to
predicting house prices (:numref:`sec_kaggle_house`). Our first step
when working with real data was to standardize our input features to
each have a mean of zero and variance of one. Intuitively, this
standardization plays nicely with our optimizers because it puts the
parameters *a priori* at a similar scale.
Second, for a typical MLP or CNN, as we train, the variables (e.g.,
affine transformation outputs in MLP) in intermediate layers may take
values with widely varying magnitudes: both along the layers from the
input to the output, across units in the same layer, and over time due
to our updates to the model parameters. The inventors of batch
normalization postulated informally that this drift in the distribution
of such variables could hamper the convergence of the network.
Intuitively, we might conjecture that if one layer has variable values
that are 100 times that of another layer, this might necessitate
compensatory adjustments in the learning rates.
Third, deeper networks are complex and easily capable of overfitting.
This means that regularization becomes more critical.
Batch normalization is applied to individual layers (optionally, to all
of them) and works as follows: In each training iteration, we first
normalize the inputs (of batch normalization) by subtracting their mean
and dividing by their standard deviation, where both are estimated based
on the statistics of the current minibatch. Next, we apply a scale
coefficient and a scale offset. It is precisely due to this
*normalization* based on *batch* statistics that *batch normalization*
derives its name.
Note that if we tried to apply batch normalization with minibatches of
size 1, we would not be able to learn anything. That is because after
subtracting the means, each hidden unit would take value 0! As you might
guess, since we are devoting a whole section to batch normalization,
with large enough minibatches, the approach proves effective and stable.
One takeaway here is that when applying batch normalization, the choice
of batch size may be even more significant than without batch
normalization.
Formally, denoting by :math:`\mathbf{x} \in \mathcal{B}` an input to
batch normalization (:math:`\mathrm{BN}`) that is from a minibatch
:math:`\mathcal{B}`, batch normalization transforms :math:`\mathbf{x}`
according to the following expression:
.. math:: \mathrm{BN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \hat{\boldsymbol{\mu}}_\mathcal{B}}{\hat{\boldsymbol{\sigma}}_\mathcal{B}} + \boldsymbol{\beta}.
:label: eq_batchnorm
In :eq:`eq_batchnorm`, :math:`\hat{\boldsymbol{\mu}}_\mathcal{B}`
is the sample mean and :math:`\hat{\boldsymbol{\sigma}}_\mathcal{B}` is
the sample standard deviation of the minibatch :math:`\mathcal{B}`.
After applying standardization, the resulting minibatch has zero mean
and unit variance. Because the choice of unit variance (vs. some other
magic number) is an arbitrary choice, we commonly include elementwise
*scale parameter* :math:`\boldsymbol{\gamma}` and *shift parameter*
:math:`\boldsymbol{\beta}` that have the same shape as
:math:`\mathbf{x}`. Note that :math:`\boldsymbol{\gamma}` and
:math:`\boldsymbol{\beta}` are parameters that need to be learned
jointly with the other model parameters.
Consequently, the variable magnitudes for intermediate layers cannot
diverge during training because batch normalization actively centers and
rescales them back to a given mean and size (via
:math:`\hat{\boldsymbol{\mu}}_\mathcal{B}` and
:math:`{\hat{\boldsymbol{\sigma}}_\mathcal{B}}`). One piece of
practitioner's intuition or wisdom is that batch normalization seems to
allow for more aggressive learning rates.
Formally, we calculate :math:`\hat{\boldsymbol{\mu}}_\mathcal{B}` and
:math:`{\hat{\boldsymbol{\sigma}}_\mathcal{B}}` in
:eq:`eq_batchnorm` as follows:
.. math::
\begin{aligned} \hat{\boldsymbol{\mu}}_\mathcal{B} &= \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x},\\
\hat{\boldsymbol{\sigma}}_\mathcal{B}^2 &= \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \hat{\boldsymbol{\mu}}_{\mathcal{B}})^2 + \epsilon.\end{aligned}
Note that we add a small constant :math:`\epsilon > 0` to the variance
estimate to ensure that we never attempt division by zero, even in cases
where the empirical variance estimate might vanish. The estimates
:math:`\hat{\boldsymbol{\mu}}_\mathcal{B}` and
:math:`{\hat{\boldsymbol{\sigma}}_\mathcal{B}}` counteract the scaling
issue by using noisy estimates of mean and variance. You might think
that this noisiness should be a problem. As it turns out, this is
actually beneficial.
This turns out to be a recurring theme in deep learning. For reasons
that are not yet well-characterized theoretically, various sources of
noise in optimization often lead to faster training and less
overfitting: this variation appears to act as a form of regularization.
In some preliminary research, :cite:`Teye.Azizpour.Smith.2018` and
:cite:`Luo.Wang.Shao.ea.2018` relate the properties of batch
normalization to Bayesian priors and penalties respectively. In
particular, this sheds some light on the puzzle of why batch
normalization works best for moderate minibatches sizes in the
:math:`50 \sim 100` range.
Fixing a trained model, you might think that we would prefer using the
entire dataset to estimate the mean and variance. Once training is
complete, why would we want the same image to be classified differently,
depending on the batch in which it happens to reside? During training,
such exact calculation is infeasible because the intermediate variables
for all data examples change every time we update our model. However,
once the model is trained, we can calculate the means and variances of
each layer's variables based on the entire dataset. Indeed this is
standard practice for models employing batch normalization and thus
batch normalization layers function differently in *training mode*
(normalizing by minibatch statistics) and in *prediction mode*
(normalizing by dataset statistics).
We are now ready to take a look at how batch normalization works in
practice.
Batch Normalization Layers
--------------------------
Batch normalization implementations for fully-connected layers and
convolutional layers are slightly different. We discuss both cases
below. Recall that one key differences between batch normalization and
other layers is that because batch normalization operates on a full
minibatch at a time, we cannot just ignore the batch dimension as we did
before when introducing other layers.
Fully-Connected Layers
~~~~~~~~~~~~~~~~~~~~~~
When applying batch normalization to fully-connected layers, the
original paper inserts batch normalization after the affine
transformation and before the nonlinear activation function (later
applications may insert batch normalization right after activation
functions) :cite:`Ioffe.Szegedy.2015`. Denoting the input to the
fully-connected layer by :math:`\mathbf{x}`, the affine transformation
by :math:`\mathbf{W}\mathbf{x} + \mathbf{b}` (with the weight parameter
:math:`\mathbf{W}` and the bias parameter :math:`\mathbf{b}`), and the
activation function by :math:`\phi`, we can express the computation of a
batch-normalization-enabled, fully-connected layer output
:math:`\mathbf{h}` as follows:
.. math:: \mathbf{h} = \phi(\mathrm{BN}(\mathbf{W}\mathbf{x} + \mathbf{b}) ).
Recall that mean and variance are computed on the *same* minibatch on
which the transformation is applied.
Convolutional Layers
~~~~~~~~~~~~~~~~~~~~
Similarly, with convolutional layers, we can apply batch normalization
after the convolution and before the nonlinear activation function. When
the convolution has multiple output channels, we need to carry out batch
normalization for *each* of the outputs of these channels, and each
channel has its own scale and shift parameters, both of which are
scalars. Assume that our minibatches contain :math:`m` examples and that
for each channel, the output of the convolution has height :math:`p` and
width :math:`q`. For convolutional layers, we carry out each batch
normalization over the :math:`m \cdot p \cdot q` elements per output
channel simultaneously. Thus, we collect the values over all spatial
locations when computing the mean and variance and consequently apply
the same mean and variance within a given channel to normalize the value
at each spatial location.
Batch Normalization During Prediction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As we mentioned earlier, batch normalization typically behaves
differently in training mode and prediction mode. First, the noise in
the sample mean and the sample variance arising from estimating each on
minibatches are no longer desirable once we have trained the model.
Second, we might not have the luxury of computing per-batch
normalization statistics. For example, we might need to apply our model
to make one prediction at a time.
Typically, after training, we use the entire dataset to compute stable
estimates of the variable statistics and then fix them at prediction
time. Consequently, batch normalization behaves differently during
training and at test time. Recall that dropout also exhibits this
characteristic.
Implementation from Scratch
---------------------------
Below, we implement a batch normalization layer with tensors from
scratch.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import autograd, init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# Use `autograd` to determine whether the current mode is training mode or
# prediction mode
if not autograd.is_training():
# If it is prediction mode, directly use the mean and variance
# obtained by moving average
X_hat = (X - moving_mean) / np.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# When using a fully-connected layer, calculate the mean and
# variance on the feature dimension
mean = X.mean(axis=0)
var = ((X - mean) ** 2).mean(axis=0)
else:
# When using a two-dimensional convolutional layer, calculate the
# mean and variance on the channel dimension (axis=1). Here we
# need to maintain the shape of `X`, so that the broadcasting
# operation can be carried out later
mean = X.mean(axis=(0, 2, 3), keepdims=True)
var = ((X - mean) ** 2).mean(axis=(0, 2, 3), keepdims=True)
# In training mode, the current mean and variance are used for the
# standardization
X_hat = (X - mean) / np.sqrt(var + eps)
# Update the mean and variance using moving average
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Scale and shift
return Y, moving_mean, moving_var
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
from d2l import torch as d2l
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# Use `is_grad_enabled` to determine whether the current mode is training
# mode or prediction mode
if not torch.is_grad_enabled():
# If it is prediction mode, directly use the mean and variance
# obtained by moving average
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# When using a fully-connected layer, calculate the mean and
# variance on the feature dimension
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# When using a two-dimensional convolutional layer, calculate the
# mean and variance on the channel dimension (axis=1). Here we
# need to maintain the shape of `X`, so that the broadcasting
# operation can be carried out later
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
# In training mode, the current mean and variance are used for the
# standardization
X_hat = (X - mean) / torch.sqrt(var + eps)
# Update the mean and variance using moving average
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Scale and shift
return Y, moving_mean.data, moving_var.data
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
from d2l import tensorflow as d2l
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps):
# Compute reciprocal of square root of the moving variance elementwise
inv = tf.cast(tf.math.rsqrt(moving_var + eps), X.dtype)
# Scale and shift
inv *= gamma
Y = X * inv + (beta - moving_mean * inv)
return Y
.. raw:: html
.. raw:: html
We can now create a proper ``BatchNorm`` layer. Our layer will maintain
proper parameters for scale ``gamma`` and shift ``beta``, both of which
will be updated in the course of training. Additionally, our layer will
maintain moving averages of the means and variances for subsequent use
during model prediction.
Putting aside the algorithmic details, note the design pattern
underlying our implementation of the layer. Typically, we define the
mathematics in a separate function, say ``batch_norm``. We then
integrate this functionality into a custom layer, whose code mostly
addresses bookkeeping matters, such as moving data to the right device
context, allocating and initializing any required variables, keeping
track of moving averages (here for mean and variance), and so on. This
pattern enables a clean separation of mathematics from boilerplate code.
Also note that for the sake of convenience we did not worry about
automatically inferring the input shape here, thus we need to specify
the number of features throughout. Do not worry, the high-level batch
normalization APIs in the deep learning framework will care of this for
us and we will demonstrate that later.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class BatchNorm(nn.Block):
# `num_features`: the number of outputs for a fully-connected layer
# or the number of output channels for a convolutional layer. `num_dims`:
# 2 for a fully-connected layer and 4 for a convolutional layer
def __init__(self, num_features, num_dims, **kwargs):
super().__init__(**kwargs)
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# The scale parameter and the shift parameter (model parameters) are
# initialized to 1 and 0, respectively
self.gamma = self.params.get('gamma', shape=shape, init=init.One())
self.beta = self.params.get('beta', shape=shape, init=init.Zero())
# The variables that are not model parameters are initialized to 0 and 1
self.moving_mean = np.zeros(shape)
self.moving_var = np.ones(shape)
def forward(self, X):
# If `X` is not on the main memory, copy `moving_mean` and
# `moving_var` to the device where `X` is located
if self.moving_mean.ctx != X.ctx:
self.moving_mean = self.moving_mean.copyto(X.ctx)
self.moving_var = self.moving_var.copyto(X.ctx)
# Save the updated `moving_mean` and `moving_var`
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma.data(), self.beta.data(), self.moving_mean,
self.moving_var, eps=1e-12, momentum=0.9)
return Y
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class BatchNorm(nn.Module):
# `num_features`: the number of outputs for a fully-connected layer
# or the number of output channels for a convolutional layer. `num_dims`:
# 2 for a fully-connected layer and 4 for a convolutional layer
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# The scale parameter and the shift parameter (model parameters) are
# initialized to 1 and 0, respectively
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# The variables that are not model parameters are initialized to 0 and 1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)
def forward(self, X):
# If `X` is not on the main memory, copy `moving_mean` and
# `moving_var` to the device where `X` is located
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# Save the updated `moving_mean` and `moving_var`
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class BatchNorm(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super(BatchNorm, self).__init__(**kwargs)
def build(self, input_shape):
weight_shape = [input_shape[-1], ]
# The scale parameter and the shift parameter (model parameters) are
# initialized to 1 and 0, respectively
self.gamma = self.add_weight(name='gamma', shape=weight_shape,
initializer=tf.initializers.ones, trainable=True)
self.beta = self.add_weight(name='beta', shape=weight_shape,
initializer=tf.initializers.zeros, trainable=True)
# The variables that are not model parameters are initialized to 0
self.moving_mean = self.add_weight(name='moving_mean',
shape=weight_shape, initializer=tf.initializers.zeros,
trainable=False)
self.moving_variance = self.add_weight(name='moving_variance',
shape=weight_shape, initializer=tf.initializers.ones,
trainable=False)
super(BatchNorm, self).build(input_shape)
def assign_moving_average(self, variable, value):
momentum = 0.9
delta = variable * momentum + value * (1 - momentum)
return variable.assign(delta)
@tf.function
def call(self, inputs, training):
if training:
axes = list(range(len(inputs.shape) - 1))
batch_mean = tf.reduce_mean(inputs, axes, keepdims=True)
batch_variance = tf.reduce_mean(tf.math.squared_difference(
inputs, tf.stop_gradient(batch_mean)), axes, keepdims=True)
batch_mean = tf.squeeze(batch_mean, axes)
batch_variance = tf.squeeze(batch_variance, axes)
mean_update = self.assign_moving_average(
self.moving_mean, batch_mean)
variance_update = self.assign_moving_average(
self.moving_variance, batch_variance)
self.add_update(mean_update)
self.add_update(variance_update)
mean, variance = batch_mean, batch_variance
else:
mean, variance = self.moving_mean, self.moving_variance
output = batch_norm(inputs, moving_mean=mean, moving_var=variance,
beta=self.beta, gamma=self.gamma, eps=1e-5)
return output
.. raw:: html
.. raw:: html
Applying Batch Normalization in LeNet
-------------------------------------
To see how to apply ``BatchNorm`` in context, below we apply it to a
traditional LeNet model (:numref:`sec_lenet`). Recall that batch
normalization is applied after the convolutional layers or
fully-connected layers but before the corresponding activation
functions.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = nn.Sequential()
net.add(nn.Conv2D(6, kernel_size=5),
BatchNorm(6, num_dims=4),
nn.Activation('sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
nn.Conv2D(16, kernel_size=5),
BatchNorm(16, num_dims=4),
nn.Activation('sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
nn.Dense(120),
BatchNorm(120, num_dims=2),
nn.Activation('sigmoid'),
nn.Dense(84),
BatchNorm(84, num_dims=2),
nn.Activation('sigmoid'),
nn.Dense(10))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5), BatchNorm(6, num_dims=4), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5), BatchNorm(16, num_dims=4), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
nn.Linear(16*4*4, 120), BatchNorm(120, num_dims=2), nn.Sigmoid(),
nn.Linear(120, 84), BatchNorm(84, num_dims=2), nn.Sigmoid(),
nn.Linear(84, 10))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Recall that this has to be a function that will be passed to `d2l.train_ch6`
# so that model building or compiling need to be within `strategy.scope()` in
# order to utilize the CPU/GPU devices that we have
def net():
return tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters=6, kernel_size=5,
input_shape=(28, 28, 1)),
BatchNorm(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
tf.keras.layers.Conv2D(filters=16, kernel_size=5),
BatchNorm(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(120),
BatchNorm(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.Dense(84),
BatchNorm(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.Dense(10)]
)
.. raw:: html
.. raw:: html
As before, we will train our network on the Fashion-MNIST dataset. This
code is virtually identical to that when we first trained LeNet
(:numref:`sec_lenet`). The main difference is the larger learning
rate.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 1.0, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.263, train acc 0.904, test acc 0.753
17469.3 examples/sec on gpu(0)
.. figure:: output_batch-norm_cf033c_39_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 1.0, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.266, train acc 0.900, test acc 0.782
33976.9 examples/sec on cuda:0
.. figure:: output_batch-norm_cf033c_42_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs, batch_size = 1.0, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
net = d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.266, train acc 0.901, test acc 0.874
34491.1 examples/sec on /GPU:0
.. figure:: output_batch-norm_cf033c_45_1.svg
.. raw:: html
.. raw:: html
Let us have a look at the scale parameter ``gamma`` and the shift
parameter ``beta`` learned from the first batch normalization layer.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net[1].gamma.data().reshape(-1,), net[1].beta.data().reshape(-1,)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(array([4.7079544 , 2.4648035 , 3.1129107 , 3.1225557 , 0.69371796,
1.2715546 ], ctx=gpu(0)),
array([ 2.4241388, 1.6977667, -3.5157082, -2.2939754, -0.4219235,
-0.9039332], ctx=gpu(0)))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net[1].gamma.reshape((-1,)), net[1].beta.reshape((-1,))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(tensor([3.2870, 2.1564, 2.1719, 2.8857, 4.2039, 2.9774], device='cuda:0',
grad_fn=),
tensor([ 2.2585, -2.3030, -2.0394, 2.5391, -2.3453, -0.1399], device='cuda:0',
grad_fn=))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
tf.reshape(net.layers[1].gamma, (-1,)), tf.reshape(net.layers[1].beta, (-1,))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(,
)
.. raw:: html
.. raw:: html
Concise Implementation
----------------------
Compared with the ``BatchNorm`` class, which we just defined ourselves,
we can use the ``BatchNorm`` class defined in high-level APIs from the
deep learning framework directly. The code looks virtually identical to
our implementation above.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = nn.Sequential()
net.add(nn.Conv2D(6, kernel_size=5),
nn.BatchNorm(),
nn.Activation('sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
nn.Conv2D(16, kernel_size=5),
nn.BatchNorm(),
nn.Activation('sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
nn.Dense(120),
nn.BatchNorm(),
nn.Activation('sigmoid'),
nn.Dense(84),
nn.BatchNorm(),
nn.Activation('sigmoid'),
nn.Dense(10))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
net = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5), nn.BatchNorm2d(6), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5), nn.BatchNorm2d(16), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
nn.Linear(256, 120), nn.BatchNorm1d(120), nn.Sigmoid(),
nn.Linear(120, 84), nn.BatchNorm1d(84), nn.Sigmoid(),
nn.Linear(84, 10))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def net():
return tf.keras.models.Sequential([
tf.keras.layers.Conv2D(filters=6, kernel_size=5,
input_shape=(28, 28, 1)),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
tf.keras.layers.Conv2D(filters=16, kernel_size=5),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.AvgPool2D(pool_size=2, strides=2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(120),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.Dense(84),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('sigmoid'),
tf.keras.layers.Dense(10),
])
.. raw:: html
.. raw:: html
Below, we use the same hyperparameters to train our model. Note that as
usual, the high-level API variant runs much faster because its code has
been compiled to C++ or CUDA while our custom implementation must be
interpreted by Python.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.279, train acc 0.896, test acc 0.799
33561.8 examples/sec on gpu(0)
.. figure:: output_batch-norm_cf033c_75_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.277, train acc 0.897, test acc 0.838
57911.7 examples/sec on cuda:0
.. figure:: output_batch-norm_cf033c_78_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.265, train acc 0.902, test acc 0.876
42919.8 examples/sec on /GPU:0
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. figure:: output_batch-norm_cf033c_81_2.svg
.. raw:: html
.. raw:: html
Controversy
-----------
Intuitively, batch normalization is thought to make the optimization
landscape smoother. However, we must be careful to distinguish between
speculative intuitions and true explanations for the phenomena that we
observe when training deep models. Recall that we do not even know why
simpler deep neural networks (MLPs and conventional CNNs) generalize
well in the first place. Even with dropout and weight decay, they remain
so flexible that their ability to generalize to unseen data cannot be
explained via conventional learning-theoretic generalization guarantees.
In the original paper proposing batch normalization, the authors, in
addition to introducing a powerful and useful tool, offered an
explanation for why it works: by reducing *internal covariate shift*.
Presumably by *internal covariate shift* the authors meant something
like the intuition expressed above---the notion that the distribution of
variable values changes over the course of training. However, there were
two problems with this explanation: i) This drift is very different from
*covariate shift*, rendering the name a misnomer. ii) The explanation
offers an under-specified intuition but leaves the question of *why
precisely this technique works* an open question wanting for a rigorous
explanation. Throughout this book, we aim to convey the intuitions that
practitioners use to guide their development of deep neural networks.
However, we believe that it is important to separate these guiding
intuitions from established scientific fact. Eventually, when you master
this material and start writing your own research papers you will want
to be clear to delineate between technical claims and hunches.
Following the success of batch normalization, its explanation in terms
of *internal covariate shift* has repeatedly surfaced in debates in the
technical literature and broader discourse about how to present machine
learning research. In a memorable speech given while accepting a Test of
Time Award at the 2017 NeurIPS conference, Ali Rahimi used *internal
covariate shift* as a focal point in an argument likening the modern
practice of deep learning to alchemy. Subsequently, the example was
revisited in detail in a position paper outlining troubling trends in
machine learning :cite:`Lipton.Steinhardt.2018`. Other authors have
proposed alternative explanations for the success of batch
normalization, some claiming that batch normalization's success comes
despite exhibiting behavior that is in some ways opposite to those
claimed in the original paper :cite:`Santurkar.Tsipras.Ilyas.ea.2018`.
We note that the *internal covariate shift* is no more worthy of
criticism than any of thousands of similarly vague claims made every
year in the technical machine learning literature. Likely, its resonance
as a focal point of these debates owes to its broad recognizability to
the target audience. Batch normalization has proven an indispensable
method, applied in nearly all deployed image classifiers, earning the
paper that introduced the technique tens of thousands of citations.
Summary
-------
- During model training, batch normalization continuously adjusts the
intermediate output of the neural network by utilizing the mean and
standard deviation of the minibatch, so that the values of the
intermediate output in each layer throughout the neural network are
more stable.
- The batch normalization methods for fully-connected layers and
convolutional layers are slightly different.
- Like a dropout layer, batch normalization layers have different
computation results in training mode and prediction mode.
- Batch normalization has many beneficial side effects, primarily that
of regularization. On the other hand, the original motivation of
reducing internal covariate shift seems not to be a valid
explanation.
Exercises
---------
1. Can we remove the bias parameter from the fully-connected layer or
the convolutional layer before the batch normalization? Why?
2. Compare the learning rates for LeNet with and without batch
normalization.
1. Plot the increase in training and test accuracy.
2. How large can you make the learning rate?
3. Do we need batch normalization in every layer? Experiment with it?
4. Can you replace dropout by batch normalization? How does the behavior
change?
5. Fix the parameters ``beta`` and ``gamma``, and observe and analyze
the results.
6. Review the online documentation for ``BatchNorm`` from the high-level
APIs to see the other applications for batch normalization.
7. Research ideas: think of other normalization transforms that you can
apply? Can you apply the probability integral transform? How about a
full rank covariance estimate?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html