.. _sec_mlp:
Multilayer Perceptrons
======================
In :numref:`chap_linear`, we introduced softmax regression
(:numref:`sec_softmax`), implementing the algorithm from scratch
(:numref:`sec_softmax_scratch`) and using high-level APIs
(:numref:`sec_softmax_concise`), and training classifiers to recognize
10 categories of clothing from low-resolution images. Along the way, we
learned how to wrangle data, coerce our outputs into a valid probability
distribution, apply an appropriate loss function, and minimize it with
respect to our model's parameters. Now that we have mastered these
mechanics in the context of simple linear models, we can launch our
exploration of deep neural networks, the comparatively rich class of
models with which this book is primarily concerned.
Hidden Layers
-------------
We have described the affine transformation in
:numref:`subsec_linear_model`, which is a linear transformation added
by a bias. To begin, recall the model architecture corresponding to our
softmax regression example, illustrated in :numref:`fig_softmaxreg`.
This model mapped our inputs directly to our outputs via a single affine
transformation, followed by a softmax operation. If our labels truly
were related to our input data by an affine transformation, then this
approach would be sufficient. But linearity in affine transformations is
a *strong* assumption.
Linear Models May Go Wrong
~~~~~~~~~~~~~~~~~~~~~~~~~~
For example, linearity implies the *weaker* assumption of
*monotonicity*: that any increase in our feature must either always
cause an increase in our model's output (if the corresponding weight is
positive), or always cause a decrease in our model's output (if the
corresponding weight is negative). Sometimes that makes sense. For
example, if we were trying to predict whether an individual will repay a
loan, we might reasonably imagine that holding all else equal, an
applicant with a higher income would always be more likely to repay than
one with a lower income. While monotonic, this relationship likely is
not linearly associated with the probability of repayment. An increase
in income from 0 to 50 thousand likely corresponds to a bigger increase
in likelihood of repayment than an increase from 1 million to 1.05
million. One way to handle this might be to preprocess our data such
that linearity becomes more plausible, say, by using the logarithm of
income as our feature.
Note that we can easily come up with examples that violate monotonicity.
Say for example that we want to predict probability of death based on
body temperature. For individuals with a body temperature above 37°C
(98.6°F), higher temperatures indicate greater risk. However, for
individuals with body temperatures below 37° C, higher temperatures
indicate lower risk! In this case too, we might resolve the problem with
some clever preprocessing. Namely, we might use the distance from 37°C
as our feature.
But what about classifying images of cats and dogs? Should increasing
the intensity of the pixel at location (13, 17) always increase (or
always decrease) the likelihood that the image depicts a dog? Reliance
on a linear model corresponds to the implicit assumption that the only
requirement for differentiating cats vs. dogs is to assess the
brightness of individual pixels. This approach is doomed to fail in a
world where inverting an image preserves the category.
And yet despite the apparent absurdity of linearity here, as compared
with our previous examples, it is less obvious that we could address the
problem with a simple preprocessing fix. That is because the
significance of any pixel depends in complex ways on its context (the
values of the surrounding pixels). While there might exist a
representation of our data that would take into account the relevant
interactions among our features, on top of which a linear model would be
suitable, we simply do not know how to calculate it by hand. With deep
neural networks, we used observational data to jointly learn both a
representation via hidden layers and a linear predictor that acts upon
that representation.
Incorporating Hidden Layers
~~~~~~~~~~~~~~~~~~~~~~~~~~~
We can overcome these limitations of linear models and handle a more
general class of functions by incorporating one or more hidden layers.
The easiest way to do this is to stack many fully-connected layers on
top of each other. Each layer feeds into the layer above it, until we
generate outputs. We can think of the first :math:`L-1` layers as our
representation and the final layer as our linear predictor. This
architecture is commonly called a *multilayer perceptron*, often
abbreviated as *MLP*. Below, we depict an MLP diagrammatically
(:numref:`fig_mlp`).
.. _fig_mlp:
.. figure:: ../img/mlp.svg
An MLP with a hidden layer of 5 hidden units.
This MLP has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden
units. Since the input layer does not involve any calculations,
producing outputs with this network requires implementing the
computations for both the hidden and output layers; thus, the number of
layers in this MLP is 2. Note that these layers are both fully
connected. Every input influences every neuron in the hidden layer, and
each of these in turn influences every neuron in the output layer.
However, as suggested by
:numref:`subsec_parameterization-cost-fc-layers`, the parameterization
cost of MLPs with fully-connected layers can be prohibitively high,
which may motivate tradeoff between parameter saving and model
effectiveness even without changing the input or output size
:cite:`Zhang.Tay.Zhang.ea.2021`.
From Linear to Nonlinear
~~~~~~~~~~~~~~~~~~~~~~~~
As before, by the matrix :math:`\mathbf{X} \in \mathbb{R}^{n \times d}`,
we denote a minibatch of :math:`n` examples where each example has
:math:`d` inputs (features). For a one-hidden-layer MLP whose hidden
layer has :math:`h` hidden units, denote by
:math:`\mathbf{H} \in \mathbb{R}^{n \times h}` the outputs of the hidden
layer, which are *hidden representations*. In mathematics or code,
:math:`\mathbf{H}` is also known as a *hidden-layer variable* or a
*hidden variable*. Since the hidden and output layers are both fully
connected, we have hidden-layer weights
:math:`\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}` and biases
:math:`\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}` and output-layer
weights :math:`\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}` and biases
:math:`\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}`. Formally, we
calculate the outputs :math:`\mathbf{O} \in \mathbb{R}^{n \times q}` of
the one-hidden-layer MLP as follows:
.. math::
\begin{aligned}
\mathbf{H} & = \mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}, \\
\mathbf{O} & = \mathbf{H}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}.
\end{aligned}
Note that after adding the hidden layer, our model now requires us to
track and update additional sets of parameters. So what have we gained
in exchange? You might be surprised to find out that---in the model
defined above---\ *we gain nothing for our troubles*! The reason is
plain. The hidden units above are given by an affine function of the
inputs, and the outputs (pre-softmax) are just an affine function of the
hidden units. An affine function of an affine function is itself an
affine function. Moreover, our linear model was already capable of
representing any affine function.
We can view the equivalence formally by proving that for any values of
the weights, we can just collapse out the hidden layer, yielding an
equivalent single-layer model with parameters
:math:`\mathbf{W} = \mathbf{W}^{(1)}\mathbf{W}^{(2)}` and
:math:`\mathbf{b} = \mathbf{b}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)}`:
.. math::
\mathbf{O} = (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X} \mathbf{W}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X} \mathbf{W} + \mathbf{b}.
In order to realize the potential of multilayer architectures, we need
one more key ingredient: a nonlinear *activation function*
:math:`\sigma` to be applied to each hidden unit following the affine
transformation. The outputs of activation functions (e.g.,
:math:`\sigma(\cdot)`) are called *activations*. In general, with
activation functions in place, it is no longer possible to collapse our
MLP into a linear model:
.. math::
\begin{aligned}
\mathbf{H} & = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}), \\
\mathbf{O} & = \mathbf{H}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}.\\
\end{aligned}
Since each row in :math:`\mathbf{X}` corresponds to an example in the
minibatch, with some abuse of notation, we define the nonlinearity
:math:`\sigma` to apply to its inputs in a rowwise fashion, i.e., one
example at a time. Note that we used the notation for softmax in the
same way to denote a rowwise operation in
:numref:`subsec_softmax_vectorization`. Often, as in this section, the
activation functions that we apply to hidden layers are not merely
rowwise, but elementwise. That means that after computing the linear
portion of the layer, we can calculate each activation without looking
at the values taken by the other hidden units. This is true for most
activation functions.
To build more general MLPs, we can continue stacking such hidden layers,
e.g.,
:math:`\mathbf{H}^{(1)} = \sigma_1(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})`
and
:math:`\mathbf{H}^{(2)} = \sigma_2(\mathbf{H}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)})`,
one atop another, yielding ever more expressive models.
Universal Approximators
~~~~~~~~~~~~~~~~~~~~~~~
MLPs can capture complex interactions among our inputs via their hidden
neurons, which depend on the values of each of the inputs. We can easily
design hidden nodes to perform arbitrary computation, for instance,
basic logic operations on a pair of inputs. Moreover, for certain
choices of the activation function, it is widely known that MLPs are
universal approximators. Even with a single-hidden-layer network, given
enough nodes (possibly absurdly many), and the right set of weights, we
can model any function, though actually learning that function is the
hard part. You might think of your neural network as being a bit like
the C programming language. The language, like any other modern
language, is capable of expressing any computable program. But actually
coming up with a program that meets your specifications is the hard
part.
Moreover, just because a single-hidden-layer network *can* learn any
function does not mean that you should try to solve all of your problems
with single-hidden-layer networks. In fact, we can approximate many
functions much more compactly by using deeper (vs. wider) networks. We
will touch upon more rigorous arguments in subsequent chapters.
.. _subsec_activation-functions:
Activation Functions
--------------------
Activation functions decide whether a neuron should be activated or not
by calculating the weighted sum and further adding bias with it. They
are differentiable operators to transform input signals to outputs,
while most of them add non-linearity. Because activation functions are
fundamental to deep learning, let us briefly survey some common
activation functions.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
from mxnet import autograd, np, npx
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import torch
from d2l import torch as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import tensorflow as tf
from d2l import tensorflow as d2l
.. raw:: html
.. raw:: html
ReLU Function
~~~~~~~~~~~~~
The most popular choice, due to both simplicity of implementation and
its good performance on a variety of predictive tasks, is the *rectified
linear unit* (*ReLU*). ReLU provides a very simple nonlinear
transformation. Given an element :math:`x`, the function is defined as
the maximum of that element and :math:`0`:
.. math:: \operatorname{ReLU}(x) = \max(x, 0).
Informally, the ReLU function retains only positive elements and
discards all negative elements by setting the corresponding activations
to 0. To gain some intuition, we can plot the function. As you can see,
the activation function is piecewise linear.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = np.arange(-8.0, 8.0, 0.1)
x.attach_grad()
with autograd.record():
y = npx.relu(x)
d2l.plot(x, y, 'x', 'relu(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_15_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = torch.relu(x)
d2l.plot(x.detach(), y.detach(), 'x', 'relu(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_18_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = tf.Variable(tf.range(-8.0, 8.0, 0.1), dtype=tf.float32)
y = tf.nn.relu(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'relu(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_21_0.svg
.. raw:: html
.. raw:: html
When the input is negative, the derivative of the ReLU function is 0,
and when the input is positive, the derivative of the ReLU function is
1. Note that the ReLU function is not differentiable when the input
takes value precisely equal to 0. In these cases, we default to the
left-hand-side derivative and say that the derivative is 0 when the
input is 0. We can get away with this because the input may never
actually be zero. There is an old adage that if subtle boundary
conditions matter, we are probably doing (*real*) mathematics, not
engineering. That conventional wisdom may apply here. We plot the
derivative of the ReLU function plotted below.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of relu', figsize=(5, 2.5))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[19:46:50] src/base.cc:49: GPU context requested, but no GPUs found.
.. figure:: output_mlp_76f463_27_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
y.backward(torch.ones_like(x), retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of relu', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_30_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
with tf.GradientTape() as t:
y = tf.nn.relu(x)
d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of relu',
figsize=(5, 2.5))
.. figure:: output_mlp_76f463_33_0.svg
.. raw:: html
.. raw:: html
The reason for using ReLU is that its derivatives are particularly well
behaved: either they vanish or they just let the argument through. This
makes optimization better behaved and it mitigated the well-documented
problem of vanishing gradients that plagued previous versions of neural
networks (more on this later).
Note that there are many variants to the ReLU function, including the
*parameterized ReLU* (*pReLU*) function :cite:`He.Zhang.Ren.ea.2015`.
This variation adds a linear term to ReLU, so some information still
gets through, even when the argument is negative:
.. math:: \operatorname{pReLU}(x) = \max(0, x) + \alpha \min(0, x).
Sigmoid Function
~~~~~~~~~~~~~~~~
The *sigmoid function* transforms its inputs, for which values lie in
the domain :math:`\mathbb{R}`, to outputs that lie on the interval (0,
1). For that reason, the sigmoid is often called a *squashing function*:
it squashes any input in the range (-inf, inf) to some value in the
range (0, 1):
.. math:: \operatorname{sigmoid}(x) = \frac{1}{1 + \exp(-x)}.
In the earliest neural networks, scientists were interested in modeling
biological neurons which either *fire* or *do not fire*. Thus the
pioneers of this field, going all the way back to McCulloch and Pitts,
the inventors of the artificial neuron, focused on thresholding units. A
thresholding activation takes value 0 when its input is below some
threshold and value 1 when the input exceeds the threshold.
When attention shifted to gradient based learning, the sigmoid function
was a natural choice because it is a smooth, differentiable
approximation to a thresholding unit. Sigmoids are still widely used as
activation functions on the output units, when we want to interpret the
outputs as probabilities for binary classification problems (you can
think of the sigmoid as a special case of the softmax). However, the
sigmoid has mostly been replaced by the simpler and more easily
trainable ReLU for most use in hidden layers. In later chapters on
recurrent neural networks, we will describe architectures that leverage
sigmoid units to control the flow of information across time.
Below, we plot the sigmoid function. Note that when the input is close
to 0, the sigmoid function approaches a linear transformation.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
with autograd.record():
y = npx.sigmoid(x)
d2l.plot(x, y, 'x', 'sigmoid(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_39_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
y = torch.sigmoid(x)
d2l.plot(x.detach(), y.detach(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_42_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
y = tf.nn.sigmoid(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_45_0.svg
.. raw:: html
.. raw:: html
The derivative of the sigmoid function is given by the following
equation:
.. math:: \frac{d}{dx} \operatorname{sigmoid}(x) = \frac{\exp(-x)}{(1 + \exp(-x))^2} = \operatorname{sigmoid}(x)\left(1-\operatorname{sigmoid}(x)\right).
The derivative of the sigmoid function is plotted below. Note that when
the input is 0, the derivative of the sigmoid function reaches a maximum
of 0.25. As the input diverges from 0 in either direction, the
derivative approaches 0.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_51_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Clear out previous gradients
x.grad.data.zero_()
y.backward(torch.ones_like(x),retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_54_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
with tf.GradientTape() as t:
y = tf.nn.sigmoid(x)
d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of sigmoid',
figsize=(5, 2.5))
.. figure:: output_mlp_76f463_57_0.svg
.. raw:: html
.. raw:: html
Tanh Function
~~~~~~~~~~~~~
Like the sigmoid function, the tanh (hyperbolic tangent) function also
squashes its inputs, transforming them into elements on the interval
between -1 and 1:
.. math:: \operatorname{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}.
We plot the tanh function below. Note that as the input nears 0, the
tanh function approaches a linear transformation. Although the shape of
the function is similar to that of the sigmoid function, the tanh
function exhibits point symmetry about the origin of the coordinate
system.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
with autograd.record():
y = np.tanh(x)
d2l.plot(x, y, 'x', 'tanh(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_63_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
y = torch.tanh(x)
d2l.plot(x.detach(), y.detach(), 'x', 'tanh(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_66_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
y = tf.nn.tanh(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'tanh(x)', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_69_0.svg
.. raw:: html
.. raw:: html
The derivative of the tanh function is:
.. math:: \frac{d}{dx} \operatorname{tanh}(x) = 1 - \operatorname{tanh}^2(x).
The derivative of tanh function is plotted below. As the input nears 0,
the derivative of the tanh function approaches a maximum of 1. And as we
saw with the sigmoid function, as the input moves away from 0 in either
direction, the derivative of the tanh function approaches 0.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_75_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
# Clear out previous gradients.
x.grad.data.zero_()
y.backward(torch.ones_like(x),retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))
.. figure:: output_mlp_76f463_78_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
with tf.GradientTape() as t:
y = tf.nn.tanh(x)
d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of tanh',
figsize=(5, 2.5))
.. figure:: output_mlp_76f463_81_0.svg
.. raw:: html
.. raw:: html
In summary, we now know how to incorporate nonlinearities to build
expressive multilayer neural network architectures. As a side note, your
knowledge already puts you in command of a similar toolkit to a
practitioner circa 1990. In some ways, you have an advantage over anyone
working in the 1990s, because you can leverage powerful open-source deep
learning frameworks to build models rapidly, using only a few lines of
code. Previously, training these networks required researchers to code
up thousands of lines of C and Fortran.
Summary
-------
- MLP adds one or multiple fully-connected hidden layers between the
output and input layers and transforms the output of the hidden layer
via an activation function.
- Commonly-used activation functions include the ReLU function, the
sigmoid function, and the tanh function.
Exercises
---------
1. Compute the derivative of the pReLU activation function.
2. Show that an MLP using only ReLU (or pReLU) constructs a continuous
piecewise linear function.
3. Show that
:math:`\operatorname{tanh}(x) + 1 = 2 \operatorname{sigmoid}(2x)`.
4. Assume that we have a nonlinearity that applies to one minibatch at a
time. What kinds of problems do you expect this to cause?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html