.. _sec_weight_decay:

Weight Decay
============


Now that we have characterized the problem of overfitting, we can
introduce some standard techniques for regularizing models. Recall that
we can always mitigate overfitting by going out and collecting more
training data. That can be costly, time consuming, or entirely out of
our control, making it impossible in the short run. For now, we can
assume that we already have as much high-quality data as our resources
permit and focus on regularization techniques.

Recall that in our polynomial regression example
(:numref:`sec_model_selection`) we could limit our model's capacity
simply by tweaking the degree of the fitted polynomial. Indeed, limiting
the number of features is a popular technique to mitigate overfitting.
However, simply tossing aside features can be too blunt an instrument
for the job. Sticking with the polynomial regression example, consider
what might happen with high-dimensional inputs. The natural extensions
of polynomials to multivariate data are called *monomials*, which are
simply products of powers of variables. The degree of a monomial is the
sum of the powers. For example, :math:`x_1^2 x_2`, and :math:`x_3 x_5^2`
are both monomials of degree 3.

Note that the number of terms with degree :math:`d` blows up rapidly as
:math:`d` grows larger. Given :math:`k` variables, the number of
monomials of degree :math:`d` (i.e., :math:`k` multichoose :math:`d`) is
:math:`{k - 1 + d} \choose {k - 1}`. Even small changes in degree, say
from :math:`2` to :math:`3`, dramatically increase the complexity of our
model. Thus we often need a more fine-grained tool for adjusting
function complexity.

Norms and Weight Decay
----------------------

We have described both the :math:`L_2` norm and the :math:`L_1` norm,
which are special cases of the more general :math:`L_p` norm in
:numref:`subsec_lin-algebra-norms`. *Weight decay* (commonly called
:math:`L_2` regularization), might be the most widely-used technique for
regularizing parametric machine learning models. The technique is
motivated by the basic intuition that among all functions :math:`f`, the
function :math:`f = 0` (assigning the value :math:`0` to all inputs) is
in some sense the *simplest*, and that we can measure the complexity of
a function by its distance from zero. But how precisely should we
measure the distance between a function and zero? There is no single
right answer. In fact, entire branches of mathematics, including parts
of functional analysis and the theory of Banach spaces, are devoted to
answering this issue.

One simple interpretation might be to measure the complexity of a linear
function :math:`f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}` by some norm
of its weight vector, e.g., :math:`\| \mathbf{w} \|^2`. The most common
method for ensuring a small weight vector is to add its norm as a
penalty term to the problem of minimizing the loss. Thus we replace our
original objective, *minimizing the prediction loss on the training
labels*, with new objective, *minimizing the sum of the prediction loss
and the penalty term*. Now, if our weight vector grows too large, our
learning algorithm might focus on minimizing the weight norm
:math:`\| \mathbf{w} \|^2` vs. minimizing the training error. That is
exactly what we want. To illustrate things in code, let us revive our
previous example from :numref:`sec_linear_regression` for linear
regression. There, our loss was given by

.. math:: L(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.

Recall that :math:`\mathbf{x}^{(i)}` are the features, :math:`y^{(i)}`
are labels for all data examples :math:`i`, and :math:`(\mathbf{w}, b)`
are the weight and bias parameters, respectively. To penalize the size
of the weight vector, we must somehow add :math:`\| \mathbf{w} \|^2` to
the loss function, but how should the model trade off the standard loss
for this new additive penalty? In practice, we characterize this
tradeoff via the *regularization constant* :math:`\lambda`, a
non-negative hyperparameter that we fit using validation data:

.. math:: L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2,

For :math:`\lambda = 0`, we recover our original loss function. For
:math:`\lambda > 0`, we restrict the size of :math:`\| \mathbf{w} \|`.
We divide by :math:`2` by convention: when we take the derivative of a
quadratic function, the :math:`2` and :math:`1/2` cancel out, ensuring
that the expression for the update looks nice and simple. The astute
reader might wonder why we work with the squared norm and not the
standard norm (i.e., the Euclidean distance). We do this for
computational convenience. By squaring the :math:`L_2` norm, we remove
the square root, leaving the sum of squares of each component of the
weight vector. This makes the derivative of the penalty easy to compute:
the sum of derivatives equals the derivative of the sum.

Moreover, you might ask why we work with the :math:`L_2` norm in the
first place and not, say, the :math:`L_1` norm. In fact, other choices
are valid and popular throughout statistics. While
:math:`L_2`-regularized linear models constitute the classic *ridge
regression* algorithm, :math:`L_1`-regularized linear regression is a
similarly fundamental model in statistics, which is popularly known as
*lasso regression*.

One reason to work with the :math:`L_2` norm is that it places an
outsize penalty on large components of the weight vector. This biases
our learning algorithm towards models that distribute weight evenly
across a larger number of features. In practice, this might make them
more robust to measurement error in a single variable. By contrast,
:math:`L_1` penalties lead to models that concentrate weights on a small
set of features by clearing the other weights to zero. This is called
*feature selection*, which may be desirable for other reasons.

Using the same notation in :eq:`eq_linreg_batch_update`, the
minibatch stochastic gradient descent updates for
:math:`L_2`-regularized regression follow:

.. math::


   \begin{aligned}
   \mathbf{w} & \leftarrow \left(1- \eta\lambda \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right).
   \end{aligned}

As before, we update :math:`\mathbf{w}` based on the amount by which our
estimate differs from the observation. However, we also shrink the size
of :math:`\mathbf{w}` towards zero. That is why the method is sometimes
called "weight decay": given the penalty term alone, our optimization
algorithm *decays* the weight at each step of training. In contrast to
feature selection, weight decay offers us a continuous mechanism for
adjusting the complexity of a function. Smaller values of
:math:`\lambda` correspond to less constrained :math:`\mathbf{w}`,
whereas larger values of :math:`\lambda` constrain :math:`\mathbf{w}`
more considerably.

Whether we include a corresponding bias penalty :math:`b^2` can vary
across implementations, and may vary across layers of a neural network.
Often, we do not regularize the bias term of a network's output layer.

High-Dimensional Linear Regression
----------------------------------

We can illustrate the benefits of weight decay through a simple
synthetic example.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-1-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-1-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-1-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-1-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    %matplotlib inline
    from mxnet import autograd, gluon, init, np, npx
    from mxnet.gluon import nn
    from d2l import mxnet as d2l
    
    npx.set_np()


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-1-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    %matplotlib inline
    import torch
    from torch import nn
    from d2l import torch as d2l


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-1-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    %matplotlib inline
    import tensorflow as tf
    from d2l import tensorflow as d2l


.. raw:: html

     </div>


.. raw:: html

     </div>

First, we generate some data as before

.. math::

   y = 0.05 + \sum_{i = 1}^d 0.01 x_i + \epsilon \text{ where }
   \epsilon \sim \mathcal{N}(0, 0.01^2).

We choose our label to be a linear function of our inputs, corrupted by
Gaussian noise with zero mean and standard deviation 0.01. To make the
effects of overfitting pronounced, we can increase the dimensionality of
our problem to :math:`d = 200` and work with a small training set
containing only 20 examples.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-3-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-3-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-3-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-3-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
    true_w, true_b = np.ones((num_inputs, 1)) * 0.01, 0.05
    train_data = d2l.synthetic_data(true_w, true_b, n_train)
    train_iter = d2l.load_array(train_data, batch_size)
    test_data = d2l.synthetic_data(true_w, true_b, n_test)
    test_iter = d2l.load_array(test_data, batch_size, is_train=False)


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-3-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
    true_w, true_b = torch.ones((num_inputs, 1)) * 0.01, 0.05
    train_data = d2l.synthetic_data(true_w, true_b, n_train)
    train_iter = d2l.load_array(train_data, batch_size)
    test_data = d2l.synthetic_data(true_w, true_b, n_test)
    test_iter = d2l.load_array(test_data, batch_size, is_train=False)


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-3-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
    true_w, true_b = tf.ones((num_inputs, 1)) * 0.01, 0.05
    train_data = d2l.synthetic_data(true_w, true_b, n_train)
    train_iter = d2l.load_array(train_data, batch_size)
    test_data = d2l.synthetic_data(true_w, true_b, n_test)
    test_iter = d2l.load_array(test_data, batch_size, is_train=False)


.. raw:: html

     </div>


.. raw:: html

     </div>

Implementation from Scratch
---------------------------

In the following, we will implement weight decay from scratch, simply by
adding the squared :math:`L_2` penalty to the original target function.

Initializing Model Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First, we will define a function to randomly initialize our model
parameters.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-5-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-5-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-5-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-5-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def init_params():
        w = np.random.normal(scale=1, size=(num_inputs, 1))
        b = np.zeros(1)
        w.attach_grad()
        b.attach_grad()
        return [w, b]


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-5-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def init_params():
        w = torch.normal(0, 1, size=(num_inputs, 1), requires_grad=True)
        b = torch.zeros(1, requires_grad=True)
        return [w, b]


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-5-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def init_params():
        w = tf.Variable(tf.random.normal(mean=1, shape=(num_inputs, 1)))
        b = tf.Variable(tf.zeros(shape=(1, )))
        return [w, b]


.. raw:: html

     </div>


.. raw:: html

     </div>

Defining :math:`L_2` Norm Penalty
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Perhaps the most convenient way to implement this penalty is to square
all terms in place and sum them up.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-7-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-7-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-7-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-7-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def l2_penalty(w):
        return (w**2).sum() / 2


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-7-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def l2_penalty(w):
        return torch.sum(w.pow(2)) / 2


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-7-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def l2_penalty(w):
        return tf.reduce_sum(tf.pow(w, 2)) / 2


.. raw:: html

     </div>


.. raw:: html

     </div>

Defining the Training Loop
~~~~~~~~~~~~~~~~~~~~~~~~~~

The following code fits a model on the training set and evaluates it on
the test set. The linear network and the squared loss have not changed
since :numref:`chap_linear`, so we will just import them via
``d2l.linreg`` and ``d2l.squared_loss``. The only change here is that
our loss now includes the penalty term.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-9-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-9-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-9-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-9-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def train(lambd):
        w, b = init_params()
        net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
        num_epochs, lr = 100, 0.003
        animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                                xlim=[5, num_epochs], legend=['train', 'test'])
        for epoch in range(num_epochs):
            for X, y in train_iter:
                with autograd.record():
                    # The L2 norm penalty term has been added, and broadcasting
                    # makes `l2_penalty(w)` a vector whose length is `batch_size`
                    l = loss(net(X), y) + lambd * l2_penalty(w)
                l.backward()
                d2l.sgd([w, b], lr, batch_size)
            if (epoch + 1) % 5 == 0:
                animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
                                         d2l.evaluate_loss(net, test_iter, loss)))
        print('L2 norm of w:', np.linalg.norm(w))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-9-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def train(lambd):
        w, b = init_params()
        net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
        num_epochs, lr = 100, 0.003
        animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                                xlim=[5, num_epochs], legend=['train', 'test'])
        for epoch in range(num_epochs):
            for X, y in train_iter:
                # The L2 norm penalty term has been added, and broadcasting
                # makes `l2_penalty(w)` a vector whose length is `batch_size`
                l = loss(net(X), y) + lambd * l2_penalty(w)
                l.sum().backward()
                d2l.sgd([w, b], lr, batch_size)
            if (epoch + 1) % 5 == 0:
                animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
                                         d2l.evaluate_loss(net, test_iter, loss)))
        print('L2 norm of w:', torch.norm(w).item())


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-9-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def train(lambd):
        w, b = init_params()
        net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
        num_epochs, lr = 100, 0.003
        animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                                xlim=[5, num_epochs], legend=['train', 'test'])
        for epoch in range(num_epochs):
            for X, y in train_iter:
                with tf.GradientTape() as tape:
                    # The L2 norm penalty term has been added, and broadcasting
                    # makes `l2_penalty(w)` a vector whose length is `batch_size`
                    l = loss(net(X), y) + lambd * l2_penalty(w)
                grads = tape.gradient(l, [w, b])
                d2l.sgd([w, b], grads, lr, batch_size)
            if (epoch + 1) % 5 == 0:
                animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
                                         d2l.evaluate_loss(net, test_iter, loss)))
        print('L2 norm of w:', tf.norm(w).numpy())


.. raw:: html

     </div>


.. raw:: html

     </div>

Training without Regularization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We now run this code with ``lambd = 0``, disabling weight decay. Note
that we overfit badly, decreasing the training error but not the test
error---a textbook case of overfitting.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-11-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-11-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-11-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-11-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train(lambd=0)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 13.259388


.. figure:: output_weight-decay_ec9cc0_63_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-11-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train(lambd=0)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 14.315082550048828


.. figure:: output_weight-decay_ec9cc0_66_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-11-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train(lambd=0)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 19.748163


.. figure:: output_weight-decay_ec9cc0_69_1.svg


.. raw:: html

     </div>


.. raw:: html

     </div>

Using Weight Decay
~~~~~~~~~~~~~~~~~~

Below, we run with substantial weight decay. Note that the training
error increases but the test error decreases. This is precisely the
effect we expect from regularization.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-13-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-13-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-13-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-13-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train(lambd=3)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 0.38249034


.. figure:: output_weight-decay_ec9cc0_75_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-13-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train(lambd=3)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 0.3709787428379059


.. figure:: output_weight-decay_ec9cc0_78_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-13-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train(lambd=3)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 0.49780998


.. figure:: output_weight-decay_ec9cc0_81_1.svg


.. raw:: html

     </div>


.. raw:: html

     </div>

Concise Implementation
----------------------

Because weight decay is ubiquitous in neural network optimization, the
deep learning framework makes it especially convenient, integrating
weight decay into the optimization algorithm itself for easy use in
combination with any loss function. Moreover, this integration serves a
computational benefit, allowing implementation tricks to add weight
decay to the algorithm, without any additional computational overhead.
Since the weight decay portion of the update depends only on the current
value of each parameter, the optimizer must touch each parameter once
anyway.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar text"><a href="#mxnet-15-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-15-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-15-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-15-0">

In the following code, we specify the weight decay hyperparameter
directly through ``wd`` when instantiating our ``Trainer``. By default,
Gluon decays both weights and biases simultaneously. Note that the
hyperparameter ``wd`` will be multiplied by ``wd_mult`` when updating
model parameters. Thus, if we set ``wd_mult`` to zero, the bias
parameter :math:`b` will not decay.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def train_concise(wd):
        net = nn.Sequential()
        net.add(nn.Dense(1))
        net.initialize(init.Normal(sigma=1))
        loss = gluon.loss.L2Loss()
        num_epochs, lr = 100, 0.003
        trainer = gluon.Trainer(net.collect_params(), 'sgd',
                                {'learning_rate': lr, 'wd': wd})
        # The bias parameter has not decayed. Bias names generally end with "bias"
        net.collect_params('.*bias').setattr('wd_mult', 0)
        animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                                xlim=[5, num_epochs], legend=['train', 'test'])
        for epoch in range(num_epochs):
            for X, y in train_iter:
                with autograd.record():
                    l = loss(net(X), y)
                l.backward()
                trainer.step(batch_size)
            if (epoch + 1) % 5 == 0:
                animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
                                         d2l.evaluate_loss(net, test_iter, loss)))
        print('L2 norm of w:', np.linalg.norm(net[0].weight.data()))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-15-1">

In the following code, we specify the weight decay hyperparameter
directly through ``weight_decay`` when instantiating our optimizer. By
default, PyTorch decays both weights and biases simultaneously. Here we
only set ``weight_decay`` for the weight, so the bias parameter
:math:`b` will not decay.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def train_concise(wd):
        net = nn.Sequential(nn.Linear(num_inputs, 1))
        for param in net.parameters():
            param.data.normal_()
        loss = nn.MSELoss(reduction='none')
        num_epochs, lr = 100, 0.003
        # The bias parameter has not decayed
        trainer = torch.optim.SGD([
            {"params":net[0].weight,'weight_decay': wd},
            {"params":net[0].bias}], lr=lr)
        animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                                xlim=[5, num_epochs], legend=['train', 'test'])
        for epoch in range(num_epochs):
            for X, y in train_iter:
                trainer.zero_grad()
                l = loss(net(X), y)
                l.mean().backward()
                trainer.step()
            if (epoch + 1) % 5 == 0:
                animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
                                         d2l.evaluate_loss(net, test_iter, loss)))
        print('L2 norm of w:', net[0].weight.norm().item())


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-15-2">

In the following code, we create an :math:`L_2` regularizer with the
weight decay hyperparameter ``wd`` and apply it to the layer through the
``kernel_regularizer`` argument.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def train_concise(wd):
        net = tf.keras.models.Sequential()
        net.add(tf.keras.layers.Dense(
            1, kernel_regularizer=tf.keras.regularizers.l2(wd)))
        net.build(input_shape=(1, num_inputs))
        w, b = net.trainable_variables
        loss = tf.keras.losses.MeanSquaredError()
        num_epochs, lr = 100, 0.003
        trainer = tf.keras.optimizers.SGD(learning_rate=lr)
        animator = d2l.Animator(xlabel='epochs', ylabel='loss', yscale='log',
                                xlim=[5, num_epochs], legend=['train', 'test'])
        for epoch in range(num_epochs):
            for X, y in train_iter:
                with tf.GradientTape() as tape:
                    # `tf.keras` requires retrieving and adding the losses from
                    # layers manually for custom training loop.
                    l = loss(net(X), y) + net.losses
                grads = tape.gradient(l, net.trainable_variables)
                trainer.apply_gradients(zip(grads, net.trainable_variables))
            if (epoch + 1) % 5 == 0:
                animator.add(epoch + 1, (d2l.evaluate_loss(net, train_iter, loss),
                                         d2l.evaluate_loss(net, test_iter, loss)))
        print('L2 norm of w:', tf.norm(net.get_weights()[0]).numpy())


.. raw:: html

     </div>


.. raw:: html

     </div>

The plots look identical to those when we implemented weight decay from
scratch. However, they run appreciably faster and are easier to
implement, a benefit that will become more pronounced for larger
problems.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-17-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-17-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-17-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-17-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train_concise(0)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 15.01407


.. figure:: output_weight-decay_ec9cc0_102_1.svg


.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train_concise(3)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 0.33991873


.. figure:: output_weight-decay_ec9cc0_103_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-17-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train_concise(0)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 13.135625839233398


.. figure:: output_weight-decay_ec9cc0_106_1.svg


.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train_concise(3)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 0.40683040022850037


.. figure:: output_weight-decay_ec9cc0_107_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-17-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train_concise(0)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 1.3400996


.. figure:: output_weight-decay_ec9cc0_110_1.svg


.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    train_concise(3)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    L2 norm of w: 0.033948638


.. figure:: output_weight-decay_ec9cc0_111_1.svg


.. raw:: html

     </div>


.. raw:: html

     </div>

So far, we only touched upon one notion of what constitutes a simple
linear function. Moreover, what constitutes a simple nonlinear function
can be an even more complex question. For instance, `reproducing kernel
Hilbert space
(RKHS) <https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space>`__
allows one to apply tools introduced for linear functions in a nonlinear
context. Unfortunately, RKHS-based algorithms tend to scale poorly to
large, high-dimensional data. In this book we will default to the simple
heuristic of applying weight decay on all layers of a deep network.

Summary
-------

-  Regularization is a common method for dealing with overfitting. It
   adds a penalty term to the loss function on the training set to
   reduce the complexity of the learned model.
-  One particular choice for keeping the model simple is weight decay
   using an :math:`L_2` penalty. This leads to weight decay in the
   update steps of the learning algorithm.
-  The weight decay functionality is provided in optimizers from deep
   learning frameworks.
-  Different sets of parameters can have different update behaviors
   within the same training loop.

Exercises
---------

1. Experiment with the value of :math:`\lambda` in the estimation
   problem in this section. Plot training and test accuracy as a
   function of :math:`\lambda`. What do you observe?
2. Use a validation set to find the optimal value of :math:`\lambda`. Is
   it really the optimal value? Does this matter?
3. What would the update equations look like if instead of
   :math:`\|\mathbf{w}\|^2` we used :math:`\sum_i |w_i|` as our penalty
   of choice (:math:`L_1` regularization)?
4. We know that :math:`\|\mathbf{w}\|^2 = \mathbf{w}^\top \mathbf{w}`.
   Can you find a similar equation for matrices (see the Frobenius norm
   in :numref:`subsec_lin-algebra-norms`)?
5. Review the relationship between training error and generalization
   error. In addition to weight decay, increased training, and the use
   of a model of suitable complexity, what other ways can you think of
   to deal with overfitting?
6. In Bayesian statistics we use the product of prior and likelihood to
   arrive at a posterior via
   :math:`P(w \mid x) \propto P(x \mid w) P(w)`. How can you identify
   :math:`P(w)` with regularization?


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar text"><a href="#mxnet-19-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-19-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-19-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-19-0">

`Discussions <https://discuss.d2l.ai/t/98>`__


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-19-1">

`Discussions <https://discuss.d2l.ai/t/99>`__


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-19-2">

`Discussions <https://discuss.d2l.ai/t/236>`__


.. raw:: html

     </div>


.. raw:: html

     </div>