11.10. Adam¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab

In the discussions leading up to this section we encountered a number of techniques for efficient optimization. Let us recap them in detail here:

We saw that Section 11.4 is more effective than Gradient Descent when solving optimization problems, e.g., due to its inherent resilience to redundant data.
We saw that Section 11.5 affords significant additional efficiency arising from vectorization, using larger sets of observations in one minibatch. This is the key to efficient multi-machine, multi-GPU and overall parallel processing.
Section 11.6 added a mechanism for aggregating a history of past gradients to accelerate convergence.
Section 11.7 used per-coordinate scaling to allow for a computationally efficient preconditioner.
Section 11.8 decoupled per-coordinate scaling from a learning rate adjustment.

Adam () combines all these techniques into one efficient learning algorithm. As expected, this is an algorithm that has become rather popular as one of the more robust and effective optimization algorithms to use in deep learning. It is not without issues, though. In particular, () show that there are situations where Adam can diverge due to poor variance control. In a follow-up work () proposed a hotfix to Adam, called Yogi which addresses these issues. More on this later. For now let us review the Adam algorithm.

11.10.1. The Algorithm¶

One of the key components of Adam is that it uses exponential weighted moving averages (also known as leaky averaging) to obtain an estimate of both the momentum and also the second moment of the gradient. That is, it uses the state variables

(11.10.1)¶\[\begin{split}\begin{aligned} \mathbf{v}_t & \leftarrow \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t, \\ \mathbf{s}_t & \leftarrow \beta_2 \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2. \end{aligned}\end{split}\]

Here \(\beta_1\) and \(\beta_2\) are nonnegative weighting parameters. Common choices for them are \(\beta_1 = 0.9\) and \(\beta_2 = 0.999\). That is, the variance estimate moves much more slowly than the momentum term. Note that if we initialize \(\mathbf{v}_0 = \mathbf{s}_0 = 0\) we have a significant amount of bias initially towards smaller values. This can be addressed by using the fact that \(\sum_{i=0}^t \beta^i = \frac{1 - \beta^t}{1 - \beta}\) to re-normalize terms. Correspondingly the normalized state variables are given by

(11.10.2)¶\[\hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_1^t} \text{ and } \hat{\mathbf{s}}_t = \frac{\mathbf{s}_t}{1 - \beta_2^t}.\]

Armed with the proper estimates we can now write out the update equations. First, we rescale the gradient in a manner very much akin to that of RMSProp to obtain

(11.10.3)¶\[\mathbf{g}_t' = \frac{\eta \hat{\mathbf{v}}_t}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon}.\]

Unlike RMSProp our update uses the momentum \(\hat{\mathbf{v}}_t\) rather than the gradient itself. Moreover, there is a slight cosmetic difference as the rescaling happens using \(\frac{1}{\sqrt{\hat{\mathbf{s}}_t} + \epsilon}\) instead of \(\frac{1}{\sqrt{\hat{\mathbf{s}}_t + \epsilon}}\). The former works arguably slightly better in practice, hence the deviation from RMSProp. Typically we pick \(\epsilon = 10^{-6}\) for a good trade-off between numerical stability and fidelity.

Now we have all the pieces in place to compute updates. This is slightly anticlimactic and we have a simple update of the form

(11.10.4)¶\[\mathbf{x}_t \leftarrow \mathbf{x}_{t-1} - \mathbf{g}_t'.\]

Reviewing the design of Adam its inspiration is clear. Momentum and scale are clearly visible in the state variables. Their rather peculiar definition forces us to debias terms (this could be fixed by a slightly different initialization and update condition). Second, the combination of both terms is pretty straightforward, given RMSProp. Last, the explicit learning rate \(\eta\) allows us to control the step length to address issues of convergence.

11.10.2. Implementation¶

Implementing Adam from scratch is not very daunting. For convenience we store the time step counter \(t\) in the hyperparams dictionary. Beyond that all is straightforward.

mxnet pytorch tensorflow

%matplotlib inline
from mxnet import np, npx
from d2l import mxnet as d2l

npx.set_np()

def init_adam_states(feature_dim):
    v_w, v_b = np.zeros((feature_dim, 1)), np.zeros(1)
    s_w, s_b = np.zeros((feature_dim, 1)), np.zeros(1)
    return ((v_w, s_w), (v_b, s_b))

def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s) in zip(params, states):
        v[:] = beta1 * v + (1 - beta1) * p.grad
        s[:] = beta2 * s + (1 - beta2) * np.square(p.grad)
        v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
        s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
        p[:] -= hyperparams['lr'] * v_bias_corr / (np.sqrt(s_bias_corr) + eps)
    hyperparams['t'] += 1

%matplotlib inline
import torch
from d2l import torch as d2l


def init_adam_states(feature_dim):
    v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
    return ((v_w, s_w), (v_b, s_b))

def adam(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s) in zip(params, states):
        with torch.no_grad():
            v[:] = beta1 * v + (1 - beta1) * p.grad
            s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
            v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
            s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
            p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr)
                                                       + eps)
        p.grad.data.zero_()
    hyperparams['t'] += 1

%matplotlib inline
import tensorflow as tf
from d2l import tensorflow as d2l


def init_adam_states(feature_dim):
    v_w = tf.Variable(tf.zeros((feature_dim, 1)))
    v_b = tf.Variable(tf.zeros(1))
    s_w = tf.Variable(tf.zeros((feature_dim, 1)))
    s_b = tf.Variable(tf.zeros(1))
    return ((v_w, s_w), (v_b, s_b))

def adam(params, grads, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s), grad in zip(params, states, grads):
        v[:].assign(beta1 * v  + (1 - beta1) * grad)
        s[:].assign(beta2 * s + (1 - beta2) * tf.math.square(grad))
        v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
        s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
        p[:].assign(p - hyperparams['lr'] * v_bias_corr
                    / tf.math.sqrt(s_bias_corr) + eps)

We are ready to use Adam to train the model. We use a learning rate of \(\eta = 0.01\).

mxnet pytorch tensorflow

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adam, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim);

loss: 0.243, 0.418 sec/epoch

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adam, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim);

loss: 0.244, 0.017 sec/epoch

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(adam, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim);

loss: 0.243, 0.153 sec/epoch

A more concise implementation is straightforward since adam is one of the algorithms provided as part of the Gluon trainer optimization library. Hence we only need to pass configuration parameters for an implementation in Gluon.

mxnet pytorch tensorflow

d2l.train_concise_ch11('adam', {'learning_rate': 0.01}, data_iter)

loss: 0.249, 0.113 sec/epoch

trainer = torch.optim.Adam
d2l.train_concise_ch11(trainer, {'lr': 0.01}, data_iter)

loss: 0.244, 0.015 sec/epoch

trainer = tf.keras.optimizers.Adam
d2l.train_concise_ch11(trainer, {'learning_rate': 0.01}, data_iter)

loss: 0.245, 0.120 sec/epoch

11.10.3. Yogi¶

One of the problems of Adam is that it can fail to converge even in convex settings when the second moment estimate in \(\mathbf{s}_t\) blows up. As a fix () proposed a refined update (and initialization) for \(\mathbf{s}_t\). To understand what’s going on, let us rewrite the Adam update as follows:

(11.10.5)¶\[\mathbf{s}_t \leftarrow \mathbf{s}_{t-1} + (1 - \beta_2) \left(\mathbf{g}_t^2 - \mathbf{s}_{t-1}\right).\]

Whenever \(\mathbf{g}_t^2\) has high variance or updates are sparse, \(\mathbf{s}_t\) might forget past values too quickly. A possible fix for this is to replace \(\mathbf{g}_t^2 - \mathbf{s}_{t-1}\) by \(\mathbf{g}_t^2 \odot \mathop{\mathrm{sgn}}(\mathbf{g}_t^2 - \mathbf{s}_{t-1})\). Now the magnitude of the update no longer depends on the amount of deviation. This yields the Yogi updates

(11.10.6)¶\[\mathbf{s}_t \leftarrow \mathbf{s}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 \odot \mathop{\mathrm{sgn}}(\mathbf{g}_t^2 - \mathbf{s}_{t-1}).\]

The authors furthermore advise to initialize the momentum on a larger initial batch rather than just initial pointwise estimate. We omit the details since they are not material to the discussion and since even without this convergence remains pretty good.

mxnet pytorch tensorflow

def yogi(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-3
    for p, (v, s) in zip(params, states):
        v[:] = beta1 * v + (1 - beta1) * p.grad
        s[:] = s + (1 - beta2) * np.sign(
            np.square(p.grad) - s) * np.square(p.grad)
        v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
        s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
        p[:] -= hyperparams['lr'] * v_bias_corr / (np.sqrt(s_bias_corr) + eps)
    hyperparams['t'] += 1

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(yogi, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim);

loss: 0.246, 0.239 sec/epoch

def yogi(params, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-3
    for p, (v, s) in zip(params, states):
        with torch.no_grad():
            v[:] = beta1 * v + (1 - beta1) * p.grad
            s[:] = s + (1 - beta2) * torch.sign(
                torch.square(p.grad) - s) * torch.square(p.grad)
            v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
            s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
            p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr)
                                                       + eps)
        p.grad.data.zero_()
    hyperparams['t'] += 1

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(yogi, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim);

loss: 0.245, 0.017 sec/epoch

def yogi(params, grads, states, hyperparams):
    beta1, beta2, eps = 0.9, 0.999, 1e-6
    for p, (v, s), grad in zip(params, states, grads):
        v[:].assign(beta1 * v  + (1 - beta1) * grad)
        s[:].assign(s + (1 - beta2) * tf.math.sign(
                   tf.math.square(grad) - s) * tf.math.square(grad))
        v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
        s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
        p[:].assign(p - hyperparams['lr'] * v_bias_corr
                    / tf.math.sqrt(s_bias_corr) + eps)
    hyperparams['t'] += 1

data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)
d2l.train_ch11(yogi, init_adam_states(feature_dim),
               {'lr': 0.01, 't': 1}, data_iter, feature_dim);

loss: 0.242, 0.156 sec/epoch

11.10.4. Summary¶

Adam combines features of many optimization algorithms into a fairly robust update rule.
Created on the basis of RMSProp, Adam also uses EWMA on the minibatch stochastic gradient.
Adam uses bias correction to adjust for a slow startup when estimating momentum and a second moment.
For gradients with significant variance we may encounter issues with convergence. They can be amended by using larger minibatches or by switching to an improved estimate for \(\mathbf{s}_t\). Yogi offers such an alternative.

11.10.5. Exercises¶

Adjust the learning rate and observe and analyze the experimental results.
Can you rewrite momentum and second moment updates such that it does not require bias correction?
Why do you need to reduce the learning rate \(\eta\) as we converge?
Try to construct a case for which Adam diverges and Yogi converges?

mxnet pytorch tensorflow

Discussions

11.10. Adam¶ Colab [mxnet] Open the notebook in Colab Colab [pytorch] Open the notebook in Colab Colab [tensorflow] Open the notebook in Colab SageMaker Studio Lab Open the notebook in SageMaker Studio Lab

11.10.1. The Algorithm¶

11.10.2. Implementation¶

11.10.3. Yogi¶

11.10.4. Summary¶

11.10.5. Exercises¶

11.10. Adam¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab