.. _chapter_gd:
Gradient Descent
================
In this section we are going to introduce the basic concepts underlying
gradient descent. This is brief by necessity. See e.g.
:cite:`Boyd.Vandenberghe.2004` for an in-depth introduction to convex
optimization. Although the latter is rarely used directly in deep
learning, an understanding of gradient descent is key to understanding
stochastic gradient descent algorithms. For instance, the optimization
problem might diverge due to an overly large learning rate. This
phenomenon can already be seen in gradient descent. Likewise,
preconditioning is a common technique in gradient descent and carries
over to more advanced algorithms. Let’s start with a simple special
case.
Gradient Descent in One Dimension
---------------------------------
Gradient descent in one dimension is an excellent example to explain why
the gradient descent algorithm may reduce the value of the objective
function. Consider some continously differentiable real-valued function
:math:`f: \mathbb{R} \rightarrow \mathbb{R}`. Using a Taylor expansion
(:numref:`chapter_math`) we obtain that
.. math:: f(x + \epsilon) = f(x) + \epsilon f'(x) + O(\epsilon^2).
:label: gd-taylor
That is, in first approximation :math:`f(x+\epsilon)` is given by the
function value :math:`f(x)` and the first derivative :math:`f'(x)` at
:math:`x`. It is not unreasonable to assume that for small
:math:`\epsilon` moving in the direction of the negative gradient will
decrease :math:`f`. To keep things simple we pick a fixed step size
:math:`\eta > 0` and choose :math:`\epsilon = -\eta f'(x)`. Plugging
this into the Taylor expansion above we get
.. math:: f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + O(\eta^2 f'^2(x)).
If the derivative :math:`f'(x) \neq 0` does not vanish we make progress
since :math:`\eta f'^2(x)>0`. Moreover, we can always choose
:math:`\eta` small enough for the higher order terms to become
irrelevant. Hence we arrive at
.. math:: f(x - \eta f'(x)) \lessapprox f(x).
This means that, if we use
.. math:: x \leftarrow x - \eta f'(x)
to iterate :math:`x`, the value of function :math:`f(x)` might decline.
Therefore, in gradient descent we first choose an initial value
:math:`x` and a constant :math:`\eta > 0` and then use them to
continuously iterate :math:`x` until the stop condition is reached, for
example, when the magnitude of the gradient :math:`|f'(x)|` is small
enough or the number of iterations has reached a certain value.
For simplicity we choose the objective function :math:`f(x)=x^2` to
illustrate how to implement gradient descent. Although we know that
:math:`x=0` is the solution to minimize :math:`f(x)`, we still use this
simple function to observe how :math:`x` changes. As always, we begin by
importing all required modules.
.. code:: python
%matplotlib inline
import d2l
import numpy as np
import math
def f(x): return x**2 # objective function
def gradf(x): return 2 * x # its derivative
Next, we use :math:`x=10` as the initial value and assume
:math:`\eta=0.2`. Using gradient descent to iterate :math:`x` for 10
times we can see that, eventually, the value of :math:`x` approaches the
optimal solution.
.. code:: python
def gd(eta):
x = 10
results = [x]
for i in range(10):
x -= eta * gradf(x)
results.append(x)
print('epoch 10, x:', x)
return results
res = gd(0.2)
.. parsed-literal::
:class: output
epoch 10, x: 0.06046617599999997
The progress of optimizing over :math:`x` can be plotted as follows.
.. code:: python
def show_trace(res):
n = max(abs(min(res)), abs(max(res)))
f_line = np.arange(-n, n, 0.01)
d2l.set_figsize((3.5, 2.5))
d2l.plot([f_line, res], [[f(x) for x in f_line], [f(x) for x in res]],
'x', 'f(x)', fmts=['-', '-o'])
show_trace(res)
.. figure:: output_gd_9cd2d2_5_0.svg
.. _section_gd-learningrate:
Learning Rate
~~~~~~~~~~~~~
The learning rate :math:`\eta` can be set by the algorithm designer. If
we use a learning rate that is too small, it will cause :math:`x` to
update very slowly, requiring more iterations to get a better solution.
To show what happens in such a case, consider the progress in the same
optimization problem for :math:`\eta = 0.05`. As we can see, even after
10 steps we are still very far from the optimal solution.
.. code:: python
show_trace(gd(0.05))
.. parsed-literal::
:class: output
epoch 10, x: 3.4867844009999995
.. figure:: output_gd_9cd2d2_7_1.svg
Conversely, if we use an excessively high learning rate,
:math:`\left|\eta f'(x)\right|` might be too large for the first-order
Taylor expansion formula. That is, the term :math:`O(\eta^2 f'^2(x))` in
:eq:`gd-taylor` might become significant. In this case, we cannot
guarantee that the iteration of :math:`x` will be able to lower the
value of :math:`f(x)`. For example, when we set the learning rate to
:math:`\eta=1.1`, :math:`x` overshoots the optimal solution :math:`x=0`
and gradually diverges.
.. code:: python
show_trace(gd(1.1))
.. parsed-literal::
:class: output
epoch 10, x: 61.917364224000096
.. figure:: output_gd_9cd2d2_9_1.svg
Local Minima
~~~~~~~~~~~~
To illustrate what happens for nonconvex functions consider the case of
:math:`f(x) = x \cdot \cos c x`. This function has infinitely many local
minima. Depending on our choice of learning rate and depending on how
well conditioned the problem is, we may end up with one of many
solutions. The example below illustrates how an (unrealistically) high
learning rate will lead to a poor local minimum.
.. code:: python
c = 0.15 * math.pi
def f(x): return x*math.cos(c * x)
def gradf(x): return math.cos(c * x) - c * x * math.sin(c * x)
show_trace(gd(2))
.. parsed-literal::
:class: output
epoch 10, x: -1.528165927635083
.. figure:: output_gd_9cd2d2_11_1.svg
Multivariate Gradient Descent
-----------------------------
Now that have a better intuition of the univariate case, let us consider
the situation where :math:`\mathbf{x} \in \mathbb{R}^d`. That is, the
objective function :math:`f: \mathbb{R}^d \to \mathbb{R}` maps vectors
into scalars. Correspondingly its gradient is multivariate, too. It is a
vector consisting of :math:`d` partial derivatives:
.. math:: \nabla f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}\bigg]^\top.
Each partial derivative element
:math:`\partial f(\mathbf{x})/\partial x_i` in the gradient indicates
the rate of change of :math:`f` at :math:`\mathbf{x}` with respect to
the input :math:`x_i`. As before in the univariate case we can use the
corresponding Taylor approximation for multivariate functions to get
some idea of what we should do. In particular, we have that
.. math:: f(\mathbf{x} + \mathbf{\epsilon}) = f(\mathbf{x}) + \mathbf{\epsilon}^\top \nabla f(\mathbf{x}) + O(\|\mathbf{\epsilon}\|^2).
:label: gd-multi-taylor
In other words, up to second order terms in :math:`\mathbf{epsilon}` the
direction of steepest descent is given by the negative gradient
:math:`-\nabla f(\mathbf{x})`. Choosing a suitable learning rate
:math:`\eta > 0` yields the prototypical gradient descent algorithm:
:math:`\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f(\mathbf{x}).`
To see how the algorithm behaves in practice let’s construct an
objective function :math:`f(\mathbf{x})=x_1^2+2x_2^2` with a
two-dimensional vector :math:`\mathbf{x} = [x_1, x_2]^\top` as input and
a scalar as output. The gradient is given by
:math:`\nabla f(\mathbf{x}) = [2x_1, 4x_2]^\top`. We will observe the
trajectory of :math:`\mathbf{x}` by gradient descent from the initial
position :math:`[-5,-2]`. We need two more helper functions. The first
uses an update function and applies it :math:`20` times to the initial
value. The second helper visualizes the trajectory of
:math:`\mathbf{x}`.
.. code:: python
# Save to the d2l package.
def train_2d(trainer, steps=20):
"""Optimize a 2-dim objective function with a customized trainer."""
# s1 and s2 are internal state variables and will
# be used later in the chapter
x1, x2, s1, s2 = -5, -2, 0, 0
results = [(x1, x2)]
for i in range(steps):
x1, x2, s1, s2 = trainer(x1, x2, s1, s2)
results.append((x1, x2))
print('epoch %d, x1 %f, x2 %f' % (i + 1, x1, x2))
return results
# Save to the d2l package.
def show_trace_2d(f, results):
"""Show the trace of 2D variables during optimization."""
d2l.set_figsize((3.5, 2.5))
d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
x1, x2 = np.meshgrid(np.arange(-5.5, 1.0, 0.1), np.arange(-3.0, 1.0, 0.1))
d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
d2l.plt.xlabel('x1')
d2l.plt.ylabel('x2')
Next, we observe the trajectory of the optimization variable
:math:`\mathbf{x}` for learning rate :math:`\eta = 0.1`. We can see that
after 20 steps the value of :math:`\mathbf{x}` approaches its minimum at
:math:`[0, 0]`. Progress is fairly well-behaved albeit rather slow.
.. code:: python
def f(x1, x2): return x1 ** 2 + 2 * x2 ** 2 # objective
def gradf(x1, x2): return (2 * x1, 4 * x2) # gradient
def gd(x1, x2, s1, s2):
(g1, g2) = gradf(x1, x2) # compute gradient
return (x1 -eta * g1, x2 -eta * g2, 0, 0) # update variables
eta = 0.1
show_trace_2d(f, train_2d(gd))
.. parsed-literal::
:class: output
epoch 20, x1 -0.057646, x2 -0.000073
.. figure:: output_gd_9cd2d2_15_1.svg
Adaptive Methods
----------------
As we could see in :numref:`section_gd-learningrate`, getting the
learning rate :math:`\eta` ‘just right’ is tricky. If we pick it too
small, we make no progress. If we pick it too large, the solution
oscillates and in the worst case it might even diverge. What if we could
determine :math:`\eta` automatically or get rid of having to select a
step size at all? Second order methods that look not only at the value
and gradient of the objective but also at its *curvature* can help in
this case. While these methods cannot be applied to deep learning
directly due to the computational cost, they provide useful intuition
into how to design advanced optimization algorithms that mimic many of
the desirable properties of the algorithms outlined below.
Newton’s Method
~~~~~~~~~~~~~~~
Reviewing the Taylor expansion of :math:`f` there’s no need to stop
after the first term. In fact, we can write it as
.. math:: f(\mathbf{x} + \mathbf{\epsilon}) = f(\mathbf{x}) + \mathbf{\epsilon}^\top \nabla f(\mathbf{x}) + \frac{1}{2} \mathbf{\epsilon}^\top \nabla \nabla^\top f(\mathbf{x}) \mathbf{\epsilon} + O(\|\mathbf{\epsilon}\|^3)
:label: gd-hot-taylor
To avoid cumbersome notation we define
:math:`H_f := \nabla \nabla^\top f(\mathbf{x})` to be the *Hessian* of
:math:`f`. This is a :math:`d \times d` matrix. For small :math:`d` and
simple problems :math:`H_f` is easy to compute. For deep networks, on
the other hand, :math:`H_f` may be prohibitively large, due to the cost
of storing :math:`O(d^2)` entries. Furthermore it may be too expensive
to compute via backprop as we would need to apply backprop to the
backpropagation call graph. For now let us ignore such considerations
and look at what algorithm we’d get.
After all, the minimum of :math:`f` satisfies
:math:`\nabla f(\mathbf{x}) = 0`. Taking derivatives of
:eq:`gd-hot-taylor` with regard to :math:`\mathbf{\epsilon}` and
ignoring higher order terms we arrive at
.. math::
\nabla f(\mathbf{x}) + H_f \mathbf{\epsilon} = 0 \text{ and hence }
\mathbf{\epsilon} = -H_f^{-1} \nabla f(\mathbf{x}).
That is, we need to invert the Hessian :math:`H_f` as part of the
optimization problem.
For :math:`f(x) = \frac{1}{2} x^2` we have :math:`\nabla f(x) = x` and
:math:`H_f = 1`. Hence for any :math:`x` we obtain
:math:`\epsilon = -x`. In other words, a single step is sufficient to
converge perfectly without the need for any adjustment! Alas, we got a
bit lucky here since the Taylor expansion was exact. Let’s see what
happens in other problems.
.. code:: python
c = 0.5
def f(x): return math.cosh(c * x) # objective
def gradf(x): return c * math.sinh(c * x) # derivative
def hessf(x): return c**2 * math.cosh(c * x) # hessian
# hide learning rate for now
def newton(eta = 1):
x = 10
results = [x]
for i in range(10):
x -= eta * gradf(x) / hessf(x)
results.append(x)
print('epoch 10, x:', x)
return results
show_trace(newton())
.. parsed-literal::
:class: output
epoch 10, x: 0.0
.. figure:: output_gd_9cd2d2_17_1.svg
Now let’s see what happens when we have a *nonconvex* function, such as
:math:`f(x) = x \cos(c x)`. After all, note that in Newton’s method we
end up dividing by the Hessian. This means that if the second derivative
is *negative* we would walk into the direction of *increasing*
:math:`f`. That is a fatal flaw of the algorithm. Let’s see what happens
in practice.
.. code:: python
c = 0.15 * math.pi
def f(x): return x*math.cos(c * x)
def gradf(x): return math.cos(c * x) - c * x * math.sin(c * x)
def hessf(x): return - 2 * c * math.sin(c * x) - x * c**2 * math.cos(c * x)
show_trace(newton())
.. parsed-literal::
:class: output
epoch 10, x: 26.83413291324767
.. figure:: output_gd_9cd2d2_19_1.svg
This went spectacularly wrong. How can we fix it? One way would be to
‘fix’ the Hessian by taking its absolute value instead. Another strategy
is to bring back the learning rate. This seems to defeat the purpose,
but not quite. Having second order information allows us to be cautious
whenever the curvature is large and to take longer steps whenever the
objective is flat. Let’s see how this works with a slightly smaller
learning rate, say :math:`\eta = 0.5`. As we can see, we have quite an
efficient algorithm.
.. code:: python
show_trace(newton(0.5))
.. parsed-literal::
:class: output
epoch 10, x: 7.269860168684531
.. figure:: output_gd_9cd2d2_21_1.svg
Convergence Analysis
~~~~~~~~~~~~~~~~~~~~
We only analyze the convergence rate for convex and three times
differentiable :math:`f`, where at its minimum :math:`x^*` the second
derivative is nonzero, i.e. where :math:`f''(x^*) > 0`. The multivariate
proof is a straightforward extension of the argument below and omitted
since it doesn’t help us much in terms of intuition.
Denote by :math:`x_k` the value of :math:`x` at the :math:`k`-th
iteration and let :math:`e_k := x_k - x^*` be the distance from
optimality. By Taylor series expansion we have that the condition
:math:`f'(x^*) = 0` can be written as
.. math:: 0 = f'(x_k - e_k) = f'(x_k) - e_k f''(x_k) + \frac{1}{2} e_k^2 f'''(\xi_k).
This holds for some :math:`\xi_k \in [x_k - e_k, x_k]`. Recall that we
have the update :math:`x_{k+1} = x_k - f'(x_k) / f''(x_k)`. Dividing the
above expansion by :math:`f''(x_k)` yields
.. math:: e_k - f'(x_k) / f''(x_k) = \frac{1}{2} e_k^2 f'''(\xi_k) / f'(x_k)
Plugging in the update equations leads to the following bound
:math:`e_{k+1} \leq e_k^2 f'''(\xi_k) / f'(x_k)`. Consequently, whenever
we are in a region of bounded :math:`f'''(\xi_k) / f'(x_k) \leq c`, we
have a quadratically decreasing error :math:`e_{k+1} \leq c e_k^2`.
As an aside, optimization researchers call this *linear* convergence,
whereas a condition such as :math:`e_{k+1} \leq \alpha e_k` would be
called a *constant* rate of convergence. Note that this analysis comes
with a number of caveats: We don’t really have much of a guarantee when
we will reach the region of rapid convergence. Instead, we only know
that once we reach it, convergence will be very quick. Second, this
requires that :math:`f` is well-behaved up to higher order derivatives.
It comes down to ensuring that :math:`f` doesn’t have any ‘surprising’
properties in terms of how it might change its values.
Preconditioning
~~~~~~~~~~~~~~~
Quite unsurprisingly computing and storing the full Hessian is very
expensive. It is thus desirable to find alternatives. One way to improve
matters is by avoiding to compute the Hessian in its entirety but only
compute the *diagonal* entries. While this isn’t quite as good as the
full Newton method, it is still much better than not using it. Moreover,
estimates for the main diagonal elements are what drives some of the
innovation in stochastic gradient descent optimization algorithms. This
leads to update algorithms of the form
.. math:: \mathbf{x} \leftarrow \mathbf{x} - \eta \mathrm{diag}(H_f)^{-1} \nabla \mathbf{x}.
To see why this might be a good idea consider a situation where one
variable denotes height in milimeters and the other one denotes height
in kilometers. Assuming that for both the natural scale is in meters we
have a terrible mismatch in parametrizations. Using preconditioning
removes this. Effectively preconditioning with gradient descent amounts
to selecting a different learning rate for each coordinate.
Gradient Descent with Line Search
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
One of the key problems in gradient descent was that we might overshoot
the goal or make insufficient progress. A simple fix for the problem is
to use line search in conjunction with gradient descent. That is, we use
the direction given by :math:`\nabla f(\mathbf{x})` and then perform
binary search as to which steplength :math:`\eta` minimizes
:math:`f(x - \eta \nabla f(\mathbf{x}))`.
This algorithm converges rapidly (for an analysis and proof see e.g.
:cite:`Boyd.Vandenberghe.2004`). However, for the purpose of deep
learning this isn’t quite so feasible, since each step of the line
search would require us to evaluate the objective function on the entire
dataset. This is way too costly to accomplish.
Summary
-------
- Learning rates matter. Too large and we diverge, too small and we
don’t make progress.
- Gradient descent can get stuck in local minima.
- In high dimensions adjusting learning the learning rate is
complicated.
- Preconditioning can help with scale adjustment.
- Newton’s method is a lot faster *once* it has started working
properly in convex problems.
- Beware of using Newton’s method without any adjustments for nonconvex
problems.
Exercises
---------
1. Experiment with different learning rates and objective functions for
gradient descent.
2. Implement line search to minimize a convex function in the interval
:math:`[a, b]`.
- Do you need derivatives for binary search, i.e. to decide whether
to pick :math:`[a, (a+b)/2]` or :math:`[(a+b)/2, b]`.
- How rapid is the rate of convergence for the algorithm?
- Implement the algorithm and apply it to minimizing
:math:`\log (\exp(x) + \exp(-2*x -3))`.
3. Design an objective function defined on :math:`\mathbb{R}^2` where
gradient descent is exceedingly slow. Hint - scale different
coordinates differently.
4. Implement the lightweight version of Newton’s method using
preconditioning:
- Use diagonal Hessian as preconditioner.
- Use the absolute values of that rather than the actual (possibly
signed) values.
- Apply this to the problem above.
5. Apply the algorithm above to a number of objective functions (convex
or not). What happens if you rotate coordinates by :math:`45`
degrees?
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_gd-sgd.svg