Optimization and Deep Learning
==============================
In this section, we will discuss the relationship between optimization
and deep learning as well as the challenges of using optimization in
deep learning. For a deep learning problem, we will usually define a
loss function first. Once we have the loss function, we can use an
optimization algorithm in attempt to minimize the loss. In optimization,
a loss function is often referred to as the objective function of the
optimization problem. By tradition and convention most optimization
algorithms are concerned with *minimization*. If we ever need to
maximize an objective there’s a simple solution - just flip the sign on
the objective.
Optimization and Estimation
---------------------------
Although optimization provides a way to minimize the loss function for
deep learning, in essence, the goals of optimization and deep learning
are fundamentally different. The former is primarily concerned with
minimizing an objective whereas the latter is concerned with finding a
suitable model, given a finite amount of data. In
:numref:`chapter_model_selection`, we discussed the difference between
these two goals in detail. For instance, training error and
generalization error generally differ: since the objective function of
the optimization algorithm is usually a loss function based on the
training data set, the goal of optimization is to reduce the training
error. However, the goal of statistical inference (and thus of deep
learning) is to reduce the generalization error. To accomplish the
latter we need to pay attention to overfitting in addition to using the
optimization algorithm to reduce the training error. We begin by
importing a few libraries with a function to annotate in a figure.
.. code:: python
%matplotlib inline
import d2l
from mpl_toolkits import mplot3d
import numpy as np
# Save to the d2l package.
def annotate(text, xy, xytext):
d2l.plt.gca().annotate(text, xy=xy, xytext=xytext,
arrowprops=dict(arrowstyle='->'))
The graph below illustrates the issue in some more detail. Since we have
only a finite amount of data the minimum of the training error may be at
a different location than the minimum of the expected error (or of the
test error).
.. code:: python
def f(x): return x * np.cos(np.pi * x)
def g(x): return f(x) + 0.2 * np.cos(5 * np.pi * x)
d2l.set_figsize((4.5, 2.5))
x = np.arange(0.5, 1.5, 0.01)
d2l.plot(x, [f(x), g(x)], 'x', 'risk')
annotate('empirical risk', (1.0, -1.2), (0.5, -1.1))
annotate('expected risk', (1.1, -1.05), (0.95, -0.5))
.. figure:: output_optimization-intro_111d97_3_0.svg
Optimization Challenges in Deep Learning
----------------------------------------
In this chapter, we are going to focus specifically on the performance
of the optimization algorithm in minimizing the objective function,
rather than a model’s generalization error. In
:numref:`chapter_linear_regression` we distinguished between
analytical solutions and numerical solutions in optimization problems.
In deep learning, most objective functions are complicated and do not
have analytical solutions. Instead, we must use numerical optimization
algorithms. The optimization algorithms below all fall into this
category.
There are many challenges in deep learning optimization. Some of the
most vexing ones are local minima, saddle points and vanishing
gradients. Let’s have a look at a few of them.
Local Minima
~~~~~~~~~~~~
For the objective function :math:`f(x)`, if the value of :math:`f(x)` at
:math:`x` is smaller than the values of :math:`f(x)` at any other points
in the vicinity of :math:`x`, then :math:`f(x)` could be a local
minimum. If the value of :math:`f(x)` at :math:`x` is the minimum of the
objective function over the entire domain, then :math:`f(x)` is the
global minimum.
For example, given the function
.. math:: f(x) = x \cdot \text{cos}(\pi x) \text{ for } -1.0 \leq x \leq 2.0,
we can approximate the local minimum and global minimum of this
function.
.. code:: python
x = np.arange(-1.0, 2.0, 0.01)
d2l.plot(x, [f(x), ], 'x', 'f(x)')
annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0))
annotate('global minimum', (1.1, -0.95), (0.6, 0.8))
.. figure:: output_optimization-intro_111d97_5_0.svg
The objective function of deep learning models usually has many local
optima. When the numerical solution of an optimization problem is near
the local optimum, the numerical solution obtained by the final
iteration may only minimize the objective function locally, rather than
globally, as the gradient of the objective function’s solutions
approaches or becomes zero. Only some degree of noise might knock the
parameter out of the local minimum. In fact, this is one of the
beneficial properties of stochastic gradient descent where the natural
variation of gradients over minibatches is able to dislodge the
parameters from local minima.
Saddle Points
~~~~~~~~~~~~~
Besides local minima, saddle points are another reason for gradients to
vanish. A `saddle point `__
is any location where all gradients of a function vanish but which is
neither a global nor a local minimum. Consider the function
:math:`f(x) = x^3`. Its first and second derivative vanish for
:math:`x=0`. Optimization might stall at the point, even though it is
not a minimum.
.. code:: python
x = np.arange(-2.0, 2.0, 0.01)
d2l.plot(x, [x**3], 'x', 'f(x)')
annotate('saddle point', (0, -0.2), (-0.52, -5.0))
.. figure:: output_optimization-intro_111d97_7_0.svg
Saddle points in higher dimensions are even more insidious, as the
example below shows. Consider the function :math:`f(x, y) = x^2 - y^2`.
It has its saddle point at :math:`(0,0)`. This is a maximum with respect
to :math:`y` and a minimum with respect to :math:`x`. Moreover, it
*looks* like a saddle, which is where this mathematical property got its
name.
.. code:: python
x, y = np.mgrid[-1: 1: 101j, -1: 1: 101j]
z = x**2 - y**2
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.plot([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y');
.. figure:: output_optimization-intro_111d97_9_0.svg
We assume that the input of a function is a :math:`k`-dimensional vector
and its output is a scalar, so its Hessian matrix will have :math:`k`
eigenvalues (refer to :numref:`chapter_math`). The solution of the
function could be a local minimum, a local maximum, or a saddle point at
a position where the function gradient is zero:
- When the eigenvalues of the function’s Hessian matrix at the
zero-gradient position are all positive, we have a local minimum for
the function.
- When the eigenvalues of the function’s Hessian matrix at the
zero-gradient position are all negative, we have a local maximum for
the function.
- When the eigenvalues of the function’s Hessian matrix at the
zero-gradient position are negative and positive, we have a saddle
point for the function.
For high-dimensional problems the likelihood that at least some of the
eigenvalues are negative is quite high. This makes saddle points more
likely than local minima. We will discuss some exceptions to this
situation in the next section when introducing convexity. In short,
convex functions are those where the eigenvalues of the Hessian are
never negative. Sadly, though, most deep learning problems do not fall
into this category. Nonetheless it’s a great tool to study optimization
algorithms.
Vanishing Gradients
~~~~~~~~~~~~~~~~~~~
Probably the most insidious problem to encounter are vanishing
gradients. For instance, assume that we want to minimize the function
:math:`f(x) = \tanh(x)` and we happen to get started at :math:`x = 4`.
As we can see, the gradient of :math:`f` is close to nil. More
specifically :math:`f'(x) = 1 - \tanh^2(x)` and thus
:math:`f'(4) = 0.0013`. Consequently optimization will get stuck for a
long time before we make progress. This turns out to be one of the
reasons that training deep learning models was quite tricky prior to the
introduction of the ReLu activation function.
.. code:: python
x = np.arange(-2.0, 5.0, 0.01)
d2l.plot(x, [np.tanh(x)], 'x', 'f(x)')
annotate('vanishing gradient', (4, 1), (2, 0.0))
.. figure:: output_optimization-intro_111d97_11_0.svg
As we saw, optimization for deep learning is full of challenges.
Fortunately there exists a robust range of algorithms that perform well
and that are easy to use even for beginners. Furthermore, it isn’t
really necessary to find *the* best solution. Local optima or even
approximate solutions thereof are still very useful.
Summary
-------
- Minimizing the training error does *not* guarantee that we find the
best set of parameters to minimize the expected error.
- The optimization problems may have many local minima.
- The problem may have even more saddle points, as generally the
problems are not convex.
- Vanishing gradients can cause optimization to stall. Often a
reparametrization of the problem helps. Good initialization of the
parameters can be beneficial, too.
Exercises
---------
1. Consider a simple multilayer perceptron with a single hidden layer
of, say, :math:`d` dimensions in the hidden layer and a single
output. Show that for any local minimum there are at least :math:`d!`
equivalent solutions that behave identically.
2. Assume that we have a symmetric random matrix :math:`\mathbf{M}`
where the entries :math:`M_{ij} = M_{ji}` are each drawn from some
probability distribution :math:`p_{ij}`. Furthermore assume that
:math:`p_{ij}(x) = p_{ij}(-x)`, i.e. that the distribution is
symmetric (see e.g. :cite:`Wigner.1958` for details).
- Prove that the distribution over eigenvalues is also symmetric.
That is, for any eigenvector :math:`\mathbf{v}` the probability
that the associated eigenvalue :math:`\lambda` satisfies
:math:`\Pr(\lambda > 0) = \Pr(\lambda < 0)`.
- Why does the above *not* imply :math:`\Pr(\lambda > 0) = 0.5`?
3. What other challenges involved in deep learning optimization can you
think of?
4. Assume that you want to balance a (real) ball on a (real) saddle.
- Why is this hard?
- Can you exploit this effect also for optimization algorithms?
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_optimization-intro.svg