2.4. Calculus
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in SageMaker Studio Lab

Finding the area of a polygon had remained mysterious until at least 2,500 years ago, when ancient Greeks divided a polygon into triangles and summed their areas. To find the area of curved shapes, such as a circle, ancient Greeks inscribed polygons in such shapes. As shown in Fig. 2.4.1, an inscribed polygon with more sides of equal length better approximates the circle. This process is also known as the method of exhaustion.

../_images/polygon-circle.svg

Fig. 2.4.1 Find the area of a circle with the method of exhaustion.

In fact, the method of exhaustion is where integral calculus (will be described in Section 18.5) originates from. More than 2,000 years later, the other branch of calculus, differential calculus, was invented. Among the most critical applications of differential calculus, optimization problems consider how to do something the best. As discussed in Section 2.3.10.1, such problems are ubiquitous in deep learning.

In deep learning, we train models, updating them successively so that they get better and better as they see more and more data. Usually, getting better means minimizing a loss function, a score that answers the question “how bad is our model?” This question is more subtle than it appears. Ultimately, what we really care about is producing a model that performs well on data that we have never seen before. But we can only fit the model to data that we can actually see. Thus we can decompose the task of fitting models into two key concerns: (i) optimization: the process of fitting our models to observed data; (ii) generalization: the mathematical principles and practitioners’ wisdom that guide as to how to produce models whose validity extends beyond the exact set of data examples used to train them.

To help you understand optimization problems and methods in later chapters, here we give a very brief primer on differential calculus that is commonly used in deep learning.

2.4.1. Derivatives and Differentiation

We begin by addressing the calculation of derivatives, a crucial step in nearly all deep learning optimization algorithms. In deep learning, we typically choose loss functions that are differentiable with respect to our model’s parameters. Put simply, this means that for each parameter, we can determine how rapidly the loss would increase or decrease, were we to increase or decrease that parameter by an infinitesimally small amount.

Suppose that we have a function \(f: \mathbb{R} \rightarrow \mathbb{R}\), whose input and output are both scalars. The derivative of \(f\) is defined as

(2.4.1)\[f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h},\]

if this limit exists. If \(f'(a)\) exists, \(f\) is said to be differentiable at \(a\). If \(f\) is differentiable at every number of an interval, then this function is differentiable on this interval. We can interpret the derivative \(f'(x)\) in (2.4.1) as the instantaneous rate of change of \(f(x)\) with respect to \(x\). The so-called instantaneous rate of change is based on the variation \(h\) in \(x\), which approaches \(0\).

To illustrate derivatives, let us experiment with an example. Define \(u = f(x) = 3x^2-4x\).

%matplotlib inline
from matplotlib_inline import backend_inline
from mxnet import np, npx
from d2l import mxnet as d2l

npx.set_np()

def f(x):
    return 3 * x ** 2 - 4 * x
%matplotlib inline
import numpy as np
from matplotlib_inline import backend_inline
from d2l import torch as d2l


def f(x):
    return 3 * x ** 2 - 4 * x
%matplotlib inline
import numpy as np
from matplotlib_inline import backend_inline
from d2l import tensorflow as d2l


def f(x):
    return 3 * x ** 2 - 4 * x

By setting \(x=1\) and letting \(h\) approach \(0\), the numerical result of \(\frac{f(x+h) - f(x)}{h}\) in (2.4.1) approaches \(2\). Though this experiment is not a mathematical proof, we will see later that the derivative \(u'\) is \(2\) when \(x=1\).

def numerical_lim(f, x, h):
    return (f(x + h) - f(x)) / h

h = 0.1
for i in range(5):
    print(f'h={h:.5f}, numerical limit={numerical_lim(f, 1, h):.5f}')
    h *= 0.1
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003
def numerical_lim(f, x, h):
    return (f(x + h) - f(x)) / h

h = 0.1
for i in range(5):
    print(f'h={h:.5f}, numerical limit={numerical_lim(f, 1, h):.5f}')
    h *= 0.1
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003
def numerical_lim(f, x, h):
    return (f(x + h) - f(x)) / h

h = 0.1
for i in range(5):
    print(f'h={h:.5f}, numerical limit={numerical_lim(f, 1, h):.5f}')
    h *= 0.1
h=0.10000, numerical limit=2.30000
h=0.01000, numerical limit=2.03000
h=0.00100, numerical limit=2.00300
h=0.00010, numerical limit=2.00030
h=0.00001, numerical limit=2.00003

Let us familiarize ourselves with a few equivalent notations for derivatives. Given \(y = f(x)\), where \(x\) and \(y\) are the independent variable and the dependent variable of the function \(f\), respectively. The following expressions are equivalent:

(2.4.2)\[f'(x) = y' = \frac{dy}{dx} = \frac{df}{dx} = \frac{d}{dx} f(x) = Df(x) = D_x f(x),\]

where symbols \(\frac{d}{dx}\) and \(D\) are differentiation operators that indicate operation of differentiation. We can use the following rules to differentiate common functions:

  • \(DC = 0\) (\(C\) is a constant),

  • \(Dx^n = nx^{n-1}\) (the power rule, \(n\) is any real number),

  • \(De^x = e^x\),

  • \(D\ln(x) = 1/x.\)

To differentiate a function that is formed from a few simpler functions such as the above common functions, the following rules can be handy for us. Suppose that functions \(f\) and \(g\) are both differentiable and \(C\) is a constant, we have the constant multiple rule

(2.4.3)\[\frac{d}{dx} [Cf(x)] = C \frac{d}{dx} f(x),\]

the sum rule

(2.4.4)\[\frac{d}{dx} [f(x) + g(x)] = \frac{d}{dx} f(x) + \frac{d}{dx} g(x),\]

the product rule

(2.4.5)\[\frac{d}{dx} [f(x)g(x)] = f(x) \frac{d}{dx} [g(x)] + g(x) \frac{d}{dx} [f(x)],\]

and the quotient rule

(2.4.6)\[\frac{d}{dx} \left[\frac{f(x)}{g(x)}\right] = \frac{g(x) \frac{d}{dx} [f(x)] - f(x) \frac{d}{dx} [g(x)]}{[g(x)]^2}.\]

Now we can apply a few of the above rules to find \(u' = f'(x) = 3 \frac{d}{dx} x^2-4\frac{d}{dx}x = 6x-4\). Thus, by setting \(x = 1\), we have \(u' = 2\): this is supported by our earlier experiment in this section where the numerical result approaches \(2\). This derivative is also the slope of the tangent line to the curve \(u = f(x)\) when \(x = 1\).

To visualize such an interpretation of derivatives, we will use matplotlib, a popular plotting library in Python. To configure properties of the figures produced by matplotlib, we need to define a few functions. In the following, the use_svg_display function specifies the matplotlib package to output the svg figures for sharper images. Note that the comment #@save is a special mark where the following function, class, or statements are saved in the d2l package so later they can be directly invoked (e.g., d2l.use_svg_display()) without being redefined.

def use_svg_display():  #@save
    """Use the svg format to display a plot in Jupyter."""
    backend_inline.set_matplotlib_formats('svg')

We define the set_figsize function to specify the figure sizes. Note that here we directly use d2l.plt since the import statement from matplotlib import pyplot as plt has been marked for being saved in the d2l package in the preface.

def set_figsize(figsize=(3.5, 2.5)):  #@save
    """Set the figure size for matplotlib."""
    use_svg_display()
    d2l.plt.rcParams['figure.figsize'] = figsize

The following set_axes function sets properties of axes of figures produced by matplotlib.

#@save
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
    """Set the axes for matplotlib."""
    axes.set_xlabel(xlabel)
    axes.set_ylabel(ylabel)
    axes.set_xscale(xscale)
    axes.set_yscale(yscale)
    axes.set_xlim(xlim)
    axes.set_ylim(ylim)
    if legend:
        axes.legend(legend)
    axes.grid()

With these three functions for figure configurations, we define the plot function to plot multiple curves succinctly since we will need to visualize many curves throughout the book.

#@save
def plot(X, Y=None, xlabel=None, ylabel=None, legend=None, xlim=None,
         ylim=None, xscale='linear', yscale='linear',
         fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
    """Plot data points."""
    if legend is None:
        legend = []

    set_figsize(figsize)
    axes = axes if axes else d2l.plt.gca()

    # Return True if `X` (tensor or list) has 1 axis
    def has_one_axis(X):
        return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
                and not hasattr(X[0], "__len__"))

    if has_one_axis(X):
        X = [X]
    if Y is None:
        X, Y = [[]] * len(X), X
    elif has_one_axis(Y):
        Y = [Y]
    if len(X) != len(Y):
        X = X * len(Y)
    axes.cla()
    for x, y, fmt in zip(X, Y, fmts):
        if len(x):
            axes.plot(x, y, fmt)
        else:
            axes.plot(y, fmt)
    set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)

Now we can plot the function \(u = f(x)\) and its tangent line \(y = 2x - 3\) at \(x=1\), where the coefficient \(2\) is the slope of the tangent line.

x = np.arange(0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
../_images/output_calculus_7e7694_35_0.svg
x = np.arange(0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
../_images/output_calculus_7e7694_38_0.svg
x = np.arange(0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
../_images/output_calculus_7e7694_41_0.svg

2.4.2. Partial Derivatives

So far we have dealt with the differentiation of functions of just one variable. In deep learning, functions often depend on many variables. Thus, we need to extend the ideas of differentiation to these multivariate functions.

Let \(y = f(x_1, x_2, \ldots, x_n)\) be a function with \(n\) variables. The partial derivative of \(y\) with respect to its \(i^\mathrm{th}\) parameter \(x_i\) is

(2.4.7)\[\frac{\partial y}{\partial x_i} = \lim_{h \rightarrow 0} \frac{f(x_1, \ldots, x_{i-1}, x_i+h, x_{i+1}, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}.\]

To calculate \(\frac{\partial y}{\partial x_i}\), we can simply treat \(x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n\) as constants and calculate the derivative of \(y\) with respect to \(x_i\). For notation of partial derivatives, the following are equivalent:

(2.4.8)\[\frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial x_i} = f_{x_i} = f_i = D_i f = D_{x_i} f.\]

2.4.3. Gradients

We can concatenate partial derivatives of a multivariate function with respect to all its variables to obtain the gradient vector of the function. Suppose that the input of function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is an \(n\)-dimensional vector \(\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top\) and the output is a scalar. The gradient of the function \(f(\mathbf{x})\) with respect to \(\mathbf{x}\) is a vector of \(n\) partial derivatives:

(2.4.9)\[\nabla_{\mathbf{x}} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}\bigg]^\top,\]

where \(\nabla_{\mathbf{x}} f(\mathbf{x})\) is often replaced by \(\nabla f(\mathbf{x})\) when there is no ambiguity.

Let \(\mathbf{x}\) be an \(n\)-dimensional vector, the following rules are often used when differentiating multivariate functions:

  • For all \(\mathbf{A} \in \mathbb{R}^{m \times n}\), \(\nabla_{\mathbf{x}} \mathbf{A} \mathbf{x} = \mathbf{A}^\top\),

  • For all \(\mathbf{A} \in \mathbb{R}^{n \times m}\), \(\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} = \mathbf{A}\),

  • For all \(\mathbf{A} \in \mathbb{R}^{n \times n}\), \(\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x} = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}\),

  • \(\nabla_{\mathbf{x}} \|\mathbf{x} \|^2 = \nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{x} = 2\mathbf{x}\).

Similarly, for any matrix \(\mathbf{X}\), we have \(\nabla_{\mathbf{X}} \|\mathbf{X} \|_F^2 = 2\mathbf{X}\). As we will see later, gradients are useful for designing optimization algorithms in deep learning.

2.4.4. Chain Rule

However, such gradients can be hard to find. This is because multivariate functions in deep learning are often composite, so we may not apply any of the aforementioned rules to differentiate these functions. Fortunately, the chain rule enables us to differentiate composite functions.

Let us first consider functions of a single variable. Suppose that functions \(y=f(u)\) and \(u=g(x)\) are both differentiable, then the chain rule states that

(2.4.10)\[\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.\]

Now let us turn our attention to a more general scenario where functions have an arbitrary number of variables. Suppose that the differentiable function \(y\) has variables \(u_1, u_2, \ldots, u_m\), where each differentiable function \(u_i\) has variables \(x_1, x_2, \ldots, x_n\). Note that \(y\) is a function of \(x_1, x_2, \ldots, x_n\). Then the chain rule gives

(2.4.11)\[\frac{dy}{dx_i} = \frac{dy}{du_1} \frac{du_1}{dx_i} + \frac{dy}{du_2} \frac{du_2}{dx_i} + \cdots + \frac{dy}{du_m} \frac{du_m}{dx_i}\]

for any \(i = 1, 2, \ldots, n\).

2.4.5. Summary

  • Differential calculus and integral calculus are two branches of calculus, where the former can be applied to the ubiquitous optimization problems in deep learning.

  • A derivative can be interpreted as the instantaneous rate of change of a function with respect to its variable. It is also the slope of the tangent line to the curve of the function.

  • A gradient is a vector whose components are the partial derivatives of a multivariate function with respect to all its variables.

  • The chain rule enables us to differentiate composite functions.

2.4.6. Exercises

  1. Plot the function \(y = f(x) = x^3 - \frac{1}{x}\) and its tangent line when \(x = 1\).

  2. Find the gradient of the function \(f(\mathbf{x}) = 3x_1^2 + 5e^{x_2}\).

  3. What is the gradient of the function \(f(\mathbf{x}) = \|\mathbf{x}\|_2\)?

  4. Can you write out the chain rule for the case where \(u = f(x, y, z)\) and \(x = x(a, b)\), \(y = y(a, b)\), and \(z = z(a, b)\)?