.. _sec_distributions:
Distributions
=============
Now that we have learned how to work with probability in both the
discrete and the continuous setting, let us get to know some of the
common distributions encountered. Depending on the area of machine
learning, we may need to be familiar with vastly more of these, or for
some areas of deep learning potentially none at all. This is, however, a
good basic list to be familiar with. Let us first import some common
libraries.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
from math import erf, factorial
import numpy as np
from IPython import display
from d2l import mxnet as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
from math import erf, factorial
import torch
from IPython import display
from d2l import torch as d2l
torch.pi = torch.acos(torch.zeros(1)) * 2 # Define pi in torch
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
from math import erf, factorial
import tensorflow as tf
import tensorflow_probability as tfp
from IPython import display
from d2l import tensorflow as d2l
tf.pi = tf.acos(tf.zeros(1)) * 2 # Define pi in TensorFlow
.. raw:: html
.. raw:: html
Bernoulli
---------
This is the simplest random variable usually encountered. This random
variable encodes a coin flip which comes up :math:`1` with probability
:math:`p` and :math:`0` with probability :math:`1-p`. If we have a
random variable :math:`X` with this distribution, we will write
.. math::
X \sim \mathrm{Bernoulli}(p).
The cumulative distribution function is
.. math:: F(x) = \begin{cases} 0 & x < 0, \\ 1-p & 0 \le x < 1, \\ 1 & x >= 1 . \end{cases}
:label: eq_bernoulli_cdf
The probability mass function is plotted below.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
p = 0.3
d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_15_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
p = 0.3
d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_18_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
p = 0.3
d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_21_0.svg
.. raw:: html
.. raw:: html
Now, let us plot the cumulative distribution function
:eq:`eq_bernoulli_cdf`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = np.arange(-1, 2, 0.01)
def F(x):
return 0 if x < 0 else 1 if x > 1 else 1 - p
d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_27_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = torch.arange(-1, 2, 0.01)
def F(x):
return 0 if x < 0 else 1 if x > 1 else 1 - p
d2l.plot(x, torch.tensor([F(y) for y in x]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_30_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = tf.range(-1, 2, 0.01)
def F(x):
return 0 if x < 0 else 1 if x > 1 else 1 - p
d2l.plot(x, tf.constant([F(y) for y in x]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_33_0.svg
.. raw:: html
.. raw:: html
If :math:`X \sim \mathrm{Bernoulli}(p)`, then:
- :math:`\mu_X = p`,
- :math:`\sigma_X^2 = p(1-p)`.
We can sample an array of arbitrary shape from a Bernoulli random
variable as follows.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
1*(np.random.rand(10, 10) < p)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 1, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 1, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 1, 0, 0, 1, 0],
[0, 1, 1, 0, 1, 0, 1, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 1]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
1*(torch.rand(10, 10) < p)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[0, 0, 0, 1, 0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 1, 0],
[0, 0, 0, 0, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 1, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[1, 1, 1, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0, 0, 0, 0]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
tf.cast(tf.random.uniform((10, 10)) < p, dtype=tf.float32)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Discrete Uniform
----------------
The next commonly encountered random variable is a discrete uniform. For
our discussion here, we will assume that it is supported on the integers
:math:`\{1, 2, \ldots, n\}`, however any other set of values can be
freely chosen. The meaning of the word *uniform* in this context is that
every possible value is equally likely. The probability for each value
:math:`i \in \{1, 2, 3, \ldots, n\}` is :math:`p_i = \frac{1}{n}`. We
will denote a random variable :math:`X` with this distribution as
.. math::
X \sim U(n).
The cumulative distribution function is
.. math:: F(x) = \begin{cases} 0 & x < 1, \\ \frac{k}{n} & k \le x < k+1 \text{ with } 1 \le k < n, \\ 1 & x >= n . \end{cases}
:label: eq_discrete_uniform_cdf
Let us first plot the probability mass function.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n = 5
d2l.plt.stem([i+1 for i in range(n)], n*[1 / n], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_51_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n = 5
d2l.plt.stem([i+1 for i in range(n)], n*[1 / n], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_54_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n = 5
d2l.plt.stem([i+1 for i in range(n)], n*[1 / n], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_57_0.svg
.. raw:: html
.. raw:: html
Now, let us plot the cumulative distribution function
:eq:`eq_discrete_uniform_cdf`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = np.arange(-1, 6, 0.01)
def F(x):
return 0 if x < 1 else 1 if x > n else np.floor(x) / n
d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_63_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = torch.arange(-1, 6, 0.01)
def F(x):
return 0 if x < 1 else 1 if x > n else torch.floor(x) / n
d2l.plot(x, torch.tensor([F(y) for y in x]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_66_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = tf.range(-1, 6, 0.01)
def F(x):
return 0 if x < 1 else 1 if x > n else tf.floor(x) / n
d2l.plot(x, [F(y) for y in x], 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_69_0.svg
.. raw:: html
.. raw:: html
If :math:`X \sim U(n)`, then:
- :math:`\mu_X = \frac{1+n}{2}`,
- :math:`\sigma_X^2 = \frac{n^2-1}{12}`.
We can sample an array of arbitrary shape from a discrete uniform random
variable as follows.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
np.random.randint(1, n, size=(10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[2, 4, 3, 2, 3, 4, 4, 3, 3, 2],
[2, 1, 3, 3, 1, 1, 4, 2, 4, 1],
[3, 2, 2, 1, 3, 4, 2, 2, 1, 1],
[2, 1, 3, 1, 2, 4, 4, 4, 2, 1],
[4, 4, 4, 2, 4, 3, 4, 1, 3, 1],
[2, 3, 4, 4, 1, 3, 1, 3, 4, 4],
[4, 1, 1, 4, 4, 3, 3, 1, 3, 2],
[1, 4, 1, 3, 2, 4, 4, 4, 4, 2],
[4, 3, 1, 1, 1, 1, 3, 2, 2, 1],
[2, 4, 3, 3, 2, 2, 4, 4, 3, 2]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
torch.randint(1, n, size=(10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[2, 2, 4, 4, 4, 4, 1, 1, 3, 1],
[2, 2, 4, 4, 1, 3, 2, 1, 3, 2],
[2, 1, 1, 1, 3, 4, 3, 2, 2, 2],
[1, 3, 4, 2, 4, 4, 2, 1, 4, 2],
[4, 4, 3, 3, 1, 2, 4, 3, 3, 3],
[2, 3, 1, 2, 1, 3, 1, 4, 2, 3],
[1, 4, 3, 3, 3, 1, 1, 3, 4, 4],
[1, 4, 4, 1, 3, 2, 4, 2, 3, 1],
[2, 1, 4, 4, 2, 3, 4, 1, 4, 4],
[3, 4, 1, 3, 3, 4, 1, 4, 2, 2]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
tf.random.uniform((10, 10), 1, n, dtype=tf.int32)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Continuous Uniform
------------------
Next, let us discuss the continuous uniform distribution. The idea
behind this random variable is that if we increase the :math:`n` in the
discrete uniform distribution, and then scale it to fit within the
interval :math:`[a, b]`, we will approach a continuous random variable
that just picks an arbitrary value in :math:`[a, b]` all with equal
probability. We will denote this distribution as
.. math::
X \sim U(a, b).
The probability density function is
.. math:: p(x) = \begin{cases} \frac{1}{b-a} & x \in [a, b], \\ 0 & x \not\in [a, b].\end{cases}
:label: eq_cont_uniform_pdf
The cumulative distribution function is
.. math:: F(x) = \begin{cases} 0 & x < a, \\ \frac{x-a}{b-a} & x \in [a, b], \\ 1 & x >= b . \end{cases}
:label: eq_cont_uniform_cdf
Let us first plot the probability density function
:eq:`eq_cont_uniform_pdf`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
a, b = 1, 3
x = np.arange(0, 4, 0.01)
p = (x > a)*(x < b)/(b - a)
d2l.plot(x, p, 'x', 'p.d.f.')
.. figure:: output_distributions_c7d568_87_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
a, b = 1, 3
x = torch.arange(0, 4, 0.01)
p = (x > a).type(torch.float32)*(x < b).type(torch.float32)/(b-a)
d2l.plot(x, p, 'x', 'p.d.f.')
.. figure:: output_distributions_c7d568_90_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
a, b = 1, 3
x = tf.range(0, 4, 0.01)
p = tf.cast(x > a, tf.float32) * tf.cast(x < b, tf.float32) / (b - a)
d2l.plot(x, p, 'x', 'p.d.f.')
.. figure:: output_distributions_c7d568_93_0.svg
.. raw:: html
.. raw:: html
Now, let us plot the cumulative distribution function
:eq:`eq_cont_uniform_cdf`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def F(x):
return 0 if x < a else 1 if x > b else (x - a) / (b - a)
d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_99_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def F(x):
return 0 if x < a else 1 if x > b else (x - a) / (b - a)
d2l.plot(x, torch.tensor([F(y) for y in x]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_102_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def F(x):
return 0 if x < a else 1 if x > b else (x - a) / (b - a)
d2l.plot(x, [F(y) for y in x], 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_105_0.svg
.. raw:: html
.. raw:: html
If :math:`X \sim U(a, b)`, then:
- :math:`\mu_X = \frac{a+b}{2}`,
- :math:`\sigma_X^2 = \frac{(b-a)^2}{12}`.
We can sample an array of arbitrary shape from a uniform random variable
as follows. Note that it by default samples from a :math:`U(0,1)`, so if
we want a different range we need to scale it.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
(b - a) * np.random.rand(10, 10) + a
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[1.1009639 , 1.99548045, 1.32078504, 2.94598323, 1.73268081,
1.13185198, 2.79059098, 2.48794303, 2.27678871, 2.59565863],
[2.19636211, 1.46368979, 2.63971369, 2.14576546, 2.45143403,
2.64608562, 2.04813795, 2.9321872 , 1.97396729, 1.52448965],
[2.60595318, 1.25977147, 1.81192341, 1.9654566 , 2.3878148 ,
1.68547482, 1.57600922, 2.98863203, 2.51559542, 1.62334523],
[2.63421924, 1.62141884, 2.19524953, 2.52135243, 1.98069251,
1.5434051 , 2.9010042 , 1.208448 , 2.47746836, 2.68472932],
[1.47839634, 2.69821046, 2.16335824, 2.52594196, 1.58609662,
2.68784221, 1.09076404, 2.25668103, 1.69730546, 1.49264671],
[2.95453587, 2.42697746, 1.30839447, 2.53141625, 1.10685981,
1.3342463 , 2.45128493, 1.12066042, 1.14130562, 1.60633291],
[1.84099523, 1.76741889, 2.07226557, 2.21084373, 1.72345977,
2.09112217, 2.71524015, 2.29997131, 2.41705823, 2.21850148],
[1.33496881, 1.0187176 , 1.79539019, 2.89608549, 2.60337656,
1.08795363, 2.10555098, 2.84983534, 2.61660088, 1.880103 ],
[2.76502754, 1.11892964, 1.79090477, 1.33770605, 1.6460579 ,
1.08028373, 1.30296252, 2.54351441, 2.08450001, 1.57908726],
[2.44670661, 2.14965936, 1.43633011, 2.95629441, 1.2254383 ,
2.70698471, 2.98637975, 1.85081822, 1.369475 , 2.81089289]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
(b - a) * torch.rand(10, 10) + a
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[2.0567, 1.7475, 2.5049, 1.1552, 1.0480, 2.2907, 1.9688, 1.5544, 1.4274,
1.1327],
[1.5359, 2.5436, 2.9093, 1.5000, 1.1420, 1.3260, 2.8982, 1.7559, 1.1971,
1.3764],
[1.9570, 1.2427, 1.4252, 1.9522, 1.7189, 1.6867, 2.6322, 1.5363, 2.4625,
2.3565],
[1.3062, 2.8515, 2.9872, 2.3049, 2.3508, 2.8846, 2.5525, 1.5796, 2.7026,
2.1324],
[1.4119, 2.8055, 2.6930, 2.9233, 1.1061, 1.3097, 1.0711, 2.4873, 1.6328,
2.6761],
[1.8753, 2.7295, 1.0687, 2.3927, 1.2787, 1.1983, 1.6894, 2.5297, 2.2103,
1.9512],
[2.6160, 1.6484, 1.0216, 1.3214, 2.1103, 1.0174, 1.5000, 2.1930, 2.6461,
2.8359],
[2.8049, 1.8693, 1.4888, 1.2923, 2.6022, 1.1023, 2.5431, 1.1829, 2.2978,
2.6984],
[2.6626, 1.3045, 2.6301, 2.6097, 1.1075, 2.2840, 1.0980, 2.4869, 1.3132,
1.6690],
[1.2408, 2.8058, 2.2616, 2.4093, 2.8544, 1.2549, 1.5693, 2.9624, 2.3658,
2.8538]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
(b - a) * tf.random.uniform((10, 10)) + a
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Binomial
--------
Let us make things a little more complex and examine the *binomial*
random variable. This random variable originates from performing a
sequence of :math:`n` independent experiments, each of which has
probability :math:`p` of succeeding, and asking how many successes we
expect to see.
Let us express this mathematically. Each experiment is an independent
random variable :math:`X_i` where we will use :math:`1` to encode
success, and :math:`0` to encode failure. Since each is an independent
coin flip which is successful with probability :math:`p`, we can say
that :math:`X_i \sim \mathrm{Bernoulli}(p)`. Then, the binomial random
variable is
.. math::
X = \sum_{i=1}^n X_i.
In this case, we will write
.. math::
X \sim \mathrm{Binomial}(n, p).
To get the cumulative distribution function, we need to notice that
getting exactly :math:`k` successes can occur in
:math:`\binom{n}{k} = \frac{n!}{k!(n-k)!}` ways each of which has a
probability of :math:`p^k(1-p)^{n-k}` of occurring. Thus the cumulative
distribution function is
.. math:: F(x) = \begin{cases} 0 & x < 0, \\ \sum_{m \le k} \binom{n}{m} p^m(1-p)^{n-m} & k \le x < k+1 \text{ with } 0 \le k < n, \\ 1 & x >= n . \end{cases}
:label: eq_binomial_cdf
Let us first plot the probability mass function.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n, p = 10, 0.2
# Compute binomial coefficient
def binom(n, k):
comb = 1
for i in range(min(k, n - k)):
comb = comb * (n - i) // (i + 1)
return comb
pmf = np.array([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])
d2l.plt.stem([i for i in range(n + 1)], pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_123_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n, p = 10, 0.2
# Compute binomial coefficient
def binom(n, k):
comb = 1
for i in range(min(k, n - k)):
comb = comb * (n - i) // (i + 1)
return comb
pmf = torch.tensor([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])
d2l.plt.stem([i for i in range(n + 1)], pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_126_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
n, p = 10, 0.2
# Compute binomial coefficient
def binom(n, k):
comb = 1
for i in range(min(k, n - k)):
comb = comb * (n - i) // (i + 1)
return comb
pmf = tf.constant([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])
d2l.plt.stem([i for i in range(n + 1)], pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_129_0.svg
.. raw:: html
.. raw:: html
Now, let us plot the cumulative distribution function
:eq:`eq_binomial_cdf`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = np.arange(-1, 11, 0.01)
cmf = np.cumsum(pmf)
def F(x):
return 0 if x < 0 else 1 if x > n else cmf[int(x)]
d2l.plot(x, np.array([F(y) for y in x.tolist()]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_135_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = torch.arange(-1, 11, 0.01)
cmf = torch.cumsum(pmf, dim=0)
def F(x):
return 0 if x < 0 else 1 if x > n else cmf[int(x)]
d2l.plot(x, torch.tensor([F(y) for y in x.tolist()]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_138_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = tf.range(-1, 11, 0.01)
cmf = tf.cumsum(pmf)
def F(x):
return 0 if x < 0 else 1 if x > n else cmf[int(x)]
d2l.plot(x, [F(y) for y in x.numpy().tolist()], 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_141_0.svg
.. raw:: html
.. raw:: html
If :math:`X \sim \mathrm{Binomial}(n, p)`, then:
- :math:`\mu_X = np`,
- :math:`\sigma_X^2 = np(1-p)`.
This follows from the linearity of expected value over the sum of
:math:`n` Bernoulli random variables, and the fact that the variance of
the sum of independent random variables is the sum of the variances.
This can be sampled as follows.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
np.random.binomial(n, p, size=(10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[2, 1, 3, 2, 2, 3, 1, 3, 2, 3],
[1, 1, 1, 3, 1, 2, 2, 2, 3, 3],
[2, 1, 4, 3, 2, 2, 4, 1, 0, 2],
[3, 1, 5, 3, 3, 1, 2, 3, 1, 3],
[2, 1, 1, 0, 1, 3, 2, 0, 2, 1],
[2, 0, 4, 2, 2, 1, 1, 1, 2, 3],
[0, 2, 4, 3, 3, 4, 0, 3, 1, 2],
[1, 1, 4, 2, 4, 4, 3, 1, 2, 4],
[2, 2, 1, 1, 1, 5, 1, 1, 3, 2],
[4, 1, 0, 2, 3, 0, 4, 1, 3, 4]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
m = torch.distributions.binomial.Binomial(n, p)
m.sample(sample_shape=(10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[1., 1., 1., 1., 1., 3., 0., 2., 2., 2.],
[1., 3., 1., 4., 2., 3., 3., 3., 3., 1.],
[3., 2., 2., 2., 2., 1., 0., 2., 2., 0.],
[2., 2., 5., 3., 3., 1., 2., 2., 5., 1.],
[1., 3., 2., 1., 1., 0., 0., 3., 3., 1.],
[4., 0., 3., 1., 5., 1., 3., 1., 0., 4.],
[2., 2., 4., 2., 1., 1., 3., 1., 3., 1.],
[3., 1., 3., 1., 1., 3., 3., 2., 2., 0.],
[2., 1., 4., 2., 3., 2., 1., 3., 3., 1.],
[2., 1., 3., 1., 1., 2., 0., 6., 2., 0.]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
m = tfp.distributions.Binomial(n, p)
m.sample(sample_shape=(10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Poisson
-------
Let us now perform a thought experiment. We are standing at a bus stop
and we want to know how many buses will arrive in the next minute. Let
us start by considering :math:`X^{(1)} \sim \mathrm{Bernoulli}(p)` which
is simply the probability that a bus arrives in the one minute window.
For bus stops far from an urban center, this might be a pretty good
approximation. We may never see more than one bus in a minute.
However, if we are in a busy area, it is possible or even likely that
two buses will arrive. We can model this by splitting our random
variable into two parts for the first 30 seconds, or the second 30
seconds. In this case we can write
.. math::
X^{(2)} \sim X^{(2)}_1 + X^{(2)}_2,
where :math:`X^{(2)}` is the total sum, and
:math:`X^{(2)}_i \sim \mathrm{Bernoulli}(p/2)`. The total distribution
is then :math:`X^{(2)} \sim \mathrm{Binomial}(2, p/2)`.
Why stop here? Let us continue to split that minute into :math:`n`
parts. By the same reasoning as above, we see that
.. math:: X^{(n)} \sim \mathrm{Binomial}(n, p/n).
:label: eq_eq_poisson_approx
Consider these random variables. By the previous section, we know that
:eq:`eq_eq_poisson_approx` has mean
:math:`\mu_{X^{(n)}} = n(p/n) = p`, and variance
:math:`\sigma_{X^{(n)}}^2 = n(p/n)(1-(p/n)) = p(1-p/n)`. If we take
:math:`n \rightarrow \infty`, we can see that these numbers stabilize to
:math:`\mu_{X^{(\infty)}} = p`, and variance
:math:`\sigma_{X^{(\infty)}}^2 = p`. This indicates that there *could
be* some random variable we can define in this infinite subdivision
limit.
This should not come as too much of a surprise, since in the real world
we can just count the number of bus arrivals, however it is nice to see
that our mathematical model is well defined. This discussion can be made
formal as the *law of rare events*.
Following through this reasoning carefully, we can arrive at the
following model. We will say that
:math:`X \sim \mathrm{Poisson}(\lambda)` if it is a random variable
which takes the values :math:`\{0,1,2, \ldots\}` with probability
.. math:: p_k = \frac{\lambda^ke^{-\lambda}}{k!}.
:label: eq_poisson_mass
The value :math:`\lambda > 0` is known as the *rate* (or the *shape*
parameter), and denotes the average number of arrivals we expect in one
unit of time.
We may sum this probability mass function to get the cumulative
distribution function.
.. math:: F(x) = \begin{cases} 0 & x < 0, \\ e^{-\lambda}\sum_{m = 0}^k \frac{\lambda^m}{m!} & k \le x < k+1 \text{ with } 0 \le k. \end{cases}
:label: eq_poisson_cdf
Let us first plot the probability mass function
:eq:`eq_poisson_mass`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lam = 5.0
xs = [i for i in range(20)]
pmf = np.array([np.exp(-lam) * lam**k / factorial(k) for k in xs])
d2l.plt.stem(xs, pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_159_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lam = 5.0
xs = [i for i in range(20)]
pmf = torch.tensor([torch.exp(torch.tensor(-lam)) * lam**k
/ factorial(k) for k in xs])
d2l.plt.stem(xs, pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_162_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lam = 5.0
xs = [i for i in range(20)]
pmf = tf.constant([tf.exp(tf.constant(-lam)).numpy() * lam**k
/ factorial(k) for k in xs])
d2l.plt.stem(xs, pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()
.. figure:: output_distributions_c7d568_165_0.svg
.. raw:: html
.. raw:: html
Now, let us plot the cumulative distribution function
:eq:`eq_poisson_cdf`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = np.arange(-1, 21, 0.01)
cmf = np.cumsum(pmf)
def F(x):
return 0 if x < 0 else 1 if x > n else cmf[int(x)]
d2l.plot(x, np.array([F(y) for y in x.tolist()]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_171_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = torch.arange(-1, 21, 0.01)
cmf = torch.cumsum(pmf, dim=0)
def F(x):
return 0 if x < 0 else 1 if x > n else cmf[int(x)]
d2l.plot(x, torch.tensor([F(y) for y in x.tolist()]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_174_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = tf.range(-1, 21, 0.01)
cmf = tf.cumsum(pmf)
def F(x):
return 0 if x < 0 else 1 if x > n else cmf[int(x)]
d2l.plot(x, [F(y) for y in x.numpy().tolist()], 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_177_0.svg
.. raw:: html
.. raw:: html
As we saw above, the means and variances are particularly concise. If
:math:`X \sim \mathrm{Poisson}(\lambda)`, then:
- :math:`\mu_X = \lambda`,
- :math:`\sigma_X^2 = \lambda`.
This can be sampled as follows.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
np.random.poisson(lam, size=(10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[ 3, 3, 1, 5, 7, 4, 4, 6, 6, 3],
[ 5, 4, 4, 8, 9, 5, 5, 2, 3, 7],
[ 2, 4, 5, 6, 1, 6, 4, 10, 4, 7],
[ 2, 8, 6, 3, 6, 8, 6, 9, 5, 6],
[ 5, 2, 7, 6, 7, 7, 3, 2, 9, 4],
[ 8, 4, 9, 7, 4, 5, 5, 6, 8, 2],
[ 4, 4, 6, 3, 9, 9, 10, 6, 7, 1],
[ 5, 5, 9, 2, 3, 4, 5, 7, 2, 5],
[ 2, 7, 6, 4, 6, 5, 6, 4, 5, 8],
[ 4, 6, 5, 4, 3, 4, 4, 2, 5, 3]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
m = torch.distributions.poisson.Poisson(lam)
m.sample((10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[ 6., 6., 6., 5., 4., 5., 4., 6., 2., 4.],
[ 6., 9., 6., 11., 6., 4., 4., 5., 5., 5.],
[ 6., 4., 6., 5., 4., 5., 3., 4., 5., 2.],
[ 4., 8., 6., 8., 3., 4., 3., 4., 6., 3.],
[ 3., 6., 1., 3., 3., 6., 2., 4., 3., 3.],
[ 2., 6., 4., 8., 1., 1., 4., 3., 6., 2.],
[ 7., 3., 3., 4., 8., 6., 6., 4., 7., 10.],
[ 5., 7., 4., 4., 2., 2., 6., 8., 5., 8.],
[ 3., 2., 11., 6., 3., 8., 3., 1., 5., 11.],
[ 4., 3., 5., 2., 8., 2., 11., 5., 10., 8.]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
m = tfp.distributions.Poisson(lam)
m.sample((10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
Gaussian
--------
Now Let us try a different, but related experiment. Let us say we again
are performing :math:`n` independent :math:`\mathrm{Bernoulli}(p)`
measurements :math:`X_i`. The distribution of the sum of these is
:math:`X^{(n)} \sim \mathrm{Binomial}(n, p)`. Rather than taking a limit
as :math:`n` increases and :math:`p` decreases, Let us fix :math:`p`,
and then send :math:`n \rightarrow \infty`. In this case
:math:`\mu_{X^{(n)}} = np \rightarrow \infty` and
:math:`\sigma_{X^{(n)}}^2 = np(1-p) \rightarrow \infty`, so there is no
reason to think this limit should be well defined.
However, not all hope is lost! Let us just make the mean and variance be
well behaved by defining
.. math::
Y^{(n)} = \frac{X^{(n)} - \mu_{X^{(n)}}}{\sigma_{X^{(n)}}}.
This can be seen to have mean zero and variance one, and so it is
plausible to believe that it will converge to some limiting
distribution. If we plot what these distributions look like, we will
become even more convinced that it will work.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
p = 0.2
ns = [1, 10, 100, 1000]
d2l.plt.figure(figsize=(10, 3))
for i in range(4):
n = ns[i]
pmf = np.array([p**i * (1-p)**(n-i) * binom(n, i) for i in range(n + 1)])
d2l.plt.subplot(1, 4, i + 1)
d2l.plt.stem([(i - n*p)/np.sqrt(n*p*(1 - p)) for i in range(n + 1)], pmf,
use_line_collection=True)
d2l.plt.xlim([-4, 4])
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.title("n = {}".format(n))
d2l.plt.show()
.. figure:: output_distributions_c7d568_195_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
p = 0.2
ns = [1, 10, 100, 1000]
d2l.plt.figure(figsize=(10, 3))
for i in range(4):
n = ns[i]
pmf = torch.tensor([p**i * (1-p)**(n-i) * binom(n, i)
for i in range(n + 1)])
d2l.plt.subplot(1, 4, i + 1)
d2l.plt.stem([(i - n*p)/torch.sqrt(torch.tensor(n*p*(1 - p)))
for i in range(n + 1)], pmf,
use_line_collection=True)
d2l.plt.xlim([-4, 4])
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.title("n = {}".format(n))
d2l.plt.show()
.. figure:: output_distributions_c7d568_198_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
p = 0.2
ns = [1, 10, 100, 1000]
d2l.plt.figure(figsize=(10, 3))
for i in range(4):
n = ns[i]
pmf = tf.constant([p**i * (1-p)**(n-i) * binom(n, i)
for i in range(n + 1)])
d2l.plt.subplot(1, 4, i + 1)
d2l.plt.stem([(i - n*p)/tf.sqrt(tf.constant(n*p*(1 - p)))
for i in range(n + 1)], pmf,
use_line_collection=True)
d2l.plt.xlim([-4, 4])
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.title("n = {}".format(n))
d2l.plt.show()
.. figure:: output_distributions_c7d568_201_0.svg
.. raw:: html
.. raw:: html
One thing to note: compared to the Poisson case, we are now dividing by
the standard deviation which means that we are squeezing the possible
outcomes into smaller and smaller areas. This is an indication that our
limit will no longer be discrete, but rather continuous.
A derivation of what occurs is beyond the scope of this document, but
the *central limit theorem* states that as :math:`n \rightarrow \infty`,
this will yield the Gaussian Distribution (or sometimes normal
distribution). More explicitly, for any :math:`a, b`:
.. math::
\lim_{n \rightarrow \infty} P(Y^{(n)} \in [a, b]) = P(\mathcal{N}(0,1) \in [a, b]),
where we say a random variable is normally distributed with given mean
:math:`\mu` and variance :math:`\sigma^2`, written
:math:`X \sim \mathcal{N}(\mu, \sigma^2)` if :math:`X` has density
.. math:: p_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}.
:label: eq_gaussian_pdf
Let us first plot the probability density function
:eq:`eq_gaussian_pdf`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
mu, sigma = 0, 1
x = np.arange(-3, 3, 0.01)
p = 1 / np.sqrt(2 * np.pi * sigma**2) * np.exp(-(x - mu)**2 / (2 * sigma**2))
d2l.plot(x, p, 'x', 'p.d.f.')
.. figure:: output_distributions_c7d568_207_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
mu, sigma = 0, 1
x = torch.arange(-3, 3, 0.01)
p = 1 / torch.sqrt(2 * torch.pi * sigma**2) * torch.exp(
-(x - mu)**2 / (2 * sigma**2))
d2l.plot(x, p, 'x', 'p.d.f.')
.. figure:: output_distributions_c7d568_210_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
mu, sigma = 0, 1
x = tf.range(-3, 3, 0.01)
p = 1 / tf.sqrt(2 * tf.pi * sigma**2) * tf.exp(
-(x - mu)**2 / (2 * sigma**2))
d2l.plot(x, p, 'x', 'p.d.f.')
.. figure:: output_distributions_c7d568_213_0.svg
.. raw:: html
.. raw:: html
Now, let us plot the cumulative distribution function. It is beyond the
scope of this appendix, but the Gaussian c.d.f. does not have a
closed-form formula in terms of more elementary functions. We will use
``erf`` which provides a way to compute this integral numerically.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def phi(x):
return (1.0 + erf((x - mu) / (sigma * np.sqrt(2)))) / 2.0
d2l.plot(x, np.array([phi(y) for y in x.tolist()]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_219_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def phi(x):
return (1.0 + erf((x - mu) / (sigma * torch.sqrt(torch.tensor(2.))))) / 2.0
d2l.plot(x, torch.tensor([phi(y) for y in x.tolist()]), 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_222_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def phi(x):
return (1.0 + erf((x - mu) / (sigma * tf.sqrt(tf.constant(2.))))) / 2.0
d2l.plot(x, [phi(y) for y in x.numpy().tolist()], 'x', 'c.d.f.')
.. figure:: output_distributions_c7d568_225_0.svg
.. raw:: html
.. raw:: html
Keen-eyed readers will recognize some of these terms. Indeed, we
encountered this integral in :numref:`sec_integral_calculus`. Indeed
we need exactly that computation to see that this :math:`p_X(x)` has
total area one and is thus a valid density.
Our choice of working with coin flips made computations shorter, but
nothing about that choice was fundamental. Indeed, if we take any
collection of independent identically distributed random variables
:math:`X_i`, and form
.. math::
X^{(N)} = \sum_{i=1}^N X_i.
Then
.. math::
\frac{X^{(N)} - \mu_{X^{(N)}}}{\sigma_{X^{(N)}}}
will be approximately Gaussian. There are additional requirements needed
to make it work, most commonly :math:`E[X^4] < \infty`, but the
philosophy is clear.
The central limit theorem is the reason that the Gaussian is fundamental
to probability, statistics, and machine learning. Whenever we can say
that something we measured is a sum of many small independent
contributions, we can assume that the thing being measured will be close
to Gaussian.
There are many more fascinating properties of Gaussians, and we would
like to discuss one more here. The Gaussian is what is known as a
*maximum entropy distribution*. We will get into entropy more deeply in
:numref:`sec_information_theory`, however all we need to know at this
point is that it is a measure of randomness. In a rigorous mathematical
sense, we can think of the Gaussian as the *most* random choice of
random variable with fixed mean and variance. Thus, if we know that our
random variable has some mean and variance, the Gaussian is in a sense
the most conservative choice of distribution we can make.
To close the section, Let us recall that if
:math:`X \sim \mathcal{N}(\mu, \sigma^2)`, then:
- :math:`\mu_X = \mu`,
- :math:`\sigma_X^2 = \sigma^2`.
We can sample from the Gaussian (or standard normal) distribution as
shown below.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
np.random.normal(mu, sigma, size=(10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[ 0.37466166, -0.76533087, 0.20277123, -0.91834533, -0.23511579,
0.32962245, 0.47530781, 1.04093472, -0.28154386, 0.22441668],
[-0.12950418, -0.77205929, -0.4880254 , 0.10875764, -0.04052255,
-0.08743801, 1.22678556, -0.15588872, 1.3784915 , 0.02290818],
[ 0.01036102, -0.25575154, 0.66098918, 0.67726447, -0.43873404,
0.07055075, 2.05284435, 0.54967315, -0.52724163, -0.18386038],
[-0.18044691, 1.24174857, -1.18784052, -0.02105469, 1.56388046,
-0.10142488, -0.22924803, 1.22975985, -1.46625941, 1.49161461],
[ 0.73418028, -0.46680701, 0.00707471, 0.38572951, 0.06672456,
-1.14038687, 1.91556069, 1.40354565, -1.07657526, -0.23493843],
[ 0.28348367, -1.40021493, 1.45461971, -0.23070793, 0.17006662,
-0.74706953, 0.01791944, 0.57150978, -0.75592796, 0.74006447],
[ 1.21916596, 0.50411463, -0.37770721, -0.06635446, 0.51047576,
0.01586193, 0.45380373, -0.5869719 , 0.12794837, 2.31662748],
[-0.42322633, -2.03769949, 1.10566508, 0.05117499, -2.13285167,
-0.16364563, -0.3739578 , -1.18988118, 0.42607967, -0.07594401],
[ 0.55629045, 1.07826704, 0.5272042 , -0.02162458, -1.25685522,
-0.74149614, -0.39005934, 0.3897774 , -2.31508359, 1.20514051],
[ 0.76944418, -1.44406731, -0.29943536, -0.2534239 , -0.18990178,
-1.65955667, -0.56773455, -1.00413216, -1.16556254, 1.35313613]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
torch.normal(mu, sigma, size=(10, 10))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[ 0.4230, -1.1328, -1.1626, 0.1155, 2.0323, 1.6196, -1.5788, 1.2823,
-0.0373, -1.1836],
[-1.1139, 1.2677, 1.3402, 0.2904, -0.0678, 1.1192, -0.2749, -0.8243,
-0.0026, 0.9376],
[-1.0510, -0.7684, -2.3318, 0.2474, -1.5833, -0.9553, -0.8174, 1.8287,
0.2391, -0.9564],
[ 1.9425, -1.2161, 2.2652, 0.1593, 0.3251, 0.1161, -0.9569, -0.0853,
-0.3341, -1.9039],
[-0.1867, -0.9346, -1.0432, -1.4792, 0.2808, 1.4643, 1.4855, -0.9092,
-0.0952, 2.3808],
[-1.1229, -0.5188, -2.0822, -1.0427, 0.9299, 0.8141, 1.2516, -0.6690,
1.1480, 0.5005],
[ 0.5415, -1.4820, -0.1676, -2.2398, -0.6275, -1.7631, 1.1156, 0.2063,
0.1434, 1.1345],
[ 0.3431, -1.0499, -0.5353, -0.8966, 0.1571, 0.3468, 1.1238, -0.0806,
-0.0720, 1.1355],
[-1.1069, -1.9573, -0.9190, 0.3881, -0.8016, 0.2344, -0.3833, 0.3867,
-1.2964, 1.1364],
[-0.6758, 0.1725, -0.2003, -0.6548, 0.4995, -0.2108, -0.4820, -0.5391,
-0.8692, 2.3844]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
tf.random.normal((10, 10), mu, sigma)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
.. raw:: html
.. raw:: html
.. _subsec_exponential_family:
Exponential Family
------------------
One shared property for all the distributions listed above is that they
all belong to which is known as the *exponential family*. The
exponential family is a set of distributions whose density can be
expressed in the following form:
.. math:: p(\mathbf{x} | \boldsymbol{\eta}) = h(\mathbf{x}) \cdot \mathrm{exp} \left( \boldsymbol{\eta}^{\top} \cdot T(\mathbf{x}) - A(\boldsymbol{\eta}) \right)
:label: eq_exp_pdf
As this definition can be a little subtle, let us examine it closely.
First, :math:`h(\mathbf{x})` is known as the *underlying measure* or the
*base measure*. This can be viewed as an original choice of measure we
are modifying with our exponential weight.
Second, we have the vector
:math:`\boldsymbol{\eta} = (\eta_1, \eta_2, ..., \eta_l) \in \mathbb{R}^l`
called the *natural parameters* or *canonical parameters*. These define
how the base measure will be modified. The natural parameters enter into
the new measure by taking the dot product of these parameters against
some function :math:`T(\cdot)` of
:math:`\mathbf{x}= (x_1, x_2, ..., x_n) \in \mathbb{R}^n` and
exponentiated. The vector
:math:`T(\mathbf{x})= (T_1(\mathbf{x}), T_2(\mathbf{x}), ..., T_l(\mathbf{x}))`
is called the *sufficient statistics* for :math:`\boldsymbol{\eta}`.
This name is used since the information represented by
:math:`T(\mathbf{x})` is sufficient to calculate the probability density
and no other information from the sample :math:`\mathbf{x}`'s are
required.
Third, we have :math:`A(\boldsymbol{\eta})`, which is referred to as the
*cumulant function*, which ensures that the above distribution
:eq:`eq_exp_pdf` integrates to one, i.e.,
.. math::
A(\boldsymbol{\eta}) = \log \left[\int h(\mathbf{x}) \cdot \mathrm{exp}
\left(\boldsymbol{\eta}^{\top} \cdot T(\mathbf{x}) \right) d\mathbf{x} \right].
To be concrete, let us consider the Gaussian. Assuming that
:math:`\mathbf{x}` is an univariate variable, we saw that it had a
density of
.. math::
\begin{aligned}
p(x | \mu, \sigma) &= \frac{1}{\sqrt{2 \pi \sigma^2}} \cdot \mathrm{exp}
\left\{ \frac{-(x-\mu)^2}{2 \sigma^2} \right\} \\
&= \frac{1}{\sqrt{2 \pi}} \cdot \mathrm{exp} \left\{ \frac{\mu}{\sigma^2}x
-\frac{1}{2 \sigma^2} x^2 - \left( \frac{1}{2 \sigma^2} \mu^2
+\log(\sigma) \right) \right\}.
\end{aligned}
This matches the definition of the exponential family with:
- *underlying measure*: :math:`h(x) = \frac{1}{\sqrt{2 \pi}}`,
- *natural parameters*:
:math:`\boldsymbol{\eta} = \begin{bmatrix} \eta_1 \\ \eta_2 \end{bmatrix} = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ \frac{1}{2 \sigma^2} \end{bmatrix}`,
- *sufficient statistics*:
:math:`T(x) = \begin{bmatrix}x\\-x^2\end{bmatrix}`, and
- *cumulant function*:
:math:`A({\boldsymbol\eta}) = \frac{1}{2 \sigma^2} \mu^2 + \log(\sigma) = \frac{\eta_1^2}{4 \eta_2} - \frac{1}{2}\log(2 \eta_2)`.
It is worth noting that the exact choice of each of above terms is
somewhat arbitrary. Indeed, the important feature is that the
distribution can be expressed in this form, not the exact form itself.
As we allude to in :numref:`subsec_softmax_and_derivatives`, a widely
used technique is to assume that the final output :math:`\mathbf{y}`
follows an exponential family distribution. The exponential family is a
common and powerful family of distributions encountered frequently in
machine learning.
Summary
-------
- Bernoulli random variables can be used to model events with a yes/no
outcome.
- Discrete uniform distributions model selects from a finite set of
possibilities.
- Continuous uniform distributions select from an interval.
- Binomial distributions model a series of Bernoulli random variables,
and count the number of successes.
- Poisson random variables model the arrival of rare events.
- Gaussian random variables model the result of adding a large number
of independent random variables together.
- All the above distributions belong to exponential family.
Exercises
---------
1. What is the standard deviation of a random variable that is the
difference :math:`X-Y` of two independent binomial random variables
:math:`X, Y \sim \mathrm{Binomial}(16, 1/2)`.
2. If we take a Poisson random variable
:math:`X \sim \mathrm{Poisson}(\lambda)` and consider
:math:`(X - \lambda)/\sqrt{\lambda}` as
:math:`\lambda \rightarrow \infty`, we can show that this becomes
approximately Gaussian. Why does this make sense?
3. What is the probability mass function for a sum of two discrete
uniform random variables on :math:`n` elements?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html