.. _chapter_prob:
Probability and Statistics
==========================
In some form or another, machine learning is all about making
predictions. We might want to predict the *probability* of a patient
suffering a heart attack in the next year, given their clinical history.
In anomaly detection, we might want to assess how *likely* a set of
readings from an airplane’s jet engine would be, were it operating
normally. In reinforcement learning, we want an agent to act
intelligently in an environment. This means we need to think about the
probability of getting a high reward under each of the available action.
And when we build recommender systems we also need to think about
probability. For example, say *hypothetically* that worked for a large
online bookseller. We might want to estimate the probability that a
particular user would buy a particular book. For this we need to use the
language of probability and statistics. Entire courses, majors, theses,
careers, and even departments, are devoted to probability. So naturally,
our goal in this section isn’t to teach the whole subject. Instead we
hope to get you off the ground, to teach you just enough that you can
start building your first machine learning models, and to give you
enough of a flavor for the subject that you can begin to explore it on
your own if you wish.
We’ve already invoked probabilities in previous sections without
articulating what precisely they are or giving a concrete example. Let’s
get more serious now by considering the problem of distinguishing cats
and dogs based on photographs. This might sound simple but it’s actually
a formidable challenge. To start with, the difficulty of the problem may
depend on the resolution of the image.
+----------+----------+----------+----------+----------+
| 10px | 20px | 40px | 80px | 160px |
+==========+==========+==========+==========+==========+
| |image0| | |image1| | |image2| | |image3| | |image4| |
+----------+----------+----------+----------+----------+
| |image5| | |image6| | |image7| | |image8| | |image9| |
+----------+----------+----------+----------+----------+
While it’s easy for humans to recognize cats and dogs at 320 pixel
resolution, it becomes challenging at 40 pixels and next to impossible
at 10 pixels. In other words, our ability to tell cats and dogs apart at
a large distance (and thus low resolution) might approach uninformed
guessing. Probability gives us a formal way of reasoning about our level
of certainty. If we are completely sure that the image depicts a cat, we
say that the *probability* that the corresponding label :math:`l` is
:math:`\mathrm{cat}`, denoted :math:`P(l=\mathrm{cat})` equals 1.0. If
we had no evidence to suggest that :math:`l =\mathrm{cat}` or that
:math:`l = \mathrm{dog}`, then we might say that the two possibilities
were equally :math:`likely` expressing this as
:math:`P(l=\mathrm{cat}) = 0.5`. If we were reasonably confident, but
not sure that the image depicted a cat, we might assign a probability
:math:`.5 < P(l=\mathrm{cat}) < 1.0`.
Now consider a second case: given some weather monitoring data, we want
to predict the probability that it will rain in Taipei tomorrow. If it’s
summertime, the rain might come with probability :math:`.5`. In both
cases, we have some value of interest. And in both cases we are
uncertain about the outcome. But there’s a key difference between the
two cases. In this first case, the image is in fact either a dog or a
cat, we just don’t know which. In the second case, the outcome may
actually be a random event, if you believe in such things (and most
physicists do). So probability is a flexible language for reasoning
about our level of certainty, and it can be applied effectively in a
broad set of contexts.
Basic probability theory
------------------------
Say that we cast a die and want to know what the chance is of seeing a
:math:`1` rather than another digit. If the die is fair, all six
outcomes :math:`\mathcal{X} = \{1, \ldots, 6\}` are equally likely to
occur, and thus we would see a :math:`1` in :math:`1` out of :math:`6`
cases. Formally we state that :math:`1` occurs with probability
:math:`\frac{1}{6}`.
For a real die that we receive from a factory, we might not know those
proportions and we would need to check whether it is tainted. The only
way to investigate the die is by casting it many times and recording the
outcomes. For each cast of the die, we’ll observe a value
:math:`\{1, 2, \ldots, 6\}`. Given these outcomes, we want to
investigate the probability of observing each outcome.
One natural approach for each value is to take the individual count for
that value and to divide it by the total number of tosses. This gives us
an *estimate* of the probability of a given event. The law of large
numbers tell us that as the number of tosses grows this estimate will
draw closer and closer to the true underlying probability. Before going
into the details of what’s going here, let’s try it out.
To start, let’s import the necessary packages:
.. |image0| image:: ../img/whitecat10.jpg
.. |image1| image:: ../img/whitecat20.jpg
.. |image2| image:: ../img/whitecat40.jpg
.. |image3| image:: ../img/whitecat80.jpg
.. |image4| image:: ../img/whitecat160.jpg
.. |image5| image:: ../img/whitedog10.jpg
.. |image6| image:: ../img/whitedog20.jpg
.. |image7| image:: ../img/whitedog40.jpg
.. |image8| image:: ../img/whitedog80.jpg
.. |image9| image:: ../img/whitedog160.jpg
.. code:: python
%matplotlib inline
from IPython import display
import numpy as np
from mxnet import nd
import math
from matplotlib import pyplot as plt
import random
Next, we’ll want to be able to cast the die. In statistics we call this
process of drawing examples from probability distributions *sampling*.
The distribution which assigns probabilities to a number of discrete
choices is called the *multinomial* distribution. We’ll give a more
formal definition of *distribution* later, but at a high level, think of
it as just an assignment of probabilities to events. In MXNet, we can
sample from the multinomial distribution via the aptly named
``nd.random.multinomial`` function. The function can be called in many
ways, but we’ll focus on the simplest. To draw a single sample, we
simply pass in a vector of probabilities.
.. code:: python
probabilities = nd.ones(6) / 6
nd.random.multinomial(probabilities)
.. parsed-literal::
:class: output
[3]
If you run the sampler a bunch of times, you’ll find that you get out
random values each time. As with estimating the fairness of a die, we
often want to generate many samples from the same distribution. It would
be unbearably slow to do this with a Python ``for`` loop, so
``random.multinomial`` supports drawing multiple samples at once,
returning an array of independent samples in any shape we might desire.
.. code:: python
print(nd.random.multinomial(probabilities, shape=(10)))
print(nd.random.multinomial(probabilities, shape=(5,10)))
.. parsed-literal::
:class: output
[3 4 5 3 5 3 5 2 3 3]
[[2 2 1 5 0 5 1 2 2 4]
[4 3 2 3 2 5 5 0 2 0]
[3 0 2 4 5 4 0 5 5 5]
[2 4 4 2 3 4 4 0 4 3]
[3 0 3 5 4 3 0 2 2 1]]
Now that we know how to sample rolls of a die, we can simulate 1000
rolls. We can then go through and count, after each of the 1000 rolls,
how many times each number was rolled.
.. code:: python
rolls = nd.random.multinomial(probabilities, shape=(1000))
counts = nd.zeros((6,1000))
totals = nd.zeros(6)
for i, roll in enumerate(rolls):
totals[int(roll.asscalar())] += 1
counts[:, i] = totals
To start, we can inspect the final tally at the end of :math:`1000`
rolls.
.. code:: python
totals / 1000
.. parsed-literal::
:class: output
[0.167 0.168 0.175 0.159 0.158 0.173]
As you can see, the lowest estimated probability for any of the numbers
is about :math:`.15` and the highest estimated probability is
:math:`0.188`. Because we generated the data from a fair die, we know
that each number actually has probability of :math:`1/6`, roughly
:math:`.167`, so these estimates are pretty good. We can also visualize
how these probabilities converge over time towards reasonable estimates.
To start let’s take a look at the ``counts`` array which has shape
``(6, 1000)``. For each time step (out of 1000), ``counts`` says how
many times each of the numbers has shown up. So we can normalize each
:math:`j`-th column of the counts vector by the number of tosses to give
the ``current`` estimated probabilities at that time. The counts object
looks like this:
.. code:: python
counts
.. parsed-literal::
:class: output
[[ 0. 0. 0. ... 165. 166. 167.]
[ 1. 1. 1. ... 168. 168. 168.]
[ 0. 0. 0. ... 175. 175. 175.]
[ 0. 0. 0. ... 159. 159. 159.]
[ 0. 1. 2. ... 158. 158. 158.]
[ 0. 0. 0. ... 173. 173. 173.]]
Normalizing by the number of tosses, we get:
.. code:: python
x = nd.arange(1000).reshape((1,1000)) + 1
estimates = counts / x
print(estimates[:,0])
print(estimates[:,1])
print(estimates[:,100])
.. parsed-literal::
:class: output
[0. 1. 0. 0. 0. 0.]
[0. 0.5 0. 0. 0.5 0. ]
[0.1980198 0.15841584 0.17821783 0.18811882 0.12871288 0.14851485]
As you can see, after the first toss of the die, we get the extreme
estimate that one of the numbers will be rolled with probability
:math:`1.0` and that the others have probability :math:`0`. After
:math:`100` rolls, things already look a bit more reasonable. We can
visualize this convergence.
First we define a function that specifies ``matplotlib`` to output the
SVG figures for sharper images, and another one to specify the figure
sizes.
.. code:: python
# Save to the d2l package.
def use_svg_display():
"""Use the svg format to display plot in jupyter."""
display.set_matplotlib_formats('svg')
# Save to the d2l package.
def set_figsize(figsize=(3.5, 2.5)):
"""Change the default figure size"""
use_svg_display()
plt.rcParams['figure.figsize'] = figsize
Now visualize the data.
.. code:: python
set_figsize((6, 4))
for i in range(6):
plt.plot(estimates[i, :].asnumpy(), label=("P(die=" + str(i) +")"))
plt.axhline(y=0.16666, color='black', linestyle='dashed')
plt.legend();
.. figure:: output_probability_342e6e_17_0.svg
Each solid curve corresponds to one of the six values of the die and
gives our estimated probability that the die turns up that value as
assessed after each of the 1000 turns. The dashed black line gives the
true underlying probability. As we get more data, the solid curves
converge towards the true answer.
In our example of casting a die, we introduced the notion of a **random
variable**. A random variable, which we denote here as :math:`X` can be
pretty much any quantity and is not deterministic. Random variables
could take one value among a set of possibilities. We denote sets with
brackets, e.g., :math:`\{\mathrm{cat}, \mathrm{dog}, \mathrm{rabbit}\}`.
The items contained in the set are called *elements*, and we can say
that an element :math:`x` is *in* the set S, by writing :math:`x \in S`.
The symbol :math:`\in` is read as “in” and denotes membership. For
instance, we could truthfully say
:math:`\mathrm{dog} \in \{\mathrm{cat}, \mathrm{dog}, \mathrm{rabbit}\}`.
When dealing with the rolls of die, we are concerned with a variable
:math:`X \in \{1, 2, 3, 4, 5, 6\}`.
Note that there is a subtle difference between discrete random
variables, like the sides of a dice, and continuous ones, like the
weight and the height of a person. There’s little point in asking
whether two people have exactly the same height. If we take precise
enough measurements you’ll find that no two people on the planet have
the exact same height. In fact, if we take a fine enough measurement,
you will not have the same height when you wake up and when you go to
sleep. So there’s no purpose in asking about the probability that
someone is :math:`2.00139278291028719210196740527486202` meters tall.
Given the world population of humans the probability is virtually 0. It
makes more sense in this case to ask whether someone’s height falls into
a given interval, say between 1.99 and 2.01 meters. In these cases we
quantify the likelihood that we see a value as a *density*. The height
of exactly 2.0 meters has no probability, but nonzero density. In the
interval between any two different heights we have nonzero probability.
There are a few important axioms of probability that you’ll want to
remember:
- For any event :math:`z`, the probability is never negative, i.e.
:math:`\Pr(Z=z) \geq 0`.
- For any two events :math:`Z=z` and :math:`X=x` the union is no more
likely than the sum of the individual events, i.e.
:math:`\Pr(Z=z \cup X=x) \leq \Pr(Z=z) + \Pr(X=x)`.
- For any random variable, the probabilities of all the values it can
take must sum to 1, i.e. :math:`\sum_{i=1}^n \Pr(Z=z_i) = 1`.
- For any two *mutually exclusive* events :math:`Z=z` and :math:`X=x`,
the probability that either happens is equal to the sum of their
individual probabilities, that is
:math:`\Pr(Z=z \cup X=x) = \Pr(Z=z) + \Pr(X=x)`.
Dealing with multiple random variables
--------------------------------------
Very often, we’ll want to consider more than one random variable at a
time. For instance, we may want to model the relationship between
diseases and symptoms. Given a disease and symptom, say ‘flu’ and
‘cough’, either may or may not occur in a patient with some probability.
While we hope that the probability of both would be close to zero, we
may want to estimate these probabilities and their relationships to each
other so that we may apply our inferences to effect better medical care.
As a more complicated example, images contain millions of pixels, thus
millions of random variables. And in many cases images will come with a
label, identifying objects in the image. We can also think of the label
as a random variable. We can even think of all the metadata as random
variables such as location, time, aperture, focal length, ISO, focus
distance, camera type, etc. All of these are random variables that occur
jointly. When we deal with multiple random variables, there are several
quantities of interest. The first is called the joint distribution
:math:`\Pr(A, B)`. Given any elements :math:`a` and :math:`b`, the joint
distribution lets us answer, what is the probability that :math:`A=a`
and :math:`B=b` simultaneously? Note that for any values :math:`a` and
:math:`b`, :math:`\Pr(A=a,B=b) \leq \Pr(A=a)`.
This has to be the case, since for :math:`A` and :math:`B` to happen,
:math:`A` has to happen *and* :math:`B` also has to happen (and vice
versa). Thus :math:`A,B` cannot be more likely than :math:`A` or
:math:`B` individually. This brings us to an interesting ratio:
:math:`0 \leq \frac{\Pr(A,B)}{\Pr(A)} \leq 1`. We call this a
**conditional probability** and denote it by :math:`\Pr(B | A)`, the
probability that :math:`B` happens, provided that :math:`A` has
happened.
Using the definition of conditional probabilities, we can derive one of
the most useful and celebrated equations in statistics—Bayes’ theorem.
It goes as follows: By construction, we have that
:math:`\Pr(A, B) = \Pr(B | A) \Pr(A)`. By symmetry, this also holds for
:math:`\Pr(A,B) = \Pr(A | B) \Pr(B)`. Solving for one of the conditional
variables we get:
.. math:: \Pr(A | B) = \frac{\Pr(B | A) \Pr(A)}{\Pr(B)}
This is very useful if we want to infer one thing from another, say
cause and effect but we only know the properties in the reverse
direction. One important operation that we need, to make this work, is
**marginalization**, i.e., the operation of determining :math:`\Pr(A)`
and :math:`\Pr(B)` from :math:`\Pr(A,B)`. We can see that the
probability of seeing :math:`A` amounts to accounting for all possible
choices of :math:`B` and aggregating the joint probabilities over all of
them, i.e.
.. math::
\Pr(A) = \sum_{B'} \Pr(A,B') \text{ and
} \Pr(B) = \sum_{A'} \Pr(A',B)
Another useful property to check for is **dependence** vs.
**independence**. Independence is when the occurrence of one event does
not reveal any information about the occurrence of the other. In this
case :math:`\Pr(B | A) = \Pr(B)`. Statisticians typically express this
as :math:`A \perp\!\!\!\perp B`. From Bayes’ Theorem, it follows
immediately that also :math:`\Pr(A | B) = \Pr(A)`. In all other cases we
call :math:`A` and :math:`B` dependent. For instance, two successive
rolls of a die are independent. On the other hand, the position of a
light switch and the brightness in the room are not (they are not
perfectly deterministic, though, since we could always have a broken
lightbulb, power failure, or a broken switch).
Let’s put our skills to the test. Assume that a doctor administers an
AIDS test to a patient. This test is fairly accurate and it fails only
with 1% probability if the patient is healthy by reporting him as
diseased. Moreover, it never fails to detect HIV if the patient actually
has it. We use :math:`D` to indicate the diagnosis and :math:`H` to
denote the HIV status. Written as a table the outcome :math:`\Pr(D | H)`
looks as follows:
+---------------+--------------+--------------+
| outcome | HIV positive | HIV negative |
+===============+==============+==============+
| Test positive | 1 | 0.01 |
+---------------+--------------+--------------+
| Test negative | 0 | 0.99 |
+---------------+--------------+--------------+
Note that the column sums are all one (but the row sums aren’t), since
the conditional probability needs to sum up to :math:`1`, just like the
probability. Let us work out the probability of the patient having AIDS
if the test comes back positive. Obviously this is going to depend on
how common the disease is, since it affects the number of false alarms.
Assume that the population is quite healthy, e.g.
:math:`\Pr(\text{HIV positive}) = 0.0015`. To apply Bayes’ Theorem, we
need to determine
.. math::
\begin{aligned}
\Pr(\text{Test positive}) =& \Pr(D=1 | H=0) \Pr(H=0) + \Pr(D=1
| H=1) \Pr(H=1) \\
=& 0.01 \cdot 0.9985 + 1 \cdot 0.0015 \\
=& 0.011485
\end{aligned}
Thus, we get
.. math:: \begin{aligned} \Pr(H = 1 | D = 1) =& \frac{\Pr(D=1 | H=1) \Pr(H=1)}{\Pr(D=1)} \\ =& \frac{1 \cdot 0.0015}{0.011485} \\ =& 0.131 \end{aligned}
In other words, there’s only a 13.1% chance that the patient actually
has AIDS, despite using a test that is 99% accurate. As we can see,
statistics can be quite counterintuitive.
Conditional independence
------------------------
What should a patient do upon receiving such terrifying news? Likely,
he/she would ask the physician to administer another test to get
clarity. The second test has different characteristics (it isn’t as good
as the first one).
+---------------+--------------+--------------+
| outcome | HIV positive | HIV negative |
+===============+==============+==============+
| Test positive | 0.98 | 0.03 |
+---------------+--------------+--------------+
| Test negative | 0.02 | 0.97 |
+---------------+--------------+--------------+
Unfortunately, the second test comes back positive, too. Let us work out
the requisite probabilities to invoke Bayes’ Theorem.
- :math:`\Pr(D_1 = 1 \text{ and } D_2 = 1 | H = 0) = 0.01 \cdot 0.03 = 0.0003`
- :math:`\Pr(D_1 = 1 \text{ and } D_2 = 1 | H = 1) = 1 \cdot 0.98 = 0.98`
- :math:`\Pr(D_1 = 1 \text{ and } D_2 = 1) = 0.0003 \cdot 0.9985 + 0.98 \cdot 0.0015 = 0.00176955`
- :math:`\Pr(H = 1 | D_1 = 1 \text{ and } D_2 = 1) = \frac{0.98 \cdot 0.0015}{0.00176955} = 0.831`
That is, the second test allowed us to gain much higher confidence that
not all is well. Despite the second test being considerably less
accurate than the first one, it still improved our estimate quite a bit.
You might ask, *why couldn’t we just run the first test a second time?*
After all, the first test was more accurate. The reason is that we
needed a second test whose result is *independent* of the first test
(given the true diagnosis). In other words, we made the tacit assumption
that :math:`\Pr(D_1, D_2 | H) = \Pr(D_1 | H) \Pr(D_2 | H)`.
Statisticians call such random variables **conditionally independent**.
This is expressed as :math:`D_1 \perp\!\!\!\perp D_2 | H`.
Sampling
--------
Often, when working with probabilistic models, we’ll want not just to
estimate distributions from data, but also to generate data by sampling
from distributions. One of the simplest ways to sample random numbers is
to invoke the ``random`` method from Python’s ``random`` package.
.. code:: python
for i in range(10):
print(random.random())
.. parsed-literal::
:class: output
0.5778223019680089
0.07226149467938947
0.36850735021156733
0.8847471310211052
0.9281154582580333
0.7410761914711824
0.6249478061105176
0.8975050243984971
0.2606026427252054
0.5530550591379313
Uniform Distribution
~~~~~~~~~~~~~~~~~~~~
These numbers likely *appear* random. Note that their range is between 0
and 1 and they are evenly distributed. Because these numbers are
generated by default from the uniform distribution, there should be no
two sub-intervals of :math:`[0,1]` of equal size where numbers are more
likely to lie in one interval than the other. In other words, the
chances of any of these numbers to fall into the interval
:math:`[0.2,0.3)` are the same as in the interval
:math:`[.593264, .693264)`. In fact, these numbers are pseudo-random,
and the computer generates them by first producing a random integer and
then dividing it by its maximum range. To sample random integers
directly, we can run the following snippet, which generates integers in
the range between 1 and 100.
.. code:: python
for i in range(10):
print(random.randint(1, 100))
.. parsed-literal::
:class: output
14
40
46
52
67
75
2
10
98
98
How might we check that ``randint`` is really uniform? Intuitively, the
best strategy would be to run sampler many times, say 1 million, and
then count the number of times it generates each value to ensure that
the results are approximately uniform.
.. code:: python
counts = np.zeros(100)
fig, axes = plt.subplots(2, 2, sharex=True)
axes = axes.flatten()
# Mangle subplots such that we can index them in a linear fashion rather than
# a 2D grid
for i in range(1, 100001):
counts[random.randint(0, 99)] += 1
if i in [100, 1000, 10000, 100000]:
axes[int(math.log10(i))-2].bar(np.arange(1, 101), counts)
.. figure:: output_probability_342e6e_23_0.svg
We can see from these figures that the initial number of counts looks
*strikingly* uneven. If we sample fewer than 100 draws from a
distribution over 100 outcomes this should be expected. But even for
1000 samples there is a significant variability between the draws. What
we are really aiming for is a situation where the probability of drawing
a number :math:`x` is given by :math:`p(x)`.
The categorical distribution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Drawing from a uniform distribution over a set of 100 outcomes is
simple. But what if we have nonuniform probabilities? Let’s start with a
simple case, a biased coin which comes up heads with probability 0.35
and tails with probability 0.65. A simple way to sample from that is to
generate a uniform random variable over :math:`[0,1]` and if the number
is less than :math:`0.35`, we output heads and otherwise we generate
tails. Let’s try this out.
.. code:: python
# Number of samples
n = 1000000
y = np.random.uniform(0, 1, n)
x = np.arange(1, n+1)
# Count number of occurrences and divide by the number of total draws
p0 = np.cumsum(y < 0.35) / x
p1 = np.cumsum(y >= 0.35) / x
plt.semilogx(x, p0)
plt.semilogx(x, p1)
plt.axhline(y=0.35, color='black', linestyle='dashed')
plt.axhline(y=0.65, color='black', linestyle='dashed');
.. figure:: output_probability_342e6e_25_0.svg
As we can see, on average, this sampler will generate 35% zeros and 65%
ones. Now what if we have more than two possible outcomes? We can simply
generalize this idea as follows. Given any probability distribution,
e.g. :math:`p = [0.1, 0.2, 0.05, 0.3, 0.25, 0.1]` we can compute its
cumulative distribution (python’s ``cumsum`` will do this for you)
:math:`F = [0.1, 0.3, 0.35, 0.65, 0.9, 1]`. Once we have this we draw a
random variable :math:`x` from the uniform distribution :math:`U[0,1]`
and then find the interval where :math:`F[i-1] \leq x < F[i]`. We then
return :math:`i` as the sample. By construction, the chances of hitting
interval :math:`[F[i-1], F[i])` has probability :math:`p(i)`.
Note that there are many more efficient algorithms for sampling than the
one above. For instance, binary search over :math:`F` will run in
:math:`O(\log n)` time for :math:`n` random variables. There are even
more clever algorithms, such as the `Alias
Method `__ to sample in
constant time, after :math:`O(n)` preprocessing.
The Normal distribution
~~~~~~~~~~~~~~~~~~~~~~~
The standard Normal distribution (aka the standard Gaussian
distribution) is given by
:math:`p(x) = \frac{1}{\sqrt{2 \pi}} \exp\left(-\frac{1}{2} x^2\right)`.
Let’s plot it to get a feel for it.
.. code:: python
x = np.arange(-10, 10, 0.01)
p = (1/math.sqrt(2 * math.pi)) * np.exp(-0.5 * x**2)
plt.plot(x, p);
.. figure:: output_probability_342e6e_27_0.svg
Sampling from this distribution is less trivial. First off, the support
is infinite, that is, for any :math:`x` the density :math:`p(x)` is
positive. Secondly, the density is nonuniform. There are many tricks for
sampling from it - the key idea in all algorithms is to stratify
:math:`p(x)` in such a way as to map it to the uniform distribution
:math:`U[0,1]`. One way to do this is with the probability integral
transform.
Denote by :math:`F(x) = \int_{-\infty}^x p(z) dz` the cumulative
distribution function (CDF) of :math:`p`. This is in a way the
continuous version of the cumulative sum that we used previously. In the
same way we can now define the inverse map :math:`F^{-1}(\xi)`, where
:math:`\xi` is drawn uniformly. Unlike previously where we needed to
find the correct interval for the vector :math:`F` (i.e. for the
piecewise constant function), we now invert the function :math:`F(x)`.
In practice, this is slightly more tricky since inverting the CDF is
hard in the case of a Gaussian. It turns out that the *twodimensional*
integral is much easier to deal with, thus yielding two normal random
variables than one, albeit at the price of two uniformly distributed
ones. For now, suffice it to say that there are built-in algorithms to
address this.
The normal distribution has yet another desirable property. In a way all
distributions converge to it, if we only average over a sufficiently
large number of draws from any other distribution. To understand this in
a bit more detail, we need to introduce three important things: expected
values, means and variances.
- The expected value :math:`\mathbf{E}_{x \sim p(x)}[f(x)]` of a
function :math:`f` under a distribution :math:`p` is given by the
integral :math:`\int_x p(x) f(x) dx`. That is, we average over all
possible outcomes, as given by :math:`p`.
- A particularly important expected value is that for the function
:math:`f(x) = x`, i.e. :math:`\mu := \mathbf{E}_{x \sim p(x)}[x]`. It
provides us with some idea about the typical values of :math:`x`.
- Another important quantity is the variance, i.e. the typical
deviation from the mean
:math:`\sigma^2 := \mathbf{E}_{x \sim p(x)}[(x-\mu)^2]`. Simple math
shows (check it as an exercise) that
:math:`\sigma^2 = \mathbf{E}_{x \sim p(x)}[x^2] - \mathbf{E}^2_{x \sim p(x)}[x]`.
The above allows us to change both mean and variance of random
variables. Quite obviously for some random variable :math:`x` with mean
:math:`\mu`, the random variable :math:`x + c` has mean :math:`\mu + c`.
Moreover, :math:`\gamma x` has the variance :math:`\gamma^2 \sigma^2`.
Applying this to the normal distribution we see that one with mean
:math:`\mu` and variance :math:`\sigma^2` has the form
:math:`p(x) = \frac{1}{\sqrt{2 \sigma^2 \pi}} \exp\left(-\frac{1}{2 \sigma^2} (x-\mu)^2\right)`.
Note the scaling factor :math:`\frac{1}{\sigma}`—it arises from the fact
that if we stretch the distribution by :math:`\sigma`, we need to lower
it by :math:`\frac{1}{\sigma}` to retain the same probability mass
(i.e. the weight under the distribution always needs to integrate out to
1).
Now we are ready to state one of the most fundamental theorems in
statistics, the `Central Limit
Theorem `__. It
states that for sufficiently well-behaved random variables, in
particular random variables with well-defined mean and variance, the sum
tends toward a normal distribution. To get some idea, let’s repeat the
experiment described in the beginning, but now using random variables
with integer values of :math:`\{0, 1, 2\}`.
.. code:: python
# Generate 10 random sequences of 10,000 uniformly distributed random variables
tmp = np.random.uniform(size=(10000,10))
x = 1.0 * (tmp > 0.3) + 1.0 * (tmp > 0.8)
mean = 1 * 0.5 + 2 * 0.2
variance = 1 * 0.5 + 4 * 0.2 - mean**2
print('mean {}, variance {}'.format(mean, variance))
# Cumulative sum and normalization
y = np.arange(1,10001).reshape(10000,1)
z = np.cumsum(x,axis=0) / y
for i in range(10):
plt.semilogx(y,z[:,i])
plt.semilogx(y,(variance**0.5) * np.power(y,-0.5) + mean,'r')
plt.semilogx(y,-(variance**0.5) * np.power(y,-0.5) + mean,'r');
.. parsed-literal::
:class: output
mean 0.9, variance 0.49
.. figure:: output_probability_342e6e_29_1.svg
This looks very similar to the initial example, at least in the limit of
averages of large numbers of variables. This is confirmed by theory.
Denote by mean and variance of a random variable the quantities
.. math:: \mu[p] := \mathbf{E}_{x \sim p(x)}[x] \text{ and } \sigma^2[p] := \mathbf{E}_{x \sim p(x)}[(x - \mu[p])^2]
Then we have that
:math:`\lim_{n\to \infty} \frac{1}{\sqrt{n}} \sum_{i=1}^n \frac{x_i - \mu}{\sigma} \to \mathcal{N}(0, 1)`.
In other words, regardless of what we started out with, we will always
converge to a Gaussian. This is one of the reasons why Gaussians are so
popular in statistics.
More distributions
~~~~~~~~~~~~~~~~~~
Many more useful distributions exist. If you’re interested in going
deeper, we recommend consulting a dedicated book on statistics or
looking up some common distributions on Wikipedia for further detail.
Some important distirbutions to be aware of include:
- **Binomial Distribution** It is used to describe the distribution
over multiple draws from the same distribution, e.g. the number of
heads when tossing a biased coin (i.e. a coin with probability
:math:`\pi \in [0, 1]` of returning heads) 10 times. The binomial
probability is given by
:math:`p(x) = {n \choose x} \pi^x (1-\pi)^{n-x}`.
- **Multinomial Distribution** Often, we are concerned with more than
two outcomes, e.g. when rolling a dice multiple times. In this case,
the distribution is given by
:math:`p(x) = \frac{n!}{\prod_{i=1}^k x_i!} \prod_{i=1}^k \pi_i^{x_i}`.
- **Poisson Distribution** This distribution models the occurrence of
point events that happen with a given rate, e.g. the number of
raindrops arriving within a given amount of time in an area (weird
fact - the number of Prussian soldiers being killed by horses kicking
them followed that distribution). Given a rate :math:`\lambda`, the
number of occurrences is given by
:math:`p(x) = \frac{1}{x!} \lambda^x e^{-\lambda}`.
- **Beta, Dirichlet, Gamma, and Wishart Distributions** They are what
statisticians call *conjugate* to the Binomial, Multinomial, Poisson
and Gaussian respectively. Without going into detail, these
distributions are often used as priors for coefficients of the latter
set of distributions, e.g. a Beta distribution as a prior for
modeling the probability for binomial outcomes.
Summary
-------
So far, we covered probabilities, independence, conditional
independence, and how to use this to draw some basic conclusions. We
also introduced some fundamental probability distributions and
demonstrated how to sample from them using Apache MXNet. This is already
a powerful bit of knowledge, and by itself a sufficient set of tools for
developing some classic machine learning models. In the next section, we
will see how to operationalize this knowlege to build your first machine
learning model: the Naive Bayes classifier.
Exercises
---------
1. Given two events with probability :math:`\Pr(A)` and :math:`\Pr(B)`,
compute upper and lower bounds on :math:`\Pr(A \cup B)` and
:math:`\Pr(A \cap B)`. Hint - display the situation using a `Venn
Diagram `__.
2. Assume that we have a sequence of events, say :math:`A`, :math:`B`
and :math:`C`, where :math:`B` only depends on :math:`A` and
:math:`C` only on :math:`B`, can you simplify the joint probability?
Hint - this is a `Markov
Chain `__.
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
.. figure:: ../img/qr_probability.svg
qr