.. _sec_naive_bayes: Naive Bayes =========== Throughout the previous sections, we learned about the theory of probability and random variables. To put this theory to work, let us introduce the *naive Bayes* classifier. This uses nothing but probabilistic fundamentals to allow us to perform classification of digits. Learning is all about making assumptions. If we want to classify a new data example that we have never seen before we have to make some assumptions about which data examples are similar to each other. The naive Bayes classifier, a popular and remarkably clear algorithm, assumes all features are independent from each other to simplify the computation. In this section, we will apply this model to recognize characters in images. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python %matplotlib inline import math from mxnet import gluon, np, npx from d2l import mxnet as d2l npx.set_np() d2l.use_svg_display() .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python %matplotlib inline import math import torch import torchvision from d2l import torch as d2l d2l.use_svg_display() .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python %matplotlib inline import math import tensorflow as tf from d2l import tensorflow as d2l d2l.use_svg_display() .. raw:: html

.. raw:: html

Optical Character Recognition ----------------------------- MNIST :cite:`LeCun.Bottou.Bengio.ea.1998` is one of widely used datasets. It contains 60,000 images for training and 10,000 images for validation. Each image contains a handwritten digit from 0 to 9. The task is classifying each image into the corresponding digit. Gluon provides a ``MNIST`` class in the ``data.vision`` module to automatically retrieve the dataset from the Internet. Subsequently, Gluon will use the already-downloaded local copy. We specify whether we are requesting the training set or the test set by setting the value of the parameter ``train`` to ``True`` or ``False``, respectively. Each image is a grayscale image with both width and height of :math:`28` with shape (:math:`28`,\ :math:`28`,\ :math:`1`). We use a customized transformation to remove the last channel dimension. In addition, the dataset represents each pixel by an unsigned :math:`8`-bit integer. We quantize them into binary features to simplify the problem. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def transform(data, label): return np.floor(data.astype('float32') / 128).squeeze(axis=-1), label mnist_train = gluon.data.vision.MNIST(train=True, transform=transform) mnist_test = gluon.data.vision.MNIST(train=False, transform=transform) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python data_transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), lambda x: torch.floor(x * 255 / 128).squeeze(dim=0) ]) mnist_train = torchvision.datasets.MNIST( root='./temp', train=True, transform=data_transform, download=True) mnist_test = torchvision.datasets.MNIST( root='./temp', train=False, transform=data_transform, download=True) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./temp/MNIST/raw/train-images-idx3-ubyte.gz 100.0% Extracting ./temp/MNIST/raw/train-images-idx3-ubyte.gz to ./temp/MNIST/raw 100.0% Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./temp/MNIST/raw/train-labels-idx1-ubyte.gz Extracting ./temp/MNIST/raw/train-labels-idx1-ubyte.gz to ./temp/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./temp/MNIST/raw/t10k-images-idx3-ubyte.gz 100.0% 100.0% Extracting ./temp/MNIST/raw/t10k-images-idx3-ubyte.gz to ./temp/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./temp/MNIST/raw/t10k-labels-idx1-ubyte.gz Extracting ./temp/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./temp/MNIST/raw .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python ((train_images, train_labels), ( test_images, test_labels)) = tf.keras.datasets.mnist.load_data() # Original pixel values of MNIST range from 0-255 (as the digits are stored as # uint8). For this section, pixel values that are greater than 128 (in the # original image) are converted to 1 and values that are less than 128 are # converted to 0. See section 18.9.2 and 18.9.3 for why train_images = tf.floor(tf.constant(train_images / 128, dtype = tf.float32)) test_images = tf.floor(tf.constant(test_images / 128, dtype = tf.float32)) train_labels = tf.constant(train_labels, dtype = tf.int32) test_labels = tf.constant(test_labels, dtype = tf.int32) .. raw:: html

.. raw:: html

We can access a particular example, which contains the image and the corresponding label. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python image, label = mnist_train[2] image.shape, label .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output ((28, 28), array(4, dtype=int32)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python image, label = mnist_train[2] image.shape, label .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (torch.Size([28, 28]), 4) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python image, label = train_images[2], train_labels[2] image.shape, label.numpy() .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (TensorShape([28, 28]), 4) .. raw:: html

.. raw:: html

Our example, stored here in the variable ``image``, corresponds to an image with a height and width of :math:`28` pixels. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python image.shape, image.dtype .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output ((28, 28), dtype('float32')) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python image.shape, image.dtype .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (torch.Size([28, 28]), torch.float32) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python image.shape, image.dtype .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (TensorShape([28, 28]), tf.float32) .. raw:: html

.. raw:: html

Our code stores the label of each image as a scalar. Its type is a :math:`32`-bit integer. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python label, type(label), label.dtype .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (array(4, dtype=int32), mxnet.numpy.ndarray, dtype('int32')) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python label, type(label) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (4, int) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python label.numpy(), label.dtype .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (4, tf.int32) .. raw:: html

.. raw:: html

We can also access multiple examples at the same time. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python images, labels = mnist_train[10:38] images.shape, labels.shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output ((28, 28, 28), (28,)) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python images = torch.stack([mnist_train[i][0] for i in range(10, 38)], dim=0) labels = torch.tensor([mnist_train[i][1] for i in range(10, 38)]) images.shape, labels.shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (torch.Size([28, 28, 28]), torch.Size([28])) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python images = tf.stack([train_images[i] for i in range(10, 38)], axis=0) labels = tf.constant([train_labels[i].numpy() for i in range(10, 38)]) images.shape, labels.shape .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (TensorShape([28, 28, 28]), TensorShape([28])) .. raw:: html

.. raw:: html

Let us visualize these examples. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python d2l.show_images(images, 2, 9); .. figure:: output_naive-bayes_6e475d_75_0.svg .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python d2l.show_images(images, 2, 9); .. figure:: output_naive-bayes_6e475d_78_0.svg .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python d2l.show_images(images, 2, 9); .. figure:: output_naive-bayes_6e475d_81_0.svg .. raw:: html

.. raw:: html

The Probabilistic Model for Classification ------------------------------------------ In a classification task, we map an example into a category. Here an example is a grayscale :math:`28\times 28` image, and a category is a digit. (Refer to :numref:`sec_softmax` for a more detailed explanation.) One natural way to express the classification task is via the probabilistic question: what is the most likely label given the features (i.e., image pixels)? Denote by :math:`\mathbf x\in\mathbb R^d` the features of the example and :math:`y\in\mathbb R` the label. Here features are image pixels, where we can reshape a :math:`2`-dimensional image to a vector so that :math:`d=28^2=784`, and labels are digits. The probability of the label given the features is :math:`p(y \mid \mathbf{x})`. If we are able to compute these probabilities, which are :math:`p(y \mid \mathbf{x})` for :math:`y=0, \ldots,9` in our example, then the classifier will output the prediction :math:`\hat{y}` given by the expression: .. math:: \hat{y} = \mathrm{argmax} \> p(y \mid \mathbf{x}). Unfortunately, this requires that we estimate :math:`p(y \mid \mathbf{x})` for every value of :math:`\mathbf{x} = x_1, ..., x_d`. Imagine that each feature could take one of :math:`2` values. For example, the feature :math:`x_1 = 1` might signify that the word apple appears in a given document and :math:`x_1 = 0` would signify that it does not. If we had :math:`30` such binary features, that would mean that we need to be prepared to classify any of :math:`2^{30}` (over 1 billion!) possible values of the input vector :math:`\mathbf{x}`. Moreover, where is the learning? If we need to see every single possible example in order to predict the corresponding label then we are not really learning a pattern but just memorizing the dataset. The Naive Bayes Classifier -------------------------- Fortunately, by making some assumptions about conditional independence, we can introduce some inductive bias and build a model capable of generalizing from a comparatively modest selection of training examples. To begin, let us use Bayes theorem, to express the classifier as .. math:: \hat{y} = \mathrm{argmax}_y \> p(y \mid \mathbf{x}) = \mathrm{argmax}_y \> \frac{p( \mathbf{x} \mid y) p(y)}{p(\mathbf{x})}. Note that the denominator is the normalizing term :math:`p(\mathbf{x})` which does not depend on the value of the label :math:`y`. As a result, we only need to worry about comparing the numerator across different values of :math:`y`. Even if calculating the denominator turned out to be intractable, we could get away with ignoring it, so long as we could evaluate the numerator. Fortunately, even if we wanted to recover the normalizing constant, we could. We can always recover the normalization term since :math:`\sum_y p(y \mid \mathbf{x}) = 1`. Now, let us focus on :math:`p( \mathbf{x} \mid y)`. Using the chain rule of probability, we can express the term :math:`p( \mathbf{x} \mid y)` as .. math:: p(x_1 \mid y) \cdot p(x_2 \mid x_1, y) \cdot ... \cdot p( x_d \mid x_1, ..., x_{d-1}, y). By itself, this expression does not get us any further. We still must estimate roughly :math:`2^d` parameters. However, if we assume that *the features are conditionally independent of each other, given the label*, then suddenly we are in much better shape, as this term simplifies to :math:`\prod_i p(x_i \mid y)`, giving us the predictor .. math:: \hat{y} = \mathrm{argmax}_y \> \prod_{i=1}^d p(x_i \mid y) p(y). If we can estimate :math:`p(x_i=1 \mid y)` for every :math:`i` and :math:`y`, and save its value in :math:`P_{xy}[i, y]`, here :math:`P_{xy}` is a :math:`d\times n` matrix with :math:`n` being the number of classes and :math:`y\in\{1, \ldots, n\}`, then we can also use this to estimate :math:`p(x_i = 0 \mid y)`, i.e., .. math:: p(x_i = t_i \mid y) = \begin{cases} P_{xy}[i, y] & \text{for } t_i=1 ;\\ 1 - P_{xy}[i, y] & \text{for } t_i = 0 . \end{cases} In addition, we estimate :math:`p(y)` for every :math:`y` and save it in :math:`P_y[y]`, with :math:`P_y` a :math:`n`-length vector. Then, for any new example :math:`\mathbf t = (t_1, t_2, \ldots, t_d)`, we could compute .. math:: \begin{aligned}\hat{y} &= \mathrm{argmax}_ y \ p(y)\prod_{i=1}^d p(x_t = t_i \mid y) \\ &= \mathrm{argmax}_y \ P_y[y]\prod_{i=1}^d \ P_{xy}[i, y]^{t_i}\, \left(1 - P_{xy}[i, y]\right)^{1-t_i}\end{aligned} :label: eq_naive_bayes_estimation for any :math:`y`. So our assumption of conditional independence has taken the complexity of our model from an exponential dependence on the number of features :math:`\mathcal{O}(2^dn)` to a linear dependence, which is :math:`\mathcal{O}(dn)`. Training -------- The problem now is that we do not know :math:`P_{xy}` and :math:`P_y`. So we need to estimate their values given some training data first. This is *training* the model. Estimating :math:`P_y` is not too hard. Since we are only dealing with :math:`10` classes, we may count the number of occurrences :math:`n_y` for each of the digits and divide it by the total amount of data :math:`n`. For instance, if digit 8 occurs :math:`n_8 = 5,800` times and we have a total of :math:`n = 60,000` images, the probability estimate is :math:`p(y=8) = 0.0967`. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python X, Y = mnist_train[:] # All training examples n_y = np.zeros((10)) for y in range(10): n_y[y] = (Y == y).sum() P_y = n_y / n_y.sum() P_y .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array([0.09871667, 0.11236667, 0.0993 , 0.10218333, 0.09736667, 0.09035 , 0.09863333, 0.10441667, 0.09751666, 0.09915 ]) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python X = torch.stack([mnist_train[i][0] for i in range(len(mnist_train))], dim=0) Y = torch.tensor([mnist_train[i][1] for i in range(len(mnist_train))]) n_y = torch.zeros(10) for y in range(10): n_y[y] = (Y == y).sum() P_y = n_y / n_y.sum() P_y .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([0.0987, 0.1124, 0.0993, 0.1022, 0.0974, 0.0904, 0.0986, 0.1044, 0.0975, 0.0992]) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python X = train_images Y = train_labels n_y = tf.Variable(tf.zeros(10)) for y in range(10): n_y[y].assign(tf.reduce_sum(tf.cast(Y == y, tf.float32))) P_y = n_y / tf.reduce_sum(n_y) P_y .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output .. raw:: html

.. raw:: html

Now on to slightly more difficult things :math:`P_{xy}`. Since we picked black and white images, :math:`p(x_i \mid y)` denotes the probability that pixel :math:`i` is switched on for class :math:`y`. Just like before we can go and count the number of times :math:`n_{iy}` such that an event occurs and divide it by the total number of occurrences of :math:`y`, i.e., :math:`n_y`. But there is something slightly troubling: certain pixels may never be black (e.g., for well cropped images the corner pixels might always be white). A convenient way for statisticians to deal with this problem is to add pseudo counts to all occurrences. Hence, rather than :math:`n_{iy}` we use :math:`n_{iy}+1` and instead of :math:`n_y` we use :math:`n_{y} + 1`. This is also called *Laplace Smoothing*. It may seem ad-hoc, however it may be well motivated from a Bayesian point-of-view. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python n_x = np.zeros((10, 28, 28)) for y in range(10): n_x[y] = np.array(X.asnumpy()[Y.asnumpy() == y].sum(axis=0)) P_xy = (n_x + 1) / (n_y + 1).reshape(10, 1, 1) d2l.show_images(P_xy, 2, 5); .. figure:: output_naive-bayes_6e475d_99_0.svg .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python n_x = torch.zeros((10, 28, 28)) for y in range(10): n_x[y] = torch.tensor(X.numpy()[Y.numpy() == y].sum(axis=0)) P_xy = (n_x + 1) / (n_y + 1).reshape(10, 1, 1) d2l.show_images(P_xy, 2, 5); .. figure:: output_naive-bayes_6e475d_102_0.svg .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python n_x = tf.Variable(tf.zeros((10, 28, 28))) for y in range(10): n_x[y].assign(tf.cast(tf.reduce_sum( X.numpy()[Y.numpy() == y], axis=0), tf.float32)) P_xy = (n_x + 1) / tf.reshape((n_y + 1), (10, 1, 1)) d2l.show_images(P_xy, 2, 5); .. figure:: output_naive-bayes_6e475d_105_0.svg .. raw:: html

.. raw:: html

By visualizing these :math:`10\times 28\times 28` probabilities (for each pixel for each class) we could get some mean looking digits. Now we can use :eq:`eq_naive_bayes_estimation` to predict a new image. Given :math:`\mathbf x`, the following functions computes :math:`p(\mathbf x \mid y)p(y)` for every :math:`y`. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def bayes_pred(x): x = np.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28) p_xy = P_xy * x + (1 - P_xy)*(1 - x) p_xy = p_xy.reshape(10, -1).prod(axis=1) # p(x|y) return np.array(p_xy) * P_y image, label = mnist_test[0] bayes_pred(image) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def bayes_pred(x): x = x.unsqueeze(0) # (28, 28) -> (1, 28, 28) p_xy = P_xy * x + (1 - P_xy)*(1 - x) p_xy = p_xy.reshape(10, -1).prod(dim=1) # p(x|y) return p_xy * P_y image, label = mnist_test[0] bayes_pred(image) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def bayes_pred(x): x = tf.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28) p_xy = P_xy * x + (1 - P_xy)*(1 - x) p_xy = tf.math.reduce_prod(tf.reshape(p_xy, (10, -1)), axis=1) # p(x|y) return p_xy * P_y image, label = train_images[0], train_labels[0] bayes_pred(image) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output .. raw:: html

.. raw:: html

This went horribly wrong! To find out why, let us look at the per pixel probabilities. They are typically numbers between :math:`0.001` and :math:`1`. We are multiplying :math:`784` of them. At this point it is worth mentioning that we are calculating these numbers on a computer, hence with a fixed range for the exponent. What happens is that we experience *numerical underflow*, i.e., multiplying all the small numbers leads to something even smaller until it is rounded down to zero. We discussed this as a theoretical issue in :numref:`sec_maximum_likelihood`, but we see the phenomena clearly here in practice. As discussed in that section, we fix this by use the fact that :math:`\log a b = \log a + \log b`, i.e., we switch to summing logarithms. Even if both :math:`a` and :math:`b` are small numbers, the logarithm values should be in a proper range. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python a = 0.1 print('underflow:', a**784) print('logarithm is normal:', 784*math.log(a)) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output underflow: 0.0 logarithm is normal: -1805.2267129073316 .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python a = 0.1 print('underflow:', a**784) print('logarithm is normal:', 784*math.log(a)) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output underflow: 0.0 logarithm is normal: -1805.2267129073316 .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python a = 0.1 print('underflow:', a**784) print('logarithm is normal:', 784*tf.math.log(a).numpy()) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output underflow: 0.0 logarithm is normal: -1805.2267379760742 .. raw:: html

.. raw:: html

Since the logarithm is an increasing function, we can rewrite :eq:`eq_naive_bayes_estimation` as .. math:: \hat{y} = \mathrm{argmax}_y \ \log P_y[y] + \sum_{i=1}^d \Big[t_i\log P_{xy}[x_i, y] + (1-t_i) \log (1 - P_{xy}[x_i, y]) \Big]. We can implement the following stable version: .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python log_P_xy = np.log(P_xy) log_P_xy_neg = np.log(1 - P_xy) log_P_y = np.log(P_y) def bayes_pred_stable(x): x = np.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28) p_xy = log_P_xy * x + log_P_xy_neg * (1 - x) p_xy = p_xy.reshape(10, -1).sum(axis=1) # p(x|y) return p_xy + log_P_y py = bayes_pred_stable(image) py .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array([-269.0042 , -301.73447, -245.21458, -218.8941 , -193.46907, -206.10315, -292.54315, -114.62834, -220.35619, -163.18881]) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python log_P_xy = torch.log(P_xy) log_P_xy_neg = torch.log(1 - P_xy) log_P_y = torch.log(P_y) def bayes_pred_stable(x): x = x.unsqueeze(0) # (28, 28) -> (1, 28, 28) p_xy = log_P_xy * x + log_P_xy_neg * (1 - x) p_xy = p_xy.reshape(10, -1).sum(axis=1) # p(x|y) return p_xy + log_P_y py = bayes_pred_stable(image) py .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor([-269.0042, -301.7345, -245.2146, -218.8941, -193.4691, -206.1031, -292.5432, -114.6283, -220.3562, -163.1888]) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python log_P_xy = tf.math.log(P_xy) log_P_xy_neg = tf.math.log(1 - P_xy) log_P_y = tf.math.log(P_y) def bayes_pred_stable(x): x = tf.expand_dims(x, axis=0) # (28, 28) -> (1, 28, 28) p_xy = log_P_xy * x + log_P_xy_neg * (1 - x) p_xy = tf.math.reduce_sum(tf.reshape(p_xy, (10, -1)), axis=1) # p(x|y) return p_xy + log_P_y py = bayes_pred_stable(image) py .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output .. raw:: html

.. raw:: html

We may now check if the prediction is correct. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python # Convert label which is a scalar tensor of int32 dtype to a Python scalar # integer for comparison py.argmax(axis=0) == int(label) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output array(True) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python py.argmax(dim=0) == label .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output tensor(True) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python tf.argmax(py, axis=0, output_type = tf.int32) == label .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output .. raw:: html

.. raw:: html

If we now predict a few validation examples, we can see the Bayes classifier works pretty well. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def predict(X): return [bayes_pred_stable(x).argmax(axis=0).astype(np.int32) for x in X] X, y = mnist_test[:18] preds = predict(X) d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]); .. figure:: output_naive-bayes_6e475d_159_0.svg .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def predict(X): return [bayes_pred_stable(x).argmax(dim=0).type(torch.int32).item() for x in X] X = torch.stack([mnist_test[i][0] for i in range(18)], dim=0) y = torch.tensor([mnist_test[i][1] for i in range(18)]) preds = predict(X) d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]); .. figure:: output_naive-bayes_6e475d_162_0.svg .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def predict(X): return [tf.argmax( bayes_pred_stable(x), axis=0, output_type = tf.int32).numpy() for x in X] X = tf.stack([train_images[i] for i in range(10, 38)], axis=0) y = tf.constant([train_labels[i].numpy() for i in range(10, 38)]) preds = predict(X) d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]); .. figure:: output_naive-bayes_6e475d_165_0.svg .. raw:: html

.. raw:: html

Finally, let us compute the overall accuracy of the classifier. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python X, y = mnist_test[:] preds = np.array(predict(X), dtype=np.int32) float((preds == y).sum()) / len(y) # Validation accuracy .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 0.8426 .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python X = torch.stack([mnist_test[i][0] for i in range(len(mnist_test))], dim=0) y = torch.tensor([mnist_test[i][1] for i in range(len(mnist_test))]) preds = torch.tensor(predict(X), dtype=torch.int32) float((preds == y).sum()) / len(y) # Validation accuracy .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 0.8426 .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python X = test_images y = test_labels preds = tf.constant(predict(X), dtype=tf.int32) # Validation accuracy tf.reduce_sum(tf.cast(preds == y, tf.float32)).numpy() / len(y) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 0.8426 .. raw:: html

.. raw:: html

Modern deep networks achieve error rates of less than :math:`0.01`. The relatively poor performance is due to the incorrect statistical assumptions that we made in our model: we assumed that each and every pixel are *independently* generated, depending only on the label. This is clearly not how humans write digits, and this wrong assumption led to the downfall of our overly naive (Bayes) classifier. Summary ------- - Using Bayes' rule, a classifier can be made by assuming all observed features are independent. - This classifier can be trained on a dataset by counting the number of occurrences of combinations of labels and pixel values. - This classifier was the gold standard for decades for tasks such as spam detection. Exercises --------- 1. Consider the dataset :math:`[[0,0], [0,1], [1,0], [1,1]]` with labels given by the XOR of the two elements :math:`[0,1,1,0]`. What are the probabilities for a Naive Bayes classifier built on this dataset. Does it successfully classify our points? If not, what assumptions are violated? 2. Suppose that we did not use Laplace smoothing when estimating probabilities and a data example arrived at testing time which contained a value never observed in training. What would the model output? 3. The naive Bayes classifier is a specific example of a Bayesian network, where the dependence of random variables are encoded with a graph structure. While the full theory is beyond the scope of this section (see :cite:`Koller.Friedman.2009` for full details), explain why allowing explicit dependence between the two input variables in the XOR model allows for the creation of a successful classifier. .. raw:: html

mxnet pytorch tensorflow

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html