.. _sec_alexnet:

Deep Convolutional Neural Networks (AlexNet)
============================================


Although CNNs were well known in the computer vision and machine
learning communities following the introduction of LeNet, they did not
immediately dominate the field. Although LeNet achieved good results on
early small datasets, the performance and feasibility of training CNNs
on larger, more realistic datasets had yet to be established. In fact,
for much of the intervening time between the early 1990s and the
watershed results of 2012, neural networks were often surpassed by other
machine learning methods, such as support vector machines.

For computer vision, this comparison is perhaps not fair. That is
although the inputs to convolutional networks consist of raw or
lightly-processed (e.g., by centering) pixel values, practitioners would
never feed raw pixels into traditional models. Instead, typical computer
vision pipelines consisted of manually engineering feature extraction
pipelines. Rather than *learn the features*, the features were
*crafted*. Most of the progress came from having more clever ideas for
features, and the learning algorithm was often relegated to an
afterthought.

Although some neural network accelerators were available in the 1990s,
they were not yet sufficiently powerful to make deep multichannel,
multilayer CNNs with a large number of parameters. Moreover, datasets
were still relatively small. Added to these obstacles, key tricks for
training neural networks including parameter initialization heuristics,
clever variants of stochastic gradient descent, non-squashing activation
functions, and effective regularization techniques were still missing.

Thus, rather than training *end-to-end* (pixel to classification)
systems, classical pipelines looked more like this:

1. Obtain an interesting dataset. In early days, these datasets required
   expensive sensors (at the time, 1 megapixel images were
   state-of-the-art).
2. Preprocess the dataset with hand-crafted features based on some
   knowledge of optics, geometry, other analytic tools, and occasionally
   on the serendipitous discoveries of lucky graduate students.
3. Feed the data through a standard set of feature extractors such as
   the SIFT (scale-invariant feature transform) :cite:`Lowe.2004`, the
   SURF (speeded up robust features)
   :cite:`Bay.Tuytelaars.Van-Gool.2006`, or any number of other
   hand-tuned pipelines.
4. Dump the resulting representations into your favorite classifier,
   likely a linear model or kernel method, to train a classifier.

If you spoke to machine learning researchers, they believed that machine
learning was both important and beautiful. Elegant theories proved the
properties of various classifiers. The field of machine learning was
thriving, rigorous, and eminently useful. However, if you spoke to a
computer vision researcher, you would hear a very different story. The
dirty truth of image recognition, they would tell you, is that features,
not learning algorithms, drove progress. Computer vision researchers
justifiably believed that a slightly bigger or cleaner dataset or a
slightly improved feature-extraction pipeline mattered far more to the
final accuracy than any learning algorithm.

Learning Representations
------------------------

Another way to cast the state of affairs is that the most important part
of the pipeline was the representation. And up until 2012 the
representation was calculated mechanically. In fact, engineering a new
set of feature functions, improving results, and writing up the method
was a prominent genre of paper. SIFT :cite:`Lowe.2004`, SURF
:cite:`Bay.Tuytelaars.Van-Gool.2006`, HOG (histograms of oriented
gradient) :cite:`Dalal.Triggs.2005`, `bags of visual
words <https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer_vision>`__
and similar feature extractors ruled the roost.

Another group of researchers, including Yann LeCun, Geoff Hinton, Yoshua
Bengio, Andrew Ng, Shun-ichi Amari, and Juergen Schmidhuber, had
different plans. They believed that features themselves ought to be
learned. Moreover, they believed that to be reasonably complex, the
features ought to be hierarchically composed with multiple jointly
learned layers, each with learnable parameters. In the case of an image,
the lowest layers might come to detect edges, colors, and textures.
Indeed, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton proposed a new
variant of a CNN, *AlexNet*, that achieved excellent performance in the
2012 ImageNet challenge. AlexNet was named after Alex Krizhevsky, the
first author of the breakthrough ImageNet classification paper
:cite:`Krizhevsky.Sutskever.Hinton.2012`.

Interestingly in the lowest layers of the network, the model learned
feature extractors that resembled some traditional filters.
:numref:`fig_filters` is reproduced from the AlexNet paper
:cite:`Krizhevsky.Sutskever.Hinton.2012` and describes lower-level
image descriptors.

.. _fig_filters:

.. figure:: ../img/filters.png
   :width: 400px

   Image filters learned by the first layer of AlexNet.


Higher layers in the network might build upon these representations to
represent larger structures, like eyes, noses, blades of grass, and so
on. Even higher layers might represent whole objects like people,
airplanes, dogs, or frisbees. Ultimately, the final hidden state learns
a compact representation of the image that summarizes its contents such
that data belonging to different categories can be easily separated.

While the ultimate breakthrough for many-layered CNNs came in 2012, a
core group of researchers had dedicated themselves to this idea,
attempting to learn hierarchical representations of visual data for many
years. The ultimate breakthrough in 2012 can be attributed to two key
factors.

Missing Ingredient: Data
~~~~~~~~~~~~~~~~~~~~~~~~

Deep models with many layers require large amounts of data in order to
enter the regime where they significantly outperform traditional methods
based on convex optimizations (e.g., linear and kernel methods).
However, given the limited storage capacity of computers, the relative
expense of sensors, and the comparatively tighter research budgets in
the 1990s, most research relied on tiny datasets. Numerous papers
addressed the UCI collection of datasets, many of which contained only
hundreds or (a few) thousands of images captured in unnatural settings
with low resolution.

In 2009, the ImageNet dataset was released, challenging researchers to
learn models from 1 million examples, 1000 each from 1000 distinct
categories of objects. The researchers, led by Fei-Fei Li, who
introduced this dataset leveraged Google Image Search to prefilter large
candidate sets for each category and employed the Amazon Mechanical Turk
crowdsourcing pipeline to confirm for each image whether it belonged to
the associated category. This scale was unprecedented. The associated
competition, dubbed the ImageNet Challenge pushed computer vision and
machine learning research forward, challenging researchers to identify
which models performed best at a greater scale than academics had
previously considered.

Missing Ingredient: Hardware
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Deep learning models are voracious consumers of compute cycles. Training
can take hundreds of epochs, and each iteration requires passing data
through many layers of computationally-expensive linear algebra
operations. This is one of the main reasons why in the 1990s and early
2000s, simple algorithms based on the more-efficiently optimized convex
objectives were preferred.

*Graphical processing units* (GPUs) proved to be a game changer in
making deep learning feasible. These chips had long been developed for
accelerating graphics processing to benefit computer games. In
particular, they were optimized for high throughput :math:`4 \times 4`
matrix-vector products, which are needed for many computer graphics
tasks. Fortunately, this math is strikingly similar to that required to
calculate convolutional layers. Around that time, NVIDIA and ATI had
begun optimizing GPUs for general computing operations, going as far as
to market them as *general-purpose GPUs* (GPGPU).

To provide some intuition, consider the cores of a modern microprocessor
(CPU). Each of the cores is fairly powerful running at a high clock
frequency and sporting large caches (up to several megabytes of L3).
Each core is well-suited to executing a wide range of instructions, with
branch predictors, a deep pipeline, and other bells and whistles that
enable it to run a large variety of programs. This apparent strength,
however, is also its Achilles heel: general-purpose cores are very
expensive to build. They require lots of chip area, a sophisticated
support structure (memory interfaces, caching logic between cores,
high-speed interconnects, and so on), and they are comparatively bad at
any single task. Modern laptops have up to 4 cores, and even high-end
servers rarely exceed 64 cores, simply because it is not cost effective.

By comparison, GPUs consist of :math:`100 \sim 1000` small processing
elements (the details differ somewhat between NVIDIA, ATI, ARM and other
chip vendors), often grouped into larger groups (NVIDIA calls them
warps). While each core is relatively weak, sometimes even running at
sub-1GHz clock frequency, it is the total number of such cores that
makes GPUs orders of magnitude faster than CPUs. For instance, NVIDIA's
recent Volta generation offers up to 120 TFlops per chip for specialized
instructions (and up to 24 TFlops for more general-purpose ones), while
floating point performance of CPUs has not exceeded 1 TFlop to date. The
reason for why this is possible is actually quite simple: first, power
consumption tends to grow *quadratically* with clock frequency. Hence,
for the power budget of a CPU core that runs 4 times faster (a typical
number), you can use 16 GPU cores at :math:`1/4` the speed, which yields
:math:`16 \times 1/4 = 4` times the performance. Furthermore, GPU cores
are much simpler (in fact, for a long time they were not even *able* to
execute general-purpose code), which makes them more energy efficient.
Last, many operations in deep learning require high memory bandwidth.
Again, GPUs shine here with buses that are at least 10 times as wide as
many CPUs.

Back to 2012. A major breakthrough came when Alex Krizhevsky and Ilya
Sutskever implemented a deep CNN that could run on GPU hardware. They
realized that the computational bottlenecks in CNNs, convolutions and
matrix multiplications, are all operations that could be parallelized in
hardware. Using two NVIDIA GTX 580s with 3GB of memory, they implemented
fast convolutions. The code
`cuda-convnet <https://code.google.com/archive/p/cuda-convnet/>`__ was
good enough that for several years it was the industry standard and
powered the first couple years of the deep learning boom.

AlexNet
-------

AlexNet, which employed an 8-layer CNN, won the ImageNet Large Scale
Visual Recognition Challenge 2012 by a phenomenally large margin. This
network showed, for the first time, that the features obtained by
learning can transcend manually-designed features, breaking the previous
paradigm in computer vision.

The architectures of AlexNet and LeNet are very similar, as
:numref:`fig_alexnet` illustrates. Note that we provide a slightly
streamlined version of AlexNet removing some of the design quirks that
were needed in 2012 to make the model fit on two small GPUs.

.. _fig_alexnet:

.. figure:: ../img/alexnet.svg

   From LeNet (left) to AlexNet (right).


The design philosophies of AlexNet and LeNet are very similar, but there
are also significant differences. First, AlexNet is much deeper than the
comparatively small LeNet5. AlexNet consists of eight layers: five
convolutional layers, two fully-connected hidden layers, and one
fully-connected output layer. Second, AlexNet used the ReLU instead of
the sigmoid as its activation function. Let us delve into the details
below.

Architecture
~~~~~~~~~~~~

In AlexNet's first layer, the convolution window shape is
:math:`11\times11`. Since most images in ImageNet are more than ten
times higher and wider than the MNIST images, objects in ImageNet data
tend to occupy more pixels. Consequently, a larger convolution window is
needed to capture the object. The convolution window shape in the second
layer is reduced to :math:`5\times5`, followed by :math:`3\times3`. In
addition, after the first, second, and fifth convolutional layers, the
network adds maximum pooling layers with a window shape of
:math:`3\times3` and a stride of 2. Moreover, AlexNet has ten times more
convolution channels than LeNet.

After the last convolutional layer there are two fully-connected layers
with 4096 outputs. These two huge fully-connected layers produce model
parameters of nearly 1 GB. Due to the limited memory in early GPUs, the
original AlexNet used a dual data stream design, so that each of their
two GPUs could be responsible for storing and computing only its half of
the model. Fortunately, GPU memory is comparatively abundant now, so we
rarely need to break up models across GPUs these days (our version of
the AlexNet model deviates from the original paper in this aspect).

Activation Functions
~~~~~~~~~~~~~~~~~~~~

Besides, AlexNet changed the sigmoid activation function to a simpler
ReLU activation function. On one hand, the computation of the ReLU
activation function is simpler. For example, it does not have the
exponentiation operation found in the sigmoid activation function. On
the other hand, the ReLU activation function makes model training easier
when using different parameter initialization methods. This is because,
when the output of the sigmoid activation function is very close to 0 or
1, the gradient of these regions is almost 0, so that backpropagation
cannot continue to update some of the model parameters. In contrast, the
gradient of the ReLU activation function in the positive interval is
always 1. Therefore, if the model parameters are not properly
initialized, the sigmoid function may obtain a gradient of almost 0 in
the positive interval, so that the model cannot be effectively trained.

Capacity Control and Preprocessing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

AlexNet controls the model complexity of the fully-connected layer by
dropout (:numref:`sec_dropout`), while LeNet only uses weight decay.
To augment the data even further, the training loop of AlexNet added a
great deal of image augmentation, such as flipping, clipping, and color
changes. This makes the model more robust and the larger sample size
effectively reduces overfitting. We will discuss data augmentation in
greater detail in :numref:`sec_image_augmentation`.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-1-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-1-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-1-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-1-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    from mxnet import np, npx
    from mxnet.gluon import nn
    from d2l import mxnet as d2l
    
    npx.set_np()
    
    net = nn.Sequential()
    # Here, we use a larger 11 x 11 window to capture objects. At the same time,
    # we use a stride of 4 to greatly reduce the height and width of the output.
    # Here, the number of output channels is much larger than that in LeNet
    net.add(nn.Conv2D(96, kernel_size=11, strides=4, activation='relu'),
            nn.MaxPool2D(pool_size=3, strides=2),
            # Make the convolution window smaller, set padding to 2 for consistent
            # height and width across the input and output, and increase the
            # number of output channels
            nn.Conv2D(256, kernel_size=5, padding=2, activation='relu'),
            nn.MaxPool2D(pool_size=3, strides=2),
            # Use three successive convolutional layers and a smaller convolution
            # window. Except for the final convolutional layer, the number of
            # output channels is further increased. Pooling layers are not used to
            # reduce the height and width of input after the first two
            # convolutional layers
            nn.Conv2D(384, kernel_size=3, padding=1, activation='relu'),
            nn.Conv2D(384, kernel_size=3, padding=1, activation='relu'),
            nn.Conv2D(256, kernel_size=3, padding=1, activation='relu'),
            nn.MaxPool2D(pool_size=3, strides=2),
            # Here, the number of outputs of the fully-connected layer is several
            # times larger than that in LeNet. Use the dropout layer to mitigate
            # overfitting
            nn.Dense(4096, activation='relu'), nn.Dropout(0.5),
            nn.Dense(4096, activation='relu'), nn.Dropout(0.5),
            # Output layer. Since we are using Fashion-MNIST, the number of
            # classes is 10, instead of 1000 as in the paper
            nn.Dense(10))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-1-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    import torch
    from torch import nn
    from d2l import torch as d2l
    
    net = nn.Sequential(
        # Here, we use a larger 11 x 11 window to capture objects. At the same
        # time, we use a stride of 4 to greatly reduce the height and width of the
        # output. Here, the number of output channels is much larger than that in
        # LeNet
        nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2),
        # Make the convolution window smaller, set padding to 2 for consistent
        # height and width across the input and output, and increase the number of
        # output channels
        nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2),
        # Use three successive convolutional layers and a smaller convolution
        # window. Except for the final convolutional layer, the number of output
        # channels is further increased. Pooling layers are not used to reduce the
        # height and width of input after the first two convolutional layers
        nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
        nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
        nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
        nn.MaxPool2d(kernel_size=3, stride=2),
        nn.Flatten(),
        # Here, the number of outputs of the fully-connected layer is several
        # times larger than that in LeNet. Use the dropout layer to mitigate
        # overfitting
        nn.Linear(6400, 4096), nn.ReLU(),
        nn.Dropout(p=0.5),
        nn.Linear(4096, 4096), nn.ReLU(),
        nn.Dropout(p=0.5),
        # Output layer. Since we are using Fashion-MNIST, the number of classes is
        # 10, instead of 1000 as in the paper
        nn.Linear(4096, 10))


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-1-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    import tensorflow as tf
    from d2l import tensorflow as d2l
    
    
    def net():
        return tf.keras.models.Sequential([
            # Here, we use a larger 11 x 11 window to capture objects. At the same
            # time, we use a stride of 4 to greatly reduce the height and width of
            # the output. Here, the number of output channels is much larger than
            # that in LeNet
            tf.keras.layers.Conv2D(filters=96, kernel_size=11, strides=4,
                                   activation='relu'),
            tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
            # Make the convolution window smaller, set padding to 2 for consistent
            # height and width across the input and output, and increase the
            # number of output channels
            tf.keras.layers.Conv2D(filters=256, kernel_size=5, padding='same',
                                   activation='relu'),
            tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
            # Use three successive convolutional layers and a smaller convolution
            # window. Except for the final convolutional layer, the number of
            # output channels is further increased. Pooling layers are not used to
            # reduce the height and width of input after the first two
            # convolutional layers
            tf.keras.layers.Conv2D(filters=384, kernel_size=3, padding='same',
                                   activation='relu'),
            tf.keras.layers.Conv2D(filters=384, kernel_size=3, padding='same',
                                   activation='relu'),
            tf.keras.layers.Conv2D(filters=256, kernel_size=3, padding='same',
                                   activation='relu'),
            tf.keras.layers.MaxPool2D(pool_size=3, strides=2),
            tf.keras.layers.Flatten(),
            # Here, the number of outputs of the fully-connected layer is several
            # times larger than that in LeNet. Use the dropout layer to mitigate
            # overfitting
            tf.keras.layers.Dense(4096, activation='relu'),
            tf.keras.layers.Dropout(0.5),
            tf.keras.layers.Dense(4096, activation='relu'),
            tf.keras.layers.Dropout(0.5),
            # Output layer. Since we are using Fashion-MNIST, the number of
            # classes is 10, instead of 1000 as in the paper
            tf.keras.layers.Dense(10)
        ])


.. raw:: html

     </div>


.. raw:: html

     </div>

We construct a single-channel data example with both height and width of
224 to observe the output shape of each layer. It matches the AlexNet
architecture in :numref:`fig_alexnet`.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-3-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-3-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-3-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-3-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    X = np.random.uniform(size=(1, 1, 224, 224))
    net.initialize()
    for layer in net:
        X = layer(X)
        print(layer.name, 'output shape:\t', X.shape)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    conv0 output shape:	 (1, 96, 54, 54)
    pool0 output shape:	 (1, 96, 26, 26)
    conv1 output shape:	 (1, 256, 26, 26)
    pool1 output shape:	 (1, 256, 12, 12)
    conv2 output shape:	 (1, 384, 12, 12)
    conv3 output shape:	 (1, 384, 12, 12)
    conv4 output shape:	 (1, 256, 12, 12)
    pool2 output shape:	 (1, 256, 5, 5)
    dense0 output shape:	 (1, 4096)
    dropout0 output shape:	 (1, 4096)
    dense1 output shape:	 (1, 4096)
    dropout1 output shape:	 (1, 4096)
    dense2 output shape:	 (1, 10)


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-3-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    X = torch.randn(1, 1, 224, 224)
    for layer in net:
        X=layer(X)
        print(layer.__class__.__name__,'output shape:\t',X.shape)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    Conv2d output shape:	 torch.Size([1, 96, 54, 54])
    ReLU output shape:	 torch.Size([1, 96, 54, 54])
    MaxPool2d output shape:	 torch.Size([1, 96, 26, 26])
    Conv2d output shape:	 torch.Size([1, 256, 26, 26])
    ReLU output shape:	 torch.Size([1, 256, 26, 26])
    MaxPool2d output shape:	 torch.Size([1, 256, 12, 12])
    Conv2d output shape:	 torch.Size([1, 384, 12, 12])
    ReLU output shape:	 torch.Size([1, 384, 12, 12])
    Conv2d output shape:	 torch.Size([1, 384, 12, 12])
    ReLU output shape:	 torch.Size([1, 384, 12, 12])
    Conv2d output shape:	 torch.Size([1, 256, 12, 12])
    ReLU output shape:	 torch.Size([1, 256, 12, 12])
    MaxPool2d output shape:	 torch.Size([1, 256, 5, 5])
    Flatten output shape:	 torch.Size([1, 6400])
    Linear output shape:	 torch.Size([1, 4096])
    ReLU output shape:	 torch.Size([1, 4096])
    Dropout output shape:	 torch.Size([1, 4096])
    Linear output shape:	 torch.Size([1, 4096])
    ReLU output shape:	 torch.Size([1, 4096])
    Dropout output shape:	 torch.Size([1, 4096])
    Linear output shape:	 torch.Size([1, 10])


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-3-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    X = tf.random.uniform((1, 224, 224, 1))
    for layer in net().layers:
        X = layer(X)
        print(layer.__class__.__name__, 'output shape:\t', X.shape)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    Conv2D output shape:	 (1, 54, 54, 96)
    MaxPooling2D output shape:	 (1, 26, 26, 96)
    Conv2D output shape:	 (1, 26, 26, 256)
    MaxPooling2D output shape:	 (1, 12, 12, 256)
    Conv2D output shape:	 (1, 12, 12, 384)
    Conv2D output shape:	 (1, 12, 12, 384)
    Conv2D output shape:	 (1, 12, 12, 256)
    MaxPooling2D output shape:	 (1, 5, 5, 256)
    Flatten output shape:	 (1, 6400)
    Dense output shape:	 (1, 4096)
    Dropout output shape:	 (1, 4096)
    Dense output shape:	 (1, 4096)
    Dropout output shape:	 (1, 4096)
    Dense output shape:	 (1, 10)


.. raw:: html

     </div>


.. raw:: html

     </div>

Reading the Dataset
-------------------

Although AlexNet is trained on ImageNet in the paper, we use
Fashion-MNIST here since training an ImageNet model to convergence could
take hours or days even on a modern GPU. One of the problems with
applying AlexNet directly on Fashion-MNIST is that its images have lower
resolution (:math:`28 \times 28` pixels) than ImageNet images. To make
things work, we upsample them to :math:`224 \times 224` (generally not a
smart practice, but we do it here to be faithful to the AlexNet
architecture). We perform this resizing with the ``resize`` argument in
the ``d2l.load_data_fashion_mnist`` function.

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    batch_size = 128
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)

Training
--------

Now, we can start training AlexNet. Compared with LeNet in
:numref:`sec_lenet`, the main change here is the use of a smaller
learning rate and much slower training due to the deeper and wider
network, the higher image resolution, and the more costly convolutions.


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#mxnet-7-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-7-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-7-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-7-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    lr, num_epochs = 0.01, 10
    d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    loss 0.334, train acc 0.877, test acc 0.887
    4072.9 examples/sec on gpu(0)


.. figure:: output_alexnet_180871_29_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-7-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    lr, num_epochs = 0.01, 10
    d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    loss 0.330, train acc 0.879, test acc 0.884
    4186.2 examples/sec on cuda:0


.. figure:: output_alexnet_180871_32_1.svg


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-7-2">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    lr, num_epochs = 0.01, 10
    d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    loss 0.327, train acc 0.881, test acc 0.880
    5027.6 examples/sec on /GPU:0


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    <keras.engine.sequential.Sequential at 0x7f76c00f3a00>


.. figure:: output_alexnet_180871_35_2.svg


.. raw:: html

     </div>


.. raw:: html

     </div>

Summary
-------

-  AlexNet has a similar structure to that of LeNet, but uses more
   convolutional layers and a larger parameter space to fit the
   large-scale ImageNet dataset.
-  Today AlexNet has been surpassed by much more effective architectures
   but it is a key step from shallow to deep networks that are used
   nowadays.
-  Although it seems that there are only a few more lines in AlexNet's
   implementation than in LeNet, it took the academic community many
   years to embrace this conceptual change and take advantage of its
   excellent experimental results. This was also due to the lack of
   efficient computational tools.
-  Dropout, ReLU, and preprocessing were the other key steps in
   achieving excellent performance in computer vision tasks.

Exercises
---------

1. Try increasing the number of epochs. Compared with LeNet, how are the
   results different? Why?
2. AlexNet may be too complex for the Fashion-MNIST dataset.

   1. Try simplifying the model to make the training faster, while
      ensuring that the accuracy does not drop significantly.
   2. Design a better model that works directly on :math:`28 \times 28`
      images.

3. Modify the batch size, and observe the changes in accuracy and GPU
   memory.
4. Analyze computational performance of AlexNet.

   1. What is the dominant part for the memory footprint of AlexNet?
   2. What is the dominant part for computation in AlexNet?
   3. How about memory bandwidth when computing the results?

5. Apply dropout and ReLU to LeNet-5. Does it improve? How about
   preprocessing?


.. raw:: html

     <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar text"><a href="#mxnet-9-0" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab is-active">mxnet</a><a href="#pytorch-9-1" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab ">pytorch</a><a href="#tensorflow-9-2" onclick="tagClick('tensorflow'); return false;" class="mdl-tabs__tab ">tensorflow</a></div>


.. raw:: html

     <div class="mdl-tabs__panel is-active" id="mxnet-9-0">

`Discussions <https://discuss.d2l.ai/t/75>`__


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="pytorch-9-1">

`Discussions <https://discuss.d2l.ai/t/76>`__


.. raw:: html

     </div>


.. raw:: html

     <div class="mdl-tabs__panel " id="tensorflow-9-2">

`Discussions <https://discuss.d2l.ai/t/276>`__


.. raw:: html

     </div>


.. raw:: html

     </div>