.. _sec_mlp_concise:
Concise Implementation of Multilayer Perceptrons
================================================
As you might expect, by relying on the high-level APIs, we can implement
MLPs even more concisely.
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    from mxnet import gluon, init, npx
    from mxnet.gluon import nn
    from d2l import mxnet as d2l
    
    npx.set_np()
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    import torch
    from torch import nn
    from d2l import torch as d2l
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    import tensorflow as tf
    from d2l import tensorflow as d2l
.. raw:: html
     
.. raw:: html
     
 
Model
-----
As compared with our concise implementation of softmax regression
implementation (:numref:`sec_softmax_concise`), the only difference is
that we add *two* fully-connected layers (previously, we added *one*).
The first is our hidden layer, which contains 256 hidden units and
applies the ReLU activation function. The second is our output layer.
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    net = nn.Sequential()
    net.add(nn.Dense(256, activation='relu'),
            nn.Dense(10))
    net.initialize(init.Normal(sigma=0.01))
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    net = nn.Sequential(nn.Flatten(),
                        nn.Linear(784, 256),
                        nn.ReLU(),
                        nn.Linear(256, 10))
    
    def init_weights(m):
        if type(m) == nn.Linear:
            nn.init.normal_(m.weight, std=0.01)
    
    net.apply(init_weights);
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    net = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(10)])
.. raw:: html
     
.. raw:: html
     
 
The training loop is exactly the same as when we implemented softmax
regression. This modularity enables us to separate matters concerning
the model architecture from orthogonal considerations.
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    batch_size, lr, num_epochs = 256, 0.1, 10
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
    trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
    
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
.. figure:: output_mlp-concise_f87756_27_0.svg
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    batch_size, lr, num_epochs = 256, 0.1, 10
    loss = nn.CrossEntropyLoss(reduction='none')
    trainer = torch.optim.SGD(net.parameters(), lr=lr)
    
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
.. figure:: output_mlp-concise_f87756_30_0.svg
.. raw:: html
     
.. raw:: html
     
.. raw:: latex
   \diilbookstyleinputcell
.. code:: python
    batch_size, lr, num_epochs = 256, 0.1, 10
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    trainer = tf.keras.optimizers.SGD(learning_rate=lr)
    
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
.. figure:: output_mlp-concise_f87756_33_0.svg
.. raw:: html
     
.. raw:: html
     
 
Summary
-------
-  Using high-level APIs, we can implement MLPs much more concisely.
-  For the same classification problem, the implementation of an MLP is
   the same as that of softmax regression except for additional hidden
   layers with activation functions.
Exercises
---------
1. Try adding different numbers of hidden layers (you may also modify
   the learning rate). What setting works best?
2. Try out different activation functions. Which one works best?
3. Try different schemes for initializing the weights. What method works
   best?
.. raw:: html
     
.. raw:: html
     
`Discussions `__
.. raw:: html
     
.. raw:: html
     
`Discussions `__
.. raw:: html
     
.. raw:: html
     
`Discussions `__
.. raw:: html
     
.. raw:: html