Parameter Management
====================
The ultimate goal of training deep networks is to find good parameter
values for a given architecture. When everything is standard, the
``nn.Sequential`` class is a perfectly good tool for it. However, very
few models are entirely standard and most scientists want to build
things that are novel. This section shows how to manipulate parameters.
In particular we will cover the following aspects:
- Accessing parameters for debugging, diagnostics, to visualize them or
to save them is the first step to understanding how to work with
custom models.
- Secondly, we want to set them in specific ways, e.g. for
initialization purposes. We discuss the structure of parameter
initializers.
- Lastly, we show how this knowledge can be put to good use by building
networks that share some parameters.
As always, we start from our trusty Multilayer Perceptron with a hidden
layer. This will serve as our choice for demonstrating the various
features.
.. code:: python
from mxnet import init, nd
from mxnet.gluon import nn
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize() # Use the default initialization method
x = nd.random.uniform(shape=(2, 20))
net(x) # Forward computation
.. parsed-literal::
:class: output
[[ 0.09543004 0.04614332 -0.00286654 -0.07790349 -0.05130243 0.02942037
0.08696642 -0.0190793 -0.04122177 0.05088576]
[ 0.0769287 0.03099705 0.00856576 -0.04467199 -0.06926839 0.09132434
0.06786595 -0.06187842 -0.03436673 0.04234694]]
Parameter Access
----------------
In the case of a Sequential class we can access the parameters with
ease, simply by indexing each of the layers in the network. The params
variable then contains the required data. Let’s try this out in practice
by inspecting the parameters of the first layer.
.. code:: python
print(net[0].params)
print(net[1].params)
.. parsed-literal::
:class: output
dense0_ (
Parameter dense0_weight (shape=(256, 20), dtype=float32)
Parameter dense0_bias (shape=(256,), dtype=float32)
)
dense1_ (
Parameter dense1_weight (shape=(10, 256), dtype=float32)
Parameter dense1_bias (shape=(10,), dtype=float32)
)
The output tells us a number of things. Firstly, the layer consists of
two sets of parameters: ``dense0_weight`` and ``dense0_bias``, as we
would expect. They are both single precision and they have the necessary
shapes that we would expect from the first layer, given that the input
dimension is 20 and the output dimension 256. In particular the names of
the parameters are very useful since they allow us to identify
parameters *uniquely* even in a network of hundreds of layers and with
nontrivial structure. The second layer is structured accordingly.
Targeted Parameters
~~~~~~~~~~~~~~~~~~~
In order to do something useful with the parameters we need to access
them, though. There are several ways to do this, ranging from simple to
general. Let’s look at some of them.
.. code:: python
print(net[1].bias)
print(net[1].bias.data())
.. parsed-literal::
:class: output
Parameter dense1_bias (shape=(10,), dtype=float32)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
The first returns the bias of the second layer. Since this is an object
containing data, gradients, and additional information, we need to
request the data explicitly. Note that the bias is all 0 since we
initialized the bias to contain all zeros. Note that we can also access
the parameters by name, such as ``dense0_weight``. This is possible
since each layer comes with its own parameter dictionary that can be
accessed directly. Both methods are entirely equivalent but the first
method leads to much more readable code.
.. code:: python
print(net[0].params['dense0_weight'])
print(net[0].params['dense0_weight'].data())
.. parsed-literal::
:class: output
Parameter dense0_weight (shape=(256, 20), dtype=float32)
[[ 0.06700657 -0.00369488 0.0418822 ... -0.05517294 -0.01194733
-0.00369594]
[-0.03296221 -0.04391347 0.03839272 ... 0.05636378 0.02545484
-0.007007 ]
[-0.0196689 0.01582889 -0.00881553 ... 0.01509629 -0.01908049
-0.02449339]
...
[ 0.00010955 0.0439323 -0.04911506 ... 0.06975312 0.0449558
-0.03283203]
[ 0.04106557 0.05671307 -0.00066976 ... 0.06387014 -0.01292654
0.00974177]
[ 0.00297424 -0.0281784 -0.06881659 ... -0.04047417 0.00457048
0.05696651]]
Note that the weights are nonzero. This is by design since they were
randomly initialized when we constructed the network. ``data`` is not
the only function that we can invoke. For instance, we can compute the
gradient with respect to the parameters. It has the same shape as the
weight. However, since we did not invoke backpropagation yet, the values
are all 0.
.. code:: python
net[0].weight.grad()
.. parsed-literal::
:class: output
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
All Parameters at Once
~~~~~~~~~~~~~~~~~~~~~~
Accessing parameters as described above can be a bit tedious, in
particular if we have more complex blocks, or blocks of blocks (or even
blocks of blocks of blocks), since we need to walk through the entire
tree in reverse order to how the blocks were constructed. To avoid this,
blocks come with a method ``collect_params`` which grabs all parameters
of a network in one dictionary such that we can traverse it with ease.
It does so by iterating over all constituents of a block and calls
``collect_params`` on subblocks as needed. To see the difference
consider the following:
.. code:: python
# parameters only for the first layer
print(net[0].collect_params())
# parameters of the entire network
print(net.collect_params())
.. parsed-literal::
:class: output
dense0_ (
Parameter dense0_weight (shape=(256, 20), dtype=float32)
Parameter dense0_bias (shape=(256,), dtype=float32)
)
sequential0_ (
Parameter dense0_weight (shape=(256, 20), dtype=float32)
Parameter dense0_bias (shape=(256,), dtype=float32)
Parameter dense1_weight (shape=(10, 256), dtype=float32)
Parameter dense1_bias (shape=(10,), dtype=float32)
)
This provides us with a third way of accessing the parameters of the
network. If we wanted to get the value of the bias term of the second
layer we could simply use this:
.. code:: python
net.collect_params()['dense1_bias'].data()
.. parsed-literal::
:class: output
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Throughout the book we’ll see how various blocks name their subblocks
(Sequential simply numbers them). This makes it very convenient to use
regular expressions to filter out the required parameters.
.. code:: python
print(net.collect_params('.*weight'))
print(net.collect_params('dense0.*'))
.. parsed-literal::
:class: output
sequential0_ (
Parameter dense0_weight (shape=(256, 20), dtype=float32)
Parameter dense1_weight (shape=(10, 256), dtype=float32)
)
sequential0_ (
Parameter dense0_weight (shape=(256, 20), dtype=float32)
Parameter dense0_bias (shape=(256,), dtype=float32)
)
Rube Goldberg strikes again
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let’s see how the parameter naming conventions work if we nest multiple
blocks inside each other. For that we first define a function that
produces blocks (a block factory, so to speak) and then we combine these
inside yet larger blocks.
.. code:: python
def block1():
net = nn.Sequential()
net.add(nn.Dense(32, activation='relu'))
net.add(nn.Dense(16, activation='relu'))
return net
def block2():
net = nn.Sequential()
for i in range(4):
net.add(block1())
return net
rgnet = nn.Sequential()
rgnet.add(block2())
rgnet.add(nn.Dense(10))
rgnet.initialize()
rgnet(x)
.. parsed-literal::
:class: output
[[ 1.0116727e-08 -9.4839003e-10 -1.1526797e-08 1.4917443e-08
-1.5690811e-09 -3.9257650e-09 -4.1441655e-09 9.3013472e-09
3.2393586e-09 -4.8612452e-09]
[ 9.0111598e-09 -1.9115812e-10 -8.9595842e-09 1.0745880e-08
1.4963460e-10 -2.2272872e-09 -3.9153973e-09 7.0595711e-09
3.4854222e-09 -4.5807327e-09]]
Now that we are done designing the network, let’s see how it is
organized. ``collect_params`` provides us with this information, both in
terms of naming and in terms of logical structure.
.. code:: python
print(rgnet.collect_params)
print(rgnet.collect_params())
.. parsed-literal::
:class: output
32, Activation(relu))
(1): Dense(32 -> 16, Activation(relu))
)
(1): Sequential(
(0): Dense(16 -> 32, Activation(relu))
(1): Dense(32 -> 16, Activation(relu))
)
(2): Sequential(
(0): Dense(16 -> 32, Activation(relu))
(1): Dense(32 -> 16, Activation(relu))
)
(3): Sequential(
(0): Dense(16 -> 32, Activation(relu))
(1): Dense(32 -> 16, Activation(relu))
)
)
(1): Dense(16 -> 10, linear)
)>
sequential1_ (
Parameter dense2_weight (shape=(32, 20), dtype=float32)
Parameter dense2_bias (shape=(32,), dtype=float32)
Parameter dense3_weight (shape=(16, 32), dtype=float32)
Parameter dense3_bias (shape=(16,), dtype=float32)
Parameter dense4_weight (shape=(32, 16), dtype=float32)
Parameter dense4_bias (shape=(32,), dtype=float32)
Parameter dense5_weight (shape=(16, 32), dtype=float32)
Parameter dense5_bias (shape=(16,), dtype=float32)
Parameter dense6_weight (shape=(32, 16), dtype=float32)
Parameter dense6_bias (shape=(32,), dtype=float32)
Parameter dense7_weight (shape=(16, 32), dtype=float32)
Parameter dense7_bias (shape=(16,), dtype=float32)
Parameter dense8_weight (shape=(32, 16), dtype=float32)
Parameter dense8_bias (shape=(32,), dtype=float32)
Parameter dense9_weight (shape=(16, 32), dtype=float32)
Parameter dense9_bias (shape=(16,), dtype=float32)
Parameter dense10_weight (shape=(10, 16), dtype=float32)
Parameter dense10_bias (shape=(10,), dtype=float32)
)
Since the layers are hierarchically generated, we can also access them
accordingly. For instance, to access the first major block, within it
the second subblock and then within it, in turn the bias of the first
layer, we perform the following.
.. code:: python
rgnet[0][1][0].bias.data()
.. parsed-literal::
:class: output
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0.]
Parameter Initialization
------------------------
Now that we know how to access the parameters, let’s look at how to
initialize them properly. We discussed the need for initialization in
:numref:`chapter_numerical_stability`. By default, MXNet initializes
the weight matrices uniformly by drawing from :math:`U[-0.07, 0.07]` and
the bias parameters are all set to :math:`0`. However, we often need to
use other methods to initialize the weights. MXNet’s ``init`` module
provides a variety of preset initialization methods, but if we want
something out of the ordinary, we need a bit of extra work.
Built-in Initialization
~~~~~~~~~~~~~~~~~~~~~~~
Let’s begin with the built-in initializers. The code below initializes
all parameters with Gaussian random variables.
.. code:: python
# force_reinit ensures that the variables are initialized again, regardless of
# whether they were already initialized previously
net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
net[0].weight.data()[0]
.. parsed-literal::
:class: output
[-0.008166 -0.00159167 -0.00273115 0.00684697 0.01204039 0.01359703
0.00776908 -0.00640936 0.00256858 0.00545601 0.0018105 -0.00914027
0.00133803 0.01070259 -0.00368285 0.01432678 0.00558631 -0.01479764
0.00879013 0.00460165]
If we wanted to initialize all parameters to 1, we could do this simply
by changing the initializer to ``Constant(1)``.
.. code:: python
net.initialize(init=init.Constant(1), force_reinit=True)
net[0].weight.data()[0]
.. parsed-literal::
:class: output
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
If we want to initialize only a specific parameter in a different
manner, we can simply set the initializer only for the appropriate
subblock (or parameter) for that matter. For instance, below we
initialize the second layer to a constant value of 42 and we use the
``Xavier`` initializer for the weights of the first layer.
.. code:: python
net[1].initialize(init=init.Constant(42), force_reinit=True)
net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
print(net[1].weight.data()[0,0])
print(net[0].weight.data()[0])
.. parsed-literal::
:class: output
[42.]
[-0.14511706 -0.01173057 -0.03754489 -0.14020921 0.00900492 0.01712246
0.12447387 -0.04094418 -0.12105145 0.00079902 -0.0277361 -0.10213967
-0.14027238 -0.02196661 -0.04641148 0.11977354 0.03604397 -0.14493202
-0.06514931 0.13826048]
Custom Initialization
~~~~~~~~~~~~~~~~~~~~~
Sometimes, the initialization methods we need are not provided in the
``init`` module. At this point, we can implement a subclass of the
``Initializer`` class so that we can use it like any other
initialization method. Usually, we only need to implement the
``_init_weight`` function and modify the incoming NDArray according to
the initial result. In the example below, we pick a decidedly bizarre
and nontrivial distribution, just to prove the point. We draw the
coefficients from the following distribution:
.. math::
\begin{aligned}
w \sim \begin{cases}
U[5, 10] & \text{ with probability } \frac{1}{4} \\
0 & \text{ with probability } \frac{1}{2} \\
U[-10, -5] & \text{ with probability } \frac{1}{4}
\end{cases}
\end{aligned}
.. code:: python
class MyInit(init.Initializer):
def _init_weight(self, name, data):
print('Init', name, data.shape)
data[:] = nd.random.uniform(low=-10, high=10, shape=data.shape)
data *= data.abs() >= 5
net.initialize(MyInit(), force_reinit=True)
net[0].weight.data()[0]
.. parsed-literal::
:class: output
Init dense0_weight (256, 20)
Init dense1_weight (10, 256)
.. parsed-literal::
:class: output
[-5.44481 6.536484 -0. 0. 0. 7.7452965
7.739216 7.6021366 0. -0. -7.3307705 -0.
9.611603 0. 7.4357147 0. 0. -0.
8.446959 0. ]
If even this functionality is insufficient, we can set parameters
directly. Since ``data()`` returns an NDArray we can access it just like
any other matrix. A note for advanced users - if you want to adjust
parameters within an ``autograd`` scope you need to use ``set_data`` to
avoid confusing the automatic differentiation mechanics.
.. code:: python
net[0].weight.data()[:] += 1
net[0].weight.data()[0,0] = 42
net[0].weight.data()[0]
.. parsed-literal::
:class: output
[42. 7.536484 1. 1. 1. 8.7452965
8.739216 8.602137 1. 1. -6.3307705 1.
10.611603 1. 8.435715 1. 1. 1.
9.446959 1. ]
Tied Parameters
---------------
In some cases, we want to share model parameters across multiple layers.
For instance when we want to find good word embeddings we may decide to
use the same parameters both for encoding and decoding of words. We
discussed one such case when we introduced
:numref:`chapter_model_construction`. Let’s see how to do this a bit
more elegantly. In the following we allocate a dense layer and then use
its parameters specifically to set those of another layer.
.. code:: python
net = nn.Sequential()
# We need to give the shared layer a name such that we can reference its
# parameters
shared = nn.Dense(8, activation='relu')
net.add(nn.Dense(8, activation='relu'),
shared,
nn.Dense(8, activation='relu', params=shared.params),
nn.Dense(10))
net.initialize()
x = nd.random.uniform(shape=(2, 20))
net(x)
# Check whether the parameters are the same
print(net[1].weight.data()[0] == net[2].weight.data()[0])
net[1].weight.data()[0,0] = 100
# Make sure that they're actually the same object rather than just having the
# same value
print(net[1].weight.data()[0] == net[2].weight.data()[0])
.. parsed-literal::
:class: output
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
The above example shows that the parameters of the second and third
layer are tied. They are identical rather than just being equal. That
is, by changing one of the parameters the other one changes, too. What
happens to the gradients is quite ingenious. Since the model parameters
contain gradients, the gradients of the second hidden layer and the
third hidden layer are accumulated in the ``shared.params.grad( )``
during backpropagation.
Summary
-------
- We have several ways to access, initialize, and tie model parameters.
- We can use custom initialization.
- Gluon has a sophisticated mechanism for accessing parameters in a
unique and hierarchical manner.
Exercises
---------
1. Use the FancyMLP defined in :numref:`chapter_model_construction`
and access the parameters of the various layers.
2. Look at the `MXNet
documentation `__
and explore different initializers.
3. Try accessing the model parameters after ``net.initialize()`` and
before ``net(x)`` to observe the shape of the model parameters. What
changes? Why?
4. Construct a multilayer perceptron containing a shared parameter layer
and train it. During the training process, observe the model
parameters and gradients of each layer.
5. Why is sharing parameters a good idea?
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_parameters.svg