.. _sec_linear_scratch:
Linear Regression Implementation from Scratch
=============================================
Now that you understand the key ideas behind linear regression, we can
begin to work through a hands-on implementation in code. In this
section, we will implement the entire method from scratch, including the
data pipeline, the model, the loss function, and the minibatch
stochastic gradient descent optimizer. While modern deep learning
frameworks can automate nearly all of this work, implementing things
from scratch is the only way to make sure that you really know what you
are doing. Moreover, when it comes time to customize models, defining
our own layers or loss functions, understanding how things work under
the hood will prove handy. In this section, we will rely only on tensors
and auto differentiation. Afterwards, we will introduce a more concise
implementation, taking advantage of bells and whistles of deep learning
frameworks.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import random
from mxnet import autograd, np, npx
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import random
import torch
from d2l import torch as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
%matplotlib inline
import random
import tensorflow as tf
from d2l import tensorflow as d2l
.. raw:: html
.. raw:: html
Generating the Dataset
----------------------
To keep things simple, we will construct an artificial dataset according
to a linear model with additive noise. Our task will be to recover this
model's parameters using the finite set of examples contained in our
dataset. We will keep the data low-dimensional so we can visualize it
easily. In the following code snippet, we generate a dataset containing
1000 examples, each consisting of 2 features sampled from a standard
normal distribution. Thus our synthetic dataset will be a matrix
:math:`\mathbf{X}\in \mathbb{R}^{1000 \times 2}`.
The true parameters generating our dataset will be
:math:`\mathbf{w} = [2, -3.4]^\top` and :math:`b = 4.2`, and our
synthetic labels will be assigned according to the following linear
model with the noise term :math:`\epsilon`:
.. math:: \mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.
You could think of :math:`\epsilon` as capturing potential measurement
errors on the features and labels. We will assume that the standard
assumptions hold and thus that :math:`\epsilon` obeys a normal
distribution with mean of 0. To make our problem easy, we will set its
standard deviation to 0.01. The following code generates our synthetic
dataset.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def synthetic_data(w, b, num_examples): #@save
"""Generate y = Xw + b + noise."""
X = np.random.normal(0, 1, (num_examples, len(w)))
y = np.dot(X, w) + b
y += np.random.normal(0, 0.01, y.shape)
return X, y.reshape((-1, 1))
true_w = np.array([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def synthetic_data(w, b, num_examples): #@save
"""Generate y = Xw + b + noise."""
X = torch.normal(0, 1, (num_examples, len(w)))
y = torch.matmul(X, w) + b
y += torch.normal(0, 0.01, y.shape)
return X, y.reshape((-1, 1))
true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def synthetic_data(w, b, num_examples): #@save
"""Generate y = Xw + b + noise."""
X = tf.zeros((num_examples, w.shape[0]))
X += tf.random.normal(shape=X.shape)
y = tf.matmul(X, tf.reshape(w, (-1, 1))) + b
y += tf.random.normal(shape=y.shape, stddev=0.01)
y = tf.reshape(y, (-1, 1))
return X, y
true_w = tf.constant([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
.. raw:: html
.. raw:: html
Note that each row in ``features`` consists of a 2-dimensional data
example and that each row in ``labels`` consists of a 1-dimensional
label value (a scalar).
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print('features:', features[0],'\nlabel:', labels[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
features: [2.2122064 1.1630787]
label: [4.662078]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print('features:', features[0],'\nlabel:', labels[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
features: tensor([ 1.1484, -1.2663])
label: tensor([10.7989])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print('features:', features[0],'\nlabel:', labels[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
features: tf.Tensor([1.4194795 0.30994684], shape=(2,), dtype=float32)
label: tf.Tensor([5.987108], shape=(1,), dtype=float32)
.. raw:: html
.. raw:: html
By generating a scatter plot using the second feature ``features[:, 1]``
and ``labels``, we can clearly observe the linear correlation between
the two.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
# The semicolon is for displaying the plot only
d2l.plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1);
.. figure:: output_linear-regression-scratch_58de05_39_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
# The semicolon is for displaying the plot only
d2l.plt.scatter(features[:, 1].detach().numpy(), labels.detach().numpy(), 1);
.. figure:: output_linear-regression-scratch_58de05_42_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
# The semicolon is for displaying the plot only
d2l.plt.scatter(features[:, 1].numpy(), labels.numpy(), 1);
.. figure:: output_linear-regression-scratch_58de05_45_0.svg
.. raw:: html
.. raw:: html
Reading the Dataset
-------------------
Recall that training models consists of making multiple passes over the
dataset, grabbing one minibatch of examples at a time, and using them to
update our model. Since this process is so fundamental to training
machine learning algorithms, it is worth defining a utility function to
shuffle the dataset and access it in minibatches.
In the following code, we define the ``data_iter`` function to
demonstrate one possible implementation of this functionality. The
function takes a batch size, a matrix of features, and a vector of
labels, yielding minibatches of the size ``batch_size``. Each minibatch
consists of a tuple of features and labels.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
# The examples are read at random, in no particular order
random.shuffle(indices)
for i in range(0, num_examples, batch_size):
batch_indices = np.array(
indices[i: min(i + batch_size, num_examples)])
yield features[batch_indices], labels[batch_indices]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
# The examples are read at random, in no particular order
random.shuffle(indices)
for i in range(0, num_examples, batch_size):
batch_indices = torch.tensor(
indices[i: min(i + batch_size, num_examples)])
yield features[batch_indices], labels[batch_indices]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
# The examples are read at random, in no particular order
random.shuffle(indices)
for i in range(0, num_examples, batch_size):
j = tf.constant(indices[i: min(i + batch_size, num_examples)])
yield tf.gather(features, j), tf.gather(labels, j)
.. raw:: html
.. raw:: html
In general, note that we want to use reasonably sized minibatches to
take advantage of the GPU hardware, which excels at parallelizing
operations. Because each example can be fed through our models in
parallel and the gradient of the loss function for each example can also
be taken in parallel, GPUs allow us to process hundreds of examples in
scarcely more time than it might take to process just a single example.
To build some intuition, let us read and print the first small batch of
data examples. The shape of the features in each minibatch tells us both
the minibatch size and the number of input features. Likewise, our
minibatch of labels will have a shape given by ``batch_size``.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
batch_size = 10
for X, y in data_iter(batch_size, features, labels):
print(X, '\n', y)
break
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[[-0.37305322 -0.1837693 ]
[-0.9276405 0.8174909 ]
[-0.55512375 0.16916618]
[ 0.7705697 -1.1428461 ]
[-0.02261727 0.41016445]
[-0.05929231 0.61196524]
[ 0.29303032 -0.06429045]
[ 0.43054336 0.8985137 ]
[ 0.49453422 -0.01003418]
[ 0.15275687 -2.1776834 ]]
[[ 4.0894604]
[-0.4445296]
[ 2.5186574]
[ 9.613484 ]
[ 2.7589145]
[ 2.0064368]
[ 5.009016 ]
[ 1.9865699]
[ 5.225176 ]
[11.9052725]]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
batch_size = 10
for X, y in data_iter(batch_size, features, labels):
print(X, '\n', y)
break
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[-1.2557, 0.6496],
[-0.1099, -1.0704],
[-0.1359, 0.0344],
[ 1.6839, -0.2851],
[-0.3902, -1.0307],
[ 0.5866, -0.8784],
[ 0.0651, -0.3029],
[ 0.9686, 0.2121],
[ 1.0732, 0.4910],
[ 0.8645, 0.8160]])
tensor([[-0.5243],
[ 7.6175],
[ 3.8289],
[ 8.5273],
[ 6.9201],
[ 8.3418],
[ 5.3555],
[ 5.4123],
[ 4.6747],
[ 3.1529]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
batch_size = 10
for X, y in data_iter(batch_size, features, labels):
print(X, '\n', y)
break
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tf.Tensor(
[[-1.3960032 0.0360745 ]
[-0.5773766 -1.4101526 ]
[-1.9878585 -1.4106326 ]
[-0.17404082 -0.8305929 ]
[ 1.611311 1.4760311 ]
[-0.13529034 -1.6472582 ]
[ 0.22689229 -0.03227513]
[-0.8304844 0.79024494]
[ 0.6619868 -0.736094 ]
[ 1.4794912 -0.17747918]], shape=(10, 2), dtype=float32)
tf.Tensor(
[[ 1.2760046 ]
[ 7.8383718 ]
[ 5.028774 ]
[ 6.676344 ]
[ 2.403925 ]
[ 9.539659 ]
[ 4.7502227 ]
[-0.14387447]
[ 8.018497 ]
[ 7.7552547 ]], shape=(10, 1), dtype=float32)
.. raw:: html
.. raw:: html
As we run the iteration, we obtain distinct minibatches successively
until the entire dataset has been exhausted (try this). While the
iteration implemented above is good for didactic purposes, it is
inefficient in ways that might get us in trouble on real problems. For
example, it requires that we load all the data in memory and that we
perform lots of random memory access. The built-in iterators implemented
in a deep learning framework are considerably more efficient and they
can deal with both data stored in files and data fed via data streams.
Initializing Model Parameters
-----------------------------
Before we can begin optimizing our model's parameters by minibatch
stochastic gradient descent, we need to have some parameters in the
first place. In the following code, we initialize weights by sampling
random numbers from a normal distribution with mean 0 and a standard
deviation of 0.01, and setting the bias to 0.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
w = np.random.normal(0, 0.01, (2, 1))
b = np.zeros(1)
w.attach_grad()
b.attach_grad()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
w = tf.Variable(tf.random.normal(shape=(2, 1), mean=0, stddev=0.01),
trainable=True)
b = tf.Variable(tf.zeros(1), trainable=True)
.. raw:: html
.. raw:: html
After initializing our parameters, our next task is to update them until
they fit our data sufficiently well. Each update requires taking the
gradient of our loss function with respect to the parameters. Given this
gradient, we can update each parameter in the direction that may reduce
the loss.
Since nobody wants to compute gradients explicitly (this is tedious and
error prone), we use automatic differentiation, as introduced in
:numref:`sec_autograd`, to compute the gradient.
Defining the Model
------------------
Next, we must define our model, relating its inputs and parameters to
its outputs. Recall that to calculate the output of the linear model, we
simply take the matrix-vector dot product of the input features
:math:`\mathbf{X}` and the model weights :math:`\mathbf{w}`, and add the
offset :math:`b` to each example. Note that below :math:`\mathbf{Xw}` is
a vector and :math:`b` is a scalar. Recall the broadcasting mechanism as
described in :numref:`subsec_broadcasting`. When we add a vector and a
scalar, the scalar is added to each component of the vector.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def linreg(X, w, b): #@save
"""The linear regression model."""
return np.dot(X, w) + b
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def linreg(X, w, b): #@save
"""The linear regression model."""
return torch.matmul(X, w) + b
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def linreg(X, w, b): #@save
"""The linear regression model."""
return tf.matmul(X, w) + b
.. raw:: html
.. raw:: html
Defining the Loss Function
--------------------------
Since updating our model requires taking the gradient of our loss
function, we ought to define the loss function first. Here we will use
the squared loss function as described in
:numref:`sec_linear_regression`. In the implementation, we need to
transform the true value ``y`` into the predicted value's shape
``y_hat``. The result returned by the following function will also have
the same shape as ``y_hat``.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def squared_loss(y_hat, y): #@save
"""Squared loss."""
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def squared_loss(y_hat, y): #@save
"""Squared loss."""
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def squared_loss(y_hat, y): #@save
"""Squared loss."""
return (y_hat - tf.reshape(y, y_hat.shape)) ** 2 / 2
.. raw:: html
.. raw:: html
Defining the Optimization Algorithm
-----------------------------------
As we discussed in :numref:`sec_linear_regression`, linear regression
has a closed-form solution. However, this is not a book about linear
regression: it is a book about deep learning. Since none of the other
models that this book introduces can be solved analytically, we will
take this opportunity to introduce your first working example of
minibatch stochastic gradient descent.
At each step, using one minibatch randomly drawn from our dataset, we
will estimate the gradient of the loss with respect to our parameters.
Next, we will update our parameters in the direction that may reduce the
loss. The following code applies the minibatch stochastic gradient
descent update, given a set of parameters, a learning rate, and a batch
size. The size of the update step is determined by the learning rate
``lr``. Because our loss is calculated as a sum over the minibatch of
examples, we normalize our step size by the batch size (``batch_size``),
so that the magnitude of a typical step size does not depend heavily on
our choice of the batch size.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def sgd(params, lr, batch_size): #@save
"""Minibatch stochastic gradient descent."""
for param in params:
param[:] = param - lr * param.grad / batch_size
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def sgd(params, lr, batch_size): #@save
"""Minibatch stochastic gradient descent."""
with torch.no_grad():
for param in params:
param -= lr * param.grad / batch_size
param.grad.zero_()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def sgd(params, grads, lr, batch_size): #@save
"""Minibatch stochastic gradient descent."""
for param, grad in zip(params, grads):
param.assign_sub(lr*grad/batch_size)
.. raw:: html
.. raw:: html
Training
--------
Now that we have all of the parts in place, we are ready to implement
the main training loop. It is crucial that you understand this code
because you will see nearly identical training loops over and over again
throughout your career in deep learning.
In each iteration, we will grab a minibatch of training examples, and
pass them through our model to obtain a set of predictions. After
calculating the loss, we initiate the backwards pass through the
network, storing the gradients with respect to each parameter. Finally,
we will call the optimization algorithm ``sgd`` to update the model
parameters.
In summary, we will execute the following loop:
- Initialize parameters :math:`(\mathbf{w}, b)`
- Repeat until done
- Compute gradient
:math:`\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)`
- Update parameters
:math:`(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}`
In each *epoch*, we will iterate through the entire dataset (using the
``data_iter`` function) once passing through every example in the
training dataset (assuming that the number of examples is divisible by
the batch size). The number of epochs ``num_epochs`` and the learning
rate ``lr`` are both hyperparameters, which we set here to 3 and 0.03,
respectively. Unfortunately, setting hyperparameters is tricky and
requires some adjustment by trial and error. We elide these details for
now but revise them later in :numref:`chap_optimization`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss
for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
with autograd.record():
l = loss(net(X, w, b), y) # Minibatch loss in `X` and `y`
# Because `l` has a shape (`batch_size`, 1) and is not a scalar
# variable, the elements in `l` are added together to obtain a new
# variable, on which gradients with respect to [`w`, `b`] are computed
l.backward()
sgd([w, b], lr, batch_size) # Update parameters using their gradient
train_l = loss(net(features, w, b), labels)
print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[19:49:23] src/base.cc:49: GPU context requested, but no GPUs found.
epoch 1, loss 0.025098
epoch 2, loss 0.000086
epoch 3, loss 0.000051
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss
for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y) # Minibatch loss in `X` and `y`
# Compute gradient on `l` with respect to [`w`, `b`]
l.sum().backward()
sgd([w, b], lr, batch_size) # Update parameters using their gradient
with torch.no_grad():
train_l = loss(net(features, w, b), labels)
print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
epoch 1, loss 0.033497
epoch 2, loss 0.000131
epoch 3, loss 0.000050
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss
for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
with tf.GradientTape() as g:
l = loss(net(X, w, b), y) # Minibatch loss in `X` and `y`
# Compute gradient on l with respect to [`w`, `b`]
dw, db = g.gradient(l, [w, b])
# Update parameters using their gradient
sgd([w, b], [dw, db], lr, batch_size)
train_l = loss(net(features, w, b), labels)
print(f'epoch {epoch + 1}, loss {float(tf.reduce_mean(train_l)):f}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
epoch 1, loss 0.029238
epoch 2, loss 0.000099
epoch 3, loss 0.000050
.. raw:: html
.. raw:: html
In this case, because we synthesized the dataset ourselves, we know
precisely what the true parameters are. Thus, we can evaluate our
success in training by comparing the true parameters with those that we
learned through our training loop. Indeed they turn out to be very close
to each other.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print(f'error in estimating w: {true_w - w.reshape(true_w.shape)}')
print(f'error in estimating b: {true_b - b}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
error in estimating w: [ 0.00016284 -0.00018811]
error in estimating b: [0.00048304]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print(f'error in estimating w: {true_w - w.reshape(true_w.shape)}')
print(f'error in estimating b: {true_b - b}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
error in estimating w: tensor([ 0.0006, -0.0003], grad_fn=)
error in estimating b: tensor([0.0012], grad_fn=)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
print(f'error in estimating w: {true_w - tf.reshape(w, true_w.shape)}')
print(f'error in estimating b: {true_b - b}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
error in estimating w: [3.3664703e-04 9.5605850e-05]
error in estimating b: [0.00059509]
.. raw:: html
.. raw:: html
Note that we should not take it for granted that we are able to recover
the parameters perfectly. However, in machine learning, we are typically
less concerned with recovering true underlying parameters, and more
concerned with parameters that lead to highly accurate prediction.
Fortunately, even on difficult optimization problems, stochastic
gradient descent can often find remarkably good solutions, owing partly
to the fact that, for deep networks, there exist many configurations of
the parameters that lead to highly accurate prediction.
Summary
-------
- We saw how a deep network can be implemented and optimized from
scratch, using just tensors and auto differentiation, without any
need for defining layers or fancy optimizers.
- This section only scratches the surface of what is possible. In the
following sections, we will describe additional models based on the
concepts that we have just introduced and learn how to implement them
more concisely.
Exercises
---------
1. What would happen if we were to initialize the weights to zero. Would
the algorithm still work?
2. Assume that you are `Georg Simon
Ohm `__ trying to come up
with a model between voltage and current. Can you use auto
differentiation to learn the parameters of your model?
3. Can you use `Planck's
Law `__ to determine
the temperature of an object using spectral energy density?
4. What are the problems you might encounter if you wanted to compute
the second derivatives? How would you fix them?
5. Why is the ``reshape`` function needed in the ``squared_loss``
function?
6. Experiment using different learning rates to find out how fast the
loss function value drops.
7. If the number of examples cannot be divided by the batch size, what
happens to the ``data_iter`` function's behavior?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html