# 8.4. Recurrent Neural Networks¶

In Section 8.3 we introduced \(n\)-gram models, where the conditional probability of word \(x_t\) at position \(t\) only depends on the \(n-1\) previous words. If we want to check the possible effect of words earlier than \(t-(n-1)\) on \(x_t\), we need to increase \(n\). However, the number of model parameters would also increase exponentially with it, as we need to store \(|V|^n\) numbers for a vocabulary \(V\). Hence, rather than modeling \(p(x_t|x_{t-1}, \ldots x_{t-n+1})\) it is preferable to use a latent variable model in which we have

Here \(h_t\) is a *latent variable* that stores the sequence
information. A latent variable is also called as *hidden variable*,
*hidden state* or *hidden state variable*. The hidden state at time
\(t\) could be computed based on both input \(x_{t-1}\) and
hidden state \(h_{t-1}\) at time \(t-1\), that is

For a sufficiently powerful function \(f\), the latent variable model is not an approximation. After all, \(h_t\) could simply store all the data it observed so far. We discussed this in Section 8.1. But it could potentially makes both computation and storage expensive.

Note that we also use \(h\) to denote by the number of hidden units
of a hidden layer. Hidden layers and hidden states refer to two very
different concepts. Hidden layers are, as explained, layers that are
hidden from view on the path from input to output. Hidden states are
technically speaking *inputs* to whatever we do at a given step.
Instead, they can only be computed by looking at data at previous
iterations. In this sense they have much in common with latent variable
models in statistics, such as clustering or topic models where the
clusters affect the output but cannot be directly observed.

Recurrent neural networks are neural networks with hidden states. Before introducing this model, let’s first revisit the multi-layer perceptron introduced in Section 4.1.

## 8.4.3. Steps in a Language Model¶

Now we illustrate how RNNs can be used to build a language model. For simplicity of illustration we use words rather than characters, since the former are easier to comprehend. Let the number of mini-batch examples be 1, and the sequence of the text be the beginning of our dataset, i.e. “the time machine by h. g. wells”. The figure below illustrates how to estimate the next word based on the present and previous words. During the training process, we run a softmax operation on the output from the output layer for each time step, and then use the cross-entropy loss function to compute the error between the result and the label. Due to the recurrent computation of the hidden state in the hidden layer, the output of time step 3, \(\mathbf{O}_3\), is determined by the text sequence “the”, “time”, “machine”. Since the next word of the sequence in the training data is “by”, the loss of time step 3 will depend on the probability distribution of the next word generated based on the sequence “the”, “time”, “machine” and the label “by” of this time step.

In practice, each word is presented by a \(d\) dimensional vector, and we use a batch size \(n>1\), therefore, the input \(\mathbf X_t\) at time step \(t\) will be a \(n\times d\) matrix, which is identical to what we discussed before.

## 8.4.4. Perplexity¶

Last, let’s discuss about how to measure the sequence model quality. One
way is to check how surprising the text is. A good language model is
able to predict with high accuracy what we will see next. Consider the
following continuations of the phrase `It is raining`

, as proposed by
different language models:

`It is raining outside`

`It is raining banana tree`

`It is raining piouw;kcj pwepoiut`

In terms of quality, example 1 is clearly the best. The words are
sensible and logically coherent. While it might not quite so accurately
reflect which word follows (`in San Francisco`

and `in winter`

would
have been perfectly reasonable extensions), the model is able to capture
which kind of word follows. Example 2 is considerably worse by producing
a nonsensical and borderline dysgrammatical extension. Nonetheless, at
least the model has learned how to spell words and some degree of
correlation between words. Lastly, example 3 indicates a poorly trained
model that doesn’t fit data.

We might measure the quality of the model by computing \(p(w)\),
i.e. the likelihood of the sequence. Unfortunately this is a number that
is hard to understand and difficult to compare. After all, shorter
sequences are *much* more likely than long ones, hence evaluating the
model on Tolstoy’s magnum opus ‘War and
Peace’ will
inevitably produce a much smaller likelihood than, say, on
Saint-Exupery’s novella ‘The Little
Prince’. What is
missing is the equivalent of an average.

Information Theory comes handy here. If we want to compress text we can ask about estimating the next symbol given the current set of symbols. A lower bound on the number of bits is given by \(-\log_2 p(x_t|x_{t-1}, \ldots x_1)\). A good language model should allow us to predict the next word quite accurately and thus it should allow us to spend very few bits on compressing the sequence. So we can measure it by the average number of bits that we need to spend.

This makes the performance on documents of different lengths comparable. For historical reasons scientists in natural language processing prefer to use a quantity called perplexity rather than bitrate. In a nutshell it is the exponential of the above:

It can be best understood as the harmonic mean of the number of real choices that we have when deciding which word to pick next. Note that perplexity naturally generalizes the notion of the cross-entropy loss defined when we introduced the softmax regression (Section 3.4). That is, for a single symbol both definitions are identical bar the fact that one is the exponential of the other. Let’s look at a number of cases:

In the best case scenario, the model always estimates the probability of the next symbol as \(1\). In this case the perplexity of the model is \(1\).

In the worst case scenario, the model always predicts the probability of the label category as 0. In this situation, the perplexity is infinite.

At the baseline, the model predicts a uniform distribution over all tokens. In this case the perplexity equals the size of the dictionary

`len(vocab)`

. In fact, if we were to store the sequence without any compression this would be the best we could do to encode it. Hence this provides a nontrivial upper bound that any model must satisfy.

## 8.4.5. Summary¶

A network that uses recurrent computation is called a recurrent neural network (RNN).

The hidden state of the RNN can capture historical information of the sequence up to the current time step.

The number of RNN model parameters does not grow as the number of time steps increases.

We can create language models using a character-level RNN.

## 8.4.6. Exercises¶

If we use an RNN to predict the next character in a text sequence, how many output dimensions do we need?

Can you design a mapping for which an RNN with hidden states is exact? Hint - what about a finite number of words?

What happens to the gradient if you backpropagate through a long sequence?

What are some of the problems associated with the simple sequence model described above?