9.8. Beam Search¶ Open the notebook in SageMaker Studio Lab
In Section 9.7, we predicted the output sequence token by token until the special end-of-sequence “<eos>” token is predicted. In this section, we will begin with formalizing this greedy search strategy and exploring issues with it, then compare this strategy with other alternatives: exhaustive search and beam search.
Before a formal introduction to greedy search, let us formalize the search problem using the same mathematical notation from Section 9.7. At any time step \(t'\), the probability of the decoder output \(y_{t'}\) is conditional on the output subsequence \(y_1, \ldots, y_{t'-1}\) before \(t'\) and the context variable \(\mathbf{c}\) that encodes the information of the input sequence. To quantify computational cost, denote by \(\mathcal{Y}\) (it contains “<eos>”) the output vocabulary. So the cardinality \(\left|\mathcal{Y}\right|\) of this vocabulary set is the vocabulary size. Let us also specify the maximum number of tokens of an output sequence as \(T'\). As a result, our goal is to search for an ideal output from all the \(\mathcal{O}(\left|\mathcal{Y}\right|^{T'})\) possible output sequences. Of course, for all these output sequences, portions including and after “<eos>” will be discarded in the actual output.
9.8.1. Greedy Search¶
First, let us take a look at a simple strategy: greedy search. This strategy has been used to predict sequences in Section 9.7. In greedy search, at any time step \(t'\) of the output sequence, we search for the token with the highest conditional probability from \(\mathcal{Y}\), i.e.,
as the output. Once “<eos>” is outputted or the output sequence has reached its maximum length \(T'\), the output sequence is completed.
So what can go wrong with greedy search? In fact, the optimal sequence should be the output sequence with the maximum \(\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})\), which is the conditional probability of generating an output sequence based on the input sequence. Unfortunately, there is no guarantee that the optimal sequence will be obtained by greedy search.
Next, let us look at another example in Fig. 9.8.2. Unlike in Fig. 9.8.1, at time step 2 we select the token “C” in Fig. 9.8.2, which has the second highest conditional probability. Since the output subsequences at time steps 1 and 2, on which time step 3 is based, have changed from “A” and “B” in Fig. 9.8.1 to “A” and “C” in Fig. 9.8.2, the conditional probability of each token at time step 3 has also changed in Fig. 9.8.2. Suppose that we choose the token “B” at time step 3. Now time step 4 is conditional on the output subsequence at the first three time steps “A”, “C”, and “B”, which is different from “A”, “B”, and “C” in Fig. 9.8.1. Therefore, the conditional probability of generating each token at time step 4 in Fig. 9.8.2 is also different from that in Fig. 9.8.1. As a result, the conditional probability of the output sequence “A”, “C”, “B”, and “<eos>” in Fig. 9.8.2 is \(0.5\times0.3 \times0.6\times0.6=0.054\), which is greater than that of greedy search in Fig. 9.8.1. In this example, the output sequence “A”, “B”, “C”, and “<eos>” obtained by the greedy search is not an optimal sequence.
9.8.2. Exhaustive Search¶
If the goal is to obtain the optimal sequence, we may consider using exhaustive search: exhaustively enumerate all the possible output sequences with their conditional probabilities, then output the one with the highest conditional probability.
Although we can use exhaustive search to obtain the optimal sequence, its computational cost \(\mathcal{O}(\left|\mathcal{Y}\right|^{T'})\) is likely to be excessively high. For example, when \(|\mathcal{Y}|=10000\) and \(T'=10\), we will need to evaluate \(10000^{10} = 10^{40}\) sequences. This is next to impossible! On the other hand, the computational cost of greedy search is \(\mathcal{O}(\left|\mathcal{Y}\right|T')\): it is usually significantly smaller than that of exhaustive search. For example, when \(|\mathcal{Y}|=10000\) and \(T'=10\), we only need to evaluate \(10000\times10=10^5\) sequences.
9.8.3. Beam Search¶
Decisions about sequence searching strategies lie on a spectrum, with easy questions at either extreme. What if only accuracy matters? Obviously, exhaustive search. What if only computational cost matters? Clearly, greedy search. A real-world application usually asks a complicated question, somewhere in between those two extremes.
Beam search is an improved version of greedy search. It has a hyperparameter named beam size, \(k\). At time step 1, we select \(k\) tokens with the highest conditional probabilities. Each of them will be the first token of \(k\) candidate output sequences, respectively. At each subsequent time step, based on the \(k\) candidate output sequences at the previous time step, we continue to select \(k\) candidate output sequences with the highest conditional probabilities from \(k\left|\mathcal{Y}\right|\) possible choices.
Fig. 9.8.3 demonstrates the process of beam search with an example. Suppose that the output vocabulary contains only five elements: \(\mathcal{Y} = \{A, B, C, D, E\}\), where one of them is “<eos>”. Let the beam size be 2 and the maximum length of an output sequence be 3. At time step 1, suppose that the tokens with the highest conditional probabilities \(P(y_1 \mid \mathbf{c})\) are \(A\) and \(C\). At time step 2, for all \(y_2 \in \mathcal{Y},\) we compute
and pick the largest two among these ten values, say \(P(A, B \mid \mathbf{c})\) and \(P(C, E \mid \mathbf{c})\). Then at time step 3, for all \(y_3 \in \mathcal{Y}\), we compute
and pick the largest two among these ten values, say \(P(A, B, D \mid \mathbf{c})\) and \(P(C, E, D \mid \mathbf{c}).\) As a result, we get six candidates output sequences: (i) \(A\); (ii) \(C\); (iii) \(A\), \(B\); (iv) \(C\), \(E\); (v) \(A\), \(B\), \(D\); and (vi) \(C\), \(E\), \(D\).
In the end, we obtain the set of final candidate output sequences based on these six sequences (e.g., discard portions including and after “<eos>”). Then we choose the sequence with the highest of the following score as the output sequence:
where \(L\) is the length of the final candidate sequence and \(\alpha\) is usually set to 0.75. Since a longer sequence has more logarithmic terms in the summation of (9.8.4), the term \(L^\alpha\) in the denominator penalizes long sequences.
The computational cost of beam search is \(\mathcal{O}(k\left|\mathcal{Y}\right|T')\). This result is in between that of greedy search and that of exhaustive search. In fact, greedy search can be treated as a special type of beam search with a beam size of 1. With a flexible choice of the beam size, beam search provides a tradeoff between accuracy versus computational cost.
9.8.4. Summary¶
Sequence searching strategies include greedy search, exhaustive search, and beam search.
Beam search provides a tradeoff between accuracy versus computational cost via its flexible choice of the beam size.
9.8.5. Exercises¶
Can we treat exhaustive search as a special type of beam search? Why or why not?
Apply beam search in the machine translation problem in Section 9.7. How does the beam size affect the translation results and the prediction speed?
We used language modeling for generating text following user-provided prefixes in Section 8.5. Which kind of search strategy does it use? Can you improve it?