9. Modern Recurrent Neural Networksnavigate_next 9.8. Beam Search
Quick search
code
Show Source
MXNet PyTorch Notebooks Courses GitHub 中文版
Dive into Deep Learning
Table Of Contents
  • Preface
  • Installation
  • Notation
  • 1. Introduction
  • 2. Preliminaries
    • 2.1. Data Manipulation
    • 2.2. Data Preprocessing
    • 2.3. Linear Algebra
    • 2.4. Calculus
    • 2.5. Automatic Differentiation
    • 2.6. Probability
    • 2.7. Documentation
  • 3. Linear Neural Networks
    • 3.1. Linear Regression
    • 3.2. Linear Regression Implementation from Scratch
    • 3.3. Concise Implementation of Linear Regression
    • 3.4. Softmax Regression
    • 3.5. The Image Classification Dataset
    • 3.6. Implementation of Softmax Regression from Scratch
    • 3.7. Concise Implementation of Softmax Regression
  • 4. Multilayer Perceptrons
    • 4.1. Multilayer Perceptrons
    • 4.2. Implementation of Multilayer Perceptrons from Scratch
    • 4.3. Concise Implementation of Multilayer Perceptrons
    • 4.4. Model Selection, Underfitting, and Overfitting
    • 4.5. Weight Decay
    • 4.6. Dropout
    • 4.7. Forward Propagation, Backward Propagation, and Computational Graphs
    • 4.8. Numerical Stability and Initialization
    • 4.9. Environment and Distribution Shift
    • 4.10. Predicting House Prices on Kaggle
  • 5. Deep Learning Computation
    • 5.1. Layers and Blocks
    • 5.2. Parameter Management
    • 5.3. Deferred Initialization
    • 5.4. Custom Layers
    • 5.5. File I/O
    • 5.6. GPUs
  • 6. Convolutional Neural Networks
    • 6.1. From Fully-Connected Layers to Convolutions
    • 6.2. Convolutions for Images
    • 6.3. Padding and Stride
    • 6.4. Multiple Input and Multiple Output Channels
    • 6.5. Pooling
    • 6.6. Convolutional Neural Networks (LeNet)
  • 7. Modern Convolutional Neural Networks
    • 7.1. Deep Convolutional Neural Networks (AlexNet)
    • 7.2. Networks Using Blocks (VGG)
    • 7.3. Network in Network (NiN)
    • 7.4. Networks with Parallel Concatenations (GoogLeNet)
    • 7.5. Batch Normalization
    • 7.6. Residual Networks (ResNet)
    • 7.7. Densely Connected Networks (DenseNet)
  • 8. Recurrent Neural Networks
    • 8.1. Sequence Models
    • 8.2. Text Preprocessing
    • 8.3. Language Models and the Dataset
    • 8.4. Recurrent Neural Networks
    • 8.5. Implementation of Recurrent Neural Networks from Scratch
    • 8.6. Concise Implementation of Recurrent Neural Networks
    • 8.7. Backpropagation Through Time
  • 9. Modern Recurrent Neural Networks
    • 9.1. Gated Recurrent Units (GRU)
    • 9.2. Long Short-Term Memory (LSTM)
    • 9.3. Deep Recurrent Neural Networks
    • 9.4. Bidirectional Recurrent Neural Networks
    • 9.5. Machine Translation and the Dataset
    • 9.6. Encoder-Decoder Architecture
    • 9.7. Sequence to Sequence Learning
    • 9.8. Beam Search
  • 10. Attention Mechanisms
    • 10.1. Attention Cues
    • 10.2. Attention Pooling: Nadaraya-Watson Kernel Regression
    • 10.3. Attention Scoring Functions
    • 10.4. Bahdanau Attention
    • 10.5. Multi-Head Attention
    • 10.6. Self-Attention and Positional Encoding
    • 10.7. Transformer
  • 11. Optimization Algorithms
    • 11.1. Optimization and Deep Learning
    • 11.2. Convexity
    • 11.3. Gradient Descent
    • 11.4. Stochastic Gradient Descent
    • 11.5. Minibatch Stochastic Gradient Descent
    • 11.6. Momentum
    • 11.7. Adagrad
    • 11.8. RMSProp
    • 11.9. Adadelta
    • 11.10. Adam
    • 11.11. Learning Rate Scheduling
  • 12. Computational Performance
    • 12.1. Compilers and Interpreters
    • 12.2. Asynchronous Computation
    • 12.3. Automatic Parallelism
    • 12.4. Hardware
    • 12.5. Training on Multiple GPUs
    • 12.6. Concise Implementation for Multiple GPUs
    • 12.7. Parameter Servers
  • 13. Computer Vision
    • 13.1. Image Augmentation
    • 13.2. Fine-Tuning
    • 13.3. Object Detection and Bounding Boxes
    • 13.4. Anchor Boxes
    • 13.5. Multiscale Object Detection
    • 13.6. The Object Detection Dataset
    • 13.7. Single Shot Multibox Detection
    • 13.8. Region-based CNNs (R-CNNs)
    • 13.9. Semantic Segmentation and the Dataset
    • 13.10. Transposed Convolution
    • 13.11. Fully Convolutional Networks
    • 13.12. Neural Style Transfer
    • 13.13. Image Classification (CIFAR-10) on Kaggle
    • 13.14. Dog Breed Identification (ImageNet Dogs) on Kaggle
  • 14. Natural Language Processing: Pretraining
    • 14.1. Word Embedding (word2vec)
    • 14.2. Approximate Training
    • 14.3. The Dataset for Pretraining Word Embeddings
    • 14.4. Pretraining word2vec
    • 14.5. Word Embedding with Global Vectors (GloVe)
    • 14.6. Subword Embedding
    • 14.7. Word Similarity and Analogy
    • 14.8. Bidirectional Encoder Representations from Transformers (BERT)
    • 14.9. The Dataset for Pretraining BERT
    • 14.10. Pretraining BERT
  • 15. Natural Language Processing: Applications
    • 15.1. Sentiment Analysis and the Dataset
    • 15.2. Sentiment Analysis: Using Recurrent Neural Networks
    • 15.3. Sentiment Analysis: Using Convolutional Neural Networks
    • 15.4. Natural Language Inference and the Dataset
    • 15.5. Natural Language Inference: Using Attention
    • 15.6. Fine-Tuning BERT for Sequence-Level and Token-Level Applications
    • 15.7. Natural Language Inference: Fine-Tuning BERT
  • 16. Recommender Systems
    • 16.1. Overview of Recommender Systems
    • 16.2. The MovieLens Dataset
    • 16.3. Matrix Factorization
    • 16.4. AutoRec: Rating Prediction with Autoencoders
    • 16.5. Personalized Ranking for Recommender Systems
    • 16.6. Neural Collaborative Filtering for Personalized Ranking
    • 16.7. Sequence-Aware Recommender Systems
    • 16.8. Feature-Rich Recommender Systems
    • 16.9. Factorization Machines
    • 16.10. Deep Factorization Machines
  • 17. Generative Adversarial Networks
    • 17.1. Generative Adversarial Networks
    • 17.2. Deep Convolutional Generative Adversarial Networks
  • 18. Appendix: Mathematics for Deep Learning
    • 18.1. Geometry and Linear Algebraic Operations
    • 18.2. Eigendecompositions
    • 18.3. Single Variable Calculus
    • 18.4. Multivariable Calculus
    • 18.5. Integral Calculus
    • 18.6. Random Variables
    • 18.7. Maximum Likelihood
    • 18.8. Distributions
    • 18.9. Naive Bayes
    • 18.10. Statistics
    • 18.11. Information Theory
  • 19. Appendix: Tools for Deep Learning
    • 19.1. Using Jupyter
    • 19.2. Using Amazon SageMaker
    • 19.3. Using AWS EC2 Instances
    • 19.4. Using Google Colab
    • 19.5. Selecting Servers and GPUs
    • 19.6. Contributing to This Book
    • 19.7. d2l API Document
  • References
Dive into Deep Learning
Table Of Contents
  • Preface
  • Installation
  • Notation
  • 1. Introduction
  • 2. Preliminaries
    • 2.1. Data Manipulation
    • 2.2. Data Preprocessing
    • 2.3. Linear Algebra
    • 2.4. Calculus
    • 2.5. Automatic Differentiation
    • 2.6. Probability
    • 2.7. Documentation
  • 3. Linear Neural Networks
    • 3.1. Linear Regression
    • 3.2. Linear Regression Implementation from Scratch
    • 3.3. Concise Implementation of Linear Regression
    • 3.4. Softmax Regression
    • 3.5. The Image Classification Dataset
    • 3.6. Implementation of Softmax Regression from Scratch
    • 3.7. Concise Implementation of Softmax Regression
  • 4. Multilayer Perceptrons
    • 4.1. Multilayer Perceptrons
    • 4.2. Implementation of Multilayer Perceptrons from Scratch
    • 4.3. Concise Implementation of Multilayer Perceptrons
    • 4.4. Model Selection, Underfitting, and Overfitting
    • 4.5. Weight Decay
    • 4.6. Dropout
    • 4.7. Forward Propagation, Backward Propagation, and Computational Graphs
    • 4.8. Numerical Stability and Initialization
    • 4.9. Environment and Distribution Shift
    • 4.10. Predicting House Prices on Kaggle
  • 5. Deep Learning Computation
    • 5.1. Layers and Blocks
    • 5.2. Parameter Management
    • 5.3. Deferred Initialization
    • 5.4. Custom Layers
    • 5.5. File I/O
    • 5.6. GPUs
  • 6. Convolutional Neural Networks
    • 6.1. From Fully-Connected Layers to Convolutions
    • 6.2. Convolutions for Images
    • 6.3. Padding and Stride
    • 6.4. Multiple Input and Multiple Output Channels
    • 6.5. Pooling
    • 6.6. Convolutional Neural Networks (LeNet)
  • 7. Modern Convolutional Neural Networks
    • 7.1. Deep Convolutional Neural Networks (AlexNet)
    • 7.2. Networks Using Blocks (VGG)
    • 7.3. Network in Network (NiN)
    • 7.4. Networks with Parallel Concatenations (GoogLeNet)
    • 7.5. Batch Normalization
    • 7.6. Residual Networks (ResNet)
    • 7.7. Densely Connected Networks (DenseNet)
  • 8. Recurrent Neural Networks
    • 8.1. Sequence Models
    • 8.2. Text Preprocessing
    • 8.3. Language Models and the Dataset
    • 8.4. Recurrent Neural Networks
    • 8.5. Implementation of Recurrent Neural Networks from Scratch
    • 8.6. Concise Implementation of Recurrent Neural Networks
    • 8.7. Backpropagation Through Time
  • 9. Modern Recurrent Neural Networks
    • 9.1. Gated Recurrent Units (GRU)
    • 9.2. Long Short-Term Memory (LSTM)
    • 9.3. Deep Recurrent Neural Networks
    • 9.4. Bidirectional Recurrent Neural Networks
    • 9.5. Machine Translation and the Dataset
    • 9.6. Encoder-Decoder Architecture
    • 9.7. Sequence to Sequence Learning
    • 9.8. Beam Search
  • 10. Attention Mechanisms
    • 10.1. Attention Cues
    • 10.2. Attention Pooling: Nadaraya-Watson Kernel Regression
    • 10.3. Attention Scoring Functions
    • 10.4. Bahdanau Attention
    • 10.5. Multi-Head Attention
    • 10.6. Self-Attention and Positional Encoding
    • 10.7. Transformer
  • 11. Optimization Algorithms
    • 11.1. Optimization and Deep Learning
    • 11.2. Convexity
    • 11.3. Gradient Descent
    • 11.4. Stochastic Gradient Descent
    • 11.5. Minibatch Stochastic Gradient Descent
    • 11.6. Momentum
    • 11.7. Adagrad
    • 11.8. RMSProp
    • 11.9. Adadelta
    • 11.10. Adam
    • 11.11. Learning Rate Scheduling
  • 12. Computational Performance
    • 12.1. Compilers and Interpreters
    • 12.2. Asynchronous Computation
    • 12.3. Automatic Parallelism
    • 12.4. Hardware
    • 12.5. Training on Multiple GPUs
    • 12.6. Concise Implementation for Multiple GPUs
    • 12.7. Parameter Servers
  • 13. Computer Vision
    • 13.1. Image Augmentation
    • 13.2. Fine-Tuning
    • 13.3. Object Detection and Bounding Boxes
    • 13.4. Anchor Boxes
    • 13.5. Multiscale Object Detection
    • 13.6. The Object Detection Dataset
    • 13.7. Single Shot Multibox Detection
    • 13.8. Region-based CNNs (R-CNNs)
    • 13.9. Semantic Segmentation and the Dataset
    • 13.10. Transposed Convolution
    • 13.11. Fully Convolutional Networks
    • 13.12. Neural Style Transfer
    • 13.13. Image Classification (CIFAR-10) on Kaggle
    • 13.14. Dog Breed Identification (ImageNet Dogs) on Kaggle
  • 14. Natural Language Processing: Pretraining
    • 14.1. Word Embedding (word2vec)
    • 14.2. Approximate Training
    • 14.3. The Dataset for Pretraining Word Embeddings
    • 14.4. Pretraining word2vec
    • 14.5. Word Embedding with Global Vectors (GloVe)
    • 14.6. Subword Embedding
    • 14.7. Word Similarity and Analogy
    • 14.8. Bidirectional Encoder Representations from Transformers (BERT)
    • 14.9. The Dataset for Pretraining BERT
    • 14.10. Pretraining BERT
  • 15. Natural Language Processing: Applications
    • 15.1. Sentiment Analysis and the Dataset
    • 15.2. Sentiment Analysis: Using Recurrent Neural Networks
    • 15.3. Sentiment Analysis: Using Convolutional Neural Networks
    • 15.4. Natural Language Inference and the Dataset
    • 15.5. Natural Language Inference: Using Attention
    • 15.6. Fine-Tuning BERT for Sequence-Level and Token-Level Applications
    • 15.7. Natural Language Inference: Fine-Tuning BERT
  • 16. Recommender Systems
    • 16.1. Overview of Recommender Systems
    • 16.2. The MovieLens Dataset
    • 16.3. Matrix Factorization
    • 16.4. AutoRec: Rating Prediction with Autoencoders
    • 16.5. Personalized Ranking for Recommender Systems
    • 16.6. Neural Collaborative Filtering for Personalized Ranking
    • 16.7. Sequence-Aware Recommender Systems
    • 16.8. Feature-Rich Recommender Systems
    • 16.9. Factorization Machines
    • 16.10. Deep Factorization Machines
  • 17. Generative Adversarial Networks
    • 17.1. Generative Adversarial Networks
    • 17.2. Deep Convolutional Generative Adversarial Networks
  • 18. Appendix: Mathematics for Deep Learning
    • 18.1. Geometry and Linear Algebraic Operations
    • 18.2. Eigendecompositions
    • 18.3. Single Variable Calculus
    • 18.4. Multivariable Calculus
    • 18.5. Integral Calculus
    • 18.6. Random Variables
    • 18.7. Maximum Likelihood
    • 18.8. Distributions
    • 18.9. Naive Bayes
    • 18.10. Statistics
    • 18.11. Information Theory
  • 19. Appendix: Tools for Deep Learning
    • 19.1. Using Jupyter
    • 19.2. Using Amazon SageMaker
    • 19.3. Using AWS EC2 Instances
    • 19.4. Using Google Colab
    • 19.5. Selecting Servers and GPUs
    • 19.6. Contributing to This Book
    • 19.7. d2l API Document
  • References

9.8. Beam Search¶
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in SageMaker Studio Lab

In Section 9.7, we predicted the output sequence token by token until the special end-of-sequence “<eos>” token is predicted. In this section, we will begin with formalizing this greedy search strategy and exploring issues with it, then compare this strategy with other alternatives: exhaustive search and beam search.

Before a formal introduction to greedy search, let us formalize the search problem using the same mathematical notation from Section 9.7. At any time step \(t'\), the probability of the decoder output \(y_{t'}\) is conditional on the output subsequence \(y_1, \ldots, y_{t'-1}\) before \(t'\) and the context variable \(\mathbf{c}\) that encodes the information of the input sequence. To quantify computational cost, denote by \(\mathcal{Y}\) (it contains “<eos>”) the output vocabulary. So the cardinality \(\left|\mathcal{Y}\right|\) of this vocabulary set is the vocabulary size. Let us also specify the maximum number of tokens of an output sequence as \(T'\). As a result, our goal is to search for an ideal output from all the \(\mathcal{O}(\left|\mathcal{Y}\right|^{T'})\) possible output sequences. Of course, for all these output sequences, portions including and after “<eos>” will be discarded in the actual output.

9.8.1. Greedy Search¶

First, let us take a look at a simple strategy: greedy search. This strategy has been used to predict sequences in Section 9.7. In greedy search, at any time step \(t'\) of the output sequence, we search for the token with the highest conditional probability from \(\mathcal{Y}\), i.e.,

(9.8.1)¶\[y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),\]

as the output. Once “<eos>” is outputted or the output sequence has reached its maximum length \(T'\), the output sequence is completed.

So what can go wrong with greedy search? In fact, the optimal sequence should be the output sequence with the maximum \(\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})\), which is the conditional probability of generating an output sequence based on the input sequence. Unfortunately, there is no guarantee that the optimal sequence will be obtained by greedy search.

../_images/s2s-prob1.svg

Fig. 9.8.1 At each time step, greedy search selects the token with the highest conditional probability.¶

Let us illustrate it with an example. Suppose that there are four tokens “A”, “B”, “C”, and “<eos>” in the output dictionary. In Fig. 9.8.1, the four numbers under each time step represent the conditional probabilities of generating “A”, “B”, “C”, and “<eos>” at that time step, respectively.
At each time step, greedy search selects the token with the highest conditional probability. Therefore, the output sequence “A”, “B”, “C”, and “<eos>” will be predicted in Fig. 9.8.1. The conditional probability of this output sequence is \(0.5\times0.4\times0.4\times0.6 = 0.048\).
../_images/s2s-prob2.svg

Fig. 9.8.2 The four numbers under each time step represent the conditional probabilities of generating “A”, “B”, “C”, and “<eos>” at that time step. At time step 2, the token “C”, which has the second highest conditional probability, is selected.¶

Next, let us look at another example in Fig. 9.8.2. Unlike in Fig. 9.8.1, at time step 2 we select the token “C” in Fig. 9.8.2, which has the second highest conditional probability. Since the output subsequences at time steps 1 and 2, on which time step 3 is based, have changed from “A” and “B” in Fig. 9.8.1 to “A” and “C” in Fig. 9.8.2, the conditional probability of each token at time step 3 has also changed in Fig. 9.8.2. Suppose that we choose the token “B” at time step 3. Now time step 4 is conditional on the output subsequence at the first three time steps “A”, “C”, and “B”, which is different from “A”, “B”, and “C” in Fig. 9.8.1. Therefore, the conditional probability of generating each token at time step 4 in Fig. 9.8.2 is also different from that in Fig. 9.8.1. As a result, the conditional probability of the output sequence “A”, “C”, “B”, and “<eos>” in Fig. 9.8.2 is \(0.5\times0.3 \times0.6\times0.6=0.054\), which is greater than that of greedy search in Fig. 9.8.1. In this example, the output sequence “A”, “B”, “C”, and “<eos>” obtained by the greedy search is not an optimal sequence.

9.8.2. Exhaustive Search¶

If the goal is to obtain the optimal sequence, we may consider using exhaustive search: exhaustively enumerate all the possible output sequences with their conditional probabilities, then output the one with the highest conditional probability.

Although we can use exhaustive search to obtain the optimal sequence, its computational cost \(\mathcal{O}(\left|\mathcal{Y}\right|^{T'})\) is likely to be excessively high. For example, when \(|\mathcal{Y}|=10000\) and \(T'=10\), we will need to evaluate \(10000^{10} = 10^{40}\) sequences. This is next to impossible! On the other hand, the computational cost of greedy search is \(\mathcal{O}(\left|\mathcal{Y}\right|T')\): it is usually significantly smaller than that of exhaustive search. For example, when \(|\mathcal{Y}|=10000\) and \(T'=10\), we only need to evaluate \(10000\times10=10^5\) sequences.

9.8.3. Beam Search¶

Decisions about sequence searching strategies lie on a spectrum, with easy questions at either extreme. What if only accuracy matters? Obviously, exhaustive search. What if only computational cost matters? Clearly, greedy search. A real-world application usually asks a complicated question, somewhere in between those two extremes.

Beam search is an improved version of greedy search. It has a hyperparameter named beam size, \(k\). At time step 1, we select \(k\) tokens with the highest conditional probabilities. Each of them will be the first token of \(k\) candidate output sequences, respectively. At each subsequent time step, based on the \(k\) candidate output sequences at the previous time step, we continue to select \(k\) candidate output sequences with the highest conditional probabilities from \(k\left|\mathcal{Y}\right|\) possible choices.

../_images/beam-search.svg

Fig. 9.8.3 The process of beam search (beam size: 2, maximum length of an output sequence: 3). The candidate output sequences are \(A\), \(C\), \(AB\), \(CE\), \(ABD\), and \(CED\).¶

Fig. 9.8.3 demonstrates the process of beam search with an example. Suppose that the output vocabulary contains only five elements: \(\mathcal{Y} = \{A, B, C, D, E\}\), where one of them is “<eos>”. Let the beam size be 2 and the maximum length of an output sequence be 3. At time step 1, suppose that the tokens with the highest conditional probabilities \(P(y_1 \mid \mathbf{c})\) are \(A\) and \(C\). At time step 2, for all \(y_2 \in \mathcal{Y},\) we compute

(9.8.2)¶\[\begin{split}\begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned}\end{split}\]

and pick the largest two among these ten values, say \(P(A, B \mid \mathbf{c})\) and \(P(C, E \mid \mathbf{c})\). Then at time step 3, for all \(y_3 \in \mathcal{Y}\), we compute

(9.8.3)¶\[\begin{split}\begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c}),\\P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\end{aligned}\end{split}\]

and pick the largest two among these ten values, say \(P(A, B, D \mid \mathbf{c})\) and \(P(C, E, D \mid \mathbf{c}).\) As a result, we get six candidates output sequences: (i) \(A\); (ii) \(C\); (iii) \(A\), \(B\); (iv) \(C\), \(E\); (v) \(A\), \(B\), \(D\); and (vi) \(C\), \(E\), \(D\).

In the end, we obtain the set of final candidate output sequences based on these six sequences (e.g., discard portions including and after “<eos>”). Then we choose the sequence with the highest of the following score as the output sequence:

(9.8.4)¶\[\frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}\mid \mathbf{c}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),\]

where \(L\) is the length of the final candidate sequence and \(\alpha\) is usually set to 0.75. Since a longer sequence has more logarithmic terms in the summation of (9.8.4), the term \(L^\alpha\) in the denominator penalizes long sequences.

The computational cost of beam search is \(\mathcal{O}(k\left|\mathcal{Y}\right|T')\). This result is in between that of greedy search and that of exhaustive search. In fact, greedy search can be treated as a special type of beam search with a beam size of 1. With a flexible choice of the beam size, beam search provides a tradeoff between accuracy versus computational cost.

9.8.4. Summary¶

  • Sequence searching strategies include greedy search, exhaustive search, and beam search.

  • Beam search provides a tradeoff between accuracy versus computational cost via its flexible choice of the beam size.

9.8.5. Exercises¶

  1. Can we treat exhaustive search as a special type of beam search? Why or why not?

  2. Apply beam search in the machine translation problem in Section 9.7. How does the beam size affect the translation results and the prediction speed?

  3. We used language modeling for generating text following user-provided prefixes in Section 8.5. Which kind of search strategy does it use? Can you improve it?

Discussions

Table Of Contents

  • 9.8. Beam Search
    • 9.8.1. Greedy Search
    • 9.8.2. Exhaustive Search
    • 9.8.3. Beam Search
    • 9.8.4. Summary
    • 9.8.5. Exercises
Previous
9.7. Sequence to Sequence Learning
Next
10. Attention Mechanisms