.. _sec_word2vec:
Word Embedding (word2vec)
=========================
Natural language is a complex system used to express meanings. In this
system, words are the basic unit of the meaning. As the name implies,
*word vectors* are vectors used to represent words, and can also be
considered as feature vectors or representations of words. The technique
of mapping words to real vectors is called *word embedding*. In recent
years, word embedding has gradually become the basic knowledge of
natural language processing.
One-Hot Vectors Are a Bad Choice
--------------------------------
We used one-hot vectors to represent words (characters are words) in
:numref:`sec_rnn_scratch`. Suppose that the number of different words
in the dictionary (the dictionary size) is :math:`N`, and each word
corresponds to a different integer (index) from :math:`0` to
:math:`N−1`. To obtain the one-hot vector representation for any word
with index :math:`i`, we create a length-\ :math:`N` vector with all 0s
and set the element at position :math:`i` to 1. In this way, each word
is represented as a vector of length :math:`N`, and it can be used
directly by neural networks.
Although one-hot word vectors are easy to construct, they are usually
not a good choice. A main reason is that one-hot word vectors cannot
accurately express the similarity between different words, such as the
*cosine similarity* that we often use. For vectors
:math:`\mathbf{x}, \mathbf{y} \in \mathbb{R}^d`, their cosine similarity
is the cosine of the angle between them:
.. math:: \frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|} \in [-1, 1].
Since the cosine similarity between one-hot vectors of any two different
words is 0, one-hot vectors cannot encode similarities among words.
Self-Supervised word2vec
------------------------
The `word2vec `__ tool was
proposed to address the above issue. It maps each word to a fixed-length
vector, and these vectors can better express the similarity and analogy
relationship among different words. The word2vec tool contains two
models, namely *skip-gram* :cite:`Mikolov.Sutskever.Chen.ea.2013` and
*continuous bag of words* (CBOW) :cite:`Mikolov.Chen.Corrado.ea.2013`.
For semantically meaningful representations, their training relies on
conditional probabilities that can be viewed as predicting some words
using some of their surrounding words in corpora. Since supervision
comes from the data without labels, both skip-gram and continuous bag of
words are self-supervised models.
In the following, we will introduce these two models and their training
methods.
.. _subsec_skip-gram:
The Skip-Gram Model
-------------------
The *skip-gram* model assumes that a word can be used to generate its
surrounding words in a text sequence. Take the text sequence "the",
"man", "loves", "his", "son" as an example. Let us choose "loves" as the
*center word* and set the context window size to 2. As shown in
:numref:`fig_skip_gram`, given the center word "loves", the skip-gram
model considers the conditional probability for generating the *context
words*: "the", "man", "his", and "son", which are no more than 2 words
away from the center word:
.. math:: P(\textrm{"the"},\textrm{"man"},\textrm{"his"},\textrm{"son"}\mid\textrm{"loves"}).
Assume that the context words are independently generated given the
center word (i.e., conditional independence). In this case, the above
conditional probability can be rewritten as
.. math:: P(\textrm{"the"}\mid\textrm{"loves"})\cdot P(\textrm{"man"}\mid\textrm{"loves"})\cdot P(\textrm{"his"}\mid\textrm{"loves"})\cdot P(\textrm{"son"}\mid\textrm{"loves"}).
.. _fig_skip_gram:
.. figure:: ../img/skip-gram.svg
The skip-gram model considers the conditional probability of
generating the surrounding context words given a center word.
In the skip-gram model, each word has two :math:`d`-dimensional-vector
representations for calculating conditional probabilities. More
concretely, for any word with index :math:`i` in the dictionary, denote
by :math:`\mathbf{v}_i\in\mathbb{R}^d` and
:math:`\mathbf{u}_i\in\mathbb{R}^d` its two vectors when used as a
*center* word and a *context* word, respectively. The conditional
probability of generating any context word :math:`w_o` (with index
:math:`o` in the dictionary) given the center word :math:`w_c` (with
index :math:`c` in the dictionary) can be modeled by a softmax operation
on vector dot products:
.. math:: P(w_o \mid w_c) = \frac{\text{exp}(\mathbf{u}_o^\top \mathbf{v}_c)}{ \sum_{i \in \mathcal{V}} \text{exp}(\mathbf{u}_i^\top \mathbf{v}_c)},
:label: eq_skip-gram-softmax
where the vocabulary index set
:math:`\mathcal{V} = \{0, 1, \ldots, |\mathcal{V}|-1\}`. Given a text
sequence of length :math:`T`, where the word at time step :math:`t` is
denoted as :math:`w^{(t)}`. Assume that context words are independently
generated given any center word. For context window size :math:`m`, the
likelihood function of the skip-gram model is the probability of
generating all context words given any center word:
.. math:: \prod_{t=1}^{T} \prod_{-m \leq j \leq m,\ j \neq 0} P(w^{(t+j)} \mid w^{(t)}),
where any time step that is less than :math:`1` or greater than
:math:`T` can be omitted.
Training
~~~~~~~~
The skip-gram model parameters are the center word vector and context
word vector for each word in the vocabulary. In training, we learn the
model parameters by maximizing the likelihood function (i.e., maximum
likelihood estimation). This is equivalent to minimizing the following
loss function:
.. math:: - \sum_{t=1}^{T} \sum_{-m \leq j \leq m,\ j \neq 0} \text{log}\, P(w^{(t+j)} \mid w^{(t)}).
When using stochastic gradient descent to minimize the loss, in each
iteration we can randomly sample a shorter subsequence to calculate the
(stochastic) gradient for this subsequence to update the model
parameters. To calculate this (stochastic) gradient, we need to obtain
the gradients of the log conditional probability with respect to the
center word vector and the context word vector. In general, according to
:eq:`eq_skip-gram-softmax` the log conditional probability
involving any pair of the center word :math:`w_c` and the context word
:math:`w_o` is
.. math:: \log P(w_o \mid w_c) =\mathbf{u}_o^\top \mathbf{v}_c - \log\left(\sum_{i \in \mathcal{V}} \text{exp}(\mathbf{u}_i^\top \mathbf{v}_c)\right).
:label: eq_skip-gram-log
Through differentiation, we can obtain its gradient with respect to the
center word vector :math:`\mathbf{v}_c` as
.. math:: \begin{aligned}\frac{\partial \text{log}\, P(w_o \mid w_c)}{\partial \mathbf{v}_c}&= \mathbf{u}_o - \frac{\sum_{j \in \mathcal{V}} \exp(\mathbf{u}_j^\top \mathbf{v}_c)\mathbf{u}_j}{\sum_{i \in \mathcal{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)}\\&= \mathbf{u}_o - \sum_{j \in \mathcal{V}} \left(\frac{\text{exp}(\mathbf{u}_j^\top \mathbf{v}_c)}{ \sum_{i \in \mathcal{V}} \text{exp}(\mathbf{u}_i^\top \mathbf{v}_c)}\right) \mathbf{u}_j\\&= \mathbf{u}_o - \sum_{j \in \mathcal{V}} P(w_j \mid w_c) \mathbf{u}_j.\end{aligned}
:label: eq_skip-gram-grad
Note that the calculation in :eq:`eq_skip-gram-grad` requires the
conditional probabilities of all words in the dictionary with
:math:`w_c` as the center word. The gradients for the other word vectors
can be obtained in the same way.
After training, for any word with index :math:`i` in the dictionary, we
obtain both word vectors :math:`\mathbf{v}_i` (as the center word) and
:math:`\mathbf{u}_i` (as the context word). In natural language
processing applications, the center word vectors of the skip-gram model
are typically used as the word representations.
The Continuous Bag of Words (CBOW) Model
----------------------------------------
The *continuous bag of words* (CBOW) model is similar to the skip-gram
model. The major difference from the skip-gram model is that the
continuous bag of words model assumes that a center word is generated
based on its surrounding context words in the text sequence. For
example, in the same text sequence "the", "man", "loves", "his", and
"son", with "loves" as the center word and the context window size being
2, the continuous bag of words model considers the conditional
probability of generating the center word "loves" based on the context
words "the", "man", "his" and "son" (as shown in :numref:`fig_cbow`),
which is
.. math:: P(\textrm{"loves"}\mid\textrm{"the"},\textrm{"man"},\textrm{"his"},\textrm{"son"}).
.. figure:: ../img/cbow.svg
The continuous bag of words model considers the conditional
probability of generating the center word given its surrounding
context words.
:label: fig_cbow
Since there are multiple context words in the continuous bag of words
model, these context word vectors are averaged in the calculation of the
conditional probability. Specifically, for any word with index :math:`i`
in the dictionary, denote by :math:`\mathbf{v}_i\in\mathbb{R}^d` and
:math:`\mathbf{u}_i\in\mathbb{R}^d` its two vectors when used as a
*context* word and a *center* word (meanings are switched in the
skip-gram model), respectively. The conditional probability of
generating any center word :math:`w_c` (with index :math:`c` in the
dictionary) given its surrounding context words
:math:`w_{o_1}, \ldots, w_{o_{2m}}` (with index
:math:`o_1, \ldots, o_{2m}` in the dictionary) can be modeled by
.. math:: P(w_c \mid w_{o_1}, \ldots, w_{o_{2m}}) = \frac{\text{exp}\left(\frac{1}{2m}\mathbf{u}_c^\top (\mathbf{v}_{o_1} + \ldots, + \mathbf{v}_{o_{2m}}) \right)}{ \sum_{i \in \mathcal{V}} \text{exp}\left(\frac{1}{2m}\mathbf{u}_i^\top (\mathbf{v}_{o_1} + \ldots, + \mathbf{v}_{o_{2m}}) \right)}.
:label: fig_cbow-full
For brevity, let :math:`\mathcal{W}_o= \{w_{o_1}, \ldots, w_{o_{2m}}\}`
and
:math:`\bar{\mathbf{v}}_o = \left(\mathbf{v}_{o_1} + \ldots, + \mathbf{v}_{o_{2m}} \right)/(2m)`.
Then :eq:`fig_cbow-full` can be simplified as
.. math:: P(w_c \mid \mathcal{W}_o) = \frac{\exp\left(\mathbf{u}_c^\top \bar{\mathbf{v}}_o\right)}{\sum_{i \in \mathcal{V}} \exp\left(\mathbf{u}_i^\top \bar{\mathbf{v}}_o\right)}.
Given a text sequence of length :math:`T`, where the word at time step
:math:`t` is denoted as :math:`w^{(t)}`. For context window size
:math:`m`, the likelihood function of the continuous bag of words model
is the probability of generating all center words given their context
words:
.. math:: \prod_{t=1}^{T} P(w^{(t)} \mid w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}).
Training
~~~~~~~~
Training continuous bag of words models is almost the same as training
skip-gram models. The maximum likelihood estimation of the continuous
bag of words model is equivalent to minimizing the following loss
function:
.. math:: -\sum_{t=1}^T \text{log}\, P(w^{(t)} \mid w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}).
Notice that
.. math:: \log\,P(w_c \mid \mathcal{W}_o) = \mathbf{u}_c^\top \bar{\mathbf{v}}_o - \log\,\left(\sum_{i \in \mathcal{V}} \exp\left(\mathbf{u}_i^\top \bar{\mathbf{v}}_o\right)\right).
Through differentiation, we can obtain its gradient with respect to any
context word vector
:math:`\mathbf{v}_{o_i}`\ (:math:`i = 1, \ldots, 2m`) as
.. math:: \frac{\partial \log\, P(w_c \mid \mathcal{W}_o)}{\partial \mathbf{v}_{o_i}} = \frac{1}{2m} \left(\mathbf{u}_c - \sum_{j \in \mathcal{V}} \frac{\exp(\mathbf{u}_j^\top \bar{\mathbf{v}}_o)\mathbf{u}_j}{ \sum_{i \in \mathcal{V}} \text{exp}(\mathbf{u}_i^\top \bar{\mathbf{v}}_o)} \right) = \frac{1}{2m}\left(\mathbf{u}_c - \sum_{j \in \mathcal{V}} P(w_j \mid \mathcal{W}_o) \mathbf{u}_j \right).
:label: eq_cbow-gradient
The gradients for the other word vectors can be obtained in the same
way. Unlike the skip-gram model, the continuous bag of words model
typically uses context word vectors as the word representations.
Summary
-------
- Word vectors are vectors used to represent words, and can also be
considered as feature vectors or representations of words. The
technique of mapping words to real vectors is called word embedding.
- The word2vec tool contains both the skip-gram and continuous bag of
words models.
- The skip-gram model assumes that a word can be used to generate its
surrounding words in a text sequence; while the continuous bag of
words model assumes that a center word is generated based on its
surrounding context words.
Exercises
---------
1. What is the computational complexity for calculating each gradient?
What could be the issue if the dictionary size is huge?
2. Some fixed phrases in English consist of multiple words, such as "new
york". How to train their word vectors? Hint: see Section 4 in the
word2vec paper :cite:`Mikolov.Sutskever.Chen.ea.2013`.
3. Let us reflect on the word2vec design by taking the skip-gram model
as an example. What is the relationship between the dot product of
two word vectors in the skip-gram model and the cosine similarity?
For a pair of words with similar semantics, why may the cosine
similarity of their word vectors (trained by the skip-gram model) be
high?
`Discussions `__