13.4. Implementation of Word2vec

In this section, we will train a skip-gram model defined in Section 13.1.

First, import the packages and modules required for the experiment, and load the PTB data set.

import d2l
from mxnet import autograd, gluon, nd
from mxnet.gluon import nn

batch_size, max_window_size, num_noise_words = 512, 5, 5
data_iter, vocab = d2l.load_data_ptb(512, 5, 5)

13.4.1. The Skip-Gram Model

We will implement the skip-gram model by using embedding layers and mini-batch multiplication. These methods are also often used to implement other natural language processing applications. Embedding Layer

The layer in which the obtained word is embedded is called the embedding layer, which can be obtained by creating an nn.Embedding instance in Gluon. The weight of the embedding layer is a matrix whose number of rows is the dictionary size (input_dim) and whose number of columns is the dimension of each word vector (output_dim). We set the dictionary size to 20 and the word vector dimension to 4.

embed = nn.Embedding(input_dim=20, output_dim=4)
Parameter embedding0_weight (shape=(20, 4), dtype=float32)

The input of the embedding layer is the index of the word. When we enter the index \(i\) of a word, the embedding layer returns the \(i\)th row of the weight matrix as its word vector. Below we enter an index of shape (2,3) into the embedding layer. Because the dimension of the word vector is 4, we obtain a word vector of shape (2,3,4).

x = nd.array([[1, 2, 3], [4, 5, 6]])
[[[ 0.01438687  0.05011239  0.00628365  0.04861524]
  [-0.01068833  0.01729892  0.02042518 -0.01618656]
  [-0.00873779 -0.02834515  0.05484822 -0.06206018]]

 [[ 0.06491279 -0.03182812 -0.01631819 -0.00312688]
  [ 0.0408415   0.04370362  0.00404529 -0.0028032 ]
  [ 0.00952624 -0.01501013  0.05958354  0.04705103]]]
<NDArray 2x3x4 @cpu(0)> Mini-batch Multiplication

We can multiply the matrices in two mini-batches one by one, by the mini-batch multiplication operation batch_dot. Suppose the first batch contains \(n\) matrices \(\boldsymbol{X}_1, \ldots, \boldsymbol{X}_n\) with a shape of \(a\times b\), and the second batch contains \(n\) matrices \(\boldsymbol{Y}_1, \ldots, \boldsymbol{Y}_n\) with a shape of \(b\times c\). The output of matrix multiplication on these two batches are \(n\) matrices \(\boldsymbol{X}_1\boldsymbol{Y}_1, \ldots, \boldsymbol{X}_n\boldsymbol{Y}_n\) with a shape of \(a\times c\). Therefore, given two NDArrays of shape (\(n\), \(a\), \(b\)) and (\(n\), \(b\), \(c\)), the shape of the mini-batch multiplication output is (\(n\), \(a\), \(c\)).

X = nd.ones((2, 1, 4))
Y = nd.ones((2, 4, 6))
nd.batch_dot(X, Y).shape
(2, 1, 6) Skip-gram Model Forward Calculation

In forward calculation, the input of the skip-gram model contains the central target word index center and the concatenated context and noise word index contexts_and_negatives. In which, the center variable has the shape (batch size, 1), while the contexts_and_negatives variable has the shape (batch size, max_len). These two variables are first transformed from word indexes to word vectors by the word embedding layer, and then the output of shape (batch size, 1, max_len) is obtained by mini-batch multiplication. Each element in the output is the inner product of the central target word vector and the context word vector or noise word vector.

def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
    v = embed_v(center)
    u = embed_u(contexts_and_negatives)
    pred = nd.batch_dot(v, u.swapaxes(1, 2))
    return pred

Verify that the output shape should be (batch size, 1, max_len).

skip_gram(nd.ones((2,1)), nd.ones((2,4)), embed, embed).shape
(2, 1, 4)

13.4.2. Training

Before training the word embedding model, we need to define the loss function of the model. Binary Cross Entropy Loss Function

According to the definition of the loss function in negative sampling, we can directly use Gluon’s binary cross entropy loss function SigmoidBinaryCrossEntropyLoss.

loss = gluon.loss.SigmoidBinaryCrossEntropyLoss()

It is worth mentioning that we can use the mask variable to specify the partial predicted value and label that participate in loss function calculation in the mini-batch: when the mask is 1, the predicted value and label of the corresponding position will participate in the calculation of the loss function; When the mask is 0, the predicted value and label of the corresponding position do not participate in the calculation of the loss function. As we mentioned earlier, mask variables can be used to avoid the effect of padding on loss function calculations.

Given two identical examples, different masks lead to different loss values.

pred = nd.array([[.5]*4]*2)
label = nd.array([[1,0,1,0]]*2)
mask = nd.array([[1, 1, 1, 1], [1, 1, 0, 0]])
loss(pred, label, mask)
[0.724077  0.3620385]
<NDArray 2 @cpu(0)>

We can normalize the loss in each example due to various lengths in each example.

loss(pred, label, mask) / mask.sum(axis=1) * mask.shape[1]
[0.724077 0.724077]
<NDArray 2 @cpu(0)> Initialize Model Parameters

We construct the embedding layers of the central and context words, respectively, and set the hyper-parameter word vector dimension embed_size to 100.

embed_size = 100
net = nn.Sequential()
net.add(nn.Embedding(input_dim=len(vocab), output_dim=embed_size),
        nn.Embedding(input_dim=len(vocab), output_dim=embed_size)) Training

The training function is defined below. Because of the existence of padding, the calculation of the loss function is slightly different compared to the previous training functions.

def train(net, data_iter, lr, num_epochs, ctx=d2l.try_gpu()):
    net.initialize(ctx=ctx, force_reinit=True)
    trainer = gluon.Trainer(net.collect_params(), 'adam',
                            {'learning_rate': lr})
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            xlim=[0, num_epochs])
    for epoch in range(num_epochs):
        timer = d2l.Timer()
        metric = d2l.Accumulator(2)  # loss_sum, num_tokens
        for i, batch in enumerate(data_iter):
            center, context_negative, mask, label = [
                data.as_in_context(ctx) for data in batch]
            with autograd.record():
                pred = skip_gram(center, context_negative, net[0], net[1])
                l = (loss(pred.reshape(label.shape), label, mask)
                     / mask.sum(axis=1) * mask.shape[1])
            metric.add(l.sum().asscalar(), l.size)
            if (i+1) % 50 == 0:
                animator.add(epoch+(i+1)/len(data_iter), metric[0]/metric[1])
    print('loss %.3f, %d tokens/sec on %s ' % (
        metric[0]/metric[1], metric[1]/timer.stop(), ctx))

Now, we can train a skip-gram model using negative sampling.

lr, num_epochs = 0.01, 5
train(net, data_iter, lr, num_epochs)
loss 0.331, 16174 tokens/sec on gpu(0)

13.4.3. Applying the Word Embedding Model

After training the word embedding model, we can represent similarity in meaning between words based on the cosine similarity of two word vectors. As we can see, when using the trained word embedding model, the words closest in meaning to the word “chip” are mostly related to chips.

def get_similar_tokens(query_token, k, embed):
    W = embed.weight.data()
    x = W[vocab[query_token]]
    # Compute the cosine similarity. Add 1e-9 for numerical stability.
    cos = nd.dot(W, x) / (nd.sum(W * W, axis=1) * nd.sum(x * x) + 1e-9).sqrt()
    topk = nd.topk(cos, k=k+1, ret_typ='indices').asnumpy().astype('int32')
    for i in topk[1:]:  # Remove the input words
        print('cosine sim=%.3f: %s' % (cos[i].asscalar(), (vocab.idx_to_token[i])))

get_similar_tokens('chip', 3, net[0])
cosine sim=0.553: intel
cosine sim=0.549: desktop
cosine sim=0.472: semiconductor

13.4.4. Summary

  • We can use Gluon to train a skip-gram model through negative sampling.

13.4.5. Exercises

  • Set sparse_grad=True when creating an instance of nn.Embedding. Does it accelerate training? Look up MXNet documentation to learn the meaning of this argument.

  • Try to find synonyms for other words.

  • Tune the hyper-parameters and observe and analyze the experimental results.

  • When the data set is large, we usually sample the context words and the noise words for the central target word in the current mini-batch only when updating the model parameters. In other words, the same central target word may have different context words or noise words in different epochs. What are the benefits of this sort of training? Try to implement this training method.

13.4.6. Scan the QR Code to Discuss