13.7. Finding Synonyms and Analogies

In Section 13.4 we trained a word2vec word embedding model on a small-scale data set and searched for synonyms using the cosine similarity of word vectors. In practice, word vectors pre-trained on a large-scale corpus can often be applied to downstream natural language processing tasks. This section will demonstrate how to use these pre-trained word vectors to find synonyms and analogies. We will continue to apply pre-trained word vectors in subsequent sections.

13.7.1. Using Pre-trained Word Vectors

MXNet’s contrib.text package provides functions and classes related to natural language processing (see the GluonNLP tool package for more details). Next, let us check out names of the provided pre-trained word embeddings.

from mxnet import nd
from mxnet.contrib import text

dict_keys(['glove', 'fasttext'])

Given the name of the word embedding, we can see which pre-trained models are provided by the word embedding. The word vector dimensions of each model may be different or obtained by pre-training on different data sets.

['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', 'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.840B.300d.txt', 'glove.twitter.27B.25d.txt', 'glove.twitter.27B.50d.txt', 'glove.twitter.27B.100d.txt', 'glove.twitter.27B.200d.txt']

The general naming conventions for pre-trained GloVe models are “model.(data set.)number of words in data set.word vector dimension.txt”. For more information, please refer to the GloVe and fastText project sites [2,3]. Below, we use a 50-dimensional GloVe word vector based on Wikipedia subset pre-training. The corresponding word vector is automatically downloaded the first time we create a pre-trained word vector instance.

glove_6b50d = text.embedding.create(
    'glove', pretrained_file_name='glove.6B.50d.txt')

Print the dictionary size. The dictionary contains 400,000 words and a special unknown token.


We can use a word to get its index in the dictionary, or we can get the word from its index.

glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367]
(3367, 'beautiful')

13.7.2. Applying Pre-trained Word Vectors

Below, we demonstrate the application of pre-trained word vectors, using GloVe as an example. Finding Synonyms

Here, we re-implement the algorithm used to search for synonyms by cosine similarity introduced in Section 13.1

In order to reuse the logic for seeking the \(k\) nearest neighbors when seeking analogies, we encapsulate this part of the logic separately in the knn (\(k\)-nearest neighbors) function.

def knn(W, x, k):
    # The added 1e-9 is for numerical stability
    cos = nd.dot(W, x.reshape((-1,))) / (
        (nd.sum(W * W, axis=1) + 1e-9).sqrt() * nd.sum(x * x).sqrt())
    topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
    return topk, [cos[i].asscalar() for i in topk]

Then, we search for synonyms by pre-training the word vector instance embed.

def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec,
                    embed.get_vecs_by_tokens([query_token]), k+1)
    for i, c in zip(topk[1:], cos[1:]):  # Remove input words
        print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i])))

The dictionary of pre-trained word vector instance glove_6b50d already created contains 400,000 words and a special unknown token. Excluding input words and unknown words, we search for the three words that are the most similar in meaning to “chip”.

get_similar_tokens('chip', 3, glove_6b50d)
cosine sim=0.856: chips
cosine sim=0.749: intel
cosine sim=0.749: electronics

Next, we search for the synonyms of “baby” and “beautiful”.

get_similar_tokens('baby', 3, glove_6b50d)
cosine sim=0.839: babies
cosine sim=0.800: boy
cosine sim=0.792: girl
get_similar_tokens('beautiful', 3, glove_6b50d)
cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful Finding Analogies

In addition to seeking synonyms, we can also use the pre-trained word vector to seek the analogies between words. For example, “man”:“woman”::“son”:“daughter” is an example of analogy, “man” is to “woman” as “son” is to “daughter”. The problem of seeking analogies can be defined as follows: for four words in the analogical relationship \(a : b :: c : d\), given the first three words, \(a\), \(b\) and \(c\), we want to find \(d\). Assume the word vector for the word \(w\) is \(\text{vec}(w)\). To solve the analogy problem, we need to find the word vector that is most similar to the result vector of \(\text{vec}(c)+\text{vec}(b)-\text{vec}(a)\).

def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[topk[0]]  # Remove unknown words

Verify the “male-female” analogy.

get_analogy('man', 'woman', 'son', glove_6b50d)

“Capital-country” analogy: “beijing” is to “china” as “tokyo” is to what? The answer should be “japan”.

get_analogy('beijing', 'china', 'tokyo', glove_6b50d)

“Adjective-superlative adjective” analogy: “bad” is to “worst” as “big” is to what? The answer should be “biggest”.

get_analogy('bad', 'worst', 'big', glove_6b50d)

“Present tense verb-past tense verb” analogy: “do” is to “did” as “go” is to what? The answer should be “went”.

get_analogy('do', 'did', 'go', glove_6b50d)

13.7.3. Summary

  • Word vectors pre-trained on a large-scale corpus can often be applied to downstream natural language processing tasks.

  • We can use pre-trained word vectors to seek synonyms and analogies.

13.7.4. Exercises

  • Test the fastText results.

  • If the dictionary is extremely large, how can we accelerate finding synonyms and analogies?

13.7.5. Scan the QR Code to Discuss