.. _sec_sentiment: Sentiment Analysis and the Dataset ================================== With the proliferation of online social media and review platforms, a plethora of opinionated data have been logged, bearing great potential for supporting decision making processes. *Sentiment analysis* studies people's sentiments in their produced text, such as product reviews, blog comments, and forum discussions. It enjoys wide applications to fields as diverse as politics (e.g., analysis of public sentiments towards policies), finance (e.g., analysis of sentiments of the market), and marketing (e.g., product research and brand management). Since sentiments can be categorized as discrete polarities or scales (e.g., positive and negative), we can consider sentiment analysis as a text classification task, which transforms a varying-length text sequence into a fixed-length text category. In this chapter, we will use Stanford's `large movie review dataset `__ for sentiment analysis. It consists of a training set and a testing set, either containing 25000 movie reviews downloaded from IMDb. In both datasets, there are equal number of "positive" and "negative" labels, indicating different sentiment polarities. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python import os from mxnet import np, npx from d2l import mxnet as d2l npx.set_np() .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python import os import torch from torch import nn from d2l import torch as d2l .. raw:: html
.. raw:: html
Reading the Dataset ------------------- First, download and extract this IMDb review dataset in the path ``../data/aclImdb``. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python #@save d2l.DATA_HUB['aclImdb'] = ( 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', '01ada507287d82875905620988597833ad4e0903') data_dir = d2l.download_extract('aclImdb', 'aclImdb') .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python #@save d2l.DATA_HUB['aclImdb'] = ( 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', '01ada507287d82875905620988597833ad4e0903') data_dir = d2l.download_extract('aclImdb', 'aclImdb') .. raw:: html
.. raw:: html
Next, read the training and test datasets. Each example is a review and its label: 1 for "positive" and 0 for "negative". .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python #@save def read_imdb(data_dir, is_train): """Read the IMDb review dataset text sequences and labels.""" data, labels = [], [] for label in ('pos', 'neg'): folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) for file in os.listdir(folder_name): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('\n', '') data.append(review) labels.append(1 if label == 'pos' else 0) return data, labels train_data = read_imdb(data_dir, is_train=True) print('# trainings:', len(train_data[0])) for x, y in zip(train_data[0][:3], train_data[1][:3]): print('label:', y, 'review:', x[0:60]) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output # trainings: 25000 label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his label: 1 review: An unassuming, subtle and lean film, "The Man in the White S label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python #@save def read_imdb(data_dir, is_train): """Read the IMDb review dataset text sequences and labels.""" data, labels = [], [] for label in ('pos', 'neg'): folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) for file in os.listdir(folder_name): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('\n', '') data.append(review) labels.append(1 if label == 'pos' else 0) return data, labels train_data = read_imdb(data_dir, is_train=True) print('# trainings:', len(train_data[0])) for x, y in zip(train_data[0][:3], train_data[1][:3]): print('label:', y, 'review:', x[0:60]) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output # trainings: 25000 label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his label: 1 review: An unassuming, subtle and lean film, "The Man in the White S label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta .. raw:: html
.. raw:: html
Preprocessing the Dataset ------------------------- Treating each word as a token and filtering out words that appear less than 5 times, we create a vocabulary out of the training dataset. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python train_tokens = d2l.tokenize(train_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['']) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python train_tokens = d2l.tokenize(train_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['']) .. raw:: html
.. raw:: html
After tokenization, let us plot the histogram of review lengths in tokens. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python d2l.set_figsize() d2l.plt.xlabel('# tokens per review') d2l.plt.ylabel('count') d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50)); .. figure:: output_sentiment-analysis-and-dataset_70179c_39_0.svg .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python d2l.set_figsize() d2l.plt.xlabel('# tokens per review') d2l.plt.ylabel('count') d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50)); .. figure:: output_sentiment-analysis-and-dataset_70179c_42_0.svg .. raw:: html
.. raw:: html
As we expected, the reviews have varying lengths. To process a minibatch of such reviews at each time, we set the length of each review to 500 with truncation and padding, which is similar to the preprocessing step for the machine translation dataset in :numref:`sec_machine_translation`. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python num_steps = 500 # sequence length train_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) print(train_features.shape) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (25000, 500) .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python num_steps = 500 # sequence length train_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) print(train_features.shape) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output torch.Size([25000, 500]) .. raw:: html
.. raw:: html
Creating Data Iterators ----------------------- Now we can create data iterators. At each iteration, a minibatch of examples are returned. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python train_iter = d2l.load_array((train_features, train_data[1]), 64) for X, y in train_iter: print('X:', X.shape, ', y:', y.shape) break print('# batches:', len(train_iter)) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output X: (64, 500) , y: (64,) # batches: 391 .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), 64) for X, y in train_iter: print('X:', X.shape, ', y:', y.shape) break print('# batches:', len(train_iter)) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output X: torch.Size([64, 500]) , y: torch.Size([64]) # batches: 391 .. raw:: html
.. raw:: html
Putting All Things Together --------------------------- Last, we wrap up the above steps into the ``load_data_imdb`` function. It returns training and test data iterators and the vocabulary of the IMDb review dataset. .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python #@save def load_data_imdb(batch_size, num_steps=500): """Return data iterators and the vocabulary of the IMDb review dataset.""" data_dir = d2l.download_extract('aclImdb', 'aclImdb') train_data = read_imdb(data_dir, True) test_data = read_imdb(data_dir, False) train_tokens = d2l.tokenize(train_data[0], token='word') test_tokens = d2l.tokenize(test_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5) train_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) test_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in test_tokens]) train_iter = d2l.load_array((train_features, train_data[1]), batch_size) test_iter = d2l.load_array((test_features, test_data[1]), batch_size, is_train=False) return train_iter, test_iter, vocab .. raw:: html
.. raw:: html
.. raw:: latex \diilbookstyleinputcell .. code:: python #@save def load_data_imdb(batch_size, num_steps=500): """Return data iterators and the vocabulary of the IMDb review dataset.""" data_dir = d2l.download_extract('aclImdb', 'aclImdb') train_data = read_imdb(data_dir, True) test_data = read_imdb(data_dir, False) train_tokens = d2l.tokenize(train_data[0], token='word') test_tokens = d2l.tokenize(test_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5) train_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) test_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in test_tokens]) train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), batch_size) test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])), batch_size, is_train=False) return train_iter, test_iter, vocab .. raw:: html
.. raw:: html
Summary ------- - Sentiment analysis studies people's sentiments in their produced text, which is considered as a text classification problem that transforms a varying-length text sequence into a fixed-length text category. - After preprocessing, we can load Stanford's large movie review dataset (IMDb review dataset) into data iterators with a vocabulary. Exercises --------- 1. What hyperparameters in this section can we modify to accelerate training sentiment analysis models? 2. Can you implement a function to load the dataset of `Amazon reviews `__ into data iterators and labels for sentiment analysis? .. raw:: html
.. raw:: html
`Discussions `__ .. raw:: html
.. raw:: html
`Discussions `__ .. raw:: html
.. raw:: html