.. _sec_sentiment:
Sentiment Analysis and the Dataset
==================================
With the proliferation of online social media and review platforms, a
plethora of opinionated data have been logged, bearing great potential
for supporting decision making processes. *Sentiment analysis* studies
people's sentiments in their produced text, such as product reviews,
blog comments, and forum discussions. It enjoys wide applications to
fields as diverse as politics (e.g., analysis of public sentiments
towards policies), finance (e.g., analysis of sentiments of the market),
and marketing (e.g., product research and brand management).
Since sentiments can be categorized as discrete polarities or scales
(e.g., positive and negative), we can consider sentiment analysis as a
text classification task, which transforms a varying-length text
sequence into a fixed-length text category. In this chapter, we will use
Stanford's `large movie review
dataset `__ for
sentiment analysis. It consists of a training set and a testing set,
either containing 25000 movie reviews downloaded from IMDb. In both
datasets, there are equal number of "positive" and "negative" labels,
indicating different sentiment polarities.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import os
from mxnet import np, npx
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import os
import torch
from torch import nn
from d2l import torch as d2l
.. raw:: html
.. raw:: html
Reading the Dataset
-------------------
First, download and extract this IMDb review dataset in the path
``../data/aclImdb``.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
d2l.DATA_HUB['aclImdb'] = (
'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
'01ada507287d82875905620988597833ad4e0903')
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
d2l.DATA_HUB['aclImdb'] = (
'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
'01ada507287d82875905620988597833ad4e0903')
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
.. raw:: html
.. raw:: html
Next, read the training and test datasets. Each example is a review and
its label: 1 for "positive" and 0 for "negative".
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def read_imdb(data_dir, is_train):
"""Read the IMDb review dataset text sequences and labels."""
data, labels = [], []
for label in ('pos', 'neg'):
folder_name = os.path.join(data_dir, 'train' if is_train else 'test',
label)
for file in os.listdir(folder_name):
with open(os.path.join(folder_name, file), 'rb') as f:
review = f.read().decode('utf-8').replace('\n', '')
data.append(review)
labels.append(1 if label == 'pos' else 0)
return data, labels
train_data = read_imdb(data_dir, is_train=True)
print('# trainings:', len(train_data[0]))
for x, y in zip(train_data[0][:3], train_data[1][:3]):
print('label:', y, 'review:', x[0:60])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
# trainings: 25000
label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his
label: 1 review: An unassuming, subtle and lean film, "The Man in the White S
label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def read_imdb(data_dir, is_train):
"""Read the IMDb review dataset text sequences and labels."""
data, labels = [], []
for label in ('pos', 'neg'):
folder_name = os.path.join(data_dir, 'train' if is_train else 'test',
label)
for file in os.listdir(folder_name):
with open(os.path.join(folder_name, file), 'rb') as f:
review = f.read().decode('utf-8').replace('\n', '')
data.append(review)
labels.append(1 if label == 'pos' else 0)
return data, labels
train_data = read_imdb(data_dir, is_train=True)
print('# trainings:', len(train_data[0]))
for x, y in zip(train_data[0][:3], train_data[1][:3]):
print('label:', y, 'review:', x[0:60])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
# trainings: 25000
label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his
label: 1 review: An unassuming, subtle and lean film, "The Man in the White S
label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
.. raw:: html
.. raw:: html
Preprocessing the Dataset
-------------------------
Treating each word as a token and filtering out words that appear less
than 5 times, we create a vocabulary out of the training dataset.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train_tokens = d2l.tokenize(train_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['
'])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train_tokens = d2l.tokenize(train_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['
'])
.. raw:: html
.. raw:: html
After tokenization, let us plot the histogram of review lengths in
tokens.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
d2l.plt.xlabel('# tokens per review')
d2l.plt.ylabel('count')
d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
.. figure:: output_sentiment-analysis-and-dataset_70179c_39_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
d2l.set_figsize()
d2l.plt.xlabel('# tokens per review')
d2l.plt.ylabel('count')
d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
.. figure:: output_sentiment-analysis-and-dataset_70179c_42_0.svg
.. raw:: html
.. raw:: html
As we expected, the reviews have varying lengths. To process a minibatch
of such reviews at each time, we set the length of each review to 500
with truncation and padding, which is similar to the preprocessing step
for the machine translation dataset in
:numref:`sec_machine_translation`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
num_steps = 500 # sequence length
train_features = np.array([d2l.truncate_pad(
vocab[line], num_steps, vocab['
']) for line in train_tokens])
print(train_features.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(25000, 500)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
num_steps = 500 # sequence length
train_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['
']) for line in train_tokens])
print(train_features.shape)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([25000, 500])
.. raw:: html
.. raw:: html
Creating Data Iterators
-----------------------
Now we can create data iterators. At each iteration, a minibatch of
examples are returned.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train_iter = d2l.load_array((train_features, train_data[1]), 64)
for X, y in train_iter:
print('X:', X.shape, ', y:', y.shape)
break
print('# batches:', len(train_iter))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X: (64, 500) , y: (64,)
# batches: 391
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), 64)
for X, y in train_iter:
print('X:', X.shape, ', y:', y.shape)
break
print('# batches:', len(train_iter))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X: torch.Size([64, 500]) , y: torch.Size([64])
# batches: 391
.. raw:: html
.. raw:: html
Putting All Things Together
---------------------------
Last, we wrap up the above steps into the ``load_data_imdb`` function.
It returns training and test data iterators and the vocabulary of the
IMDb review dataset.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def load_data_imdb(batch_size, num_steps=500):
"""Return data iterators and the vocabulary of the IMDb review dataset."""
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
train_data = read_imdb(data_dir, True)
test_data = read_imdb(data_dir, False)
train_tokens = d2l.tokenize(train_data[0], token='word')
test_tokens = d2l.tokenize(test_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5)
train_features = np.array([d2l.truncate_pad(
vocab[line], num_steps, vocab['
']) for line in train_tokens])
test_features = np.array([d2l.truncate_pad(
vocab[line], num_steps, vocab['']) for line in test_tokens])
train_iter = d2l.load_array((train_features, train_data[1]), batch_size)
test_iter = d2l.load_array((test_features, test_data[1]), batch_size,
is_train=False)
return train_iter, test_iter, vocab
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
#@save
def load_data_imdb(batch_size, num_steps=500):
"""Return data iterators and the vocabulary of the IMDb review dataset."""
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
train_data = read_imdb(data_dir, True)
test_data = read_imdb(data_dir, False)
train_tokens = d2l.tokenize(train_data[0], token='word')
test_tokens = d2l.tokenize(test_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5)
train_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['
']) for line in train_tokens])
test_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['']) for line in test_tokens])
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])),
batch_size)
test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])),
batch_size,
is_train=False)
return train_iter, test_iter, vocab
.. raw:: html
.. raw:: html
Summary
-------
- Sentiment analysis studies people's sentiments in their produced
text, which is considered as a text classification problem that
transforms a varying-length text sequence into a fixed-length text
category.
- After preprocessing, we can load Stanford's large movie review
dataset (IMDb review dataset) into data iterators with a vocabulary.
Exercises
---------
1. What hyperparameters in this section can we modify to accelerate
training sentiment analysis models?
2. Can you implement a function to load the dataset of `Amazon
reviews `__ into data
iterators and labels for sentiment analysis?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html