A Complete Guide to CNN for Sentence Classification with PyTorch

Convolutional Neural Networks (CNN) were originally invented for computer vision and now are the building blocks of state-of-the-art CV models. One of the earliest applications of CNN in Natural Language Processing was introduced in the paper Convolutional Neural Networks for Sentence Classification (Kim, 2014). With the same idea as in computer vision, CNN model is used as an feature extractor that encodes semantic features of sentences before these features are fed to a classifier.

With only a simple one-layer CNN trained on top of pretrained word vectors and little hyperparameter tuning, the model achieves excellent results on multiple sentence-level classification tasks. CNN models are now used widely in other NLP tasks such as translation and question answering as a part of a more complex architecture.

When implementing the original paper (Kim, 2014) in PyTorch, I needed to put many pieces together to complete the project. This article serves as a complete guide to CNN for sentence classification tasks accompanied with advice for practioners. It will cover:

Tokenizing and building vocabuilary from text data
Loading pretrained fastText word vectors and creating embedding layer for fine-tuning
Building and training CNN model with PyTorch
Advice for practitioners
Bonus: Using Skorch as a scikit-like wrapper for PyTorch’s Deep Learning models

Reference:

Convolutional Neural Networks for Sentence Classification (Kim, 2014).
A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification (Zhang, 2015).
Advances in Pre-Training Distributed Word Representations (Mikolov, 2018).

1. Setup

1.1. Import Libraries

import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import nltk
nltk.download("all")
import matplotlib.pyplot as plt
import torch

%matplotlib inline

1.2. Download Datasets

The dataset we will use is Movie Review (MR), a sentence polarity dataset from (Pang and Lee, 2005). The dataset has 5331 positive and 5331 negative processed sentences/snippets.

URL = 'https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'
# Download Datasets
!wget -P 'Data/' $URL
# Unzip
!tar xvzf 'Data/rt-polaritydata.tar.gz' -C 'Data/'

def load_text(path):
    """Load text data, lowercase text and save to a list."""

    with open(path, 'rb') as f:
        texts = []
        for line in f:
            texts.append(line.decode(errors='ignore').lower().strip())

    return texts

# Load files
neg_text = load_text('Data/rt-polaritydata/rt-polarity.neg')
pos_text = load_text('Data/rt-polaritydata/rt-polarity.pos')

# Concatenate and label data
texts = np.array(neg_text + pos_text)
labels = np.array([0]*len(neg_text) + [1]*len(pos_text))

1.3. Download fastText Word Vectors

The pretrained word vectors used in the original paper were trained by word2vec (Mikolov et al., 2013) on 100 billion tokens of Google News. In this tutorial, we will use fastText pretrained word vectors (Mikolov et al., 2017), trained on 600 billion tokens on Common Crawl. fastText is an upgraded version of word2vec and outperforms other state-of-the-art methods by a large margin.

The code below will download fastText pretrained vectors. Using Google Colab, the running time is approximately 3min 30s.

%%time
URL = "https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip"
FILE = "fastText"

if os.path.isdir(FILE):
    print("fastText exists.")
else:
    !wget -P $FILE $URL
    !unzip $FILE/crawl-300d-2M.vec.zip -d $FILE

crawl-300d-2M.vec.z 100%[===================>]   1.42G  23.8MB/s    in 62s     

2020-02-01 00:40:43 (23.3 MB/s) - ‘fastText/crawl-300d-2M.vec.zip’ saved [1523785255/1523785255]

Archive:  fastText/crawl-300d-2M.vec.zip
  inflating: fastText/crawl-300d-2M.vec  

1.4. Use GPU for Training

Google Colab offers free GPUs and TPUs. Since we’ll be training a large neural network it’s best to utilize these features.

A GPU can be added by going to the menu and selecting:

Runtime -> Change runtime type -> Hardware accelerator: GPU

Then we need to run the following cell to specify the GPU as the device.

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4

2. Data Preparation

To prepare our text data for training, first we need to tokenize our sentences and build a vocabulary dictionary word2idx, which will later be used to convert our tokens into indexes and build an embedding layer.

So, what is an embedding layer?

An embedding layer serves as a look-up table which takes words’ indexes in the vocabulary as input and output word vectors. Hence, the embedding layer has shape \((N, d)\) where \(N\) is the size of the vocabulary and \(d\) is the embedding dimension. In order to fine-tune pretrained word vectors, we need to create an embedding layer in our nn.Module class. Our input to the model will then be input_ids, which is tokens’ indexes in the vocabulary.

2.1. Tokenize

The function tokenize will tokenize our sentences, build a vocabulary and find the maximum sentence length. The function encode will take outputs of tokenize as inputs, perform sentence padding and return input_ids as a numpy array.

from nltk.tokenize import word_tokenize
from collections import defaultdict

def tokenize(texts):
    """Tokenize texts, build vocabulary and find maximum sentence length.
    
    Args:
        texts (List[str]): List of text data
    
    Returns:
        tokenized_texts (List[List[str]]): List of list of tokens
        word2idx (Dict): Vocabulary built from the corpus
        max_len (int): Maximum sentence length
    """

    max_len = 0
    tokenized_texts = []
    word2idx = {}

    # Add <pad> and <unk> tokens to the vocabulary
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    # Building our vocab from the corpus starting from index 2
    idx = 2
    for sent in texts:
        tokenized_sent = word_tokenize(sent)

        # Add `tokenized_sent` to `tokenized_texts`
        tokenized_texts.append(tokenized_sent)

        # Add new token to `word2idx`
        for token in tokenized_sent:
            if token not in word2idx:
                word2idx[token] = idx
                idx += 1

        # Update `max_len`
        max_len = max(max_len, len(tokenized_sent))

    return tokenized_texts, word2idx, max_len

def encode(tokenized_texts, word2idx, max_len):
    """Pad each sentence to the maximum sentence length and encode tokens to
    their index in the vocabulary.

    Returns:
        input_ids (np.array): Array of token indexes in the vocabulary with
            shape (N, max_len). It will the input of our CNN model.
    """

    input_ids = []
    for tokenized_sent in tokenized_texts:
        # Pad sentences to max_len
        tokenized_sent += ['<pad>'] * (max_len - len(tokenized_sent))

        # Encode tokens to input_ids
        input_id = [word2idx.get(token) for token in tokenized_sent]
        input_ids.append(input_id)
    
    return np.array(input_ids)

2.2. Load Pretrained Vectors

We will load the pretrained vectors for each token in our vocabulary. For tokens with no pretraiend vectors, we will initialize random word vectors with the same dimension and variance.

from tqdm import tqdm_notebook

def load_pretrained_vectors(word2idx, fname):
    """Load pretrained vectors and create embedding layers.
    
    Args:
        word2idx (Dict): Vocabulary built from the corpus
        fname (str): Path to pretrained vector file

    Returns:
        embeddings (np.array): Embedding matrix with shape (N, d) where N is
            the size of word2idx and d is embedding dimension
    """

    print("Loading pretrained vectors...")
    fin = open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())

    # Initilize random embeddings
    embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), d))
    embeddings[word2idx['<pad>']] = np.zeros((d,))

    # Load pretrained vectors
    count = 0
    for line in tqdm_notebook(fin):
        tokens = line.rstrip().split(' ')
        word = tokens[0]
        if word in word2idx:
            count += 1
            embeddings[word2idx[word]] = np.array(tokens[1:], dtype=np.float32)

    print(f"There are {count} / {len(word2idx)} pretrained vectors found.")

    return embeddings

Now let’s put above steps together.

# Tokenize, build vocabulary, encode tokens
print("Tokenizing...\n")
tokenized_texts, word2idx, max_len = tokenize(texts)
input_ids = encode(tokenized_texts, word2idx, max_len)

# Load pretrained vectors
embeddings = load_pretrained_vectors(word2idx, "fastText/crawl-300d-2M.vec")
embeddings = torch.tensor(embeddings)

Tokenizing...

Loading pretrained vectors...

There are 18526 / 20286 pretrained vectors found.

2.3. Create PyTorch DataLoader

We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed. The batch size used in the paper is 50.

from torch.utils.data import (TensorDataset, DataLoader, RandomSampler,
                              SequentialSampler)

def data_loader(train_inputs, val_inputs, train_labels, val_labels,
                batch_size=50):
    """Convert train and validation sets to torch.Tensors and load them to
    DataLoader.
    """

    # Convert data type to torch.Tensor
    train_inputs, val_inputs, train_labels, val_labels =\
    tuple(torch.tensor(data) for data in
          [train_inputs, val_inputs, train_labels, val_labels])

    # Specify batch_size
    batch_size = 50

    # Create DataLoader for training data
    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Create DataLoader for validation data
    val_data = TensorDataset(val_inputs, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

    return train_dataloader, val_dataloader

We will use 90% of the dataset for training and 10% for validation.

from sklearn.model_selection import train_test_split

# Train Test Split
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    input_ids, labels, test_size=0.1, random_state=42)

# Load data to PyTorch DataLoader
train_dataloader, val_dataloader = \
data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50)

3. Model

CNN Architecture

The picture below is the illustration of the CNN architecture that we are going to build with three filter sizes: 2, 3, and 4, each of which has 2 filters.

CNN Architecture (Source: Zhang, 2015)

# Sample configuration:
filter_sizes = [2, 3, 4]
num_filters = [2, 2, 2]

Suppose that we are classifying the sentence “I like this movie very much!” (\(N = 7\) tokens) and the dimensionality of word vectors is \(d=5\). After applying the embedding layer on the input token ids, the sample sentence is presented as a 2D tensor with shape (7, 5) like an image.

\[\mathrm{x_{emb}} \quad \in \mathbb{R}^{7 \times 5}\]

We then use 1-dimesional convolution to extract features from the sentence. In this example, we have 6 filters in total, and each filter has shape \((f_i, d)\) where \(f_i\) is the filter size for \(i \in \{1,...,6\}\). Each filter will then scan over \(\mathrm{x_{emb}}\) and return a feature map:

\[\mathrm{x_{conv_ i} = Conv1D(x_{emb})} \quad \in \mathbb{R}^{N-f_i+1}\]

Next, we apply the ReLU activation to \(\mathrm{x_{conv_{i}}}\) and use max-over-time-pooling to reduce each feature map to a single scalar. Then we concatenate these scalars into a vector which will be fed to a fully connected layer to compute the final scores for our classes (logits).

\[\mathrm{x_{pool_i} = MaxPool(ReLU(x_{conv_i}))} \quad \in \mathbb{R}\] \[\mathrm{x_{fc} = \texttt{concat}(x_{pool_i})} \quad \in \mathbb{R}^6\]

The idea here is that each filter will capture different semantic signals in the sentence (e.g., happiness, humor, politics, anger…) and max-pooling will record only the strongest signal over the sentence. This logic makes sense because humans also perceive the sentiment of a sentence based on its strongest semantic signal.

Finally, we use a fully connected layer with the weight matrix \(\mathbf{W_{fc}} \in \mathbb{R}^{2 \times 6}\) and dropout to compute \(\mathrm{logits}\), which is a vector of length 2 that keeps the scores for our two classes.

\[\mathrm{logits = Dropout(\mathbf{W_{fc}}x_{fc})} \in \mathbb{R}^2\]

An in-depth explanation of CNN can be found in this article and this video.

3.1. Create CNN Model

For simplicity, the model above has very small configurations. The final model will have the same architecture but be much bigger:

Description	Values
input word vectors	fastText
embedding size	300
filter sizes	(3, 4, 5)
num filters	(100, 100, 100)
activation	ReLU
pooling	1-max pooling
dropout rate	0.5

import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN_NLP(nn.Module):
    """An 1D Convulational Neural Network for Sentence Classification."""
    def __init__(self,
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=300,
                 filter_sizes=[3, 4, 5],
                 num_filters=[100, 100, 100],
                 num_classes=2,
                 dropout=0.5):
        """
        The constructor for CNN_NLP class.

        Args:
            pretrained_embedding (torch.Tensor): Pretrained embeddings with
                shape (vocab_size, embed_dim)
            freeze_embedding (bool): Set to False to fine-tune pretraiend
                vectors. Default: False
            vocab_size (int): Need to be specified when not pretrained word
                embeddings are not used.
            embed_dim (int): Dimension of word vectors. Need to be specified
                when pretrained word embeddings are not used. Default: 300
            filter_sizes (List[int]): List of filter sizes. Default: [3, 4, 5]
            num_filters (List[int]): List of number of filters, has the same
                length as `filter_sizes`. Default: [100, 100, 100]
            n_classes (int): Number of classes. Default: 2
            dropout (float): Dropout rate. Default: 0.5
        """

        super(CNN_NLP, self).__init__()
        # Embedding layer
        if pretrained_embedding is not None:
            self.vocab_size, self.embed_dim = pretrained_embedding.shape
            self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                          freeze=freeze_embedding)
        else:
            self.embed_dim = embed_dim
            self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                          embedding_dim=self.embed_dim,
                                          padding_idx=0,
                                          max_norm=5.0)
        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        # Fully-connected layer and Dropout
        self.fc = nn.Linear(np.sum(num_filters), num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input_ids):
        """Perform a forward pass through the network.

        Args:
            input_ids (torch.Tensor): A tensor of token ids with shape
                (batch_size, max_sent_length)

        Returns:
            logits (torch.Tensor): Output logits with shape (batch_size,
                n_classes)
        """

        # Get embeddings from `input_ids`. Output shape: (b, max_len, embed_dim)
        x_embed = self.embedding(input_ids).float()

        # Permute `x_embed` to match input shape requirement of `nn.Conv1d`.
        # Output shape: (b, embed_dim, max_len)
        x_reshaped = x_embed.permute(0, 2, 1)

        # Apply CNN and ReLU. Output shape: (b, num_filters[i], L_out)
        x_conv_list = [F.relu(conv1d(x_reshaped)) for conv1d in self.conv1d_list]

        # Max pooling. Output shape: (b, num_filters[i], 1)
        x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2])
            for x_conv in x_conv_list]
        
        # Concatenate x_pool_list to feed the fully connected layer.
        # Output shape: (b, sum(num_filters))
        x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list],
                         dim=1)
        
        # Compute logits. Output shape: (b, n_classes)
        logits = self.fc(self.dropout(x_fc))

        return logits

3.2. Optimizer

To train Deep Learning models, we need to define a loss function and minimize this loss. We’ll use back-propagation to compute gradients and use an optimization algorithm (ie. Gradient Descent) to minimize the loss. The original paper used the Adadelta optimizer.

import torch.optim as optim

def initilize_model(pretrained_embedding=None,
                    freeze_embedding=False,
                    vocab_size=None,
                    embed_dim=300,
                    filter_sizes=[3, 4, 5],
                    num_filters=[100, 100, 100],
                    num_classes=2,
                    dropout=0.5,
                    learning_rate=0.01):
    """Instantiate a CNN model and an optimizer."""

    assert (len(filter_sizes) == len(num_filters)), "filter_sizes and \
    num_filters need to be of the same length."

    # Instantiate CNN model
    cnn_model = CNN_NLP(pretrained_embedding=pretrained_embedding,
                        freeze_embedding=freeze_embedding,
                        vocab_size=vocab_size,
                        embed_dim=embed_dim,
                        filter_sizes=filter_sizes,
                        num_filters=num_filters,
                        num_classes=2,
                        dropout=0.5)
    
    # Send model to `device` (GPU/CPU)
    cnn_model.to(device)

    # Instantiate Adadelta optimizer
    optimizer = optim.Adadelta(cnn_model.parameters(),
                               lr=learning_rate,
                               rho=0.95)

    return cnn_model, optimizer

3.3. Training Loop

For each epoch, the code below will perform a forward step to compute the Cross Entropy loss, a backward step to compute gradients and use the optimizer to update weights/parameters. At the end of each epoch, the loss on training data and the accuracy over the validation data will be printed to help us keep track of the model’s performance. The code is heavily annotated with detailed explanations.

import random
import time

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, optimizer, train_dataloader, val_dataloader=None, epochs=10):
    """Train the CNN model."""
    
    # Tracking best validation accuracy
    best_accuracy = 0

    # Start training loop
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {\
    'Val Acc':^9} | {'Elapsed':^9}")
    print("-"*60)

    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================

        # Tracking time and loss
        t0_epoch = time.time()
        total_loss = 0

        # Put the model into the training mode
        model.train()

        for step, batch in enumerate(train_dataloader):
            # Load batch to GPU
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Update parameters
            optimizer.step()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        # =======================================
        #               Evaluation
        # =======================================
        if val_dataloader is not None:
            # After the completion of each training epoch, measure the model's
            # performance on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Track the best accuracy
            if val_accuracy > best_accuracy:
                best_accuracy = val_accuracy

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {\
            val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            
    print("\n")
    print(f"Training complete! Best accuracy: {best_accuracy:.2f}%.")

def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's
    performance on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled
    # during the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

4. Evaluation

In the original paper, the author tried different variations of the model.

CNN-rand: The baseline model where the embedding layer is randomly initialized and then updated during training.
CNN-static: A model with pretrained vectors. However, the embedding layer is freezed during training.
CNN-non-static: Same as above but the embedding layers is fine-tuned during training.

We will experiment with all 3 variations and compare their performance. Below is the report of our results and the original paper’s results.

Model	Kim’s results	Our results
CNN-rand	76.1	74.2
CNN-static	81.0	82.7
CNN-non-static	81.5	84.4

Randomness could cause the difference in the results. I think the reason for the improvement in our results is that we used fastText pretrained vectors, which are of higher quality than word2vec vectors that the author used.

# CNN-rand: Word vectors are randomly initialized.
set_seed(42)
cnn_rand, optimizer = initilize_model(vocab_size=len(word2idx),
                                      embed_dim=300,
                                      learning_rate=0.25,
                                      dropout=0.5)
train(cnn_rand, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
  |   0.682544   |  0.653227  |   62.22   |   1.50   
  |   0.622080   |  0.616504  |   65.22   |   1.41   
  |   0.546976   |  0.574917  |   69.30   |   1.43   
  |   0.473106   |  0.559976  |   69.21   |   1.43   
  |   0.397637   |  0.541240  |   72.47   |   1.44   
  |   0.322112   |  0.530545  |   71.93   |   1.43   
  |   0.258854   |  0.513072  |   72.92   |   1.43   
  |   0.204417   |  0.534012  |   73.74   |   1.43   
  |   0.157654   |  0.533650  |   74.01   |   1.44   
  |   0.129191   |  0.542072  |   74.19   |   1.44   
  |   0.104160   |  0.561548  |   73.56   |   1.45   
  |   0.083750   |  0.560357  |   73.10   |   1.47   
  |   0.067199   |  0.565875  |   73.10   |   1.45   
  |   0.061943   |  0.591892  |   73.83   |   1.44   
  |   0.047678   |  0.615021  |   73.38   |   1.44   
  |   0.043667   |  0.609918  |   73.47   |   1.45   
  |   0.038222   |  0.624876  |   73.74   |   1.43   
  |   0.037270   |  0.636214  |   73.83   |   1.44   
  |   0.032148   |  0.635478  |   73.19   |   1.46   
  |   0.027427   |  0.636196  |   73.56   |   1.42   


Training complete! Best accuracy: 74.19%.

# CNN-static: fastText pretrained word vectors are used and freezed during training.
set_seed(42)
cnn_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                        freeze_embedding=True,
                                        learning_rate=0.25,
                                        dropout=0.5)
train(cnn_static, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
  |   0.587050   |  0.473927  |   76.93   |   0.82   
  |   0.453002   |  0.432967  |   79.39   |   0.71   
  |   0.389261   |  0.417466  |   80.11   |   0.74   
  |   0.345526   |  0.417371  |   80.93   |   0.81   
  |   0.284621   |  0.403670  |   81.47   |   0.83   
  |   0.242149   |  0.406981  |   81.93   |   0.81   
  |   0.190178   |  0.460115  |   79.93   |   0.76   
  |   0.155375   |  0.421258  |   82.20   |   0.84   
  |   0.118369   |  0.436616  |   82.02   |   0.80   
  |   0.095217   |  0.443634  |   81.83   |   0.79   
  |   0.078958   |  0.447452  |   82.11   |   0.76   
  |   0.063665   |  0.504030  |   81.20   |   0.83   
  |   0.047461   |  0.457974  |   82.02   |   0.77   
  |   0.043035   |  0.485016  |   82.11   |   0.70   
  |   0.035299   |  0.479483  |   82.11   |   0.82   
  |   0.028384   |  0.498936  |   82.19   |   0.79   
  |   0.024328   |  0.521321  |   82.37   |   0.76   
  |   0.024897   |  0.511377  |   82.74   |   0.74   
  |   0.019988   |  0.530753  |   81.93   |   0.79   
  |   0.017251   |  0.546499  |   82.20   |   0.85   


Training complete! Best accuracy: 82.74%.

# CNN-non-static: fastText pretrained word vectors are fine-tuned during training.
set_seed(42)
cnn_non_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_non_static, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
  |   0.586136   |  0.471964  |   77.21   |   2.08   
  |   0.448910   |  0.428012  |   80.03   |   2.11   
  |   0.381136   |  0.409408  |   81.29   |   2.09   
  |   0.332936   |  0.411652  |   80.75   |   2.10   
  |   0.267999   |  0.397631  |   82.02   |   2.10   
  |   0.223944   |  0.399833  |   81.29   |   2.11   
  |   0.168644   |  0.452024  |   81.29   |   2.10   
  |   0.132921   |  0.442039  |   81.65   |   2.09   
  |   0.097992   |  0.457295  |   81.84   |   2.09   
  |   0.079037   |  0.458124  |   82.38   |   2.09   
  |   0.061001   |  0.459572  |   83.74   |   2.09   
  |   0.047450   |  0.535106  |   81.29   |   2.08   
  |   0.037088   |  0.491504  |   84.37   |   2.10   
  |   0.031085   |  0.503522  |   83.11   |   2.08   
  |   0.025401   |  0.512804  |   84.01   |   2.10   
  |   0.020165   |  0.532516  |   84.19   |   2.11   
  |   0.017053   |  0.545771  |   83.83   |   2.08   
  |   0.017567   |  0.540735  |   84.20   |   2.09   
  |   0.013829   |  0.567102  |   82.47   |   2.09   
  |   0.013072   |  0.594407  |   82.20   |   2.08   


Training complete! Best accuracy: 84.37%.

5. Test Model

Let’s test our CNN-non-static model on some examples.

def predict(text, model=cnn_non_static.to("cpu"), max_len=62):
    """Predict probability that a review is positive."""

    # Tokenize, pad and encode text
    tokens = word_tokenize(text.lower())
    padded_tokens = tokens + ['<pad>'] * (max_len - len(tokens))
    input_id = [word2idx.get(token, word2idx['<unk>']) for token in padded_tokens]

    # Convert to PyTorch tensors
    input_id = torch.tensor(input_id).unsqueeze(dim=0)

    # Compute logits
    logits = model.forward(input_id)

    #  Compute probability
    probs = F.softmax(logits, dim=1).squeeze(dim=0)

    print(f"This review is {probs[1] * 100:.2f}% positive.")

Our model can easily regconize reviews with strong negative signals. On samples that have mixed feelings but positive sentiment overvall, our model also gets excellent results.

predict("All of friends slept while watching this movie. But I really enjoyed it.")
predict("I have waited so long for this movie. I am now so satisfied and happy.")
predict("This movie is long and boring.")
predict("I don't like the ending.")

This review is 61.22% positive.
This review is 94.68% positive.
This review is 0.01% positive.
This review is 4.03% positive.

6. Advice for Practitioners

In A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification (Zhang, 2015), the authors conducted a sensitivity analysis of the above CNN architecture by running it many different sets of hyperparameters. Based on main empirical findings of the research, below are some advice for practioners to choose hyperparameters when applying this architecture for sentence classification tasks:

Input word vectors: Using pretrained word vectors such as word2vec, Glove (or fastText in our implementation) yields much better results than using one-hot vectors or randomly initialized vectors.
Filter region size can have a large effect on performance, and should be tuned. A reasonable range might be 1~10. For example, using filter_size=[7] and num_filters=[400] yields the best result in the MR dataset.
Number of feature maps: try values from 100 to 600 for each filter region size.
Activation funtions: ReLu and tanh are the best candidates.
Pooling: Use 1-max pooling.
Regularization: When increasing number of feature maps, try imposing stronger regularization, e.g. a dropout rate larger than 0.5.

Bonus: Skorch: A Scikit-like Library for PyTorch Modules

If you find the training loop in PyTorch intimidating with a lot of steps and wonder why those steps aren’t wrapped in a function like model.fit() and model.predict() in scikit-learn library. Actually it is something I like in PyTorch. It allows me to manipulate my codes to add extra customizations during training such as clipping gradients and updating learning rates. In addition, because I build my model and training loop block by block, when my model runs into errors, I can navigate the bugs faster. However, when I need to deploy a baseline model quickly, writing an entire training loop is really a burden. It’s when I come to skorch.

skorch is “a scikit-learn compatible neural network library that wraps PyTorch.” There is no need to create DataLoader or write a training/evaluation loop. All you need to do is defining the model and optimizer as in the code below, then a simple net.fit(X, y) is enough.

skorch does not only make it neat and fast to train your Deep Learning models, it also provides powerful support. You can specify callbacks parameters to define early stopping and checkpoint saving. You can also combine skorch model with scikit-learn methods to do cross-validation and hyperparameter tuning with grid-search. Please check out the documentation to explore more powerful functions in this library.

!pip install skorch
from skorch import NeuralNetClassifier
from skorch.helper import predefined_split
from skorch.callbacks import EarlyStopping, Checkpoint, LoadInitState
from skorch.dataset import CVSplit, Dataset

# Specify validation set
val_dataset = Dataset(val_inputs, val_labels)

# Specify callbacks and checkpoints
cp = Checkpoint(monitor='valid_acc_best', dirname='exp1')
callbacks = [
    ('early_stop', EarlyStopping(monitor='valid_acc', patience=5, lower_is_better=False)),
    cp
]

net = NeuralNetClassifier(
    # Module
    module=CNN_NLP,
    module__pretrained_embedding=embeddings,
    module__freeze_embedding=False,
    module__dropout=0.5,
    # Optimizer
    criterion=nn.CrossEntropyLoss,
    optimizer=optim.Adadelta,
    optimizer__lr=0.25,
    optimizer__rho=0.95,
    # Others
    max_epochs=20,
    batch_size=50,
    train_split=predefined_split(val_dataset),
    iterator_train__shuffle=True,
    warm_start=False,
    callbacks=callbacks,
    device=device
)

skorch also prints training results in a very nice table. My training loop in section 3 is inspired by this format. When model (checkpoints) are saved, you can see the + sign in column cp.

set_seed(42)
_ = net.fit(np.array(train_inputs), train_labels)

valid_acc_best = np.max(net.history[:, 'valid_acc'])
print(f"Training complete! Best accuracy: {valid_acc_best * 100:.2f}%")

  epoch    train_loss    valid_acc    valid_loss    cp     dur
-------  ------------  -----------  ------------  ----  ------
      0.5862       0.7741        0.4727     +  2.2838
      0.4481       0.7901        0.4385     +  2.2232
      0.3849       0.7938        0.4369     +  2.2337
      0.3242       0.8285        0.3940     +  2.2340
      0.2787       0.8257        0.3951        2.2225
      0.2156       0.8285        0.3958        2.2006
      0.1714       0.8144        0.4410        2.2059
      0.1336       0.8332        0.4100     +  2.2174
      0.0950       0.8266        0.4295        2.2214
      0.0738       0.8238        0.4489        2.1938
      0.0596       0.8304        0.4705        2.1988
      0.0476       0.8266        0.4769        2.2083
Stopping since valid_acc has not improved in the last 5 epochs.
Training complete! Best accuracy: 83.32%

As Deep Learning model can overfit training data quickly, it’s important to save our model when it fits validation data just right. After training, we can load our model from the last checkpoint to make predictions.

# Load parameters from checkpoint
net.load_params(checkpoint=cp)

predict("All of friends slept while watching this movie. But I really enjoyed it.", model=net)
predict("I have waited so long for this movie. I am now so satisfied and happy.", model=net)
predict("This movie is long and boring.", model=net)
predict("I don't like the ending.", model=net)

This review is 67.25% positive.
This review is 61.38% positive.
This review is 0.12% positive.
This review is 19.14% positive.

Conclusion

Before the rise of huge and complicated models using Transformer architecture, a simple CNN architecture with one layer of convolution can yield excellent performance on sentence classification tasks. The model can take advantages of unsupervised pre-training of word vectors to improve overall performance. Improvements can be made in this architecture by increasing the number of CNN layers or using sub-word model (using BPE tokenizer and fastText pretrained sub-word vectors). Because of its speed, we can use the CNN model as a strong baseline before trying more complicated models such as BERT.

Thank you for staying with me to this point. If interested, you can check out other articles in my NLP tutorial series:

Tutorial: Fine-tuning BERT for Sentiment Analysis