Chris Tran

Pre-train ELECTRA for Spanish from Scratch

2020-06-11T00:00:00-04:00

1. Introduction

At ICLR 2020, ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, a new method for self-supervised language representation learning, was introduced. ELECTRA is another member of the Transformer pre-training method family, whose previous members such as BERT, GPT-2, RoBERTa have achieved many state-of-the-art results in Natural Language Processing benchmarks.

Different from other masked language modeling methods, ELECTRA is a more sample-efficient pre-training task called replaced token detection. At a small scale, ELECTRA-small can be trained on a single GPU for 4 days to outperform GPT (Radford et al., 2018) (trained using 30x more compute) on the GLUE benchmark. At a large scale, ELECTRA-large outperforms ALBERT (Lan et al., 2019) on GLUE and sets a new state-of-the-art for SQuAD 2.0.

ELECTRA consistently outperforms masked language model pre-training approaches.

2. Method

Masked language modeling pre-training methods such as BERT (Devlin et al., 2019) corrupt the input by replacing some tokens (typically 15% of the input) with [MASK] and then train a model to re-construct the original tokens.

Instead of masking, ELECTRA corrupts the input by replacing some tokens with samples from the outputs of a smalled masked language model. Then, a discriminative model is trained to predict whether each token was an original or a replacement. After pre-training, the generator is thrown out and the discriminator is fine-tuned on downstream tasks.

An overview of ELECTRA.

Although having a generator and a discriminator like GAN, ELECTRA is not adversarial in that the generator producing corrupted tokens is trained with maximum likelihood rather than being trained to fool the discriminator.

Why is ELECTRA so efficient?

With a new training objective, ELECTRA can achieve comparable performance to strong models such as RoBERTa (Liu et al., (2019) which has more parameters and needs 4x more compute for training. In the paper, an analysis was conducted to understand what really contribute to ELECTRA’s efficiency. The key findings are:

ELECTRA is greatly benefiting from having a loss defined over all input tokens rather than just a subset. More specifically, in ELECTRA, the discriminator predicts on every token in the input, while in BERT, the generator only predicts 15% masked tokens of the input.
BERT’s performance is slightly harmed because in the pre-training phase, the model sees [MASK] tokens, while it is not the case in the fine-tuning phase.

ELECTRA vs. BERT

3. Pre-train ELECTRA

In this section, we will train ELECTRA from scratch with TensorFlow using scripts provided by ELECTRA’s authors in google-research/electra. Then we will convert the model to PyTorch’s checkpoint, which can be easily fine-tuned on downstream tasks using Hugging Face’s transformers library.

Setup

!pip install tensorflow==1.15
!pip install transformers==2.8.0
!git clone https://github.com/google-research/electra.git

import os
import json
from transformers import AutoTokenizer

Data

We will pre-train ELECTRA on a Spanish movie subtitle dataset retrieved from OpenSubtitles. This dataset is 5.4 GB in size and we will train on a small subset of ~30 MB for presentation.

DATA_DIR = "./data" #@param {type: "string"}
TRAIN_SIZE = 1000000 #@param {type:"integer"}
MODEL_NAME = "electra-spanish" #@param {type: "string"}

# Download and unzip the Spanish movie substitle dataset
if not os.path.exists(DATA_DIR):
  !mkdir -p $DATA_DIR
  !wget "https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz" -O $DATA_DIR/OpenSubtitles.txt.gz
  !gzip -d $DATA_DIR/OpenSubtitles.txt.gz
  !head -n $TRAIN_SIZE $DATA_DIR/OpenSubtitles.txt > $DATA_DIR/train_data.txt 
  !rm $DATA_DIR/OpenSubtitles.txt

Before building the pre-training dataset, we should make sure the corpus has the following format:

each line is a sentence
a blank line separates two documents

Build Pretraining Dataset

We will use the tokenizer of bert-base-multilingual-cased to process Spanish texts.

# Save the pretrained WordPiece tokenizer to get `vocab.txt`
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
tokenizer.save_pretrained(DATA_DIR)

We use build_pretraining_dataset.py to create a pre-training dataset from a dump of raw text.

!python3 electra/build_pretraining_dataset.py \
  --corpus-dir $DATA_DIR \
  --vocab-file $DATA_DIR/vocab.txt \
  --output-dir $DATA_DIR/pretrain_tfrecords \
  --max-seq-length 128 \
  --blanks-separate-docs False \
  --no-lower-case \
  --num-processes 5

Start Training

We use run_pretraining.py to pre-train an ELECTRA model.

To train a small ELECTRA model for 1 million steps, run:

python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small

This takes slightly over 4 days on a Tesla V100 GPU. However, the model should achieve decent results after 200k steps (10 hours of training on the v100 GPU).

To customize the training, create a .json file containing the hyperparameters. Please refer configure_pretraining.py for default values of all hyperparameters.

Below, we set the hyperparameters to train the model for only 100 steps.

hparams = {
    "do_train": "true",
    "do_eval": "false",
    "model_size": "small",
    "do_lower_case": "false",
    "vocab_size": 119547,
    "num_train_steps": 100,
    "save_checkpoints_steps": 100,
    "train_batch_size": 32,
}
           
with open("hparams.json", "w") as f:
    json.dump(hparams, f)

Let’s start training:

!python3 electra/run_pretraining.py \
  --data-dir $DATA_DIR \
  --model-name $MODEL_NAME \
  --hparams "hparams.json"

If you are training on a virtual machine, run the following lines on the terminal to moniter the training process with TensorBoard.

pip install -U tensorboard
tensorboard dev upload --logdir data/models/electra-spanish

This is the TensorBoard of training ELECTRA-small for 1 million steps in 4 days on a V100 GPU.

4. Convert Tensorflow checkpoints to PyTorch format

Hugging Face has a tool to convert Tensorflow checkpoints to PyTorch. However, this tool has yet been updated for ELECTRA. Fortunately, I found a GitHub repo by @lonePatient that can help us with this task.

!git clone https://github.com/lonePatient/electra_pytorch.git

MODEL_DIR = "data/models/electra-spanish/"

config = {
  "vocab_size": 119547,
  "embedding_size": 128,
  "hidden_size": 256,
  "num_hidden_layers": 12,
  "num_attention_heads": 4,
  "intermediate_size": 1024,
  "generator_size":"0.25",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "attention_probs_dropout_prob": 0.1,
  "max_position_embeddings": 512,
  "type_vocab_size": 2,
  "initializer_range": 0.02
}

with open(MODEL_DIR + "config.json", "w") as f:
    json.dump(config, f)

!python electra_pytorch/convert_electra_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path=$MODEL_DIR \
    --electra_config_file=$MODEL_DIR/config.json \
    --pytorch_dump_path=$MODEL_DIR/pytorch_model.bin

Use ELECTRA with transformers

After converting the model checkpoint to PyTorch format, we can start to use our pre-trained ELECTRA model on downstream tasks with the transformers library.

import torch
from transformers import ElectraForPreTraining, ElectraTokenizerFast

discriminator = ElectraForPreTraining.from_pretrained(MODEL_DIR)
tokenizer = ElectraTokenizerFast.from_pretrained(DATA_DIR, do_lower_case=False)

sentence = "Los pájaros están cantando" # The birds are singing
fake_sentence = "Los pájaros están hablando" # The birds are speaking 

fake_tokens = tokenizer.tokenize(fake_sentence, add_special_tokens=True)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = discriminator_outputs[0] > 0

[print("%7s" % token, end="") for token in fake_tokens]
print("\n")
[print("%7s" % int(prediction), end="") for prediction in predictions.tolist()];

  [CLS]    Los    paj ##aros  estan  habla  ##ndo  [SEP]

      1      0      0      0      0      0      0      0

Our model was trained for only 100 steps so the predictions are not accurate. The fully-trained ELECTRA-small for Spanish can be loaded as below:

discriminator = ElectraForPreTraining.from_pretrained("skimai/electra-small-spanish")
tokenizer = ElectraTokenizerFast.from_pretrained("skimai/electra-small-spanish", do_lower_case=False)

5. Conclusion

In this article, we have walked through the ELECTRA paper to understand why ELECTRA is the most efficient transformer pre-training approach at the moment. At a small scale, ELECTRA-small can be trained on one GPU for 4 days to outperform GPT on the GLUE benchmark. At a large scale, ELECTRA-large sets a new state-of-the-art for SQuAD 2.0.

We then actually train an ELECTRA model on Spanish texts and convert Tensorflow checkpoint to PyTorch and use the model with the transformers library.

References

[1] ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
[2] google-research/electra - the official GitHub repository of the original paper
[3] electra_pytorch - a PyTorch implementation of ELECTRA

Extractive Summarization with BERT

2020-05-31T00:00:00-04:00

1. Introduction

Summarization has long been a challenge in Natural Language Processing. To generate a short version of a document while retaining its most important information, we need a model capable of accurately extracting the key points while avoiding repetitive information. Fortunately, recent works in NLP such as Transformer models and language model pretraining have advanced the state-of-the-art in summarization.

In this article, we will explore BERTSUM, a simple variant of BERT, for extractive summarization from Text Summarization with Pretrained Encoders (Liu et al., 2019). Then, in an effort to make extractive summarization even faster and smaller for low-resource devices, we will fine-tune DistilBERT (Sanh et al., 2019) and MobileBERT (Sun et al., 2019), two recent lite versions of BERT, and discuss our findings.

2. Extractive Summarization

There are two types of summarization: abstractive and extractive summarization. Abstractive summarization basically means rewriting key points while extractive summarization generates summary by copying directly the most important spans/sentences from a document.

Abstractive summarization is more challenging for humans, and also more computationally expensive for machines. However, which summaration is better depends on the purpose of the end user. If you were writing an essay, abstractive summaration might be a better choice. On the other hand, if you were doing some research and needed to get a quick summary of what you were reading, extractive summarization would be more helpful for the task.

In this section we will explore the architecture of our extractive summarization model. The BERT summarizer has 2 parts: a BERT encoder and a summarization classifier.

BERT Encoder

The overview architecture of BERTSUM

Our BERT encoder is the pretrained BERT-base encoder from the masked language modeling task (Devlin et at., 2018). The task of extractive summarization is a binary classification problem at the sentence level. We want to assign each sentence a label \(y_i \in \{0, 1\}\) indicating whether the sentence should be included in the final summary. Therefore, we need to add a token [CLS] before each sentence. After we run a forward pass through the encoder, the last hidden layer of these [CLS] tokens will be used as the representions for our sentences.

Summarization Classifier

After getting the vector representation of each sentence, we can use a simple feed forward layer as our classifier to return a score for each sentence. In the paper, the author experimented with a simple linear classifier, a Recurrent Neural Network and a small Transformer model with 3 layers. The Transformer classifier yields the best results, showing that inter-sentence interactions through self-attention mechanism is important in selecting the most important sentences.

So in the encoder, we learn the interactions among tokens in our document while in the summarization classifier, we learn the interactions among sentences.

3. Make Summarization Even Faster

Transformer models achieve state-of-the-art performance in most NLP bechmarks; however, training and making predictions from them are computationally expensive. In an effort to make summarization lighter and faster to be deployed on low-resource devices, I have modified the source codes provided by the authors of BERTSUM to replace the BERT encoder with DistilBERT and MobileBERT. The summary layers are kept unchaged.

Here are training losses of these 3 variants: TensorBoard

Despite being 40% smaller than BERT-base, DistilBERT has the same training losses as BERT-base while MobileBERT performs slightly worse. The table below shows their performance on CNN/DailyMail dataset, size and running time of a forward pass:

Models	ROUGE-1	ROUGE-2	ROUGE-L	Inference Time*	Size	Params
bert-base	43.23	20.24	39.63	1.65 s	475 MB	120.5 M
distilbert	42.84	20.04	39.31	925 ms	310 MB	77.4 M
mobilebert	40.59	17.98	36.99	609 ms	128 MB	30.8 M

*Average running time of a forward pass on a single GPU on a standard Google Colab notebook

Being 45% faster, DistilBERT have almost the same performance as BERT-base. MobileBERT retains 94% performance of BERT-base, while being 4x smaller than BERT-base and 2.5x smaller than DistilBERT. In the MobileBERT paper, it’s shown that MobileBERT significantly outperforms DistilBERT on SQuAD v1.1. However, it’s not the case for extractive summarization. But this is still an impressive result for MobileBERT with a disk size of only 128 MB.

4. Let’s Summarize

All pretrained checkpoints, training details and setup instruction can be found in this GitHub repository. In addition, I have deployed a demo of BERTSUM with the MobileBERT encoder.

Web app: https://extractive-summarization.herokuapp.com/

Code:

import torch
from models.model_builder import ExtSummarizer
from ext_sum import summarize

# Load model
model_type = 'mobilebert' #@param ['bertbase', 'distilbert', 'mobilebert']
checkpoint = torch.load(f'checkpoints/{model_type}_ext.pt', map_location='cpu')
model = ExtSummarizer(checkpoint=checkpoint, bert_type=model_type, device='cpu')

# Run summarization
input_fp = 'raw_data/input.txt'
result_fp = 'results/summary.txt'
summary = summarize(input_fp, result_fp, model, max_length=3)
print(summary)

Summary sample

Original: https://www.cnn.com/2020/05/22/business/hertz-bankruptcy/index.html

By declaring bankruptcy, Hertz says it intends to stay in business while restructuring its debts and emerging a
financially healthier company. The company has been renting cars since 1918, when it set up shop with a dozen
Ford Model Ts, and has survived the Great Depression, the virtual halt of US auto production during World War II
and numerous oil price shocks. "The impact of Covid-19 on travel demand was sudden and dramatic, causing an
abrupt decline in the company's revenue and future bookings," said the company's statement.

5. Conclusion

In this article, we have explored BERTSUM, a simple variant of BERT, for extractive summarization from the paper Text Summarization with Pretrained Encoders (Liu et al., 2019). Then, in an effort to make extractive summarization even faster and smaller for low-resource devices, we fine-tuned DistilBERT (Sanh et al., 2019) and MobileBERT (Sun et al., 2019) on CNN/DailyMail datasets.

DistilBERT retains BERT-base’s performance in extractive summarization while being 45% smaller. MobileBERT is 4x smaller and 2.7x faster than BERT-base yet retains 94% of its performance.

Finally, we deployed a web app demo of MobileBERT for extractive summarization at https://extractive-summarization.herokuapp.com/.

References

Visual Recognition for Vietnamese Foods

2020-05-26T00:00:00-04:00

Imagine that you are a world traveller and are traveling to a country famous for its street foods. Walking in a night market street full of food trucks with many delicious-looking options, you have no idea what these foods are and whether they contain any ingredient you are allergic to. You want to ask the local but you don’t know the language. You wish that you have an app on your phone that allows you to take a picture of that food you want to have and will return all the information you need to know about it.

It’s one simple application of computer vision that can make our life better, besides many other applications being implemented in autonomous driving or cancer detection.

Today we are going to build a world-class image classifier using the fastai library to classify 11 popular Vietnamese dishes. The fastai library is built on top of PyTorch and allows us to quickly and easily build the latest neural networks and train our models to receive state-of-the-art results.

Before we begin, let’s load packages that we are going to use in this project.

%reload_ext autoreload
%autoreload 2
%matplotlib inline

import os
import numpy as np
import urllib.request
from fastai.vision import *
from fastai.metrics import error_rate
from PIL import Image, ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

Looking at the data

I have built an image dataset of popular Vietnamese dishes using the Bing Image Search API by following PyImageSearch’s tutorial. We can directly download this dataset from Google Drive.

%%time
# Download and unzip
if not os.path.exists("data"):
    !wget -O "dataset.zip" "https://www.googleapis.com/drive/v3/files/13GD8pcwHJPiAPbPtm6KeC20Qw1zm9xdy?alt=media&key=AIzaSyCmo6sAQ37OK8DK4wnT94PoLx5lx-7VTDE"
    !unzip dataset.zip
    !rm dataset.zip

Each class is stored in a seperate folder in data and we can use ImageDataBunch.from_folder to quickly load our dataset. In addition, we resize our images to 224 x 224 pixel and use get_transforms to flip, rotate, zoom, warp, adjust lighting our original images (which is called data augmentation, a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data). We also normalize the images using statistics from ImageNet dataset.

path = Path("./data/")
np.random.seed(42)
data = ImageDataBunch.from_folder(path, train='.', ds_tfms=get_transforms(),
                                  valid_pct=0.2, bs=64, size=224).normalize(imagenet_stats)

The dataset has more than 6k images and we use 20% of them as a validation set.

len(data.train_ds), len(data.valid_ds)

(4959, 1239)

Let’s look at some samples:

data.show_batch(rows=3, figsize=(7, 6))

The dataset has a total of 11 classes. They are:

print("Number of classes: ", len(data.classes))
print("Classes: ", data.classes)

Number of classes:  11
Classes:  ['banh-mi', 'banh-xeo', 'bubble-tea', 'bun-bo-hue', 'bun-bo-nam-bo', 'bun-cha', 'bun-dau-mam-tom', 'che', 'hu-tieu', 'pho', 'spring-rolls']

Training: ResNet-50

Now we will finetune a ResNet-50 model on our customized dataset. ResNet is from the paper Deep Residual Learning for Image Recognition and is the best default model for computer vision. This ResNet-34 model is trained on ImageNet with 1000 classes, so first we need to initialize a new head for the model to be adapted to the number of classes in our dataset. The cnn_learner method will do this for us (Read documentation).

Then we will train the model in two stages:

first we freeze the body weights and only train the head,
then we unfreeze the layers of the backbone and fine-tune the whole model.

Stage 1: Finetune the top layers only

learn = cnn_learner(data, models.resnet50, metrics=error_rate, pretrained=True)

We will train the model for 20 epoches following the 1cycle policy. The number of epoches we need to train depends on how different the dataset is from ImageNet. Basically, we train the model until the validation loss stops decrease. The default learning rate here is 3e-3.

learn.fit_one_cycle(20)

epoch	train_loss	valid_loss	error_rate	time
0	1.356109	0.657256	0.206618	01:31
1	0.870719	0.560793	0.167070	01:27
2	0.687549	0.526645	0.167877	01:32
3	0.621711	0.444908	0.143664	01:27
4	0.493342	0.397156	0.126715	01:28
5	0.435479	0.381771	0.123487	01:27
6	0.373893	0.389900	0.121065	01:28
7	0.325564	0.371386	0.112994	01:28
8	0.295842	0.349679	0.106538	01:29
9	0.271255	0.348150	0.098467	01:28
10	0.233799	0.317944	0.091203	01:28
11	0.205257	0.306772	0.086360	01:27
12	0.171645	0.310830	0.084746	01:28
13	0.159442	0.293081	0.078289	01:27
14	0.137362	0.295972	0.083939	01:29
15	0.112108	0.289382	0.084746	01:26
16	0.093431	0.279121	0.083939	01:30
17	0.094814	0.281516	0.079903	01:29
18	0.083980	0.277226	0.073446	01:33
19	0.090189	0.274643	0.081517	01:28

learn.save('resnes50-stage-1')

Stage 2: Unfreeze and finetune the entire networks

At this stage we will unfreeze the whole model and finetune it with smaller learning rates. The code below will help us find the learning rate for this stage.

learn.load('resnes50-stage-1')
learn.lr_find()
learn.recorder.plot()

When finetuning the model at state 2, we will use different learning rates for different layers. The top layers will be update at greater rates than the bottom layers. As a rule of thumb, we use learning rates between (a, b), in which:

a is taken from the LR Finder above where the loss starts to decrease for a while,
b is 5 to 10 times smaller than the default rate we used in stage 1.

In this case, the learning rate we will use is from 3e-6 to 3e-4 We continnue to train the model until the validation loss stops decreasing.

learn.unfreeze()
learn.fit_one_cycle(8, max_lr=slice(3e-6, 3e-4))

epoch	train_loss	valid_loss	error_rate	time
0	0.077156	0.277789	0.079903	01:30
1	0.089096	0.269736	0.075061	01:30
2	0.092252	0.282279	0.079096	01:29
3	0.074603	0.265167	0.067797	01:31
4	0.071408	0.278237	0.074253	01:33
5	0.047503	0.250248	0.065375	01:29
6	0.037788	0.249852	0.070218	01:30
7	0.035603	0.251429	0.066990	01:29

After two stages, the error rate on the validation set is about 6% and the accuracy for our dataset with 11 classes is 94%. That’s a pretty accurate model. It’s a good practice to save the current stage of the model in case we want to train it for more epoches.

learn.save('resnes50-stage-2')

Evaluation

Now let’s look on the model’s predictions. Specifically, we will look at examples where the model makes wrong predictions. Below are 9 examples with the top losses, meaning on these examples the model gives the actual class a very low score.

learn.load('resnes50-stage-2')
interp = ClassificationInterpretation.from_learner(learn)
losses, idxs = interp.top_losses()
interp.plot_top_losses(9, figsize=(15,11))

The confusion matrix below will us a big picture of which pair of dishes make our model confused. The number of wrong predictions are quite low.

interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

interp.most_confused(min_val=3)

[('pho', 'hu-tieu', 11),
 ('bun-bo-nam-bo', 'bun-cha', 5),
 ('hu-tieu', 'bun-bo-nam-bo', 5),
 ('pho', 'bun-bo-hue', 5),
 ('bun-bo-hue', 'pho', 4),
 ('hu-tieu', 'bun-cha', 4),
 ('bun-bo-hue', 'hu-tieu', 3),
 ('bun-cha', 'bun-bo-nam-bo', 3),
 ('hu-tieu', 'bun-bo-hue', 3),
 ('hu-tieu', 'pho', 3)]

Production

After being satisfied with our model, we can save it to disk and use it to classify new data. We can also deploy it into a magical app that I described at the beginning of the notebook.

# Export model
learn.export()

For inference, we also want to run the model on CPU instead of GPU.

defaults.device = torch.device('cpu')

The functions below will download an image from an URL and use the model to make prediction for that image.

def open_image_url(url):
    """Download image from URL, return fastai image and PIL image."""
    urllib.request.urlretrieve(url, "./img/test.jpg")
    return open_image("./img/test.jpg"), Image.open("./img/test.jpg")

def predict(url):
    """Make prediction for image from url, show image and predicted probability."""
    img, pil_img = open_image_url(url)
    pred_class, pred_idx, outputs = learn.predict(img)
    print("Predicted Class: ", pred_class)
    print(f"Probability: {outputs[pred_idx].numpy() * 100:.2f}%")
    
    plt.figure(figsize=(15, 4))
    # Show image
    plt.subplot(1, 2, 1)
    plt.imshow(pil_img)
    plt.axis('off')
    plt.title('Image')
    
    # Plot Probabilities
    plt.subplot(1, 2, 2)
    pd.Series(outputs, data.classes).sort_values().plot(kind='barh')
    plt.xlabel("Class")
    plt.title("Probability");

# Load model
learn = load_learner(path)

# Make prediction from URL
url = str(input("URL: "))
predict(url)

URL:  https://i.pinimg.com/originals/9b/63/d7/9b63d76a44be020e03eeaec0a1134e95.jpg


Predicted Class:  bun-bo-hue
Probability: 99.99%

I have also built a web app for this model. Check it out below:

Conclusion

We have walked through the whole pipeline to finetune a ResNet-50 model on our customized dataset, evaluate its strengths and weaknesses and deploy the model for production. With the fastai library, we can quickly achieve state-of-the-art results with very neat and clean codes.

In this project, we built a classifier for only 11 popular Vietnamese dishes, but we can easily scale up the model by collecting more data for hundreds or thounsands of classes following this PyImageSearch’s tutorial.

I have seen many interesting ideas using computer vision, such as Plantix, a plant doctor app which can tell what diseases your plants might have from a single photo and how you should take care of them. And yes, with some software engineering skills, some data and an idea, you can build your own impactful computer vision application. I look forwards to seeing these applications make the world a better place in all directions.

Reference

Named Entity Recognition with Transformers

2020-05-07T00:00:00-04:00

Introduction

Part I: How We Trained RoBERTa Language Model for Spanish from Scratch

In my previous blog post, we have discussed how my team pretrained SpanBERTa, a transformer language model for Spanish, on a big corpus from scratch. The model has shown to be able to predict correctly masked words in a sequence based on its context. In this blog post, to really leverage the power of transformer models, we will fine-tune SpanBERTa for a named-entity recognition task.

According to its definition on Wikipedia, Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

We will use the script run_ner.py by Hugging Face and CoNLL-2002 dataset to fine-tune SpanBERTa.

Setup

Download transformers and install required packages.

%%capture
!git clone https://github.com/huggingface/transformers
%cd transformers
!pip install .
!pip install -r ./examples/requirements.txt
%cd ..

Data

1. Download Datasets

The below command will download and unzip the dataset. The files contain the train and test data for three parts of the CoNLL-2002 shared task:

esp.testa: Spanish test data for the development stage
esp.testb: Spanish test data
esp.train: Spanish train data

%%capture
!wget -O 'conll2002.zip' 'https://drive.google.com/uc?export=download&id=1Wrl1b39ZXgKqCeAFNM9EoXtA1kzwNhCe'
!unzip 'conll2002.zip'

The size of each dataset:

!wc -l conll2002/esp.train
!wc -l conll2002/esp.testa
!wc -l conll2002/esp.testb

273038 conll2002/esp.train
54838 conll2002/esp.testa
53050 conll2002/esp.testb

All data files has three columns: words, associated part-of-speech tags and named entity tags in the IOB2 format. Sentence breaks are encoded by empty lines.

!head -n20 conll2002/esp.train

Melbourne NP B-LOC
( Fpa O
Australia NP B-LOC
) Fpt O
, Fc O
25 Z O
may NC O
( Fpa O
EFE NC B-ORG
) Fpt O
. Fp O

- Fg O

El DA O
Abogado NC B-PER
General AQ I-PER
del SP I-PER
Estado NC I-PER
, Fc O

We will only keep the word column and the named entity tag column for our train, dev and test datasets.

!cat conll2002/esp.train | cut -d " " -f 1,3 > train_temp.txt
!cat conll2002/esp.testa | cut -d " " -f 1,3 > dev_temp.txt
!cat conll2002/esp.testb | cut -d " " -f 1,3 > test_temp.txt

2. Preprocessing

Let’s define some variables that we need for further pre-processing steps and training the model:

MAX_LENGTH = 120 #@param {type: "integer"}
MODEL = "chriskhanhtran/spanberta" #@param ["chriskhanhtran/spanberta", "bert-base-multilingual-cased"]

The script below will split sentences longer than MAX_LENGTH (in terms of tokens) into small ones. Otherwise, long sentences will be truncated when tokenized, causing the loss of training data and some tokens in the test set not being predicted.

%%capture
!wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"

!python3 preprocess.py train_temp.txt $MODEL $MAX_LENGTH > train.txt
!python3 preprocess.py dev_temp.txt $MODEL $MAX_LENGTH > dev.txt
!python3 preprocess.py test_temp.txt $MODEL $MAX_LENGTH > test.txt

2020-04-22 23:02:05.747294: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 1.03k/1.03k [00:00<00:00, 704kB/s]
Downloading: 100% 954k/954k [00:00<00:00, 1.89MB/s]
Downloading: 100% 512k/512k [00:00<00:00, 1.19MB/s]
Downloading: 100% 16.0/16.0 [00:00<00:00, 12.6kB/s]
2020-04-22 23:02:23.409488: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-22 23:02:31.168967: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1

3. Labels

In CoNLL-2002/2003 datasets, there are have 9 classes of NER tags:

O, Outside of a named entity
B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS, Miscellaneous entity
B-PER, Beginning of a person’s name right after another person’s name
I-PER, Person’s name
B-ORG, Beginning of an organisation right after another organisation
I-ORG, Organisation
B-LOC, Beginning of a location right after another location
I-LOC, Location

If your dataset has different labels or more labels than CoNLL-2002/2003 datasets, run the line below to get unique labels from your data and save them into labels.txt. This file will be used when we start fine-tuning our model.

!cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt

Fine-tuning Model

These are the example scripts from transformers’s repo that we will use to fine-tune our model for NER. After 04/21/2020, Hugging Face has updated their example scripts to use a new Trainer class. To avoid any future conflict, let’s use the version before they made these updates.

%%capture
!wget "https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/ner/run_ner.py"
!wget "https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/ner/utils_ner.py"

Now it’s time for transfer learning. In my previous blog post, I have pretrained a RoBERTa language model on a very large Spanish corpus to predict masked words based on the context they are in. By doing that, the model has learned inherent properties of the language. I have uploaded the pretrained model to Hugging Face’s server. Now we will load the model and start fine-tuning it for the NER task.

Below are our training hyperparameters.

MAX_LENGTH = 128 #@param {type: "integer"}
MODEL = "chriskhanhtran/spanberta" #@param ["chriskhanhtran/spanberta", "bert-base-multilingual-cased"]
OUTPUT_DIR = "spanberta-ner" #@param ["spanberta-ner", "bert-base-ml-ner"]
BATCH_SIZE = 32 #@param {type: "integer"}
NUM_EPOCHS = 3 #@param {type: "integer"}
SAVE_STEPS = 100 #@param {type: "integer"}
LOGGING_STEPS = 100 #@param {type: "integer"}
SEED = 42 #@param {type: "integer"}

Let’s start training.

!python3 run_ner.py \
  --data_dir ./ \
  --model_type bert \
  --labels ./labels.txt \
  --model_name_or_path $MODEL \
  --output_dir $OUTPUT_DIR \
  --max_seq_length  $MAX_LENGTH \
  --num_train_epochs $NUM_EPOCHS \
  --per_gpu_train_batch_size $BATCH_SIZE \
  --save_steps $SAVE_STEPS \
  --logging_steps $LOGGING_STEPS \
  --seed $SEED \
  --do_train \
  --do_eval \
  --do_predict \
  --overwrite_output_dir

Performance on the dev set:

04/21/2020 02:24:31 - INFO - __main__ -   ***** Eval results  *****
04/21/2020 02:24:31 - INFO - __main__ -     f1 = 0.831027443864822
04/21/2020 02:24:31 - INFO - __main__ -     loss = 0.1004064822183894
04/21/2020 02:24:31 - INFO - __main__ -     precision = 0.8207885304659498
04/21/2020 02:24:31 - INFO - __main__ -     recall = 0.8415250344510795

Performance on the test set:

04/21/2020 02:24:48 - INFO - __main__ -   ***** Eval results  *****
04/21/2020 02:24:48 - INFO - __main__ -     f1 = 0.8559533721898419
04/21/2020 02:24:48 - INFO - __main__ -     loss = 0.06848683688204177
04/21/2020 02:24:48 - INFO - __main__ -     precision = 0.845858475041141
04/21/2020 02:24:48 - INFO - __main__ -     recall = 0.8662921348314607

Here are the tensorboards of fine-tuning spanberta and bert-base-multilingual-cased for 5 epoches. We can see that the models overfit the training data after 3 epoches.

Classification Report

To understand how well our model actually performs, let’s load its predictions and examine the classification report.

def read_examples_from_file(file_path):
    """Read words and labels from a CoNLL-2002/2003 data file.
    
    Args:
      file_path (str): path to NER data file.

    Returns:
      examples (dict): a dictionary with two keys: `words` (list of lists)
        holding words in each sequence, and `labels` (list of lists) holding
        corresponding labels.
    """
    with open(file_path, encoding="utf-8") as f:
        examples = {"words": [], "labels": []}
        words = []
        labels = []
        for line in f:
            if line.startswith("-DOCSTART-") or line == "" or line == "\n":
                if words:
                    examples["words"].append(words)
                    examples["labels"].append(labels)
                    words = []
                    labels = []
            else:
                splits = line.split(" ")
                words.append(splits[0])
                if len(splits) > 1:
                    labels.append(splits[-1].replace("\n", ""))
                else:
                    # Examples could have no label for mode = "test"
                    labels.append("O")
    return examples

Read data and labels from the raw text files:

y_true = read_examples_from_file("test.txt")["labels"]
y_pred = read_examples_from_file("spanberta-ner/test_predictions.txt")["labels"]

Print the classification report:

from seqeval.metrics import classification_report as classification_report_seqeval

print(classification_report_seqeval(y_true, y_pred))

           precision    recall  f1-score   support

      LOC       0.87      0.84      0.85      1084
      ORG       0.82      0.87      0.85      1401
     MISC       0.63      0.66      0.65       340
      PER       0.94      0.96      0.95       735

micro avg       0.84      0.86      0.85      3560
macro avg       0.84      0.86      0.85      3560

The metrics we are seeing in this report are designed specifically for NLP tasks such as NER and POS tagging, in which all words of an entity need to be predicted correctly to be counted as one correct prediction. Therefore, the metrics in this classification report are much lower than in scikit-learn’s classification report.

import numpy as np
from sklearn.metrics import classification_report

print(classification_report(np.concatenate(y_true), np.concatenate(y_pred)))

              precision    recall  f1-score   support

       B-LOC       0.88      0.85      0.86      1084
      B-MISC       0.73      0.73      0.73       339
       B-ORG       0.87      0.91      0.89      1400
       B-PER       0.95      0.96      0.95       735
       I-LOC       0.82      0.81      0.81       325
      I-MISC       0.85      0.76      0.80       557
       I-ORG       0.89      0.87      0.88      1104
       I-PER       0.98      0.98      0.98       634
           O       1.00      1.00      1.00     45355

    accuracy                           0.98     51533
   macro avg       0.89      0.87      0.88     51533
weighted avg       0.98      0.98      0.98     51533

From above reports, our model has a good performance in predicting person, location and organization. We will need more data for MISC entities to improve our model’s performance on these entities.

Pipeline

After fine-tuning our models, we can share them with the community by following the tutorial in this page. Now we can start loading the fine-tuned model from Hugging Face’s server and use it to predict named entities in Spanish documents.

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("skimai/spanberta-base-cased-ner-conll02")
tokenizer = AutoTokenizer.from_pretrained("skimai/spanberta-base-cased-ner-conll02")

ner_model = pipeline('ner', model=model, tokenizer=tokenizer)

The example below is obtained from La Opinión and means “The economic recovery of the United States after the coronavirus pandemic will be a matter of months, said Treasury Secretary Steven Mnuchin.”

sequence = "La recuperación económica de los Estados Unidos después de la " \
           "pandemia del coronavirus será cuestión de meses, afirmó el " \
           "Secretario del Tesoro, Steven Mnuchin."
ner_model(sequence)

[{'entity': 'B-ORG', 'score': 0.9155661463737488, 'word': 'ĠEstados'},
 {'entity': 'I-ORG', 'score': 0.800682544708252, 'word': 'ĠUnidos'},
 {'entity': 'I-MISC', 'score': 0.5006815791130066, 'word': 'Ġcorona'},
 {'entity': 'I-MISC', 'score': 0.510674774646759, 'word': 'virus'},
 {'entity': 'B-PER', 'score': 0.5558510422706604, 'word': 'ĠSecretario'},
 {'entity': 'I-PER', 'score': 0.7758238315582275, 'word': 'Ġdel'},
 {'entity': 'I-PER', 'score': 0.7096233367919922, 'word': 'ĠTesoro'},
 {'entity': 'B-PER', 'score': 0.9940345883369446, 'word': 'ĠSteven'},
 {'entity': 'I-PER', 'score': 0.9962581992149353, 'word': 'ĠM'},
 {'entity': 'I-PER', 'score': 0.9918380379676819, 'word': 'n'},
 {'entity': 'I-PER', 'score': 0.9848328828811646, 'word': 'uch'},
 {'entity': 'I-PER', 'score': 0.8513168096542358, 'word': 'in'}]

Looks great! The fine-tuned model successfully recognizes all entities in our example, and even recognizes “corona virus.”

Conclusion

Named-entity recognition can help us quickly extract important information from texts. Therefore, its application in business can have a direct impact on improving human’s productivity in reading contracts and documents. However, it is a challenging NLP task because NER requires accurate classification at the word level, making simple approaches such as bag-of-word impossible to deal with this task.

We have walked through how we can leverage a pretrained BERT model to quickly gain an excellent performance on the NER task for Spanish. The pretrained SpanBERTa model can also be fine-tuned for other tasks such as document classification. I have written a detailed tutorial to finetune BERT for sequence classification and sentiment analysis.

Fine-tuning BERT for Sentiment Analysis

Next in this series, we will discuss ELECTRA, a more efficient pre-training approach for transformer models which can quickly achieve state-of-the-art performance. Stay tuned!

SpanBERTa: Pre-train RoBERTa Language Model for Spanish from Scratch

2020-04-07T00:00:00-04:00

Introduction

Part II: Fine-tuning SpanBERTa for Named Entity Recognition

Self-training methods with transformer models have achieved state-of-the-art performance on most NLP tasks. However, because training them is computationally expensive, most currently available pretrained transformer models are only for English. Therefore, to improve performance on NLP tasks in our projects on Spanish, my team at Skim AI decided to train a RoBERTa language model for Spanish from scratch and call it SpanBERTa.

SpanBERTa has the same size as RoBERTa-base. We followed RoBERTa’s training schema to train the model on 18 GB of OSCAR’s Spanish corpus in 8 days using 4 Tesla P100 GPUs.

In this blog post, we will walk through an end-to-end process to train a BERT-like language model from scratch using transformers and tokenizers libraries by Hugging Face. There is also a Google Colab notebook to run the codes in this article directly. You can also modify the notebook accordingly to train a BERT-like model for other languages or fine-tune it on your customized dataset.

Before moving on, I want to express a huge thank to the Hugging Face team for making state-of-the-art NLP models accessible for everyone.

Setup

1. Install Dependencies

%%capture
!pip uninstall -y tensorflow
!pip install transformers

2. Data

We pretrained SpanBERTa on OSCAR’s Spanish corpus. The full size of the dataset is 150 GB and we used a portion of 18 GB to train.

In this example, for simplicity, we will use a dataset of Spanish movie subtitles from OpenSubtitles. This dataset has a size of 5.4 GB and we will train on a subset of ~300 MB.

import os

# Download and unzip movie substitle dataset
if not os.path.exists('data/dataset.txt'):
  !wget "https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz" -O dataset.txt.gz
  !gzip -d dataset.txt.gz
  !mkdir data
  !mv dataset.txt data

--2020-04-06 15:53:04--  https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1859673728 (1.7G) [application/gzip]
Saving to: ‘dataset.txt.gz’

dataset.txt.gz      100%[===================>]   1.73G  17.0MB/s    in 1m 46s  

2020-04-06 15:54:51 (16.8 MB/s) - ‘dataset.txt.gz’ saved [1859673728/1859673728]

# Total number of lines and some random lines
!wc -l data/dataset.txt
!shuf -n 5 data/dataset.txt

179287150 data/dataset.txt
Sabes, pensé que tenías más pelotas que para enfrentarme a través de mi hermano.
Supe todos los encantamientos en todas las lenguas de los Elfos hombres y Orcos.
Anteriormente en Blue Bloods:
Y quiero que prometas que no habrá ningún trato con Daniel Stafford.
Fue comiquísimo.

# Get a subset of first 10,000,000 lines for training
TRAIN_SIZE = 10000000 #@param {type:"integer"}
!(head -n $TRAIN_SIZE data/dataset.txt) > data/train.txt

# Get a subset of next 10,000 lines for validation
VAL_SIZE = 10000 #@param {type:"integer"}
!(sed -n {TRAIN_SIZE + 1},{TRAIN_SIZE + VAL_SIZE}p data/dataset.txt) > data/dev.txt

3. Train a Tokenizer

The original BERT implementation uses a WordPiece tokenizer with a vocabulary of 32K subword units. This method, however, can introduce “unknown” tokens when processing rare words.

In this implementation, we use a byte-level BPE tokenizer with a vocabulary of 50,265 subword units (same as RoBERTa-base). Using byte-level BPE makes it possible to learn a subword vocabulary of modest size that can encode any input without getting “unknown” tokens.

Because ByteLevelBPETokenizer produces 2 files ["vocab.json", "merges.txt"] while BertWordPieceTokenizer produces only 1 file vocab.txt, it will cause an error if we use BertWordPieceTokenizer to load outputs of a BPE tokenizer.

%%time
from tokenizers import ByteLevelBPETokenizer

path = "data/train.txt"

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=path,
                vocab_size=50265,
                min_frequency=2,
                special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

# Save files to disk
!mkdir -p "models/roberta"
tokenizer.save("models/roberta")

CPU times: user 1min 37s, sys: 1.02 s, total: 1min 38s
Wall time: 1min 38s

Super fast! It takes only 2 minutes to train on 10 million lines.

Traing Language Model from Scratch

1. Model Architecture

RoBERTa has exactly the same architecture as BERT. The only differences are:

RoBERTa uses a Byte-Level BPE tokenizer with a larger subword vocabulary (50k vs 32k).
RoBERTa implements dynamic word masking and drops next sentence prediction task.
RoBERTa’s training hyperparameters.

Other architecture configurations can be found in the documentation (RoBERTa, BERT).

import json
config = {
	"architectures": [
		"RobertaForMaskedLM"
	],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "roberta",
	"num_attention_heads": 12,
	"num_hidden_layers": 12,
	"type_vocab_size": 1,
	"vocab_size": 50265
}

with open("models/roberta/config.json", 'w') as fp:
    json.dump(config, fp)

tokenizer_config = {"max_len": 512}

with open("models/roberta/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

2. Training Hyperparameters

Hyperparam	BERT-base	RoBERTa-base
Sequence Length	128, 512	512
Batch Size	256	8K
Peak Learning Rate	1e-4	6e-4
Max Steps	1M	500K
Warmup Steps	10K	24K
Weight Decay	0.01	0.01
Adam \(\epsilon\)	1e-6	1e-6
Adam \(\beta_1\)	0.9	0.9
Adam \(\beta_2\)	0.999	0.98
Gradient Clipping	0.0	0.0

Note the batch size when training RoBERTa is 8000. Therefore, although RoBERTa-base was trained for 500K steps, its training computational cost is 16 times that of BERT-base. In the RoBERTa paper, it is shown that training with large batches improves perplexity for the masked language modeling objective, as well as end-task accuracy. Larger batch size can be obtained by tweaking gradient_accumulation_steps.

Due to computational constraint, we followed BERT-base’s training schema and trained our SpanBERTa model using 4 Tesla P100 GPUs for 200K steps in 8 days.

3. Start Training

We will train our model from scratch using run_language_modeling.py, a script provided by Hugging Face, which will preprocess, tokenize the corpus and train the model on Masked Language Modeling task. The script is optimized to train on a single big corpus. Therefore, if your dataset is large and you want to split it to train sequentially, you will need to modify the script, or be ready to get a monster machine with high memory.

!wget -c https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/run_language_modeling.py

Important Arguments

--line_by_line Whether distinct lines of text in the dataset are to be handled as distinct sequences. If each line in your dataset is long and has ~512 tokens or more, you should use this setting. If each line is short, the default text preprocessing will concatenate all lines, tokenize them and slit tokenized outputs into blocks of 512 tokens. You can also split your datasets into small chunks and preprocess them separately. 3GB of text will take ~50 minutes to process with the default TextDataset class.
--should_continue Whether to continue from latest checkpoint in output_dir.
--model_name_or_path The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.
--mlm Train with masked-language modeling loss instead of language modeling.
--config_name, --tokenizer_name Optional pretrained config and tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new config.
--per_gpu_train_batch_size Batch size per GPU/CPU for training. Choose the largest number you can fit on your GPUs. You will see an error if your batch size is too large.
--gradient_accumulation_steps Number of updates steps to accumulate before performing a backward/update pass. You can use this trick to increase batch size. For example, if per_gpu_train_batch_size = 16 and gradient_accumulation_steps = 4, your total train batch size will be 64.
--overwrite_output_dir Overwrite the content of the output directory.
--no_cuda, --fp16, --fp16_opt_level Arguments for training on GPU/CPU.
Other arguments are model paths and training hyperparameters.

It’s highly recommended to include model type (eg. “roberta”, “bert”, “gpt2” etc.) in the model path because the script uses the AutoModels class to guess the model’s configuration using pattern matching on the provided path.

# Model paths
MODEL_TYPE = "roberta" #@param ["roberta", "bert"]
MODEL_DIR = "models/roberta" #@param {type: "string"}
OUTPUT_DIR = "models/roberta/output" #@param {type: "string"}
TRAIN_PATH = "data/train.txt" #@param {type: "string"}
EVAL_PATH = "data/dev.txt" #@param {type: "string"}

For this example, we will train for only 25 steps on a Tesla P4 GPU provided by Colab.

!nvidia-smi

Mon Apr  6 15:59:35 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# Command line
cmd = """python run_language_modeling.py \
    --output_dir {output_dir} \
    --model_type {model_type} \
    --mlm \
    --config_name {config_name} \
    --tokenizer_name {tokenizer_name} \
    {line_by_line} \
    {should_continue} \
    {model_name_or_path} \
    --train_data_file {train_path} \
    --eval_data_file {eval_path} \
    --do_train \
    {do_eval} \
    {evaluate_during_training} \
    --overwrite_output_dir \
    --block_size 512 \
    --max_step 25 \
    --warmup_steps 10 \
    --learning_rate 5e-5 \
    --per_gpu_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 100.0 \
    --save_total_limit 10 \
    --save_steps 10 \
    --logging_steps 2 \
    --seed 42
"""

# Arguments for training from scratch. I turn off evaluate_during_training,
#   line_by_line, should_continue, and model_name_or_path.
train_params = {
    "output_dir": OUTPUT_DIR,
    "model_type": MODEL_TYPE,
    "config_name": MODEL_DIR,
    "tokenizer_name": MODEL_DIR,
    "train_path": TRAIN_PATH,
    "eval_path": EVAL_PATH,
    "do_eval": "--do_eval",
    "evaluate_during_training": "",
    "line_by_line": "",
    "should_continue": "",
    "model_name_or_path": "",
}

If you are training on a virtual machine, you can install tensorboard to monitor the training process. Here is our Tensorboard for training SpanBERTa.

pip install tensorboard==2.1.0
tensorboard dev upload --logdir runs

After 200k steps, the loss reached 1.8 and the perplexity reached 5.2.

Now let’s start training!

!{cmd.format(**train_params)}

04/06/2020 15:59:41 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
04/06/2020 15:59:41 - INFO - transformers.configuration_utils -   loading configuration file models/roberta/config.json
04/06/2020 15:59:41 - INFO - transformers.configuration_utils -   Model config RobertaConfig {
04/06/2020 15:59:41 - INFO - transformers.configuration_utils -   loading configuration file models/roberta/config.json
04/06/2020 15:59:41 - INFO - transformers.configuration_utils -   Model config RobertaConfig {
04/06/2020 15:59:41 - INFO - transformers.tokenization_utils -   Model name 'models/roberta' not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). Assuming 'models/roberta' is a path, a model identifier, or url to a directory containing tokenizer files.
04/06/2020 15:59:41 - INFO - transformers.tokenization_utils -   loading file models/roberta/vocab.json
04/06/2020 15:59:41 - INFO - transformers.tokenization_utils -   loading file models/roberta/merges.txt
04/06/2020 15:59:41 - INFO - transformers.tokenization_utils -   loading file models/roberta/tokenizer_config.json
04/06/2020 15:59:41 - INFO - __main__ -   Training new model from scratch
04/06/2020 15:59:55 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-06, block_size=512, cache_dir=None, config_name='models/roberta', device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='data/dev.txt', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=4, learning_rate=5e-05, line_by_line=False, local_rank=-1, logging_steps=2, max_grad_norm=100.0, max_steps=25, mlm=True, mlm_probability=0.15, model_name_or_path=None, model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='models/roberta/output', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, save_steps=10, save_total_limit=10, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='models/roberta', train_data_file='data/train.txt', warmup_steps=10, weight_decay=0.01)

04/06/2020 15:59:55 - INFO - __main__ -   Creating features from dataset file at data
04/06/2020 16:04:43 - INFO - __main__ -   Saving features into cached file data/roberta_cached_lm_510_train.txt
04/06/2020 16:04:46 - INFO - __main__ -   ***** Running training *****
04/06/2020 16:04:46 - INFO - __main__ -     Num examples = 165994
04/06/2020 16:04:46 - INFO - __main__ -     Num Epochs = 1
04/06/2020 16:04:46 - INFO - __main__ -     Instantaneous batch size per GPU = 4
04/06/2020 16:04:46 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 16
04/06/2020 16:04:46 - INFO - __main__ -     Gradient Accumulation steps = 4
04/06/2020 16:04:46 - INFO - __main__ -     Total optimization steps = 25

Epoch:   0% 0/1 [00:00<?, ?it/s]
Iteration:   0% 0/41499 [00:00<?, ?it/s][A
Iteration:   0% 1/41499 [00:01<13:18:02,  1.15s/it][A
Iteration:   0% 2/41499 [00:01<11:26:47,  1.01it/s][A
Iteration:   0% 3/41499 [00:02<10:10:30,  1.13it/s][A
Iteration:   0% 4/41499 [00:03<9:38:10,  1.20it/s] [A
Iteration:   0% 5/41499 [00:03<8:52:44,  1.30it/s][A
Iteration:   0% 6/41499 [00:04<8:22:47,  1.38it/s][A
Iteration:   0% 7/41499 [00:04<8:00:55,  1.44it/s][A
Iteration:   0% 8/41499 [00:05<8:03:40,  1.43it/s][A
Iteration:   0% 9/41499 [00:06<7:46:57,  1.48it/s][A
Iteration:   0% 10/41499 [00:06<7:35:35,  1.52it/s][A
Epoch:   0% 0/1 [01:25<?, ?it/s]
04/06/2020 16:06:11 - INFO - __main__ -    global_step = 26, average loss = 9.355212138249325
04/06/2020 16:06:11 - INFO - __main__ -   Saving model checkpoint to models/roberta/output

04/06/2020 16:06:18 - INFO - transformers.modeling_utils -   loading weights file models/roberta/output/pytorch_model.bin
04/06/2020 16:06:23 - INFO - __main__ -   Creating features from dataset file at data
04/06/2020 16:06:23 - INFO - __main__ -   Saving features into cached file data/roberta_cached_lm_510_dev.txt
04/06/2020 16:06:23 - INFO - __main__ -   ***** Running evaluation  *****
04/06/2020 16:06:23 - INFO - __main__ -     Num examples = 156
04/06/2020 16:06:23 - INFO - __main__ -     Batch size = 4
Evaluating: 100% 39/39 [00:08<00:00,  4.41it/s]
04/06/2020 16:06:32 - INFO - __main__ -   ***** Eval results  *****
04/06/2020 16:06:32 - INFO - __main__ -     perplexity = tensor(6077.6812)

4. Predict Masked Words

After training your language model, you can upload and share your model with the community. We have uploaded our SpanBERTa model to Hugging Face’s server. Before evaluating the model on downstream tasks, let’s see how it has learned to fill masked words given a context.

%%capture
%%time
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="chriskhanhtran/spanberta",
    tokenizer="chriskhanhtran/spanberta"
)

I pick a sentence from Wikipedia’s article about COVID-19.

The original sentence is “Lavarse frecuentemente las manos con agua y jabón,” meaning “Frequently wash your hands with soap and water.”

The masked word is “jabón” (soap) and the top 5 predictions are soap, salt, steam, lemon and vinegar. It is interesting that the model somehow learns that we should wash our hands with things that can kill bacteria or contain acid.

fill_mask("Lavarse frecuentemente las manos con agua y <mask>.")

[{'score': 0.6469631195068359,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y jabón.</s>',
  'token': 18493},
 {'score': 0.06074320897459984,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y sal.</s>',
  'token': 619},
 {'score': 0.029787985607981682,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y vapor.</s>',
  'token': 11079},
 {'score': 0.026410052552819252,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y limón.</s>',
  'token': 12788},
 {'score': 0.017029203474521637,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y vinagre.</s>',
  'token': 18424}]

Conclusion

We have walked through how to train a BERT language model for Spanish from scratch and seen that the model has learned properties of the language by trying to predict masked words given a context. You can also follow this article to fine-tune a pretrained BERT-like model on your customized dataset.

Next, we will implement the pretrained models on downstream tasks including Sequence Classification, NER, POS tagging, and NLI, as well as compare the model’s performance with some non-BERT models.

Stay tuned for our next posts!

A Complete Guide to CNN for Sentence Classification with PyTorch

2020-02-01T00:00:00-05:00

Convolutional Neural Networks (CNN) were originally invented for computer vision and now are the building blocks of state-of-the-art CV models. One of the earliest applications of CNN in Natural Language Processing was introduced in the paper Convolutional Neural Networks for Sentence Classification (Kim, 2014). With the same idea as in computer vision, CNN model is used as an feature extractor that encodes semantic features of sentences before these features are fed to a classifier.

With only a simple one-layer CNN trained on top of pretrained word vectors and little hyperparameter tuning, the model achieves excellent results on multiple sentence-level classification tasks. CNN models are now used widely in other NLP tasks such as translation and question answering as a part of a more complex architecture.

When implementing the original paper (Kim, 2014) in PyTorch, I needed to put many pieces together to complete the project. This article serves as a complete guide to CNN for sentence classification tasks accompanied with advice for practioners. It will cover:

Tokenizing and building vocabuilary from text data
Loading pretrained fastText word vectors and creating embedding layer for fine-tuning
Building and training CNN model with PyTorch
Advice for practitioners
Bonus: Using Skorch as a scikit-like wrapper for PyTorch’s Deep Learning models

Reference:

Convolutional Neural Networks for Sentence Classification (Kim, 2014).
A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification (Zhang, 2015).
Advances in Pre-Training Distributed Word Representations (Mikolov, 2018).

1. Setup

1.1. Import Libraries

import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import nltk
nltk.download("all")
import matplotlib.pyplot as plt
import torch

%matplotlib inline

1.2. Download Datasets

The dataset we will use is Movie Review (MR), a sentence polarity dataset from (Pang and Lee, 2005). The dataset has 5331 positive and 5331 negative processed sentences/snippets.

URL = 'https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'
# Download Datasets
!wget -P 'Data/' $URL
# Unzip
!tar xvzf 'Data/rt-polaritydata.tar.gz' -C 'Data/'

def load_text(path):
    """Load text data, lowercase text and save to a list."""

    with open(path, 'rb') as f:
        texts = []
        for line in f:
            texts.append(line.decode(errors='ignore').lower().strip())

    return texts

# Load files
neg_text = load_text('Data/rt-polaritydata/rt-polarity.neg')
pos_text = load_text('Data/rt-polaritydata/rt-polarity.pos')

# Concatenate and label data
texts = np.array(neg_text + pos_text)
labels = np.array([0]*len(neg_text) + [1]*len(pos_text))

1.3. Download fastText Word Vectors

The pretrained word vectors used in the original paper were trained by word2vec (Mikolov et al., 2013) on 100 billion tokens of Google News. In this tutorial, we will use fastText pretrained word vectors (Mikolov et al., 2017), trained on 600 billion tokens on Common Crawl. fastText is an upgraded version of word2vec and outperforms other state-of-the-art methods by a large margin.

The code below will download fastText pretrained vectors. Using Google Colab, the running time is approximately 3min 30s.

%%time
URL = "https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip"
FILE = "fastText"

if os.path.isdir(FILE):
    print("fastText exists.")
else:
    !wget -P $FILE $URL
    !unzip $FILE/crawl-300d-2M.vec.zip -d $FILE

crawl-300d-2M.vec.z 100%[===================>]   1.42G  23.8MB/s    in 62s     

2020-02-01 00:40:43 (23.3 MB/s) - ‘fastText/crawl-300d-2M.vec.zip’ saved [1523785255/1523785255]

Archive:  fastText/crawl-300d-2M.vec.zip
  inflating: fastText/crawl-300d-2M.vec  

1.4. Use GPU for Training

Google Colab offers free GPUs and TPUs. Since we’ll be training a large neural network it’s best to utilize these features.

A GPU can be added by going to the menu and selecting:

Runtime -> Change runtime type -> Hardware accelerator: GPU

Then we need to run the following cell to specify the GPU as the device.

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla T4

2. Data Preparation

To prepare our text data for training, first we need to tokenize our sentences and build a vocabulary dictionary word2idx, which will later be used to convert our tokens into indexes and build an embedding layer.

So, what is an embedding layer?

An embedding layer serves as a look-up table which takes words’ indexes in the vocabulary as input and output word vectors. Hence, the embedding layer has shape \((N, d)\) where \(N\) is the size of the vocabulary and \(d\) is the embedding dimension. In order to fine-tune pretrained word vectors, we need to create an embedding layer in our nn.Module class. Our input to the model will then be input_ids, which is tokens’ indexes in the vocabulary.

2.1. Tokenize

The function tokenize will tokenize our sentences, build a vocabulary and find the maximum sentence length. The function encode will take outputs of tokenize as inputs, perform sentence padding and return input_ids as a numpy array.

from nltk.tokenize import word_tokenize
from collections import defaultdict

def tokenize(texts):
    """Tokenize texts, build vocabulary and find maximum sentence length.
    
    Args:
        texts (List[str]): List of text data
    
    Returns:
        tokenized_texts (List[List[str]]): List of list of tokens
        word2idx (Dict): Vocabulary built from the corpus
        max_len (int): Maximum sentence length
    """

    max_len = 0
    tokenized_texts = []
    word2idx = {}

    # Add <pad> and <unk> tokens to the vocabulary
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1

    # Building our vocab from the corpus starting from index 2
    idx = 2
    for sent in texts:
        tokenized_sent = word_tokenize(sent)

        # Add `tokenized_sent` to `tokenized_texts`
        tokenized_texts.append(tokenized_sent)

        # Add new token to `word2idx`
        for token in tokenized_sent:
            if token not in word2idx:
                word2idx[token] = idx
                idx += 1

        # Update `max_len`
        max_len = max(max_len, len(tokenized_sent))

    return tokenized_texts, word2idx, max_len

def encode(tokenized_texts, word2idx, max_len):
    """Pad each sentence to the maximum sentence length and encode tokens to
    their index in the vocabulary.

    Returns:
        input_ids (np.array): Array of token indexes in the vocabulary with
            shape (N, max_len). It will the input of our CNN model.
    """

    input_ids = []
    for tokenized_sent in tokenized_texts:
        # Pad sentences to max_len
        tokenized_sent += ['<pad>'] * (max_len - len(tokenized_sent))

        # Encode tokens to input_ids
        input_id = [word2idx.get(token) for token in tokenized_sent]
        input_ids.append(input_id)
    
    return np.array(input_ids)

2.2. Load Pretrained Vectors

We will load the pretrained vectors for each token in our vocabulary. For tokens with no pretraiend vectors, we will initialize random word vectors with the same dimension and variance.

from tqdm import tqdm_notebook

def load_pretrained_vectors(word2idx, fname):
    """Load pretrained vectors and create embedding layers.
    
    Args:
        word2idx (Dict): Vocabulary built from the corpus
        fname (str): Path to pretrained vector file

    Returns:
        embeddings (np.array): Embedding matrix with shape (N, d) where N is
            the size of word2idx and d is embedding dimension
    """

    print("Loading pretrained vectors...")
    fin = open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())

    # Initilize random embeddings
    embeddings = np.random.uniform(-0.25, 0.25, (len(word2idx), d))
    embeddings[word2idx['<pad>']] = np.zeros((d,))

    # Load pretrained vectors
    count = 0
    for line in tqdm_notebook(fin):
        tokens = line.rstrip().split(' ')
        word = tokens[0]
        if word in word2idx:
            count += 1
            embeddings[word2idx[word]] = np.array(tokens[1:], dtype=np.float32)

    print(f"There are {count} / {len(word2idx)} pretrained vectors found.")

    return embeddings

Now let’s put above steps together.

# Tokenize, build vocabulary, encode tokens
print("Tokenizing...\n")
tokenized_texts, word2idx, max_len = tokenize(texts)
input_ids = encode(tokenized_texts, word2idx, max_len)

# Load pretrained vectors
embeddings = load_pretrained_vectors(word2idx, "fastText/crawl-300d-2M.vec")
embeddings = torch.tensor(embeddings)

Tokenizing...

Loading pretrained vectors...

There are 18526 / 20286 pretrained vectors found.

2.3. Create PyTorch DataLoader

We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed. The batch size used in the paper is 50.

from torch.utils.data import (TensorDataset, DataLoader, RandomSampler,
                              SequentialSampler)

def data_loader(train_inputs, val_inputs, train_labels, val_labels,
                batch_size=50):
    """Convert train and validation sets to torch.Tensors and load them to
    DataLoader.
    """

    # Convert data type to torch.Tensor
    train_inputs, val_inputs, train_labels, val_labels =\
    tuple(torch.tensor(data) for data in
          [train_inputs, val_inputs, train_labels, val_labels])

    # Specify batch_size
    batch_size = 50

    # Create DataLoader for training data
    train_data = TensorDataset(train_inputs, train_labels)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Create DataLoader for validation data
    val_data = TensorDataset(val_inputs, val_labels)
    val_sampler = SequentialSampler(val_data)
    val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

    return train_dataloader, val_dataloader

We will use 90% of the dataset for training and 10% for validation.

from sklearn.model_selection import train_test_split

# Train Test Split
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    input_ids, labels, test_size=0.1, random_state=42)

# Load data to PyTorch DataLoader
train_dataloader, val_dataloader = \
data_loader(train_inputs, val_inputs, train_labels, val_labels, batch_size=50)

3. Model

CNN Architecture

The picture below is the illustration of the CNN architecture that we are going to build with three filter sizes: 2, 3, and 4, each of which has 2 filters.

CNN Architecture (Source: Zhang, 2015)

# Sample configuration:
filter_sizes = [2, 3, 4]
num_filters = [2, 2, 2]

Suppose that we are classifying the sentence “I like this movie very much!” (\(N = 7\) tokens) and the dimensionality of word vectors is \(d=5\). After applying the embedding layer on the input token ids, the sample sentence is presented as a 2D tensor with shape (7, 5) like an image.

\[\mathrm{x_{emb}} \quad \in \mathbb{R}^{7 \times 5}\]

We then use 1-dimesional convolution to extract features from the sentence. In this example, we have 6 filters in total, and each filter has shape \((f_i, d)\) where \(f_i\) is the filter size for \(i \in \{1,...,6\}\). Each filter will then scan over \(\mathrm{x_{emb}}\) and return a feature map:

\[\mathrm{x_{conv_ i} = Conv1D(x_{emb})} \quad \in \mathbb{R}^{N-f_i+1}\]

Next, we apply the ReLU activation to \(\mathrm{x_{conv_{i}}}\) and use max-over-time-pooling to reduce each feature map to a single scalar. Then we concatenate these scalars into a vector which will be fed to a fully connected layer to compute the final scores for our classes (logits).

\[\mathrm{x_{pool_i} = MaxPool(ReLU(x_{conv_i}))} \quad \in \mathbb{R}\] \[\mathrm{x_{fc} = \texttt{concat}(x_{pool_i})} \quad \in \mathbb{R}^6\]

The idea here is that each filter will capture different semantic signals in the sentence (e.g., happiness, humor, politics, anger…) and max-pooling will record only the strongest signal over the sentence. This logic makes sense because humans also perceive the sentiment of a sentence based on its strongest semantic signal.

Finally, we use a fully connected layer with the weight matrix \(\mathbf{W_{fc}} \in \mathbb{R}^{2 \times 6}\) and dropout to compute \(\mathrm{logits}\), which is a vector of length 2 that keeps the scores for our two classes.

\[\mathrm{logits = Dropout(\mathbf{W_{fc}}x_{fc})} \in \mathbb{R}^2\]

An in-depth explanation of CNN can be found in this article and this video.

3.1. Create CNN Model

For simplicity, the model above has very small configurations. The final model will have the same architecture but be much bigger:

Description	Values
input word vectors	fastText
embedding size	300
filter sizes	(3, 4, 5)
num filters	(100, 100, 100)
activation	ReLU
pooling	1-max pooling
dropout rate	0.5

import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN_NLP(nn.Module):
    """An 1D Convulational Neural Network for Sentence Classification."""
    def __init__(self,
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=300,
                 filter_sizes=[3, 4, 5],
                 num_filters=[100, 100, 100],
                 num_classes=2,
                 dropout=0.5):
        """
        The constructor for CNN_NLP class.

        Args:
            pretrained_embedding (torch.Tensor): Pretrained embeddings with
                shape (vocab_size, embed_dim)
            freeze_embedding (bool): Set to False to fine-tune pretraiend
                vectors. Default: False
            vocab_size (int): Need to be specified when not pretrained word
                embeddings are not used.
            embed_dim (int): Dimension of word vectors. Need to be specified
                when pretrained word embeddings are not used. Default: 300
            filter_sizes (List[int]): List of filter sizes. Default: [3, 4, 5]
            num_filters (List[int]): List of number of filters, has the same
                length as `filter_sizes`. Default: [100, 100, 100]
            n_classes (int): Number of classes. Default: 2
            dropout (float): Dropout rate. Default: 0.5
        """

        super(CNN_NLP, self).__init__()
        # Embedding layer
        if pretrained_embedding is not None:
            self.vocab_size, self.embed_dim = pretrained_embedding.shape
            self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                          freeze=freeze_embedding)
        else:
            self.embed_dim = embed_dim
            self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                          embedding_dim=self.embed_dim,
                                          padding_idx=0,
                                          max_norm=5.0)
        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        # Fully-connected layer and Dropout
        self.fc = nn.Linear(np.sum(num_filters), num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input_ids):
        """Perform a forward pass through the network.

        Args:
            input_ids (torch.Tensor): A tensor of token ids with shape
                (batch_size, max_sent_length)

        Returns:
            logits (torch.Tensor): Output logits with shape (batch_size,
                n_classes)
        """

        # Get embeddings from `input_ids`. Output shape: (b, max_len, embed_dim)
        x_embed = self.embedding(input_ids).float()

        # Permute `x_embed` to match input shape requirement of `nn.Conv1d`.
        # Output shape: (b, embed_dim, max_len)
        x_reshaped = x_embed.permute(0, 2, 1)

        # Apply CNN and ReLU. Output shape: (b, num_filters[i], L_out)
        x_conv_list = [F.relu(conv1d(x_reshaped)) for conv1d in self.conv1d_list]

        # Max pooling. Output shape: (b, num_filters[i], 1)
        x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2])
            for x_conv in x_conv_list]
        
        # Concatenate x_pool_list to feed the fully connected layer.
        # Output shape: (b, sum(num_filters))
        x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list],
                         dim=1)
        
        # Compute logits. Output shape: (b, n_classes)
        logits = self.fc(self.dropout(x_fc))

        return logits

3.2. Optimizer

To train Deep Learning models, we need to define a loss function and minimize this loss. We’ll use back-propagation to compute gradients and use an optimization algorithm (ie. Gradient Descent) to minimize the loss. The original paper used the Adadelta optimizer.

import torch.optim as optim

def initilize_model(pretrained_embedding=None,
                    freeze_embedding=False,
                    vocab_size=None,
                    embed_dim=300,
                    filter_sizes=[3, 4, 5],
                    num_filters=[100, 100, 100],
                    num_classes=2,
                    dropout=0.5,
                    learning_rate=0.01):
    """Instantiate a CNN model and an optimizer."""

    assert (len(filter_sizes) == len(num_filters)), "filter_sizes and \
    num_filters need to be of the same length."

    # Instantiate CNN model
    cnn_model = CNN_NLP(pretrained_embedding=pretrained_embedding,
                        freeze_embedding=freeze_embedding,
                        vocab_size=vocab_size,
                        embed_dim=embed_dim,
                        filter_sizes=filter_sizes,
                        num_filters=num_filters,
                        num_classes=2,
                        dropout=0.5)
    
    # Send model to `device` (GPU/CPU)
    cnn_model.to(device)

    # Instantiate Adadelta optimizer
    optimizer = optim.Adadelta(cnn_model.parameters(),
                               lr=learning_rate,
                               rho=0.95)

    return cnn_model, optimizer

3.3. Training Loop

For each epoch, the code below will perform a forward step to compute the Cross Entropy loss, a backward step to compute gradients and use the optimizer to update weights/parameters. At the end of each epoch, the loss on training data and the accuracy over the validation data will be printed to help us keep track of the model’s performance. The code is heavily annotated with detailed explanations.

import random
import time

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility."""

    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, optimizer, train_dataloader, val_dataloader=None, epochs=10):
    """Train the CNN model."""
    
    # Tracking best validation accuracy
    best_accuracy = 0

    # Start training loop
    print("Start training...\n")
    print(f"{'Epoch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {\
    'Val Acc':^9} | {'Elapsed':^9}")
    print("-"*60)

    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================

        # Tracking time and loss
        t0_epoch = time.time()
        total_loss = 0

        # Put the model into the training mode
        model.train()

        for step, batch in enumerate(train_dataloader):
            # Load batch to GPU
            b_input_ids, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Update parameters
            optimizer.step()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        # =======================================
        #               Evaluation
        # =======================================
        if val_dataloader is not None:
            # After the completion of each training epoch, measure the model's
            # performance on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Track the best accuracy
            if val_accuracy > best_accuracy:
                best_accuracy = val_accuracy

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {avg_train_loss:^12.6f} | {\
            val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            
    print("\n")
    print(f"Training complete! Best accuracy: {best_accuracy:.2f}%.")

def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's
    performance on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled
    # during the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

4. Evaluation

In the original paper, the author tried different variations of the model.

CNN-rand: The baseline model where the embedding layer is randomly initialized and then updated during training.
CNN-static: A model with pretrained vectors. However, the embedding layer is freezed during training.
CNN-non-static: Same as above but the embedding layers is fine-tuned during training.

We will experiment with all 3 variations and compare their performance. Below is the report of our results and the original paper’s results.

Model	Kim’s results	Our results
CNN-rand	76.1	74.2
CNN-static	81.0	82.7
CNN-non-static	81.5	84.4

Randomness could cause the difference in the results. I think the reason for the improvement in our results is that we used fastText pretrained vectors, which are of higher quality than word2vec vectors that the author used.

# CNN-rand: Word vectors are randomly initialized.
set_seed(42)
cnn_rand, optimizer = initilize_model(vocab_size=len(word2idx),
                                      embed_dim=300,
                                      learning_rate=0.25,
                                      dropout=0.5)
train(cnn_rand, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
  |   0.682544   |  0.653227  |   62.22   |   1.50   
  |   0.622080   |  0.616504  |   65.22   |   1.41   
  |   0.546976   |  0.574917  |   69.30   |   1.43   
  |   0.473106   |  0.559976  |   69.21   |   1.43   
  |   0.397637   |  0.541240  |   72.47   |   1.44   
  |   0.322112   |  0.530545  |   71.93   |   1.43   
  |   0.258854   |  0.513072  |   72.92   |   1.43   
  |   0.204417   |  0.534012  |   73.74   |   1.43   
  |   0.157654   |  0.533650  |   74.01   |   1.44   
  |   0.129191   |  0.542072  |   74.19   |   1.44   
  |   0.104160   |  0.561548  |   73.56   |   1.45   
  |   0.083750   |  0.560357  |   73.10   |   1.47   
  |   0.067199   |  0.565875  |   73.10   |   1.45   
  |   0.061943   |  0.591892  |   73.83   |   1.44   
  |   0.047678   |  0.615021  |   73.38   |   1.44   
  |   0.043667   |  0.609918  |   73.47   |   1.45   
  |   0.038222   |  0.624876  |   73.74   |   1.43   
  |   0.037270   |  0.636214  |   73.83   |   1.44   
  |   0.032148   |  0.635478  |   73.19   |   1.46   
  |   0.027427   |  0.636196  |   73.56   |   1.42   


Training complete! Best accuracy: 74.19%.

# CNN-static: fastText pretrained word vectors are used and freezed during training.
set_seed(42)
cnn_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                        freeze_embedding=True,
                                        learning_rate=0.25,
                                        dropout=0.5)
train(cnn_static, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
  |   0.587050   |  0.473927  |   76.93   |   0.82   
  |   0.453002   |  0.432967  |   79.39   |   0.71   
  |   0.389261   |  0.417466  |   80.11   |   0.74   
  |   0.345526   |  0.417371  |   80.93   |   0.81   
  |   0.284621   |  0.403670  |   81.47   |   0.83   
  |   0.242149   |  0.406981  |   81.93   |   0.81   
  |   0.190178   |  0.460115  |   79.93   |   0.76   
  |   0.155375   |  0.421258  |   82.20   |   0.84   
  |   0.118369   |  0.436616  |   82.02   |   0.80   
  |   0.095217   |  0.443634  |   81.83   |   0.79   
  |   0.078958   |  0.447452  |   82.11   |   0.76   
  |   0.063665   |  0.504030  |   81.20   |   0.83   
  |   0.047461   |  0.457974  |   82.02   |   0.77   
  |   0.043035   |  0.485016  |   82.11   |   0.70   
  |   0.035299   |  0.479483  |   82.11   |   0.82   
  |   0.028384   |  0.498936  |   82.19   |   0.79   
  |   0.024328   |  0.521321  |   82.37   |   0.76   
  |   0.024897   |  0.511377  |   82.74   |   0.74   
  |   0.019988   |  0.530753  |   81.93   |   0.79   
  |   0.017251   |  0.546499  |   82.20   |   0.85   


Training complete! Best accuracy: 82.74%.

# CNN-non-static: fastText pretrained word vectors are fine-tuned during training.
set_seed(42)
cnn_non_static, optimizer = initilize_model(pretrained_embedding=embeddings,
                                            freeze_embedding=False,
                                            learning_rate=0.25,
                                            dropout=0.5)
train(cnn_non_static, optimizer, train_dataloader, val_dataloader, epochs=20)

Start training...

 Epoch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
------------------------------------------------------------
  |   0.586136   |  0.471964  |   77.21   |   2.08   
  |   0.448910   |  0.428012  |   80.03   |   2.11   
  |   0.381136   |  0.409408  |   81.29   |   2.09   
  |   0.332936   |  0.411652  |   80.75   |   2.10   
  |   0.267999   |  0.397631  |   82.02   |   2.10   
  |   0.223944   |  0.399833  |   81.29   |   2.11   
  |   0.168644   |  0.452024  |   81.29   |   2.10   
  |   0.132921   |  0.442039  |   81.65   |   2.09   
  |   0.097992   |  0.457295  |   81.84   |   2.09   
  |   0.079037   |  0.458124  |   82.38   |   2.09   
  |   0.061001   |  0.459572  |   83.74   |   2.09   
  |   0.047450   |  0.535106  |   81.29   |   2.08   
  |   0.037088   |  0.491504  |   84.37   |   2.10   
  |   0.031085   |  0.503522  |   83.11   |   2.08   
  |   0.025401   |  0.512804  |   84.01   |   2.10   
  |   0.020165   |  0.532516  |   84.19   |   2.11   
  |   0.017053   |  0.545771  |   83.83   |   2.08   
  |   0.017567   |  0.540735  |   84.20   |   2.09   
  |   0.013829   |  0.567102  |   82.47   |   2.09   
  |   0.013072   |  0.594407  |   82.20   |   2.08   


Training complete! Best accuracy: 84.37%.

5. Test Model

Let’s test our CNN-non-static model on some examples.

def predict(text, model=cnn_non_static.to("cpu"), max_len=62):
    """Predict probability that a review is positive."""

    # Tokenize, pad and encode text
    tokens = word_tokenize(text.lower())
    padded_tokens = tokens + ['<pad>'] * (max_len - len(tokens))
    input_id = [word2idx.get(token, word2idx['<unk>']) for token in padded_tokens]

    # Convert to PyTorch tensors
    input_id = torch.tensor(input_id).unsqueeze(dim=0)

    # Compute logits
    logits = model.forward(input_id)

    #  Compute probability
    probs = F.softmax(logits, dim=1).squeeze(dim=0)

    print(f"This review is {probs[1] * 100:.2f}% positive.")

Our model can easily regconize reviews with strong negative signals. On samples that have mixed feelings but positive sentiment overvall, our model also gets excellent results.

predict("All of friends slept while watching this movie. But I really enjoyed it.")
predict("I have waited so long for this movie. I am now so satisfied and happy.")
predict("This movie is long and boring.")
predict("I don't like the ending.")

This review is 61.22% positive.
This review is 94.68% positive.
This review is 0.01% positive.
This review is 4.03% positive.

6. Advice for Practitioners

In A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification (Zhang, 2015), the authors conducted a sensitivity analysis of the above CNN architecture by running it many different sets of hyperparameters. Based on main empirical findings of the research, below are some advice for practioners to choose hyperparameters when applying this architecture for sentence classification tasks:

Input word vectors: Using pretrained word vectors such as word2vec, Glove (or fastText in our implementation) yields much better results than using one-hot vectors or randomly initialized vectors.
Filter region size can have a large effect on performance, and should be tuned. A reasonable range might be 1~10. For example, using filter_size=[7] and num_filters=[400] yields the best result in the MR dataset.
Number of feature maps: try values from 100 to 600 for each filter region size.
Activation funtions: ReLu and tanh are the best candidates.
Pooling: Use 1-max pooling.
Regularization: When increasing number of feature maps, try imposing stronger regularization, e.g. a dropout rate larger than 0.5.

Bonus: Skorch: A Scikit-like Library for PyTorch Modules

If you find the training loop in PyTorch intimidating with a lot of steps and wonder why those steps aren’t wrapped in a function like model.fit() and model.predict() in scikit-learn library. Actually it is something I like in PyTorch. It allows me to manipulate my codes to add extra customizations during training such as clipping gradients and updating learning rates. In addition, because I build my model and training loop block by block, when my model runs into errors, I can navigate the bugs faster. However, when I need to deploy a baseline model quickly, writing an entire training loop is really a burden. It’s when I come to skorch.

skorch is “a scikit-learn compatible neural network library that wraps PyTorch.” There is no need to create DataLoader or write a training/evaluation loop. All you need to do is defining the model and optimizer as in the code below, then a simple net.fit(X, y) is enough.

skorch does not only make it neat and fast to train your Deep Learning models, it also provides powerful support. You can specify callbacks parameters to define early stopping and checkpoint saving. You can also combine skorch model with scikit-learn methods to do cross-validation and hyperparameter tuning with grid-search. Please check out the documentation to explore more powerful functions in this library.

!pip install skorch
from skorch import NeuralNetClassifier
from skorch.helper import predefined_split
from skorch.callbacks import EarlyStopping, Checkpoint, LoadInitState
from skorch.dataset import CVSplit, Dataset

# Specify validation set
val_dataset = Dataset(val_inputs, val_labels)

# Specify callbacks and checkpoints
cp = Checkpoint(monitor='valid_acc_best', dirname='exp1')
callbacks = [
    ('early_stop', EarlyStopping(monitor='valid_acc', patience=5, lower_is_better=False)),
    cp
]

net = NeuralNetClassifier(
    # Module
    module=CNN_NLP,
    module__pretrained_embedding=embeddings,
    module__freeze_embedding=False,
    module__dropout=0.5,
    # Optimizer
    criterion=nn.CrossEntropyLoss,
    optimizer=optim.Adadelta,
    optimizer__lr=0.25,
    optimizer__rho=0.95,
    # Others
    max_epochs=20,
    batch_size=50,
    train_split=predefined_split(val_dataset),
    iterator_train__shuffle=True,
    warm_start=False,
    callbacks=callbacks,
    device=device
)

skorch also prints training results in a very nice table. My training loop in section 3 is inspired by this format. When model (checkpoints) are saved, you can see the + sign in column cp.

set_seed(42)
_ = net.fit(np.array(train_inputs), train_labels)

valid_acc_best = np.max(net.history[:, 'valid_acc'])
print(f"Training complete! Best accuracy: {valid_acc_best * 100:.2f}%")

  epoch    train_loss    valid_acc    valid_loss    cp     dur
-------  ------------  -----------  ------------  ----  ------
      0.5862       0.7741        0.4727     +  2.2838
      0.4481       0.7901        0.4385     +  2.2232
      0.3849       0.7938        0.4369     +  2.2337
      0.3242       0.8285        0.3940     +  2.2340
      0.2787       0.8257        0.3951        2.2225
      0.2156       0.8285        0.3958        2.2006
      0.1714       0.8144        0.4410        2.2059
      0.1336       0.8332        0.4100     +  2.2174
      0.0950       0.8266        0.4295        2.2214
      0.0738       0.8238        0.4489        2.1938
      0.0596       0.8304        0.4705        2.1988
      0.0476       0.8266        0.4769        2.2083
Stopping since valid_acc has not improved in the last 5 epochs.
Training complete! Best accuracy: 83.32%

As Deep Learning model can overfit training data quickly, it’s important to save our model when it fits validation data just right. After training, we can load our model from the last checkpoint to make predictions.

# Load parameters from checkpoint
net.load_params(checkpoint=cp)

predict("All of friends slept while watching this movie. But I really enjoyed it.", model=net)
predict("I have waited so long for this movie. I am now so satisfied and happy.", model=net)
predict("This movie is long and boring.", model=net)
predict("I don't like the ending.", model=net)

This review is 67.25% positive.
This review is 61.38% positive.
This review is 0.12% positive.
This review is 19.14% positive.

Conclusion

Before the rise of huge and complicated models using Transformer architecture, a simple CNN architecture with one layer of convolution can yield excellent performance on sentence classification tasks. The model can take advantages of unsupervised pre-training of word vectors to improve overall performance. Improvements can be made in this architecture by increasing the number of CNN layers or using sub-word model (using BPE tokenizer and fastText pretrained sub-word vectors). Because of its speed, we can use the CNN model as a strong baseline before trying more complicated models such as BERT.

Thank you for staying with me to this point. If interested, you can check out other articles in my NLP tutorial series:

Tutorial: Fine-tuning BERT for Sentiment Analysis

Create a Minimalism GitHub Page for Your Data Science Portfolio in 30 Minutes

2020-01-13T00:00:00-05:00

In the early days of my journey in data science a year ago, I spent most of my time reading articles on Towards Data Science to create my own Data Science roadmap. The opinions are different in the knowledge one needs to acquire to become a Data Scientist and how to get there, but there is one thing in common: at a point in that journey, one should have a portfolio where she can host her Data Science projects.

I created my first portfolio about 6 months after I wrote my first line of Python, when I completed the Python for Data Science and Machine Learning Bootcamp on Udemy, to host simple projects I had done during the course. Since then, building and maintaining my portforlio is one of my favorite things to do. I enjoy organizing my ideas, writing them down, explaining things and make them neat.

Having a portforlio encourages me to seriously document any projects I work on. For job search, I usually bring an iPad with my portfolio opened to career events and interviews so when I share about my projects, I can guide interviewers through my codes and visualizations. It is a very efficient way to build impression and maintain the conversations.

In this tutorial, we will step-by-step learn how to build a simple but powerful GitHub page to host your Data Science projects. The whole process will take about 30 minutes. Let’s get started!

Step 1: Create a GitHub Account

First, we need to sign up a GitHub account at https://github.com/.

After signing up, we will log in and move to Step 2.

Step 2: Create a Repository Named `user-name.github.io`

After all steps in this tutorial are completed, our GitHub page can be accessed at https://user-name.github.io. In this step, we will create a repository named user-name.github.io where user-name is the user name we use to log into GitHub. My user name is ktran3-simon so I will create a repository name ktran3-simon.github.io.

To create a new repository, we click on the + sign at the top right of the screen, next to our profile picture, and select New repository.

We fill the repository name with user-name.github.io, select Public and then click Create repository.

Next, we will upload the theme to this repository. The theme we will use is the Jekyll Minimal theme. This GitHub repository has a more concise version of the theme.

To download the theme, we go to https://github.com/evanca/quick-portfolio, click Clone or download and select Download ZIP.

Now, let’s open our newly created repository, which is still empty. We will click uploading an existing file. After downloading the theme, we unzip the file and upload these files into our repository. After the uploading is complete, we click Commit changes.

Now, by going to user-name.github.io, we can already see our website! In the next step, we will go through some instructions to customize our portfolio.

(Optional) A faster way to complete this step is to simply click the Fork button to copy the entire repository to our GitHub account and then change the repository’s name to user-name.github.io. However I think the above explanation is friendlier for first-time GitHub users.

Step 3: Customize Our Portfolio

Our GitHub page has a two-column layout. On the left is our profile picture and some description, and on the right is the main page where we present our projects. I really like this design because of its simplicity yet efficiency.

To customize the sidebar (the left part), we will edit the file _config.yml in our repository. We can also add Google Analytics ID to track and analyze traffic to our page.

To customize the main page (the right part), where we display our projects, we will need to edit index.md. This file is written in Markdown. If you frequently work with Jupyter Notebook, you must be very familiar with this language. Markdown is very easy to use. Here is a helpful Markdown Cheatsheet that I often refer to.

(Optional) More Customizations

As we customize the sidebar, we will see that we cannot edit the last two lines by editting _config.yml:

To remove them, we need go to the original repository of the Jekyll Minimal theme, and copy the content of default.html in _layouts. Then we create _layouts/default.html in our repository by clicking Creating new file and typing _layouts/default.html, paste the copied content here and commit.

Now we can remove lines 29-31 in _layouts/default.html to remove View My GitHub Profile,

<p class="view"><a href="https://github.com/chriskhanhtran">View My GitHub Profile</a></p>

and line 50 to remove Hosted on GitHub pages - Theme by orderedlist.

<p><small>Hosted on GitHub Pages &mdash; Theme by <a href="https://github.com/orderedlist">orderedlist</a></small></p>

Step 4: Upload Our Projects

After customizing the design of our GitHub page, we can start adding projects to the page by editing index.md. There are several ways to do that, including:

Link to our GitHub repositories,
Link to our Jupyter Notebooks,
Write blog posts in Markdown.

My favorite way to add projects is creating a folder in the repository to save the html files of Jupyter Notebooks, and add the link https://user-name.github.io/folder-name/file-name.html to my main page. Alternatively, we can insert Google Colab links so that viewers can run our projects directly. Google Colab is basically the same as Jupyter Notebook with GPU supported by Google, which helps increase the speed of running our cells, especially in Deep Learning projects. Here is a nice Google Colab tutorial. If you are an R user, you can use R markdown in R Studio to render your projects and export them in the html format.

We can also write blog posts in Markdown and upload them to our repository. You can always refer to the file sample_page.md as an example.

Tips and Tricks

Badges with Shields.io

In official repositories on GitHub, we usually see authors use badges to show the status of their project. For example:

I really like to use these badges to embed links with call for actions, such as:

You can go to https://shields.io/ to create your own badges. Basically, we just need to create links in a specific format and use them as image links.

Format:

https://img.shields.io/badge/label-message-color?logo=logo_name

where:

color could be brightgreen, green, yellowgreen, yellow, orange, red, blue, lightgrey or any HEX, RBG color codes
logo: the list of logo of popular brands and their brand color codes can be found at simple-icons
Visit shields.io to learn more

Examples:

https://img.shields.io/badge/Spotify-My_Musics-1ED760?logo=Spotify will give us:

https://img.shields.io/badge/PyTorch-Run_in_Colab-EE4C2C?logo=PyTorch will give us:

To insert links with badges we created, we only need to type [![](link-to-our-badge)](link-to-our-project).

More Themes

There are several other themes that we can utilize to be more creative with our portfolio. To use them, we can simply Fork the repository to our account and change its name to user-name.github.io.

Cayman: repo - preview
Minimal Mistake: repo - preview. I really like this portfolio where the author uses this theme.

A preview of the Minimal Mistake Theme

Content of Your Portfolio

Last but also the most important thing I want to say in this tutorial, the reason I like a minimalism theme is that it took me minimal time on designing works; thus I can spend more time on the content of my projects. Ultimately, the purpose of building a Data Science portfolio is to present our Data Science project, rather than to show our web-design skills. Below are some articles that I found super helpful when I started building my first portfolio.

How to Build a Data Science Portfolio in Towards Data Science
Data Science Portfolios That Will Get You the Job in Dataquest
Building a Data Science Portfolio That Stands Out in Springboard Blog
5 Data Science Projects That Will Get You Hired in 2018 in Medium

Feel free to visit my portfolio to see how I write my Data Science projects. For example, this is a detailed notebook I wrote after completing a Kaggle competition, in which I went through all important steps of a Data Science project, including Exploratory Data Analysis, Data Cleaning, Feature Engineering, Modeling and Evaluation. Now I still often revisit this notebook to copy the cross-validation codes to reuse. I found I learn the most by reading notebooks on Kaggle and writing my own projects.

I also made some changes in my portfolio compared to the original version, such as making the sidebar narrower and the main page wider. You can fork my repo (https://github.com/chriskhanhtran/minimal-portfolio) and change the codes in the diretory _sass/jekyll-theme-minimal.scss as you like, including changing width, font size or image size. However, be careful when you do so because it might mess your page up. If so, just recover the settings by copying from the theme’s original repo.

change max-width in .wrapper to change the width of the entire page
change max-width in section to change the width of the main page
change width in header to change the width of the side bar

My GitHub Page

Last Words

Having completed your minimalism portfolio, you now can remove or modify these files in your repository:

LICENSE
README.md: you can modify it to the description of your page.
sample_page.md: you can remove or change it to a blog post.
pdf/sample_presentation.pdf

You can also visit the original tutorial with more tips such as:

How to create thumbnails for your project
How to create a round profile picture

For me, I use Photoshop and Powerpoint to create pictures used in my GitHub pages.

Thank you so much for staying with me to this point of my first tutorial. Don’t hesitate to reach out to me if you’ve got any questions. Please connect with me on LinkedIn and share with me your Data Science portfolio.

Fine-tuning BERT for Sentiment Analysis

2019-12-25T00:00:00-05:00

A - Introduction

In recent years the NLP community has seen many breakthoughs in Natural Language Processing, especially the shift to transfer learning. Models like ELMo, fast.ai’s ULMFiT, Transformer and OpenAI’s GPT have allowed researchers to achieves state-of-the-art results on multiple benchmarks and provided the community with large pre-trained models with high performance. This shift in NLP is seen as NLP’s ImageNet moment, a shift in computer vision a few year ago when lower layers of deep learning networks with million of parameters trained on a specific task can be reused and fine-tuned for other tasks, rather than training new networks from scratch.

One of the most biggest milestones in the evolution of NLP recently is the release of Google’s BERT, which is described as the beginning of a new era in NLP. In this notebook I’ll use the HuggingFace’s transformers library to fine-tune pretrained BERT model for a classification task. Then I will compare the BERT’s performance with a baseline model, in which I use a TF-IDF vectorizer and a Naive Bayes classifier. The transformers library help us quickly and efficiently fine-tune the state-of-the-art BERT model and yield an accuracy rate 10% higher than the baseline model.

Reference:

To understand Transformer (the architecture which BERT is built on) and learn how to implement BERT, I highly recommend reading the following sources:

The Illustrated BERT, ELMo, and co.: A very clear and well-written guide to understand BERT.
The documentation of the transformers library
BERT Fine-Tuning Tutorial with PyTorch by Chris McCormick: A very detailed tutorial showing how to use BERT with the HuggingFace PyTorch library.

B - Setup

1. Load Essential Libraries

import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

2. Dataset

2.1. Download Dataset

# Download data
import requests
request = requests.get("https://drive.google.com/uc?export=download&id=1wHt8PsMLsfX5yNSqrt2fSTcb8LEiclcf")
with open("data.zip", "wb") as file:
    file.write(request.content)

# Unzip data
import zipfile
with zipfile.ZipFile('data.zip') as zip:
    zip.extractall('data')

2.2. Load Train Data

The train data has 2 files, each containing 1700 complaining/non-complaining tweets. Every tweets in the data contains at least a hashtag of an airline.

We will load the train data and label it. Because we use only the text data to classify, we will drop unimportant columns and only keep id, tweet and label columns.

 # Load data and set labels
data_complaint = pd.read_csv('data/complaint1700.csv')
data_complaint['label'] = 0
data_non_complaint = pd.read_csv('data/noncomplaint1700.csv')
data_non_complaint['label'] = 1

# Concatenate complaining and non-complaining data
data = pd.concat([data_complaint, data_non_complaint], axis=0).reset_index(drop=True)

# Drop 'airline' column
data.drop(['airline'], inplace=True, axis=1)

# Display 5 random samples
data.sample(5)

	id	tweet	label
1722	2478	Thank you @MichaelRRoy . I am so glad that yo...	1
1653	52356	Seriously .@united GET YOUR SHIT TOGETHER	0
930	128102	@SouthwestAir - Yet another delayed flight. Wh...	0
1975	24242	@AmericanAir yea already did that. They were m...	1
3053	133225	@DeltaAssist i lost my tickets information an...	1

We will randomly split the entire training data into two sets: a train set with 90% of the data and a validation set with 10% of the data. We will perform hyperparameter tuning using cross-validation on the train set and use the validation set to compare models.

from sklearn.model_selection import train_test_split

X = data.tweet.values
y = data.label.values

X_train, X_val, y_train, y_val =\
    train_test_split(X, y, test_size=0.1, random_state=2020)

2.3. Load Test Data

The test data contains 4555 examples with no label. About 300 examples are non-complaining tweets. Our task is to identify their id and examine manually whether our results are correct.

# Load test data
test_data = pd.read_csv('data/test_data.csv')

# Keep important columns
test_data = test_data[['id', 'tweet']]

# Display 5 samples from the test data
test_data.sample(5)

	id	tweet
3353	126992	Lol! Worst reason for a flight delay ever! @us...
70	2461	@JetBlue you suck. You never emailed tickets ...
3551	134150	@AmericanAir We're stuck at KDFW headed for KI...
3200	120820	@AmericanAir hey guys, flight 202 to Boston he...
3546	134021	Been waiting on my lost bags for quite some ti...

3. Set up GPU for training

Google Colab offers free GPUs and TPUs. Since we’ll be training a large neural network it’s best to utilize these features.

A GPU can be added by going to the menu and selecting:

Runtime -> Change runtime type -> Hardware accelerator: GPU

Then we need to run the following cell to specify the GPU as the device.

import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: Tesla P100-PCIE-16GB

C - Baseline: TF-IDF + Naive Bayes Classifier

In this baseline approach, first we will use TF-IDF to vectorize our text data. Then we will use the Naive Bayes model as our classifier.

Why Naive Bayse? I have experiemented different machine learning algorithms including Random Forest, Support Vectors Machine, XGBoost and observed that Naive Bayes yields the best performance. In Scikit-learn’s guide to choose the right estimator, it is also suggested that Naive Bayes should be used for text data. I also tried using SVD to reduce dimensionality; however, it did not yield a better performance.

1. Data Preparation

1.1. Preprocessing

In the bag-of-words model, a text is represented as the bag of its words, disregarding grammar and word order. Therefore, we will want to remove stop words, punctuations and characters that don’t contribute much to the sentence’s meaning.

import nltk
# Uncomment to download "stopwords"
# nltk.download("stopwords")
from nltk.corpus import stopwords

def text_preprocessing(s):
    """
    - Lowercase the sentence
    - Change "'t" to "not"
    - Remove "@name"
    - Isolate and remove punctuations except "?"
    - Remove other special characters
    - Remove stop words except "not" and "can"
    - Remove trailing whitespace
    """
    s = s.lower()
    # Change 't to 'not'
    s = re.sub(r"\'t", " not", s)
    # Remove @name
    s = re.sub(r'(@.*?)[\s]', ' ', s)
    # Isolate and remove punctuations except '?'
    s = re.sub(r'([\'\"\.\(\)\!\?\\\/\,])', r' \1 ', s)
    s = re.sub(r'[^\w\s\?]', ' ', s)
    # Remove some special characters
    s = re.sub(r'([\;\:\|•«\n])', ' ', s)
    # Remove stopwords except 'not' and 'can'
    s = " ".join([word for word in s.split()
                  if word not in stopwords.words('english')
                  or word in ['not', 'can']])
    # Remove trailing whitespace
    s = re.sub(r'\s+', ' ', s).strip()
    
    return s

1.2. TF-IDF Vectorizer

In information retrieval, TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. We will use TF-IDF to vectorize our text data before feeding them to machine learning algorithms.

%%time
from sklearn.feature_extraction.text import TfidfVectorizer

# Preprocess text
X_train_preprocessed = np.array([text_preprocessing(text) for text in X_train])
X_val_preprocessed = np.array([text_preprocessing(text) for text in X_val])

# Calculate TF-IDF
tf_idf = TfidfVectorizer(smooth_idf=False)
X_train_tfidf = tf_idf.fit_transform(X_train_preprocessed)
X_val_tfidf = tf_idf.transform(X_val_preprocessed)

CPU times: user 5.31 s, sys: 490 ms, total: 5.8 s
Wall time: 5.83 s

2. Train Naive Bayes Classifier

2.1. Hyperparameter Tuning

We will use cross-validation and AUC score to tune hyperparameters of our model. The function get_auc_CV will return the average AUC score from cross-validation.

from sklearn.model_selection import StratifiedKFold, cross_val_score

def get_auc_CV(model):
    """
    Return the average AUC score from cross-validation.
    """
    # Set KFold to shuffle data before the split
    kf = StratifiedKFold(5, shuffle=True, random_state=1)

    # Get AUC scores
    auc = cross_val_score(
        model, X_train_tfidf, y_train, scoring="roc_auc", cv=kf)

    return auc.mean()

The MultinominalNB class only have one hypterparameter - alpha. The code below will help us find the alpha value that gives us the highest CV AUC score.

from sklearn.naive_bayes import MultinomialNB

res = pd.Series([get_auc_CV(MultinomialNB(i))
                 for i in np.arange(1, 10, 0.1)],
                index=np.arange(1, 10, 0.1))

best_alpha = np.round(res.idxmax(), 2)
print('Best alpha: ', best_alpha)

plt.plot(res)
plt.title('AUC vs. Alpha')
plt.xlabel('Alpha')
plt.ylabel('AUC')
plt.show()

Best alpha:  1.8

2.2. Evaluation on Validation Set

To evaluate the performance of our model, we will calculate the accuracy rate and the AUC score of our model on the validation set.

from sklearn.metrics import accuracy_score, roc_curve, auc

def evaluate_roc(probs, y_true):
    """
    - Print AUC and accuracy on the test set
    - Plot ROC
    @params    probs (np.array): an array of predicted probabilities with shape (len(y_true), 2)
    @params    y_true (np.array): an array of the true values with shape (len(y_true),)
    """
    preds = probs[:, 1]
    fpr, tpr, threshold = roc_curve(y_true, preds)
    roc_auc = auc(fpr, tpr)
    print(f'AUC: {roc_auc:.4f}')
       
    # Get accuracy over the test set
    y_pred = np.where(preds >= 0.5, 1, 0)
    accuracy = accuracy_score(y_true, y_pred)
    print(f'Accuracy: {accuracy*100:.2f}%')
    
    # Plot ROC AUC
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

By combining TF-IDF and the Naive Bayes algorithm, we achieve the accuracy rate of 72.65% on the validation set. This value is the baseline performance and will be used to evaluate the performance of our fine-tune BERT model.

# Compute predicted probabilities
nb_model = MultinomialNB(alpha=1.8)
nb_model.fit(X_train_tfidf, y_train)
probs = nb_model.predict_proba(X_val_tfidf)

# Evaluate the classifier
evaluate_roc(probs, y_val)

AUC: 0.8269
Accuracy: 72.65%

D - Fine-tuning BERT

1. Install the Hugging Face Library

The transformer library of Hugging Face contains PyTorch implementation of state-of-the-art NLP models including BERT (from Google), GPT (from OpenAI) … and pre-trained model weights.

# Uncomment the line below to install `transformers`
# !pip install transformers

2. Tokenization and Input Formatting

Before tokenizing our text, we will perform some slight processing on our text including removing entity mentions (eg. @united) and some special character. The level of processing here is much less than in previous approachs because BERT was trained with the entire sentences.

def text_preprocessing(text):
    """
    - Remove entity mentions (eg. '@united')
    - Correct errors (eg. '&amp;' to '&')
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Print sentence 0
print('Original: ', X[0])
print('Processed: ', text_preprocessing(X[0]))

Original:  @united I'm having issues. Yesterday I rebooked for 24 hours after I was supposed to fly, now I can't log on &amp; check in. Can you help?
Processed:  I'm having issues. Yesterday I rebooked for 24 hours after I was supposed to fly, now I can't log on & check in. Can you help?

2.1. BERT Tokenizer

In order to apply the pre-trained BERT, we must use the tokenizer provided by the library. This is because (1) the model has a specific, fixed vocabulary and (2) the BERT tokenizer has a particular way of handling out-of-vocabulary words.

In addition, we are required to add special tokens to the start and end of each sentence, pad & truncate all sentences to a single constant length, and explicitly specify what are padding tokens with the “attention mask”.

The encode_plus method of BERT tokenizer will:

(1) split our text into tokens,

(2) add the special [CLS] and [SEP] tokens, and

(3) convert these tokens into indexes of the tokenizer vocabulary,

(4) pad or truncate sentences to max length, and

(5) create attention mask.

from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# Create a function to tokenize a set of texts
def preprocessing_for_bert(data):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []

    # For every sentence...
    for sent in data:
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode_plus(
            text=text_preprocessing(sent),  # Preprocess sentence
            add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            max_length=MAX_LEN,                  # Max length to truncate/pad
            pad_to_max_length=True,         # Pad sentence to max length
            #return_tensors='pt',           # Return PyTorch tensor
            return_attention_mask=True      # Return attention mask
            )
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.
We recommend you upgrade now or ensure your notebook will continue to use TensorFlow 1.x via the %tensorflow_version 1.x magic: more info.

Before tokenizing, we need to specify the maximum length of our sentences.

# Concatenate train data and test data
all_tweets = np.concatenate([data.tweet.values, test_data.tweet.values])

# Encode our concatenated data
encoded_tweets = [tokenizer.encode(sent, add_special_tokens=True) for sent in all_tweets]

# Find the maximum length
max_len = max([len(sent) for sent in encoded_tweets])
print('Max length: ', max_len)

Max length:  68

Now let’s tokenize our data.

# Specify `MAX_LEN`
MAX_LEN = 64

# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X[0]])[0].squeeze().numpy())
print('Original: ', X[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train)
val_inputs, val_masks = preprocessing_for_bert(X_val)

Original:  @united I'm having issues. Yesterday I rebooked for 24 hours after I was supposed to fly, now I can't log on &amp; check in. Can you help?
Token IDs:  [101, 1045, 1005, 1049, 2383, 3314, 1012, 7483, 1045, 2128, 8654, 2098, 2005, 2484, 2847, 2044, 1045, 2001, 4011, 2000, 4875, 1010, 2085, 1045, 2064, 1005, 1056, 8833, 2006, 1004, 4638, 1999, 1012, 2064, 2017, 2393, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Tokenizing data...

2.2. Create PyTorch DataLoader

We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed.

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train)
val_labels = torch.tensor(y_val)

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 32

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

3. Train Our Model

3.1. Create BertClassifier

BERT-base consists of 12 transformer layers, each transformer layer takes in a list of token embeddings, and produces the same number of embeddings with the same hidden size (or dimensions) on the output. The output of the final transformer layer of the [CLS] token is used as the features of the sequence to feed a classifier.

The transformers library has the BertForSequenceClassification class which is designed for classification tasks. However, we will create a new class so we can specify our own choice of classifiers.

Below we will create a BertClassifier class with a BERT model to extract the last hidden layer of the [CLS] token and a single-hidden-layer feed-forward neural network as our classifier.

%%time
import torch
import torch.nn as nn
from transformers import BertModel

# Create the BertClassfier class
class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks.
    """
    def __init__(self, freeze_bert=False):
        """
        @param    bert: a BertModel object
        @param    classifier: a torch.nn.Module classifier
        @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in, H, D_out = 768, 50, 2

        # Instantiate BERT model
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # Instantiate an one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            #nn.Dropout(0.5),
            nn.Linear(H, D_out)
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)

        return logits

CPU times: user 0 ns, sys: 46 µs, total: 46 µs
Wall time: 49.4 µs

3.2. Optimizer & Learning Rate Scheduler

To fine-tune our Bert Classifier, we need to create an optimizer. The authors recommend following hyper-parameters:

Batch size: 16 or 32
Learning rate (Adam): 5e-5, 3e-5 or 2e-5
Number of epochs: 2, 3, 4

Huggingface provided the run_glue.py script, an examples of implementing the transformers library. In the script, the AdamW optimizer is used.

from transformers import AdamW, get_linear_schedule_with_warmup

def initialize_model(epochs=4):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=False)

    # Tell PyTorch to run the model on GPU
    bert_classifier.to(device)

    # Create the optimizer
    optimizer = AdamW(bert_classifier.parameters(),
                      lr=5e-5,    # Default learning rate
                      eps=1e-8    # Default epsilon value
                      )

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=0, # Default value
                                                num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

3.3. Training Loop

We will train our Bert Classifier for 4 epochs. In each epoch, we will train our model and evaluate its performance on the validation set. In more details, we will:

Training:

Unpack our data from the dataloader and load the data onto the GPU
Zero out gradients calculated in the previous pass
Perform a forward pass to compute logits and loss
Perform a backward pass to compute gradients (loss.backward())
Clip the norm of the gradients to 1.0 to prevent “exploding gradients”
Update the model’s parameters (optimizer.step())
Update the learning rate (scheduler.step())

Evaluation:

Unpack our data and load onto the GPU
Forward pass
Compute loss and accuracy rate over the validation set

The script below is commented with the details of our training and evaluation loop.

import random
import time

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
    """Train the BertClassifier model.
    """
    # Start training loop
    print("Start training...\n")
    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
        print("-"*70)

        # Measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids, b_attn_mask)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            scheduler.step()

            # Print the loss values and time elapsed for every 20 batches
            if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch

                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        print("-"*70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            print("-"*70)
        print("\n")
    
    print("Training complete!")


def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

Now, let’s start training our BertClassifier!

set_seed(42)    # Set seed for reproducibility
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |   20    |   0.653637   |     -      |     -     |   4.99   
   1    |   40    |   0.517290   |     -      |     -     |   4.73   
   1    |   60    |   0.502695   |     -      |     -     |   4.68   
   1    |   80    |   0.495539   |     -      |     -     |   4.68   
   1    |   95    |   0.490748   |     -      |     -     |   3.44   
----------------------------------------------------------------------
   1    |    -    |   0.535397   |  0.466385  |   79.09   |   23.22  
----------------------------------------------------------------------


 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   2    |   20    |   0.336384   |     -      |     -     |   4.94   
   2    |   40    |   0.277895   |     -      |     -     |   4.76   
   2    |   60    |   0.314162   |     -      |     -     |   4.74   
   2    |   80    |   0.307749   |     -      |     -     |   4.71   
   2    |   95    |   0.307835   |     -      |     -     |   3.44   
----------------------------------------------------------------------
   2    |    -    |   0.309143   |  0.440339  |   82.56   |   23.30  
----------------------------------------------------------------------


Training complete!

3.4. Evaluation on Validation Set

# Compute predicted probabilities on the test set
# Please initialize function `bert_predict` by running the first cell in Section 4.2.
probs = bert_predict(bert_classifier, val_dataloader)

# Evaluate the Bert classifier
evaluate_roc(probs, y_val)

AUC: 0.9006
Accuracy: 82.65%

The Bert Classifer achieves 0.90 AUC score and 82.65% accuracy rate on the validation set. This result is 10 points better than the baseline method.

3.5. Train Our Model on the Entire Training Data

# Concatenate the train set and the validation set
full_train_data = torch.utils.data.ConcatDataset([train_data, val_data])
full_train_sampler = RandomSampler(full_train_data)
full_train_dataloader = DataLoader(full_train_data, sampler=full_train_sampler, batch_size=32)

# Train the Bert Classifier on the entire training data
set_seed(42)
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
train(bert_classifier, full_train_dataloader, epochs=2)

Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |   20    |   0.638744   |     -      |     -     |   4.90   
   1    |   40    |   0.545325   |     -      |     -     |   4.68   
   1    |   60    |   0.487211   |     -      |     -     |   4.66   
   1    |   80    |   0.516911   |     -      |     -     |   4.66   
   1    |   100   |   0.413083   |     -      |     -     |   4.65   
   1    |   106   |   0.359597   |     -      |     -     |   1.27   
----------------------------------------------------------------------


 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   2    |   20    |   0.286960   |     -      |     -     |   4.89   
   2    |   40    |   0.269116   |     -      |     -     |   4.65   
   2    |   60    |   0.235394   |     -      |     -     |   4.67   
   2    |   80    |   0.280183   |     -      |     -     |   4.65   
   2    |   100   |   0.299446   |     -      |     -     |   4.64   
   2    |   106   |   0.292475   |     -      |     -     |   1.25   
----------------------------------------------------------------------


Training complete!

4. Predictions on Test Set

4.1. Data Preparation

Let’s revisit out test set shortly.

test_data.sample(5)

	id	tweet
1921	74037	No Wi-Fi on a plane nowadays is just unaccepta...
133	5103	@AmericanAir how is it that 2 passengers miss ...
1296	50793	Arbitration board issues decision on joint con...
1771	68130	@AngieTheo14 @ToniVeltri @AmericanAir @JetBlue...
21	620	.@richardbranson .@rmchrQB .@VirginAmerica Air...

Before making predictions on the test set, we need to redo processing and encoding steps done on the training data. Fortunately, we have written the preprocessing_for_bert function to do that for us.

# Run `preprocessing_for_bert` on the test set
print('Tokenizing data...')
test_inputs, test_masks = preprocessing_for_bert(test_data.tweet)

# Create the DataLoader for our test set
test_dataset = TensorDataset(test_inputs, test_masks)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=32)

Tokenizing data...

4.2. Predictions

The prediction step is similar to the evaluation step that we did in the training loop, but simpler. We will perform a forward pass to compute logits and apply softmax to calculate probabilities.

import torch.nn.functional as F

def bert_predict(model, test_dataloader):
    """Perform a forward pass on the trained BERT model to predict probabilities
    on the test set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    all_logits = []

    # For each batch in our test set...
    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
        all_logits.append(logits)
    
    # Concatenate logits from each batch
    all_logits = torch.cat(all_logits, dim=0)

    # Apply softmax to calculate probabilities
    probs = F.softmax(all_logits, dim=1).cpu().numpy()

    return probs

There are about 300 non-negative tweets in our test set. Therefore, we will keep adjusting the decision threshold until we have about 300 non-negative tweets.

The threshold we will use is 0.992, meaning that tweets with a predicted probability greater than 99.2% will be predicted positive. This value is very high compared to the default 0.5 threshold.

After manually examining the test set, I realize that the sentiment classification task here is even difficult for human. Therefore, a high threshold will give us safe predictions.

# Compute predicted probabilities on the test set
probs = bert_predict(bert_classifier, test_dataloader)

# Get predictions from the probabilities
threshold = 0.992
preds = np.where(probs[:, 1] > threshold, 1, 0)

# Number of tweets predicted non-negative
print("Number of tweets predicted non-negative: ", preds.sum())

Number of tweets predicted non-negative:  298

Now we will examine 20 random tweets from our predictions. 17 of them are correct, the results showing that the BERT Classifier acquires about 0.85 precision rate.

output = test_data[preds==1]
list(output.sample(20).tweet)

["@iChrisLehman @SouthwestAir Aw poor little one. I'm sure parents will make it better. Kisses and hugs and lots of encouragement!",
 'Props to @JetBlue for doing this for the two police officers killed in New York City. #PoliceLivesMatter http://t.co/U3Ff6pXzdC',
 "@SilverJames_ I say do whatever gets you there :-) Can't wait to see you! @SouthwestAir",
 '@united not angry disappointed two trips in a row. May cancel my next one and try delta',
 '@drey38 @united AND you are missing the holiday bowl! #unitedhatesamericans',
 "@JetBlue now just another airline. :/. Can't wait for @SouthwestAir to provide direct/nonstop from BOS to SJU.",
 "Ahhhh @JetBlue I've missed you!!! Hopefully making it back to NY tonight ",
 "On the bright side, I don't have to pay to check my bag to Portland tomorrow because @united lost it!",
 '@JetBlue Thank you so much for making my experience as a person w/health issues so pleasant!',
 "@JetBlue has NBCSports on this flight which means I won't miss the @NYRangers game!!! #YesYesYes",
 'Wow @JetBlue flt 1199 to #Orlando, entire TV system down. Plane full of kids. Thanks - real value add service. #fail',
 "Can't wait to see what happens next.@NYNYVegas is taking over @SouthwestAir today with #spreadtheluck flights. #Vegas #PR",
 'check out @americanair  in the " in case you get lost " section of our #website #travel #airlines http://t.co/X2WMcoQdgt',
 'Everyone is gonna hate @AmericanAir if Jerome gets arrested  #AmericanAirlinesCHILLOUT',
 'Waiting to board a @SouthwestAir plane to #Pittsburgh. I feel kinda sa being #55 in my group. Feels like being picked last in HS gym class.',
 "I can't get over how much different it is flying @VirginAmerica than any other airline! I Love it! I can't wait to be home (for a week) _",
 'Wait up @AmericanAir',
 '@richeisen @united Rich, fly @AmericanAir zero problems &amp; I fly weekly! #firstworldproblems',
 '@AlaskaAir what happens if I miss my connection in Seattle #alaska687',
 '@adamrides After @VirginAmerica they are my top choice. Welcome to Va. Thanks for bringing the bad weather, again.']

E - Conclusion

By adding a simple one-hidden-layer neural network classifier on top of BERT and fine-tuning BERT, we can achieve near state-of-the-art performance, which is 10 points better than the baseline method although we only have 3,400 data points.

In addition, although BERT is very large, complicated, and have millions of parameters, we only need to fine-tune it in only 2-4 epochs. That result can be achieved because BERT was trained on the huge amount and already encode a lot of information about our language. An impresive performance achieved in a short amount of time, with a small amount of data has shown why BERT is one of the most powerful NLP models available at the moment.

Chris Tran

Pre-train ELECTRA for Spanish from Scratch

1. Introduction

2. Method

3. Pre-train ELECTRA

Setup

Data

Build Pretraining Dataset

Start Training

4. Convert Tensorflow checkpoints to PyTorch format

5. Conclusion

References

Extractive Summarization with BERT

1. Introduction

2. Extractive Summarization

BERT Encoder

Summarization Classifier

3. Make Summarization Even Faster

4. Let’s Summarize

5. Conclusion

References

Visual Recognition for Vietnamese Foods

Looking at the data

Training: ResNet-50

Stage 1: Finetune the top layers only

Stage 2: Unfreeze and finetune the entire networks

Evaluation

Production

Conclusion

Reference

Named Entity Recognition with Transformers

Introduction

Setup

Data

1. Download Datasets

2. Preprocessing

3. Labels

Fine-tuning Model

Pipeline

Conclusion

SpanBERTa: Pre-train RoBERTa Language Model for Spanish from Scratch

Introduction

Setup

1. Install Dependencies

2. Data

3. Train a Tokenizer

Traing Language Model from Scratch

1. Model Architecture

2. Training Hyperparameters

3. Start Training

4. Predict Masked Words

Conclusion

A Complete Guide to CNN for Sentence Classification with PyTorch

1. Setup

1.1. Import Libraries

1.2. Download Datasets

1.3. Download fastText Word Vectors

1.4. Use GPU for Training

2. Data Preparation

2.1. Tokenize

2.2. Load Pretrained Vectors

2.3. Create PyTorch DataLoader

3. Model

3.1. Create CNN Model

3.2. Optimizer

3.3. Training Loop

4. Evaluation

5. Test Model

6. Advice for Practitioners

Bonus: Skorch: A Scikit-like Library for PyTorch Modules

Conclusion

Create a Minimalism GitHub Page for Your Data Science Portfolio in 30 Minutes

Step 1: Create a GitHub Account

Step 2: Create a Repository Named user-name.github.io

Step 3: Customize Our Portfolio

Step 4: Upload Our Projects

Tips and Tricks

Badges with Shields.io

More Themes

Content of Your Portfolio

Step 2: Create a Repository Named `user-name.github.io`