Jekyll2023-06-05T01:26:06-04:00https://chriskhanhtran.github.io/feed.xmlChris TranChris Tran's PortfolioChris TranPre-train ELECTRA for Spanish from Scratch2020-06-11T00:00:00-04:002020-06-11T00:00:00-04:00https://chriskhanhtran.github.io/posts/electra-spanish<p><a href="https://colab.research.google.com/drive/1DiOwhRjQbtYRgFWG7e3dybcXJsZcu86l#scrollTo=YIHC6Pg66zHg"><img src="https://img.shields.io/badge/Colab-Run_in_Google_Colab-blue?logo=Google&logoColor=FDBA18" alt="Run in Google Colab" /></a></p>
<h2 id="1-introduction">1. Introduction</h2>
<p>At ICLR 2020, <a href="https://openreview.net/pdf?id=r1xMH1BtvB">ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators</a>, a new method for self-supervised language representation learning, was introduced. ELECTRA is another member of the Transformer pre-training method family, whose previous members such as BERT, GPT-2, RoBERTa have achieved many state-of-the-art results in Natural Language Processing benchmarks.</p>
<p>Different from other masked language modeling methods, ELECTRA is a more sample-efficient pre-training task called replaced token detection. At a small scale, ELECTRA-small can be trained on a single GPU for 4 days to outperform <a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf">GPT (Radford et al., 2018)</a> (trained using 30x more compute) on the GLUE benchmark. At a large scale, ELECTRA-large outperforms <a href="">ALBERT (Lan et al., 2019)</a> on GLUE and sets a new state-of-the-art for SQuAD 2.0.</p>
<p class="text-center"><img src="https://github.com/chriskhanhtran/spanish-bert/blob/master/img/electra-performance.JPG?raw=true" alt="" />
<em>ELECTRA consistently outperforms masked language model pre-training approaches.</em></p>
<h2 id="2-method">2. Method</h2>
<p>Masked language modeling pre-training methods such as <a href="https://arxiv.org/abs/1810.04805">BERT (Devlin et al., 2019)</a> corrupt the input by replacing some tokens (typically 15% of the input) with <code class="language-plaintext highlighter-rouge">[MASK]</code> and then train a model to re-construct the original tokens.</p>
<p>Instead of masking, ELECTRA corrupts the input by replacing some tokens with samples from the outputs of a smalled masked language model. Then, a discriminative model is trained to predict whether each token was an original or a replacement. After pre-training, the generator is thrown out and the discriminator is fine-tuned on downstream tasks.</p>
<p class="text-center"><img src="https://github.com/chriskhanhtran/spanish-bert/blob/master/img/electra-overview.JPG?raw=true" alt="" />
<em>An overview of ELECTRA.</em></p>
<p>Although having a generator and a discriminator like GAN, ELECTRA is not adversarial in that the generator producing corrupted tokens is trained with maximum likelihood rather than being trained to fool the discriminator.</p>
<p><strong>Why is ELECTRA so efficient?</strong></p>
<p>With a new training objective, ELECTRA can achieve comparable performance to strong models such as <a href="https://arxiv.org/abs/1907.11692">RoBERTa (Liu et al., (2019)</a> which has more parameters and needs 4x more compute for training. In the paper, an analysis was conducted to understand what really contribute to ELECTRA’s efficiency. The key findings are:</p>
<ul>
<li>ELECTRA is greatly benefiting from having a loss defined over all input tokens rather than just a subset. More specifically, in ELECTRA, the discriminator predicts on every token in the input, while in BERT, the generator only predicts 15% masked tokens of the input.</li>
<li>BERT’s performance is slightly harmed because in the pre-training phase, the model sees <code class="language-plaintext highlighter-rouge">[MASK]</code> tokens, while it is not the case in the fine-tuning phase.</li>
</ul>
<p class="text-center"><img src="https://github.com/chriskhanhtran/spanish-bert/blob/master/img/electra-vs-bert.JPG?raw=true" alt="" />
<em>ELECTRA vs. BERT</em></p>
<h2 id="3-pre-train-electra">3. Pre-train ELECTRA</h2>
<p>In this section, we will train ELECTRA from scratch with TensorFlow using scripts provided by ELECTRA’s authors in <a href="https://github.com/google-research/electra">google-research/electra</a>. Then we will convert the model to PyTorch’s checkpoint, which can be easily fine-tuned on downstream tasks using Hugging Face’s <code class="language-plaintext highlighter-rouge">transformers</code> library.</p>
<h3 id="setup">Setup</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">tensorflow</span><span class="o">==</span><span class="mf">1.15</span>
<span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">transformers</span><span class="o">==</span><span class="mf">2.8</span><span class="p">.</span><span class="mi">0</span>
<span class="err">!</span><span class="n">git</span> <span class="n">clone</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">github</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">google</span><span class="o">-</span><span class="n">research</span><span class="o">/</span><span class="n">electra</span><span class="p">.</span><span class="n">git</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span>
</code></pre></div></div>
<h3 id="data">Data</h3>
<p>We will pre-train ELECTRA on a Spanish movie subtitle dataset retrieved from OpenSubtitles. This dataset is 5.4 GB in size and we will train on a small subset of ~30 MB for presentation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">DATA_DIR</span> <span class="o">=</span> <span class="s">"./data"</span> <span class="c1">#@param {type: "string"}
</span><span class="n">TRAIN_SIZE</span> <span class="o">=</span> <span class="mi">1000000</span> <span class="c1">#@param {type:"integer"}
</span><span class="n">MODEL_NAME</span> <span class="o">=</span> <span class="s">"electra-spanish"</span> <span class="c1">#@param {type: "string"}
</span></code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Download and unzip the Spanish movie substitle dataset
</span><span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">DATA_DIR</span><span class="p">):</span>
<span class="err">!</span><span class="n">mkdir</span> <span class="o">-</span><span class="n">p</span> <span class="err">$</span><span class="n">DATA_DIR</span>
<span class="err">!</span><span class="n">wget</span> <span class="s">"https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz"</span> <span class="o">-</span><span class="n">O</span> <span class="err">$</span><span class="n">DATA_DIR</span><span class="o">/</span><span class="n">OpenSubtitles</span><span class="p">.</span><span class="n">txt</span><span class="p">.</span><span class="n">gz</span>
<span class="err">!</span><span class="n">gzip</span> <span class="o">-</span><span class="n">d</span> <span class="err">$</span><span class="n">DATA_DIR</span><span class="o">/</span><span class="n">OpenSubtitles</span><span class="p">.</span><span class="n">txt</span><span class="p">.</span><span class="n">gz</span>
<span class="err">!</span><span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="err">$</span><span class="n">TRAIN_SIZE</span> <span class="err">$</span><span class="n">DATA_DIR</span><span class="o">/</span><span class="n">OpenSubtitles</span><span class="p">.</span><span class="n">txt</span> <span class="o">></span> <span class="err">$</span><span class="n">DATA_DIR</span><span class="o">/</span><span class="n">train_data</span><span class="p">.</span><span class="n">txt</span>
<span class="err">!</span><span class="n">rm</span> <span class="err">$</span><span class="n">DATA_DIR</span><span class="o">/</span><span class="n">OpenSubtitles</span><span class="p">.</span><span class="n">txt</span>
</code></pre></div></div>
<p>Before building the pre-training dataset, we should make sure the corpus has the following format:</p>
<ul>
<li>each line is a sentence</li>
<li>a blank line separates two documents</li>
</ul>
<h3 id="build-pretraining-dataset">Build Pretraining Dataset</h3>
<p>We will use the tokenizer of <code class="language-plaintext highlighter-rouge">bert-base-multilingual-cased</code> to process Spanish texts.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Save the pretrained WordPiece tokenizer to get `vocab.txt`
</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"bert-base-multilingual-cased"</span><span class="p">)</span>
<span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">DATA_DIR</span><span class="p">)</span>
</code></pre></div></div>
<p>We use <code class="language-plaintext highlighter-rouge">build_pretraining_dataset.py</code> to create a pre-training dataset from a dump of raw text.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">python3</span> <span class="n">electra</span><span class="o">/</span><span class="n">build_pretraining_dataset</span><span class="p">.</span><span class="n">py</span> \
<span class="o">--</span><span class="n">corpus</span><span class="o">-</span><span class="nb">dir</span> <span class="err">$</span><span class="n">DATA_DIR</span> \
<span class="o">--</span><span class="n">vocab</span><span class="o">-</span><span class="nb">file</span> <span class="err">$</span><span class="n">DATA_DIR</span><span class="o">/</span><span class="n">vocab</span><span class="p">.</span><span class="n">txt</span> \
<span class="o">--</span><span class="n">output</span><span class="o">-</span><span class="nb">dir</span> <span class="err">$</span><span class="n">DATA_DIR</span><span class="o">/</span><span class="n">pretrain_tfrecords</span> \
<span class="o">--</span><span class="nb">max</span><span class="o">-</span><span class="n">seq</span><span class="o">-</span><span class="n">length</span> <span class="mi">128</span> \
<span class="o">--</span><span class="n">blanks</span><span class="o">-</span><span class="n">separate</span><span class="o">-</span><span class="n">docs</span> <span class="bp">False</span> \
<span class="o">--</span><span class="n">no</span><span class="o">-</span><span class="n">lower</span><span class="o">-</span><span class="n">case</span> \
<span class="o">--</span><span class="n">num</span><span class="o">-</span><span class="n">processes</span> <span class="mi">5</span>
</code></pre></div></div>
<h3 id="start-training">Start Training</h3>
<p>We use <code class="language-plaintext highlighter-rouge">run_pretraining.py</code> to pre-train an ELECTRA model.</p>
<p>To train a small ELECTRA model for 1 million steps, run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small
</code></pre></div></div>
<p>This takes slightly over 4 days on a Tesla V100 GPU. However, the model should achieve decent results after 200k steps (10 hours of training on the v100 GPU).</p>
<p>To customize the training, create a <code class="language-plaintext highlighter-rouge">.json</code> file containing the hyperparameters. Please refer <a href="https://github.com/google-research/electra/blob/master/configure_pretraining.py"><code class="language-plaintext highlighter-rouge">configure_pretraining.py</code></a> for default values of all hyperparameters.</p>
<p>Below, we set the hyperparameters to train the model for only 100 steps.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hparams</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"do_train"</span><span class="p">:</span> <span class="s">"true"</span><span class="p">,</span>
<span class="s">"do_eval"</span><span class="p">:</span> <span class="s">"false"</span><span class="p">,</span>
<span class="s">"model_size"</span><span class="p">:</span> <span class="s">"small"</span><span class="p">,</span>
<span class="s">"do_lower_case"</span><span class="p">:</span> <span class="s">"false"</span><span class="p">,</span>
<span class="s">"vocab_size"</span><span class="p">:</span> <span class="mi">119547</span><span class="p">,</span>
<span class="s">"num_train_steps"</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span>
<span class="s">"save_checkpoints_steps"</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span>
<span class="s">"train_batch_size"</span><span class="p">:</span> <span class="mi">32</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"hparams.json"</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">hparams</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s start training:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">python3</span> <span class="n">electra</span><span class="o">/</span><span class="n">run_pretraining</span><span class="p">.</span><span class="n">py</span> \
<span class="o">--</span><span class="n">data</span><span class="o">-</span><span class="nb">dir</span> <span class="err">$</span><span class="n">DATA_DIR</span> \
<span class="o">--</span><span class="n">model</span><span class="o">-</span><span class="n">name</span> <span class="err">$</span><span class="n">MODEL_NAME</span> \
<span class="o">--</span><span class="n">hparams</span> <span class="s">"hparams.json"</span>
</code></pre></div></div>
<p>If you are training on a virtual machine, run the following lines on the terminal to moniter the training process with TensorBoard.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install -U tensorboard
tensorboard dev upload --logdir data/models/electra-spanish
</code></pre></div></div>
<p>This is the <a href="https://tensorboard.dev/experiment/AmaGBV3RTGOB1leXGGsJmw/#scalars">TensorBoard</a> of training ELECTRA-small for 1 million steps in 4 days on a V100 GPU.</p>
<p><img src="https://github.com/chriskhanhtran/spanish-bert/blob/master/img/electra-tensorboard.JPG?raw=true" width="400" class="align-center" /></p>
<h2 id="4-convert-tensorflow-checkpoints-to-pytorch-format">4. Convert Tensorflow checkpoints to PyTorch format</h2>
<p>Hugging Face has <a href="https://huggingface.co/transformers/converting_tensorflow_models.html">a tool</a> to convert Tensorflow checkpoints to PyTorch. However, this tool has yet been updated for ELECTRA. Fortunately, I found a GitHub repo by @lonePatient that can help us with this task.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">git</span> <span class="n">clone</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">github</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">lonePatient</span><span class="o">/</span><span class="n">electra_pytorch</span><span class="p">.</span><span class="n">git</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MODEL_DIR</span> <span class="o">=</span> <span class="s">"data/models/electra-spanish/"</span>
<span class="n">config</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"vocab_size"</span><span class="p">:</span> <span class="mi">119547</span><span class="p">,</span>
<span class="s">"embedding_size"</span><span class="p">:</span> <span class="mi">128</span><span class="p">,</span>
<span class="s">"hidden_size"</span><span class="p">:</span> <span class="mi">256</span><span class="p">,</span>
<span class="s">"num_hidden_layers"</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span>
<span class="s">"num_attention_heads"</span><span class="p">:</span> <span class="mi">4</span><span class="p">,</span>
<span class="s">"intermediate_size"</span><span class="p">:</span> <span class="mi">1024</span><span class="p">,</span>
<span class="s">"generator_size"</span><span class="p">:</span><span class="s">"0.25"</span><span class="p">,</span>
<span class="s">"hidden_act"</span><span class="p">:</span> <span class="s">"gelu"</span><span class="p">,</span>
<span class="s">"hidden_dropout_prob"</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>
<span class="s">"attention_probs_dropout_prob"</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>
<span class="s">"max_position_embeddings"</span><span class="p">:</span> <span class="mi">512</span><span class="p">,</span>
<span class="s">"type_vocab_size"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
<span class="s">"initializer_range"</span><span class="p">:</span> <span class="mf">0.02</span>
<span class="p">}</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">MODEL_DIR</span> <span class="o">+</span> <span class="s">"config.json"</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">config</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">python</span> <span class="n">electra_pytorch</span><span class="o">/</span><span class="n">convert_electra_tf_checkpoint_to_pytorch</span><span class="p">.</span><span class="n">py</span> \
<span class="o">--</span><span class="n">tf_checkpoint_path</span><span class="o">=</span><span class="err">$</span><span class="n">MODEL_DIR</span> \
<span class="o">--</span><span class="n">electra_config_file</span><span class="o">=</span><span class="err">$</span><span class="n">MODEL_DIR</span><span class="o">/</span><span class="n">config</span><span class="p">.</span><span class="n">json</span> \
<span class="o">--</span><span class="n">pytorch_dump_path</span><span class="o">=</span><span class="err">$</span><span class="n">MODEL_DIR</span><span class="o">/</span><span class="n">pytorch_model</span><span class="p">.</span><span class="nb">bin</span>
</code></pre></div></div>
<p><strong>Use ELECTRA with <code class="language-plaintext highlighter-rouge">transformers</code></strong></p>
<p>After converting the model checkpoint to PyTorch format, we can start to use our pre-trained ELECTRA model on downstream tasks with the <code class="language-plaintext highlighter-rouge">transformers</code> library.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">ElectraForPreTraining</span><span class="p">,</span> <span class="n">ElectraTokenizerFast</span>
<span class="n">discriminator</span> <span class="o">=</span> <span class="n">ElectraForPreTraining</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">MODEL_DIR</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">ElectraTokenizerFast</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">DATA_DIR</span><span class="p">,</span> <span class="n">do_lower_case</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sentence</span> <span class="o">=</span> <span class="s">"Los pájaros están cantando"</span> <span class="c1"># The birds are singing
</span><span class="n">fake_sentence</span> <span class="o">=</span> <span class="s">"Los pájaros están hablando"</span> <span class="c1"># The birds are speaking
</span>
<span class="n">fake_tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">tokenize</span><span class="p">(</span><span class="n">fake_sentence</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">fake_inputs</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">fake_sentence</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s">"pt"</span><span class="p">)</span>
<span class="n">discriminator_outputs</span> <span class="o">=</span> <span class="n">discriminator</span><span class="p">(</span><span class="n">fake_inputs</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">discriminator_outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span>
<span class="p">[</span><span class="k">print</span><span class="p">(</span><span class="s">"%7s"</span> <span class="o">%</span> <span class="n">token</span><span class="p">,</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">fake_tokens</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="p">[</span><span class="k">print</span><span class="p">(</span><span class="s">"%7s"</span> <span class="o">%</span> <span class="nb">int</span><span class="p">(</span><span class="n">prediction</span><span class="p">),</span> <span class="n">end</span><span class="o">=</span><span class="s">""</span><span class="p">)</span> <span class="k">for</span> <span class="n">prediction</span> <span class="ow">in</span> <span class="n">predictions</span><span class="p">.</span><span class="n">tolist</span><span class="p">()];</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> [CLS] Los paj ##aros estan habla ##ndo [SEP]
1 0 0 0 0 0 0 0
</code></pre></div></div>
<p>Our model was trained for only 100 steps so the predictions are not accurate. The fully-trained ELECTRA-small for Spanish can be loaded as below:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">discriminator</span> <span class="o">=</span> <span class="n">ElectraForPreTraining</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"skimai/electra-small-spanish"</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">ElectraTokenizerFast</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"skimai/electra-small-spanish"</span><span class="p">,</span> <span class="n">do_lower_case</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="5-conclusion">5. Conclusion</h2>
<p>In this article, we have walked through the ELECTRA paper to understand why ELECTRA is the most efficient transformer pre-training approach at the moment. At a small scale, ELECTRA-small can be trained on one GPU for 4 days to outperform GPT on the GLUE benchmark. At a large scale, ELECTRA-large sets a new state-of-the-art for SQuAD 2.0.</p>
<p>We then actually train an ELECTRA model on Spanish texts and convert Tensorflow checkpoint to PyTorch and use the model with the <code class="language-plaintext highlighter-rouge">transformers</code> library.</p>
<h2 id="references">References</h2>
<ul>
<li>[1] <a href="https://openreview.net/pdf?id=r1xMH1BtvB">ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators</a></li>
<li>[2] <a href="https://github.com/google-research/electra">google-research/electra</a> - the official GitHub repository of the original paper</li>
<li>[3] <a href="https://github.com/lonePatient/electra_pytorch">electra_pytorch</a> - a PyTorch implementation of ELECTRA</li>
</ul>Chris TranELECTRA is another member of the Transformer pre-training method family, whose previous members such as BERT, GPT-2, RoBERTa have achieved many state-of-the-art results in Natural Language Processing benchmarks.Extractive Summarization with BERT2020-05-31T00:00:00-04:002020-05-31T00:00:00-04:00https://chriskhanhtran.github.io/posts/extractive-summarization-with-bert<p><a href="https://github.com/chriskhanhtran/bert-extractive-summarization"><img src="https://img.shields.io/badge/GitHub-View_on_GitHub-blue?logo=GitHub" alt="" /></a></p>
<h2 id="1-introduction">1. Introduction</h2>
<p>Summarization has long been a challenge in Natural Language Processing. To generate a short version of a document while retaining its most important information, we need a model capable of accurately extracting the key points while avoiding repetitive information. Fortunately, recent works in NLP such as Transformer models and language model pretraining have advanced the state-of-the-art in summarization.</p>
<p>In this article, we will explore BERTSUM, a simple variant of BERT, for extractive summarization from <a href="https://arxiv.org/abs/1908.08345">Text Summarization with Pretrained Encoders</a> (Liu et al., 2019). Then, in an effort to make extractive summarization even faster and smaller for low-resource devices, we will fine-tune DistilBERT (<a href="https://arxiv.org/abs/1910.01108">Sanh et al., 2019</a>) and MobileBERT (<a href="https://arxiv.org/abs/2004.02984">Sun et al., 2019</a>), two recent lite versions of BERT, and discuss our findings.</p>
<h2 id="2-extractive-summarization">2. Extractive Summarization</h2>
<p>There are two types of summarization: <em>abstractive</em> and <em>extractive summarization</em>. Abstractive summarization basically means rewriting key points while extractive summarization generates summary by copying directly the most important spans/sentences from a document.</p>
<p>Abstractive summarization is more challenging for humans, and also more computationally expensive for machines. However, which summaration is better depends on the purpose of the end user. If you were writing an essay, abstractive summaration might be a better choice. On the other hand, if you were doing some research and needed to get a quick summary of what you were reading, extractive summarization would be more helpful for the task.</p>
<p>In this section we will explore the architecture of our extractive summarization model. The BERT summarizer has 2 parts: a BERT encoder and a summarization classifier.</p>
<h3 id="bert-encoder">BERT Encoder</h3>
<p><img src="https://github.com/chriskhanhtran/minimal-portfolio/blob/master/images/bertsum.jpeg?raw=true" alt="" /></p>
<p><em>The overview architecture of BERTSUM</em></p>
<p>Our BERT encoder is the pretrained BERT-base encoder from the masked language modeling task (<a href="https://github.com/google-research/bert">Devlin et at., 2018</a>). The task of extractive summarization is a binary classification problem at the sentence level. We want to assign each sentence a label \(y_i \in \{0, 1\}\) indicating whether the sentence should be included in the final summary. Therefore, we need to add a token <code class="language-plaintext highlighter-rouge">[CLS]</code> before each sentence. After we run a forward pass through the encoder, the last hidden layer of these <code class="language-plaintext highlighter-rouge">[CLS]</code> tokens will be used as the representions for our sentences.</p>
<h3 id="summarization-classifier">Summarization Classifier</h3>
<p>After getting the vector representation of each sentence, we can use a simple feed forward layer as our classifier to return a score for each sentence. In the paper, the author experimented with a simple linear classifier, a Recurrent Neural Network and a small Transformer model with 3 layers. The Transformer classifier yields the best results, showing that inter-sentence interactions through self-attention mechanism is important in selecting the most important sentences.</p>
<p>So in the encoder, we learn the interactions among tokens in our document while in the summarization classifier, we learn the interactions among sentences.</p>
<h2 id="3-make-summarization-even-faster">3. Make Summarization Even Faster</h2>
<p>Transformer models achieve state-of-the-art performance in most NLP bechmarks; however, training and making predictions from them are computationally expensive. In an effort to make summarization lighter and faster to be deployed on low-resource devices, I have modified the <a href="https://github.com/nlpyang/PreSumm">source codes</a> provided by the authors of BERTSUM to replace the BERT encoder with DistilBERT and MobileBERT. The summary layers are kept unchaged.</p>
<p>Here are training losses of these 3 variants: <a href="https://tensorboard.dev/experiment/Ly7CRURRSOuPBlZADaqBlQ/#scalars">TensorBoard</a></p>
<p><img src="https://github.com/chriskhanhtran/bert-extractive-summarization/raw/master/tensorboard.JPG" alt="" /></p>
<p>Despite being 40% smaller than BERT-base, DistilBERT has the same training losses as BERT-base while MobileBERT performs slightly worse. The table below shows their performance on CNN/DailyMail dataset, size and running time of a forward pass:</p>
<table>
<thead>
<tr>
<th style="text-align: left">Models</th>
<th style="text-align: center">ROUGE-1</th>
<th style="text-align: center">ROUGE-2</th>
<th style="text-align: center">ROUGE-L</th>
<th style="text-align: center">Inference Time*</th>
<th style="text-align: center">Size</th>
<th style="text-align: center">Params</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">bert-base</td>
<td style="text-align: center">43.23</td>
<td style="text-align: center">20.24</td>
<td style="text-align: center">39.63</td>
<td style="text-align: center">1.65 s</td>
<td style="text-align: center">475 MB</td>
<td style="text-align: center">120.5 M</td>
</tr>
<tr>
<td style="text-align: left">distilbert</td>
<td style="text-align: center">42.84</td>
<td style="text-align: center">20.04</td>
<td style="text-align: center">39.31</td>
<td style="text-align: center">925 ms</td>
<td style="text-align: center">310 MB</td>
<td style="text-align: center">77.4 M</td>
</tr>
<tr>
<td style="text-align: left">mobilebert</td>
<td style="text-align: center">40.59</td>
<td style="text-align: center">17.98</td>
<td style="text-align: center">36.99</td>
<td style="text-align: center">609 ms</td>
<td style="text-align: center">128 MB</td>
<td style="text-align: center">30.8 M</td>
</tr>
</tbody>
</table>
<p>*<em>Average running time of a forward pass on a single GPU on a standard Google Colab notebook</em></p>
<p>Being 45% faster, DistilBERT have almost the same performance as BERT-base. MobileBERT retains 94% performance of BERT-base, while being 4x smaller than BERT-base and 2.5x smaller than DistilBERT. In the MobileBERT paper, it’s shown that MobileBERT significantly outperforms DistilBERT on SQuAD v1.1. However, it’s not the case for extractive summarization. But this is still an impressive result for MobileBERT with a disk size of only 128 MB.</p>
<h2 id="4-lets-summarize">4. Let’s Summarize</h2>
<p>All pretrained checkpoints, training details and setup instruction can be found in <a href="https://github.com/chriskhanhtran/bert-extractive-summarization/">this GitHub repository</a>. In addition, I have deployed a demo of BERTSUM with the MobileBERT encoder.</p>
<p><strong>Web app:</strong> https://extractive-summarization.herokuapp.com/</p>
<p><a href="https://extractive-summarization.herokuapp.com/"><img src="https://img.shields.io/badge/Heroku-Open_Web_App-blue?logo=Heroku" alt="" /></a></p>
<p><img src="https://github.com/chriskhanhtran/minimal-portfolio/blob/master/images/bertsum.gif?raw=true" alt="" /></p>
<p><strong>Code:</strong></p>
<p><a href="https://colab.research.google.com/drive/1hwpYC-AU6C_nwuM_N5ynOShXIRGv-U51#scrollTo=KizhzOxVOjaN"><img src="https://img.shields.io/badge/Colab-Run_in_Google_Colab-blue?logo=Google&logoColor=FDBA18" alt="" /></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">models.model_builder</span> <span class="kn">import</span> <span class="n">ExtSummarizer</span>
<span class="kn">from</span> <span class="nn">ext_sum</span> <span class="kn">import</span> <span class="n">summarize</span>
<span class="c1"># Load model
</span><span class="n">model_type</span> <span class="o">=</span> <span class="s">'mobilebert'</span> <span class="c1">#@param ['bertbase', 'distilbert', 'mobilebert']
</span><span class="n">checkpoint</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="sa">f</span><span class="s">'checkpoints/</span><span class="si">{</span><span class="n">model_type</span><span class="si">}</span><span class="s">_ext.pt'</span><span class="p">,</span> <span class="n">map_location</span><span class="o">=</span><span class="s">'cpu'</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">ExtSummarizer</span><span class="p">(</span><span class="n">checkpoint</span><span class="o">=</span><span class="n">checkpoint</span><span class="p">,</span> <span class="n">bert_type</span><span class="o">=</span><span class="n">model_type</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="s">'cpu'</span><span class="p">)</span>
<span class="c1"># Run summarization
</span><span class="n">input_fp</span> <span class="o">=</span> <span class="s">'raw_data/input.txt'</span>
<span class="n">result_fp</span> <span class="o">=</span> <span class="s">'results/summary.txt'</span>
<span class="n">summary</span> <span class="o">=</span> <span class="n">summarize</span><span class="p">(</span><span class="n">input_fp</span><span class="p">,</span> <span class="n">result_fp</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">max_length</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">summary</span><span class="p">)</span>
</code></pre></div></div>
<p><strong>Summary sample</strong></p>
<p>Original: https://www.cnn.com/2020/05/22/business/hertz-bankruptcy/index.html</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>By declaring bankruptcy, Hertz says it intends to stay in business while restructuring its debts and emerging a
financially healthier company. The company has been renting cars since 1918, when it set up shop with a dozen
Ford Model Ts, and has survived the Great Depression, the virtual halt of US auto production during World War II
and numerous oil price shocks. "The impact of Covid-19 on travel demand was sudden and dramatic, causing an
abrupt decline in the company's revenue and future bookings," said the company's statement.
</code></pre></div></div>
<h2 id="5-conclusion">5. Conclusion</h2>
<p>In this article, we have explored BERTSUM, a simple variant of BERT, for extractive summarization from the paper <strong>Text Summarization with Pretrained Encoders</strong> (Liu et al., 2019). Then, in an effort to make extractive summarization even faster and smaller for low-resource devices, we fine-tuned DistilBERT (Sanh et al., 2019) and MobileBERT (Sun et al., 2019) on CNN/DailyMail datasets.</p>
<p>DistilBERT retains BERT-base’s performance in extractive summarization while being 45% smaller. MobileBERT is 4x smaller and 2.7x faster than BERT-base yet retains 94% of its performance.</p>
<p>Finally, we deployed a web app demo of MobileBERT for extractive summarization at https://extractive-summarization.herokuapp.com/.</p>
<h2 id="references">References</h2>
<ul>
<li>[1] <a href="https://github.com/nlpyang/PreSumm">PreSumm: Text Summarization with Pretrained Encoders</a></li>
<li>[2] <a href="https://huggingface.co/transformers/model_doc/distilbert.html">DistilBERT: Smaller, faster, cheaper, lighter version of BERT</a></li>
<li>[3] <a href="https://github.com/google-research/google-research/tree/master/mobilebert">MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices</a></li>
<li>[4] <a href="https://github.com/lonePatient/MobileBert_PyTorch">MobileBert_PyTorch</a></li>
</ul>Chris TranIn an effort to make BERTSUM lighter and faster for low-resource devices, I fine-tuned DistilBERT and MobileBERT, two lite versions of BERT on CNN/DailyMail dataset.Visual Recognition for Vietnamese Foods2020-05-26T00:00:00-04:002020-05-26T00:00:00-04:00https://chriskhanhtran.github.io/posts/vn-food-classifier<p><a href="https://github.com/chriskhanhtran/vn-food-app"><img src="https://img.shields.io/badge/GitHub-View_Repository-blue?logo=GitHub" alt="" /></a></p>
<p>Imagine that you are a world traveller and are traveling to a country famous for its street foods. Walking in a night market street full of food trucks with many delicious-looking options, you have no idea what these foods are and whether they contain any ingredient you are allergic to. You want to ask the local but you don’t know the language. You wish that you have an app on your phone that allows you to take a picture of that food you want to have and will return all the information you need to know about it.</p>
<p>It’s one simple application of computer vision that can make our life better, besides many other applications being implemented in autonomous driving or cancer detection.</p>
<p>Today we are going to build a world-class image classifier using the <code class="language-plaintext highlighter-rouge">fastai</code> library to classify 11 popular Vietnamese dishes. The <code class="language-plaintext highlighter-rouge">fastai</code> library is built on top of <a href="https://pytorch.org/">PyTorch</a> and allows us to quickly and easily build the latest neural networks and train our models to receive state-of-the-art results.</p>
<p>Before we begin, let’s load packages that we are going to use in this project.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">reload_ext</span> <span class="n">autoreload</span>
<span class="o">%</span><span class="n">autoreload</span> <span class="mi">2</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">urllib.request</span>
<span class="kn">from</span> <span class="nn">fastai.vision</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">from</span> <span class="nn">fastai.metrics</span> <span class="kn">import</span> <span class="n">error_rate</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span><span class="p">,</span> <span class="n">ImageFile</span>
<span class="n">ImageFile</span><span class="p">.</span><span class="n">LOAD_TRUNCATED_IMAGES</span> <span class="o">=</span> <span class="bp">True</span>
</code></pre></div></div>
<h2 id="looking-at-the-data">Looking at the data</h2>
<p>I have built an image dataset of popular Vietnamese dishes using the Bing Image Search API by following <a href="https://www.pyimagesearch.com/2018/04/09/how-to-quickly-build-a-deep-learning-image-dataset/">PyImageSearch’s tutorial</a>. We can directly download this dataset from Google Drive.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span>
<span class="c1"># Download and unzip
</span><span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="s">"data"</span><span class="p">):</span>
<span class="err">!</span><span class="n">wget</span> <span class="o">-</span><span class="n">O</span> <span class="s">"dataset.zip"</span> <span class="s">"https://www.googleapis.com/drive/v3/files/13GD8pcwHJPiAPbPtm6KeC20Qw1zm9xdy?alt=media&key=AIzaSyCmo6sAQ37OK8DK4wnT94PoLx5lx-7VTDE"</span>
<span class="err">!</span><span class="n">unzip</span> <span class="n">dataset</span><span class="p">.</span><span class="nb">zip</span>
<span class="err">!</span><span class="n">rm</span> <span class="n">dataset</span><span class="p">.</span><span class="nb">zip</span>
</code></pre></div></div>
<p>Each class is stored in a seperate folder in <code class="language-plaintext highlighter-rouge">data</code> and we can use <code class="language-plaintext highlighter-rouge">ImageDataBunch.from_folder</code> to quickly load our dataset. In addition, we resize our images to 224 x 224 pixel and use <code class="language-plaintext highlighter-rouge">get_transforms</code> to flip, rotate, zoom, warp, adjust lighting our original images (which is called data augmentation, a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data). We also <code class="language-plaintext highlighter-rouge">normalize</code> the images using statistics from ImageNet dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">"./data/"</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">ImageDataBunch</span><span class="p">.</span><span class="n">from_folder</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">train</span><span class="o">=</span><span class="s">'.'</span><span class="p">,</span> <span class="n">ds_tfms</span><span class="o">=</span><span class="n">get_transforms</span><span class="p">(),</span>
<span class="n">valid_pct</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">bs</span><span class="o">=</span><span class="mi">64</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">224</span><span class="p">).</span><span class="n">normalize</span><span class="p">(</span><span class="n">imagenet_stats</span><span class="p">)</span>
</code></pre></div></div>
<p>The dataset has more than 6k images and we use 20% of them as a validation set.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">train_ds</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">valid_ds</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(4959, 1239)
</code></pre></div></div>
<p>Let’s look at some samples:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="p">.</span><span class="n">show_batch</span><span class="p">(</span><span class="n">rows</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="/assets/images/vn-food-classifier/output_12_0.png" alt="png" /></p>
<p>The dataset has a total of 11 classes. They are:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">"Number of classes: "</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">classes</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Classes: "</span><span class="p">,</span> <span class="n">data</span><span class="p">.</span><span class="n">classes</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Number of classes: 11
Classes: ['banh-mi', 'banh-xeo', 'bubble-tea', 'bun-bo-hue', 'bun-bo-nam-bo', 'bun-cha', 'bun-dau-mam-tom', 'che', 'hu-tieu', 'pho', 'spring-rolls']
</code></pre></div></div>
<h2 id="training-resnet-50">Training: ResNet-50</h2>
<p>Now we will finetune a ResNet-50 model on our customized dataset. ResNet is from the paper <a href="https://arxiv.org/pdf/1512.03385.pdf">Deep Residual Learning for Image Recognition</a> and is the best default model for computer vision. This ResNet-34 model is trained on ImageNet with 1000 classes, so first we need to initialize a new head for the model to be adapted to the number of classes in our dataset. The <code class="language-plaintext highlighter-rouge">cnn_learner</code> method will do this for us (Read <a href="https://docs.fast.ai/vision.learner.html#cnn_learner">documentation</a>).</p>
<p>Then we will train the model in two stages:</p>
<ul>
<li>first we freeze the body weights and only train the head,</li>
<li>then we unfreeze the layers of the backbone and fine-tune the whole model.</li>
</ul>
<h3 id="stage-1-finetune-the-top-layers-only">Stage 1: Finetune the top layers only</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span> <span class="o">=</span> <span class="n">cnn_learner</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">models</span><span class="p">.</span><span class="n">resnet50</span><span class="p">,</span> <span class="n">metrics</span><span class="o">=</span><span class="n">error_rate</span><span class="p">,</span> <span class="n">pretrained</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p>We will train the model for 20 epoches following the 1cycle policy. The number of epoches we need to train depends on how different the dataset is from ImageNet. Basically, we train the model until the validation loss stops decrease. The default learning rate here is 3e-3.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span><span class="p">.</span><span class="n">fit_one_cycle</span><span class="p">(</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: left;">
<th>epoch</th>
<th>train_loss</th>
<th>valid_loss</th>
<th>error_rate</th>
<th>time</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1.356109</td>
<td>0.657256</td>
<td>0.206618</td>
<td>01:31</td>
</tr>
<tr>
<td>1</td>
<td>0.870719</td>
<td>0.560793</td>
<td>0.167070</td>
<td>01:27</td>
</tr>
<tr>
<td>2</td>
<td>0.687549</td>
<td>0.526645</td>
<td>0.167877</td>
<td>01:32</td>
</tr>
<tr>
<td>3</td>
<td>0.621711</td>
<td>0.444908</td>
<td>0.143664</td>
<td>01:27</td>
</tr>
<tr>
<td>4</td>
<td>0.493342</td>
<td>0.397156</td>
<td>0.126715</td>
<td>01:28</td>
</tr>
<tr>
<td>5</td>
<td>0.435479</td>
<td>0.381771</td>
<td>0.123487</td>
<td>01:27</td>
</tr>
<tr>
<td>6</td>
<td>0.373893</td>
<td>0.389900</td>
<td>0.121065</td>
<td>01:28</td>
</tr>
<tr>
<td>7</td>
<td>0.325564</td>
<td>0.371386</td>
<td>0.112994</td>
<td>01:28</td>
</tr>
<tr>
<td>8</td>
<td>0.295842</td>
<td>0.349679</td>
<td>0.106538</td>
<td>01:29</td>
</tr>
<tr>
<td>9</td>
<td>0.271255</td>
<td>0.348150</td>
<td>0.098467</td>
<td>01:28</td>
</tr>
<tr>
<td>10</td>
<td>0.233799</td>
<td>0.317944</td>
<td>0.091203</td>
<td>01:28</td>
</tr>
<tr>
<td>11</td>
<td>0.205257</td>
<td>0.306772</td>
<td>0.086360</td>
<td>01:27</td>
</tr>
<tr>
<td>12</td>
<td>0.171645</td>
<td>0.310830</td>
<td>0.084746</td>
<td>01:28</td>
</tr>
<tr>
<td>13</td>
<td>0.159442</td>
<td>0.293081</td>
<td>0.078289</td>
<td>01:27</td>
</tr>
<tr>
<td>14</td>
<td>0.137362</td>
<td>0.295972</td>
<td>0.083939</td>
<td>01:29</td>
</tr>
<tr>
<td>15</td>
<td>0.112108</td>
<td>0.289382</td>
<td>0.084746</td>
<td>01:26</td>
</tr>
<tr>
<td>16</td>
<td>0.093431</td>
<td>0.279121</td>
<td>0.083939</td>
<td>01:30</td>
</tr>
<tr>
<td>17</td>
<td>0.094814</td>
<td>0.281516</td>
<td>0.079903</td>
<td>01:29</td>
</tr>
<tr>
<td>18</td>
<td>0.083980</td>
<td>0.277226</td>
<td>0.073446</td>
<td>01:33</td>
</tr>
<tr>
<td>19</td>
<td>0.090189</td>
<td>0.274643</td>
<td>0.081517</td>
<td>01:28</td>
</tr>
</tbody>
</table>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="s">'resnes50-stage-1'</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="stage-2-unfreeze-and-finetune-the-entire-networks">Stage 2: Unfreeze and finetune the entire networks</h3>
<p>At this stage we will unfreeze the whole model and finetune it with smaller learning rates. The code below will help us find the learning rate for this stage.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'resnes50-stage-1'</span><span class="p">)</span>
<span class="n">learn</span><span class="p">.</span><span class="n">lr_find</span><span class="p">()</span>
<span class="n">learn</span><span class="p">.</span><span class="n">recorder</span><span class="p">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/images/vn-food-classifier/output_24_2.png" alt="png" /></p>
<p>When finetuning the model at state 2, we will use different learning rates for different layers. The top layers will be update at greater rates than the bottom layers. As a rule of thumb, we use learning rates between (a, b), in which:</p>
<ul>
<li>a is taken from the LR Finder above where the loss starts to decrease for a while,</li>
<li>b is 5 to 10 times smaller than the default rate we used in stage 1.</li>
</ul>
<p>In this case, the learning rate we will use is from 3e-6 to 3e-4 We continnue to train the model until the validation loss stops decreasing.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span><span class="p">.</span><span class="n">unfreeze</span><span class="p">()</span>
<span class="n">learn</span><span class="p">.</span><span class="n">fit_one_cycle</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="n">max_lr</span><span class="o">=</span><span class="nb">slice</span><span class="p">(</span><span class="mf">3e-6</span><span class="p">,</span> <span class="mf">3e-4</span><span class="p">))</span>
</code></pre></div></div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: left;">
<th>epoch</th>
<th>train_loss</th>
<th>valid_loss</th>
<th>error_rate</th>
<th>time</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0.077156</td>
<td>0.277789</td>
<td>0.079903</td>
<td>01:30</td>
</tr>
<tr>
<td>1</td>
<td>0.089096</td>
<td>0.269736</td>
<td>0.075061</td>
<td>01:30</td>
</tr>
<tr>
<td>2</td>
<td>0.092252</td>
<td>0.282279</td>
<td>0.079096</td>
<td>01:29</td>
</tr>
<tr>
<td>3</td>
<td>0.074603</td>
<td>0.265167</td>
<td>0.067797</td>
<td>01:31</td>
</tr>
<tr>
<td>4</td>
<td>0.071408</td>
<td>0.278237</td>
<td>0.074253</td>
<td>01:33</td>
</tr>
<tr>
<td>5</td>
<td>0.047503</td>
<td>0.250248</td>
<td>0.065375</td>
<td>01:29</td>
</tr>
<tr>
<td>6</td>
<td>0.037788</td>
<td>0.249852</td>
<td>0.070218</td>
<td>01:30</td>
</tr>
<tr>
<td>7</td>
<td>0.035603</td>
<td>0.251429</td>
<td>0.066990</td>
<td>01:29</td>
</tr>
</tbody>
</table>
<p>After two stages, the error rate on the validation set is about 6% and the accuracy for our dataset with 11 classes is 94%. That’s a pretty accurate model. It’s a good practice to save the current stage of the model in case we want to train it for more epoches.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="s">'resnes50-stage-2'</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="evaluation">Evaluation</h2>
<p>Now let’s look on the model’s predictions. Specifically, we will look at examples where the model makes wrong predictions. Below are 9 examples with the top losses, meaning on these examples the model gives the actual class a very low score.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">learn</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'resnes50-stage-2'</span><span class="p">)</span>
<span class="n">interp</span> <span class="o">=</span> <span class="n">ClassificationInterpretation</span><span class="p">.</span><span class="n">from_learner</span><span class="p">(</span><span class="n">learn</span><span class="p">)</span>
<span class="n">losses</span><span class="p">,</span> <span class="n">idxs</span> <span class="o">=</span> <span class="n">interp</span><span class="p">.</span><span class="n">top_losses</span><span class="p">()</span>
<span class="n">interp</span><span class="p">.</span><span class="n">plot_top_losses</span><span class="p">(</span><span class="mi">9</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span><span class="mi">11</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="/assets/images/vn-food-classifier/output_31_1.png" alt="png" /></p>
<p>The confusion matrix below will us a big picture of which pair of dishes make our model confused. The number of wrong predictions are quite low.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">interp</span><span class="p">.</span><span class="n">plot_confusion_matrix</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">12</span><span class="p">),</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/images/vn-food-classifier/output_33_0.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">interp</span><span class="p">.</span><span class="n">most_confused</span><span class="p">(</span><span class="n">min_val</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[('pho', 'hu-tieu', 11),
('bun-bo-nam-bo', 'bun-cha', 5),
('hu-tieu', 'bun-bo-nam-bo', 5),
('pho', 'bun-bo-hue', 5),
('bun-bo-hue', 'pho', 4),
('hu-tieu', 'bun-cha', 4),
('bun-bo-hue', 'hu-tieu', 3),
('bun-cha', 'bun-bo-nam-bo', 3),
('hu-tieu', 'bun-bo-hue', 3),
('hu-tieu', 'pho', 3)]
</code></pre></div></div>
<h2 id="production">Production</h2>
<p>After being satisfied with our model, we can save it to disk and use it to classify new data. We can also deploy it into a magical app that I described at the beginning of the notebook.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Export model
</span><span class="n">learn</span><span class="p">.</span><span class="n">export</span><span class="p">()</span>
</code></pre></div></div>
<p>For inference, we also want to run the model on CPU instead of GPU.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">defaults</span><span class="p">.</span><span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">'cpu'</span><span class="p">)</span>
</code></pre></div></div>
<p>The functions below will download an image from an URL and use the model to make prediction for that image.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">open_image_url</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="s">"""Download image from URL, return fastai image and PIL image."""</span>
<span class="n">urllib</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">urlretrieve</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="s">"./img/test.jpg"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">open_image</span><span class="p">(</span><span class="s">"./img/test.jpg"</span><span class="p">),</span> <span class="n">Image</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"./img/test.jpg"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="s">"""Make prediction for image from url, show image and predicted probability."""</span>
<span class="n">img</span><span class="p">,</span> <span class="n">pil_img</span> <span class="o">=</span> <span class="n">open_image_url</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">pred_class</span><span class="p">,</span> <span class="n">pred_idx</span><span class="p">,</span> <span class="n">outputs</span> <span class="o">=</span> <span class="n">learn</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Predicted Class: "</span><span class="p">,</span> <span class="n">pred_class</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Probability: </span><span class="si">{</span><span class="n">outputs</span><span class="p">[</span><span class="n">pred_idx</span><span class="p">].</span><span class="n">numpy</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">%"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="c1"># Show image
</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">pil_img</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Image'</span><span class="p">)</span>
<span class="c1"># Plot Probabilities
</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">data</span><span class="p">.</span><span class="n">classes</span><span class="p">).</span><span class="n">sort_values</span><span class="p">().</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'barh'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">"Class"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Probability"</span><span class="p">);</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Load model
</span><span class="n">learn</span> <span class="o">=</span> <span class="n">load_learner</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Make prediction from URL
</span><span class="n">url</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="nb">input</span><span class="p">(</span><span class="s">"URL: "</span><span class="p">))</span>
<span class="n">predict</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>URL: https://i.pinimg.com/originals/9b/63/d7/9b63d76a44be020e03eeaec0a1134e95.jpg
Predicted Class: bun-bo-hue
Probability: 99.99%
</code></pre></div></div>
<p><img src="/assets/images/vn-food-classifier/output_43_2.png" alt="png" /></p>
<p>I have also built a web app for this model. Check it out below:</p>
<p><a href="https://vietnamese-food.herokuapp.com/"><img src="https://img.shields.io/badge/Heroku-Open_Web_App-blue?logo=Heroku" alt="" /></a></p>
<p><img src="https://github.com/chriskhanhtran/vn-food-app/blob/master/img/vn-food-app.gif?raw=true" alt="" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>We have walked through the whole pipeline to finetune a ResNet-50 model on our customized dataset, evaluate its strengths and weaknesses and deploy the model for production. With the <code class="language-plaintext highlighter-rouge">fastai</code> library, we can quickly achieve state-of-the-art results with very neat and clean codes.</p>
<p>In this project, we built a classifier for only 11 popular Vietnamese dishes, but we can easily scale up the model by collecting more data for hundreds or thounsands of classes following this <a href="https://www.pyimagesearch.com/2018/04/09/how-to-quickly-build-a-deep-learning-image-dataset/">PyImageSearch’s tutorial</a>.</p>
<p>I have seen many interesting ideas using computer vision, such as <a href="https://play.google.com/store/apps/details?id=com.peat.GartenBank&hl=en_US">Plantix</a>, a plant doctor app which can tell what diseases your plants might have from a single photo and how you should take care of them. And yes, with some software engineering skills, some data and an idea, you can build your own impactful computer vision application. I look forwards to seeing these applications make the world a better place in all directions.</p>
<h2 id="reference">Reference</h2>
<ul>
<li><a href="https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson1-pets.ipynb">Fast AI: Lesson 1 - What’s your pet</a></li>
<li><a href="https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson2-download.ipynb">Fast AI: Lesson 2 - Creating your own dataset from Google Images</a></li>
<li><a href="https://www.pyimagesearch.com/2018/04/09/how-to-quickly-build-a-deep-learning-image-dataset/">PyImageSearch: How to (quickly) build a deep learning image dataset</a></li>
</ul>Chris TranWe are going to build a world-class image classifier using the fastai library to classify 11 popular Vietnamese dishes.Named Entity Recognition with Transformers2020-05-07T00:00:00-04:002020-05-07T00:00:00-04:00https://chriskhanhtran.github.io/posts/named-entity-recognition-with-transformers<p><img src="https://github.com/chriskhanhtran/spanish-bert/blob/master/img/part2.PNG?raw=true" alt="" /></p>
<p><a href="https://colab.research.google.com/drive/1ezuE7wC7Fa21Wu3fvzRffx2m14CAySS1#scrollTo=LhKZ3vItVBzi"><img src="https://img.shields.io/badge/Colab-Run_in_Google_Colab-blue?logo=Google&logoColor=FDBA18" alt="Run in Google Colab" /></a></p>
<h1 id="introduction">Introduction</h1>
<ul>
<li><a href="https://chriskhanhtran.github.io/posts/spanberta-bert-for-spanish-from-scratch/">Part I: How We Trained RoBERTa Language Model for Spanish from Scratch</a></li>
</ul>
<p>In my previous blog post, we have discussed how my team pretrained SpanBERTa, a transformer language model for Spanish, on a big corpus from scratch. The model has shown to be able to predict correctly masked words in a sequence based on its context. In this blog post, to really leverage the power of transformer models, we will fine-tune SpanBERTa for a named-entity recognition task.</p>
<p>According to its definition on <a href="https://en.wikipedia.org/wiki/Named-entity_recognition">Wikipedia</a>, Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.</p>
<p>We will use the script <a href="https://github.com/huggingface/transformers/blob/master/examples/ner/run_ner.py"><code class="language-plaintext highlighter-rouge">run_ner.py</code></a> by Hugging Face and <a href="https://www.kaggle.com/nltkdata/conll-corpora">CoNLL-2002 dataset</a> to fine-tune SpanBERTa.</p>
<h1 id="setup">Setup</h1>
<p>Download <code class="language-plaintext highlighter-rouge">transformers</code> and install required packages.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">capture</span>
<span class="err">!</span><span class="n">git</span> <span class="n">clone</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">github</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">huggingface</span><span class="o">/</span><span class="n">transformers</span>
<span class="o">%</span><span class="n">cd</span> <span class="n">transformers</span>
<span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="p">.</span>
<span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="o">-</span><span class="n">r</span> <span class="p">.</span><span class="o">/</span><span class="n">examples</span><span class="o">/</span><span class="n">requirements</span><span class="p">.</span><span class="n">txt</span>
<span class="o">%</span><span class="n">cd</span> <span class="p">..</span>
</code></pre></div></div>
<h1 id="data">Data</h1>
<h2 id="1-download-datasets">1. Download Datasets</h2>
<p>The below command will download and unzip the dataset. The files contain the train and test data for three parts of the <a href="https://www.clips.uantwerpen.be/conll2002/ner/">CoNLL-2002</a> shared task:</p>
<ul>
<li>esp.testa: Spanish test data for the development stage</li>
<li>esp.testb: Spanish test data</li>
<li>esp.train: Spanish train data</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">capture</span>
<span class="err">!</span><span class="n">wget</span> <span class="o">-</span><span class="n">O</span> <span class="s">'conll2002.zip'</span> <span class="s">'https://drive.google.com/uc?export=download&id=1Wrl1b39ZXgKqCeAFNM9EoXtA1kzwNhCe'</span>
<span class="err">!</span><span class="n">unzip</span> <span class="s">'conll2002.zip'</span>
</code></pre></div></div>
<p>The size of each dataset:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">wc</span> <span class="o">-</span><span class="n">l</span> <span class="n">conll2002</span><span class="o">/</span><span class="n">esp</span><span class="p">.</span><span class="n">train</span>
<span class="err">!</span><span class="n">wc</span> <span class="o">-</span><span class="n">l</span> <span class="n">conll2002</span><span class="o">/</span><span class="n">esp</span><span class="p">.</span><span class="n">testa</span>
<span class="err">!</span><span class="n">wc</span> <span class="o">-</span><span class="n">l</span> <span class="n">conll2002</span><span class="o">/</span><span class="n">esp</span><span class="p">.</span><span class="n">testb</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>273038 conll2002/esp.train
54838 conll2002/esp.testa
53050 conll2002/esp.testb
</code></pre></div></div>
<p>All data files has three columns: words, associated part-of-speech tags and named entity tags in the IOB2 format. Sentence breaks are encoded by empty lines.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">head</span> <span class="o">-</span><span class="n">n20</span> <span class="n">conll2002</span><span class="o">/</span><span class="n">esp</span><span class="p">.</span><span class="n">train</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Melbourne NP B-LOC
( Fpa O
Australia NP B-LOC
) Fpt O
, Fc O
25 Z O
may NC O
( Fpa O
EFE NC B-ORG
) Fpt O
. Fp O
- Fg O
El DA O
Abogado NC B-PER
General AQ I-PER
del SP I-PER
Estado NC I-PER
, Fc O
</code></pre></div></div>
<p>We will only keep the word column and the named entity tag column for our train, dev and test datasets.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">cat</span> <span class="n">conll2002</span><span class="o">/</span><span class="n">esp</span><span class="p">.</span><span class="n">train</span> <span class="o">|</span> <span class="n">cut</span> <span class="o">-</span><span class="n">d</span> <span class="s">" "</span> <span class="o">-</span><span class="n">f</span> <span class="mi">1</span><span class="p">,</span><span class="mi">3</span> <span class="o">></span> <span class="n">train_temp</span><span class="p">.</span><span class="n">txt</span>
<span class="err">!</span><span class="n">cat</span> <span class="n">conll2002</span><span class="o">/</span><span class="n">esp</span><span class="p">.</span><span class="n">testa</span> <span class="o">|</span> <span class="n">cut</span> <span class="o">-</span><span class="n">d</span> <span class="s">" "</span> <span class="o">-</span><span class="n">f</span> <span class="mi">1</span><span class="p">,</span><span class="mi">3</span> <span class="o">></span> <span class="n">dev_temp</span><span class="p">.</span><span class="n">txt</span>
<span class="err">!</span><span class="n">cat</span> <span class="n">conll2002</span><span class="o">/</span><span class="n">esp</span><span class="p">.</span><span class="n">testb</span> <span class="o">|</span> <span class="n">cut</span> <span class="o">-</span><span class="n">d</span> <span class="s">" "</span> <span class="o">-</span><span class="n">f</span> <span class="mi">1</span><span class="p">,</span><span class="mi">3</span> <span class="o">></span> <span class="n">test_temp</span><span class="p">.</span><span class="n">txt</span>
</code></pre></div></div>
<h2 id="2-preprocessing">2. Preprocessing</h2>
<p>Let’s define some variables that we need for further pre-processing steps and training the model:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MAX_LENGTH</span> <span class="o">=</span> <span class="mi">120</span> <span class="c1">#@param {type: "integer"}
</span><span class="n">MODEL</span> <span class="o">=</span> <span class="s">"chriskhanhtran/spanberta"</span> <span class="c1">#@param ["chriskhanhtran/spanberta", "bert-base-multilingual-cased"]
</span></code></pre></div></div>
<p>The script below will split sentences longer than <code class="language-plaintext highlighter-rouge">MAX_LENGTH</code> (in terms of tokens) into small ones. Otherwise, long sentences will be truncated when tokenized, causing the loss of training data and some tokens in the test set not being predicted.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">capture</span>
<span class="err">!</span><span class="n">wget</span> <span class="s">"https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">python3</span> <span class="n">preprocess</span><span class="p">.</span><span class="n">py</span> <span class="n">train_temp</span><span class="p">.</span><span class="n">txt</span> <span class="err">$</span><span class="n">MODEL</span> <span class="err">$</span><span class="n">MAX_LENGTH</span> <span class="o">></span> <span class="n">train</span><span class="p">.</span><span class="n">txt</span>
<span class="err">!</span><span class="n">python3</span> <span class="n">preprocess</span><span class="p">.</span><span class="n">py</span> <span class="n">dev_temp</span><span class="p">.</span><span class="n">txt</span> <span class="err">$</span><span class="n">MODEL</span> <span class="err">$</span><span class="n">MAX_LENGTH</span> <span class="o">></span> <span class="n">dev</span><span class="p">.</span><span class="n">txt</span>
<span class="err">!</span><span class="n">python3</span> <span class="n">preprocess</span><span class="p">.</span><span class="n">py</span> <span class="n">test_temp</span><span class="p">.</span><span class="n">txt</span> <span class="err">$</span><span class="n">MODEL</span> <span class="err">$</span><span class="n">MAX_LENGTH</span> <span class="o">></span> <span class="n">test</span><span class="p">.</span><span class="n">txt</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2020-04-22 23:02:05.747294: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 1.03k/1.03k [00:00<00:00, 704kB/s]
Downloading: 100% 954k/954k [00:00<00:00, 1.89MB/s]
Downloading: 100% 512k/512k [00:00<00:00, 1.19MB/s]
Downloading: 100% 16.0/16.0 [00:00<00:00, 12.6kB/s]
2020-04-22 23:02:23.409488: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-22 23:02:31.168967: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
</code></pre></div></div>
<h2 id="3-labels">3. Labels</h2>
<p>In CoNLL-2002/2003 datasets, there are have 9 classes of NER tags:</p>
<ul>
<li>O, Outside of a named entity</li>
<li>B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity</li>
<li>I-MIS, Miscellaneous entity</li>
<li>B-PER, Beginning of a person’s name right after another person’s name</li>
<li>I-PER, Person’s name</li>
<li>B-ORG, Beginning of an organisation right after another organisation</li>
<li>I-ORG, Organisation</li>
<li>B-LOC, Beginning of a location right after another location</li>
<li>I-LOC, Location</li>
</ul>
<p>If your dataset has different labels or more labels than CoNLL-2002/2003 datasets, run the line below to get unique labels from your data and save them into <code class="language-plaintext highlighter-rouge">labels.txt</code>. This file will be used when we start fine-tuning our model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">cat</span> <span class="n">train</span><span class="p">.</span><span class="n">txt</span> <span class="n">dev</span><span class="p">.</span><span class="n">txt</span> <span class="n">test</span><span class="p">.</span><span class="n">txt</span> <span class="o">|</span> <span class="n">cut</span> <span class="o">-</span><span class="n">d</span> <span class="s">" "</span> <span class="o">-</span><span class="n">f</span> <span class="mi">2</span> <span class="o">|</span> <span class="n">grep</span> <span class="o">-</span><span class="n">v</span> <span class="s">"^$"</span><span class="o">|</span> <span class="n">sort</span> <span class="o">|</span> <span class="n">uniq</span> <span class="o">></span> <span class="n">labels</span><span class="p">.</span><span class="n">txt</span>
</code></pre></div></div>
<h1 id="fine-tuning-model">Fine-tuning Model</h1>
<p>These are the example scripts from <code class="language-plaintext highlighter-rouge">transformers</code>’s repo that we will use to fine-tune our model for NER. After 04/21/2020, Hugging Face has updated their example scripts to use a new <code class="language-plaintext highlighter-rouge">Trainer</code> class. To avoid any future conflict, let’s use the version before they made these updates.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">capture</span>
<span class="err">!</span><span class="n">wget</span> <span class="s">"https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/ner/run_ner.py"</span>
<span class="err">!</span><span class="n">wget</span> <span class="s">"https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/ner/utils_ner.py"</span>
</code></pre></div></div>
<p>Now it’s time for transfer learning. In my <a href="https://chriskhanhtran.github.io/posts/spanberta-bert-for-spanish-from-scratch/">previous blog post</a>, I have pretrained a RoBERTa language model on a very large Spanish corpus to predict masked words based on the context they are in. By doing that, the model has learned inherent properties of the language. I have uploaded the pretrained model to Hugging Face’s server. Now we will load the model and start fine-tuning it for the NER task.</p>
<p>Below are our training hyperparameters.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MAX_LENGTH</span> <span class="o">=</span> <span class="mi">128</span> <span class="c1">#@param {type: "integer"}
</span><span class="n">MODEL</span> <span class="o">=</span> <span class="s">"chriskhanhtran/spanberta"</span> <span class="c1">#@param ["chriskhanhtran/spanberta", "bert-base-multilingual-cased"]
</span><span class="n">OUTPUT_DIR</span> <span class="o">=</span> <span class="s">"spanberta-ner"</span> <span class="c1">#@param ["spanberta-ner", "bert-base-ml-ner"]
</span><span class="n">BATCH_SIZE</span> <span class="o">=</span> <span class="mi">32</span> <span class="c1">#@param {type: "integer"}
</span><span class="n">NUM_EPOCHS</span> <span class="o">=</span> <span class="mi">3</span> <span class="c1">#@param {type: "integer"}
</span><span class="n">SAVE_STEPS</span> <span class="o">=</span> <span class="mi">100</span> <span class="c1">#@param {type: "integer"}
</span><span class="n">LOGGING_STEPS</span> <span class="o">=</span> <span class="mi">100</span> <span class="c1">#@param {type: "integer"}
</span><span class="n">SEED</span> <span class="o">=</span> <span class="mi">42</span> <span class="c1">#@param {type: "integer"}
</span></code></pre></div></div>
<p>Let’s start training.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">python3</span> <span class="n">run_ner</span><span class="p">.</span><span class="n">py</span> \
<span class="o">--</span><span class="n">data_dir</span> <span class="p">.</span><span class="o">/</span> \
<span class="o">--</span><span class="n">model_type</span> <span class="n">bert</span> \
<span class="o">--</span><span class="n">labels</span> <span class="p">.</span><span class="o">/</span><span class="n">labels</span><span class="p">.</span><span class="n">txt</span> \
<span class="o">--</span><span class="n">model_name_or_path</span> <span class="err">$</span><span class="n">MODEL</span> \
<span class="o">--</span><span class="n">output_dir</span> <span class="err">$</span><span class="n">OUTPUT_DIR</span> \
<span class="o">--</span><span class="n">max_seq_length</span> <span class="err">$</span><span class="n">MAX_LENGTH</span> \
<span class="o">--</span><span class="n">num_train_epochs</span> <span class="err">$</span><span class="n">NUM_EPOCHS</span> \
<span class="o">--</span><span class="n">per_gpu_train_batch_size</span> <span class="err">$</span><span class="n">BATCH_SIZE</span> \
<span class="o">--</span><span class="n">save_steps</span> <span class="err">$</span><span class="n">SAVE_STEPS</span> \
<span class="o">--</span><span class="n">logging_steps</span> <span class="err">$</span><span class="n">LOGGING_STEPS</span> \
<span class="o">--</span><span class="n">seed</span> <span class="err">$</span><span class="n">SEED</span> \
<span class="o">--</span><span class="n">do_train</span> \
<span class="o">--</span><span class="n">do_eval</span> \
<span class="o">--</span><span class="n">do_predict</span> \
<span class="o">--</span><span class="n">overwrite_output_dir</span>
</code></pre></div></div>
<p>Performance on the dev set:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>04/21/2020 02:24:31 - INFO - __main__ - ***** Eval results *****
04/21/2020 02:24:31 - INFO - __main__ - f1 = 0.831027443864822
04/21/2020 02:24:31 - INFO - __main__ - loss = 0.1004064822183894
04/21/2020 02:24:31 - INFO - __main__ - precision = 0.8207885304659498
04/21/2020 02:24:31 - INFO - __main__ - recall = 0.8415250344510795
</code></pre></div></div>
<p>Performance on the test set:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>04/21/2020 02:24:48 - INFO - __main__ - ***** Eval results *****
04/21/2020 02:24:48 - INFO - __main__ - f1 = 0.8559533721898419
04/21/2020 02:24:48 - INFO - __main__ - loss = 0.06848683688204177
04/21/2020 02:24:48 - INFO - __main__ - precision = 0.845858475041141
04/21/2020 02:24:48 - INFO - __main__ - recall = 0.8662921348314607
</code></pre></div></div>
<p>Here are the tensorboards of fine-tuning <a href="https://tensorboard.dev/experiment/Ggs7aCjWQ0exU2Nbp3pPlQ/#scalars&_smoothingWeight=0.265">spanberta</a> and <a href="https://tensorboard.dev/experiment/M9AXw2lORjeRzFZzEJOxkA/#scalars">bert-base-multilingual-cased</a> for 5 epoches. We can see that the models overfit the training data after 3 epoches.</p>
<p><img src="https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/img/spanberta-ner-tb-5.JPG" alt="" /></p>
<p><strong>Classification Report</strong></p>
<p>To understand how well our model actually performs, let’s load its predictions and examine the classification report.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">read_examples_from_file</span><span class="p">(</span><span class="n">file_path</span><span class="p">):</span>
<span class="s">"""Read words and labels from a CoNLL-2002/2003 data file.
Args:
file_path (str): path to NER data file.
Returns:
examples (dict): a dictionary with two keys: `words` (list of lists)
holding words in each sequence, and `labels` (list of lists) holding
corresponding labels.
"""</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file_path</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">"utf-8"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">examples</span> <span class="o">=</span> <span class="p">{</span><span class="s">"words"</span><span class="p">:</span> <span class="p">[],</span> <span class="s">"labels"</span><span class="p">:</span> <span class="p">[]}</span>
<span class="n">words</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span><span class="p">:</span>
<span class="k">if</span> <span class="n">line</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">"-DOCSTART-"</span><span class="p">)</span> <span class="ow">or</span> <span class="n">line</span> <span class="o">==</span> <span class="s">""</span> <span class="ow">or</span> <span class="n">line</span> <span class="o">==</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">:</span>
<span class="k">if</span> <span class="n">words</span><span class="p">:</span>
<span class="n">examples</span><span class="p">[</span><span class="s">"words"</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>
<span class="n">examples</span><span class="p">[</span><span class="s">"labels"</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span>
<span class="n">words</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">splits</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">)</span>
<span class="n">words</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">splits</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">splits</span><span class="p">)</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">labels</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">splits</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">replace</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="s">""</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># Examples could have no label for mode = "test"
</span> <span class="n">labels</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"O"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">examples</span>
</code></pre></div></div>
<p>Read data and labels from the raw text files:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y_true</span> <span class="o">=</span> <span class="n">read_examples_from_file</span><span class="p">(</span><span class="s">"test.txt"</span><span class="p">)[</span><span class="s">"labels"</span><span class="p">]</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">read_examples_from_file</span><span class="p">(</span><span class="s">"spanberta-ner/test_predictions.txt"</span><span class="p">)[</span><span class="s">"labels"</span><span class="p">]</span>
</code></pre></div></div>
<p>Print the classification report:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">seqeval.metrics</span> <span class="kn">import</span> <span class="n">classification_report</span> <span class="k">as</span> <span class="n">classification_report_seqeval</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report_seqeval</span><span class="p">(</span><span class="n">y_true</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> precision recall f1-score support
LOC 0.87 0.84 0.85 1084
ORG 0.82 0.87 0.85 1401
MISC 0.63 0.66 0.65 340
PER 0.94 0.96 0.95 735
micro avg 0.84 0.86 0.85 3560
macro avg 0.84 0.86 0.85 3560
</code></pre></div></div>
<p>The metrics we are seeing in this report are designed specifically for NLP tasks such as NER and POS tagging, in which all words of an entity need to be predicted correctly to be counted as one correct prediction. Therefore, the metrics in this classification report are much lower than in <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html">scikit-learn’s classification report</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">classification_report</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">y_true</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">y_pred</span><span class="p">)))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> precision recall f1-score support
B-LOC 0.88 0.85 0.86 1084
B-MISC 0.73 0.73 0.73 339
B-ORG 0.87 0.91 0.89 1400
B-PER 0.95 0.96 0.95 735
I-LOC 0.82 0.81 0.81 325
I-MISC 0.85 0.76 0.80 557
I-ORG 0.89 0.87 0.88 1104
I-PER 0.98 0.98 0.98 634
O 1.00 1.00 1.00 45355
accuracy 0.98 51533
macro avg 0.89 0.87 0.88 51533
weighted avg 0.98 0.98 0.98 51533
</code></pre></div></div>
<p>From above reports, our model has a good performance in predicting person, location and organization. We will need more data for <code class="language-plaintext highlighter-rouge">MISC</code> entities to improve our model’s performance on these entities.</p>
<h1 id="pipeline">Pipeline</h1>
<p>After fine-tuning our models, we can share them with the community by following the tutorial in this <a href="https://huggingface.co/transformers/model_sharing.html">page</a>. Now we can start loading the fine-tuned model from Hugging Face’s server and use it to predict named entities in Spanish documents.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">pipeline</span><span class="p">,</span> <span class="n">AutoModelForTokenClassification</span><span class="p">,</span> <span class="n">AutoTokenizer</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForTokenClassification</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"skimai/spanberta-base-cased-ner-conll02"</span><span class="p">)</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"skimai/spanberta-base-cased-ner-conll02"</span><span class="p">)</span>
<span class="n">ner_model</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">(</span><span class="s">'ner'</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">tokenizer</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">)</span>
</code></pre></div></div>
<p>The example below is obtained from <a href="https://laopinion.com/2020/04/19/secretario-del-tesoro-advierte-que-la-economia-de-estados-unidos-tardara-meses-en-recuperarse-tras-coronavirus/">La Opinión</a> and means “<em>The economic recovery of the United States after the coronavirus pandemic will be a matter of months, said Treasury Secretary Steven Mnuchin.</em>”</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sequence</span> <span class="o">=</span> <span class="s">"La recuperación económica de los Estados Unidos después de la "</span> \
<span class="s">"pandemia del coronavirus será cuestión de meses, afirmó el "</span> \
<span class="s">"Secretario del Tesoro, Steven Mnuchin."</span>
<span class="n">ner_model</span><span class="p">(</span><span class="n">sequence</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[{'entity': 'B-ORG', 'score': 0.9155661463737488, 'word': 'ĠEstados'},
{'entity': 'I-ORG', 'score': 0.800682544708252, 'word': 'ĠUnidos'},
{'entity': 'I-MISC', 'score': 0.5006815791130066, 'word': 'Ġcorona'},
{'entity': 'I-MISC', 'score': 0.510674774646759, 'word': 'virus'},
{'entity': 'B-PER', 'score': 0.5558510422706604, 'word': 'ĠSecretario'},
{'entity': 'I-PER', 'score': 0.7758238315582275, 'word': 'Ġdel'},
{'entity': 'I-PER', 'score': 0.7096233367919922, 'word': 'ĠTesoro'},
{'entity': 'B-PER', 'score': 0.9940345883369446, 'word': 'ĠSteven'},
{'entity': 'I-PER', 'score': 0.9962581992149353, 'word': 'ĠM'},
{'entity': 'I-PER', 'score': 0.9918380379676819, 'word': 'n'},
{'entity': 'I-PER', 'score': 0.9848328828811646, 'word': 'uch'},
{'entity': 'I-PER', 'score': 0.8513168096542358, 'word': 'in'}]
</code></pre></div></div>
<p>Looks great! The fine-tuned model successfully recognizes all entities in our example, and even recognizes “corona virus.”</p>
<h1 id="conclusion">Conclusion</h1>
<p>Named-entity recognition can help us quickly extract important information from texts. Therefore, its application in business can have a direct impact on improving human’s productivity in reading contracts and documents. However, it is a challenging NLP task because NER requires accurate classification at the word level, making simple approaches such as bag-of-word impossible to deal with this task.</p>
<p>We have walked through how we can leverage a pretrained BERT model to quickly gain an excellent performance on the NER task for Spanish. The pretrained SpanBERTa model can also be fine-tuned for other tasks such as document classification. I have written a detailed tutorial to finetune BERT for sequence classification and sentiment analysis.</p>
<ul>
<li><a href="https://chriskhanhtran.github.io/posts/bert-for-sentiment-analysis/">Fine-tuning BERT for Sentiment Analysis</a></li>
</ul>
<p>Next in this series, we will discuss ELECTRA, a more efficient pre-training approach for transformer models which can quickly achieve state-of-the-art performance. Stay tuned!</p>Chris TranIn this blog post, to really leverage the power of transformer models, we will fine-tune SpanBERTa for a named-entity recognition task.SpanBERTa: Pre-train RoBERTa Language Model for Spanish from Scratch2020-04-07T00:00:00-04:002020-04-07T00:00:00-04:00https://chriskhanhtran.github.io/posts/spanberta-bert-for-spanish-from-scratch<p><img src="https://github.com/chriskhanhtran/spanish-bert/blob/master/img/part1.PNG?raw=true" alt="" /></p>
<p><a href="https://colab.research.google.com/drive/1mXWYYkB9UjRdklPVSDvAcUDralmv3Pgv"><img src="https://img.shields.io/badge/Colab-Run_in_Google_Colab-blue?logo=Google&logoColor=FDBA18" alt="Run in Google Colab" /></a></p>
<h1 id="introduction">Introduction</h1>
<ul>
<li><a href="https://chriskhanhtran.github.io/posts/named-entity-recognition-with-transformers/">Part II: Fine-tuning SpanBERTa for Named Entity Recognition</a></li>
</ul>
<p>Self-training methods with transformer models have achieved state-of-the-art performance on most NLP tasks. However, because training them is computationally expensive, most currently available pretrained transformer models are only for English. Therefore, to improve performance on NLP tasks in our projects on Spanish, my team at <a href="https://skimai.com/"><strong>Skim AI</strong></a> decided to train a <strong>RoBERTa</strong> language model for Spanish from scratch and call it SpanBERTa.</p>
<p>SpanBERTa has the same size as RoBERTa-base. We followed RoBERTa’s training schema to train the model on 18 GB of <a href="https://traces1.inria.fr/oscar/">OSCAR</a>’s Spanish corpus in 8 days using 4 Tesla P100 GPUs.</p>
<p>In this blog post, we will walk through an end-to-end process to train a BERT-like language model from scratch using <code class="language-plaintext highlighter-rouge">transformers</code> and <code class="language-plaintext highlighter-rouge">tokenizers</code> libraries by Hugging Face. There is also a Google Colab notebook to run the codes in this article directly. You can also modify the notebook accordingly to train a BERT-like model for other languages or fine-tune it on your customized dataset.</p>
<p>Before moving on, I want to express a huge thank to the Hugging Face team for making state-of-the-art NLP models accessible for everyone.</p>
<h1 id="setup">Setup</h1>
<h2 id="1-install-dependencies">1. Install Dependencies</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">capture</span>
<span class="err">!</span><span class="n">pip</span> <span class="n">uninstall</span> <span class="o">-</span><span class="n">y</span> <span class="n">tensorflow</span>
<span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">transformers</span>
</code></pre></div></div>
<h2 id="2-data">2. Data</h2>
<p>We pretrained SpanBERTa on <a href="https://traces1.inria.fr/oscar/">OSCAR</a>’s Spanish corpus. The full size of the dataset is 150 GB and we used a portion of 18 GB to train.</p>
<p>In this example, for simplicity, we will use a dataset of Spanish movie subtitles from <a href="https://www.opensubtitles.org/en/search">OpenSubtitles</a>. This dataset has a size of 5.4 GB and we will train on a subset of ~300 MB.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="c1"># Download and unzip movie substitle dataset
</span><span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="s">'data/dataset.txt'</span><span class="p">):</span>
<span class="err">!</span><span class="n">wget</span> <span class="s">"https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz"</span> <span class="o">-</span><span class="n">O</span> <span class="n">dataset</span><span class="p">.</span><span class="n">txt</span><span class="p">.</span><span class="n">gz</span>
<span class="err">!</span><span class="n">gzip</span> <span class="o">-</span><span class="n">d</span> <span class="n">dataset</span><span class="p">.</span><span class="n">txt</span><span class="p">.</span><span class="n">gz</span>
<span class="err">!</span><span class="n">mkdir</span> <span class="n">data</span>
<span class="err">!</span><span class="n">mv</span> <span class="n">dataset</span><span class="p">.</span><span class="n">txt</span> <span class="n">data</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--2020-04-06 15:53:04-- https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/mono/es.txt.gz
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1859673728 (1.7G) [application/gzip]
Saving to: ‘dataset.txt.gz’
dataset.txt.gz 100%[===================>] 1.73G 17.0MB/s in 1m 46s
2020-04-06 15:54:51 (16.8 MB/s) - ‘dataset.txt.gz’ saved [1859673728/1859673728]
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Total number of lines and some random lines
</span><span class="err">!</span><span class="n">wc</span> <span class="o">-</span><span class="n">l</span> <span class="n">data</span><span class="o">/</span><span class="n">dataset</span><span class="p">.</span><span class="n">txt</span>
<span class="err">!</span><span class="n">shuf</span> <span class="o">-</span><span class="n">n</span> <span class="mi">5</span> <span class="n">data</span><span class="o">/</span><span class="n">dataset</span><span class="p">.</span><span class="n">txt</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>179287150 data/dataset.txt
Sabes, pensé que tenías más pelotas que para enfrentarme a través de mi hermano.
Supe todos los encantamientos en todas las lenguas de los Elfos hombres y Orcos.
Anteriormente en Blue Bloods:
Y quiero que prometas que no habrá ningún trato con Daniel Stafford.
Fue comiquísimo.
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Get a subset of first 10,000,000 lines for training
</span><span class="n">TRAIN_SIZE</span> <span class="o">=</span> <span class="mi">10000000</span> <span class="c1">#@param {type:"integer"}
</span><span class="err">!</span><span class="p">(</span><span class="n">head</span> <span class="o">-</span><span class="n">n</span> <span class="err">$</span><span class="n">TRAIN_SIZE</span> <span class="n">data</span><span class="o">/</span><span class="n">dataset</span><span class="p">.</span><span class="n">txt</span><span class="p">)</span> <span class="o">></span> <span class="n">data</span><span class="o">/</span><span class="n">train</span><span class="p">.</span><span class="n">txt</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Get a subset of next 10,000 lines for validation
</span><span class="n">VAL_SIZE</span> <span class="o">=</span> <span class="mi">10000</span> <span class="c1">#@param {type:"integer"}
</span><span class="err">!</span><span class="p">(</span><span class="n">sed</span> <span class="o">-</span><span class="n">n</span> <span class="p">{</span><span class="n">TRAIN_SIZE</span> <span class="o">+</span> <span class="mi">1</span><span class="p">},{</span><span class="n">TRAIN_SIZE</span> <span class="o">+</span> <span class="n">VAL_SIZE</span><span class="p">}</span><span class="n">p</span> <span class="n">data</span><span class="o">/</span><span class="n">dataset</span><span class="p">.</span><span class="n">txt</span><span class="p">)</span> <span class="o">></span> <span class="n">data</span><span class="o">/</span><span class="n">dev</span><span class="p">.</span><span class="n">txt</span>
</code></pre></div></div>
<h2 id="3-train-a-tokenizer">3. Train a Tokenizer</h2>
<p>The original BERT implementation uses a WordPiece tokenizer with a vocabulary of 32K subword units. This method, however, can introduce “unknown” tokens when processing rare words.</p>
<p>In this implementation, we use a byte-level BPE tokenizer with a vocabulary of 50,265 subword units (same as RoBERTa-base). Using byte-level BPE makes it possible to learn a subword vocabulary of modest size that can encode any input without getting “unknown” tokens.</p>
<p>Because <code class="language-plaintext highlighter-rouge">ByteLevelBPETokenizer</code> produces 2 files <code class="language-plaintext highlighter-rouge">["vocab.json", "merges.txt"]</code> while <code class="language-plaintext highlighter-rouge">BertWordPieceTokenizer</code> produces only 1 file <code class="language-plaintext highlighter-rouge">vocab.txt</code>, it will cause an error if we use <code class="language-plaintext highlighter-rouge">BertWordPieceTokenizer</code> to load outputs of a BPE tokenizer.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">tokenizers</span> <span class="kn">import</span> <span class="n">ByteLevelBPETokenizer</span>
<span class="n">path</span> <span class="o">=</span> <span class="s">"data/train.txt"</span>
<span class="c1"># Initialize a tokenizer
</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">ByteLevelBPETokenizer</span><span class="p">()</span>
<span class="c1"># Customize training
</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">train</span><span class="p">(</span><span class="n">files</span><span class="o">=</span><span class="n">path</span><span class="p">,</span>
<span class="n">vocab_size</span><span class="o">=</span><span class="mi">50265</span><span class="p">,</span>
<span class="n">min_frequency</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">special_tokens</span><span class="o">=</span><span class="p">[</span><span class="s">"<s>"</span><span class="p">,</span> <span class="s">"<pad>"</span><span class="p">,</span> <span class="s">"</s>"</span><span class="p">,</span> <span class="s">"<unk>"</span><span class="p">,</span> <span class="s">"<mask>"</span><span class="p">])</span>
<span class="c1"># Save files to disk
</span><span class="err">!</span><span class="n">mkdir</span> <span class="o">-</span><span class="n">p</span> <span class="s">"models/roberta"</span>
<span class="n">tokenizer</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="s">"models/roberta"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 1min 37s, sys: 1.02 s, total: 1min 38s
Wall time: 1min 38s
</code></pre></div></div>
<p>Super fast! It takes only 2 minutes to train on 10 million lines.</p>
<p><img src="https://github.com/chriskhanhtran/spanish-bert/blob/master/img/train_tokenizers.gif?raw=true" width="700" /></p>
<h1 id="traing-language-model-from-scratch">Traing Language Model from Scratch</h1>
<h2 id="1-model-architecture">1. Model Architecture</h2>
<p>RoBERTa has exactly the same architecture as BERT. The only differences are:</p>
<ul>
<li>RoBERTa uses a Byte-Level BPE tokenizer with a larger subword vocabulary (50k vs 32k).</li>
<li>RoBERTa implements dynamic word masking and drops next sentence prediction task.</li>
<li>RoBERTa’s training hyperparameters.</li>
</ul>
<p>Other architecture configurations can be found in the documentation (<a href="https://huggingface.co/transformers/_modules/transformers/configuration_roberta.html#RobertaConfig">RoBERTa</a>, <a href="https://huggingface.co/transformers/_modules/transformers/configuration_bert.html#BertConfig">BERT</a>).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="n">config</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"architectures"</span><span class="p">:</span> <span class="p">[</span>
<span class="s">"RobertaForMaskedLM"</span>
<span class="p">],</span>
<span class="s">"attention_probs_dropout_prob"</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>
<span class="s">"hidden_act"</span><span class="p">:</span> <span class="s">"gelu"</span><span class="p">,</span>
<span class="s">"hidden_dropout_prob"</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>
<span class="s">"hidden_size"</span><span class="p">:</span> <span class="mi">768</span><span class="p">,</span>
<span class="s">"initializer_range"</span><span class="p">:</span> <span class="mf">0.02</span><span class="p">,</span>
<span class="s">"intermediate_size"</span><span class="p">:</span> <span class="mi">3072</span><span class="p">,</span>
<span class="s">"layer_norm_eps"</span><span class="p">:</span> <span class="mf">1e-05</span><span class="p">,</span>
<span class="s">"max_position_embeddings"</span><span class="p">:</span> <span class="mi">514</span><span class="p">,</span>
<span class="s">"model_type"</span><span class="p">:</span> <span class="s">"roberta"</span><span class="p">,</span>
<span class="s">"num_attention_heads"</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span>
<span class="s">"num_hidden_layers"</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span>
<span class="s">"type_vocab_size"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="s">"vocab_size"</span><span class="p">:</span> <span class="mi">50265</span>
<span class="p">}</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"models/roberta/config.json"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
<span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">config</span><span class="p">,</span> <span class="n">fp</span><span class="p">)</span>
<span class="n">tokenizer_config</span> <span class="o">=</span> <span class="p">{</span><span class="s">"max_len"</span><span class="p">:</span> <span class="mi">512</span><span class="p">}</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"models/roberta/tokenizer_config.json"</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fp</span><span class="p">:</span>
<span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">tokenizer_config</span><span class="p">,</span> <span class="n">fp</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="2-training-hyperparameters">2. Training Hyperparameters</h2>
<table>
<thead>
<tr>
<th>Hyperparam</th>
<th style="text-align: center">BERT-base</th>
<th style="text-align: center">RoBERTa-base</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sequence Length</td>
<td style="text-align: center">128, 512</td>
<td style="text-align: center">512</td>
</tr>
<tr>
<td>Batch Size</td>
<td style="text-align: center">256</td>
<td style="text-align: center">8K</td>
</tr>
<tr>
<td>Peak Learning Rate</td>
<td style="text-align: center">1e-4</td>
<td style="text-align: center">6e-4</td>
</tr>
<tr>
<td>Max Steps</td>
<td style="text-align: center">1M</td>
<td style="text-align: center">500K</td>
</tr>
<tr>
<td>Warmup Steps</td>
<td style="text-align: center">10K</td>
<td style="text-align: center">24K</td>
</tr>
<tr>
<td>Weight Decay</td>
<td style="text-align: center">0.01</td>
<td style="text-align: center">0.01</td>
</tr>
<tr>
<td>Adam \(\epsilon\)</td>
<td style="text-align: center">1e-6</td>
<td style="text-align: center">1e-6</td>
</tr>
<tr>
<td>Adam \(\beta_1\)</td>
<td style="text-align: center">0.9</td>
<td style="text-align: center">0.9</td>
</tr>
<tr>
<td>Adam \(\beta_2\)</td>
<td style="text-align: center">0.999</td>
<td style="text-align: center">0.98</td>
</tr>
<tr>
<td>Gradient Clipping</td>
<td style="text-align: center">0.0</td>
<td style="text-align: center">0.0</td>
</tr>
</tbody>
</table>
<p>Note the batch size when training RoBERTa is 8000. Therefore, although RoBERTa-base was trained for 500K steps, its training computational cost is 16 times that of BERT-base. In the <a href="https://arxiv.org/pdf/1907.11692.pdf">RoBERTa paper</a>, it is shown that training with large batches improves perplexity for the masked language modeling objective, as well as end-task accuracy. Larger batch size can be obtained by tweaking <code class="language-plaintext highlighter-rouge">gradient_accumulation_steps</code>.</p>
<p>Due to computational constraint, we followed BERT-base’s training schema and trained our SpanBERTa model using 4 Tesla P100 GPUs for 200K steps in 8 days.</p>
<h2 id="3-start-training">3. Start Training</h2>
<p>We will train our model from scratch using <a href="https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py"><code class="language-plaintext highlighter-rouge">run_language_modeling.py</code></a>, a script provided by Hugging Face, which will preprocess, tokenize the corpus and train the model on <em>Masked Language Modeling</em> task. The script is optimized to train on a single big corpus. Therefore, if your dataset is large and you want to split it to train sequentially, you will need to modify the script, or be ready to get a monster machine with high memory.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">wget</span> <span class="o">-</span><span class="n">c</span> <span class="n">https</span><span class="p">:</span><span class="o">//</span><span class="n">raw</span><span class="p">.</span><span class="n">githubusercontent</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">chriskhanhtran</span><span class="o">/</span><span class="n">spanish</span><span class="o">-</span><span class="n">bert</span><span class="o">/</span><span class="n">master</span><span class="o">/</span><span class="n">run_language_modeling</span><span class="p">.</span><span class="n">py</span>
</code></pre></div></div>
<p><strong>Important Arguments</strong></p>
<ul>
<li><code class="language-plaintext highlighter-rouge">--line_by_line</code> Whether distinct lines of text in the dataset are to be handled as distinct sequences. If each line in your dataset is long and has ~512 tokens or more, you should use this setting. If each line is short, the default text preprocessing will concatenate all lines, tokenize them and slit tokenized outputs into blocks of 512 tokens. You can also split your datasets into small chunks and preprocess them separately. 3GB of text will take ~50 minutes to process with the default <code class="language-plaintext highlighter-rouge">TextDataset</code> class.</li>
<li><code class="language-plaintext highlighter-rouge">--should_continue</code> Whether to continue from latest checkpoint in output_dir.</li>
<li><code class="language-plaintext highlighter-rouge">--model_name_or_path</code> The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.</li>
<li><code class="language-plaintext highlighter-rouge">--mlm</code> Train with masked-language modeling loss instead of language modeling.</li>
<li><code class="language-plaintext highlighter-rouge">--config_name, --tokenizer_name</code> Optional pretrained config and tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new config.</li>
<li><code class="language-plaintext highlighter-rouge">--per_gpu_train_batch_size</code> Batch size per GPU/CPU for training. Choose the largest number you can fit on your GPUs. You will see an error if your batch size is too large.</li>
<li><code class="language-plaintext highlighter-rouge">--gradient_accumulation_steps</code> Number of updates steps to accumulate before performing a backward/update pass. You can use this trick to increase batch size. For example, if <code class="language-plaintext highlighter-rouge">per_gpu_train_batch_size = 16</code> and <code class="language-plaintext highlighter-rouge">gradient_accumulation_steps = 4</code>, your total train batch size will be 64.</li>
<li><code class="language-plaintext highlighter-rouge">--overwrite_output_dir</code> Overwrite the content of the output directory.</li>
<li><code class="language-plaintext highlighter-rouge">--no_cuda, --fp16, --fp16_opt_level</code> Arguments for training on GPU/CPU.</li>
<li>Other arguments are model paths and training hyperparameters.</li>
</ul>
<p>It’s highly recommended to include model type (eg. “roberta”, “bert”, “gpt2” etc.) in the model path because the script uses the <a href="https://huggingface.co/transformers/model_doc/auto.html?highlight=automodels"><code class="language-plaintext highlighter-rouge">AutoModels</code></a> class to guess the model’s configuration using pattern matching on the provided path.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Model paths
</span><span class="n">MODEL_TYPE</span> <span class="o">=</span> <span class="s">"roberta"</span> <span class="c1">#@param ["roberta", "bert"]
</span><span class="n">MODEL_DIR</span> <span class="o">=</span> <span class="s">"models/roberta"</span> <span class="c1">#@param {type: "string"}
</span><span class="n">OUTPUT_DIR</span> <span class="o">=</span> <span class="s">"models/roberta/output"</span> <span class="c1">#@param {type: "string"}
</span><span class="n">TRAIN_PATH</span> <span class="o">=</span> <span class="s">"data/train.txt"</span> <span class="c1">#@param {type: "string"}
</span><span class="n">EVAL_PATH</span> <span class="o">=</span> <span class="s">"data/dev.txt"</span> <span class="c1">#@param {type: "string"}
</span>
</code></pre></div></div>
<p>For this example, we will train for only 25 steps on a Tesla P4 GPU provided by Colab.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">nvidia</span><span class="o">-</span><span class="n">smi</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Mon Apr 6 15:59:35 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:04.0 Off | 0 |
| N/A 31C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Command line
</span><span class="n">cmd</span> <span class="o">=</span> <span class="s">"""python run_language_modeling.py </span><span class="se">\
</span><span class="s"> --output_dir {output_dir} </span><span class="se">\
</span><span class="s"> --model_type {model_type} </span><span class="se">\
</span><span class="s"> --mlm </span><span class="se">\
</span><span class="s"> --config_name {config_name} </span><span class="se">\
</span><span class="s"> --tokenizer_name {tokenizer_name} </span><span class="se">\
</span><span class="s"> {line_by_line} </span><span class="se">\
</span><span class="s"> {should_continue} </span><span class="se">\
</span><span class="s"> {model_name_or_path} </span><span class="se">\
</span><span class="s"> --train_data_file {train_path} </span><span class="se">\
</span><span class="s"> --eval_data_file {eval_path} </span><span class="se">\
</span><span class="s"> --do_train </span><span class="se">\
</span><span class="s"> {do_eval} </span><span class="se">\
</span><span class="s"> {evaluate_during_training} </span><span class="se">\
</span><span class="s"> --overwrite_output_dir </span><span class="se">\
</span><span class="s"> --block_size 512 </span><span class="se">\
</span><span class="s"> --max_step 25 </span><span class="se">\
</span><span class="s"> --warmup_steps 10 </span><span class="se">\
</span><span class="s"> --learning_rate 5e-5 </span><span class="se">\
</span><span class="s"> --per_gpu_train_batch_size 4 </span><span class="se">\
</span><span class="s"> --gradient_accumulation_steps 4 </span><span class="se">\
</span><span class="s"> --weight_decay 0.01 </span><span class="se">\
</span><span class="s"> --adam_epsilon 1e-6 </span><span class="se">\
</span><span class="s"> --max_grad_norm 100.0 </span><span class="se">\
</span><span class="s"> --save_total_limit 10 </span><span class="se">\
</span><span class="s"> --save_steps 10 </span><span class="se">\
</span><span class="s"> --logging_steps 2 </span><span class="se">\
</span><span class="s"> --seed 42
"""</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Arguments for training from scratch. I turn off evaluate_during_training,
# line_by_line, should_continue, and model_name_or_path.
</span><span class="n">train_params</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"output_dir"</span><span class="p">:</span> <span class="n">OUTPUT_DIR</span><span class="p">,</span>
<span class="s">"model_type"</span><span class="p">:</span> <span class="n">MODEL_TYPE</span><span class="p">,</span>
<span class="s">"config_name"</span><span class="p">:</span> <span class="n">MODEL_DIR</span><span class="p">,</span>
<span class="s">"tokenizer_name"</span><span class="p">:</span> <span class="n">MODEL_DIR</span><span class="p">,</span>
<span class="s">"train_path"</span><span class="p">:</span> <span class="n">TRAIN_PATH</span><span class="p">,</span>
<span class="s">"eval_path"</span><span class="p">:</span> <span class="n">EVAL_PATH</span><span class="p">,</span>
<span class="s">"do_eval"</span><span class="p">:</span> <span class="s">"--do_eval"</span><span class="p">,</span>
<span class="s">"evaluate_during_training"</span><span class="p">:</span> <span class="s">""</span><span class="p">,</span>
<span class="s">"line_by_line"</span><span class="p">:</span> <span class="s">""</span><span class="p">,</span>
<span class="s">"should_continue"</span><span class="p">:</span> <span class="s">""</span><span class="p">,</span>
<span class="s">"model_name_or_path"</span><span class="p">:</span> <span class="s">""</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div></div>
<p>If you are training on a virtual machine, you can install tensorboard to monitor the training process. Here is our <a href="https://tensorboard.dev/experiment/4wOFJBwPRBK9wjKE6F32qQ/#scalars">Tensorboard</a> for training SpanBERTa.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span><span class="nv">tensorboard</span><span class="o">==</span>2.1.0
tensorboard dev upload <span class="nt">--logdir</span> runs
</code></pre></div></div>
<p><img src="https://github.com/chriskhanhtran/spanish-bert/blob/master/img/tensorboard-spanberta.JPG?raw=true" width="400" /></p>
<p><em>After 200k steps, the loss reached 1.8 and the perplexity reached 5.2.</em></p>
<p>Now let’s start training!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="p">{</span><span class="n">cmd</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="o">**</span><span class="n">train_params</span><span class="p">)}</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>04/06/2020 15:59:41 - WARNING - __main__ - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
04/06/2020 15:59:41 - INFO - transformers.configuration_utils - loading configuration file models/roberta/config.json
04/06/2020 15:59:41 - INFO - transformers.configuration_utils - Model config RobertaConfig {
04/06/2020 15:59:41 - INFO - transformers.configuration_utils - loading configuration file models/roberta/config.json
04/06/2020 15:59:41 - INFO - transformers.configuration_utils - Model config RobertaConfig {
04/06/2020 15:59:41 - INFO - transformers.tokenization_utils - Model name 'models/roberta' not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). Assuming 'models/roberta' is a path, a model identifier, or url to a directory containing tokenizer files.
04/06/2020 15:59:41 - INFO - transformers.tokenization_utils - loading file models/roberta/vocab.json
04/06/2020 15:59:41 - INFO - transformers.tokenization_utils - loading file models/roberta/merges.txt
04/06/2020 15:59:41 - INFO - transformers.tokenization_utils - loading file models/roberta/tokenizer_config.json
04/06/2020 15:59:41 - INFO - __main__ - Training new model from scratch
04/06/2020 15:59:55 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-06, block_size=512, cache_dir=None, config_name='models/roberta', device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='data/dev.txt', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=4, learning_rate=5e-05, line_by_line=False, local_rank=-1, logging_steps=2, max_grad_norm=100.0, max_steps=25, mlm=True, mlm_probability=0.15, model_name_or_path=None, model_type='roberta', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='models/roberta/output', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, save_steps=10, save_total_limit=10, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='models/roberta', train_data_file='data/train.txt', warmup_steps=10, weight_decay=0.01)
04/06/2020 15:59:55 - INFO - __main__ - Creating features from dataset file at data
04/06/2020 16:04:43 - INFO - __main__ - Saving features into cached file data/roberta_cached_lm_510_train.txt
04/06/2020 16:04:46 - INFO - __main__ - ***** Running training *****
04/06/2020 16:04:46 - INFO - __main__ - Num examples = 165994
04/06/2020 16:04:46 - INFO - __main__ - Num Epochs = 1
04/06/2020 16:04:46 - INFO - __main__ - Instantaneous batch size per GPU = 4
04/06/2020 16:04:46 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16
04/06/2020 16:04:46 - INFO - __main__ - Gradient Accumulation steps = 4
04/06/2020 16:04:46 - INFO - __main__ - Total optimization steps = 25
Epoch: 0% 0/1 [00:00<?, ?it/s]
Iteration: 0% 0/41499 [00:00<?, ?it/s][A
Iteration: 0% 1/41499 [00:01<13:18:02, 1.15s/it][A
Iteration: 0% 2/41499 [00:01<11:26:47, 1.01it/s][A
Iteration: 0% 3/41499 [00:02<10:10:30, 1.13it/s][A
Iteration: 0% 4/41499 [00:03<9:38:10, 1.20it/s] [A
Iteration: 0% 5/41499 [00:03<8:52:44, 1.30it/s][A
Iteration: 0% 6/41499 [00:04<8:22:47, 1.38it/s][A
Iteration: 0% 7/41499 [00:04<8:00:55, 1.44it/s][A
Iteration: 0% 8/41499 [00:05<8:03:40, 1.43it/s][A
Iteration: 0% 9/41499 [00:06<7:46:57, 1.48it/s][A
Iteration: 0% 10/41499 [00:06<7:35:35, 1.52it/s][A
Epoch: 0% 0/1 [01:25<?, ?it/s]
04/06/2020 16:06:11 - INFO - __main__ - global_step = 26, average loss = 9.355212138249325
04/06/2020 16:06:11 - INFO - __main__ - Saving model checkpoint to models/roberta/output
04/06/2020 16:06:18 - INFO - transformers.modeling_utils - loading weights file models/roberta/output/pytorch_model.bin
04/06/2020 16:06:23 - INFO - __main__ - Creating features from dataset file at data
04/06/2020 16:06:23 - INFO - __main__ - Saving features into cached file data/roberta_cached_lm_510_dev.txt
04/06/2020 16:06:23 - INFO - __main__ - ***** Running evaluation *****
04/06/2020 16:06:23 - INFO - __main__ - Num examples = 156
04/06/2020 16:06:23 - INFO - __main__ - Batch size = 4
Evaluating: 100% 39/39 [00:08<00:00, 4.41it/s]
04/06/2020 16:06:32 - INFO - __main__ - ***** Eval results *****
04/06/2020 16:06:32 - INFO - __main__ - perplexity = tensor(6077.6812)
</code></pre></div></div>
<h2 id="4-predict-masked-words">4. Predict Masked Words</h2>
<p>After training your language model, you can upload and share your model with the community. We have uploaded our SpanBERTa model to Hugging Face’s server. Before evaluating the model on downstream tasks, let’s see how it has learned to fill masked words given a context.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">capture</span>
<span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">pipeline</span>
<span class="n">fill_mask</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">(</span>
<span class="s">"fill-mask"</span><span class="p">,</span>
<span class="n">model</span><span class="o">=</span><span class="s">"chriskhanhtran/spanberta"</span><span class="p">,</span>
<span class="n">tokenizer</span><span class="o">=</span><span class="s">"chriskhanhtran/spanberta"</span>
<span class="p">)</span>
</code></pre></div></div>
<p>I pick a sentence from Wikipedia’s article about COVID-19.</p>
<p>The original sentence is “<em>Lavarse frecuentemente las manos con agua y jabón,</em>” meaning “<em>Frequently wash your hands with soap and water.</em>”</p>
<p>The masked word is <strong>“jabón” (soap)</strong> and the top 5 predictions are <strong>soap, salt, steam, lemon</strong> and <strong>vinegar</strong>. It is interesting that the model somehow learns that we should wash our hands with things that can kill bacteria or contain acid.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fill_mask</span><span class="p">(</span><span class="s">"Lavarse frecuentemente las manos con agua y <mask>."</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[{'score': 0.6469631195068359,
'sequence': '<s> Lavarse frecuentemente las manos con agua y jabón.</s>',
'token': 18493},
{'score': 0.06074320897459984,
'sequence': '<s> Lavarse frecuentemente las manos con agua y sal.</s>',
'token': 619},
{'score': 0.029787985607981682,
'sequence': '<s> Lavarse frecuentemente las manos con agua y vapor.</s>',
'token': 11079},
{'score': 0.026410052552819252,
'sequence': '<s> Lavarse frecuentemente las manos con agua y limón.</s>',
'token': 12788},
{'score': 0.017029203474521637,
'sequence': '<s> Lavarse frecuentemente las manos con agua y vinagre.</s>',
'token': 18424}]
</code></pre></div></div>
<h1 id="conclusion">Conclusion</h1>
<p>We have walked through how to train a BERT language model for Spanish from scratch and seen that the model has learned properties of the language by trying to predict masked words given a context. You can also follow this article to fine-tune a pretrained BERT-like model on your customized dataset.</p>
<p>Next, we will implement the pretrained models on downstream tasks including Sequence Classification, NER, POS tagging, and NLI, as well as compare the model’s performance with some non-BERT models.</p>
<p>Stay tuned for our next posts!</p>Chris TranSelf-training methods with transformer models have achieved state-of-the-art performance on most NLP tasks. However, because training them is computationally expensive, most currently available pretrained transformer models are only for English.A Complete Guide to CNN for Sentence Classification with PyTorch2020-02-01T00:00:00-05:002020-02-01T00:00:00-05:00https://chriskhanhtran.github.io/posts/cnn-sentence-classification<p><a href="https://colab.research.google.com/drive/1b7aZamr065WPuLpq9C4RU6irB59gbX_K"><img src="https://img.shields.io/badge/Colab-Run_in_Google_Colab-blue?logo=Google&logoColor=FDBA18" alt="Run in Google Colab" /></a></p>
<p><strong>Convolutional Neural Networks (CNN)</strong> were originally invented for computer vision and now are the building blocks of state-of-the-art CV models. One of the earliest applications of CNN in Natural Language Processing was introduced in the paper <strong><em>Convolutional Neural Networks for Sentence Classification</em></strong> (Kim, 2014). With the same idea as in computer vision, CNN model is used as an feature extractor that encodes semantic features of sentences before these features are fed to a classifier.</p>
<p>With only a simple one-layer CNN trained on top of pretrained word vectors and little hyperparameter tuning, the model achieves excellent results on multiple sentence-level classification tasks. CNN models are now used widely in other NLP tasks such as translation and question answering as a part of a more complex architecture.</p>
<p>When implementing the original paper (Kim, 2014) in PyTorch, I needed to put many pieces together to complete the project. This article serves as a complete guide to CNN for sentence classification tasks accompanied with advice for practioners. It will cover:</p>
<ul>
<li>Tokenizing and building vocabuilary from text data</li>
<li>Loading pretrained fastText word vectors and creating embedding layer for fine-tuning</li>
<li>Building and training CNN model with PyTorch</li>
<li>Advice for practitioners</li>
<li>Bonus: Using Skorch as a scikit-like wrapper for PyTorch’s Deep Learning models</li>
</ul>
<p><strong>Reference:</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/1408.5882">Convolutional Neural Networks for Sentence Classification</a> (Kim, 2014).</li>
<li><a href="https://arxiv.org/abs/1510.03820">A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification</a> (Zhang, 2015).</li>
<li><a href="https://arxiv.org/abs/1712.09405">Advances in Pre-Training Distributed Word Representations</a> (Mikolov, 2018).</li>
</ul>
<h2 id="1-setup">1. Setup</h2>
<h3 id="11-import-libraries">1.1. Import Libraries</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">nltk</span>
<span class="n">nltk</span><span class="p">.</span><span class="n">download</span><span class="p">(</span><span class="s">"all"</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</code></pre></div></div>
<h3 id="12-download-datasets">1.2. Download Datasets</h3>
<p>The dataset we will use is Movie Review (MR), a sentence polarity dataset from (Pang and Lee, 2005). The dataset has 5331 positive and 5331 negative processed sentences/snippets.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">URL</span> <span class="o">=</span> <span class="s">'https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'</span>
<span class="c1"># Download Datasets
</span><span class="err">!</span><span class="n">wget</span> <span class="o">-</span><span class="n">P</span> <span class="s">'Data/'</span> <span class="err">$</span><span class="n">URL</span>
<span class="c1"># Unzip
</span><span class="err">!</span><span class="n">tar</span> <span class="n">xvzf</span> <span class="s">'Data/rt-polaritydata.tar.gz'</span> <span class="o">-</span><span class="n">C</span> <span class="s">'Data/'</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">load_text</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
<span class="s">"""Load text data, lowercase text and save to a list."""</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s">'rb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">texts</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">f</span><span class="p">:</span>
<span class="n">texts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">line</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="n">errors</span><span class="o">=</span><span class="s">'ignore'</span><span class="p">).</span><span class="n">lower</span><span class="p">().</span><span class="n">strip</span><span class="p">())</span>
<span class="k">return</span> <span class="n">texts</span>
<span class="c1"># Load files
</span><span class="n">neg_text</span> <span class="o">=</span> <span class="n">load_text</span><span class="p">(</span><span class="s">'Data/rt-polaritydata/rt-polarity.neg'</span><span class="p">)</span>
<span class="n">pos_text</span> <span class="o">=</span> <span class="n">load_text</span><span class="p">(</span><span class="s">'Data/rt-polaritydata/rt-polarity.pos'</span><span class="p">)</span>
<span class="c1"># Concatenate and label data
</span><span class="n">texts</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">neg_text</span> <span class="o">+</span> <span class="n">pos_text</span><span class="p">)</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">neg_text</span><span class="p">)</span> <span class="o">+</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">pos_text</span><span class="p">))</span>
</code></pre></div></div>
<h3 id="13-download-fasttext-word-vectors">1.3. Download fastText Word Vectors</h3>
<p>The pretrained word vectors used in the original paper were trained by <em>word2vec</em> (Mikolov et al., 2013) on 100 billion tokens of Google News. In this tutorial, we will use <a href="https://fasttext.cc/docs/en/english-vectors.html"><em>fastText</em> pretrained word vectors</a> (Mikolov et al., 2017), trained on 600 billion tokens on Common Crawl. <em>fastText</em> is an upgraded version of <em>word2vec</em> and outperforms other state-of-the-art methods by a large margin.</p>
<p>The code below will download fastText pretrained vectors. Using Google Colab, the running time is approximately 3min 30s.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span>
<span class="n">URL</span> <span class="o">=</span> <span class="s">"https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip"</span>
<span class="n">FILE</span> <span class="o">=</span> <span class="s">"fastText"</span>
<span class="k">if</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">isdir</span><span class="p">(</span><span class="n">FILE</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">"fastText exists."</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="err">!</span><span class="n">wget</span> <span class="o">-</span><span class="n">P</span> <span class="err">$</span><span class="n">FILE</span> <span class="err">$</span><span class="n">URL</span>
<span class="err">!</span><span class="n">unzip</span> <span class="err">$</span><span class="n">FILE</span><span class="o">/</span><span class="n">crawl</span><span class="o">-</span><span class="mi">300</span><span class="n">d</span><span class="o">-</span><span class="mi">2</span><span class="n">M</span><span class="p">.</span><span class="n">vec</span><span class="p">.</span><span class="nb">zip</span> <span class="o">-</span><span class="n">d</span> <span class="err">$</span><span class="n">FILE</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>crawl-300d-2M.vec.z 100%[===================>] 1.42G 23.8MB/s in 62s
2020-02-01 00:40:43 (23.3 MB/s) - ‘fastText/crawl-300d-2M.vec.zip’ saved [1523785255/1523785255]
Archive: fastText/crawl-300d-2M.vec.zip
inflating: fastText/crawl-300d-2M.vec
</code></pre></div></div>
<h3 id="14-use-gpu-for-training">1.4. Use GPU for Training</h3>
<p>Google Colab offers free GPUs and TPUs. Since we’ll be training a large neural network it’s best to utilize these features.</p>
<p>A GPU can be added by going to the menu and selecting:</p>
<blockquote>
<p>Runtime -> Change runtime type -> Hardware accelerator: GPU</p>
</blockquote>
<p>Then we need to run the following cell to specify the GPU as the device.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">():</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cuda"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'There are </span><span class="si">{</span><span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">device_count</span><span class="p">()</span><span class="si">}</span><span class="s"> GPU(s) available.'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Device name:'</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">get_device_name</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">'No GPU available, using the CPU instead.'</span><span class="p">)</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cpu"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>There are 1 GPU(s) available.
Device name: Tesla T4
</code></pre></div></div>
<h2 id="2-data-preparation">2. Data Preparation</h2>
<p>To prepare our text data for training, first we need to tokenize our sentences and build a vocabulary dictionary <code class="language-plaintext highlighter-rouge">word2idx</code>, which will later be used to convert our tokens into indexes and build an embedding layer.</p>
<p><strong><em>So, what is an embedding layer?</em></strong></p>
<p>An embedding layer serves as a look-up table which takes words’ indexes in the vocabulary as input and output word vectors. Hence, the embedding layer has shape \((N, d)\) where \(N\) is the size of the vocabulary and \(d\) is the embedding dimension. In order to fine-tune pretrained word vectors, we need to create an embedding layer in our <code class="language-plaintext highlighter-rouge">nn.Module</code> class. Our input to the model will then be <code class="language-plaintext highlighter-rouge">input_ids</code>, which is tokens’ indexes in the vocabulary.</p>
<h3 id="21-tokenize">2.1. Tokenize</h3>
<p>The function <code class="language-plaintext highlighter-rouge">tokenize</code> will tokenize our sentences, build a vocabulary and find the maximum sentence length. The function <code class="language-plaintext highlighter-rouge">encode</code> will take outputs of <code class="language-plaintext highlighter-rouge">tokenize</code> as inputs, perform sentence padding and return <code class="language-plaintext highlighter-rouge">input_ids</code> as a numpy array.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">nltk.tokenize</span> <span class="kn">import</span> <span class="n">word_tokenize</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="k">def</span> <span class="nf">tokenize</span><span class="p">(</span><span class="n">texts</span><span class="p">):</span>
<span class="s">"""Tokenize texts, build vocabulary and find maximum sentence length.
Args:
texts (List[str]): List of text data
Returns:
tokenized_texts (List[List[str]]): List of list of tokens
word2idx (Dict): Vocabulary built from the corpus
max_len (int): Maximum sentence length
"""</span>
<span class="n">max_len</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">tokenized_texts</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">word2idx</span> <span class="o">=</span> <span class="p">{}</span>
<span class="c1"># Add <pad> and <unk> tokens to the vocabulary
</span> <span class="n">word2idx</span><span class="p">[</span><span class="s">'<pad>'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">word2idx</span><span class="p">[</span><span class="s">'<unk>'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="c1"># Building our vocab from the corpus starting from index 2
</span> <span class="n">idx</span> <span class="o">=</span> <span class="mi">2</span>
<span class="k">for</span> <span class="n">sent</span> <span class="ow">in</span> <span class="n">texts</span><span class="p">:</span>
<span class="n">tokenized_sent</span> <span class="o">=</span> <span class="n">word_tokenize</span><span class="p">(</span><span class="n">sent</span><span class="p">)</span>
<span class="c1"># Add `tokenized_sent` to `tokenized_texts`
</span> <span class="n">tokenized_texts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">tokenized_sent</span><span class="p">)</span>
<span class="c1"># Add new token to `word2idx`
</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">tokenized_sent</span><span class="p">:</span>
<span class="k">if</span> <span class="n">token</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">word2idx</span><span class="p">:</span>
<span class="n">word2idx</span><span class="p">[</span><span class="n">token</span><span class="p">]</span> <span class="o">=</span> <span class="n">idx</span>
<span class="n">idx</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="c1"># Update `max_len`
</span> <span class="n">max_len</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_len</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokenized_sent</span><span class="p">))</span>
<span class="k">return</span> <span class="n">tokenized_texts</span><span class="p">,</span> <span class="n">word2idx</span><span class="p">,</span> <span class="n">max_len</span>
<span class="k">def</span> <span class="nf">encode</span><span class="p">(</span><span class="n">tokenized_texts</span><span class="p">,</span> <span class="n">word2idx</span><span class="p">,</span> <span class="n">max_len</span><span class="p">):</span>
<span class="s">"""Pad each sentence to the maximum sentence length and encode tokens to
their index in the vocabulary.
Returns:
input_ids (np.array): Array of token indexes in the vocabulary with
shape (N, max_len). It will the input of our CNN model.
"""</span>
<span class="n">input_ids</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">tokenized_sent</span> <span class="ow">in</span> <span class="n">tokenized_texts</span><span class="p">:</span>
<span class="c1"># Pad sentences to max_len
</span> <span class="n">tokenized_sent</span> <span class="o">+=</span> <span class="p">[</span><span class="s">'<pad>'</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="n">max_len</span> <span class="o">-</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokenized_sent</span><span class="p">))</span>
<span class="c1"># Encode tokens to input_ids
</span> <span class="n">input_id</span> <span class="o">=</span> <span class="p">[</span><span class="n">word2idx</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">token</span><span class="p">)</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">tokenized_sent</span><span class="p">]</span>
<span class="n">input_ids</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">input_id</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="22-load-pretrained-vectors">2.2. Load Pretrained Vectors</h3>
<p>We will load the pretrained vectors for each token in our vocabulary. For tokens with no pretraiend vectors, we will initialize random word vectors with the same dimension and variance.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm_notebook</span>
<span class="k">def</span> <span class="nf">load_pretrained_vectors</span><span class="p">(</span><span class="n">word2idx</span><span class="p">,</span> <span class="n">fname</span><span class="p">):</span>
<span class="s">"""Load pretrained vectors and create embedding layers.
Args:
word2idx (Dict): Vocabulary built from the corpus
fname (str): Path to pretrained vector file
Returns:
embeddings (np.array): Embedding matrix with shape (N, d) where N is
the size of word2idx and d is embedding dimension
"""</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Loading pretrained vectors..."</span><span class="p">)</span>
<span class="n">fin</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">fname</span><span class="p">,</span> <span class="s">'r'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s">'ignore'</span><span class="p">)</span>
<span class="n">n</span><span class="p">,</span> <span class="n">d</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">().</span><span class="n">split</span><span class="p">())</span>
<span class="c1"># Initilize random embeddings
</span> <span class="n">embeddings</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="o">-</span><span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">word2idx</span><span class="p">),</span> <span class="n">d</span><span class="p">))</span>
<span class="n">embeddings</span><span class="p">[</span><span class="n">word2idx</span><span class="p">[</span><span class="s">'<pad>'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">d</span><span class="p">,))</span>
<span class="c1"># Load pretrained vectors
</span> <span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">tqdm_notebook</span><span class="p">(</span><span class="n">fin</span><span class="p">):</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">().</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)</span>
<span class="n">word</span> <span class="o">=</span> <span class="n">tokens</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">if</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">word2idx</span><span class="p">:</span>
<span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">embeddings</span><span class="p">[</span><span class="n">word2idx</span><span class="p">[</span><span class="n">word</span><span class="p">]]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">tokens</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"There are </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s"> / </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">word2idx</span><span class="p">)</span><span class="si">}</span><span class="s"> pretrained vectors found."</span><span class="p">)</span>
<span class="k">return</span> <span class="n">embeddings</span>
</code></pre></div></div>
<p>Now let’s put above steps together.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Tokenize, build vocabulary, encode tokens
</span><span class="k">print</span><span class="p">(</span><span class="s">"Tokenizing...</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="n">tokenized_texts</span><span class="p">,</span> <span class="n">word2idx</span><span class="p">,</span> <span class="n">max_len</span> <span class="o">=</span> <span class="n">tokenize</span><span class="p">(</span><span class="n">texts</span><span class="p">)</span>
<span class="n">input_ids</span> <span class="o">=</span> <span class="n">encode</span><span class="p">(</span><span class="n">tokenized_texts</span><span class="p">,</span> <span class="n">word2idx</span><span class="p">,</span> <span class="n">max_len</span><span class="p">)</span>
<span class="c1"># Load pretrained vectors
</span><span class="n">embeddings</span> <span class="o">=</span> <span class="n">load_pretrained_vectors</span><span class="p">(</span><span class="n">word2idx</span><span class="p">,</span> <span class="s">"fastText/crawl-300d-2M.vec"</span><span class="p">)</span>
<span class="n">embeddings</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">embeddings</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tokenizing...
Loading pretrained vectors...
There are 18526 / 20286 pretrained vectors found.
</code></pre></div></div>
<h3 id="23-create-pytorch-dataloader">2.3. Create PyTorch DataLoader</h3>
<p>We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed. The batch size used in the paper is 50.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="p">(</span><span class="n">TensorDataset</span><span class="p">,</span> <span class="n">DataLoader</span><span class="p">,</span> <span class="n">RandomSampler</span><span class="p">,</span>
<span class="n">SequentialSampler</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">data_loader</span><span class="p">(</span><span class="n">train_inputs</span><span class="p">,</span> <span class="n">val_inputs</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">,</span> <span class="n">val_labels</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="mi">50</span><span class="p">):</span>
<span class="s">"""Convert train and validation sets to torch.Tensors and load them to
DataLoader.
"""</span>
<span class="c1"># Convert data type to torch.Tensor
</span> <span class="n">train_inputs</span><span class="p">,</span> <span class="n">val_inputs</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">,</span> <span class="n">val_labels</span> <span class="o">=</span>\
<span class="nb">tuple</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="k">for</span> <span class="n">data</span> <span class="ow">in</span>
<span class="p">[</span><span class="n">train_inputs</span><span class="p">,</span> <span class="n">val_inputs</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">,</span> <span class="n">val_labels</span><span class="p">])</span>
<span class="c1"># Specify batch_size
</span> <span class="n">batch_size</span> <span class="o">=</span> <span class="mi">50</span>
<span class="c1"># Create DataLoader for training data
</span> <span class="n">train_data</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">train_inputs</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">)</span>
<span class="n">train_sampler</span> <span class="o">=</span> <span class="n">RandomSampler</span><span class="p">(</span><span class="n">train_data</span><span class="p">)</span>
<span class="n">train_dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">train_data</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">train_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">)</span>
<span class="c1"># Create DataLoader for validation data
</span> <span class="n">val_data</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">val_inputs</span><span class="p">,</span> <span class="n">val_labels</span><span class="p">)</span>
<span class="n">val_sampler</span> <span class="o">=</span> <span class="n">SequentialSampler</span><span class="p">(</span><span class="n">val_data</span><span class="p">)</span>
<span class="n">val_dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">val_data</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">val_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">)</span>
<span class="k">return</span> <span class="n">train_dataloader</span><span class="p">,</span> <span class="n">val_dataloader</span>
</code></pre></div></div>
<p>We will use 90% of the dataset for training and 10% for validation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="c1"># Train Test Split
</span><span class="n">train_inputs</span><span class="p">,</span> <span class="n">val_inputs</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">,</span> <span class="n">val_labels</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span>
<span class="n">input_ids</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="c1"># Load data to PyTorch DataLoader
</span><span class="n">train_dataloader</span><span class="p">,</span> <span class="n">val_dataloader</span> <span class="o">=</span> \
<span class="n">data_loader</span><span class="p">(</span><span class="n">train_inputs</span><span class="p">,</span> <span class="n">val_inputs</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">,</span> <span class="n">val_labels</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="3-model">3. Model</h2>
<p><strong>CNN Architecture</strong></p>
<p>The picture below is the illustration of the CNN architecture that we are going to build with three filter sizes: 2, 3, and 4, each of which has 2 filters.</p>
<p class="text-center small"><img src="https://github.com/chriskhanhtran/CNN-Sentence-Classification-PyTorch/blob/master/cnn-architecture.JPG?raw=true" width="650" class="align-center" />
<em>CNN Architecture (Source: Zhang, 2015)</em></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Sample configuration:
</span><span class="n">filter_sizes</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">]</span>
<span class="n">num_filters</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
</code></pre></div></div>
<p>Suppose that we are classifying the sentence “<strong><em>I like this movie very much!</em></strong>” (\(N = 7\) tokens) and the dimensionality of word vectors is \(d=5\). After applying the embedding layer on the input token ids, the sample sentence is presented as a 2D tensor with shape (7, 5) like an image.</p>
\[\mathrm{x_{emb}} \quad \in \mathbb{R}^{7 \times 5}\]
<p>We then use 1-dimesional convolution to extract features from the sentence. In this example, we have 6 filters in total, and each filter has shape \((f_i, d)\) where \(f_i\) is the filter size for \(i \in \{1,...,6\}\). Each filter will then scan over \(\mathrm{x_{emb}}\) and return a feature map:</p>
\[\mathrm{x_{conv_ i} = Conv1D(x_{emb})} \quad \in \mathbb{R}^{N-f_i+1}\]
<p>Next, we apply the ReLU activation to \(\mathrm{x_{conv_{i}}}\) and use max-over-time-pooling to reduce each feature map to a single scalar. Then we concatenate these scalars into a vector which will be fed to a fully connected layer to compute the final scores for our classes (logits).</p>
\[\mathrm{x_{pool_i} = MaxPool(ReLU(x_{conv_i}))} \quad \in \mathbb{R}\]
\[\mathrm{x_{fc} = \texttt{concat}(x_{pool_i})} \quad \in \mathbb{R}^6\]
<p>The idea here is that each filter will capture different semantic signals in the sentence (e.g., happiness, humor, politics, anger…) and max-pooling will record only the strongest signal over the sentence. This logic makes sense because humans also perceive the sentiment of a sentence based on its strongest semantic signal.</p>
<p>Finally, we use a fully connected layer with the weight matrix \(\mathbf{W_{fc}} \in \mathbb{R}^{2 \times 6}\) and dropout to compute \(\mathrm{logits}\), which is a vector of length 2 that keeps the scores for our two classes.</p>
\[\mathrm{logits = Dropout(\mathbf{W_{fc}}x_{fc})} \in \mathbb{R}^2\]
<p>An in-depth explanation of CNN can be found in this <a href="https://cs231n.github.io/convolutional-networks/">article</a> and this <a href="https://www.youtube.com/watch?v=YRhxdVk_sIs">video</a>.</p>
<h3 id="31-create-cnn-model">3.1. Create CNN Model</h3>
<p>For simplicity, the model above has very small configurations. The final model will have the same architecture but be much bigger:</p>
<table>
<thead>
<tr>
<th style="text-align: center">Description</th>
<th style="text-align: center">Values</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">input word vectors</td>
<td style="text-align: center">fastText</td>
</tr>
<tr>
<td style="text-align: center">embedding size</td>
<td style="text-align: center">300</td>
</tr>
<tr>
<td style="text-align: center">filter sizes</td>
<td style="text-align: center">(3, 4, 5)</td>
</tr>
<tr>
<td style="text-align: center">num filters</td>
<td style="text-align: center">(100, 100, 100)</td>
</tr>
<tr>
<td style="text-align: center">activation</td>
<td style="text-align: center">ReLU</td>
</tr>
<tr>
<td style="text-align: center">pooling</td>
<td style="text-align: center">1-max pooling</td>
</tr>
<tr>
<td style="text-align: center">dropout rate</td>
<td style="text-align: center">0.5</td>
</tr>
</tbody>
</table>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="k">class</span> <span class="nc">CNN_NLP</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="s">"""An 1D Convulational Neural Network for Sentence Classification."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span>
<span class="n">pretrained_embedding</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
<span class="n">freeze_embedding</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">vocab_size</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
<span class="n">embed_dim</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span>
<span class="n">filter_sizes</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">],</span>
<span class="n">num_filters</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">100</span><span class="p">],</span>
<span class="n">num_classes</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">):</span>
<span class="s">"""
The constructor for CNN_NLP class.
Args:
pretrained_embedding (torch.Tensor): Pretrained embeddings with
shape (vocab_size, embed_dim)
freeze_embedding (bool): Set to False to fine-tune pretraiend
vectors. Default: False
vocab_size (int): Need to be specified when not pretrained word
embeddings are not used.
embed_dim (int): Dimension of word vectors. Need to be specified
when pretrained word embeddings are not used. Default: 300
filter_sizes (List[int]): List of filter sizes. Default: [3, 4, 5]
num_filters (List[int]): List of number of filters, has the same
length as `filter_sizes`. Default: [100, 100, 100]
n_classes (int): Number of classes. Default: 2
dropout (float): Dropout rate. Default: 0.5
"""</span>
<span class="nb">super</span><span class="p">(</span><span class="n">CNN_NLP</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
<span class="c1"># Embedding layer
</span> <span class="k">if</span> <span class="n">pretrained_embedding</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">vocab_size</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">embed_dim</span> <span class="o">=</span> <span class="n">pretrained_embedding</span><span class="p">.</span><span class="n">shape</span>
<span class="bp">self</span><span class="p">.</span><span class="n">embedding</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">pretrained_embedding</span><span class="p">,</span>
<span class="n">freeze</span><span class="o">=</span><span class="n">freeze_embedding</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">embed_dim</span> <span class="o">=</span> <span class="n">embed_dim</span>
<span class="bp">self</span><span class="p">.</span><span class="n">embedding</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="n">num_embeddings</span><span class="o">=</span><span class="n">vocab_size</span><span class="p">,</span>
<span class="n">embedding_dim</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">embed_dim</span><span class="p">,</span>
<span class="n">padding_idx</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="n">max_norm</span><span class="o">=</span><span class="mf">5.0</span><span class="p">)</span>
<span class="c1"># Conv Network
</span> <span class="bp">self</span><span class="p">.</span><span class="n">conv1d_list</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">ModuleList</span><span class="p">([</span>
<span class="n">nn</span><span class="p">.</span><span class="n">Conv1d</span><span class="p">(</span><span class="n">in_channels</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">embed_dim</span><span class="p">,</span>
<span class="n">out_channels</span><span class="o">=</span><span class="n">num_filters</span><span class="p">[</span><span class="n">i</span><span class="p">],</span>
<span class="n">kernel_size</span><span class="o">=</span><span class="n">filter_sizes</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">filter_sizes</span><span class="p">))</span>
<span class="p">])</span>
<span class="c1"># Fully-connected layer and Dropout
</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">num_filters</span><span class="p">),</span> <span class="n">num_classes</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dropout</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">p</span><span class="o">=</span><span class="n">dropout</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_ids</span><span class="p">):</span>
<span class="s">"""Perform a forward pass through the network.
Args:
input_ids (torch.Tensor): A tensor of token ids with shape
(batch_size, max_sent_length)
Returns:
logits (torch.Tensor): Output logits with shape (batch_size,
n_classes)
"""</span>
<span class="c1"># Get embeddings from `input_ids`. Output shape: (b, max_len, embed_dim)
</span> <span class="n">x_embed</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">embedding</span><span class="p">(</span><span class="n">input_ids</span><span class="p">).</span><span class="nb">float</span><span class="p">()</span>
<span class="c1"># Permute `x_embed` to match input shape requirement of `nn.Conv1d`.
</span> <span class="c1"># Output shape: (b, embed_dim, max_len)
</span> <span class="n">x_reshaped</span> <span class="o">=</span> <span class="n">x_embed</span><span class="p">.</span><span class="n">permute</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="c1"># Apply CNN and ReLU. Output shape: (b, num_filters[i], L_out)
</span> <span class="n">x_conv_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="n">conv1d</span><span class="p">(</span><span class="n">x_reshaped</span><span class="p">))</span> <span class="k">for</span> <span class="n">conv1d</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">conv1d_list</span><span class="p">]</span>
<span class="c1"># Max pooling. Output shape: (b, num_filters[i], 1)
</span> <span class="n">x_pool_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">F</span><span class="p">.</span><span class="n">max_pool1d</span><span class="p">(</span><span class="n">x_conv</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="n">x_conv</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="k">for</span> <span class="n">x_conv</span> <span class="ow">in</span> <span class="n">x_conv_list</span><span class="p">]</span>
<span class="c1"># Concatenate x_pool_list to feed the fully connected layer.
</span> <span class="c1"># Output shape: (b, sum(num_filters))
</span> <span class="n">x_fc</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">([</span><span class="n">x_pool</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">x_pool</span> <span class="ow">in</span> <span class="n">x_pool_list</span><span class="p">],</span>
<span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># Compute logits. Output shape: (b, n_classes)
</span> <span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">x_fc</span><span class="p">))</span>
<span class="k">return</span> <span class="n">logits</span>
</code></pre></div></div>
<h3 id="32-optimizer">3.2. Optimizer</h3>
<p>To train Deep Learning models, we need to define a loss function and minimize this loss. We’ll use back-propagation to compute gradients and use an optimization algorithm (ie. Gradient Descent) to minimize the loss. The original paper used the Adadelta optimizer.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch.optim</span> <span class="k">as</span> <span class="n">optim</span>
<span class="k">def</span> <span class="nf">initilize_model</span><span class="p">(</span><span class="n">pretrained_embedding</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
<span class="n">freeze_embedding</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">vocab_size</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
<span class="n">embed_dim</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span>
<span class="n">filter_sizes</span><span class="o">=</span><span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">],</span>
<span class="n">num_filters</span><span class="o">=</span><span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">100</span><span class="p">],</span>
<span class="n">num_classes</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.01</span><span class="p">):</span>
<span class="s">"""Instantiate a CNN model and an optimizer."""</span>
<span class="k">assert</span> <span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">filter_sizes</span><span class="p">)</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">num_filters</span><span class="p">)),</span> <span class="s">"filter_sizes and </span><span class="se">\
</span><span class="s"> num_filters need to be of the same length."</span>
<span class="c1"># Instantiate CNN model
</span> <span class="n">cnn_model</span> <span class="o">=</span> <span class="n">CNN_NLP</span><span class="p">(</span><span class="n">pretrained_embedding</span><span class="o">=</span><span class="n">pretrained_embedding</span><span class="p">,</span>
<span class="n">freeze_embedding</span><span class="o">=</span><span class="n">freeze_embedding</span><span class="p">,</span>
<span class="n">vocab_size</span><span class="o">=</span><span class="n">vocab_size</span><span class="p">,</span>
<span class="n">embed_dim</span><span class="o">=</span><span class="n">embed_dim</span><span class="p">,</span>
<span class="n">filter_sizes</span><span class="o">=</span><span class="n">filter_sizes</span><span class="p">,</span>
<span class="n">num_filters</span><span class="o">=</span><span class="n">num_filters</span><span class="p">,</span>
<span class="n">num_classes</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="c1"># Send model to `device` (GPU/CPU)
</span> <span class="n">cnn_model</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="c1"># Instantiate Adadelta optimizer
</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adadelta</span><span class="p">(</span><span class="n">cnn_model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span>
<span class="n">lr</span><span class="o">=</span><span class="n">learning_rate</span><span class="p">,</span>
<span class="n">rho</span><span class="o">=</span><span class="mf">0.95</span><span class="p">)</span>
<span class="k">return</span> <span class="n">cnn_model</span><span class="p">,</span> <span class="n">optimizer</span>
</code></pre></div></div>
<h3 id="33-training-loop">3.3. Training Loop</h3>
<p>For each epoch, the code below will perform a forward step to compute the <em>Cross Entropy</em> loss, a backward step to compute gradients and use the optimizer to update weights/parameters. At the end of each epoch, the loss on training data and the accuracy over the validation data will be printed to help us keep track of the model’s performance. The code is heavily annotated with detailed explanations.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="c1"># Specify loss function
</span><span class="n">loss_fn</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">CrossEntropyLoss</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">set_seed</span><span class="p">(</span><span class="n">seed_value</span><span class="o">=</span><span class="mi">42</span><span class="p">):</span>
<span class="s">"""Set seed for reproducibility."""</span>
<span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">manual_seed_all</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">train_dataloader</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="s">"""Train the CNN model."""</span>
<span class="c1"># Tracking best validation accuracy
</span> <span class="n">best_accuracy</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c1"># Start training loop
</span> <span class="k">print</span><span class="p">(</span><span class="s">"Start training...</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'Epoch'</span><span class="p">:</span><span class="o">^</span><span class="mi">7</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'Train Loss'</span><span class="p">:</span><span class="o">^</span><span class="mi">12</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'Val Loss'</span><span class="p">:</span><span class="o">^</span><span class="mi">10</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span>\
<span class="s">'Val Acc'</span><span class="p">:</span><span class="o">^</span><span class="mi">9</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'Elapsed'</span><span class="p">:</span><span class="o">^</span><span class="mi">9</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"-"</span><span class="o">*</span><span class="mi">60</span><span class="p">)</span>
<span class="k">for</span> <span class="n">epoch_i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
<span class="c1"># =======================================
</span> <span class="c1"># Training
</span> <span class="c1"># =======================================
</span>
<span class="c1"># Tracking time and loss
</span> <span class="n">t0_epoch</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="n">total_loss</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c1"># Put the model into the training mode
</span> <span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
<span class="k">for</span> <span class="n">step</span><span class="p">,</span> <span class="n">batch</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">train_dataloader</span><span class="p">):</span>
<span class="c1"># Load batch to GPU
</span> <span class="n">b_input_ids</span><span class="p">,</span> <span class="n">b_labels</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">batch</span><span class="p">)</span>
<span class="c1"># Zero out any previously calculated gradients
</span> <span class="n">model</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="c1"># Perform a forward pass. This will return logits.
</span> <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">b_input_ids</span><span class="p">)</span>
<span class="c1"># Compute loss and accumulate the loss values
</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">loss_fn</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">b_labels</span><span class="p">)</span>
<span class="n">total_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
<span class="c1"># Perform a backward pass to calculate gradients
</span> <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c1"># Update parameters
</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># Calculate the average loss over the entire training data
</span> <span class="n">avg_train_loss</span> <span class="o">=</span> <span class="n">total_loss</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">train_dataloader</span><span class="p">)</span>
<span class="c1"># =======================================
</span> <span class="c1"># Evaluation
</span> <span class="c1"># =======================================
</span> <span class="k">if</span> <span class="n">val_dataloader</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="c1"># After the completion of each training epoch, measure the model's
</span> <span class="c1"># performance on our validation set.
</span> <span class="n">val_loss</span><span class="p">,</span> <span class="n">val_accuracy</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">)</span>
<span class="c1"># Track the best accuracy
</span> <span class="k">if</span> <span class="n">val_accuracy</span> <span class="o">></span> <span class="n">best_accuracy</span><span class="p">:</span>
<span class="n">best_accuracy</span> <span class="o">=</span> <span class="n">val_accuracy</span>
<span class="c1"># Print performance over the entire training data
</span> <span class="n">time_elapsed</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0_epoch</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">epoch_i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">:</span><span class="o">^</span><span class="mi">7</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">avg_train_loss</span><span class="p">:</span><span class="o">^</span><span class="mf">12.6</span><span class="n">f</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span>\
<span class="n">val_loss</span><span class="p">:</span><span class="o">^</span><span class="mf">10.6</span><span class="n">f</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">val_accuracy</span><span class="p">:</span><span class="o">^</span><span class="mf">9.2</span><span class="n">f</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">time_elapsed</span><span class="p">:</span><span class="o">^</span><span class="mf">9.2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Training complete! Best accuracy: </span><span class="si">{</span><span class="n">best_accuracy</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">%."</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">):</span>
<span class="s">"""After the completion of each training epoch, measure the model's
performance on our validation set.
"""</span>
<span class="c1"># Put the model into the evaluation mode. The dropout layers are disabled
</span> <span class="c1"># during the test time.
</span> <span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="c1"># Tracking variables
</span> <span class="n">val_accuracy</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">val_loss</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># For each batch in our validation set...
</span> <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">val_dataloader</span><span class="p">:</span>
<span class="c1"># Load batch to GPU
</span> <span class="n">b_input_ids</span><span class="p">,</span> <span class="n">b_labels</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">batch</span><span class="p">)</span>
<span class="c1"># Compute logits
</span> <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">b_input_ids</span><span class="p">)</span>
<span class="c1"># Compute loss
</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">loss_fn</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">b_labels</span><span class="p">)</span>
<span class="n">val_loss</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">())</span>
<span class="c1"># Get the predictions
</span> <span class="n">preds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">flatten</span><span class="p">()</span>
<span class="c1"># Calculate the accuracy rate
</span> <span class="n">accuracy</span> <span class="o">=</span> <span class="p">(</span><span class="n">preds</span> <span class="o">==</span> <span class="n">b_labels</span><span class="p">).</span><span class="n">cpu</span><span class="p">().</span><span class="n">numpy</span><span class="p">().</span><span class="n">mean</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span>
<span class="n">val_accuracy</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">accuracy</span><span class="p">)</span>
<span class="c1"># Compute the average accuracy and loss over the validation set.
</span> <span class="n">val_loss</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">val_loss</span><span class="p">)</span>
<span class="n">val_accuracy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">val_accuracy</span><span class="p">)</span>
<span class="k">return</span> <span class="n">val_loss</span><span class="p">,</span> <span class="n">val_accuracy</span>
</code></pre></div></div>
<h2 id="4-evaluation">4. Evaluation</h2>
<p>In the original paper, the author tried different variations of the model.</p>
<ul>
<li><strong>CNN-rand</strong>: The baseline model where the embedding layer is randomly initialized and then updated during training.</li>
<li><strong>CNN-static</strong>: A model with pretrained vectors. However, the embedding layer is freezed during training.</li>
<li><strong>CNN-non-static</strong>: Same as above but the embedding layers is fine-tuned during training.</li>
</ul>
<p>We will experiment with all 3 variations and compare their performance. Below is the report of our results and the original paper’s results.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Model</th>
<th style="text-align: center">Kim’s results</th>
<th style="text-align: center">Our results</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">CNN-rand</td>
<td style="text-align: center">76.1</td>
<td style="text-align: center">74.2</td>
</tr>
<tr>
<td style="text-align: left">CNN-static</td>
<td style="text-align: center">81.0</td>
<td style="text-align: center">82.7</td>
</tr>
<tr>
<td style="text-align: left">CNN-non-static</td>
<td style="text-align: center">81.5</td>
<td style="text-align: center">84.4</td>
</tr>
</tbody>
</table>
<p>Randomness could cause the difference in the results. I think the reason for the improvement in our results is that we used fastText pretrained vectors, which are of higher quality than word2vec vectors that the author used.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># CNN-rand: Word vectors are randomly initialized.
</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">cnn_rand</span><span class="p">,</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">initilize_model</span><span class="p">(</span><span class="n">vocab_size</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">word2idx</span><span class="p">),</span>
<span class="n">embed_dim</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.25</span><span class="p">,</span>
<span class="n">dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">train</span><span class="p">(</span><span class="n">cnn_rand</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">train_dataloader</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Start training...
Epoch | Train Loss | Val Loss | Val Acc | Elapsed
------------------------------------------------------------
1 | 0.682544 | 0.653227 | 62.22 | 1.50
2 | 0.622080 | 0.616504 | 65.22 | 1.41
3 | 0.546976 | 0.574917 | 69.30 | 1.43
4 | 0.473106 | 0.559976 | 69.21 | 1.43
5 | 0.397637 | 0.541240 | 72.47 | 1.44
6 | 0.322112 | 0.530545 | 71.93 | 1.43
7 | 0.258854 | 0.513072 | 72.92 | 1.43
8 | 0.204417 | 0.534012 | 73.74 | 1.43
9 | 0.157654 | 0.533650 | 74.01 | 1.44
10 | 0.129191 | 0.542072 | 74.19 | 1.44
11 | 0.104160 | 0.561548 | 73.56 | 1.45
12 | 0.083750 | 0.560357 | 73.10 | 1.47
13 | 0.067199 | 0.565875 | 73.10 | 1.45
14 | 0.061943 | 0.591892 | 73.83 | 1.44
15 | 0.047678 | 0.615021 | 73.38 | 1.44
16 | 0.043667 | 0.609918 | 73.47 | 1.45
17 | 0.038222 | 0.624876 | 73.74 | 1.43
18 | 0.037270 | 0.636214 | 73.83 | 1.44
19 | 0.032148 | 0.635478 | 73.19 | 1.46
20 | 0.027427 | 0.636196 | 73.56 | 1.42
Training complete! Best accuracy: 74.19%.
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># CNN-static: fastText pretrained word vectors are used and freezed during training.
</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">cnn_static</span><span class="p">,</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">initilize_model</span><span class="p">(</span><span class="n">pretrained_embedding</span><span class="o">=</span><span class="n">embeddings</span><span class="p">,</span>
<span class="n">freeze_embedding</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.25</span><span class="p">,</span>
<span class="n">dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">train</span><span class="p">(</span><span class="n">cnn_static</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">train_dataloader</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Start training...
Epoch | Train Loss | Val Loss | Val Acc | Elapsed
------------------------------------------------------------
1 | 0.587050 | 0.473927 | 76.93 | 0.82
2 | 0.453002 | 0.432967 | 79.39 | 0.71
3 | 0.389261 | 0.417466 | 80.11 | 0.74
4 | 0.345526 | 0.417371 | 80.93 | 0.81
5 | 0.284621 | 0.403670 | 81.47 | 0.83
6 | 0.242149 | 0.406981 | 81.93 | 0.81
7 | 0.190178 | 0.460115 | 79.93 | 0.76
8 | 0.155375 | 0.421258 | 82.20 | 0.84
9 | 0.118369 | 0.436616 | 82.02 | 0.80
10 | 0.095217 | 0.443634 | 81.83 | 0.79
11 | 0.078958 | 0.447452 | 82.11 | 0.76
12 | 0.063665 | 0.504030 | 81.20 | 0.83
13 | 0.047461 | 0.457974 | 82.02 | 0.77
14 | 0.043035 | 0.485016 | 82.11 | 0.70
15 | 0.035299 | 0.479483 | 82.11 | 0.82
16 | 0.028384 | 0.498936 | 82.19 | 0.79
17 | 0.024328 | 0.521321 | 82.37 | 0.76
18 | 0.024897 | 0.511377 | 82.74 | 0.74
19 | 0.019988 | 0.530753 | 81.93 | 0.79
20 | 0.017251 | 0.546499 | 82.20 | 0.85
Training complete! Best accuracy: 82.74%.
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># CNN-non-static: fastText pretrained word vectors are fine-tuned during training.
</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">cnn_non_static</span><span class="p">,</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">initilize_model</span><span class="p">(</span><span class="n">pretrained_embedding</span><span class="o">=</span><span class="n">embeddings</span><span class="p">,</span>
<span class="n">freeze_embedding</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.25</span><span class="p">,</span>
<span class="n">dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">train</span><span class="p">(</span><span class="n">cnn_non_static</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">train_dataloader</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Start training...
Epoch | Train Loss | Val Loss | Val Acc | Elapsed
------------------------------------------------------------
1 | 0.586136 | 0.471964 | 77.21 | 2.08
2 | 0.448910 | 0.428012 | 80.03 | 2.11
3 | 0.381136 | 0.409408 | 81.29 | 2.09
4 | 0.332936 | 0.411652 | 80.75 | 2.10
5 | 0.267999 | 0.397631 | 82.02 | 2.10
6 | 0.223944 | 0.399833 | 81.29 | 2.11
7 | 0.168644 | 0.452024 | 81.29 | 2.10
8 | 0.132921 | 0.442039 | 81.65 | 2.09
9 | 0.097992 | 0.457295 | 81.84 | 2.09
10 | 0.079037 | 0.458124 | 82.38 | 2.09
11 | 0.061001 | 0.459572 | 83.74 | 2.09
12 | 0.047450 | 0.535106 | 81.29 | 2.08
13 | 0.037088 | 0.491504 | 84.37 | 2.10
14 | 0.031085 | 0.503522 | 83.11 | 2.08
15 | 0.025401 | 0.512804 | 84.01 | 2.10
16 | 0.020165 | 0.532516 | 84.19 | 2.11
17 | 0.017053 | 0.545771 | 83.83 | 2.08
18 | 0.017567 | 0.540735 | 84.20 | 2.09
19 | 0.013829 | 0.567102 | 82.47 | 2.09
20 | 0.013072 | 0.594407 | 82.20 | 2.08
Training complete! Best accuracy: 84.37%.
</code></pre></div></div>
<h2 id="5-test-model">5. Test Model</h2>
<p>Let’s test our CNN-non-static model on some examples.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="n">cnn_non_static</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="s">"cpu"</span><span class="p">),</span> <span class="n">max_len</span><span class="o">=</span><span class="mi">62</span><span class="p">):</span>
<span class="s">"""Predict probability that a review is positive."""</span>
<span class="c1"># Tokenize, pad and encode text
</span> <span class="n">tokens</span> <span class="o">=</span> <span class="n">word_tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">.</span><span class="n">lower</span><span class="p">())</span>
<span class="n">padded_tokens</span> <span class="o">=</span> <span class="n">tokens</span> <span class="o">+</span> <span class="p">[</span><span class="s">'<pad>'</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="n">max_len</span> <span class="o">-</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">))</span>
<span class="n">input_id</span> <span class="o">=</span> <span class="p">[</span><span class="n">word2idx</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">token</span><span class="p">,</span> <span class="n">word2idx</span><span class="p">[</span><span class="s">'<unk>'</span><span class="p">])</span> <span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">padded_tokens</span><span class="p">]</span>
<span class="c1"># Convert to PyTorch tensors
</span> <span class="n">input_id</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">input_id</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># Compute logits
</span> <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">input_id</span><span class="p">)</span>
<span class="c1"># Compute probability
</span> <span class="n">probs</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">squeeze</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"This review is </span><span class="si">{</span><span class="n">probs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">100</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">% positive."</span><span class="p">)</span>
</code></pre></div></div>
<p>Our model can easily regconize reviews with strong negative signals. On samples that have mixed feelings but positive sentiment overvall, our model also gets excellent results.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">predict</span><span class="p">(</span><span class="s">"All of friends slept while watching this movie. But I really enjoyed it."</span><span class="p">)</span>
<span class="n">predict</span><span class="p">(</span><span class="s">"I have waited so long for this movie. I am now so satisfied and happy."</span><span class="p">)</span>
<span class="n">predict</span><span class="p">(</span><span class="s">"This movie is long and boring."</span><span class="p">)</span>
<span class="n">predict</span><span class="p">(</span><span class="s">"I don't like the ending."</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>This review is 61.22% positive.
This review is 94.68% positive.
This review is 0.01% positive.
This review is 4.03% positive.
</code></pre></div></div>
<h2 id="6-advice-for-practitioners">6. Advice for Practitioners</h2>
<p>In <a href="https://arxiv.org/abs/1510.03820">A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification</a> (Zhang, 2015), the authors conducted a sensitivity analysis of the above CNN architecture by running it many different sets of hyperparameters. Based on main empirical findings of the research, below are some advice for practioners to choose hyperparameters when applying this architecture for sentence classification tasks:</p>
<ul>
<li><strong>Input word vectors:</strong> Using pretrained word vectors such as word2vec, Glove (or fastText in our implementation) yields much better results than using one-hot vectors or randomly initialized vectors.</li>
<li><strong>Filter region size</strong> can have a large effect on performance, and should be tuned. A reasonable range might be 1~10. For example, using <code class="language-plaintext highlighter-rouge">filter_size=[7]</code> and <code class="language-plaintext highlighter-rouge">num_filters=[400]</code> yields the best result in the MR dataset.</li>
<li><strong>Number of feature maps:</strong> try values from 100 to 600 for each filter region size.</li>
<li><strong>Activation funtions:</strong> ReLu and tanh are the best candidates.</li>
<li><strong>Pooling:</strong> Use 1-max pooling.</li>
<li><strong>Regularization:</strong> When increasing number of feature maps, try imposing stronger regularization, e.g. a dropout rate larger than 0.5.</li>
</ul>
<h2 id="bonus-skorch-a-scikit-like-library-for-pytorch-modules">Bonus: Skorch: A Scikit-like Library for PyTorch Modules</h2>
<p>If you find the training loop in PyTorch intimidating with a lot of steps and wonder why those steps aren’t wrapped in a function like <code class="language-plaintext highlighter-rouge">model.fit()</code> and <code class="language-plaintext highlighter-rouge">model.predict()</code> in <code class="language-plaintext highlighter-rouge">scikit-learn</code> library. Actually it is something I like in PyTorch. It allows me to manipulate my codes to add extra customizations during training such as clipping gradients and updating learning rates. In addition, because I build my model and training loop block by block, when my model runs into errors, I can navigate the bugs faster. However, when I need to deploy a baseline model quickly, writing an entire training loop is really a burden. It’s when I come to <code class="language-plaintext highlighter-rouge">skorch</code>.</p>
<p><code class="language-plaintext highlighter-rouge">skorch</code> is “a scikit-learn compatible neural network library that wraps PyTorch.” There is no need to create <code class="language-plaintext highlighter-rouge">DataLoader</code> or write a training/evaluation loop. All you need to do is defining the model and optimizer as in the code below, then a simple <code class="language-plaintext highlighter-rouge">net.fit(X, y)</code> is enough.</p>
<p><code class="language-plaintext highlighter-rouge">skorch</code> does not only make it neat and fast to train your Deep Learning models, it also provides powerful support. You can specify <code class="language-plaintext highlighter-rouge">callbacks</code> parameters to define early stopping and checkpoint saving. You can also combine <code class="language-plaintext highlighter-rouge">skorch</code> model with <code class="language-plaintext highlighter-rouge">scikit-learn</code> methods to do cross-validation and hyperparameter tuning with grid-search. Please check out the <a href="https://skorch.readthedocs.io/en/stable/index.html#">documentation</a> to explore more powerful functions in this library.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">skorch</span>
<span class="kn">from</span> <span class="nn">skorch</span> <span class="kn">import</span> <span class="n">NeuralNetClassifier</span>
<span class="kn">from</span> <span class="nn">skorch.helper</span> <span class="kn">import</span> <span class="n">predefined_split</span>
<span class="kn">from</span> <span class="nn">skorch.callbacks</span> <span class="kn">import</span> <span class="n">EarlyStopping</span><span class="p">,</span> <span class="n">Checkpoint</span><span class="p">,</span> <span class="n">LoadInitState</span>
<span class="kn">from</span> <span class="nn">skorch.dataset</span> <span class="kn">import</span> <span class="n">CVSplit</span><span class="p">,</span> <span class="n">Dataset</span>
<span class="c1"># Specify validation set
</span><span class="n">val_dataset</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">(</span><span class="n">val_inputs</span><span class="p">,</span> <span class="n">val_labels</span><span class="p">)</span>
<span class="c1"># Specify callbacks and checkpoints
</span><span class="n">cp</span> <span class="o">=</span> <span class="n">Checkpoint</span><span class="p">(</span><span class="n">monitor</span><span class="o">=</span><span class="s">'valid_acc_best'</span><span class="p">,</span> <span class="n">dirname</span><span class="o">=</span><span class="s">'exp1'</span><span class="p">)</span>
<span class="n">callbacks</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="s">'early_stop'</span><span class="p">,</span> <span class="n">EarlyStopping</span><span class="p">(</span><span class="n">monitor</span><span class="o">=</span><span class="s">'valid_acc'</span><span class="p">,</span> <span class="n">patience</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">lower_is_better</span><span class="o">=</span><span class="bp">False</span><span class="p">)),</span>
<span class="n">cp</span>
<span class="p">]</span>
<span class="n">net</span> <span class="o">=</span> <span class="n">NeuralNetClassifier</span><span class="p">(</span>
<span class="c1"># Module
</span> <span class="n">module</span><span class="o">=</span><span class="n">CNN_NLP</span><span class="p">,</span>
<span class="n">module__pretrained_embedding</span><span class="o">=</span><span class="n">embeddings</span><span class="p">,</span>
<span class="n">module__freeze_embedding</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">module__dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span>
<span class="c1"># Optimizer
</span> <span class="n">criterion</span><span class="o">=</span><span class="n">nn</span><span class="p">.</span><span class="n">CrossEntropyLoss</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="n">optim</span><span class="p">.</span><span class="n">Adadelta</span><span class="p">,</span>
<span class="n">optimizer__lr</span><span class="o">=</span><span class="mf">0.25</span><span class="p">,</span>
<span class="n">optimizer__rho</span><span class="o">=</span><span class="mf">0.95</span><span class="p">,</span>
<span class="c1"># Others
</span> <span class="n">max_epochs</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>
<span class="n">train_split</span><span class="o">=</span><span class="n">predefined_split</span><span class="p">(</span><span class="n">val_dataset</span><span class="p">),</span>
<span class="n">iterator_train__shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">warm_start</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">callbacks</span><span class="o">=</span><span class="n">callbacks</span><span class="p">,</span>
<span class="n">device</span><span class="o">=</span><span class="n">device</span>
<span class="p">)</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">skorch</code> also prints training results in a very nice table. My training loop in section 3 is inspired by this format. When model (checkpoints) are saved, you can see the <code class="language-plaintext highlighter-rouge">+</code> sign in column <code class="language-plaintext highlighter-rouge">cp</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">net</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">train_inputs</span><span class="p">),</span> <span class="n">train_labels</span><span class="p">)</span>
<span class="n">valid_acc_best</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">net</span><span class="p">.</span><span class="n">history</span><span class="p">[:,</span> <span class="s">'valid_acc'</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Training complete! Best accuracy: </span><span class="si">{</span><span class="n">valid_acc_best</span> <span class="o">*</span> <span class="mi">100</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">%"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> epoch train_loss valid_acc valid_loss cp dur
------- ------------ ----------- ------------ ---- ------
1 0.5862 0.7741 0.4727 + 2.2838
2 0.4481 0.7901 0.4385 + 2.2232
3 0.3849 0.7938 0.4369 + 2.2337
4 0.3242 0.8285 0.3940 + 2.2340
5 0.2787 0.8257 0.3951 2.2225
6 0.2156 0.8285 0.3958 2.2006
7 0.1714 0.8144 0.4410 2.2059
8 0.1336 0.8332 0.4100 + 2.2174
9 0.0950 0.8266 0.4295 2.2214
10 0.0738 0.8238 0.4489 2.1938
11 0.0596 0.8304 0.4705 2.1988
12 0.0476 0.8266 0.4769 2.2083
Stopping since valid_acc has not improved in the last 5 epochs.
Training complete! Best accuracy: 83.32%
</code></pre></div></div>
<p>As Deep Learning model can overfit training data quickly, it’s important to save our model when it fits validation data just right. After training, we can load our model from the last checkpoint to make predictions.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Load parameters from checkpoint
</span><span class="n">net</span><span class="p">.</span><span class="n">load_params</span><span class="p">(</span><span class="n">checkpoint</span><span class="o">=</span><span class="n">cp</span><span class="p">)</span>
<span class="n">predict</span><span class="p">(</span><span class="s">"All of friends slept while watching this movie. But I really enjoyed it."</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="n">net</span><span class="p">)</span>
<span class="n">predict</span><span class="p">(</span><span class="s">"I have waited so long for this movie. I am now so satisfied and happy."</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="n">net</span><span class="p">)</span>
<span class="n">predict</span><span class="p">(</span><span class="s">"This movie is long and boring."</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="n">net</span><span class="p">)</span>
<span class="n">predict</span><span class="p">(</span><span class="s">"I don't like the ending."</span><span class="p">,</span> <span class="n">model</span><span class="o">=</span><span class="n">net</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>This review is 67.25% positive.
This review is 61.38% positive.
This review is 0.12% positive.
This review is 19.14% positive.
</code></pre></div></div>
<h2 id="conclusion">Conclusion</h2>
<p>Before the rise of huge and complicated models using Transformer architecture, a simple CNN architecture with one layer of convolution can yield excellent performance on sentence classification tasks. The model can take advantages of unsupervised pre-training of word vectors to improve overall performance. Improvements can be made in this architecture by increasing the number of CNN layers or using sub-word model (using BPE tokenizer and fastText pretrained sub-word vectors). Because of its speed, we can use the CNN model as a strong baseline before trying more complicated models such as BERT.</p>
<p>Thank you for staying with me to this point. If interested, you can check out other articles in my NLP tutorial series:</p>
<ul>
<li><a href="https://chriskhanhtran.github.io/posts/bert_for_sentiment_analysis/">Tutorial: Fine-tuning BERT for Sentiment Analysis</a></li>
</ul>Chris TranConvolutional Neural Networks (CNN) were originally invented for computer vision (CV) and now are the building block of state-of-the-art CV models. One of the earliest applications of CNN in Natural Language Processing (NLP) was introduced in the paper Convolutional Neural Networks for Sentence Classification (Kim, 2014).Create a Minimalism GitHub Page for Your Data Science Portfolio in 30 Minutes2020-01-13T00:00:00-05:002020-01-13T00:00:00-05:00https://chriskhanhtran.github.io/posts/portfolio-tutorial<p>In the early days of my journey in data science a year ago, I spent most of my time reading articles on Towards Data Science to create my own Data Science roadmap. The opinions are different in the knowledge one needs to acquire to become a Data Scientist and how to get there, but there is one thing in common: at a point in that journey, one should have a portfolio where she can host her Data Science projects.</p>
<p>I created my first portfolio about 6 months after I wrote my first line of Python, when I completed the <a href="https://www.udemy.com/course/python-for-data-science-and-machine-learning-bootcamp/">Python for Data Science and Machine Learning Bootcamp</a> on Udemy, to host simple projects I had done during the course. Since then, building and maintaining my portforlio is one of my favorite things to do. I enjoy organizing my ideas, writing them down, explaining things and make them neat.</p>
<p>Having a portforlio encourages me to seriously document any projects I work on. For job search, I usually bring an iPad with my portfolio opened to career events and interviews so when I share about my projects, I can guide interviewers through my codes and visualizations. It is a very efficient way to build impression and maintain the conversations.</p>
<p>In this tutorial, we will step-by-step learn how to build a simple but powerful GitHub page to host your Data Science projects. The whole process will take about 30 minutes. Let’s get started!</p>
<h2 id="step-1-create-a-github-account">Step 1: Create a GitHub Account</h2>
<p>First, we need to sign up a GitHub account at <a href="https://github.com/">https://github.com/</a>.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/1.PNG?raw=true" /></center>
<p>After signing up, we will log in and move to Step 2.</p>
<h2 id="step-2-create-a-repository-named-user-namegithubio">Step 2: Create a Repository Named <code class="language-plaintext highlighter-rouge">user-name.github.io</code></h2>
<p>After all steps in this tutorial are completed, our GitHub page can be accessed at <code class="language-plaintext highlighter-rouge">https://user-name.github.io</code>. In this step, we will create a repository named <code class="language-plaintext highlighter-rouge">user-name.github.io</code> where <code class="language-plaintext highlighter-rouge">user-name</code> is the user name we use to log into GitHub. My user name is <code class="language-plaintext highlighter-rouge">ktran3-simon</code> so I will create a repository name <code class="language-plaintext highlighter-rouge">ktran3-simon.github.io</code>.</p>
<p>To create a new repository, we click on the <code class="language-plaintext highlighter-rouge">+</code> sign at the top right of the screen, next to our profile picture, and select <code class="language-plaintext highlighter-rouge">New repository</code>.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/2.PNG?raw=true" /></center>
<p>We fill the repository name with <code class="language-plaintext highlighter-rouge">user-name.github.io</code>, select <strong>Public</strong> and then click <code class="language-plaintext highlighter-rouge">Create repository</code>.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/3.PNG?raw=true" /></center>
<p>Next, we will upload the theme to this repository. The theme we will use is the <a href="https://github.com/pages-themes/minimal"><strong>Jekyll Minimal theme</strong></a>. This GitHub <a href="https://github.com/pages-themes/minimal">repository</a> has a more concise version of the theme.</p>
<center><img src="https://github.com/ktran3-simon/quick-portfolio/raw/master/images/demo.gif?raw=true" /></center>
<p>To download the theme, we go to <a href="https://github.com/evanca/quick-portfolio">https://github.com/evanca/quick-portfolio</a>, click <code class="language-plaintext highlighter-rouge">Clone or download</code> and select <code class="language-plaintext highlighter-rouge">Download ZIP</code>.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/4.PNG?raw=true" /></center>
<p>Now, let’s open our newly created repository, which is still empty. We will click <code class="language-plaintext highlighter-rouge">uploading an existing file</code>. After downloading the theme, we unzip the file and upload these files into our repository. After the uploading is complete, we click <code class="language-plaintext highlighter-rouge">Commit changes</code>.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/5.PNG?raw=true" /></center>
<p>Now, by going to <a href="https://ktran3-simon.github.io/"><code class="language-plaintext highlighter-rouge">user-name.github.io</code></a>, we can already see our website! In the next step, we will go through some instructions to customize our portfolio.</p>
<p><strong>(Optional)</strong> A faster way to complete this step is to simply click the <strong>Fork</strong> button to copy the entire repository to our GitHub account and then change the repository’s name to <code class="language-plaintext highlighter-rouge">user-name.github.io</code>. However I think the above explanation is friendlier for first-time GitHub users.</p>
<h2 id="step-3-customize-our-portfolio">Step 3: Customize Our Portfolio</h2>
<p>Our GitHub page has a two-column layout. On the left is our profile picture and some description, and on the right is the main page where we present our projects. I really like this design because of its simplicity yet efficiency.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/6.PNG?raw=true" /></center>
<p>To customize the sidebar (the left part), we will edit the file <code class="language-plaintext highlighter-rouge">_config.yml</code> in our repository. We can also add Google Analytics ID to track and analyze traffic to our page.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/7.PNG?raw=true" /></center>
<p>To customize the main page (the right part), where we display our projects, we will need to edit <code class="language-plaintext highlighter-rouge">index.md</code>. This file is written in <code class="language-plaintext highlighter-rouge">Markdown</code>. If you frequently work with Jupyter Notebook, you must be very familiar with this language. <code class="language-plaintext highlighter-rouge">Markdown</code> is very easy to use. Here is a helpful <a href="https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet">Markdown Cheatsheet</a> that I often refer to.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/8.PNG?raw=true" /></center>
<p><strong>(Optional) More Customizations</strong></p>
<p>As we customize the sidebar, we will see that we cannot edit the last two lines by editting <code class="language-plaintext highlighter-rouge">_config.yml</code>:</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/9.PNG?raw=true" /></center>
<p>To remove them, we need go to the original repository of the <a href="https://github.com/pages-themes/minimal"><strong>Jekyll Minimal theme</strong></a>, and copy the content of <code class="language-plaintext highlighter-rouge">default.html</code> in <code class="language-plaintext highlighter-rouge">_layouts</code>. Then we create <code class="language-plaintext highlighter-rouge">_layouts/default.html</code> in our repository by clicking <code class="language-plaintext highlighter-rouge">Creating new file</code> and typing <code class="language-plaintext highlighter-rouge">_layouts/default.html</code>, paste the copied content here and commit.</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/10.PNG?raw=true" /></center>
<p>Now we can remove lines 29-31 in <code class="language-plaintext highlighter-rouge">_layouts/default.html</code> to remove <strong>View My GitHub Profile</strong>,</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><p class="view"><a href="https://github.com/chriskhanhtran">View My GitHub Profile</a></p>
</code></pre></div></div>
<p>and line 50 to remove <strong>Hosted on GitHub pages - Theme by orderedlist</strong>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><p><small>Hosted on GitHub Pages &mdash; Theme by <a href="https://github.com/orderedlist">orderedlist</a></small></p>
</code></pre></div></div>
<h2 id="step-4-upload-our-projects">Step 4: Upload Our Projects</h2>
<p>After customizing the design of our GitHub page, we can start adding projects to the page by editing <code class="language-plaintext highlighter-rouge">index.md</code>. There are several ways to do that, including:</p>
<ul>
<li>Link to our GitHub repositories,</li>
<li>Link to our Jupyter Notebooks,</li>
<li>Write blog posts in <code class="language-plaintext highlighter-rouge">Markdown</code>.</li>
</ul>
<p>My favorite way to add projects is creating a folder in the repository to save the <code class="language-plaintext highlighter-rouge">html</code> files of Jupyter Notebooks, and add the link <code class="language-plaintext highlighter-rouge">https://user-name.github.io/folder-name/file-name.html</code> to my main page. Alternatively, we can insert <strong>Google Colab</strong> links so that viewers can run our projects directly. <strong>Google Colab</strong> is basically the same as <strong>Jupyter Notebook</strong> with GPU supported by Google, which helps increase the speed of running our cells, especially in Deep Learning projects. Here is a nice <a href="https://www.youtube.com/watch?v=KCCzo31Oo8U">Google Colab tutorial</a>. If you are an R user, you can use <strong>R markdown</strong> in R Studio to render your projects and export them in the <code class="language-plaintext highlighter-rouge">html</code> format.</p>
<p>We can also write blog posts in <code class="language-plaintext highlighter-rouge">Markdown</code> and upload them to our repository. You can always refer to the file <code class="language-plaintext highlighter-rouge">sample_page.md</code> as an example.</p>
<h2 id="tips-and-tricks">Tips and Tricks</h2>
<h3 id="badges-with-shieldsio">Badges with Shields.io</h3>
<p>In official repositories on GitHub, we usually see authors use badges to show the status of their project. For example:</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/11.PNG?raw=true" /></center>
<p>I really like to use these badges to embed links with call for actions, such as:</p>
<center><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/12.PNG?raw=true" /></center>
<p>You can go to https://shields.io/ to create your own badges. Basically, we just need to create links in a specific format and use them as image links.</p>
<p><strong>Format:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://img.shields.io/badge/label-message-color?logo=logo_name
</code></pre></div></div>
<p>where:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">color</code> could be brightgreen, green, yellowgreen, yellow, orange, red, blue, lightgrey or any HEX, RBG color codes</li>
<li><code class="language-plaintext highlighter-rouge">logo</code>: the list of <code class="language-plaintext highlighter-rouge">logo</code> of popular brands and their brand color codes can be found at <a href="https://simpleicons.org/">simple-icons</a></li>
<li>Visit <a href="https://shields.io/">shields.io</a> to learn more</li>
</ul>
<p><strong>Examples:</strong></p>
<p><code class="language-plaintext highlighter-rouge">https://img.shields.io/badge/Spotify-My_Musics-1ED760?logo=Spotify</code> will give us: <img src="https://img.shields.io/badge/Spotify-My_Musics-1ED760?logo=Spotify" alt="" /></p>
<p><code class="language-plaintext highlighter-rouge">https://img.shields.io/badge/PyTorch-Run_in_Colab-EE4C2C?logo=PyTorch</code> will give us: <img src="https://img.shields.io/badge/PyTorch-Run_in_Colab-EE4C2C?logo=PyTorch" alt="" /></p>
<p>To insert links with badges we created, we only need to type <code class="language-plaintext highlighter-rouge">[![](link-to-our-badge)](link-to-our-project)</code>.</p>
<h3 id="more-themes">More Themes</h3>
<p>There are several other themes that we can utilize to be more creative with our portfolio. To use them, we can simply <strong>Fork</strong> the repository to our account and change its name to <code class="language-plaintext highlighter-rouge">user-name.github.io</code>.</p>
<ul>
<li>Cayman: <a href="https://github.com/pages-themes/cayman">repo</a> - <a href="https://pages-themes.github.io/cayman/">preview</a></li>
<li>Minimal Mistake: <a href="https://github.com/mmistakes/minimal-mistakes">repo</a> - <a href="https://mmistakes.github.io/minimal-mistakes/collection-archive/">preview</a>. I really like <a href="https://leimao.github.io/">this portfolio</a> where the author uses this theme.</li>
</ul>
<p class="text-center"><img src="https://github.com/chriskhanhtran/portfolio-tutorial/blob/master/images/13.PNG?raw=true" alt="" />
<em>A preview of the Minimal Mistake Theme</em></p>
<h3 id="content-of-your-portfolio">Content of Your Portfolio</h3>
<p>Last but also the most important thing I want to say in this tutorial, the reason I like a minimalism theme is that it took me minimal time on designing works; thus I can spend more time on the content of my projects. Ultimately, the purpose of building a Data Science portfolio is to present our Data Science project, rather than to show our web-design skills. Below are some articles that I found super helpful when I started building my first portfolio.</p>
<ul>
<li><a href="https://towardsdatascience.com/how-to-build-a-data-science-portfolio-5f566517c79c">How to Build a Data Science Portfolio</a> in Towards Data Science</li>
<li><a href="https://www.dataquest.io/blog/build-a-data-science-portfolio/">Data Science Portfolios That Will Get You the Job</a> in Dataquest</li>
<li><a href="https://www.springboard.com/blog/data-science-portfolio/">Building a Data Science Portfolio That Stands Out</a> in Springboard Blog</li>
<li><a href="https://medium.com/@dataoptimal9/5-data-science-projects-that-will-get-you-hired-in-2018-9e51525084e">5 Data Science Projects That Will Get You Hired in 2018</a> in Medium</li>
</ul>
<p>Feel free to visit my portfolio to see how I write my Data Science projects. For example, this is a detailed <a href="https://chriskhanhtran.github.io/minimal-portfolio/projects/ames-house-price.html">notebook</a> I wrote after completing a Kaggle competition, in which I went through all important steps of a Data Science project, including <strong>Exploratory Data Analysis, Data Cleaning, Feature Engineering, Modeling and Evaluation</strong>. Now I still often revisit this notebook to copy the cross-validation codes to reuse. I found I learn the most by reading notebooks on Kaggle and writing my own projects.</p>
<p>I also made some changes in my portfolio compared to the original version, such as making the sidebar narrower and the main page wider. You can <strong>fork</strong> my repo (<a href="https://github.com/chriskhanhtran/minimal-portfolio">https://github.com/chriskhanhtran/minimal-portfolio</a>) and change the codes in the diretory <code class="language-plaintext highlighter-rouge">_sass/jekyll-theme-minimal.scss</code> as you like, including changing width, font size or image size. However, be careful when you do so because it might mess your page up. If so, just recover the settings by copying from the theme’s original repo.</p>
<ul>
<li>change <code class="language-plaintext highlighter-rouge">max-width</code> in <code class="language-plaintext highlighter-rouge">.wrapper</code> to change the width of the entire page</li>
<li>change <code class="language-plaintext highlighter-rouge">max-width</code> in <code class="language-plaintext highlighter-rouge">section</code> to change the width of the main page</li>
<li>change <code class="language-plaintext highlighter-rouge">width</code> in <code class="language-plaintext highlighter-rouge">header</code> to change the width of the side bar</li>
</ul>
<p class="text-center"><img src="https://raw.githubusercontent.com/chriskhanhtran/portfolio-tutorial/master/images/portfolio.gif" alt="" />
<em>My GitHub Page</em></p>
<h3 id="last-words">Last Words</h3>
<p>Having completed your minimalism portfolio, you now can remove or modify these files in your repository:</p>
<ul>
<li><strong>LICENSE</strong></li>
<li><strong>README.md</strong>: you can modify it to the description of your page.</li>
<li><strong>sample_page.md</strong>: you can remove or change it to a blog post.</li>
<li><strong>pdf/sample_presentation.pdf</strong></li>
</ul>
<p>You can also visit the original <a href="https://medium.com/@evanca/set-up-your-portfolio-website-in-less-than-10-minutes-with-github-pages-d0efa8ff56fd">tutorial</a> with more tips such as:</p>
<ul>
<li>How to create thumbnails for your project</li>
<li>How to create a round profile picture</li>
</ul>
<p>For me, I use Photoshop and Powerpoint to create pictures used in my GitHub pages.</p>
<p>Thank you so much for staying with me to this point of my first tutorial. Don’t hesitate to reach out to me if you’ve got any questions. Please connect with me on LinkedIn and share with me your Data Science portfolio.
<a href="https://www.linkedin.com/in/chriskhanhtran/"><img src="https://img.shields.io/badge/LinkedIn-Connect%20with%20Me-blue?logo=LinkedIn&style=social" alt="" /></a></p>Chris TranIn the early days of my journey in data science a year ago, I spent most of my time reading articles on Towards Data Science to create my own Data Science roadmap. The opinions are different in the knowledge one needs to acquire to become a Data Scientist and how to get there, but there is one thing in common: at a point in that journey, one should have a portfolio where she can host her Data Science projects.Fine-tuning BERT for Sentiment Analysis2019-12-25T00:00:00-05:002019-12-25T00:00:00-05:00https://chriskhanhtran.github.io/posts/bert-for-sentiment-analysis<p><a href="https://colab.research.google.com/drive/1f32gj5IYIyFipoINiC8P3DvKat-WWLUK"><img src="https://img.shields.io/badge/Colab-Run_in_Google_Colab-blue?logo=Google&logoColor=FDBA18" alt="Run in Google Colab" /></a></p>
<h1 id="a---introduction">A - Introduction</h1>
<p>In recent years the NLP community has seen many breakthoughs in Natural Language Processing, especially the shift to transfer learning. Models like ELMo, fast.ai’s ULMFiT, Transformer and OpenAI’s GPT have allowed researchers to achieves state-of-the-art results on multiple benchmarks and provided the community with large pre-trained models with high performance. This shift in NLP is seen as NLP’s ImageNet moment, a shift in computer vision a few year ago when lower layers of deep learning networks with million of parameters trained on a specific task can be reused and fine-tuned for other tasks, rather than training new networks from scratch.</p>
<p>One of the most biggest milestones in the evolution of NLP recently is the release of Google’s BERT, which is described as the beginning of a new era in NLP. In this notebook I’ll use the HuggingFace’s <code class="language-plaintext highlighter-rouge">transformers</code> library to fine-tune pretrained BERT model for a classification task. Then I will compare the BERT’s performance with a baseline model, in which I use a TF-IDF vectorizer and a Naive Bayes classifier. The <code class="language-plaintext highlighter-rouge">transformers</code> library help us quickly and efficiently fine-tune the state-of-the-art BERT model and yield an accuracy rate <strong>10%</strong> higher than the baseline model.</p>
<p><strong>Reference</strong>:</p>
<p>To understand <strong>Transformer</strong> (the architecture which BERT is built on) and learn how to implement BERT, I highly recommend reading the following sources:</p>
<ul>
<li><a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT, ELMo, and co.</a>: A very clear and well-written guide to understand BERT.</li>
<li><a href="https://huggingface.co/transformers/v2.2.0/index.html">The documentation of the <code class="language-plaintext highlighter-rouge">transformers</code> library</a></li>
<li><a href="http://mccormickml.com/2019/07/22/BERT-fine-tuning/">BERT Fine-Tuning Tutorial with PyTorch</a> by <a href="http://mccormickml.com/">Chris McCormick</a>: A very detailed tutorial showing how to use BERT with the HuggingFace PyTorch library.</li>
</ul>
<h1 id="b---setup">B - Setup</h1>
<h2 id="1-load-essential-libraries">1. Load Essential Libraries</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</code></pre></div></div>
<h2 id="2-dataset">2. Dataset</h2>
<h3 id="21-download-dataset">2.1. Download Dataset</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Download data
</span><span class="kn">import</span> <span class="nn">requests</span>
<span class="n">request</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://drive.google.com/uc?export=download&id=1wHt8PsMLsfX5yNSqrt2fSTcb8LEiclcf"</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"data.zip"</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="nb">file</span><span class="p">:</span>
<span class="nb">file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
<span class="c1"># Unzip data
</span><span class="kn">import</span> <span class="nn">zipfile</span>
<span class="k">with</span> <span class="n">zipfile</span><span class="p">.</span><span class="n">ZipFile</span><span class="p">(</span><span class="s">'data.zip'</span><span class="p">)</span> <span class="k">as</span> <span class="nb">zip</span><span class="p">:</span>
<span class="nb">zip</span><span class="p">.</span><span class="n">extractall</span><span class="p">(</span><span class="s">'data'</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="22-load-train-data">2.2. Load Train Data</h3>
<p>The train data has 2 files, each containing 1700 complaining/non-complaining tweets. Every tweets in the data contains at least a hashtag of an airline.</p>
<p>We will load the train data and label it. Because we use only the text data to classify, we will drop unimportant columns and only keep <code class="language-plaintext highlighter-rouge">id</code>, <code class="language-plaintext highlighter-rouge">tweet</code> and <code class="language-plaintext highlighter-rouge">label</code> columns.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1"># Load data and set labels
</span><span class="n">data_complaint</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/complaint1700.csv'</span><span class="p">)</span>
<span class="n">data_complaint</span><span class="p">[</span><span class="s">'label'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">data_non_complaint</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/noncomplaint1700.csv'</span><span class="p">)</span>
<span class="n">data_non_complaint</span><span class="p">[</span><span class="s">'label'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="c1"># Concatenate complaining and non-complaining data
</span><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">data_complaint</span><span class="p">,</span> <span class="n">data_non_complaint</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Drop 'airline' column
</span><span class="n">data</span><span class="p">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'airline'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># Display 5 random samples
</span><span class="n">data</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>tweet</th>
<th>label</th>
</tr>
</thead>
<tbody>
<tr>
<th>1722</th>
<td>2478</td>
<td>Thank you @MichaelRRoy . I am so glad that yo...</td>
<td>1</td>
</tr>
<tr>
<th>1653</th>
<td>52356</td>
<td>Seriously .@united GET YOUR SHIT TOGETHER</td>
<td>0</td>
</tr>
<tr>
<th>930</th>
<td>128102</td>
<td>@SouthwestAir - Yet another delayed flight. Wh...</td>
<td>0</td>
</tr>
<tr>
<th>1975</th>
<td>24242</td>
<td>@AmericanAir yea already did that. They were m...</td>
<td>1</td>
</tr>
<tr>
<th>3053</th>
<td>133225</td>
<td>@DeltaAssist i lost my tickets information an...</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
<p>We will randomly split the entire training data into two sets: a train set with 90% of the data and a validation set with 10% of the data. We will perform hyperparameter tuning using cross-validation on the train set and use the validation set to compare models.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">tweet</span><span class="p">.</span><span class="n">values</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">label</span><span class="p">.</span><span class="n">values</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_val</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_val</span> <span class="o">=</span>\
<span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">2020</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="23-load-test-data">2.3. Load Test Data</h3>
<p>The test data contains 4555 examples with no label. About 300 examples are non-complaining tweets. Our task is to identify their <code class="language-plaintext highlighter-rouge">id</code> and examine manually whether our results are correct.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Load test data
</span><span class="n">test_data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/test_data.csv'</span><span class="p">)</span>
<span class="c1"># Keep important columns
</span><span class="n">test_data</span> <span class="o">=</span> <span class="n">test_data</span><span class="p">[[</span><span class="s">'id'</span><span class="p">,</span> <span class="s">'tweet'</span><span class="p">]]</span>
<span class="c1"># Display 5 samples from the test data
</span><span class="n">test_data</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>tweet</th>
</tr>
</thead>
<tbody>
<tr>
<th>3353</th>
<td>126992</td>
<td>Lol! Worst reason for a flight delay ever! @us...</td>
</tr>
<tr>
<th>70</th>
<td>2461</td>
<td>@JetBlue you suck. You never emailed tickets ...</td>
</tr>
<tr>
<th>3551</th>
<td>134150</td>
<td>@AmericanAir We're stuck at KDFW headed for KI...</td>
</tr>
<tr>
<th>3200</th>
<td>120820</td>
<td>@AmericanAir hey guys, flight 202 to Boston he...</td>
</tr>
<tr>
<th>3546</th>
<td>134021</td>
<td>Been waiting on my lost bags for quite some ti...</td>
</tr>
</tbody>
</table>
</div>
<h2 id="3-set-up-gpu-for-training">3. Set up GPU for training</h2>
<p>Google Colab offers free GPUs and TPUs. Since we’ll be training a large neural network it’s best to utilize these features.</p>
<p>A GPU can be added by going to the menu and selecting:</p>
<p><code class="language-plaintext highlighter-rouge">Runtime -> Change runtime type -> Hardware accelerator: GPU</code></p>
<p>Then we need to run the following cell to specify the GPU as the device.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">():</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cuda"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'There are </span><span class="si">{</span><span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">device_count</span><span class="p">()</span><span class="si">}</span><span class="s"> GPU(s) available.'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Device name:'</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">get_device_name</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">'No GPU available, using the CPU instead.'</span><span class="p">)</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cpu"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>There are 1 GPU(s) available.
Device name: Tesla P100-PCIE-16GB
</code></pre></div></div>
<h1 id="c---baseline-tf-idf--naive-bayes-classifier">C - Baseline: TF-IDF + Naive Bayes Classifier</h1>
<p>In this baseline approach, first we will use TF-IDF to vectorize our text data. Then we will use the Naive Bayes model as our classifier.</p>
<p>Why Naive Bayse? I have experiemented different machine learning algorithms including Random Forest, Support Vectors Machine, XGBoost and observed that Naive Bayes yields the best performance. In <a href="https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html">Scikit-learn’s guide</a> to choose the right estimator, it is also suggested that Naive Bayes should be used for text data. I also tried using SVD to reduce dimensionality; however, it did not yield a better performance.</p>
<h2 id="1-data-preparation">1. Data Preparation</h2>
<h3 id="11-preprocessing">1.1. Preprocessing</h3>
<p>In the bag-of-words model, a text is represented as the bag of its words, disregarding grammar and word order. Therefore, we will want to remove stop words, punctuations and characters that don’t contribute much to the sentence’s meaning.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nltk</span>
<span class="c1"># Uncomment to download "stopwords"
# nltk.download("stopwords")
</span><span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">stopwords</span>
<span class="k">def</span> <span class="nf">text_preprocessing</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="s">"""
- Lowercase the sentence
- Change "'t" to "not"
- Remove "@name"
- Isolate and remove punctuations except "?"
- Remove other special characters
- Remove stop words except "not" and "can"
- Remove trailing whitespace
"""</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span>
<span class="c1"># Change 't to 'not'
</span> <span class="n">s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">"\'t"</span><span class="p">,</span> <span class="s">" not"</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span>
<span class="c1"># Remove @name
</span> <span class="n">s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'(@.*?)[\s]'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span>
<span class="c1"># Isolate and remove punctuations except '?'
</span> <span class="n">s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'([\'\"\.\(\)\!\?\\\/\,])'</span><span class="p">,</span> <span class="sa">r</span><span class="s">' \1 '</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'[^\w\s\?]'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span>
<span class="c1"># Remove some special characters
</span> <span class="n">s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'([\;\:\|•«\n])'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">s</span><span class="p">)</span>
<span class="c1"># Remove stopwords except 'not' and 'can'
</span> <span class="n">s</span> <span class="o">=</span> <span class="s">" "</span><span class="p">.</span><span class="n">join</span><span class="p">([</span><span class="n">word</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">s</span><span class="p">.</span><span class="n">split</span><span class="p">()</span>
<span class="k">if</span> <span class="n">word</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="p">.</span><span class="n">words</span><span class="p">(</span><span class="s">'english'</span><span class="p">)</span>
<span class="ow">or</span> <span class="n">word</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'not'</span><span class="p">,</span> <span class="s">'can'</span><span class="p">]])</span>
<span class="c1"># Remove trailing whitespace
</span> <span class="n">s</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'\s+'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">s</span><span class="p">).</span><span class="n">strip</span><span class="p">()</span>
<span class="k">return</span> <span class="n">s</span>
</code></pre></div></div>
<h3 id="12-tf-idf-vectorizer">1.2. TF-IDF Vectorizer</h3>
<p>In information retrieval, <strong>TF-IDF</strong>, short for <strong>term frequency–inverse document frequency</strong>, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. We will use TF-IDF to vectorize our text data before feeding them to machine learning algorithms.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfVectorizer</span>
<span class="c1"># Preprocess text
</span><span class="n">X_train_preprocessed</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">text_preprocessing</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">X_train</span><span class="p">])</span>
<span class="n">X_val_preprocessed</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">text_preprocessing</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">X_val</span><span class="p">])</span>
<span class="c1"># Calculate TF-IDF
</span><span class="n">tf_idf</span> <span class="o">=</span> <span class="n">TfidfVectorizer</span><span class="p">(</span><span class="n">smooth_idf</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">X_train_tfidf</span> <span class="o">=</span> <span class="n">tf_idf</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X_train_preprocessed</span><span class="p">)</span>
<span class="n">X_val_tfidf</span> <span class="o">=</span> <span class="n">tf_idf</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_val_preprocessed</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 5.31 s, sys: 490 ms, total: 5.8 s
Wall time: 5.83 s
</code></pre></div></div>
<h2 id="2-train-naive-bayes-classifier">2. Train Naive Bayes Classifier</h2>
<h3 id="21-hyperparameter-tuning">2.1. Hyperparameter Tuning</h3>
<p>We will use cross-validation and AUC score to tune hyperparameters of our model. The function <code class="language-plaintext highlighter-rouge">get_auc_CV</code> will return the average AUC score from cross-validation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">StratifiedKFold</span><span class="p">,</span> <span class="n">cross_val_score</span>
<span class="k">def</span> <span class="nf">get_auc_CV</span><span class="p">(</span><span class="n">model</span><span class="p">):</span>
<span class="s">"""
Return the average AUC score from cross-validation.
"""</span>
<span class="c1"># Set KFold to shuffle data before the split
</span> <span class="n">kf</span> <span class="o">=</span> <span class="n">StratifiedKFold</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># Get AUC scores
</span> <span class="n">auc</span> <span class="o">=</span> <span class="n">cross_val_score</span><span class="p">(</span>
<span class="n">model</span><span class="p">,</span> <span class="n">X_train_tfidf</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">"roc_auc"</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">kf</span><span class="p">)</span>
<span class="k">return</span> <span class="n">auc</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">MultinominalNB</code> class only have one hypterparameter - <strong>alpha</strong>. The code below will help us find the alpha value that gives us the highest CV AUC score.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.naive_bayes</span> <span class="kn">import</span> <span class="n">MultinomialNB</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">([</span><span class="n">get_auc_CV</span><span class="p">(</span><span class="n">MultinomialNB</span><span class="p">(</span><span class="n">i</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)],</span>
<span class="n">index</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">))</span>
<span class="n">best_alpha</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nb">round</span><span class="p">(</span><span class="n">res</span><span class="p">.</span><span class="n">idxmax</span><span class="p">(),</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Best alpha: '</span><span class="p">,</span> <span class="n">best_alpha</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">res</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'AUC vs. Alpha'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Alpha'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'AUC'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Best alpha: 1.8
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/chriskhanhtran/bert-for-sentiment-analysis/master/output_32_1.png" alt="png" /></p>
<h3 id="22-evaluation-on-validation-set">2.2. Evaluation on Validation Set</h3>
<p>To evaluate the performance of our model, we will calculate the accuracy rate and the AUC score of our model on the validation set.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span><span class="p">,</span> <span class="n">roc_curve</span><span class="p">,</span> <span class="n">auc</span>
<span class="k">def</span> <span class="nf">evaluate_roc</span><span class="p">(</span><span class="n">probs</span><span class="p">,</span> <span class="n">y_true</span><span class="p">):</span>
<span class="s">"""
- Print AUC and accuracy on the test set
- Plot ROC
@params probs (np.array): an array of predicted probabilities with shape (len(y_true), 2)
@params y_true (np.array): an array of the true values with shape (len(y_true),)
"""</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">probs</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">,</span> <span class="n">threshold</span> <span class="o">=</span> <span class="n">roc_curve</span><span class="p">(</span><span class="n">y_true</span><span class="p">,</span> <span class="n">preds</span><span class="p">)</span>
<span class="n">roc_auc</span> <span class="o">=</span> <span class="n">auc</span><span class="p">(</span><span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'AUC: </span><span class="si">{</span><span class="n">roc_auc</span><span class="p">:.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="c1"># Get accuracy over the test set
</span> <span class="n">y_pred</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">preds</span> <span class="o">>=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">accuracy</span> <span class="o">=</span> <span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_true</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Accuracy: </span><span class="si">{</span><span class="n">accuracy</span><span class="o">*</span><span class="mi">100</span><span class="p">:.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">%'</span><span class="p">)</span>
<span class="c1"># Plot ROC AUC
</span> <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Receiver Operating Characteristic'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">fpr</span><span class="p">,</span> <span class="n">tpr</span><span class="p">,</span> <span class="s">'b'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'AUC = %0.2f'</span> <span class="o">%</span> <span class="n">roc_auc</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span> <span class="o">=</span> <span class="s">'lower right'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span><span class="s">'r--'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlim</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylim</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'True Positive Rate'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'False Positive Rate'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p>By combining TF-IDF and the Naive Bayes algorithm, we achieve the accuracy rate of <strong>72.65%</strong> on the validation set. This value is the baseline performance and will be used to evaluate the performance of our fine-tune BERT model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compute predicted probabilities
</span><span class="n">nb_model</span> <span class="o">=</span> <span class="n">MultinomialNB</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">1.8</span><span class="p">)</span>
<span class="n">nb_model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_tfidf</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">probs</span> <span class="o">=</span> <span class="n">nb_model</span><span class="p">.</span><span class="n">predict_proba</span><span class="p">(</span><span class="n">X_val_tfidf</span><span class="p">)</span>
<span class="c1"># Evaluate the classifier
</span><span class="n">evaluate_roc</span><span class="p">(</span><span class="n">probs</span><span class="p">,</span> <span class="n">y_val</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>AUC: 0.8269
Accuracy: 72.65%
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/chriskhanhtran/bert-for-sentiment-analysis/master/output_37_1.png" alt="png" /></p>
<h1 id="d---fine-tuning-bert">D - Fine-tuning BERT</h1>
<h2 id="1-install-the-hugging-face-library">1. Install the Hugging Face Library</h2>
<p>The transformer library of Hugging Face contains PyTorch implementation of state-of-the-art NLP models including BERT (from Google), GPT (from OpenAI) … and pre-trained model weights.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Uncomment the line below to install `transformers`
# !pip install transformers
</span></code></pre></div></div>
<h2 id="2-tokenization-and-input-formatting">2. Tokenization and Input Formatting</h2>
<p>Before tokenizing our text, we will perform some slight processing on our text including removing entity mentions (eg. @united) and some special character. The level of processing here is much less than in previous approachs because BERT was trained with the entire sentences.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">text_preprocessing</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="s">"""
- Remove entity mentions (eg. '@united')
- Correct errors (eg. '&amp;' to '&')
@param text (str): a string to be processed.
@return text (Str): the processed string.
"""</span>
<span class="c1"># Remove '@name'
</span> <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'(@.*?)[\s]'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="c1"># Replace '&amp;' with '&'
</span> <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'&amp;'</span><span class="p">,</span> <span class="s">'&'</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="c1"># Remove trailing whitespace
</span> <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">'\s+'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">text</span><span class="p">).</span><span class="n">strip</span><span class="p">()</span>
<span class="k">return</span> <span class="n">text</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Print sentence 0
</span><span class="k">print</span><span class="p">(</span><span class="s">'Original: '</span><span class="p">,</span> <span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Processed: '</span><span class="p">,</span> <span class="n">text_preprocessing</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Original: @united I'm having issues. Yesterday I rebooked for 24 hours after I was supposed to fly, now I can't log on &amp; check in. Can you help?
Processed: I'm having issues. Yesterday I rebooked for 24 hours after I was supposed to fly, now I can't log on & check in. Can you help?
</code></pre></div></div>
<h3 id="21-bert-tokenizer">2.1. BERT Tokenizer</h3>
<p>In order to apply the pre-trained BERT, we must use the tokenizer provided by the library. This is because (1) the model has a specific, fixed vocabulary and (2) the BERT tokenizer has a particular way of handling out-of-vocabulary words.</p>
<p>In addition, we are required to add special tokens to the start and end of each sentence, pad & truncate all sentences to a single constant length, and explicitly specify what are padding tokens with the “attention mask”.</p>
<p>The <code class="language-plaintext highlighter-rouge">encode_plus</code> method of BERT tokenizer will:</p>
<p>(1) split our text into tokens,</p>
<p>(2) add the special <code class="language-plaintext highlighter-rouge">[CLS]</code> and <code class="language-plaintext highlighter-rouge">[SEP]</code> tokens, and</p>
<p>(3) convert these tokens into indexes of the tokenizer vocabulary,</p>
<p>(4) pad or truncate sentences to max length, and</p>
<p>(5) create attention mask.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">BertTokenizer</span>
<span class="c1"># Load the BERT tokenizer
</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">BertTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">'bert-base-uncased'</span><span class="p">,</span> <span class="n">do_lower_case</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Create a function to tokenize a set of texts
</span><span class="k">def</span> <span class="nf">preprocessing_for_bert</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
<span class="s">"""Perform required preprocessing steps for pretrained BERT.
@param data (np.array): Array of texts to be processed.
@return input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
@return attention_masks (torch.Tensor): Tensor of indices specifying which
tokens should be attended to by the model.
"""</span>
<span class="c1"># Create empty lists to store outputs
</span> <span class="n">input_ids</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">attention_masks</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># For every sentence...
</span> <span class="k">for</span> <span class="n">sent</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
<span class="c1"># `encode_plus` will:
</span> <span class="c1"># (1) Tokenize the sentence
</span> <span class="c1"># (2) Add the `[CLS]` and `[SEP]` token to the start and end
</span> <span class="c1"># (3) Truncate/Pad sentence to max length
</span> <span class="c1"># (4) Map tokens to their IDs
</span> <span class="c1"># (5) Create attention mask
</span> <span class="c1"># (6) Return a dictionary of outputs
</span> <span class="n">encoded_sent</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">encode_plus</span><span class="p">(</span>
<span class="n">text</span><span class="o">=</span><span class="n">text_preprocessing</span><span class="p">(</span><span class="n">sent</span><span class="p">),</span> <span class="c1"># Preprocess sentence
</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="c1"># Add `[CLS]` and `[SEP]`
</span> <span class="n">max_length</span><span class="o">=</span><span class="n">MAX_LEN</span><span class="p">,</span> <span class="c1"># Max length to truncate/pad
</span> <span class="n">pad_to_max_length</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="c1"># Pad sentence to max length
</span> <span class="c1">#return_tensors='pt', # Return PyTorch tensor
</span> <span class="n">return_attention_mask</span><span class="o">=</span><span class="bp">True</span> <span class="c1"># Return attention mask
</span> <span class="p">)</span>
<span class="c1"># Add the outputs to the lists
</span> <span class="n">input_ids</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">encoded_sent</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'input_ids'</span><span class="p">))</span>
<span class="n">attention_masks</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">encoded_sent</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'attention_mask'</span><span class="p">))</span>
<span class="c1"># Convert lists to tensors
</span> <span class="n">input_ids</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)</span>
<span class="n">attention_masks</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">attention_masks</span><span class="p">)</span>
<span class="k">return</span> <span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_masks</span>
</code></pre></div></div>
<p style="color: red;">
The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.<br />
We recommend you <a href="https://www.tensorflow.org/guide/migrate" target="_blank">upgrade</a> now
or ensure your notebook will continue to use TensorFlow 1.x via the <code>%tensorflow_version 1.x</code> magic:
<a href="https://colab.research.google.com/notebooks/tensorflow_version.ipynb" target="_blank">more info</a>.</p>
<p>Before tokenizing, we need to specify the maximum length of our sentences.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Concatenate train data and test data
</span><span class="n">all_tweets</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">data</span><span class="p">.</span><span class="n">tweet</span><span class="p">.</span><span class="n">values</span><span class="p">,</span> <span class="n">test_data</span><span class="p">.</span><span class="n">tweet</span><span class="p">.</span><span class="n">values</span><span class="p">])</span>
<span class="c1"># Encode our concatenated data
</span><span class="n">encoded_tweets</span> <span class="o">=</span> <span class="p">[</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">sent</span><span class="p">,</span> <span class="n">add_special_tokens</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">for</span> <span class="n">sent</span> <span class="ow">in</span> <span class="n">all_tweets</span><span class="p">]</span>
<span class="c1"># Find the maximum length
</span><span class="n">max_len</span> <span class="o">=</span> <span class="nb">max</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">sent</span><span class="p">)</span> <span class="k">for</span> <span class="n">sent</span> <span class="ow">in</span> <span class="n">encoded_tweets</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Max length: '</span><span class="p">,</span> <span class="n">max_len</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Max length: 68
</code></pre></div></div>
<p>Now let’s tokenize our data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Specify `MAX_LEN`
</span><span class="n">MAX_LEN</span> <span class="o">=</span> <span class="mi">64</span>
<span class="c1"># Print sentence 0 and its encoded token ids
</span><span class="n">token_ids</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">preprocessing_for_bert</span><span class="p">([</span><span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">]])[</span><span class="mi">0</span><span class="p">].</span><span class="n">squeeze</span><span class="p">().</span><span class="n">numpy</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Original: '</span><span class="p">,</span> <span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Token IDs: '</span><span class="p">,</span> <span class="n">token_ids</span><span class="p">)</span>
<span class="c1"># Run function `preprocessing_for_bert` on the train set and the validation set
</span><span class="k">print</span><span class="p">(</span><span class="s">'Tokenizing data...'</span><span class="p">)</span>
<span class="n">train_inputs</span><span class="p">,</span> <span class="n">train_masks</span> <span class="o">=</span> <span class="n">preprocessing_for_bert</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">val_inputs</span><span class="p">,</span> <span class="n">val_masks</span> <span class="o">=</span> <span class="n">preprocessing_for_bert</span><span class="p">(</span><span class="n">X_val</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Original: @united I'm having issues. Yesterday I rebooked for 24 hours after I was supposed to fly, now I can't log on &amp; check in. Can you help?
Token IDs: [101, 1045, 1005, 1049, 2383, 3314, 1012, 7483, 1045, 2128, 8654, 2098, 2005, 2484, 2847, 2044, 1045, 2001, 4011, 2000, 4875, 1010, 2085, 1045, 2064, 1005, 1056, 8833, 2006, 1004, 4638, 1999, 1012, 2064, 2017, 2393, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Tokenizing data...
</code></pre></div></div>
<h3 id="22-create-pytorch-dataloader">2.2. Create PyTorch DataLoader</h3>
<p>We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">TensorDataset</span><span class="p">,</span> <span class="n">DataLoader</span><span class="p">,</span> <span class="n">RandomSampler</span><span class="p">,</span> <span class="n">SequentialSampler</span>
<span class="c1"># Convert other data types to torch.Tensor
</span><span class="n">train_labels</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">y_train</span><span class="p">)</span>
<span class="n">val_labels</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">y_val</span><span class="p">)</span>
<span class="c1"># For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
</span><span class="n">batch_size</span> <span class="o">=</span> <span class="mi">32</span>
<span class="c1"># Create the DataLoader for our training set
</span><span class="n">train_data</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">train_inputs</span><span class="p">,</span> <span class="n">train_masks</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">)</span>
<span class="n">train_sampler</span> <span class="o">=</span> <span class="n">RandomSampler</span><span class="p">(</span><span class="n">train_data</span><span class="p">)</span>
<span class="n">train_dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">train_data</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">train_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">)</span>
<span class="c1"># Create the DataLoader for our validation set
</span><span class="n">val_data</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">val_inputs</span><span class="p">,</span> <span class="n">val_masks</span><span class="p">,</span> <span class="n">val_labels</span><span class="p">)</span>
<span class="n">val_sampler</span> <span class="o">=</span> <span class="n">SequentialSampler</span><span class="p">(</span><span class="n">val_data</span><span class="p">)</span>
<span class="n">val_dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">val_data</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">val_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="3-train-our-model">3. Train Our Model</h2>
<h3 id="31-create-bertclassifier">3.1. Create BertClassifier</h3>
<p>BERT-base consists of 12 transformer layers, each transformer layer takes in a list of token embeddings, and produces the same number of embeddings with the same hidden size (or dimensions) on the output. The output of the final transformer layer of the <code class="language-plaintext highlighter-rouge">[CLS]</code> token is used as the features of the sequence to feed a classifier.</p>
<p>The <code class="language-plaintext highlighter-rouge">transformers</code> library has the <a href="https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertforsequenceclassification"><code class="language-plaintext highlighter-rouge">BertForSequenceClassification</code></a> class which is designed for classification tasks. However, we will create a new class so we can specify our own choice of classifiers.</p>
<p>Below we will create a BertClassifier class with a BERT model to extract the last hidden layer of the <code class="language-plaintext highlighter-rouge">[CLS]</code> token and a single-hidden-layer feed-forward neural network as our classifier.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%%</span><span class="n">time</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">BertModel</span>
<span class="c1"># Create the BertClassfier class
</span><span class="k">class</span> <span class="nc">BertClassifier</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="s">"""Bert Model for Classification Tasks.
"""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">freeze_bert</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="s">"""
@param bert: a BertModel object
@param classifier: a torch.nn.Module classifier
@param freeze_bert (bool): Set `False` to fine-tune the BERT model
"""</span>
<span class="nb">super</span><span class="p">(</span><span class="n">BertClassifier</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
<span class="c1"># Specify hidden size of BERT, hidden size of our classifier, and number of labels
</span> <span class="n">D_in</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">D_out</span> <span class="o">=</span> <span class="mi">768</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">2</span>
<span class="c1"># Instantiate BERT model
</span> <span class="bp">self</span><span class="p">.</span><span class="n">bert</span> <span class="o">=</span> <span class="n">BertModel</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">'bert-base-uncased'</span><span class="p">)</span>
<span class="c1"># Instantiate an one-layer feed-forward classifier
</span> <span class="bp">self</span><span class="p">.</span><span class="n">classifier</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
<span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">D_in</span><span class="p">,</span> <span class="n">H</span><span class="p">),</span>
<span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">(),</span>
<span class="c1">#nn.Dropout(0.5),
</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">H</span><span class="p">,</span> <span class="n">D_out</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># Freeze the BERT model
</span> <span class="k">if</span> <span class="n">freeze_bert</span><span class="p">:</span>
<span class="k">for</span> <span class="n">param</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">bert</span><span class="p">.</span><span class="n">parameters</span><span class="p">():</span>
<span class="n">param</span><span class="p">.</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_mask</span><span class="p">):</span>
<span class="s">"""
Feed input to BERT and the classifier to compute logits.
@param input_ids (torch.Tensor): an input tensor with shape (batch_size,
max_length)
@param attention_mask (torch.Tensor): a tensor that hold attention mask
information with shape (batch_size, max_length)
@return logits (torch.Tensor): an output tensor with shape (batch_size,
num_labels)
"""</span>
<span class="c1"># Feed input to BERT
</span> <span class="n">outputs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">bert</span><span class="p">(</span><span class="n">input_ids</span><span class="o">=</span><span class="n">input_ids</span><span class="p">,</span>
<span class="n">attention_mask</span><span class="o">=</span><span class="n">attention_mask</span><span class="p">)</span>
<span class="c1"># Extract the last hidden state of the token `[CLS]` for classification task
</span> <span class="n">last_hidden_state_cls</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">][:,</span> <span class="mi">0</span><span class="p">,</span> <span class="p">:]</span>
<span class="c1"># Feed input to classifier to compute logits
</span> <span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">classifier</span><span class="p">(</span><span class="n">last_hidden_state_cls</span><span class="p">)</span>
<span class="k">return</span> <span class="n">logits</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CPU times: user 0 ns, sys: 46 µs, total: 46 µs
Wall time: 49.4 µs
</code></pre></div></div>
<h3 id="32-optimizer--learning-rate-scheduler">3.2. Optimizer & Learning Rate Scheduler</h3>
<p>To fine-tune our Bert Classifier, we need to create an optimizer. The authors recommend following hyper-parameters:</p>
<ul>
<li>Batch size: 16 or 32</li>
<li>Learning rate (Adam): 5e-5, 3e-5 or 2e-5</li>
<li>Number of epochs: 2, 3, 4</li>
</ul>
<p>Huggingface provided the <a href="https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109">run_glue.py</a> script, an examples of implementing the <code class="language-plaintext highlighter-rouge">transformers</code> library. In the script, the AdamW optimizer is used.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AdamW</span><span class="p">,</span> <span class="n">get_linear_schedule_with_warmup</span>
<span class="k">def</span> <span class="nf">initialize_model</span><span class="p">(</span><span class="n">epochs</span><span class="o">=</span><span class="mi">4</span><span class="p">):</span>
<span class="s">"""Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
"""</span>
<span class="c1"># Instantiate Bert Classifier
</span> <span class="n">bert_classifier</span> <span class="o">=</span> <span class="n">BertClassifier</span><span class="p">(</span><span class="n">freeze_bert</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c1"># Tell PyTorch to run the model on GPU
</span> <span class="n">bert_classifier</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="c1"># Create the optimizer
</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">AdamW</span><span class="p">(</span><span class="n">bert_classifier</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span>
<span class="n">lr</span><span class="o">=</span><span class="mf">5e-5</span><span class="p">,</span> <span class="c1"># Default learning rate
</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-8</span> <span class="c1"># Default epsilon value
</span> <span class="p">)</span>
<span class="c1"># Total number of training steps
</span> <span class="n">total_steps</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">train_dataloader</span><span class="p">)</span> <span class="o">*</span> <span class="n">epochs</span>
<span class="c1"># Set up the learning rate scheduler
</span> <span class="n">scheduler</span> <span class="o">=</span> <span class="n">get_linear_schedule_with_warmup</span><span class="p">(</span><span class="n">optimizer</span><span class="p">,</span>
<span class="n">num_warmup_steps</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="c1"># Default value
</span> <span class="n">num_training_steps</span><span class="o">=</span><span class="n">total_steps</span><span class="p">)</span>
<span class="k">return</span> <span class="n">bert_classifier</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">scheduler</span>
</code></pre></div></div>
<h3 id="33-training-loop">3.3. Training Loop</h3>
<p>We will train our Bert Classifier for 4 epochs. In each epoch, we will train our model and evaluate its performance on the validation set. In more details, we will:</p>
<p>Training:</p>
<ul>
<li>Unpack our data from the dataloader and load the data onto the GPU</li>
<li>Zero out gradients calculated in the previous pass</li>
<li>Perform a forward pass to compute logits and loss</li>
<li>Perform a backward pass to compute gradients (<code class="language-plaintext highlighter-rouge">loss.backward()</code>)</li>
<li>Clip the norm of the gradients to 1.0 to prevent “exploding gradients”</li>
<li>Update the model’s parameters (<code class="language-plaintext highlighter-rouge">optimizer.step()</code>)</li>
<li>Update the learning rate (<code class="language-plaintext highlighter-rouge">scheduler.step()</code>)</li>
</ul>
<p>Evaluation:</p>
<ul>
<li>Unpack our data and load onto the GPU</li>
<li>Forward pass</li>
<li>Compute loss and accuracy rate over the validation set</li>
</ul>
<p>The script below is commented with the details of our training and evaluation loop.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="c1"># Specify loss function
</span><span class="n">loss_fn</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">CrossEntropyLoss</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">set_seed</span><span class="p">(</span><span class="n">seed_value</span><span class="o">=</span><span class="mi">42</span><span class="p">):</span>
<span class="s">"""Set seed for reproducibility.
"""</span>
<span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">manual_seed_all</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">train_dataloader</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">evaluation</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="s">"""Train the BertClassifier model.
"""</span>
<span class="c1"># Start training loop
</span> <span class="k">print</span><span class="p">(</span><span class="s">"Start training...</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">epoch_i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>
<span class="c1"># =======================================
</span> <span class="c1"># Training
</span> <span class="c1"># =======================================
</span> <span class="c1"># Print the header of the result table
</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'Epoch'</span><span class="p">:</span><span class="o">^</span><span class="mi">7</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'Batch'</span><span class="p">:</span><span class="o">^</span><span class="mi">7</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'Train Loss'</span><span class="p">:</span><span class="o">^</span><span class="mi">12</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'Val Loss'</span><span class="p">:</span><span class="o">^</span><span class="mi">10</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'Val Acc'</span><span class="p">:</span><span class="o">^</span><span class="mi">9</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'Elapsed'</span><span class="p">:</span><span class="o">^</span><span class="mi">9</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"-"</span><span class="o">*</span><span class="mi">70</span><span class="p">)</span>
<span class="c1"># Measure the elapsed time of each epoch
</span> <span class="n">t0_epoch</span><span class="p">,</span> <span class="n">t0_batch</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">(),</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="c1"># Reset tracking variables at the beginning of each epoch
</span> <span class="n">total_loss</span><span class="p">,</span> <span class="n">batch_loss</span><span class="p">,</span> <span class="n">batch_counts</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span>
<span class="c1"># Put the model into the training mode
</span> <span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
<span class="c1"># For each batch of training data...
</span> <span class="k">for</span> <span class="n">step</span><span class="p">,</span> <span class="n">batch</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">train_dataloader</span><span class="p">):</span>
<span class="n">batch_counts</span> <span class="o">+=</span><span class="mi">1</span>
<span class="c1"># Load batch to GPU
</span> <span class="n">b_input_ids</span><span class="p">,</span> <span class="n">b_attn_mask</span><span class="p">,</span> <span class="n">b_labels</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">batch</span><span class="p">)</span>
<span class="c1"># Zero out any previously calculated gradients
</span> <span class="n">model</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="c1"># Perform a forward pass. This will return logits.
</span> <span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">b_input_ids</span><span class="p">,</span> <span class="n">b_attn_mask</span><span class="p">)</span>
<span class="c1"># Compute loss and accumulate the loss values
</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">loss_fn</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">b_labels</span><span class="p">)</span>
<span class="n">batch_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
<span class="n">total_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
<span class="c1"># Perform a backward pass to calculate gradients
</span> <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c1"># Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">clip_grad_norm_</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="mf">1.0</span><span class="p">)</span>
<span class="c1"># Update parameters and the learning rate
</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="n">scheduler</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># Print the loss values and time elapsed for every 20 batches
</span> <span class="k">if</span> <span class="p">(</span><span class="n">step</span> <span class="o">%</span> <span class="mi">20</span> <span class="o">==</span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">step</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="ow">or</span> <span class="p">(</span><span class="n">step</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="n">train_dataloader</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
<span class="c1"># Calculate time elapsed for 20 batches
</span> <span class="n">time_elapsed</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0_batch</span>
<span class="c1"># Print training results
</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">epoch_i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">:</span><span class="o">^</span><span class="mi">7</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">step</span><span class="p">:</span><span class="o">^</span><span class="mi">7</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">batch_loss</span> <span class="o">/</span> <span class="n">batch_counts</span><span class="p">:</span><span class="o">^</span><span class="mf">12.6</span><span class="n">f</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'-'</span><span class="p">:</span><span class="o">^</span><span class="mi">10</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'-'</span><span class="p">:</span><span class="o">^</span><span class="mi">9</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">time_elapsed</span><span class="p">:</span><span class="o">^</span><span class="mf">9.2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="c1"># Reset batch tracking variables
</span> <span class="n">batch_loss</span><span class="p">,</span> <span class="n">batch_counts</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span>
<span class="n">t0_batch</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="c1"># Calculate the average loss over the entire training data
</span> <span class="n">avg_train_loss</span> <span class="o">=</span> <span class="n">total_loss</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">train_dataloader</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"-"</span><span class="o">*</span><span class="mi">70</span><span class="p">)</span>
<span class="c1"># =======================================
</span> <span class="c1"># Evaluation
</span> <span class="c1"># =======================================
</span> <span class="k">if</span> <span class="n">evaluation</span> <span class="o">==</span> <span class="bp">True</span><span class="p">:</span>
<span class="c1"># After the completion of each training epoch, measure the model's performance
</span> <span class="c1"># on our validation set.
</span> <span class="n">val_loss</span><span class="p">,</span> <span class="n">val_accuracy</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">)</span>
<span class="c1"># Print performance over the entire training data
</span> <span class="n">time_elapsed</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0_epoch</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">epoch_i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">:</span><span class="o">^</span><span class="mi">7</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="s">'-'</span><span class="p">:</span><span class="o">^</span><span class="mi">7</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">avg_train_loss</span><span class="p">:</span><span class="o">^</span><span class="mf">12.6</span><span class="n">f</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">val_loss</span><span class="p">:</span><span class="o">^</span><span class="mf">10.6</span><span class="n">f</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">val_accuracy</span><span class="p">:</span><span class="o">^</span><span class="mf">9.2</span><span class="n">f</span><span class="si">}</span><span class="s"> | </span><span class="si">{</span><span class="n">time_elapsed</span><span class="p">:</span><span class="o">^</span><span class="mf">9.2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"-"</span><span class="o">*</span><span class="mi">70</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Training complete!"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">):</span>
<span class="s">"""After the completion of each training epoch, measure the model's performance
on our validation set.
"""</span>
<span class="c1"># Put the model into the evaluation mode. The dropout layers are disabled during
</span> <span class="c1"># the test time.
</span> <span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="c1"># Tracking variables
</span> <span class="n">val_accuracy</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">val_loss</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># For each batch in our validation set...
</span> <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">val_dataloader</span><span class="p">:</span>
<span class="c1"># Load batch to GPU
</span> <span class="n">b_input_ids</span><span class="p">,</span> <span class="n">b_attn_mask</span><span class="p">,</span> <span class="n">b_labels</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">batch</span><span class="p">)</span>
<span class="c1"># Compute logits
</span> <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">b_input_ids</span><span class="p">,</span> <span class="n">b_attn_mask</span><span class="p">)</span>
<span class="c1"># Compute loss
</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">loss_fn</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">b_labels</span><span class="p">)</span>
<span class="n">val_loss</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">())</span>
<span class="c1"># Get the predictions
</span> <span class="n">preds</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">flatten</span><span class="p">()</span>
<span class="c1"># Calculate the accuracy rate
</span> <span class="n">accuracy</span> <span class="o">=</span> <span class="p">(</span><span class="n">preds</span> <span class="o">==</span> <span class="n">b_labels</span><span class="p">).</span><span class="n">cpu</span><span class="p">().</span><span class="n">numpy</span><span class="p">().</span><span class="n">mean</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span>
<span class="n">val_accuracy</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">accuracy</span><span class="p">)</span>
<span class="c1"># Compute the average accuracy and loss over the validation set.
</span> <span class="n">val_loss</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">val_loss</span><span class="p">)</span>
<span class="n">val_accuracy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">val_accuracy</span><span class="p">)</span>
<span class="k">return</span> <span class="n">val_loss</span><span class="p">,</span> <span class="n">val_accuracy</span>
</code></pre></div></div>
<p>Now, let’s start training our BertClassifier!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span> <span class="c1"># Set seed for reproducibility
</span><span class="n">bert_classifier</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">scheduler</span> <span class="o">=</span> <span class="n">initialize_model</span><span class="p">(</span><span class="n">epochs</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">train</span><span class="p">(</span><span class="n">bert_classifier</span><span class="p">,</span> <span class="n">train_dataloader</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">evaluation</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Start training...
Epoch | Batch | Train Loss | Val Loss | Val Acc | Elapsed
----------------------------------------------------------------------
1 | 20 | 0.653637 | - | - | 4.99
1 | 40 | 0.517290 | - | - | 4.73
1 | 60 | 0.502695 | - | - | 4.68
1 | 80 | 0.495539 | - | - | 4.68
1 | 95 | 0.490748 | - | - | 3.44
----------------------------------------------------------------------
1 | - | 0.535397 | 0.466385 | 79.09 | 23.22
----------------------------------------------------------------------
Epoch | Batch | Train Loss | Val Loss | Val Acc | Elapsed
----------------------------------------------------------------------
2 | 20 | 0.336384 | - | - | 4.94
2 | 40 | 0.277895 | - | - | 4.76
2 | 60 | 0.314162 | - | - | 4.74
2 | 80 | 0.307749 | - | - | 4.71
2 | 95 | 0.307835 | - | - | 3.44
----------------------------------------------------------------------
2 | - | 0.309143 | 0.440339 | 82.56 | 23.30
----------------------------------------------------------------------
Training complete!
</code></pre></div></div>
<h3 id="34-evaluation-on-validation-set">3.4. Evaluation on Validation Set</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compute predicted probabilities on the test set
# Please initialize function `bert_predict` by running the first cell in Section 4.2.
</span><span class="n">probs</span> <span class="o">=</span> <span class="n">bert_predict</span><span class="p">(</span><span class="n">bert_classifier</span><span class="p">,</span> <span class="n">val_dataloader</span><span class="p">)</span>
<span class="c1"># Evaluate the Bert classifier
</span><span class="n">evaluate_roc</span><span class="p">(</span><span class="n">probs</span><span class="p">,</span> <span class="n">y_val</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>AUC: 0.9006
Accuracy: 82.65%
</code></pre></div></div>
<p><img src="https://raw.githubusercontent.com/chriskhanhtran/bert-for-sentiment-analysis/master/output_69_1.png" alt="png" /></p>
<p>The Bert Classifer achieves 0.90 AUC score and 82.65% accuracy rate on the validation set. This result is 10 points better than the baseline method.</p>
<h3 id="35-train-our-model-on-the-entire-training-data">3.5. Train Our Model on the Entire Training Data</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Concatenate the train set and the validation set
</span><span class="n">full_train_data</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">ConcatDataset</span><span class="p">([</span><span class="n">train_data</span><span class="p">,</span> <span class="n">val_data</span><span class="p">])</span>
<span class="n">full_train_sampler</span> <span class="o">=</span> <span class="n">RandomSampler</span><span class="p">(</span><span class="n">full_train_data</span><span class="p">)</span>
<span class="n">full_train_dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">full_train_data</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">full_train_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
<span class="c1"># Train the Bert Classifier on the entire training data
</span><span class="n">set_seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">bert_classifier</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">scheduler</span> <span class="o">=</span> <span class="n">initialize_model</span><span class="p">(</span><span class="n">epochs</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">train</span><span class="p">(</span><span class="n">bert_classifier</span><span class="p">,</span> <span class="n">full_train_dataloader</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Start training...
Epoch | Batch | Train Loss | Val Loss | Val Acc | Elapsed
----------------------------------------------------------------------
1 | 20 | 0.638744 | - | - | 4.90
1 | 40 | 0.545325 | - | - | 4.68
1 | 60 | 0.487211 | - | - | 4.66
1 | 80 | 0.516911 | - | - | 4.66
1 | 100 | 0.413083 | - | - | 4.65
1 | 106 | 0.359597 | - | - | 1.27
----------------------------------------------------------------------
Epoch | Batch | Train Loss | Val Loss | Val Acc | Elapsed
----------------------------------------------------------------------
2 | 20 | 0.286960 | - | - | 4.89
2 | 40 | 0.269116 | - | - | 4.65
2 | 60 | 0.235394 | - | - | 4.67
2 | 80 | 0.280183 | - | - | 4.65
2 | 100 | 0.299446 | - | - | 4.64
2 | 106 | 0.292475 | - | - | 1.25
----------------------------------------------------------------------
Training complete!
</code></pre></div></div>
<h2 id="4-predictions-on-test-set">4. Predictions on Test Set</h2>
<h3 id="41-data-preparation">4.1. Data Preparation</h3>
<p>Let’s revisit out test set shortly.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test_data</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>tweet</th>
</tr>
</thead>
<tbody>
<tr>
<th>1921</th>
<td>74037</td>
<td>No Wi-Fi on a plane nowadays is just unaccepta...</td>
</tr>
<tr>
<th>133</th>
<td>5103</td>
<td>@AmericanAir how is it that 2 passengers miss ...</td>
</tr>
<tr>
<th>1296</th>
<td>50793</td>
<td>Arbitration board issues decision on joint con...</td>
</tr>
<tr>
<th>1771</th>
<td>68130</td>
<td>@AngieTheo14 @ToniVeltri @AmericanAir @JetBlue...</td>
</tr>
<tr>
<th>21</th>
<td>620</td>
<td>.@richardbranson .@rmchrQB .@VirginAmerica Air...</td>
</tr>
</tbody>
</table>
</div>
<p>Before making predictions on the test set, we need to redo processing and encoding steps done on the training data. Fortunately, we have written the <code class="language-plaintext highlighter-rouge">preprocessing_for_bert</code> function to do that for us.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Run `preprocessing_for_bert` on the test set
</span><span class="k">print</span><span class="p">(</span><span class="s">'Tokenizing data...'</span><span class="p">)</span>
<span class="n">test_inputs</span><span class="p">,</span> <span class="n">test_masks</span> <span class="o">=</span> <span class="n">preprocessing_for_bert</span><span class="p">(</span><span class="n">test_data</span><span class="p">.</span><span class="n">tweet</span><span class="p">)</span>
<span class="c1"># Create the DataLoader for our test set
</span><span class="n">test_dataset</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">test_inputs</span><span class="p">,</span> <span class="n">test_masks</span><span class="p">)</span>
<span class="n">test_sampler</span> <span class="o">=</span> <span class="n">SequentialSampler</span><span class="p">(</span><span class="n">test_dataset</span><span class="p">)</span>
<span class="n">test_dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">test_dataset</span><span class="p">,</span> <span class="n">sampler</span><span class="o">=</span><span class="n">test_sampler</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tokenizing data...
</code></pre></div></div>
<h3 id="42-predictions">4.2. Predictions</h3>
<p>The prediction step is similar to the evaluation step that we did in the training loop, but simpler. We will perform a forward pass to compute logits and apply softmax to calculate probabilities.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="k">def</span> <span class="nf">bert_predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">test_dataloader</span><span class="p">):</span>
<span class="s">"""Perform a forward pass on the trained BERT model to predict probabilities
on the test set.
"""</span>
<span class="c1"># Put the model into the evaluation mode. The dropout layers are disabled during
</span> <span class="c1"># the test time.
</span> <span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="n">all_logits</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># For each batch in our test set...
</span> <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">test_dataloader</span><span class="p">:</span>
<span class="c1"># Load batch to GPU
</span> <span class="n">b_input_ids</span><span class="p">,</span> <span class="n">b_attn_mask</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">batch</span><span class="p">)[:</span><span class="mi">2</span><span class="p">]</span>
<span class="c1"># Compute logits
</span> <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">b_input_ids</span><span class="p">,</span> <span class="n">b_attn_mask</span><span class="p">)</span>
<span class="n">all_logits</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">logits</span><span class="p">)</span>
<span class="c1"># Concatenate logits from each batch
</span> <span class="n">all_logits</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">all_logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># Apply softmax to calculate probabilities
</span> <span class="n">probs</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">all_logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">cpu</span><span class="p">().</span><span class="n">numpy</span><span class="p">()</span>
<span class="k">return</span> <span class="n">probs</span>
</code></pre></div></div>
<p>There are about 300 non-negative tweets in our test set. Therefore, we will keep adjusting the decision threshold until we have about 300 non-negative tweets.</p>
<p>The threshold we will use is 0.992, meaning that tweets with a predicted probability greater than 99.2% will be predicted positive. This value is very high compared to the default 0.5 threshold.</p>
<p>After manually examining the test set, I realize that the sentiment classification task here is even difficult for human. Therefore, a high threshold will give us safe predictions.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compute predicted probabilities on the test set
</span><span class="n">probs</span> <span class="o">=</span> <span class="n">bert_predict</span><span class="p">(</span><span class="n">bert_classifier</span><span class="p">,</span> <span class="n">test_dataloader</span><span class="p">)</span>
<span class="c1"># Get predictions from the probabilities
</span><span class="n">threshold</span> <span class="o">=</span> <span class="mf">0.992</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">probs</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">></span> <span class="n">threshold</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="c1"># Number of tweets predicted non-negative
</span><span class="k">print</span><span class="p">(</span><span class="s">"Number of tweets predicted non-negative: "</span><span class="p">,</span> <span class="n">preds</span><span class="p">.</span><span class="nb">sum</span><span class="p">())</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Number of tweets predicted non-negative: 298
</code></pre></div></div>
<p>Now we will examine 20 random tweets from our predictions. 17 of them are correct, the results showing that the BERT Classifier acquires about 0.85 precision rate.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">output</span> <span class="o">=</span> <span class="n">test_data</span><span class="p">[</span><span class="n">preds</span><span class="o">==</span><span class="mi">1</span><span class="p">]</span>
<span class="nb">list</span><span class="p">(</span><span class="n">output</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">20</span><span class="p">).</span><span class="n">tweet</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>["@iChrisLehman @SouthwestAir Aw poor little one. I'm sure parents will make it better. Kisses and hugs and lots of encouragement!",
'Props to @JetBlue for doing this for the two police officers killed in New York City. #PoliceLivesMatter http://t.co/U3Ff6pXzdC',
"@SilverJames_ I say do whatever gets you there :-) Can't wait to see you! @SouthwestAir",
'@united not angry disappointed two trips in a row. May cancel my next one and try delta',
'@drey38 @united AND you are missing the holiday bowl! #unitedhatesamericans',
"@JetBlue now just another airline. :/. Can't wait for @SouthwestAir to provide direct/nonstop from BOS to SJU.",
"Ahhhh @JetBlue I've missed you!!! Hopefully making it back to NY tonight ",
"On the bright side, I don't have to pay to check my bag to Portland tomorrow because @united lost it!",
'@JetBlue Thank you so much for making my experience as a person w/health issues so pleasant!',
"@JetBlue has NBCSports on this flight which means I won't miss the @NYRangers game!!! #YesYesYes",
'Wow @JetBlue flt 1199 to #Orlando, entire TV system down. Plane full of kids. Thanks - real value add service. #fail',
"Can't wait to see what happens next.@NYNYVegas is taking over @SouthwestAir today with #spreadtheluck flights. #Vegas #PR",
'check out @americanair in the " in case you get lost " section of our #website #travel #airlines http://t.co/X2WMcoQdgt',
'Everyone is gonna hate @AmericanAir if Jerome gets arrested #AmericanAirlinesCHILLOUT',
'Waiting to board a @SouthwestAir plane to #Pittsburgh. I feel kinda sa being #55 in my group. Feels like being picked last in HS gym class.',
"I can't get over how much different it is flying @VirginAmerica than any other airline! I Love it! I can't wait to be home (for a week) _",
'Wait up @AmericanAir',
'@richeisen @united Rich, fly @AmericanAir zero problems &amp; I fly weekly! #firstworldproblems',
'@AlaskaAir what happens if I miss my connection in Seattle #alaska687',
'@adamrides After @VirginAmerica they are my top choice. Welcome to Va. Thanks for bringing the bad weather, again.']
</code></pre></div></div>
<h1 id="e---conclusion">E - Conclusion</h1>
<p>By adding a simple one-hidden-layer neural network classifier on top of BERT and fine-tuning BERT, we can achieve near state-of-the-art performance, which is 10 points better than the baseline method although we only have 3,400 data points.</p>
<p>In addition, although BERT is very large, complicated, and have millions of parameters, we only need to fine-tune it in only 2-4 epochs. That result can be achieved because BERT was trained on the huge amount and already encode a lot of information about our language. An impresive performance achieved in a short amount of time, with a small amount of data has shown why BERT is one of the most powerful NLP models available at the moment.</p>Chris TranOne of the most biggest milestones in the evolution of NLP recently is the release of Google's BERT, which is described as the beginning of a new era in NLP. In this notebook I'll use the HuggingFace's `transformers` library to fine-tune pretrained BERT model for a classification task.