Paper resources

Objective

Why do we need augmentations?

  • To have better generalizability, we need more and more comprehensive datasets but collection of these datasets and labelling is a laborious task so augmentation becomes an attractive method to introduce more examples for model to consume.

What are the different kind of augmentations used in NLP?

  • For improving machine translation task, researchers have tried substituting common words with rare words thus providing more context for rare words.
  • Some researchers have tried replacing words with their synonyms for tweet classification.
  • Randomly swap two words in a sentence.
  • Randomly delete a word in the sentence and many more.

What is the novel idea presented in the paper?

AEDA method proposes randomly inserting some punctuation marks as an augmentation to introduce noise. The authors report improvement performance in text classification tasks.

Can you share an example of how this augmentation would work?

Original Text:

Appropriate given recent events


Augmented Text:

Appropriate given ; recent events

Appropriate ? given : recent events

Appropriate given , recent events ```

How many punctuation marks are inserted?

  • Between 1 to n/3 where n represents the length of the sentence.

Why one-third of the sentence length?

The authors mention that they want to increase the complexity of the sentence but doesn't want to add too many punctuation marks which would interfere with the semantic meaning of the sentence.

At which positions should we insert these punctuation marks?

  • The authors inserted them at random positions in the sentence.

What are the different punctuation marks used?

. ; ? : ! ,

Why does AEDA work better compared to EDA?

  • EDA proposes synonym replacement, random replacement, random insertion and random deletion. These modifications could change the semantic meaning of the text.
  • Whereas AEDA just introduces punctuation marks which would only introduce noise and would not mess the semantic meaning or the word ordering.

Implementation

  • We would be using fastai to implement this augmentation.
  • The authors have released the code as well which we would using.

Dataset

  • We will be using this dataset used in this challenge where the goal is to predict the subreddit of a subreddit post based on their title and their description. This is an example of text categorization / text classification task

Load libraries

import pandas as pd
import numpy  as np

from pathlib import Path
from tqdm    import tqdm

from fastai.text.all import *

SEED = 41

Define paths and constants

BASE_DIR      = Path('~/data/dl_nlp')
RAW_DATA_PATH = BASE_DIR / 'data'
OUTPUT_DIR    = Path('~/data/dl_nlp/outputs')

PUNCTUATIONS = ['.', ',', '!', '?', ';', ':']
PUNC_RATIO   = 0.3

Load dataset

train = pd.read_csv(RAW_DATA_PATH / 'train.csv')
train.head()

  • Text column represents title as well as description.
  • Subreddit column represents our label.

Class Distribution

train.subreddit.value_counts(normalize=True)

  • We have multiple categories that our model needs to get right.
  • Most of the categories have similar percentage of data points in the dataset, with only SubredditSimulator category having less training examples.

Splitter

splits = RandomSplitter(seed=41)(train)
  • Create a splitting strategy.
  • Here we plan to split our training dataframe randomly into training ( 80% ) and validation ( 20% ) datasets.

Tokenize the training dataset

df_tok, cnt = tokenize_df(train.iloc[splits[0]], text_cols='text')
  • Fast.ai provides a method to tokenize our dataset.
  • Here we only passing our training examples as the corpus for tokenizer to create vocabulary.
  • We could pass in different types of tokenizers here but by default it works with WordTokenizer.
df_tok

  • Here we could see that it has split our text string into tokens and created an additional column called text_length describing the length.
  • It has also added some library specific tokens like xxbos, xxmaj etc. xxbos represents beginning of the sentence token. For more details please refer to fast.ai
cnt

  • Here is a snapshot of the vocabulary constructed by the tokenize_df method.

Using fast.ai Pipeline to construct Dataset

text_pipe   = Pipeline([attrgetter('text'), Tokenizer.from_df(0), Numericalize(vocab=list(cnt.keys()))])

lbl_pipe    = Pipeline([attrgetter('subreddit'), Categorize()])
lbl_pipe.setup(train.subreddit)

dsets       = Datasets(train, [text_pipe, lbl_pipe], splits=splits, dl_type=SortedDL)
  • Here we use Pipeline provided by fast.ai to put together different transforms we want to run on our dataframe.
  • text_pipe represents the Pipeline that we would like to run on our text column in the dataframe.
  • lbl_pipe represents the Pipeline that we would like to run on our subreddit column in the dataframe.
  • Numericalize transform takes in our vocabulary and converts the tokens to ids.
  • Categorize transforms converts our labels to categories.
  • Tokenizer.from_df transform tokenizes the text stored in our dataframe.

AEDA data augmentation as fast.ai transform

np.random.seed(0)

PUNCTUATIONS = ['.', ',', '!', '?', ';', ':']
PUNC_RATIO   = 0.3

class InsertPunctuation(Transform):
    split_idx = 0
    def __init__(self, o2i, punc_ratio=PUNC_RATIO):
        self.o2i        = o2i
        self.punc_ratio = punc_ratio
        
    def encodes(self, words:TensorText):
        new_line = []
        q  = random.randint(1, int(self.punc_ratio * len(words) + 1))
        qs = random.sample(range(0, len(words)), q)
            
        for j, word in enumerate(words):
            if j in qs:
                new_line.append(self.o2i[PUNCTUATIONS[random.randint(0, len(PUNCTUATIONS)-1)]])
                new_line.append(int(word))
            else:
                new_line.append(int(word))
        
        return TensorText(new_line)
  • We have taken the implementation from the github shared by the authors and created a fast.ai tranform that would take in the PUNC_RATIO and o2i as parameters and inserts punctuations at random positions in the sentence.
  • PUNC_RATIO by default takes a value of 0.3 which represents the 1/3rd of the sentence length mentioned in the paper.
  • o2i is mapping between token to token_id.

Construct dataloaders

seq_len    = 72
dls_kwargs = {
              'after_item'  : InsertPunctuation(dsets.o2i),
              'before_batch': Pad_Chunk(seq_len=seq_len)
             }

dls        = dsets.dataloaders(bs=32, seq_len=seq_len, **dls_kwargs)
  • When creating fast.ai dataloders we could perform operations on some of the events emitted.
  • Here we have made use of two such events, after_item callback is used to run our augmentation and add punctuation marks.
  • before_batch callback is used to make sure that we have paddded the tokens to make sure they are of same size before we collate them to form a batch.
dls.show_batch(max_n=3)

  • dls.show_batch gives a glimpse of the batch

Using the classic TextCNN model introduced by Yoon Kim, paper

class TextCNN(Module):
    def __init__(self, n_embed, embed_dim, num_filters, filter_sizes, num_classes, dropout=0.5, pad_idx=1):
        store_attr('n_embed,embed_dim')
        
        self.embed = nn.Embedding(num_embeddings=n_embed,
                                  embedding_dim=embed_dim,
                                  padding_idx=1
                                 )
        self.convs = nn.ModuleList([
                            nn.Conv2d(in_channels=1, 
                                      out_channels=num_filters, 
                                      kernel_size=(k, embed_dim)
                                     ) 
                            for k in filter_sizes
                                ])
        
        self.dropout  = nn.Dropout(dropout)
        self.relu     = nn.ReLU()
        self.fc       = nn.Linear(num_filters * len(filter_sizes), num_classes)
    
    def _conv_and_pool(self, x, conv):
        x = self.relu(conv(x)).squeeze(3)
        x = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x
    
    def forward(self, x):
        out = self.embed(x)
        out = out.unsqueeze(1)
        out = torch.cat([self._conv_and_pool(out, conv) for conv in self.convs], 1)
        out = self.dropout(out)
        out = self.fc(out)
        return out
vocab       = dls.train_ds.vocab
num_classes = get_c(dls)
model       = TextCNN(len(vocab[0]), 
                      embed_dim=300, 
                      num_filters=100, 
                      filter_sizes=[1, 2, 3],
                      num_classes=num_classes,
                     )

Define learner

learn = Learner(dls, model, metrics=[accuracy, F1Score(average='weighted')])
  • Using F1 score weighted metric for multi-class classification.
learn.fit_one_cycle(n_epoch=25, lr_max=3e-4, cbs=EarlyStoppingCallback(patience=3))

  • We are getting a F1( weighted ) score of 0.869 without using any pre-trained embeddings.

References