AEDA ( An Easier Data Augmentation Technique for Text Classification )
A new data augmentation method proposed for text classification.
- Paper resources
- Objective
- Why do we need augmentations?
- What are the different kind of augmentations used in NLP?
- What is the novel idea presented in the paper?
- Can you share an example of how this augmentation would work?
- How many punctuation marks are inserted?
- Why one-third of the sentence length?
- At which positions should we insert these punctuation marks?
- What are the different punctuation marks used?
- Why does AEDA work better compared to EDA?
- Implementation
- Dataset
- Load libraries
- Define paths and constants
- Load dataset
- Class Distribution
- Splitter
- Tokenize the training dataset
- Using fast.ai Pipeline to construct Dataset
- AEDA data augmentation as fast.ai transform
- Construct dataloaders
- Using the classic TextCNN model introduced by Yoon Kim, paper
- Define learner
- References
- This paper proposes a new data augmentation technique for text classification task.
- It also compares the performance of this augmentation technique with EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks and concludes that their method is simpler and produces better results.
- In this experiment we will try to implement this data augmentation using
fastai
on a text classification task.
- To have better generalizability, we need more and more comprehensive datasets but collection of these datasets and labelling is a laborious task so augmentation becomes an attractive method to introduce more examples for model to consume.
What are the different kind of augmentations used in NLP?
- For improving machine translation task, researchers have tried substituting common words with rare words thus providing more context for rare words.
- Some researchers have tried replacing words with their synonyms for tweet classification.
- Randomly swap two words in a sentence.
- Randomly delete a word in the sentence and many more.
- Between 1 to
n/3
wheren
represents the length of the sentence.
The authors mention that they want to increase the complexity of the sentence but doesn't want to add too many punctuation marks which would interfere with the semantic meaning of the sentence.
- The authors inserted them at random positions in the sentence.
. ; ? : ! ,
Why does AEDA work better compared to EDA?
- EDA proposes synonym replacement, random replacement, random insertion and random deletion. These modifications could change the semantic meaning of the text.
- Whereas AEDA just introduces punctuation marks which would only introduce noise and would not mess the semantic meaning or the word ordering.
Dataset
- We will be using this dataset used in this challenge where the goal is to predict the subreddit of a subreddit post based on their title and their description. This is an example of
text categorization
/text classification
task
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm
from fastai.text.all import *
SEED = 41
BASE_DIR = Path('~/data/dl_nlp')
RAW_DATA_PATH = BASE_DIR / 'data'
OUTPUT_DIR = Path('~/data/dl_nlp/outputs')
PUNCTUATIONS = ['.', ',', '!', '?', ';', ':']
PUNC_RATIO = 0.3
train = pd.read_csv(RAW_DATA_PATH / 'train.csv')
train.head()
- Text column represents
title
as well asdescription
. - Subreddit column represents our
label
.
train.subreddit.value_counts(normalize=True)
- We have multiple categories that our model needs to get right.
- Most of the categories have similar percentage of data points in the dataset, with only
SubredditSimulator
category having less training examples.
splits = RandomSplitter(seed=41)(train)
- Create a splitting strategy.
- Here we plan to split our training dataframe randomly into training ( 80% ) and validation ( 20% ) datasets.
df_tok, cnt = tokenize_df(train.iloc[splits[0]], text_cols='text')
- Fast.ai provides a method to tokenize our dataset.
- Here we only passing our training examples as the corpus for tokenizer to create vocabulary.
- We could pass in different types of tokenizers here but by default it works with
WordTokenizer
.
df_tok
- Here we could see that it has split our
text
string into tokens and created an additional column calledtext_length
describing the length. - It has also added some library specific tokens like
xxbos
,xxmaj
etc.xxbos
represents beginning of the sentence token. For more details please refer to fast.ai
cnt
- Here is a snapshot of the vocabulary constructed by the
tokenize_df
method.
text_pipe = Pipeline([attrgetter('text'), Tokenizer.from_df(0), Numericalize(vocab=list(cnt.keys()))])
lbl_pipe = Pipeline([attrgetter('subreddit'), Categorize()])
lbl_pipe.setup(train.subreddit)
dsets = Datasets(train, [text_pipe, lbl_pipe], splits=splits, dl_type=SortedDL)
- Here we use
Pipeline
provided byfast.ai
to put together different transforms we want to run on our dataframe. -
text_pipe
represents the Pipeline that we would like to run on ourtext
column in the dataframe. -
lbl_pipe
represents the Pipeline that we would like to run on oursubreddit
column in the dataframe. -
Numericalize
transform takes in our vocabulary and converts the tokens to ids. -
Categorize
transforms converts our labels to categories. -
Tokenizer.from_df
transform tokenizes the text stored in our dataframe.
np.random.seed(0)
PUNCTUATIONS = ['.', ',', '!', '?', ';', ':']
PUNC_RATIO = 0.3
class InsertPunctuation(Transform):
split_idx = 0
def __init__(self, o2i, punc_ratio=PUNC_RATIO):
self.o2i = o2i
self.punc_ratio = punc_ratio
def encodes(self, words:TensorText):
new_line = []
q = random.randint(1, int(self.punc_ratio * len(words) + 1))
qs = random.sample(range(0, len(words)), q)
for j, word in enumerate(words):
if j in qs:
new_line.append(self.o2i[PUNCTUATIONS[random.randint(0, len(PUNCTUATIONS)-1)]])
new_line.append(int(word))
else:
new_line.append(int(word))
return TensorText(new_line)
- We have taken the implementation from the github shared by the authors and created a
fast.ai
tranform that would take in thePUNC_RATIO
ando2i
as parameters and inserts punctuations at random positions in the sentence. -
PUNC_RATIO
by default takes a value of0.3
which represents the1/3rd
of the sentence length mentioned in the paper. -
o2i
is mapping between token to token_id.
seq_len = 72
dls_kwargs = {
'after_item' : InsertPunctuation(dsets.o2i),
'before_batch': Pad_Chunk(seq_len=seq_len)
}
dls = dsets.dataloaders(bs=32, seq_len=seq_len, **dls_kwargs)
- When creating
fast.ai
dataloders we could perform operations on some of the events emitted. - Here we have made use of two such events,
after_item
callback is used to run our augmentation and add punctuation marks. -
before_batch
callback is used to make sure that we have paddded the tokens to make sure they are of same size before we collate them to form a batch.
dls.show_batch(max_n=3)
-
dls.show_batch
gives a glimpse of the batch
TextCNN
model introduced by Yoon Kim, paper
Using the classic class TextCNN(Module):
def __init__(self, n_embed, embed_dim, num_filters, filter_sizes, num_classes, dropout=0.5, pad_idx=1):
store_attr('n_embed,embed_dim')
self.embed = nn.Embedding(num_embeddings=n_embed,
embedding_dim=embed_dim,
padding_idx=1
)
self.convs = nn.ModuleList([
nn.Conv2d(in_channels=1,
out_channels=num_filters,
kernel_size=(k, embed_dim)
)
for k in filter_sizes
])
self.dropout = nn.Dropout(dropout)
self.relu = nn.ReLU()
self.fc = nn.Linear(num_filters * len(filter_sizes), num_classes)
def _conv_and_pool(self, x, conv):
x = self.relu(conv(x)).squeeze(3)
x = F.max_pool1d(x, x.size(2)).squeeze(2)
return x
def forward(self, x):
out = self.embed(x)
out = out.unsqueeze(1)
out = torch.cat([self._conv_and_pool(out, conv) for conv in self.convs], 1)
out = self.dropout(out)
out = self.fc(out)
return out
vocab = dls.train_ds.vocab
num_classes = get_c(dls)
model = TextCNN(len(vocab[0]),
embed_dim=300,
num_filters=100,
filter_sizes=[1, 2, 3],
num_classes=num_classes,
)
learn = Learner(dls, model, metrics=[accuracy, F1Score(average='weighted')])
- Using
F1 score weighted
metric for multi-class classification.
learn.fit_one_cycle(n_epoch=25, lr_max=3e-4, cbs=EarlyStoppingCallback(patience=3))
- We are getting a
F1( weighted )
score of0.869
without using anypre-trained
embeddings.