Training BERT for Cyberbullying Detection - Towards SOTA
Training Bert For Cyberbullying Detection Part 2
Published on October 27, 2020
This is a follow up to the previous notebook training a binary classification model that should detect cyberbullying in Polish Tweets. The dataset comes from a Polish NLP competition - PolEval 2019 (http://2019.poleval.pl/index.php/tasks/task6). It is also included in Polish NLP Benchmark KLEJ (https://klejbenchmark.com/). Our goal is to reach state-of-the-art results, with the following points of reference: - Best result in last year competition: 58.58 f1 (n-waves ULMFiT) - Best result for a base BERT model on KLEJ: 66.7 (Polish Roberta base) - Best result for a large BERT model on KLEJ: 72.4 (XLM-RoBERTa large + NKJP)
To achieve that, we will work with the HuggingFace library and Pytorch.
Setup
Let’s start by installing transformers, and importing the relevant libraries. We will now work mostly with Pytorch.
!pip install transformers -q
|████████████████████████████████| 1.3MB 2.7MB/s
|████████████████████████████████| 1.1MB 16.2MB/s
|████████████████████████████████| 2.9MB 18.5MB/s
|████████████████████████████████| 890kB 53.0MB/s
Building wheel for sacremoses (setup.py) ... done
We will now switch from a single train-validation split into cross-validation (5 fold). We will be also more careful with the split, applying stratified k-fold split, so that each fold has similar amount of positive labels. With cross-validation, our goal is to benefit from all training data. At the same time, by ensembling the models trained on each fold, we should be reducing random errors, making our ensemble more predictable.
We will also apply some pre-processing of the tweets. First, we will replace ‘@anonymized_account’ with ‘@ użytkownik’. Second, we will replace the emoji characters with their plain text counterparts. Both modifications are based on the Polish Roberta training scripts (https://github.com/sdadas/polish-roberta). These changes should allow the Polish BERT model better represent the text.
df['text'] = df['text'].apply(lambda r: "".join((emoji.get(c, c) for c in r)))
Helper functions
class AverageMeter:""" Computes and stores the average and current value """def__init__(self):self.val =0self.avg =0self.sum=0self.count =0def reset(self):self.val =0self.avg =0self.sum=0self.count =0def update(self, val, n=1):self.val = valself.sum+= val * nself.count += nself.avg =self.sum/self.count
Configuration
Let’s define some key hyperparameters that influence our training: - max length: how many tokens should be used per tweet? Based on the training data, the longest tweet is 91 tokens with Polbert tokenizer, so we will set the max length to 92 tokens and pad all tokens to that length with [PAD] token. - batch size: we will use batch size 64, it might be difficult to use a bigger one on some GPUs - number of epochs: our dataset is fairly small, so training for a large number of epochs might lead to overfitting. Let’s set on 2 epochs here. - learning rate: we will use discriminative learning rate, applying a higher learning rate to the classifier layer (that we start with random weights), and a lower learning rate to the encoder (which has been pretrained so it already should have ‘good’ weights) - warm up: we will be using linear schedule with warm up, so the learning rate will be increased for the number of steps defined here, and then linearly decreased to zero - pretrained model and tokenizer: we will work again with Polbert uncased model
Let’s start by defining Pytorch Dataset. It needs to implement the len and getitem methods. We will again use the HuggingFace tokenizer to convert text into input_ids, mask and token_type_ids that are expected by our BERT layer.
Now is the time to define our model! First, let’s look at the elements that are normally expected: - bert layer: the entire BERT pretrained model is a single layer in our model. We are using again pretrained weights from HuggingFace hub. - drop out: it’s another hyperparameter that can be tuned, here we set it directly in the model - linear classification layer: this is a binary classification problem with 2 classes (True and False) and we define a linear layer for this. This comes with random weights that we initialize here.
We are also doing some modifications here that should help us improve the results: - using the full hidden state rather than [CLS] token output: there is some research showing that the last layers of pretrained model are very specific to pretraining task and don’t help in finetuning. We will output all hidden states from the model and use the penultimate layer (-2) for our task - max pooling: we will take the output from all tokens (768 features
92 tokens) and take the max value for each feature across all tokens. The intutition here is that the model may encode ‘cyberbullying’ in the token representation, and if it’s contained somewhere in a tweet, we should use that information.
class CBDModel(BertPreTrainedModel):def__init__(self, conf):super(CBDModel, self).__init__(conf)self.bert = BertModel.from_pretrained(BERT_PATH, config=conf)self.mx = nn.MaxPool1d(MAX_LEN)self.drop_out = nn.Dropout(0.5)self.l0 = nn.Linear(768, 2) torch.nn.init.normal_(self.l0.weight, std=0.02)def forward(self, ids, mask, token_type_ids): _, _, out =self.bert(ids, attention_mask=mask, token_type_ids=token_type_ids) out = out[-2] out = out.permute(0,2,1) out = torch.squeeze(self.mx(out)) out =self.drop_out(out) out =self.l0(out)return out
Training and Evaluation Loop with Weighted Random Sampling
In this section, we define our training and evaluation functions and the runner that executes the training. The key modification here is using Weigthed Random Sampler to address the class imbalance issue.
def run(fold): df_train = df[df.kfold != fold].reset_index(drop=True) df_valid = df[df.kfold == fold].reset_index(drop=True)# df_train = df_train[:64]# df_valid = df_valid[:64] target = df_train.label.values class_sample_count = np.array([len(np.where(target == t)[0]) for t in np.unique(target)]) weight =1./ class_sample_count samples_weight = np.array([weight[t] for t in target]) samples_weight = torch.from_numpy(samples_weight) samples_weigth = samples_weight.double() sampler = WeightedRandomSampler(samples_weight, len(samples_weight)) train_dataset = CBDDataset( text=df_train.text.values, label=df_train.label.values, ) train_data_loader = torch.utils.data.DataLoader( train_dataset, batch_size=TRAIN_BATCH_SIZE, num_workers=1, sampler=sampler ) valid_dataset = CBDDataset( text=df_valid.text.values, label=df_valid.label.values, ) valid_data_loader = torch.utils.data.DataLoader( valid_dataset, batch_size=VALID_BATCH_SIZE, shuffle=False, num_workers=2 ) device = torch.device("cuda") model_config = BertConfig.from_pretrained(BERT_PATH) model_config.output_hidden_states =True model = CBDModel(conf=model_config) model.to(device) num_train_steps =int(len(df_train) / TRAIN_BATCH_SIZE * EPOCHS) param_optimizer =list(model.named_parameters())[:-2] no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"] optimizer_parameters = [ {'params': [p for n, p in param_optimizer ifnotany(nd in n for nd in no_decay)], 'weight_decay': 0.01}, {'params': [p for n, p in param_optimizer if (any(nd in n for nd in no_decay))], 'weight_decay': 0.0}, {'params': model.l0.weight, "lr": HEAD_LR, 'weight_decay': 0.01}, {'params': model.l0.bias, "lr": HEAD_LR, 'weight_decay': 0.0}, ] optimizer = AdamW(optimizer_parameters, lr=LR) scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=num_train_steps )print(f"Training is starting for fold: {fold}")for epoch inrange(EPOCHS): train_fn(train_data_loader, model, optimizer, device, scheduler=scheduler) f1 = eval_fn(valid_data_loader, model, device)print(f"Epoch: {epoch}, F1 score = {f1}") model_path=f"model_{fold}.bin" torch.save(model.state_dict(), model_path)
Let’s train!
run(0)
Training is starting for fold: 0
Epoch: 0, F1 score = 0.4273255813953489
Epoch: 1, F1 score = 0.5588235294117647
run(1)
Training is starting for fold: 1
Epoch: 0, F1 score = 0.4831460674157303
Epoch: 1, F1 score = 0.5245901639344263
run(2)
Training is starting for fold: 2
Epoch: 0, F1 score = 0.4551083591331269
Epoch: 1, F1 score = 0.5592233009708738
run(3)
Training is starting for fold: 3
Epoch: 0, F1 score = 0.4318181818181818
Epoch: 1, F1 score = 0.5283018867924528
run(4)
Training is starting for fold: 4
Epoch: 0, F1 score = 0.47330960854092524
Epoch: 1, F1 score = 0.5536480686695279
Evaluation and results
We have now trained 5 models on different folds. Let’s apply these models on our test set, pre-processed in the same way as our training set. We will average the raw logits (outputs) from each model, and then apply argmax to choose the outputted class.
This looks good! Our F1 score is around 0.66 - 0.68, which is in the range of the best base BERT model on KLEJ Benchmark (Polish Roberta base reported results in the range 0.63-0.69).
Improvements
What can be done to further improve the results? Here are some ideas: - Data augmentation. Can we add more variety/examples via text augmentation? - More hyperparameter tuning. Key watch out is to ensure a good cross-validation approach, so that we don’t tune on the test set. - Multi-sample dropout. This technique was used by winning teams in recent Kaggle NLP competitions. - Multi-lingual transfer. We have large toxicity datasets in English, can we use that with a multi-lingual model like XLM-Roberta to classify Polish Tweets? - Multi-task learning. We could benefit from training a single model on several tasks, e.g. from KLEJ Benchmark, to see if that helps. - Ensembling/Stacking. Ensembling results across models with different encoders and fine-tuning protocols is very likely to improve the score even further.