Training BERT for Cyberbullying Detection - HF Trainer Baseline
Training Bert For Cyberbullying Detection Part 1
Published on October 27, 2020
Our goal is to train a binary classification model that should detect cyberbullying in Polish Tweets. The dataset comes from a Polish NLP competition - PolEval 2019 (http://2019.poleval.pl/index.php/tasks/task6). It is also included in Polish NLP Benchmark KLEJ (https://klejbenchmark.com/).
Setup
Let’s start by installing two libraries from HuggingFace that will make our job easier: transformers and datasets. We will also import the relevant libraries.
In this demo, we will use the Polish pretrained BERT model - Polbert (https://github.com/kldarek/polbert). It can be downloaded from the HuggingFace model hub, and we will use BertForSequenceClassification class to load it. We will also need the Polbert tokenizer.
model = BertForSequenceClassification.from_pretrained('dkleczek/bert-base-polish-uncased-v1')tokenizer = BertTokenizerFast.from_pretrained('dkleczek/bert-base-polish-uncased-v1')
Some weights of the model checkpoint at dkleczek/bert-base-polish-uncased-v1 were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dkleczek/bert-base-polish-uncased-v1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Now we will load the training data from the KLEJ benchmark website, clean it up, and convert to a csv format that can be used to create a dataset.
Our training set consists of 10 thousand tweets, but only 851 of those tweets are tagged as cyberbullying. We can take a look at a sample of data in the dataframe.
df.head()
text
label
5809
LUDZIE Z BYDGOSZCZY: NAJLEPSZA RESTAURACJA? Rt...
0
5938
@anonymized_account Stałam na zewnątrz, ale ma...
0
2260
RT @anonymized_account Halicki: proszę nie mów...
0
8833
@anonymized_account @anonymized_account Czyli ...
1
4513
@anonymized_account Już nic nie będzie takie s...
0
Dataset
We need to convert our data into a format that can be fed to our model. Thanks to HuggingFace datesets library magic, we con do this with just a few lines of code. We will load the dataset from csv file, split it into train (80%) and validation set (20%). We will then map the tokenizer to convert the text strings into a format that can be fed into BERT model (input_ids and attention mask). Finally, we’ll convert that into torch tensors.
Downloading and preparing dataset csv/default-013faa159f500b12 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-013faa159f500b12/0.0.0/49187751790fa4d820300fd4d0707896e5b941f1a9c644652645b866716a4ac4...
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-013faa159f500b12/0.0.0/49187751790fa4d820300fd4d0707896e5b941f1a9c644652645b866716a4ac4. Subsequent calls will reuse this data.
We are almost ready to start the training. Let’s define a function that will help us monitor the training progress and evaluate results on the validation dataset. We will primarily focus on F1, recall and precision metrics, especially that F1 is the official evaluation metric for this dataset.
HuggingFace wraps up the default transformer fine-tuning approach in the Trainer object, and we can customize it by passing training arguments such as learning rate, number of epochs, batch size etc. We will set logging_steps to 20, so that we can frequently evaluate how the model performs on the validation set throughout the training.
/usr/local/lib/python3.6/dist-packages/transformers/training_args.py:339: FutureWarning: The `evaluate_during_training` argument is deprecated in favor of `evaluation_strategy` (which has more options)
FutureWarning,
trainer.train()
/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py:847: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
return torch.tensor(x, **format_kwargs)
[378/378 05:31, Epoch 3/3]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
20
0.581389
0.320318
0.907869
0.000000
0.000000
0.000000
40
0.268629
0.245489
0.908367
0.010753
1.000000
0.005405
60
0.216368
0.229870
0.914841
0.149254
0.937500
0.081081
80
0.207710
0.255416
0.913845
0.121827
1.000000
0.064865
100
0.187619
0.178497
0.926793
0.374468
0.880000
0.237838
120
0.207315
0.169160
0.931275
0.589286
0.655629
0.535135
140
0.155866
0.170657
0.929283
0.564417
0.652482
0.497297
160
0.173842
0.173169
0.930279
0.554140
0.674419
0.470270
180
0.128828
0.174443
0.927291
0.522876
0.661157
0.432432
200
0.144621
0.169735
0.930777
0.535117
0.701754
0.432432
220
0.122884
0.167735
0.929781
0.568807
0.654930
0.502703
240
0.134178
0.168437
0.927291
0.535032
0.651163
0.454054
260
0.094126
0.169739
0.932769
0.571429
0.692308
0.486486
280
0.079876
0.183835
0.931275
0.574074
0.669065
0.502703
300
0.091462
0.203578
0.930279
0.545455
0.682927
0.454054
320
0.074335
0.195306
0.930777
0.582583
0.655405
0.524324
340
0.079284
0.202131
0.932769
0.563107
0.701613
0.470270
360
0.082395
0.193977
0.931275
0.584337
0.659864
0.524324
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
We are reaching around 0.58 - 0.62 F1 score on the validation set (the actual result will vary since we’re not fixing the seeds here). We can also see the training progress in the tensorboard charts that read our training log.
Given that this is a completed competition, we have access to the test set. We shouldn’t be using it for validation, to avoid presenting overfitted results, but we can use it to see how our solution ranks against the benchmarks. Let’s download that data and evaluate the model on it. We will need to repeat some of the steps we applied on the training set.
The F1 score on the test set is around 0.56 - 0.59, which is in the range of state-of-the-art result last year during the PolEval 2019 competition. It is also pretty competitive on the KLEJ benchmark, although the models based on Roberta-large architecture perform better, and Polish Roberta base is also significantly better. In a separate post, we will try to reach those scores.