Our goal is to train a binary classification model that should detect cyberbullying in Polish Tweets. The dataset comes from a Polish NLP competition - PolEval 2019 (http://2019.poleval.pl/index.php/tasks/task6). It is also included in Polish NLP Benchmark KLEJ (https://klejbenchmark.com/).

Setup

Let's start by installing two libraries from HuggingFace that will make our job easier: transformers and datasets. We will also import the relevant libraries.

!pip install transformers -qq
!pip install datasets -qq
     |████████████████████████████████| 1.3MB 3.4MB/s 
     |████████████████████████████████| 890kB 19.2MB/s 
     |████████████████████████████████| 1.1MB 29.8MB/s 
     |████████████████████████████████| 2.9MB 43.7MB/s 
  Building wheel for sacremoses (setup.py) ... done
     |████████████████████████████████| 153kB 3.3MB/s 
     |████████████████████████████████| 17.7MB 204kB/s 
     |████████████████████████████████| 245kB 64.2MB/s 
import numpy as np
import pandas as pd
import torch
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In this demo, we will use the Polish pretrained BERT model - Polbert (https://github.com/kldarek/polbert). It can be downloaded from the HuggingFace model hub, and we will use BertForSequenceClassification class to load it. We will also need the Polbert tokenizer.

model = BertForSequenceClassification.from_pretrained('dkleczek/bert-base-polish-uncased-v1')
tokenizer = BertTokenizerFast.from_pretrained('dkleczek/bert-base-polish-uncased-v1')

Some weights of the model checkpoint at dkleczek/bert-base-polish-uncased-v1 were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dkleczek/bert-base-polish-uncased-v1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now we will load the training data from the KLEJ benchmark website, clean it up, and convert to a csv format that can be used to create a dataset.

Data

!wget -q https://klejbenchmark.com/static/data/klej_cbd.zip
!unzip -q klej_cbd.zip
--2020-10-26 08:29:07--  https://klejbenchmark.com/static/data/klej_cbd.zip
Resolving klejbenchmark.com (klejbenchmark.com)... 35.234.99.58
Connecting to klejbenchmark.com (klejbenchmark.com)|35.234.99.58|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 375476 (367K) [application/zip]
Saving to: ‘klej_cbd.zip’

klej_cbd.zip        100%[===================>] 366.68K   866KB/s    in 0.4s    

2020-10-26 08:29:08 (866 KB/s) - ‘klej_cbd.zip’ saved [375476/375476]

Archive:  klej_cbd.zip
  inflating: test_features.tsv       
  inflating: train.tsv               
df = pd.read_csv('train.tsv', delimiter='\t')
df = df.dropna().reset_index(drop=True)
df.columns = ['text', 'label']
df.label = df.label.astype(int)
df = df.sample(frac=1, random_state=42)
df.to_csv('train.csv', index=False)
len(df), len(df[df.label == 1])
(10041, 851)

Our training set consists of 10 thousand tweets, but only 851 of those tweets are tagged as cyberbullying. We can take a look at a sample of data in the dataframe.

df.head()
text label
5809 LUDZIE Z BYDGOSZCZY: NAJLEPSZA RESTAURACJA? Rt... 0
5938 @anonymized_account Stałam na zewnątrz, ale ma... 0
2260 RT @anonymized_account Halicki: proszę nie mów... 0
8833 @anonymized_account @anonymized_account Czyli ... 1
4513 @anonymized_account Już nic nie będzie takie s... 0

Dataset

We need to convert our data into a format that can be fed to our model. Thanks to HuggingFace datesets library magic, we con do this with just a few lines of code. We will load the dataset from csv file, split it into train (80%) and validation set (20%). We will then map the tokenizer to convert the text strings into a format that can be fed into BERT model (input_ids and attention mask). Finally, we'll convert that into torch tensors.

train_dataset, test_dataset = load_dataset('csv', data_files='train.csv', split=['train[:80%]', 'train[80%:]'])
Using custom data configuration default
Downloading and preparing dataset csv/default-013faa159f500b12 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-013faa159f500b12/0.0.0/49187751790fa4d820300fd4d0707896e5b941f1a9c644652645b866716a4ac4...
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-013faa159f500b12/0.0.0/49187751790fa4d820300fd4d0707896e5b941f1a9c644652645b866716a4ac4. Subsequent calls will reuse this data.
# train_dataset[0]
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=False)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))

# train_dataset[0]
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
# train_dataset[0]

Training and Evaluation

We are almost ready to start the training. Let's define a function that will help us monitor the training progress and evaluate results on the validation dataset. We will primarily focus on F1, recall and precision metrics, especially that F1 is the official evaluation metric for this dataset.

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

HuggingFace wraps up the default transformer fine-tuning approach in the Trainer object, and we can customize it by passing training arguments such as learning rate, number of epochs, batch size etc. We will set logging_steps to 20, so that we can frequently evaluate how the model performs on the validation set throughout the training.

training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    fp16=True,
    warmup_steps=30,
    logging_steps=20,
    weight_decay=0.01,
    evaluate_during_training=True,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)
/usr/local/lib/python3.6/dist-packages/transformers/training_args.py:339: FutureWarning: The `evaluate_during_training` argument is deprecated in favor of `evaluation_strategy` (which has more options)
  FutureWarning,
trainer.train()
/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py:847: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return torch.tensor(x, **format_kwargs)
[378/378 05:31, Epoch 3/3]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
20 0.581389 0.320318 0.907869 0.000000 0.000000 0.000000
40 0.268629 0.245489 0.908367 0.010753 1.000000 0.005405
60 0.216368 0.229870 0.914841 0.149254 0.937500 0.081081
80 0.207710 0.255416 0.913845 0.121827 1.000000 0.064865
100 0.187619 0.178497 0.926793 0.374468 0.880000 0.237838
120 0.207315 0.169160 0.931275 0.589286 0.655629 0.535135
140 0.155866 0.170657 0.929283 0.564417 0.652482 0.497297
160 0.173842 0.173169 0.930279 0.554140 0.674419 0.470270
180 0.128828 0.174443 0.927291 0.522876 0.661157 0.432432
200 0.144621 0.169735 0.930777 0.535117 0.701754 0.432432
220 0.122884 0.167735 0.929781 0.568807 0.654930 0.502703
240 0.134178 0.168437 0.927291 0.535032 0.651163 0.454054
260 0.094126 0.169739 0.932769 0.571429 0.692308 0.486486
280 0.079876 0.183835 0.931275 0.574074 0.669065 0.502703
300 0.091462 0.203578 0.930279 0.545455 0.682927 0.454054
320 0.074335 0.195306 0.930777 0.582583 0.655405 0.524324
340 0.079284 0.202131 0.932769 0.563107 0.701613 0.470270
360 0.082395 0.193977 0.931275 0.584337 0.659864 0.524324

</div> </div>

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
TrainOutput(global_step=378, training_loss=0.16361538316837695)
</div> </div> </div>

We are reaching around 0.58 - 0.62 F1 score on the validation set (the actual result will vary since we're not fixing the seeds here). We can also see the training progress in the tensorboard charts that read our training log.

trainer.evaluate()
[32/32 00:04]
{'epoch': 3.0,
 'eval_accuracy': 0.9327689243027888,
 'eval_f1': 0.5970149253731343,
 'eval_loss': 0.19227837026119232,
 'eval_precision': 0.6666666666666666,
 'eval_recall': 0.5405405405405406,
 'total_flos': 1738480015991628}
%load_ext tensorboard
%tensorboard --logdir logs

Result Evaluation

Given that this is a completed competition, we have access to the test set. We shouldn't be using it for validation, to avoid presenting overfitted results, but we can use it to see how our solution ranks against the benchmarks. Let's download that data and evaluate the model on it. We will need to repeat some of the steps we applied on the training set.

test_df = pd.read_csv('test_features.tsv', delimiter='\t')
test_df.columns = ['text']
final_test_dataset = Dataset.from_pandas(test_df)
final_test_dataset = final_test_dataset.map(tokenize, batched=True, batch_size=len(final_test_dataset))
final_test_dataset.set_format('torch', columns=['input_ids', 'attention_mask'])

!wget https://raw.githubusercontent.com/ptaszynski/cyberbullying-Polish/master/task%2001/test_set_clean_only_tags.txt
df_lbls = pd.read_csv('test_set_clean_only_tags.txt',names=['label'])
labels = df_lbls.label.values
--2020-10-26 08:35:05--  https://raw.githubusercontent.com/ptaszynski/cyberbullying-Polish/master/task%2001/test_set_clean_only_tags.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3000 (2.9K) [text/plain]
Saving to: ‘test_set_clean_only_tags.txt’

test_set_clean_only 100%[===================>]   2.93K  --.-KB/s    in 0s      

2020-10-26 08:35:05 (87.7 MB/s) - ‘test_set_clean_only_tags.txt’ saved [3000/3000]

preds = trainer.predict(final_test_dataset)
outputs = preds.predictions.argmax(axis=1)
[32/32 00:10]
precision, recall, f1, _ = precision_recall_fscore_support(labels, outputs, average='binary')
acc = accuracy_score(labels, outputs)
print( {
    'accuracy': acc,
    'f1': f1,
    'precision': precision,
    'recall': recall
})
{'accuracy': 0.913, 'f1': 0.5671641791044776, 'precision': 0.8507462686567164, 'recall': 0.4253731343283582}

The F1 score on the test set is around 0.56 - 0.59, which is in the range of state-of-the-art result last year during the PolEval 2019 competition. It is also pretty competitive on the KLEJ benchmark, although the models based on Roberta-large architecture perform better, and Polish Roberta base is also significantly better. In a separate post, we will try to reach those scores.

</div>