Training BERT for Cyberbullying Detection - HF Trainer Baseline
Our goal is to train a binary classification model that should detect cyberbullying in Polish Tweets. The dataset comes from a Polish NLP competition - PolEval 2019 (http://2019.poleval.pl/index.php/tasks/task6). It is also included in Polish NLP Benchmark KLEJ (https://klejbenchmark.com/).
Let's start by installing two libraries from HuggingFace that will make our job easier: transformers and datasets. We will also import the relevant libraries.
!pip install transformers -qq
!pip install datasets -qq
import numpy as np
import pandas as pd
import torch
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
In this demo, we will use the Polish pretrained BERT model - Polbert (https://github.com/kldarek/polbert). It can be downloaded from the HuggingFace model hub, and we will use BertForSequenceClassification class to load it. We will also need the Polbert tokenizer.
model = BertForSequenceClassification.from_pretrained('dkleczek/bert-base-polish-uncased-v1')
tokenizer = BertTokenizerFast.from_pretrained('dkleczek/bert-base-polish-uncased-v1')
Now we will load the training data from the KLEJ benchmark website, clean it up, and convert to a csv format that can be used to create a dataset.
!wget -q https://klejbenchmark.com/static/data/klej_cbd.zip
!unzip -q klej_cbd.zip
df = pd.read_csv('train.tsv', delimiter='\t')
df = df.dropna().reset_index(drop=True)
df.columns = ['text', 'label']
df.label = df.label.astype(int)
df = df.sample(frac=1, random_state=42)
df.to_csv('train.csv', index=False)
len(df), len(df[df.label == 1])
Our training set consists of 10 thousand tweets, but only 851 of those tweets are tagged as cyberbullying. We can take a look at a sample of data in the dataframe.
df.head()
We need to convert our data into a format that can be fed to our model. Thanks to HuggingFace datesets library magic, we con do this with just a few lines of code. We will load the dataset from csv file, split it into train (80%) and validation set (20%). We will then map the tokenizer to convert the text strings into a format that can be fed into BERT model (input_ids and attention mask). Finally, we'll convert that into torch tensors.
train_dataset, test_dataset = load_dataset('csv', data_files='train.csv', split=['train[:80%]', 'train[80%:]'])
# train_dataset[0]
def tokenize(batch):
return tokenizer(batch['text'], padding=True, truncation=False)
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
# train_dataset[0]
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
# train_dataset[0]
We are almost ready to start the training. Let's define a function that will help us monitor the training progress and evaluate results on the validation dataset. We will primarily focus on F1, recall and precision metrics, especially that F1 is the official evaluation metric for this dataset.
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
'f1': f1,
'precision': precision,
'recall': recall
}
HuggingFace wraps up the default transformer fine-tuning approach in the Trainer object, and we can customize it by passing training arguments such as learning rate, number of epochs, batch size etc. We will set logging_steps to 20, so that we can frequently evaluate how the model performs on the validation set throughout the training.
training_args = TrainingArguments(
output_dir='./results',
learning_rate=2e-5,
num_train_epochs=3,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
fp16=True,
warmup_steps=30,
logging_steps=20,
weight_decay=0.01,
evaluate_during_training=True,
logging_dir='./logs',
)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=test_dataset
)
trainer.train()