If you're just getting started with BERT, this article is for you. I will explain the most popular use cases, the inputs and outputs of the model, and how it was trained. I will also provide some intuition into how it works, and will refer your to several excellent guides if you'd like to get deeper.

I've spent the last couple of months working on different NLP tasks, including text classification, question answering, and named entity recognition. BERT has been my starting point for each of these use cases - even though there is a bunch of new transformer-based architectures, it still performs surprisingly well, as evidenced by the recent Kaggle NLP competitions. Eventually, I also ended up training my own BERT model for Polish language and was the first to make it broadly available via HuggingFace library.

Fortunately, you probably won't need to train your own BERT - pre-trained models are available for many languages, including several Polish language models published now.

HuggingFace and PyTorch

HuggingFace Transformers is an excellent library that makes it easy to apply cutting edge NLP models. I will use their code, such as pipelines, to demonstrate the most popular use cases for BERT. We will need pre-trained model weights, which are also hosted by HuggingFace. I will use PyTorch in some examples.

!pip install transformers -q

from transformers import pipeline, BertTokenizer, BertModel, BertForNextSentencePrediction, BertConfig
import torch

What can I use BERT for?

Text classification

Probably the most popular use case for BERT is text classification. This means that we are dealing with sequences of text and want to classify them into discrete categories.

Here are some examples of text sequences and categories:

Movie Review - Sentiment: positive, negative
Product Review - Rating: one to five stars
Email - Intent: product question, pricing question, complaint, other

Below is a code example of sentiment classification use case.

# Text classification - sentiment analysis
nlp = pipeline("sentiment-analysis")

print(nlp("This movie was great!"))
print(nlp("I have just wasted 2 hours of my time."))

[{'label': 'POSITIVE', 'score': 0.6986343860626221}]
[{'label': 'NEGATIVE', 'score': 0.9613907337188721}]

Named Entity Recognition

Sometimes, we're not interested in the overall text, but specific words in it. Maybe we want to extract the company name from a report. Or the start and end date of hotel reservation from an email.

That means that we need to apply classification at the word level - well, actually BERT doesn't work with words, but tokens (more on that later on), so let's call it token classification.

There are existing pre-trained models for common types of named entities, like people names, organization names or locations. Let's see how this performs on an example text. Note that we will only print out the named entities, the tokens classified in the 'Other' category will be ommitted.

# NER / token classification
nlp = pipeline("ner")

sequence = "My name is Darek and I live in Warsaw."

for token in nlp(sequence): print(token)

{'word': 'Dare', 'score': 0.9987152218818665, 'entity': 'I-PER'}
{'word': '##k', 'score': 0.9988871812820435, 'entity': 'I-PER'}
{'word': 'Warsaw', 'score': 0.9978176355361938, 'entity': 'I-LOC'}

Question Answering

Wouldn't it be great if we simply asked a question and got an answer? That is certainly a direction where some of the NLP research is heading (for example T5). BERT can only handle extractive question answering. It means that we provide it with a context, such as a Wikipedia article, and a question related to the context. BERT will find for us the most likely place in the article that contains an answer to our question, or inform us that an answer is not likely to be found.

# Question Answering
nlp = pipeline("question-answering")

context = "My name is Darek. I'm Polish. I like to practice kungfu. My home is in Warsaw but I often travel to Berlin. My friend, Paul, lives in Canada."

print(nlp(question="Where does Darek live?", context=context))
print(nlp(question="Where does Paul live?", context=context))

{'score': 0.8502292525232313, 'start': 71, 'end': 77, 'answer': 'Warsaw'}
{'score': 0.9584999083856722, 'start': 134, 'end': 140, 'answer': 'Canada.'}

Other use cases and fine-tuning

There are some other interesting use cases for transformer-based models, such as text summarization, text generation, or translation. BERT is not designed to do these tasks specifically, so I will not cover them here.

The examples above are based on pre-trained pipelines, which means that they may be useful for us if our data is similar to what they were trained on. Very often, we will need to fine-tune a pretrained model to fit our data or task. This is much more efficient than training a whole model from scratch, and with few examples we can often achieve very good performance.

To be able to do fine-tuning, we need to understand a bit more about BERT.

What are the inputs to BERT, and what comes out of it?

Let's start by treating BERT as a black box. The minimum that we need to understand to use the black box is what data to feed into it, and what type of outputs to expect. You can build on top of these outputs, for example by adding one or more linear layers. You can then fine-tune your custom architecture on your data.

Tokenization

Before you feed your text into BERT, you need to turn it into numbers. That's the role of a tokenizer. Some tokenizers split text on spaces, so that each token corresponds to a word. That would result however in a huge vocabulary, which makes training a model more difficult, so instead BERT relies on sub-word tokenization. Let's see how it works in code.

Each pre-trained model comes with a pre-trained tokenizer (we can't separate them), so we need to download it as well. Let's use it then to tokenize a line of text and see the output.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = 'I like to practice kungfu.'
tokens = tokenizer.encode(text)
print(tokens)

[101, 1045, 2066, 2000, 3218, 18577, 11263, 1012, 102]

Each token is a number that corresponds to a word (or subword) in the vocabulary. The most frequent words are represented as a whole word, while less frequent words are divided in sub-words. That ensures that we can map the entire corpus to a fixed size vocabulary without unknown tokens (in reality, they may still come up in rare cases). Let's see the length of our model's vocabulary, and how the tokens corresponds to words.

print(f'Length of BERT base vocabulary: {len(tokenizer.vocab)}')
print(f'Text: {text}')
for t in tokens:
  print(f'Token: {t}, subword: {tokenizer.decode([t])}')

Length of BERT base vocabulary: 30522
Text: I like to practice kungfu.
Token: 101, subword: [CLS]
Token: 1045, subword: i
Token: 2066, subword: like
Token: 2000, subword: to
Token: 3218, subword: practice
Token: 18577, subword: kung
Token: 11263, subword: ##fu
Token: 1012, subword: .
Token: 102, subword: [SEP]

In the example, you can see how the tokenizer split a less common word 'kungfu' into 2 subwords: 'kung' and '##fu'. The '##' characters inform us that this subword occurs in the middle of a word. BERT tokenizer also added 2 special tokens for us, that are expected by the model: [CLS] which comes at the beginning of every sequence, and [SEP] that comes at the end. [SEP] may optionally also be used to separate two sequences, for example between question and context in a question answering scenario. Another example of a special token is [PAD], we need to use it to pad shorter sequences in a batch, because BERT expects each example in a batch to have the same amount of tokens.

Outputs

Let's download a pretrained model now, run our text through it, and see what comes out. We will first need to convert the tokens into tensors, and add the batch size dimension (here, we will work with batch size 1).

model = BertModel.from_pretrained('bert-base-uncased')

inputs = torch.tensor(tokens).unsqueeze(0) # Batch size 1
outputs = model(inputs)
print(f'output type: {type(outputs)}, output length: {len(outputs)}')
print(f'first item shape: {outputs[0].shape}')
print(f'second item shape: {outputs[1].shape}')

output type: <class 'tuple'>, output length: 2
first item shape: torch.Size([1, 9, 768])
second item shape: torch.Size([1, 768])

In the examples above, we used BERT to handle some useful tasks, such as text classification, named entity recognition, or question answering. For each of those tasks, a task-specific model head was added on top of raw model outputs. Here, we are dealing with the raw model outputs - we need to understand them to be able to add custom heads to solve our own, specific tasks.

The model outputs a tuple. The first item of the tuple has the following shape: 1 (batch size) x 9 (sequence length) x 768 (the number of hidden units). This is called the sequence output, and it provides the representation of each token in the context of other tokens in the sequence. If we'd like to fine-tune our model for named entity recognition, we will use this output and expect the 768 numbers representing each token in a sequence to inform us if the token corresponds to a named entity.

The second item in the tuple has the shape: 1 (batch size) x 768 (the number of hidden units). It is called the pooled output, and in theory it should represent the entire sequence. It corresponds to the first token in a sequence (the [CLS] token). We can use it in a text classification task - for example when we fine-tune the model for sentiment classification, we'd expect the 768 hidden units of the pooled output to capture the sentiment of the text.

In practice, we may want to use some other way to capture the meaning of the sequence, for example by averaging the sequence output, or even concatenating the hidden states from lower levels.

How was BERT trained?

The models we have been using so far have already been pre-trained, and in some cases fine-tuned as well. What does this actually mean?

Pre-training

In order for a model to solve an NLP task, like sentiment classification, it needs to understand a lot about language. Most of the labelled datasets that we have available are too small to teach our model enough about language. Ideally, we'd like to use all the text we have available, for example all books and the internet. Because it's hard to label so much text, we create 'fake tasks' that will help us achieve our goal without manual labelling.

BERT is trained on a very large corpus using two 'fake tasks': masked language modeling (MLM) and next sentence prediction (NSP). In MLM, we randomly hide some tokens in a sequence, and ask the model to predict which tokens are missing. In NSP, we provide our model with two sentences, and ask it to predict if the second sentence follows the first one in our corpus. The intent of these tasks is for our model to be able to represent the meaning of both individual words, and the entire sentences.

nlp = pipeline("fill-mask")
preds = nlp(f"I am exhausted, it's been a very {nlp.tokenizer.mask_token} day.")
print('I am exhausted, it\'s been a very ***** day.')
for p in preds: print(nlp.tokenizer.decode([p['token']]))
preds = nlp(f"I am excited, it's been a very {nlp.tokenizer.mask_token} day.")
print('I am excited, it\'s been a very ***** day.')
for p in preds: print(nlp.tokenizer.decode([p['token']]))

I am exhausted, it's been a very ***** day.
 busy
 exhausting
 stressful
 taxing
 rough
I am excited, it's been a very ***** day.
 busy
 exciting
 productive
 good
 nice

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

first_sentence = "I cut my finger."
second_sentence_right = "The blood started flowing."
second_sentence_wrong = "This website uses cookies."

right = tokenizer.encode_plus(first_sentence, text_pair=second_sentence_right)
wrong = tokenizer.encode_plus(first_sentence, text_pair=second_sentence_wrong)

r1, r2, r3 = torch.tensor(right['input_ids']).unsqueeze(0), torch.tensor(right['token_type_ids']).unsqueeze(0), torch.tensor(right['attention_mask']).unsqueeze(0)
w1, w2, w3 = torch.tensor(wrong['input_ids']).unsqueeze(0), torch.tensor(wrong['token_type_ids']).unsqueeze(0), torch.tensor(wrong['attention_mask']).unsqueeze(0)

right_outputs = model(input_ids=r1, token_type_ids=r2, attention_mask=r3)
right_seq_relationship_scores = right_outputs[0]
wrong_outputs = model(input_ids=w1, token_type_ids=w2, attention_mask=w3)
wrong_seq_relationship_scores = wrong_outputs[0]

print(first_sentence + ' ' + second_sentence_right)
print(f'Next sentence prediction: {right_seq_relationship_scores.detach().numpy().flatten()[0] > 0}')
print(first_sentence + ' ' + second_sentence_wrong)
print(f'Next sentence prediction: {wrong_seq_relationship_scores.detach().numpy().flatten()[0] > 0}')

I cut my finger. The blood started flowing.
Next sentence prediction: True
I cut my finger. This website uses cookies.
Next sentence prediction: False

Finetuning

As we can see from the examples above, BERT has learned quite a lot about language during pretraining. That knowledge is represented in its outputs - the hidden units corresponding to tokens in a sequence. We can use that knowledge by adding our own, custom layers on top of BERT outputs, and further training (finetuning) it on our own data.

How does BERT really work?

If training a model is like training a dog, then understanding the internals of BERT is like understanding the anatomy of a dog. It's not required to effectively train a model, but it can be helpful if you want to do some really advanced stuff, or if you want to understand the limits of what is possible.

I will only scratch the surface here by showing the key ingredients of BERT architecture, and at the end I will point to some additional resources I have found very helpful.

Let's start by loading up basic BERT configuration and looking what's inside.

config = BertConfig()
config

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

This configuration file lists the key dimensions that determine the size of the model:

768 hidden size is the number of floats in a vector representing each token in the vocabulary
30522 is the vocabulary size
We can deal with max 512 tokens in a sequence
The initial embeddings will go through 12 layers of computation, including the application of 12 attention heads and dense layers with 3072 hidden units, to produce our final output, which will again be a vector with 768 units per token

Let's briefly look at each major building block of the model architecture. We start with the embedding layer, which maps each vocabulary token to a 768-long embedding. We can also see position embeddings, which are trained to represent the ordering of words in a sequence, and token type embeddings, which are used if we want to distinguish between two sequences (for example question and context).

model = BertModel(config)
print(model.embeddings)

BertEmbeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

Then, we pass the embeddings through 12 layers of computation. This starts with self-attention, is followed by an intermediate dense layer with hidden size 3072, and ends with sequence output that we have already seen above. Usually, we will deal with the last hidden state, i.e. the 12th layer. However, to achieve better results, we may sometimes use the layers below as well to represent our sequences, for example by concatenating the last 4 hidden states.

print(f'There are {len(model.encoder.layer)} layers like this in the model architecture:')
print('---')
print(model.encoder.layer[0])

There are 12 layers like this in the model architecture:
---
BertLayer(
  (attention): BertAttention(
    (self): BertSelfAttention(
      (query): Linear(in_features=768, out_features=768, bias=True)
      (key): Linear(in_features=768, out_features=768, bias=True)
      (value): Linear(in_features=768, out_features=768, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (output): BertSelfOutput(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (intermediate): BertIntermediate(
    (dense): Linear(in_features=768, out_features=3072, bias=True)
  )
  (output): BertOutput(
    (dense): Linear(in_features=3072, out_features=768, bias=True)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

Finally, we have the pooled output, which is used in pre-training for the NSP task, and corresponds to the [CLS] token hidden state that goes through another linear layer.

print(model.pooler)

BertPooler(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (activation): Tanh()
)

In this overview, I haven't explained at all the self-attention mechanism, or the detailed inner workings of BERT. If you'd like to learn further, here are some materials that I have found very useful.

Chris Mccormick BERT Research Series on Youtube
Jay Alammar A Visual Guide to Using BERT for the First Time
Jay Alammar The Illustrated Transformer
Peter Bloem Transformers from Scratch