You developed a mobile app and want to figure out what your users are talking about in the app reviews. You have thousands of tweets mentioning your product and not enough time to read and digest all of them. Maybe you want to look at your emails from the last 5 years and figure out what you have spent your time on while reading and answering them.

If any of these use cases sounds familiar, you should learn about topic modeling! In this article, I will explore various topic modelling algorithms and approaches. You can also open it in Google Colab and apply on your dataset easily!

Install the libraries

To start with, let's install three libraries:

  • datasets will allow us to easily grab a bunch of texts to work with
  • sentence-transformers will help us create text embeddings (more on that later)
  • bokeh will help us with visualization

We will install these libraries and import the functions and classes we will need later on.

!pip install -qq datasets
!pip install -Uqq sentence-transformers
!pip install -qq bokeh
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import sklearn.manifold
import numpy as np
import pandas as pd
import random
from import output_file, show
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper
from bokeh.palettes import plasma, d3, Turbo256
from bokeh.plotting import figure
from bokeh.transform import transform

import bokeh.plotting as bpl
import bokeh.models as bmo

Grab the data

Topic modeling requires a bunch of texts. We don't need any labels! Let's grab an English subset of the public Amazon reviews dataset and test if we can get practical insights on the topics and themes represented in those reviews.

dataset = load_dataset('amazon_reviews_multi', 'en')
Reusing dataset amazon_reviews_multi (/root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)

First Look at the Data

Let's take a quick look at the data we'll be working with. Our dataset is a dictionary consisting of three parts: train, validation and test. Let's peek into the train set and put it into pandas dataframe to see how it's constructed.

dict_keys(['train', 'validation', 'test'])
df = pd.DataFrame(dataset['train'])
language product_category product_id review_body review_id review_title reviewer_id stars
0 en furniture product_en_0740675 Arrived broken. Manufacturer defect. Two of th... en_0964290 I'll spend twice the amount of time boxing up ... reviewer_en_0342986 1
1 en home_improvement product_en_0440378 the cabinet dot were all detached from backing... en_0690095 Not use able reviewer_en_0133349 1
2 en home product_en_0399702 I received my first order of this product and ... en_0311558 The product is junk. reviewer_en_0152034 1
3 en wireless product_en_0444063 This product is a piece of shit. Do not buy. D... en_0044972 Fucking waste of money reviewer_en_0656967 1
4 en pc product_en_0139353 went through 3 in one day doesn't fit correct ... en_0784379 bubble reviewer_en_0757638 1

This is useful - we can see that the dataset consists of a number of atributes. We'll focus on the review_body and try to discover topics in those reviews, but the other attributes can help us to validate if we're stepping in a good direction. For example, we can compare how our topics correlate with the product_category attribute. Let's peek into the the categories just to see what we have in the dataset.

df.product_category.value_counts().plot(kind='bar', figsize=(15,5));

How can we extract meaning from the review_body though? There are many ways of course. Rather than going bottom up from simple techniques such as key words, n-grams, tf-idf etc., let's jump straight into the concept of embedding.


A key idea for machine learning is that of representations. Most algorithms can only work with numbers, so whatever we're dealing with - words, texts, images - we should represent with numbers. We are focusing on texts here, texts can represent many different things, so we also need many numbers - let's say 768 - for each text. We'll put these 768 numbers into vectors and use them to represent our texts. These vectors are called embeddings.

For the purpose of these article, we will not worry about where these embeddings come from, other than the fact we can produce them with the SentenceTransformer library. We will load a pretrained model (Distilbert) and use it to encode our texts.

Dimensionality Reduction

768 numbers for each text is actually less meaningful to a normal person than a text, so how does this help? We can use some magic to reduce these 768 numbers to 2. These magic is called t-SNE and it's one of several dimensionality reduction techniques (for example PCA or UMAP). It tries to preserve the relative positions of points in a multidimensional space while mapping it to fewer dimensions. With 2 dimensions, we can actually plot these points (texts) on a chart! Let's do it!

Oh, we have 20.000 texts, so our chart can get really cluttered... Let's take a 1000 texts sample and use it instead.

model = SentenceTransformer('stsb-distilbert-base')
sample = df.sample(n=1000, random_state=42)
texts = sample.review_body.values.tolist()
categories = sample.product_category.values.tolist()
embeddings = model.encode(texts)
out = sklearn.manifold.TSNE(n_components=2).fit_transform(embeddings)

Visualization with bokeh

Bokeh is a nice tool that allows us to create interactive charts. We'll use it to create a scatter plot where each text is placed according to the meaning dimension. Additionally, we're color each dot to indicate which category it comes from. We can hover over the chart and see the text/category associated with each dot.

clrs = random.sample(Turbo256, len(set(categories)), )
color_map = bmo.CategoricalColorMapper(factors=list(set(categories)), palette=clrs)
list_x = out[:,0]
list_y = out[:,1]
desc = texts

source = ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, cat=categories))
hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("(x,y)", "(@x, @y)"),
    ('desc', '@desc'),
    ('cat', '@cat')

p = figure(plot_width=1200, plot_height=600, tools=[hover], title="First Look at the Data")'x', 'y', size=10, source=source, fill_color=transform('cat', color_map),)
<!DOCTYPE html> First Look at the Data

Looks interesting! If you hover over the distinct clusters on the chart, you should be able to recognize common topics. Some of these topics are related to a single category, some of them are shared across categories. What topics can you find in the chart?

Discovering Topics with BERTopic

Looking at the chart above, we can get a sense for some of the topics in our corpus, but it doesn't solve our problem yet. It would require lots of time to review the chart in detail, find clusters, and label them. How can we automate this process?

BERTopic is one of the methods to achieve that. It depends on sentence embeddings and clustering algorithms, as well as dimensionality reduction to produce clusters of documents (topics). Let's if we can get some good insights with this approach.

!pip install bertopic -qq
from bertopic import BERTopic
model = BERTopic(language="english")
topics, probs = model.fit_transform(texts)
len(topics), len(set(topics))
(1000, 16)

We've run the algorithm on our 1000 texts sample, and it identified 16 topics in this corpus. Let's see if we can learn something more about those topics!

Topic Count
0 -1 451
1 11 104
2 9 97
3 14 61
4 10 43
5 1 42
6 7 28
7 8 28
8 0 26
9 13 22
10 4 21
11 5 19
12 2 16
13 3 15
14 6 15

Wow, there's quite a lot of outliers here, represented by topic -1, almost half of the dataset! Let's take a look at one of the topics from this dataset.

[('size', 0.05050109336233438),
 ('fit', 0.026139678912211962),
 ('could', 0.025590393661103304),
 ('top', 0.025448458979747752),
 ('ordered', 0.02355394098054413),
 ('dress', 0.022519132135764744),
 ('larger', 0.020384763234235534),
 ('zipper', 0.019640993684217505),
 ('too', 0.01934745458460365),
 ('all', 0.019074915582195817)]

What we typically get with topic modelling is key words associated with each topic. In the case above, we can see key words associated with sizes: size, fit, larger. Let's take a look at some texts associated with this topic to confirm our intuition.

ex_ind = [i for i, x in enumerate(topics) if x == 1]
ex_txt = [x for i, x in enumerate(texts) if i in ex_ind]
for t in ex_txt[:10]: print(t)
Really cute mug. I would have given 5 stars if it were a bit bigger.
Not the size I hoped for but that could be partly my fault. It did come in a very nice gift bag with the brand name on it but I just wish that it was a bead or two larger. Otherwise this is a great gift for someone with a petite wrist.
Its o.k. but not as thick as another brand I previously used. I think the other brand lasted longer in my hair for the day.
I wish I could give 5 stars. As far as the glasses go, I absolutely love them. But three glasses arrived completely shattered
The size was off, I usually wear a lrg. or x-lrg. But this was snug I wanted to order larger but was sold out.
The top was a bit tight and I'm a 36 B. I got a medium. I prob would still wear top but underboob is inevitable since the straps are not adjustable. Otherwise the top was cute. Bottoms fit weird and where the strappy parts are on each side the inner lining (tan/white material) showed no matter what and looked super odd. Not cute at all. Maybe I am just too wide for them. I have a 26" waist. Def for SHORT PETITE people.
I really want to give this suit a 5 star but I can’t. The appearance is beautiful and I love the color. But sadly the top is to big. I followed the sizing chart for around the bust size. It all fits there but the cup size in a xxl looks as if it is a triple d or a double d. I am a larger girl being 249 but my chest is smaller. Would love to exchange sizes but cant find anywhere to message sender.
I ordered a size up because my butt is larger than the rest of me, and like every other pair of jeans/shorts I buy, the waist is too big. You can see my underwear in these if I don’t have something underneath. They are good quality though.
Love this dress, I probably should order a smaller size since it is a bit loose in the top and very long on me.
The waist is too high and the bottom too long. I could get away with it but I like my leggings to be be fitted. I might have them altered or I send them back. Not sure yet. Fabric is on the thin size but not see through. Expected for the price. I am 5.2 so I would recommend for taller people! It adjusts well to my size which I am small/medium legging size. Perhaps they could create a petite size!

Indeed, most of these texts talk about sizes! Looks like the model is onto something!

What if we overlay the topics discovered here with our initial scatter plot? Let's try it! Now, instead of categories, we will color the dots according to the topic assigned by BERTopic algorithm.

topic_words = ['-1: outlier']
for i in range(len(set(topics))-1):
  tpc = model.get_topic(i)[:7]
  words = [x[0] for x in tpc]
  tw = ' '.join([str(i) + ':'] + words)
exp_topics = [topic_words[x+1] for x in topics]
clrs = random.sample(Turbo256, len(set(topics)))
color_map = bmo.CategoricalColorMapper(factors=topic_words, palette=clrs)
list_x = out[:,0]
list_y = out[:,1]
desc = texts

source = ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, topic=exp_topics))
hover = HoverTool(tooltips=[
    ("index", "$index"),
    ('desc', '@desc'),
    ('topic', '@topic')

p = figure(plot_width=1200, plot_height=600, tools=[hover], title="Test")'x', 'y', size=10, source=source,
         fill_color=transform('topic', color_map),
         # legend='topic'
# p.legend.location = "top_left"
# p.legend.click_policy="hide"
<!DOCTYPE html> First Look at the Data

In this visual, the topics are clustered together - which makes sense, because the method for creating visual and topics is consistent. Interestingly, when looking at clusters of outliers that are located near each other in the chart, we can see common theme - I wonder why these were tagged as outliers?

LDA with Mallet

Let's now turn to a classic approach - LDA, Latent Dirichlet Allocation. We will not review the theory or the inner workings of this algorithm here. The key difference vs. BERTopic is that each text (document) is considered to be a composition of topics. We don't cluster documents into topics, but instead discover abstract topics that are represented in a document corpus. For each document, we get the probability distribution over these topics.

Let's imagine we have discovered three topics: sports, data science, competition.

A document that is about data science competition might have the following distribution: sports: 0.05, data science: 0.5, competition: 0.045.

A document that talks about world championship in cricket migth have the following distribution instead: sports: 0.54, data science: 0.01, competition: 0.45.

There seem to be many implementations of the LDA algorithm, and some of them result in significantly worse results. It also seems that the Mallet implementation is considered one of the best ones, so we will use it here.

To speed things up, I will use the first 10.000 reviews for topic modeling. I will only display 1000 reviews in the t-sne chart.

Imports and installation

!pip install -Uqq gensim==3.8.3
     |████████████████████████████████| 24.2MB 143kB/s 
import os       #importing os to set environment variable
def install_java():
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version       #check java version
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)
!wget -q
!unzip -qq
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models.wrappers import LdaMallet
from gensim.models.coherencemodel import CoherenceModel
from gensim import similarities

import os.path
import re
import glob

import nltk'stopwords')

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/
os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'
mallet_path = '/content/mallet-2.0.8/bin/mallet' # you should NOT need to change this 
def preprocess_data(doc_set,extra_stopwords = {}):
    # adapted from
    # replace all newlines or multiple sequences of spaces with a standard space
    doc_set = [re.sub('\s+', ' ', doc) for doc in doc_set]
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    # create English stop words list
    en_stop = set(stopwords.words('english'))
    # add any extra stopwords
    if (len(extra_stopwords) > 0):
        en_stop = en_stop.union(extra_stopwords)
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        # add tokens to list
    return texts

def prepare_corpus(doc_clean):
    # adapted from
    # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    dictionary = corpora.Dictionary(doc_clean)
    dictionary.filter_extremes(no_below=5, no_above=0.5)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
    # generate LDA model
    return dictionary,doc_term_matrix

Topic modelling with LDA

LDA requires some careful parameter choices to work properly. These seem to be expecially relevant:

  • number of topics
  • stop words list
  • alpha parameter, which roughly determines how many topics correspond to a single document
# texts_lda = [dataset['train'][i]['review_body'] for i in range(10000)]
doc_clean = preprocess_data(texts,{})
dictionary, doc_term_matrix = prepare_corpus(doc_clean)
number_of_topics=30 # adjust this to alter the number of topics
words=10 #adjust this to alter the number of words output for the topic below
ldamallet = LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=number_of_topics, id2word=dictionary, alpha=10)
topic_words = ldamallet.show_topics(num_topics=number_of_topics,num_words=5)
topic_words = [x[1] for x in topic_words]
topic_words = []
for i in range(number_of_topics):
  tpc = ldamallet.show_topic(i, topn=7, num_words=None)
  words = [x[0] for x in tpc]
  tw = ' '.join([str(i) + ':'] + words)
['0: case small love feels design bit camera',
 '1: perfect started fall heavy weight quickly feet',
 '2: nice day box gift looked purchased shoe',
 '3: 2 3 5 stars 1 4 weeks',
 '4: work bought make fine cut pump job',
 '5: broke side soft beautiful ring long bottom',
 '6: bag product picture package show guess happy',
 '7: hard money working lot worth worked things',
 '8: color light colors white loves lights daughter',
 '9: water plastic open air hold inside difficult',
 '10: size fit wear ordered order comfortable big',
 '11: put easy bought left times piece face',
 '12: arrived nice pieces broken returned completely thin',
 '13: quality made work easily poor fits low',
 '14: book love great pages missing family star',
 '15: great purchase cover screen purchased recommended replace',
 '16: great works recommend lots price smells awesome',
 '17: product bad month disappointed reason sound needed',
 '18: buy review year frame support difficult idea',
 '19: good fit bit brand fine watch screws',
 '20: top set problem short expected people story',
 '21: item return back shipping received disappointed send',
 '22: easy recommend install works clean thick 10',
 '23: fast battery charge wrong 4 year cord',
 '24: back extra front chair makes pull returning',
 '25: phone thing years home stay friend find',
 '26: received hair order ordered amazon seller problems',
 '27: time loved cute super long huge toy',
 '28: good price quality pretty decent expect end',
 '29: cheap material perfect loose buy big 5']
topics_docs = list()
for m in ldamallet[doc_term_matrix[:1000]]:
x = np.array(topics_docs[:1000])
y = np.delete(x,0,axis=2)
y = y.squeeze()
best_topics = np.argmax(y, axis=1)
topics = list(best_topics)
topics = [topic_words[x] for x in topics]
# up to 20 colors:
# palette = d3['Category20'][number_of_topics]
clrs = random.sample(Turbo256, number_of_topics)
color_map = bmo.CategoricalColorMapper(factors=topic_words, palette=clrs)
list_x = out[:,0]
list_y = out[:,1]
desc = texts

source = ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, topic=topics))
hover = HoverTool(tooltips=[
    ("index", "$index"),
    ('desc', '@desc'),
    ('topic', '@topic')

p = figure(plot_width=1200, plot_height=600, tools=[hover], title="Test")'x', 'y', size=10, source=source,
         fill_color=transform('topic', color_map),
         # legend='topic'
# p.legend.location = "top_left"
# p.legend.click_policy="hide"