Topic Modeling in Python
You developed a mobile app and want to figure out what your users are talking about in the app reviews. You have thousands of tweets mentioning your product and not enough time to read and digest all of them. Maybe you want to look at your emails from the last 5 years and figure out what you have spent your time on while reading and answering them.
If any of these use cases sounds familiar, you should learn about topic modeling! In this article, I will explore various topic modelling algorithms and approaches. You can also open it in Google Colab and apply on your dataset easily!
Install the libraries
To start with, let's install three libraries:
datasets
will allow us to easily grab a bunch of texts to work withsentence-transformers
will help us create text embeddings (more on that later)bokeh
will help us with visualization
We will install these libraries and import the functions and classes we will need later on.
!pip install -qq datasets
!pip install -Uqq sentence-transformers
!pip install -qq bokeh
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import sklearn.manifold
import numpy as np
import pandas as pd
import random
random.seed(42)
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper
from bokeh.palettes import plasma, d3, Turbo256
from bokeh.plotting import figure
from bokeh.transform import transform
import bokeh.io
bokeh.io.output_notebook()
import bokeh.plotting as bpl
import bokeh.models as bmo
bpl.output_notebook()
dataset = load_dataset('amazon_reviews_multi', 'en')
Let's take a quick look at the data we'll be working with. Our dataset is a dictionary consisting of three parts: train, validation and test. Let's peek into the train set and put it into pandas dataframe to see how it's constructed.
dataset.keys()
df = pd.DataFrame(dataset['train'])
df.head()
This is useful - we can see that the dataset consists of a number of atributes. We'll focus on the review_body
and try to discover topics in those reviews, but the other attributes can help us to validate if we're stepping in a good direction. For example, we can compare how our topics correlate with the product_category
attribute. Let's peek into the the categories just to see what we have in the dataset.
df.product_category.value_counts().plot(kind='bar', figsize=(15,5));
How can we extract meaning from the review_body
though? There are many ways of course. Rather than going bottom up from simple techniques such as key words, n-grams, tf-idf etc., let's jump straight into the concept of embedding.
Embeddings
A key idea for machine learning is that of representations. Most algorithms can only work with numbers, so whatever we're dealing with - words, texts, images - we should represent with numbers. We are focusing on texts here, texts can represent many different things, so we also need many numbers - let's say 768 - for each text. We'll put these 768 numbers into vectors and use them to represent our texts. These vectors are called embeddings.
For the purpose of these article, we will not worry about where these embeddings come from, other than the fact we can produce them with the SentenceTransformer library. We will load a pretrained model (Distilbert) and use it to encode our texts.
Dimensionality Reduction
768 numbers for each text is actually less meaningful to a normal person than a text, so how does this help? We can use some magic to reduce these 768 numbers to 2. These magic is called t-SNE
and it's one of several dimensionality reduction techniques (for example PCA
or UMAP
). It tries to preserve the relative positions of points in a multidimensional space while mapping it to fewer dimensions. With 2 dimensions, we can actually plot these points (texts) on a chart! Let's do it!
Oh, we have 20.000 texts, so our chart can get really cluttered... Let's take a 1000 texts sample and use it instead.
model = SentenceTransformer('stsb-distilbert-base')
sample = df.sample(n=1000, random_state=42)
texts = sample.review_body.values.tolist()
categories = sample.product_category.values.tolist()
embeddings = model.encode(texts)
out = sklearn.manifold.TSNE(n_components=2).fit_transform(embeddings)
Visualization with bokeh
Bokeh is a nice tool that allows us to create interactive charts. We'll use it to create a scatter plot where each text is placed according to the meaning dimension. Additionally, we're color each dot to indicate which category it comes from. We can hover over the chart and see the text/category associated with each dot.
clrs = random.sample(Turbo256, len(set(categories)), )
color_map = bmo.CategoricalColorMapper(factors=list(set(categories)), palette=clrs)
list_x = out[:,0]
list_y = out[:,1]
desc = texts
source = ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, cat=categories))
hover = HoverTool(tooltips=[
("index", "$index"),
("(x,y)", "(@x, @y)"),
('desc', '@desc'),
('cat', '@cat')
])
p = figure(plot_width=1200, plot_height=600, tools=[hover], title="First Look at the Data")
p.circle('x', 'y', size=10, source=source, fill_color=transform('cat', color_map),)
bpl.show(p)
Looks interesting! If you hover over the distinct clusters on the chart, you should be able to recognize common topics. Some of these topics are related to a single category, some of them are shared across categories. What topics can you find in the chart?
Looking at the chart above, we can get a sense for some of the topics in our corpus, but it doesn't solve our problem yet. It would require lots of time to review the chart in detail, find clusters, and label them. How can we automate this process?
BERTopic is one of the methods to achieve that. It depends on sentence embeddings and clustering algorithms, as well as dimensionality reduction to produce clusters of documents (topics). Let's if we can get some good insights with this approach.
!pip install bertopic -qq
from bertopic import BERTopic
model = BERTopic(language="english")
topics, probs = model.fit_transform(texts)
len(topics), len(set(topics))
We've run the algorithm on our 1000 texts sample, and it identified 16 topics in this corpus. Let's see if we can learn something more about those topics!
model.get_topic_freq().head(15)
Wow, there's quite a lot of outliers here, represented by topic -1, almost half of the dataset! Let's take a look at one of the topics from this dataset.
model.get_topic(1)
What we typically get with topic modelling is key words associated with each topic. In the case above, we can see key words associated with sizes: size, fit, larger. Let's take a look at some texts associated with this topic to confirm our intuition.
ex_ind = [i for i, x in enumerate(topics) if x == 1]
ex_txt = [x for i, x in enumerate(texts) if i in ex_ind]
for t in ex_txt[:10]: print(t)
Indeed, most of these texts talk about sizes! Looks like the model is onto something!
What if we overlay the topics discovered here with our initial scatter plot? Let's try it! Now, instead of categories, we will color the dots according to the topic assigned by BERTopic algorithm.
topic_words = ['-1: outlier']
for i in range(len(set(topics))-1):
tpc = model.get_topic(i)[:7]
words = [x[0] for x in tpc]
tw = ' '.join([str(i) + ':'] + words)
topic_words.append(tw)
exp_topics = [topic_words[x+1] for x in topics]
clrs = random.sample(Turbo256, len(set(topics)))
color_map = bmo.CategoricalColorMapper(factors=topic_words, palette=clrs)
list_x = out[:,0]
list_y = out[:,1]
desc = texts
source = ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, topic=exp_topics))
hover = HoverTool(tooltips=[
("index", "$index"),
('desc', '@desc'),
('topic', '@topic')
])
p = figure(plot_width=1200, plot_height=600, tools=[hover], title="Test")
p.circle('x', 'y', size=10, source=source,
fill_color=transform('topic', color_map),
# legend='topic'
)
# p.legend.location = "top_left"
# p.legend.click_policy="hide"
bpl.show(p)
In this visual, the topics are clustered together - which makes sense, because the method for creating visual and topics is consistent. Interestingly, when looking at clusters of outliers that are located near each other in the chart, we can see common theme - I wonder why these were tagged as outliers?
Let's now turn to a classic approach - LDA, Latent Dirichlet Allocation. We will not review the theory or the inner workings of this algorithm here. The key difference vs. BERTopic is that each text (document) is considered to be a composition of topics. We don't cluster documents into topics, but instead discover abstract topics that are represented in a document corpus. For each document, we get the probability distribution over these topics.
Let's imagine we have discovered three topics: sports, data science, competition.
A document that is about data science competition might have the following distribution: sports: 0.05, data science: 0.5, competition: 0.045.
A document that talks about world championship in cricket migth have the following distribution instead: sports: 0.54, data science: 0.01, competition: 0.45.
There seem to be many implementations of the LDA algorithm, and some of them result in significantly worse results. It also seems that the Mallet implementation is considered one of the best ones, so we will use it here.
To speed things up, I will use the first 10.000 reviews for topic modeling. I will only display 1000 reviews in the t-sne chart.
!pip install -Uqq gensim==3.8.3
import os #importing os to set environment variable
def install_java():
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null #install openjdk
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" #set environment variable
!java -version #check java version
install_java()
!wget -q http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip -qq mallet-2.0.8.zip
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models.wrappers import LdaMallet
from gensim.models.coherencemodel import CoherenceModel
from gensim import similarities
import os.path
import re
import glob
import nltk
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'
mallet_path = '/content/mallet-2.0.8/bin/mallet' # you should NOT need to change this
def preprocess_data(doc_set,extra_stopwords = {}):
# adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
# replace all newlines or multiple sequences of spaces with a standard space
doc_set = [re.sub('\s+', ' ', doc) for doc in doc_set]
# initialize regex tokenizer
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = set(stopwords.words('english'))
# add any extra stopwords
if (len(extra_stopwords) > 0):
en_stop = en_stop.union(extra_stopwords)
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# add tokens to list
texts.append(stopped_tokens)
return texts
def prepare_corpus(doc_clean):
# adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
dictionary = corpora.Dictionary(doc_clean)
dictionary.filter_extremes(no_below=5, no_above=0.5)
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
# generate LDA model
return dictionary,doc_term_matrix
LDA requires some careful parameter choices to work properly. These seem to be expecially relevant:
- number of topics
- stop words list
- alpha parameter, which roughly determines how many topics correspond to a single document
# texts_lda = [dataset['train'][i]['review_body'] for i in range(10000)]
doc_clean = preprocess_data(texts,{})
dictionary, doc_term_matrix = prepare_corpus(doc_clean)
number_of_topics=30 # adjust this to alter the number of topics
words=10 #adjust this to alter the number of words output for the topic below
ldamallet = LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=number_of_topics, id2word=dictionary, alpha=10)
topic_words = ldamallet.show_topics(num_topics=number_of_topics,num_words=5)
topic_words = [x[1] for x in topic_words]
topic_words = []
for i in range(number_of_topics):
tpc = ldamallet.show_topic(i, topn=7, num_words=None)
words = [x[0] for x in tpc]
tw = ' '.join([str(i) + ':'] + words)
topic_words.append(tw)
topic_words
topics_docs = list()
for m in ldamallet[doc_term_matrix[:1000]]:
topics_docs.append(m)
x = np.array(topics_docs[:1000])
y = np.delete(x,0,axis=2)
y = y.squeeze()
best_topics = np.argmax(y, axis=1)
topics = list(best_topics)
topics = [topic_words[x] for x in topics]
# up to 20 colors:
# palette = d3['Category20'][number_of_topics]
clrs = random.sample(Turbo256, number_of_topics)
color_map = bmo.CategoricalColorMapper(factors=topic_words, palette=clrs)
list_x = out[:,0]
list_y = out[:,1]
desc = texts
source = ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, topic=topics))
hover = HoverTool(tooltips=[
("index", "$index"),
('desc', '@desc'),
('topic', '@topic')
])
p = figure(plot_width=1200, plot_height=600, tools=[hover], title="Test")
p.circle('x', 'y', size=10, source=source,
fill_color=transform('topic', color_map),
# legend='topic'
)
# p.legend.location = "top_left"
# p.legend.click_policy="hide"
bpl.show(p)