!pip install -qq datasets
!pip install -Uqq sentence-transformers
!pip install -qq bokeh
Topic Models Introduction
Published on May 27, 2021
You developed a mobile app and want to figure out what your users are talking about in the app reviews. You have thousands of tweets mentioning your product and not enough time to read and digest all of them. Maybe you want to look at your emails from the last 5 years and figure out what you have spent your time on while reading and answering them.
If any of these use cases sounds familiar, you should learn about topic modeling! In this article, I will explore various topic modelling algorithms and approaches. You can also open it in Google Colab and apply on your dataset easily!
Install the libraries
To start with, let’s install three libraries: - datasets
will allow us to easily grab a bunch of texts to work with - sentence-transformers
will help us create text embeddings (more on that later) - bokeh
will help us with visualization
We will install these libraries and import the functions and classes we will need later on.
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import sklearn.manifold
import numpy as np
import pandas as pd
import random
42)
random.seed(from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper
from bokeh.palettes import plasma, d3, Turbo256
from bokeh.plotting import figure
from bokeh.transform import transform
import bokeh.io
bokeh.io.output_notebook()
import bokeh.plotting as bpl
import bokeh.models as bmo
bpl.output_notebook()
Grab the data
Topic modeling requires a bunch of texts. We don’t need any labels! Let’s grab an English subset of the public Amazon reviews dataset and test if we can get practical insights on the topics and themes represented in those reviews.
= load_dataset('amazon_reviews_multi', 'en') dataset
Reusing dataset amazon_reviews_multi (/root/.cache/huggingface/datasets/amazon_reviews_multi/en/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609)
First Look at the Data
Let’s take a quick look at the data we’ll be working with. Our dataset is a dictionary consisting of three parts: train, validation and test. Let’s peek into the train set and put it into pandas dataframe to see how it’s constructed.
dataset.keys()
dict_keys(['train', 'validation', 'test'])
= pd.DataFrame(dataset['train'])
df df.head()
language | product_category | product_id | review_body | review_id | review_title | reviewer_id | stars | |
---|---|---|---|---|---|---|---|---|
0 | en | furniture | product_en_0740675 | Arrived broken. Manufacturer defect. Two of th... | en_0964290 | I'll spend twice the amount of time boxing up ... | reviewer_en_0342986 | 1 |
1 | en | home_improvement | product_en_0440378 | the cabinet dot were all detached from backing... | en_0690095 | Not use able | reviewer_en_0133349 | 1 |
2 | en | home | product_en_0399702 | I received my first order of this product and ... | en_0311558 | The product is junk. | reviewer_en_0152034 | 1 |
3 | en | wireless | product_en_0444063 | This product is a piece of shit. Do not buy. D... | en_0044972 | Fucking waste of money | reviewer_en_0656967 | 1 |
4 | en | pc | product_en_0139353 | went through 3 in one day doesn't fit correct ... | en_0784379 | bubble | reviewer_en_0757638 | 1 |
This is useful - we can see that the dataset consists of a number of atributes. We’ll focus on the review_body
and try to discover topics in those reviews, but the other attributes can help us to validate if we’re stepping in a good direction. For example, we can compare how our topics correlate with the product_category
attribute. Let’s peek into the the categories just to see what we have in the dataset.
='bar', figsize=(15,5)); df.product_category.value_counts().plot(kind
How can we extract meaning from the review_body
though? There are many ways of course. Rather than going bottom up from simple techniques such as key words, n-grams, tf-idf etc., let’s jump straight into the concept of embedding.
Embeddings
A key idea for machine learning is that of representations. Most algorithms can only work with numbers, so whatever we’re dealing with - words, texts, images - we should represent with numbers. We are focusing on texts here, texts can represent many different things, so we also need many numbers - let’s say 768 - for each text. We’ll put these 768 numbers into vectors and use them to represent our texts. These vectors are called embeddings.
For the purpose of these article, we will not worry about where these embeddings come from, other than the fact we can produce them with the SentenceTransformer library. We will load a pretrained model (Distilbert) and use it to encode our texts.
Dimensionality Reduction
768 numbers for each text is actually less meaningful to a normal person than a text, so how does this help? We can use some magic to reduce these 768 numbers to 2. These magic is called t-SNE
and it’s one of several dimensionality reduction techniques (for example PCA
or UMAP
). It tries to preserve the relative positions of points in a multidimensional space while mapping it to fewer dimensions. With 2 dimensions, we can actually plot these points (texts) on a chart! Let’s do it!
Oh, we have 20.000 texts, so our chart can get really cluttered… Let’s take a 1000 texts sample and use it instead.
= SentenceTransformer('stsb-distilbert-base') model
= df.sample(n=1000, random_state=42)
sample = sample.review_body.values.tolist()
texts = sample.product_category.values.tolist() categories
= model.encode(texts) embeddings
= sklearn.manifold.TSNE(n_components=2).fit_transform(embeddings) out
Visualization with bokeh
Bokeh is a nice tool that allows us to create interactive charts. We’ll use it to create a scatter plot where each text is placed according to the meaning dimension. Additionally, we’re color each dot to indicate which category it comes from. We can hover over the chart and see the text/category associated with each dot.
= random.sample(Turbo256, len(set(categories)), )
clrs = bmo.CategoricalColorMapper(factors=list(set(categories)), palette=clrs) color_map
= out[:,0]
list_x = out[:,1]
list_y = texts
desc
= ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, cat=categories))
source = HoverTool(tooltips=[
hover "index", "$index"),
("(x,y)", "(@x, @y)"),
('desc', '@desc'),
('cat', '@cat')
(
])
= figure(plot_width=1200, plot_height=600, tools=[hover], title="First Look at the Data")
p 'x', 'y', size=10, source=source, fill_color=transform('cat', color_map),)
p.circle( bpl.show(p)
#hide_input
# this cell will fail if you run it in colab, it's a workaround to display the chart in the blog
from bokeh.resources import CDN
from bokeh.embed import file_html
from IPython.display import display, HTML
= file_html(p, CDN, "First Look at the Data")
html HTML(html)
Looks interesting! If you hover over the distinct clusters on the chart, you should be able to recognize common topics. Some of these topics are related to a single category, some of them are shared across categories. What topics can you find in the chart?
Discovering Topics with BERTopic
Looking at the chart above, we can get a sense for some of the topics in our corpus, but it doesn’t solve our problem yet. It would require lots of time to review the chart in detail, find clusters, and label them. How can we automate this process?
BERTopic is one of the methods to achieve that. It depends on sentence embeddings and clustering algorithms, as well as dimensionality reduction to produce clusters of documents (topics). Let’s if we can get some good insights with this approach.
!pip install bertopic -qq
from bertopic import BERTopic
= BERTopic(language="english")
model = model.fit_transform(texts) topics, probs
len(topics), len(set(topics))
(1000, 16)
We’ve run the algorithm on our 1000 texts sample, and it identified 16 topics in this corpus. Let’s see if we can learn something more about those topics!
15) model.get_topic_freq().head(
Topic | Count | |
---|---|---|
0 | -1 | 451 |
1 | 11 | 104 |
2 | 9 | 97 |
3 | 14 | 61 |
4 | 10 | 43 |
5 | 1 | 42 |
6 | 7 | 28 |
7 | 8 | 28 |
8 | 0 | 26 |
9 | 13 | 22 |
10 | 4 | 21 |
11 | 5 | 19 |
12 | 2 | 16 |
13 | 3 | 15 |
14 | 6 | 15 |
Wow, there’s quite a lot of outliers here, represented by topic -1, almost half of the dataset! Let’s take a look at one of the topics from this dataset.
1) model.get_topic(
[('size', 0.05050109336233438),
('fit', 0.026139678912211962),
('could', 0.025590393661103304),
('top', 0.025448458979747752),
('ordered', 0.02355394098054413),
('dress', 0.022519132135764744),
('larger', 0.020384763234235534),
('zipper', 0.019640993684217505),
('too', 0.01934745458460365),
('all', 0.019074915582195817)]
What we typically get with topic modelling is key words associated with each topic. In the case above, we can see key words associated with sizes: size, fit, larger. Let’s take a look at some texts associated with this topic to confirm our intuition.
= [i for i, x in enumerate(topics) if x == 1]
ex_ind = [x for i, x in enumerate(texts) if i in ex_ind]
ex_txt for t in ex_txt[:10]: print(t)
Really cute mug. I would have given 5 stars if it were a bit bigger.
Not the size I hoped for but that could be partly my fault. It did come in a very nice gift bag with the brand name on it but I just wish that it was a bead or two larger. Otherwise this is a great gift for someone with a petite wrist.
Its o.k. but not as thick as another brand I previously used. I think the other brand lasted longer in my hair for the day.
I wish I could give 5 stars. As far as the glasses go, I absolutely love them. But three glasses arrived completely shattered
The size was off, I usually wear a lrg. or x-lrg. But this was snug I wanted to order larger but was sold out.
The top was a bit tight and I'm a 36 B. I got a medium. I prob would still wear top but underboob is inevitable since the straps are not adjustable. Otherwise the top was cute. Bottoms fit weird and where the strappy parts are on each side the inner lining (tan/white material) showed no matter what and looked super odd. Not cute at all. Maybe I am just too wide for them. I have a 26" waist. Def for SHORT PETITE people.
I really want to give this suit a 5 star but I can’t. The appearance is beautiful and I love the color. But sadly the top is to big. I followed the sizing chart for around the bust size. It all fits there but the cup size in a xxl looks as if it is a triple d or a double d. I am a larger girl being 249 but my chest is smaller. Would love to exchange sizes but cant find anywhere to message sender.
I ordered a size up because my butt is larger than the rest of me, and like every other pair of jeans/shorts I buy, the waist is too big. You can see my underwear in these if I don’t have something underneath. They are good quality though.
Love this dress, I probably should order a smaller size since it is a bit loose in the top and very long on me.
The waist is too high and the bottom too long. I could get away with it but I like my leggings to be be fitted. I might have them altered or I send them back. Not sure yet. Fabric is on the thin size but not see through. Expected for the price. I am 5.2 so I would recommend for taller people! It adjusts well to my size which I am small/medium legging size. Perhaps they could create a petite size!
Indeed, most of these texts talk about sizes! Looks like the model is onto something!
What if we overlay the topics discovered here with our initial scatter plot? Let’s try it! Now, instead of categories, we will color the dots according to the topic assigned by BERTopic algorithm.
= ['-1: outlier']
topic_words for i in range(len(set(topics))-1):
= model.get_topic(i)[:7]
tpc = [x[0] for x in tpc]
words = ' '.join([str(i) + ':'] + words)
tw topic_words.append(tw)
= [topic_words[x+1] for x in topics] exp_topics
= random.sample(Turbo256, len(set(topics)))
clrs = bmo.CategoricalColorMapper(factors=topic_words, palette=clrs) color_map
= out[:,0]
list_x = out[:,1]
list_y = texts
desc
= ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, topic=exp_topics))
source = HoverTool(tooltips=[
hover "index", "$index"),
('desc', '@desc'),
('topic', '@topic')
(
])
= figure(plot_width=1200, plot_height=600, tools=[hover], title="Test")
p 'x', 'y', size=10, source=source,
p.circle(=transform('topic', color_map),
fill_color# legend='topic'
)# p.legend.location = "top_left"
# p.legend.click_policy="hide"
bpl.show(p)
#hide_input
# this cell will fail if you run it in colab, it's a workaround to display the chart in the blog
= file_html(p, CDN, "First Look at the Data")
html HTML(html)
In this visual, the topics are clustered together - which makes sense, because the method for creating visual and topics is consistent. Interestingly, when looking at clusters of outliers that are located near each other in the chart, we can see common theme - I wonder why these were tagged as outliers?
LDA with Mallet
Let’s now turn to a classic approach - LDA, Latent Dirichlet Allocation. We will not review the theory or the inner workings of this algorithm here. The key difference vs. BERTopic is that each text (document) is considered to be a composition of topics. We don’t cluster documents into topics, but instead discover abstract topics that are represented in a document corpus. For each document, we get the probability distribution over these topics.
Let’s imagine we have discovered three topics: sports, data science, competition.
A document that is about data science competition might have the following distribution: sports: 0.05, data science: 0.5, competition: 0.045.
A document that talks about world championship in cricket migth have the following distribution instead: sports: 0.54, data science: 0.01, competition: 0.45.
There seem to be many implementations of the LDA algorithm, and some of them result in significantly worse results. It also seems that the Mallet implementation is considered one of the best ones, so we will use it here.
To speed things up, I will use the first 10.000 reviews for topic modeling. I will only display 1000 reviews in the t-sne chart.
Imports and installation
!pip install -Uqq gensim==3.8.3
|████████████████████████████████| 24.2MB 143kB/s
import os #importing os to set environment variable
def install_java():
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null #install openjdk
"JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" #set environment variable
os.environ[!java -version #check java version
install_java()
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)
!wget -q http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip -qq mallet-2.0.8.zip
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models.wrappers import LdaMallet
from gensim.models.coherencemodel import CoherenceModel
from gensim import similarities
import os.path
import re
import glob
import nltk
'stopwords')
nltk.download(
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
'MALLET_HOME'] = '/content/mallet-2.0.8'
os.environ[= '/content/mallet-2.0.8/bin/mallet' # you should NOT need to change this mallet_path
def preprocess_data(doc_set,extra_stopwords = {}):
# adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
# replace all newlines or multiple sequences of spaces with a standard space
= [re.sub('\s+', ' ', doc) for doc in doc_set]
doc_set # initialize regex tokenizer
= RegexpTokenizer(r'\w+')
tokenizer # create English stop words list
= set(stopwords.words('english'))
en_stop # add any extra stopwords
if (len(extra_stopwords) > 0):
= en_stop.union(extra_stopwords)
en_stop
# list for tokenized documents in loop
= []
texts # loop through document list
for i in doc_set:
# clean and tokenize document string
= i.lower()
raw = tokenizer.tokenize(raw)
tokens # remove stop words from tokens
= [i for i in tokens if not i in en_stop]
stopped_tokens # add tokens to list
texts.append(stopped_tokens)return texts
def prepare_corpus(doc_clean):
# adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
# Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
= corpora.Dictionary(doc_clean)
dictionary
=5, no_above=0.5)
dictionary.filter_extremes(no_below# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
= [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix # generate LDA model
return dictionary,doc_term_matrix
Topic modelling with LDA
LDA requires some careful parameter choices to work properly. These seem to be expecially relevant: - number of topics - stop words list - alpha parameter, which roughly determines how many topics correspond to a single document
# texts_lda = [dataset['train'][i]['review_body'] for i in range(10000)]
= preprocess_data(texts,{})
doc_clean = prepare_corpus(doc_clean) dictionary, doc_term_matrix
=30 # adjust this to alter the number of topics
number_of_topics=10 #adjust this to alter the number of words output for the topic below words
= LdaMallet(mallet_path, corpus=doc_term_matrix, num_topics=number_of_topics, id2word=dictionary, alpha=10) ldamallet
= ldamallet.show_topics(num_topics=number_of_topics,num_words=5)
topic_words = [x[1] for x in topic_words] topic_words
= []
topic_words for i in range(number_of_topics):
= ldamallet.show_topic(i, topn=7, num_words=None)
tpc = [x[0] for x in tpc]
words = ' '.join([str(i) + ':'] + words)
tw topic_words.append(tw)
topic_words
['0: case small love feels design bit camera',
'1: perfect started fall heavy weight quickly feet',
'2: nice day box gift looked purchased shoe',
'3: 2 3 5 stars 1 4 weeks',
'4: work bought make fine cut pump job',
'5: broke side soft beautiful ring long bottom',
'6: bag product picture package show guess happy',
'7: hard money working lot worth worked things',
'8: color light colors white loves lights daughter',
'9: water plastic open air hold inside difficult',
'10: size fit wear ordered order comfortable big',
'11: put easy bought left times piece face',
'12: arrived nice pieces broken returned completely thin',
'13: quality made work easily poor fits low',
'14: book love great pages missing family star',
'15: great purchase cover screen purchased recommended replace',
'16: great works recommend lots price smells awesome',
'17: product bad month disappointed reason sound needed',
'18: buy review year frame support difficult idea',
'19: good fit bit brand fine watch screws',
'20: top set problem short expected people story',
'21: item return back shipping received disappointed send',
'22: easy recommend install works clean thick 10',
'23: fast battery charge wrong 4 year cord',
'24: back extra front chair makes pull returning',
'25: phone thing years home stay friend find',
'26: received hair order ordered amazon seller problems',
'27: time loved cute super long huge toy',
'28: good price quality pretty decent expect end',
'29: cheap material perfect loose buy big 5']
= list()
topics_docs for m in ldamallet[doc_term_matrix[:1000]]:
topics_docs.append(m)
= np.array(topics_docs[:1000])
x = np.delete(x,0,axis=2)
y = y.squeeze() y
= np.argmax(y, axis=1)
best_topics = list(best_topics)
topics = [topic_words[x] for x in topics] topics
# up to 20 colors:
# palette = d3['Category20'][number_of_topics]
= random.sample(Turbo256, number_of_topics)
clrs = bmo.CategoricalColorMapper(factors=topic_words, palette=clrs) color_map
= out[:,0]
list_x = out[:,1]
list_y = texts
desc
= ColumnDataSource(data=dict(x=list_x, y=list_y, desc=desc, topic=topics))
source = HoverTool(tooltips=[
hover "index", "$index"),
('desc', '@desc'),
('topic', '@topic')
(
])
= figure(plot_width=1200, plot_height=600, tools=[hover], title="Test")
p 'x', 'y', size=10, source=source,
p.circle(=transform('topic', color_map),
fill_color# legend='topic'
)# p.legend.location = "top_left"
# p.legend.click_policy="hide"
bpl.show(p)