Finding number of topics using perplexity

8,132 views
Skip to first unread message

Brenda Moon

unread,
Nov 13, 2014, 10:54:26 PM11/13/14
to gen...@googlegroups.com
(posting again because images were missing)

I'm trying to find the natural number of topics for my corpus of January 2011 tweets containing the keyword 'science'.  I thought that if I plotted the perplexity against the number of topics for the same model and corpus I would see a dip in perplexity at the best number of topics.

As a first step I split the corpus into a training and test component:

# split into train and test - random sample, but preserving order
train_size
= int(round(len(bow_corpus)*0.8))
train_index
= sorted(random.sample(xrange(len(bow_corpus)), train_size))
test_index
= sorted(set(xrange(len(bow_corpus)))-set(train_index))
train_corpus
= [bow_corpus[i] for i in train_index]
test_corpus
= [bow_corpus[j] for j in test_index]

I then used this code to iterate through the number of topics from 5 to 150 topics in steps of 5, calculating the perplexity on the held out test corpus at each step.

number_of_words = sum(cnt for document in test_corpus for _, cnt in document)
parameter_list
= range(5, 151, 5)
for parameter_value in parameter_list:
   
print "starting pass for parameter_value = %.3f" % parameter_value
    model
= models.LdaMulticore(corpus=bow_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10)
   
    perplex
= model.bound(test_corpus) # this is model perplexity not the per word perplexity
   
print "Total Perplexity: %s" % perplex
    grid
[parameter_value].append(perplex)

    per_word_perplex
= np.exp2(-perplex / number_of_words)
   
print "Per-word Perplexity: %s" % per_word_perplex
    grid
[parameter_value].append(per_word_perplex)
    model
.save(data_path + 'ldaMulticore_i10_T' + str(parameter_value) + '_training_corpus.lda')
   
print

for numtopics in parameter_list:
   
print numtopics, '\t',  grid[numtopics]

df
= pandas.DataFrame(grid)
ax
= plt.figure(figsize=(7, 4), dpi=300).add_subplot(111)
df
.iloc[1].transpose().plot(ax=ax,  color="#254F09")
plt
.xlim(parameter_list[0], parameter_list[-1])
plt
.ylabel('Perplexity')
plt
.xlabel('topics')
plt
.title('')
plt
.savefig('gensim_multicore_i10_topic_perplexity.pdf', format='pdf', bbox_inches='tight', pad_inches=0.1)
plt
.show()
df
.to_pickle(data_path + 'gensim_multicore_i10_topic_perplexity.df')

This is the graph of the perplexity:

There is a dip at around 130 topics, but it isn't very large - seem like it could be noise?  Does the change of gradient at around 35-40 topics suggest that is the best number of topics? Does anyone have any examples of this type of graph showing what to expect when it works?

Is there a better approach?

I started with this model because it had a the shortest run time for a reasonable perplexity from me previous testing (https://groups.google.com/forum/#!msg/gensim/yJan7QlKr4I/0XmdtR_78MoJ).  When I didn't see the dip I expected, I tried extending the number of topics higher - checking every 10th topic between 155 and 300.
The results are strange, the perplexity jumps enormously between 225 and 235 topics and continues to go up:

215     [-10569430.048500545, 584.90060291802604]
225     [-10684866.564780824, 627.05178242805653]
235     [-31870705.178804845, 220681813.39332914]
245     [-37573025.065190136, 6864939909.9477577]

Is this expected? I repeated this from just before it jumped up and this was the result:

So it seems like it is quite unstable at these large perplexity values.

I tried using alpha='asymmetric' to see if that gave a different result but it was very similar.

model = models.LdaMulticore(corpus=bow_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10, alpha='asymmetric')

regards,

Brenda

Ian Wood

unread,
Nov 14, 2014, 5:05:04 AM11/14/14
to gen...@googlegroups.com
As far as I'm aware, you'd expect "in-sample" perplexity to improve with more topics, but that the improvement would level off as the model captures all but the most trivial structures in the data. Some 'noise' may be expected also - you could try doing several runs at each number of topics with random initialisations to even out the noise. You might choose a point where it starts to level off. With alpha/beta optimisation in Mallet, I found that LL/token would drop off a little after each beta optimisation, and that overall it drops a little after many iterations.

Ideally, you should use a held out sample of data and estimate the perplexity of that. I know that Mallet has a means of estimating topic assignments to new data - that can then be used to calculate held out perplexity. 

That said, it's been shown that perplexity doesn't correlate well with human judgements of topic coherence anyway... There are a few coherence measures out there that have reportedly done better, and I believe Mallet has one in it's diagnostics class. Otherwise most studies have done extensive hand-verification of topic coherence. 

A recent innovation in this direction is to perform a "posterior predictive check" - (see Mimno et.al.'s paper "Bayesian checking for topic models"). The idea is to choose a "discriminant function" - a function of a data set (with topic assignements in this case) that captures something you care about. Mimno etal chose the mutual information (MI) between word-topic assignments and word-document assignments - the intuition is that these should be independant, thus have low mutual information. In reality, random fluctuations result in small values of this function - the posterior predictive check then calculates this function on synthetic data generated by your model (perhaps 100 or so synthetic data sets) to estimate the distribution of MI values given the model. If the MI value for your data (with topic assignments) doesn't sit nicely in that distribution, there's something wrong - either words are concentrated on some documents more than others (high MI) or are surprisingly uniformly spread between documents (low MI). Increasing the number of topics should help in the high MI case. You can also look at the 'IMI' (instantaneous mutual information) scores of individual words that make up the total model MI score to find words that behave badly.

...

Anyway..
Radim: I've coded in python Mimno's ppc method for my research, but using home baked data structures. I've been thinking of porting it to gensim... is there an obvious place to put it? I guess it would sit well next to other diagnostic measures that are there or planned. I've not looked much at gensim yet, but I'd like to start using it. I've been working with large-ish models (one crazy model had 250k documents and 300k vocab with 200 topics) and had to work hard on making it memory and time efficient, and it seems a good time to give back to all the open source stuff I'm always using :)

Brenda Moon

unread,
Nov 15, 2014, 7:36:30 AM11/15/14
to gen...@googlegroups.com
Hello Ian,

Thanks for your suggestions.

I've spotted an error in my code that may explain the strange graphs - I'm training on the whole corpus (bow_corpus), not the train_corpus. So the test_corpus isn't actually held out.


model = models.LdaMulticore(corpus=bow_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10)

should be

model = models.LdaMulticore(corpus=train_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10)

I'll rerun using the train_corpus as I intended in the first place and see if that gives better results.

I'll have a look at Mimno et.al.'s paper "Bayesian checking for topic models".

Radim Řehůřek

unread,
Nov 15, 2014, 9:04:19 AM11/15/14
to gen...@googlegroups.com
Ian: that would be awesome, thanks! And let me know if you need any help/assistance there.

Brenda: people have been reporting strange relationship between perplexity and number of topics for some time. I haven't had time to investigate, still on my to-do list :(

I also think the relationship should be inverse (more topics => lower perpl, at least up to a certain inflection point).

I've invited Matt Hoffman to comment, since the code is ported from his original onlineldavb Python package.

But like Ian says, perplexity is not a good measure of topic quality anyway. Not to mention that mallet (gibbs sampling) and gensim (variational bayes) compute it in completely different ways.

Best,
Radim

Brenda Moon

unread,
Nov 16, 2014, 12:35:59 AM11/16/14
to gen...@googlegroups.com
Running the corrected code using the training corpus for training and test corpus for testing hasn't changed the graphs very much.


model = models.LdaMulticore(corpus=train_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10)


gives:

I checked the change in gradient at 35 topics by running 5 tests at 35 topics. The statistics for the results were:
mean     462.3
std        6.5
min      451.5
max      467.7

compared to the 454.8 for the run shown on the graph. So the dip is probably just noise introduced by the starting settings for each run.

The asymmetric alpha gives very similar results:
 
model = models.LdaMulticore(corpus=train_corpus, workers=None, id2word=dictionary, num_topics=parameter_value, iterations=10, alpha='asymmetric')


Happy to run any tests that might help work out why the perplexity is increasing as number of topics increases.

regards,

Brenda

Ian Wood

unread,
Nov 25, 2014, 1:06:52 AM11/25/14
to gen...@googlegroups.com
I got similar plots using Mallet a while back - I had log likelihood (a negative number) increasing with more topics, but levelling off. I took it to mean that the probabilitiy of the data given the model was increasing, which made sense.

Rajeswari Arikrishnan

unread,
Nov 25, 2014, 6:45:50 AM11/25/14
to gen...@googlegroups.com
how to download gensim 

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
vazhga valamudan...

Yuval Shachaf

unread,
Nov 2, 2016, 9:21:11 AM11/2/16
to gensim
Hi Radim,
Following the posts here I am still confused about how to evaluate the number of topics and passes. I keep seeing the perplexity increases with topic number.
Currently I use pyLDAvis to evaluate the model which seems to do the job but I never know if I can do better unless using some kind of metric.
Btw, I am doing topic modeling on tweets.
Yuval

Lev Konstantinovskiy

unread,
Nov 24, 2016, 3:29:36 PM11/24/16
to gensim
Hi Yuval,,

You might find useful this blog post about a metric called c_v coherence. It is useful when you just need one number to go by.

Regards
Lev

MSH

unread,
Mar 6, 2017, 9:11:01 AM3/6/17
to gensim
I am running another model and my perplexity also increases with number of topics... I am perplex... :-)

Supriya Kinariwala

unread,
Dec 14, 2017, 9:14:08 AM12/14/17
to gensim
Hi Yuval,
I am also working on same . Can you pls. help me about python code for calculating perplexity

Hiba Aleqabie

unread,
Dec 16, 2017, 6:48:33 AM12/16/17
to gensim
me too please

Ivan Menshikh

unread,
Dec 18, 2017, 1:05:27 AM12/18/17
to gensim
Hi,
small example available here

Hiba Aleqabie

unread,
Dec 18, 2017, 12:01:16 PM12/18/17
to gen...@googlegroups.com
Thank you so much

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/TpuYRxhyIOc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

chit...@umn.edu

unread,
Apr 20, 2018, 11:42:21 AM4/20/18
to gensim
Hi Brenda,

Looking into your code, i have a question regarding the way perplexity-per-word is computed. If `perplex ` in your code is model perplexity, shouldn't the `per_word_perplex ` be (perplex / number_of_words) instead of (np.exp2(-perplex / number_of_words))??

The reason i ask is, i was under the assumption that perplexity of the model is already computed using exponential of negative likelihood. So, perplexity-per-word  would be just (perplexity of the model/number of words). 

Other folks are free to take a shot at this question. Thanks in advance!

Sandeep

Hiba Aleqabie

unread,
Apr 20, 2018, 11:53:51 AM4/20/18
to gen...@googlegroups.com
Thanks...
Very grateful.

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/TpuYRxhyIOc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages