Perplexity in gensim

Brian Feeny

unread,

Dec 10, 2013, 12:47:57 AM12/10/13

to gen...@googlegroups.com

Is this showing perplexity improving or getting worse?

10 Perplexity: -4240066.51184

Per-word Perplexity: 556.775892128

25 Perplexity: -4412724.42007

Per-word Perplexity: 720.254670788

50 Perplexity: -4602324.53917

Per-word Perplexity: 955.570588477

75 Perplexity: -4743153.28502

Per-word Perplexity: 1178.84653298

100 Perplexity: -4875013.20852

Per-word Perplexity: 1434.97373636

150 Perplexity: -5065182.32312

Per-word Perplexity: 1905.41289365

It looks like the number is getting smaller, so from that perspective its improving, but I realize gensim is just reporting the lower bound correct? So is this still an improvement? The above is showing using bound on number of topics 10, 25, 50, 75, 100, and 150.

Brian Feeny

unread,

Dec 10, 2013, 8:22:46 AM12/10/13

to gen...@googlegroups.com

I am sampling 200k reviews from a corpus of 890k reviews

grid searching over 15 topic sizes 10 to 150

grid = defaultdict(list)

# Choose a parameter you are wanting to search, for example num_topics or alpha / eta, make sure you substitute "parameter_value"

# into the model below instead of a static value.

#

# num topics

parameter_list=[10, 20, 30, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150]

# alpha / eta

# parameter_list=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 1.5]

# we can sample if we like

cp = random.sample(corpus,200000)

# shuffle corpus

# cp = list(corpus)

# random.shuffle(cp)

# split into 80% training and 20% test sets

p = int(len(cp) * .5)

cp_train = cp[0:p]

cp_test = cp[p:]

# for num_topics_value in num_topics_list:

for parameter_value in parameter_list:

# print "starting pass for num_topic = %d" % num_topics_value

print "starting pass for parameter_value = %.3f" % parameter_value

start_time = time.time()

# run model

model = models.ldamodel.LdaModel(corpus=cp_train, id2word=dictionary, num_topics=parameter_value, chunksize=3125,

passes=25, update_every=0, alpha=None, eta=None, decay=0.5,

distributed=True)

# show elapsed time for model

elapsed = time.time() - start_time

print "Elapsed time: %s" % elapsed

perplex = model.bound(cp_test)

print "Perplexity: %s" % perplex

grid[parameter_value].append(perplex)

per_word_perplex = np.exp2(-perplex / sum(cnt for document in cp_test for _, cnt in document))

print "Per-word Perplexity: %s" % per_word_perplex

grid[parameter_value].append(per_word_perplex)

What's weird is a colleague took the words and put them into the Stanford NLP Toolkit, and got the following results instead:

num topics	perplexity
5	959.4420577
10	786.9904847
15	663.3430717
20	567.9809147
30	441.16059
40	348.3312332
50	295.2714263
60	252.1265202
70	225.7246909
80	200.7043352
90	185.2709879
100	163.3825024
110	154.6160706
120	144.9268282
130	130.7736963
140	125.327153
150	115.7756802

With a graph that looked more reasonable:

Radim Řehůřek

unread,

Dec 12, 2013, 8:02:51 AM12/12/13

to gen...@googlegroups.com

On Tuesday, December 10, 2013 6:47:57 AM UTC+1, Brian Feeny wrote:

Is this showing perplexity improving or getting worse?

Neither. The values coming out of `bound()` depend on the number of topics (as well as number of words), so they're not comparable across different num_topics (or different test corpora).

No, the opposite: a smaller bound value implies deterioration. For example, bound -6000 is "better" than -7000 (bigger is better).

If your bound *decreases* during training (while holding the test corpus and topics constant), it means the LDA model is not improving w.r.t. to your test corpus. Something went wrong (disconnect between train/test data?), because this bound is exactly what (variational) LDA maximizes.

This question comes up frequently; I'll add a patch that normalizes bound() scores better across num_topics. And log the (~estimated, chunk) bound during LDA training. Should make inspecting what's going on during LDA training more "human-friendly" :)

As for comparing absolute perplexity values across toolkits, make sure they're using the same formula (some people exponentiate to the power of 2^, some to e^..., or compute the test corpus likelihood/bound in a different way etc).

Also, better perplexity doesn't necessarily mean better topics (see "Reading tea leaves: How humans interpret topic models" by Chang&al). Although it's still useful for within-model comparisons & optimizations.

Best,

Radim

Radim Řehůřek

unread,

Dec 23, 2013, 6:09:57 PM12/23/13

to gen...@googlegroups.com, brian...@gmail.com

Hello Brian,

I merged the asymmetric patch by Ben & improved the perplexity logging in gensim, may be of interest to you:

http://radimrehurek.com/2013/12/python-lda-in-gensim-christmas-edition/

Best,

Radim

Hongyuan Mei

unread,

Apr 22, 2016, 1:54:13 PM4/22/16

to gensim

Hi Brian,

Did you understand the trend and the difference (from Stanford NLP toolkit) yet?

I got the same observations but got very confused...

Thanks!

Best,

Reply all

Reply to author

Forward