Perplexity in gensim

6,794 views
Skip to first unread message

Brian Feeny

unread,
Dec 10, 2013, 12:47:57 AM12/10/13
to gen...@googlegroups.com
Is this showing perplexity improving or getting worse?

10 Perplexity:  -4240066.51184
Per-word Perplexity: 556.775892128

25 Perplexity:  -4412724.42007
Per-word Perplexity: 720.254670788

50 Perplexity:  -4602324.53917
Per-word Perplexity: 955.570588477

75 Perplexity:  -4743153.28502
Per-word Perplexity: 1178.84653298

100 Perplexity:  -4875013.20852
Per-word Perplexity:  1434.97373636

150 Perplexity:  -5065182.32312
Per-word Perplexity:  1905.41289365

It looks like the number is getting smaller, so from that perspective its improving, but I realize gensim is just reporting the lower bound correct? So is this still an improvement?  The above is showing using bound on number of topics 10, 25, 50, 75, 100, and 150.

Brian Feeny

unread,
Dec 10, 2013, 8:22:46 AM12/10/13
to gen...@googlegroups.com

I am sampling 200k reviews from a corpus of 890k reviews
grid searching over 15 topic sizes 10 to 150

    grid = defaultdict(list)

    # Choose a parameter you are wanting to search, for example num_topics or alpha / eta, make sure you substitute "parameter_value"
    # into the model below instead of a static value.
    #
    # num topics
    parameter_list=[10, 20, 30, 45, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150]
    
    # alpha / eta
    # parameter_list=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 1.5]
    
    # we can sample if we like
    cp = random.sample(corpus,200000)

    # shuffle corpus
    # cp = list(corpus)
    # random.shuffle(cp)

    # split into 80% training and 20% test sets
    p = int(len(cp) * .5)
    cp_train = cp[0:p]
    cp_test = cp[p:]

    # for num_topics_value in num_topics_list:
    for parameter_value in parameter_list:

        # print "starting pass for num_topic = %d" % num_topics_value
        print "starting pass for parameter_value = %.3f" % parameter_value
        start_time = time.time()

        # run model
        model = models.ldamodel.LdaModel(corpus=cp_train, id2word=dictionary, num_topics=parameter_value, chunksize=3125, 
                                        passes=25, update_every=0, alpha=None, eta=None, decay=0.5,
                                        distributed=True)
    
        # show elapsed time for model
        elapsed = time.time() - start_time
        print "Elapsed time: %s" % elapsed
    
        perplex = model.bound(cp_test)
        print "Perplexity: %s" % perplex
        grid[parameter_value].append(perplex)
    
        per_word_perplex = np.exp2(-perplex / sum(cnt for document in cp_test for _, cnt in document))
        print "Per-word Perplexity: %s" % per_word_perplex
        grid[parameter_value].append(per_word_perplex)



What's weird is a colleague took the words and put them into the Stanford NLP Toolkit, and got the following results instead:

num topicsperplexity
5959.4420577
10786.9904847
15663.3430717
20567.9809147
30441.16059
40348.3312332
50295.2714263
60252.1265202
70225.7246909
80200.7043352
90185.2709879
100163.3825024
110154.6160706
120144.9268282
130130.7736963
140125.327153
150115.7756802

With a graph that looked more reasonable:


Radim Řehůřek

unread,
Dec 12, 2013, 8:02:51 AM12/12/13
to gen...@googlegroups.com

On Tuesday, December 10, 2013 6:47:57 AM UTC+1, Brian Feeny wrote:
Is this showing perplexity improving or getting worse?

Neither. The values coming out of `bound()` depend on the number of topics (as well as number of words), so they're not comparable across different num_topics (or different test corpora).
No, the opposite: a smaller bound value implies deterioration. For example, bound -6000 is "better" than -7000 (bigger is better).

If your bound *decreases* during training (while holding the test corpus and topics constant), it means the LDA model is not improving w.r.t. to your test corpus. Something went wrong (disconnect between train/test data?), because this bound is exactly what (variational) LDA maximizes.

This question comes up frequently; I'll add a patch that normalizes bound() scores better across num_topics. And log the (~estimated, chunk) bound during LDA training. Should make inspecting what's going on during LDA training more "human-friendly" :)

As for comparing absolute perplexity values across toolkits, make sure they're using the same formula (some people exponentiate to the power of 2^, some to e^..., or compute the test corpus likelihood/bound in a different way etc).
Also, better perplexity doesn't necessarily mean better topics (see "Reading tea leaves: How humans interpret topic models" by Chang&al). Although it's still useful for within-model comparisons & optimizations.

Best,
Radim

 

Radim Řehůřek

unread,
Dec 23, 2013, 6:09:57 PM12/23/13
to gen...@googlegroups.com, brian...@gmail.com
Hello Brian,

I merged the asymmetric patch by Ben & improved the perplexity logging in gensim, may be of interest to you:

Best,
Radim

Hongyuan Mei

unread,
Apr 22, 2016, 1:54:13 PM4/22/16
to gensim
Hi Brian,
 
Did you understand the trend and the difference (from Stanford NLP toolkit) yet?
I got the same observations but got very confused...

Thanks!
Best,
Reply all
Reply to author
Forward
0 new messages