Calculating perplexity in LDA model

7,258 views
Skip to first unread message

Benjamin Soltoff

unread,
Nov 25, 2013, 12:55:34 PM11/25/13
to gen...@googlegroups.com
I am attempting to estimate an LDA topicmodel for a corpus of ~59,000 documents and ~500,000 unique tokens. I would prefer to estimate the final model in R to utilize its visualization tools for interpreting my results, however first I need to select the number of topics for my model. Since I have no intuition as to how many topics are in the latent structure, I was going to estimate a series of models with the number of topics k = 20, 25, 30... and estimate the perplexity of each model to determine the optimal number of topics as recommended in Blei (2003). The only packages for estimating LDA in R that I am aware of (LDA and topicmodels) utilize batch LDA and whenever I estimate a model with more than 70 topics, I run out of memory (and this is on a supercomputing cluster with up to 96 gigs of ram per processor). I thought I could use gensim to estimate the series of models using online LDA which is much less memory-intensive, calculate the perplexity on a held-out sample of documents, select the number of topics based off of these results, then estimate the final model using batch LDA in R.

The steps I followed are:
  1. Generate the corpus from a series of text files in R, exporting the document-term matrix and dictionary in MM format.
  2. Import the corpus and dictionary in Python.
  3. Split the corpus into training/test datasets.
  4. Estimate the LDA model using the training data.
  5. Calculate bound and per-word perplexity using the test data.
My understanding is that perplexity is always decreasing as the number of topics increase, so the optimal number of topics should be where the marginal change in perplexity is small. However whenever I estimate the series of models, perplexity is in fact increasing with the number of topics. The perplexity values for k=20,25,30,35,40 are

Perplexity (20 topics):  -44138604.0036
Per-word Perplexity:  542.513884961
Perplexity (25 topics):  -44834368.1148
Per-word Perplexity:  599.120014719
Perplexity (30 topics):  -45627143.4341
Per-word Perplexity:  670.851965367
Perplexity (35 topics):  -46457210.907
Per-word Perplexity:  755.178877447
Perplexity (40 topics):  -47294658.5467
Per-word Perplexity:  851.001209258

Potential problems I've already thought of:
  • Is the model not running long enough to converge properly? I set the chunk size to 1000, so there should be 40-50 passes and by the last chunk I am seeing 980+/1000 documents converging within 50 iterations.
  • Am I not understanding what the lda.bound function is estimating?
  • Do I need to trim the dictionary more? I've already removed all tokens below the median TF-IDF score, so I cut the original dictionary in half.
  • Is my problem that I am using R to build the dictionary and corpus? I compared in a text editor the dictionary and MM corpus files generated from R to a smaller test dictionary/corpus built with gensim and I do not see any differences in how information is coded. I want to use R to build the corpus so I ensure I am using the exact same corpus for the online LDA as I will use for the final model in R and I do not know how to convert a gensim corpus into an R document-term matrix object.


The script I use is:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import numpy
import scipy
import gensim

import random
random.seed(11091987)           #set random seed


# load id->word mapping (the dictionary)
id2word =  gensim.corpora.Dictionary.load_from_text('../dict.dict')

# load corpus
## add top line to MM file since R does not automatically add this
## and save new version
with open('../dtm.mtx') as f:
    dtm = f.read()
    dtm = "%%MatrixMarket matrix coordinate real general\n" + dtm

with open('dtm.mtx', 'w+') as f:
    f.write(dtm)


corpus = gensim.corpora.MmCorpus('dtm.mtx')

print id2word
print corpus

# shuffle corpus
cp = list(corpus)
random.shuffle(cp)

# split into 80% training and 20% test sets
p = int(len(cp) * .8)
cp_train = cp[0:p]
cp_test = cp[p:]

import time
start_time = time.time()

lda = gensim.models.ldamodel.LdaModel(corpus=cp_train, id2word=id2word, num_topics=25,
                                      update_every=1, chunksize=1000, passes=2)

elapsed = time.time() - start_time
print('Elapsed time: '),
print elapsed


print lda.show_topics(topics=-1, topn=10, formatted=True)

print('Perplexity: '),
perplex = lda.bound(cp_test)
print perplex

print('Per-word Perplexity: '),
print numpy.exp2(-perplex / sum(cnt for document in cp_test for _, cnt in document))

elapsed = time.time() - start_time
print('Elapsed time: '),
print elapsed

Brian Feeny

unread,
Nov 27, 2013, 8:30:18 PM11/27/13
to gen...@googlegroups.com
I tried your approach, intreating over creating models.  I find that every now and then it seems the distributed LDA loses its mind and I have to kill all the workers and Pyro and start again.  Is there some housekeeping/cleanup that should be done between multiple iterations of a model? perhaps lda.clear().

I am doing like so based on your code:

if(feature['perplexity_search']):
    random.seed(42)
    
    grid = defaultdict(list)

    # alpha /beta
    
    # num topics
    parameter_list=[0.01, 0.1, 0.25, 0.5, 0.75, 1]
    
    # for num_topics_value in num_topics_list:
    for parameter_value in parameter_list:

        # print "starting pass for num_topic = %d" % num_topics_value
        print "starting pass for parameter_value = %.2f" % parameter_value
        start_time = time.time()

        # shuffle corpus
        cp = list(corpus)
        random.shuffle(cp)

        # split into 80% training and 20% test sets
        p = int(len(cp) * .8)
        cp_train = cp[0:p]
        cp_test = cp[p:]
    
        # run model
        model = models.ldamodel.LdaModel(corpus=cp_train, id2word=dictionary, num_topics=40, chunksize=500, 
                                        passes=25, update_every=0, alpha=parameter_value, eta=parameter_value, decay=0.5,
                                        distributed=True)
    
        # show elapsed time for model
        elapsed = time.time() - start_time
        print('Elapsed time: '),
        print elapsed
    
        print model.print_topics(num_topics_value)

        print('Perplexity: '),
        perplex = model.bound(cp_test)
        print perplex
        grid[parameter_value].append(perplex)
    
        print('Per-word Perplexity: '),
        per_word_perplex = np.exp2(-perplex / sum(cnt for document in cp_test for _, cnt in document))
        print per_word_perplex
        grid[parameter_value].append(per_word_perplex)

Ben Trahan

unread,
Nov 28, 2013, 1:06:38 AM11/28/13
to gen...@googlegroups.com

This doesn't answer your perplexity question, but there is apparently a MALLET package for R.  MALLET is incredibly memory efficient -- I've done hundreds of topics and hundreds of thousands of documents on an 8GB desktop.  (Depending on what you want to visualize in R you may also be able to simply dump the model that gensim produces to some standard table format and import it into R.)

By the way -- unless you have some reason you need to automate this process, like you'll be doing it for a whole lot of different corpora, you may get better results if you simply find the correct number of topics by inspection rather than with perplexity (so just do the same process, but evaluate the models by eye).  Perplexity is pretty loosely correlated with what a human would think of as topic quality.  You may also find that a 20 topic model and a 150 topic model both produce good results for different purposes.

Also, in my experience running Gensim on a corpus will tend to give a very different set of topics than running a Gibbs sampler (like those R packages).  I've tried this with both MALLET and my own crappy homebrew Python script.  The variational Bayes algorithm in Gensim seems to produce topics with fatter tails, but wants the documents to be purer than the Gibbs sampler.  It's hard to explain, but the difference becomes clear if you run the experiment enough times.  I suspect that whatever number of topics you find in Gensim won't transfer perfectly to R -- I'd probably settle on one LDA stack and just figure out how to get the results into whatever package you prefer for visualization.


On Monday, November 25, 2013 12:55:34 PM UTC-5, Benjamin Soltoff wrote
I am attempting to estimate an LDA topicmodel for a corpus of ~59,000 documents and ~500,000 unique tokens. I would prefer to estimate the final model in R to utilize its visualization tools for interpreting my results, however first I need to select the number of topics for my model. Since I have no intuition as to how many topics are in the latent structure, I was going to estimate a series of models with the number of topics k = 20, 25, 30... and estimate the perplexity of each model to determine the optimal number of topics as recommended in Blei (2003). The only packages for estimating LDA in R that I am aware of (LDA and topicmodels) utilize batch LDA and whenever I estimate a model with more than 70 topics, I run out of memory (and this is on a supercomputing cluster with up to 96 gigs of ram per processor). I thought I could use gensim to estimate the series of models using online LDA which is much less memory-intensive, calculate the perplexity on a held-out sample of documents, select the number of topics based off of these results, then estimate the final model using batch LDA in R.
Message has been deleted

Halil

unread,
Dec 4, 2013, 10:12:02 AM12/4/13
to gen...@googlegroups.com
I also experimented perplexity code on another data set and I observed that perplexity goes down as I increase the number of topics. I tried T in [11, 200].

Do you think that we can achieve a U shape curve when we use very large number of topics??

Thank You
Halil

Yuanhan Mo

unread,
Mar 13, 2014, 4:02:52 PM3/13/14
to gen...@googlegroups.com
Hey dude, I have the same problem. Did you figure out what happened?. If you did can you share your finding with us?

Benjamin Soltoff

unread,
Mar 13, 2014, 4:15:34 PM3/13/14
to gen...@googlegroups.com
I never did figure it out. Eventually I changed my approach and did the estimation in R. Used 80+ gigs of memory to estimate the larger models, but I have access to a supercomputing cluster that worked. I just needed to submit each model as a separate job to give each model its own dedicated memory so the script wouldn't crash. The LDA results didn't end up suiting my needs, so I switched to a supervised topic model with a hand-coded training set.


Benjamin
Reply all
Reply to author
Forward
0 new messages