(LDA) calculating perplexity on unseen documents

216 views

Skip to first unread message

Hans Ekbrand

unread,

Feb 3, 2021, 9:04:04 AM2/3/21

to Gensim

When `eval_every` is a positive number, `LdaMulticore()` (and I'm sure, other versions of LDA in gensim too) will hold out a set of documents and when occassionally infer topic distributions on these held out documents and then calculate the perplexity of them, given the current state of the model. I see these messages on standard output (or possibly standard error), and they look something like this:

2021-02-03 14:39:38,348 : INFO : -8.303 per-word bound, 315.8 perplexity estimate based on a held-out corpus of 7996 documents with 2656240 word

These are great, I'd like to use them for choosing an optimal number of topics. I know that I can use the `log_perplexity()` method of the LDA object to calculate them manually, and if I apply this method on the training data corpus I get values very similar to the ones in the log of `LdaMulticore()`. However, if I manually create a test corpus - by sampling documents from the same master data that the training data was sampled from, and making sure none of the documents in the testing data is present in the training data, and using the same dictionary as when creating the training corpus - and then apply the `log_perplexity()`method on the testing corpus, I seem to get monotonically increasing perplexity estimates as the number of topics increase, which should not be the case. Am I missing something here?

Here is the code I use to create the two corpora, to fit the model, and to evaluate the model. I can share the data if that helps.

dictionary = corpora.Dictionary(line.lower().split() for line in open(training_file, encoding = "utf-8"))

training_corpus = [dictionary.doc2bow(line.lower().split()) for line in open(training_file)]

testing_corpus = [dictionary.doc2bow(line.lower().split()) for line in open(testing_file)]
num_topics = list([35, 37, 39, 41, 43, 45, 47, 49])

num_passes = 10

LDA_models = {}

num_workers = multiprocessing.cpu_count()-1
for i in range(len(num_topics)):
    LDA_models[i] = LdaMulticore(corpus=training_corpus,
                         workers=num_workers,
                         id2word=dictionary,
                         num_topics=num_topics[i],
chunksize=int(len(training_corpus)/num_workers+1),
                         alpha='asymmetric',
                         eta='auto',
                         eval_every=num_passes,
                         passes=num_passes,
                         random_state=42)
    training_perplexity[i] = LDA_models[i].log_perplexity(training_corpus)
    print("perplexity on the whole training set for {} topics {}".format(num_topics[i], training_perplexity[i]))
    testing_perplexity[i] = LDA_models[i].log_perplexity(testing_corpus)
    print("perplexity on the test set for {} topics {}".format(num_topics[i], testing_perplexity[i]))

Hans Ekbrand

unread,

Feb 3, 2021, 9:47:30 AM2/3/21

to Gensim

On the same topic, in "Evaluation Methods for Topic Models" by Wallach et al (2009), they compare a set of methods to estimate the probabilty of unseen documents:

1. Importance sampling methods

2. Harmonic mean method

3. Annealed importance sampling

4. "Left-to-right" evaluation

5. Chib-style estimator

AFAIK perplexity is minimized when likelihood is maximized, so these methods should be relevant for the same purpose as one would calculate perplexity. When gensim calculates perplexity on unseen data, does is use any of the above methods? If not, what algorithm is used? The conclusions from Wallach et al is that Annealed importance sample and their own "left-to-right" evaluation performs better than the old harmonic mean method which "wildly overestimates" the likelihood of the testing data.

Reply all

Reply to author

Forward

0 new messages