When `eval_every` is a positive number, `LdaMulticore()` (and I'm sure, other versions of LDA in gensim too) will hold out a set of documents and when occassionally infer topic distributions on these held out documents and then calculate the perplexity of them, given the current state of the model. I see these messages on standard output (or possibly standard error), and they look something like this:
2021-02-03 14:39:38,348 : INFO : -8.303 per-word bound, 315.8 perplexity estimate based on a held-out corpus of 7996 documents with 2656240 word
These are great, I'd like to use them for choosing an optimal number of topics. I know that I can use the `log_perplexity()` method of the LDA object to calculate them manually, and if I apply this method on the training data corpus I get values very similar to the ones in the log of `LdaMulticore()`. However, if I manually create a test corpus - by sampling documents from the same master data that the training data was sampled from, and making sure none of the documents in the testing data is present in the training data, and using the same dictionary as when creating the training corpus - and then apply the `log_perplexity()`method on the testing corpus, I seem to get monotonically increasing perplexity estimates as the number of topics increase, which should not be the case. Am I missing something here?
Here is the code I use to create the two corpora, to fit the model, and to evaluate the model. I can share the data if that helps.
dictionary = corpora.Dictionary(line.lower().split() for line in open(training_file, encoding = "utf-8"))
training_corpus = [dictionary.doc2bow(line.lower().split()) for line in open(training_file)]
testing_corpus = [dictionary.doc2bow(line.lower().split()) for line in open(testing_file)]
num_topics = list([35, 37, 39, 41, 43, 45, 47, 49])
num_passes = 10
LDA_models = {}
num_workers = multiprocessing.cpu_count()-1
for i in range(len(num_topics)):
LDA_models[i] = LdaMulticore(corpus=training_corpus,
workers=num_workers,
id2word=dictionary,
num_topics=num_topics[i],
chunksize=int(len(training_corpus)/num_workers+1),
alpha='asymmetric',
eta='auto',
eval_every=num_passes,
passes=num_passes,
random_state=42)
training_perplexity[i] = LDA_models[i].log_perplexity(training_corpus)
print("perplexity on the whole training set for {} topics {}".format(num_topics[i], training_perplexity[i]))
testing_perplexity[i] = LDA_models[i].log_perplexity(testing_corpus)
print("perplexity on the test set for {} topics {}".format(num_topics[i], testing_perplexity[i]))