Does LDA Require A Lot of Training Passes to Model Small Corpuses

john.hard...@gmail.com

unread,

May 18, 2017, 3:48:16 PM5/18/17

to gensim

Hello,

I've got a quick question about determining whether I've trained my LDA models long enough. I am trying to test whether my model's are reproducible, by comparing the topics of duplicate models trained on the the same data and parameters.

https://github.com/RaRe-Technologies/gensim/issues/1328

I'm noticing that even when training my LDA models with 150-200 passes (using 25 topics), I'm not getting great results. There are about 19-21 good, consistent topics, but the rest are not consistent from one model to another. I think a problem is that I'm using a relatively tiny corpus to create my LDA model. My corpus only has 27090 documents, and 11195 features. I'm using a very aggressive pre-processing step, which identifies the key concept words from a document rather than using raw text.

Would such a small number of documents and features require a large amount of training passes to create consistent models?

Ivan Menshikh

unread,

May 19, 2017, 8:33:16 AM5/19/17

to gensim

Hello John,

What about preprocessing, I think you do everything right because LdaModel is very sensitive to input corpus.

The fact is that not all topics turn out to be "good", almost always there are topics that fluctuate and this is normal. This happens not only with LDA but with any topic models in general.

As far as I know, for example, BigARTM uses regularization in order to displace all "garbage" words in specific topics (thrash topics).

Regarding your problem, it seems to me that you need to ignore these "thrash topics"

john.hard...@gmail.com

unread,

May 22, 2017, 11:12:58 AM5/22/17

to gensim

I wouldn't call these topics trash topics, though. I'm only training LDA with 25 topics, and all of them are relatively coherent. That is, within each topic the words tend to agree with each other, at least in my opinion. I am not returning topics where the words have no relationship whatsoever. The only issue I've seen is word chaining, where there appears to be two distinct groupings of words in a topic that are "attached" by a single word they share in common.

This is a problem, because models I train on the same data and the same parameters appear to chain topics in different ways. I'm using my LDA topics to categorize documents, so I'm trying to minimize this chaining issue as much as possible.

Running my LDA models with more training passes seems to be helping me out. Models trained with 500 passes are more consistent than those trained with 150 passes (using the code in my fork), though I need to study this further to be confident. However, even with 500 passes I'm noticing chaining, so I'm wondering whether models with a few hundred more passes would work better. This is the next thing I'm going to study. Is there any downside to training models this long? I assume that it depends on the corpus, but is there any danger of "overtraining" my models that I should know of?

Thanks so much for the help.

Ivan Menshikh

unread,

May 22, 2017, 12:24:08 PM5/22/17

to gensim

Is there any downside to training models this long? I assume that it depends on the corpus, but is there any danger of "overtraining" my models that I should know of?

It seems to me that the model can not "overfit" because it's unsupervised task. It all depends on your time, which you are ready to spend on training.

To see if you need to train more (or not), you can look in the direction of topic coherence (blog) and topic coherence (doc). Also, don't forget about perplexity

john.hard...@gmail.com

unread,

May 22, 2017, 12:42:11 PM5/22/17

to gensim

I'd read that perplexity doesn't correspond as well to human judgement as coherence.

One of my next steps is to see if coherence correlates with consistency.

Reply all

Reply to author

Forward