Convergence of LDA model

Abhinav Prakash

unread,

Dec 1, 2016, 11:49:38 PM12/1/16

to gensim

First off, gensim is beautiful. And I am newbie. I do know statistics but I am new to the world of unsupervised learning, particularly topic modelling.

I am running my own tweaked version of ldamodel which plots me the graph between number of passes and topic diff, so that I may do a trial and error with the number of passes chosen in order to get some kind of convergence with my model.

Things are working all right, but I have a few questions:

1. In some of the associated posts on convergence of model, topic_diff was highlighted as one of the parameters to show convergence of the model, which basically is how different is the new topic distribution with a new chunk, than the one which was created without this chunk earlier. If we are able to get a constant or near-constant topic diff, means the model has converged, Now, this is fine but I am sure there must be other ways to show convergence than let's say putting in value of number of passes to 200 and then check with the graph (matplotlib) what's the optimum number of passes to run the model on specific corpus. I heard of VARMAXTER or something (not sure) from one of the replies by Radim in other posts, however couldn't find it anywhere in gensim. So, any ideas there would be appreciated.

Note: I am running ldamodel on a corpus of around 4k documents. My idea is to keep adding documents and updating the corpus. However, veritably when documents and numbers of passes are fewer gensim gives me a warning asking me either to increase the number of passes or the iterations. This is fine and it is clear from the code as well. Hence, my choice of number of passes is 200 and then checking my plot to see convergence.

2. In some of the replies on related topic, I heard of something like "model converging on xx/xx documents". I have never gotten this kind of log at all in the result of my running the model. Does it mean my model is not converging at all. Or, has it been removed from gensim package. Asking because any statement of this order is not present in any script in gensim package. Any confirmation here would be appreciated.

Abhinav Prakash

unread,

Dec 2, 2016, 12:07:56 AM12/2/16

to gensim

And yes, we can replace the graph with any other parameter, such as if the slope of the aforesaid graph reaches near 0, our model is converging. I am not talking about that, I am trying to get other parameters for checking model convergence (other than topic diff)

Abhinav Prakash

unread,

Dec 2, 2016, 12:13:15 AM12/2/16

to gensim

In my post, I have referenced earlier post:

https://groups.google.com/d/msg/gensim/Hy3otVYqJNc/cNCvazc-iOYJ

On Friday, 2 December 2016 10:19:38 UTC+5:30, Abhinav Prakash wrote:

Lev Konstantinovskiy

unread,

Dec 2, 2016, 7:16:19 AM12/2/16

to gensim

Hi Abhinav,

1. May I ask for clarification to make sure that I understand your question better. There is a known question of "how to choose LDA parameters for a fixed corpus". For example, see this blog post for one strategy.

When the change in gamma is small, the training automatically stops. It should be visible on your graph, right?

It seems though that you keep adding new documents to the training set. That will make a new model with new topics. So one cannot say that new documents are not needed just because one new chunk didn't change the topics. Next two chunks could have changed the topics a lot.

By the way MAX_ITER is now just called `iterations`

2. Enabling DEBUG logging with show the "n/m documents converged" message.

Let me know if it answred your questions,

Lev

Abhinav Prakash

unread,

Dec 4, 2016, 10:26:49 AM12/4/16

to gen...@googlegroups.com

Hi Lev,

Thanks for your mail, solved a few stuffs and your post was a good help.

1. It seems though that you keep adding new documents to the training set. That will make a new model with new topics. So one cannot say that new documents are not needed just because one new chunk didn't change the topics. Next two chunks could have changed the topics a lot.

That's correct, it changes the model. topic_diff then in case of chunksize less then the number of documents wouldn't be reliable. For now, I am using the pyLDAvis library to gain more visual understanding of my model. Somehow, coherencemodel seems to throw "model unsupported" error. However, I will come around to that. pyLDAvis in combination with iterations is serving the purpose for now.

2. Debug logging is enabled in your package anyway. But the point of len(chunk) > 1, is skipping my head for now. I will again come around to that in some time. Need to read through deep into it.

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/LdD337I8XB4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Abhinav Prakash

unread,

Dec 4, 2016, 10:28:54 AM12/4/16

to gen...@googlegroups.com

What actually I am trying to do it to use LDA through time to get dynamic corpus, which changes when new documents would be fed into it. And new documents would be fed into it everyday. Hence, the pain for now.

Szymon Talaga

unread,

Apr 12, 2019, 9:20:49 AM4/12/19

to Gensim

Hi Abhinav Prakash,

Sorry to refer to a conversation from two years ago, but I am wondering whether the number of converging documents is reported by LdaMulticore. At the moment I believe it is not reported (I have logging set to DEBUG level), but it would be great if someone could confirm it.

Best regards,

Szymon Talaga.

To unsubscribe from this group and all its topics, send an email to gen...@googlegroups.com.

Reply all

Reply to author

Forward