Having HDP-LDP determine number of topics

1,324 views
Skip to first unread message

David Rudel

unread,
Oct 29, 2017, 3:26:13 PM10/29/17
to gensim
Everything I have read about HDP-LDA says that it can infer the number of topics from a given data set.

How do you do this with the gensim version?

I've noticed that this thread indicates that "T" is not equivalent to the number of topics found by the process. I thought that "T" might be a maximum for the number of topics eventually selected, but this appears untrue. I did a test run with a small corpus (1500) documents on a tiny vocabulary (2500 features) and set T=4000.

When I then apply get_topics() to the resulting model, I still get 4000 topics present. I would have expected far, far fewer be selected.

If "T" is not equivalent to topics found, how do they differ?

Since the documentation says that the number of topics selected for the approximate LDA via suggested_lda_model() is set equal to m_T, and the code has self.m_T = T, it seems that this call is not intended to purge unrefined topics.

This thread indicates others are having the same results. 

Based on output shown here, the C++ implementation does dynamically find the number of topics for the data.


Ivan Menshikh

unread,
Oct 30, 2017, 5:57:46 AM10/30/17
to gensim
Hi David,

Please ask the author of HDP model on GitHub.

David Rudel

unread,
Oct 31, 2017, 8:02:31 PM10/31/17
to gensim
He isn't easy to get ahold of. I eventually sent him a tweet on twitter. I didn't see a mechanism on github for sending messages.

Ivan Menshikh

unread,
Nov 1, 2017, 2:14:18 AM11/1/17
to gensim
You also can try to ping in original PR.

Christoph Winkler

unread,
Nov 1, 2017, 8:38:20 AM11/1/17
to gensim
I have the same problem. I don't understand how to find the optimal number of topics by using gensim's HDP

Christoph Winkler

unread,
Nov 3, 2017, 6:18:03 AM11/3/17
to gensim
If there is a solution, can you post it here, please? I can't contact the author and I am not on twitter.


Am Sonntag, 29. Oktober 2017 20:26:13 UTC+1 schrieb David Rudel:

David Rudel

unread,
Nov 4, 2017, 4:48:18 PM11/4/17
to gensim
If I'm interpreting Wang's implementation correctly, the number of potential topics starts out as K and is reduced gradually by HDPState::compact_hdp_state.

Wang's write-up, he gives and indication where the algorithm only found 110 topics when given an initial K=150.

It appears that Wang drops topics when the mechanisms of the variational inference cause the topics to have no words.

At each step the value (num_topics) is essentially decremented if word_counts_by_topic_[k] == 0.

(The code does not explicitly decrement, it loops through topic indexes and essentially jumps over index k unless word_counts_by_topic_[k] > 0.
Then if it finds it has jumped over an index, it essentially removes the associated topic from the model and at the end resets num_topics to the number that were not skipped.)

I cannot find anywhere in the gensim code where num_topics (m_T) is changed.

One thing you might do is search through the topics found and see if any are empty (i.e., they have no words).

I don't understand HDP well enough to know whether that makes any sense---I'm just doing a quick look at the code.

jonathan....@gmail.com

unread,
Nov 10, 2017, 3:23:00 AM11/10/17
to gensim
K and T are both "truncation" parameters... so they do limit the total number of topics that will be found at the corpus level (K) and the document level (T). How many of these end up being used is influence by the concentration parameters gamma and alpha. Unfortunately, the code doesn't prune empty topics out of the model, so you will end up with K and T topics, but some of them are "empty" topics, and often it just isn't clear where to draw the line between sparsely represented topics and empty ones. I can't offer any advice better than David's.

You could also try to contact the author of the original paper (Chong Wang). The code here was adapted from his original work.


Christoph Winkler

unread,
Nov 11, 2017, 6:04:41 AM11/11/17
to gensim
Thank you very much for your help, David and Jonathan. 
I've tested Wang's implementation in C. The number of topics is being increased and decreased, but generally the number of topics is growing. When I use 2000 documents the number of topics is still growing after 1.000 iterations. Looks like HDP is not very stable - at least for 2.000 documents.

David Rudel

unread,
Nov 11, 2017, 3:15:05 PM11/11/17
to gensim
Does Wang's implementation print out the number of topics at each iteration?

How do these "number of topic" values compare to the initial and current values for K?

I'm pretty sure that the value of K (first-level truncation) is never increased. I don't see how it could be, and I believe his paper explicitly says that his does not allow for it. In one of his examples k was set to 150 but the model ended up only finding 110 topics after training.

Christoph Winkler

unread,
Nov 12, 2017, 2:00:14 PM11/12/17
to gensim
Yes, it does. There is a log file in which you can see the number of topics at each iteration. K is fixed, I think that is correct. I used the default configuration. 1000 Iterations takes quite a long time and the number of topics is increased very very slowly. Maybe it ends up with 120 or 130 topics after 5000 iterations. But you can never be sure if the end is reached.

David Rudel

unread,
Nov 13, 2017, 2:17:05 AM11/13/17
to gen...@googlegroups.com
Christopher,
That is very interesting! The :compact_hdp_state code I referenced earlier can only decrease the number of topics. It cannot increase it.
I wonder if there is somewhere else in the code where num_topics can be increased. I should take a look.

Which version of the code are you running? I was looking at one that I assumed to be identical to Wang's, but perhaps it was not.

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/Eqkx942kBRU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

"A false conclusion, once arrived at and widely accepted is not dislodged easily, and the less it is understood, the more tenaciously it is held." - Cantor's Law of Preservation of Ignorance.

Christoph Winkler

unread,
Nov 13, 2017, 3:56:24 AM11/13/17
to gensim
I've tested these two implementations: https://github.com/blei-lab/hdp
And I've tested this one: https://github.com/renaud/hdp-faster

I couldn't see any difference in the results, except that hdp fast is really faster.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Christoph Winkler

unread,
Nov 13, 2017, 4:03:06 AM11/13/17
to gensim
I sent you the log file by mail.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Zeyd Boukhers

unread,
Feb 6, 2019, 10:11:22 AM2/6/19
to Gensim
Hi everyone, 
Is there any paper one can rely on for this purpose? 
Thank you! 
 

P. Werner

unread,
Feb 6, 2019, 11:52:04 AM2/6/19
to gen...@googlegroups.com
Hi, 

what are exactly are you trying to do? 
What do you mean by LDP? 

Do you mean LDA (latent dirichlet allocation) for which you need to know the number of topics? 

And I guess with HDP you mean the  hierarchical dirichlet process (HDP)? 
The hierarchical dirichlet process doesn't need to know the number of topics. It was build to infer them. 

Kind regards 


Am Mi., 6. Feb. 2019, 16:11 hat Zeyd Boukhers <bo....@gmail.com> geschrieben:
Hi everyone, 
Is there any paper one can rely on for this purpose? 
Thank you! 
 

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.

Zeyd

unread,
Feb 7, 2019, 3:48:23 AM2/7/19
to Gensim
@Werner, 

That is exactly the question, how to infer the number of topics using hierarchical dirichlet process. In the paper (Online Variational Inference for the Hierarchical Dirichlet Process), it is not explicitly mentioned how to infer them. In §3.1, it is mentioned that variational inference is used for doing so, but still unclear. 

Do you have an idea?
Thank you!

Raf Td

unread,
Jun 26, 2019, 4:48:13 PM6/26/19
to Gensim
To the best of my knowledge, HDPs models which use variational Bayesian inference methods that rely on truncated stick-breaking representations of the DP processes require specifying two levels of truncation for the dynamically inferred number of topics k: a corpus top-level truncation K and a second document level truncation T, and the variational methods infer a smaller number of mixture components within the specified allowed range according to the data. The relationship between K and T is T << K, "because in practice each document Gj requires far fewer topics than those needed for the entire corpus (i.e., the atoms of G0)" [from Chong Wang, John Paisley and David M Blei. ‘Online Variational Inference for the Hierarchical Dirichlet Process’. In: Aistats 2011 15 (2011)].

Consequently, the result of the HDP algorithm is a probability distribution over the predefined range of topics, and I obtained the probability distribution of topics (alpha vector) using the hdp_to_lda method.. not the best way, I know.

Hope it helps in any way,
Cheers,
Rafi.
Reply all
Reply to author
Forward
0 new messages