Radim Řehůřek <me@...> writes:
>
>
> Hello Kuang,
> `bound()` has been refactored since then, so you should be able to compare
its scores across different number of topics.
>
> Comparing on "same corpus" still stands (though if your test corpora is
large & representative enough, the difference in scores shouldn't be large
anyway, so that's no problem either).
>
> HTH,
> RadimOn Wednesday, October 15, 2014 4:25:06 PM UTC+2, Kuang wrote:Hello,
> After reading others' discussions, I found the answer to an inquirer's
> question about finding a reasonable number of topics using the returns of
> bound():
> "The values coming out of `bound()` depend on the number of topics (as
well as
> number of words), so they're not comparable across different num_topics
(or
> different test corpora)."
> Based on my interpretation, it means that the outputs of bound() are
> comparable only when 1. num_topics is fixed, and 2. using the same corpus.
> My question is (and I believe many others' have the same question): How to
> generate perplexity values from the same corpus that are comparable across
> different num_topics (and generate figures similar to figure 9 in Blei et
al.
> 2003)? I'd like to identity an appropriate num. of topics for my topic
model.
> Are there other ways of doing that in addition to calculating
perplexities?
> Thank you very much!
>
Hello Radim,
Thank you for your kind reply. I am asking this question because I, like
some other posters, also experienced an increment of (per word) perplexity
values as num_topics increases, which is counter-intuitive.
I tried Hoffman's online LDA code to analyze wiki data and found that the
perplexity values also increase as the number of topics increases (the
return values of bound() from his code). This made me think that probably
the perplexity values are not comparable across num_topics. Any comments?
Thanks a lot.
(btw, I ended up using Christopher Grainger's approach to calculate the
values of symmetric KL divergence)
sincerely,
Kuang