Hi All, I'd sincerely appreciate some guidance on perplexity calculations for LDA models I'm building. My problem is as follows.
Background:
* Dictionary has 100k terms (ie, relatively large)
* Corpus has a few million documents, which are divided into roughly equal classes A, B, & C.
* Documents in class A are more similar to B than C, based on our human understanding.
* I built a LDA model of class A documents and have intuitive results when looking at the document-topic and topic-word distributions.
My problem:
Given the class-A-model, I'd expect a document drawn from class-C to have a higher perplexity than a document from class-A or class-B. However, this is not what I see. Instead, they all appear to be the same. That is, the `bounds()` method of the LDA model gives me approximately the same---large, negative---number for documents drawn from any class.
So, I'm embarrassed to ask. Am I correct that the .bounds() method is giving me the perplexity. Is there an alternative way for me to know the likelihood that a model would produce a document?
Thanks!
Kyle