>> Short version: I have an HDP model. Calling myHDPModel.hdp_to_lda() returns
>> a 2-tuple of an array of floats and an array of arrays of floats, not a
>> gensim.ldamodel.LdaModel. I want a gensim.ldamodel.LdaModel so that I can
>> query the transformation interface to get the LDA topics of a particular
>> document. How can I get the HDP-mostly-equivalent LDA model so that I can
>> query for the topics of a specific document?
>> While that runs, I am trying to use HDP to generate a topic model as well.
>> HDP doesn't support the [] transformation interface, so I figured I would
>> use myHDPModel.hdp_to_lda() to get an LDA model, and then query that
>> instead. As mentioned above, the return value is not an LdaModel object,
>> although I suspect there is some way to make one from it, or use it in one.
>
> that code was written by Jonathan (CC), so I had to have a look :)
>
> You are completely right, `hdp_to_lda` only returns alpha and beta,
> not a proper LdaModel.
>
> The way to create an LdaModel from these arrays would be:
>
> lda = gensim.models.LdaModel(id2word=hdp.id2word,
> num_topics=len(alpha), alpha=alpha, eta=hdp.m_eta)
> lda.expElogbeta = numpy.array(beta, dtype=numpy.float32)
>
>
> Let me know if that works for you. Or better yet, if it does work,
> update the `hdp_to_lda` method and send me a pull request on github :)
Thanks, that looks like it works, expect a pull request shortly. However...
> Re. poor HDP results -- with only 870 documents, minor differences in
> preprocessing, tokenization etc. will have a big effect. Make sure you
> give the model enough training -- see the `max_time` or `max_chunks`
> parameters. In any case, with so few documents, I wouldn't expect the
> statistical methods to give any particularly amazing results...
This looks like it is the case, at least with my current
preprocessing. At the moment, I'm not doing stemming/lemmatizing, but
I am removing stopwords (a common list, plus the top few percent of
the most common words as extracted from my corpus (for some reason
"most" was not in the common list, go figure). I lowercased everything
that's left, stripped punctuation, and used TF-IDF on the corpus
before training the models.
I end up with a lot of topics like
0.001*interlocks + 0.001*forehead + 0.000*kelly + 0.000*stricture +
0.000*agile + 0.000*titles + 0.000*malt...
with very low values for most of the contributing words. I think these
are what is getting me a few topics with ~1/4 of the corpus in them.
At the moment, I'm considering trying to find correspondences between
my LDA and HDP topics and let them both vote, or just trying a
different method.
-Abe