HDP topics and hdp_to_lda()

1,304 views
Skip to first unread message

Noam DePlume

unread,
Dec 5, 2012, 5:36:11 PM12/5/12
to gen...@googlegroups.com
Hi all,


Short version: I have an HDP model. Calling myHDPModel.hdp_to_lda() returns a 2-tuple of an array of floats and an array of arrays of floats, not a gensim.ldamodel.LdaModel. I want a gensim.ldamodel.LdaModel so that I can query the transformation interface to get the LDA topics of a particular document. How can I get the HDP-mostly-equivalent LDA model so that I can query for the topics of a specific document?

Long version:
I have a collection of 870 documents that I have written on various blogs and wikis and whatnot over the years. I want to infer the topics, and tag each entry with a set of topics. Initially, the topic names don't matter, but I want to be able to look at the results and see that the documents included in each topic are actually about the same thing.

The way I have been trying to do this is training an LDA model on my corpus, and then using myLDAModel[document] to get the topics that the model thinks the document is about, and tagging the document accordingly.

So far so good, but I tried with 100 topics, and got results that didn't seem related. I suspected that this is because that was too many topics, and so some of the topics were only very loose associations. I'm trying now with a reduced number of topics (35, to be exact). On the other hand, it could be that it was not enough, and so there was insufficient seperation. I don't know the model and process well enough to have a good intuition for this. Also potentially of interest, the HDP model ends up with 150 topics, or at least prints out 150 topics when I call myHDPmodel.print_topics(topics=-1). Could this be expected to be a good estimate of the number of LDA topics to use when training the LDA model?

While that runs, I am trying to use HDP to generate a topic model as well. HDP doesn't support the [] transformation interface, so I figured I would use myHDPModel.hdp_to_lda() to get an LDA model, and then query that instead. As mentioned above, the return value is not an LdaModel object, although I suspect there is some way to make one from it, or use it in one.

Thanks,
Abe

Radim Řehůřek

unread,
Dec 8, 2012, 2:10:32 PM12/8/12
to gensim, jonathan....@gmail.com
Hello Abe,
that code was written by Jonathan (CC), so I had to have a look :)

You are completely right, `hdp_to_lda` only returns alpha and beta,
not a proper LdaModel.

The way to create an LdaModel from these arrays would be:

lda = gensim.models.LdaModel(id2word=hdp.id2word,
num_topics=len(alpha), alpha=alpha, eta=hdp.m_eta)
lda.expElogbeta = numpy.array(beta, dtype=numpy.float32)


Let me know if that works for you. Or better yet, if it does work,
update the `hdp_to_lda` method and send me a pull request on github :)

Re. poor HDP results -- with only 870 documents, minor differences in
preprocessing, tokenization etc. will have a big effect. Make sure you
give the model enough training -- see the `max_time` or `max_chunks`
parameters. In any case, with so few documents, I wouldn't expect the
statistical methods to give any particularly amazing results...

Cheers,
Radim


>
> Thanks,
> Abe

Abe S.

unread,
Dec 9, 2012, 7:37:18 PM12/9/12
to gen...@googlegroups.com
>> Short version: I have an HDP model. Calling myHDPModel.hdp_to_lda() returns
>> a 2-tuple of an array of floats and an array of arrays of floats, not a
>> gensim.ldamodel.LdaModel. I want a gensim.ldamodel.LdaModel so that I can
>> query the transformation interface to get the LDA topics of a particular
>> document. How can I get the HDP-mostly-equivalent LDA model so that I can
>> query for the topics of a specific document?

>> While that runs, I am trying to use HDP to generate a topic model as well.
>> HDP doesn't support the [] transformation interface, so I figured I would
>> use myHDPModel.hdp_to_lda() to get an LDA model, and then query that
>> instead. As mentioned above, the return value is not an LdaModel object,
>> although I suspect there is some way to make one from it, or use it in one.
>
> that code was written by Jonathan (CC), so I had to have a look :)
>
> You are completely right, `hdp_to_lda` only returns alpha and beta,
> not a proper LdaModel.
>
> The way to create an LdaModel from these arrays would be:
>
> lda = gensim.models.LdaModel(id2word=hdp.id2word,
> num_topics=len(alpha), alpha=alpha, eta=hdp.m_eta)
> lda.expElogbeta = numpy.array(beta, dtype=numpy.float32)
>
>
> Let me know if that works for you. Or better yet, if it does work,
> update the `hdp_to_lda` method and send me a pull request on github :)

Thanks, that looks like it works, expect a pull request shortly. However...

> Re. poor HDP results -- with only 870 documents, minor differences in
> preprocessing, tokenization etc. will have a big effect. Make sure you
> give the model enough training -- see the `max_time` or `max_chunks`
> parameters. In any case, with so few documents, I wouldn't expect the
> statistical methods to give any particularly amazing results...

This looks like it is the case, at least with my current
preprocessing. At the moment, I'm not doing stemming/lemmatizing, but
I am removing stopwords (a common list, plus the top few percent of
the most common words as extracted from my corpus (for some reason
"most" was not in the common list, go figure). I lowercased everything
that's left, stripped punctuation, and used TF-IDF on the corpus
before training the models.

I end up with a lot of topics like
0.001*interlocks + 0.001*forehead + 0.000*kelly + 0.000*stricture +
0.000*agile + 0.000*titles + 0.000*malt...
with very low values for most of the contributing words. I think these
are what is getting me a few topics with ~1/4 of the corpus in them.

At the moment, I'm considering trying to find correspondences between
my LDA and HDP topics and let them both vote, or just trying a
different method.

-Abe
Reply all
Reply to author
Forward
0 new messages