Merging dictionaries

367 views
Skip to first unread message

Hefei

unread,
Apr 28, 2014, 10:37:44 AM4/28/14
to gen...@googlegroups.com
Hi everyone,

I just started using gensim and python, and the tutorials have been really helpful in setting up my analysis.  I've been able to run LDA on a standard set of documents and an initial dictionary, but I'd like to add noun-phrases to my dictionary as well.  I've extracted these using NLTK, so I now have two dictionaries:
      1. Dictionary with single word tokens from the corpus
      2. Dictionary with phrases from the corpus

I'd like to run LDA on a combined dictionary of the two above, but the merge_with() function creates a "VocabTransform" object.  I can use this to generate a corpus with updated counts for each word ID in the merged dictionary item, but where I'm stuck is the "id2word" input for LDA (i.e., I can't use either of the original dictionaries, nor can I use the transformation object that's created by merge_with()). 

Can anyone recommend an approach to combine the two dictionaries into a third dictionary, or to append a single dictionary with more items?

Thanks,
Hefei

Christopher Corley

unread,
Apr 28, 2014, 11:11:31 AM4/28/14
to gensim
Excerpts from Hefei's message of 2014-04-28 09:37:44 -0500:
Yep! That's how you should use the VocabTransform. Now, you can just
give LDA that dictionary you called merge_with() on to id2word; it is
the combined dictionary.

That is:
t = a.merge_with(b)
model = LdaModel(combined_corpus, id2word=a)

Now `a` is the actual combined dictionary. You can check this by
comparing the len of the two items before & after merging.

Chris.

Hefei

unread,
Apr 28, 2014, 11:35:21 AM4/28/14
to gen...@googlegroups.com
Oh I see - I didn't realize 'a' would be the combined dictionary.  Thanks so much for your help!

Radim Řehůřek

unread,
Apr 28, 2014, 1:26:48 PM4/28/14
to gen...@googlegroups.com
I didn't realize either, thanks Chris!

It's really awesome to see there are people who know more about gensim than I do :)

Radim

Christopher Corley

unread,
Apr 28, 2014, 2:11:02 PM4/28/14
to gensim
It's been the only way I've gotten a Dictionary object from a basic
Python dict built from a corpus.

i.e.,
j
id2word = gensim.corpora.Dictionary()
_ = id2word.merge_with(some_corpus.id2word)

Makes me question: why do all of the corpora build a Python dict and not
a Gensim Dictionary? Useful for using doc2bow!

Chris.

Excerpts from Radim Řehůřek's message of 2014-04-28 12:26:48 -0500:
Reply all
Reply to author
Forward
0 new messages