LDA's id2word with custom bag of words

32 views
Skip to first unread message

N T

unread,
Apr 14, 2018, 7:16:34 PM4/14/18
to gensim
Hi there, 

I have built an document-term-matrix in order to train a LDA model, without using dictionary.doc2bow function. An example of topics formula that I get is as following:
(0, '0.027*"260" + 0.023*"200" + 0.022*"560"), based on num_words=3.

I would like to know what feature (name) each index presents. I know 'id2word'-parameter would do the trick, but I have no dictionary to assign to that parameter. The problem is that my features contain word combinations, e.g. 'pink rose', and I don't want corpora.Dictionary to treat it as separate words.

So I have a list of feature names corresponding to the indices in my document-term-matrix. I would like to let LDA automatically translates the indices in the formula to the feature names. So that if I changed the num_words to=17, it immediately gives me a formula with names instead of indices.

Is there a way to do this? Like, a custom dictionary in the format of corpora?
If it is not possible within gensim, how can I translate the indices 'manually'? I'm not very skilled in python, so I hope someone could help me with this.

Regards,

Ivan Menshikh

unread,
Apr 15, 2018, 9:58:00 PM4/15/18
to gensim
Hello

and I don't want corpora.Dictionary to treat it as separate words.

this depends fully on you, Dictionary expected a sequence of tokens as input (how this tokens will look is your responsibility), this class doesn't split anything.

I have a list of feature names corresponding  to the indices in my document-term-matrix

great, create a dict like {idx_1: "word_1", idx_2: "word_2", ...} and pass it as id2word parameter to LdaModel, this will works

example for demonstration:

from gensim.models import LdaModel

id2word = {0: "a", 1: "b", 2: "hello world"}
m = LdaModel([[(0, 2), (2, 1)], [(1, 1), (2, 1)]], id2word=id2word)

m.show_topics()

Reply all
Reply to author
Forward
0 new messages