Hi,I am getting funny results when I try to get a word's distribution over topics. I trained an LDA model over several MB of newswire (11k documents), and then ask for the distribution over topics of a document containing a single word. The output is something like:2015-04-22 12:01:15,398 : INFO : LDA vector for rachel is [(35, 0.50499999999999989)]2015-04-22 12:01:15,399 : INFO : LDA vector for badenhorst is [(70, 0.50499999596117906)]2015-04-22 12:01:15,399 : INFO : LDA vector for cracking is [(13, 0.50499999999999989)]However, the topics I get look reasonable. The only thing that is out of the ordinary is the topic weights, which are all the same.2015-04-22 12:14:49,789 : INFO : topic #56 (0.010): 0.013*i + 0.010*race + 0.009*mansell + 0.007*car + 0.006*stage + 0.006*km + 0.006*indurain + 0.005*indy + 0.005*tour + 0.004*people2015-04-22 12:14:49,818 : INFO : topic #55 (0.010): 0.010*north + 0.006*carter + 0.005*pyongyang + 0.005*prix + 0.005*we + 0.005*korea + 0.005*ford + 0.005*grand + 0.005*team + 0.005*france2015-04-22 12:14:49,839 : INFO : topic #12 (0.010): 0.006*against + 0.006*soviet + 0.006*all + 0.005*former + 0.005*south + 0.005*world + 0.004*trial + 0.004*cup + 0.004*korea + 0.003*coup2015-04-22 12:14:49,858 : INFO : topic #28 (0.010): 0.008*second + 0.008*south + 0.007*minutes + 0.007*th + 0.007*against + 0.007*half + 0.007*world + 0.007*only + 0.006*minute + 0.006*off2015-04-22 12:14:49,867 : INFO : topic #4 (0.010): 0.025*north + 0.022*nuclear + 0.021*korea + 0.008*korean + 0.008*international + 0.008*states + 0.008*united + 0.007*agency + 0.007*iaea + 0.007*pyongyang
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/5WrdTuA3IL8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I’ve got a follow-up question regarding API consistency.The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.”The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the modelI tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t.
Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?
On Wednesday, April 22, 2015 at 5:17:04 PM UTC+2, Miroslav Batchkarov wrote:I’ve got a follow-up question regarding API consistency.The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.”The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the modelI tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t.Both return a list of `(topic_id, topic_weight_in_input_doc)` 2-tuples. Looks consistent to me :)Or what is the inconsistency?
Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?But these do different things! Lambda is a "vocab x topics" matrix, which tells you the prominence of each word in each topic. LDA has something similar -- the matrix of left singular values, U. These matrices are derived from your training corpus, and they don't depend on any particular "query" document.
__getitem__ gives you the topic distribution for an input document = query.
On 22 Apr 2015, at 19:51, Radim Řehůřek <m...@radimrehurek.com> wrote:
On Wednesday, April 22, 2015 at 5:17:04 PM UTC+2, Miroslav Batchkarov wrote:I’ve got a follow-up question regarding API consistency.The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.”The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the modelI tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t.Both return a list of `(topic_id, topic_weight_in_input_doc)` 2-tuples. Looks consistent to me :)Or what is the inconsistency?So we are back to my original question :) Why is there just one very prominent topic in all single-word queries in my example, given the topics look reasonable?Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?But these do different things! Lambda is a "vocab x topics" matrix, which tells you the prominence of each word in each topic. LDA has something similar -- the matrix of left singular values, U. These matrices are derived from your training corpus, and they don't depend on any particular "query" document.Did you mean LSI?
__getitem__ gives you the topic distribution for an input document = query.If __getitem__ is the right way to get the topic distribution for an input document, why did you point me towards getLambda? Isn’t the topic the topic distribution for an input document the same as the prominence of a word in each topic when the query document consists of a single word (or at least proportional).?
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
On 22 Apr 2015, at 23:48, Radim Řehůřek <m...@radimrehurek.com> wrote:
On Wednesday, April 22, 2015 at 11:09:31 PM UTC+2, Miroslav Batchkarov wrote:On 22 Apr 2015, at 19:51, Radim Řehůřek <m...@radimrehurek.com> wrote:
On Wednesday, April 22, 2015 at 5:17:04 PM UTC+2, Miroslav Batchkarov wrote:I’ve got a follow-up question regarding API consistency.The docstring for LsiModel.__getitem__ says "Return latent representation, as a list of (topic_id, topic_value) 2-tuples.”The docstring for LdaModel.__getitem__ says “Return topic distribution for the given document `bow`, as a list of (topic_id, topic_probability) 2-tuples."To me that sounds like both are doing the same thing. The tutorial (https://radimrehurek.com/gensim/tut2.html) also says __getitem__ is how you get the distribution of a document over “topics” for a LsiModel:lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the modelI tried my example code with LsiModel instead of LdaModel and got a well-formed distribution over all 100 “topics”, as I expected. I assumed the same would work for LdaModel, which it didn’t.Both return a list of `(topic_id, topic_weight_in_input_doc)` 2-tuples. Looks consistent to me :)Or what is the inconsistency?So we are back to my original question :) Why is there just one very prominent topic in all single-word queries in my example, given the topics look reasonable?Is this a bug in LdaModel or the API inconsistent? Why would one need to call lda.state.getLambda() instead of __getitem__?But these do different things! Lambda is a "vocab x topics" matrix, which tells you the prominence of each word in each topic. LDA has something similar -- the matrix of left singular values, U. These matrices are derived from your training corpus, and they don't depend on any particular "query" document.Did you mean LSI?Yes, LSI, sorry.__getitem__ gives you the topic distribution for an input document = query.If __getitem__ is the right way to get the topic distribution for an input document, why did you point me towards getLambda? Isn’t the topic the topic distribution for an input document the same as the prominence of a word in each topic when the query document consists of a single word (or at least proportional).?No, it isn't. The LDA algorithm assigns one topic to one document word (usually called `z` in the LDA math). This is not the same as knowing the word has different propensity toward different topics (lambda).