Excerpts from April's message of 2014-03-04 14:48:06 -0600:
> Hi all,
>
> Apologies in advance for what is probably a naive question. I just began
> using python/gensim and am pretty excited about it so far.
>
> My question is similar to one in a recent post -- I'm trying to understand
> how to generate non-gensim-specific output files from an LDA model.
> Specifically the following:
>
> (A) - a file containing #documents x #topics topic probabilities for each
> document
You're looking to do this:
1. Build the model
2. For each document in the corpus, infer it's topic probability.
def get_doc_topic(corpus, model):
doc_topic = list()
for doc in corpus:
doc_topic.append(model.__getitem__(doc, eps=0))
return doc_topic
This will return a d*k matrix (a list of lists) where the index is the
nth document in the corpus, and the value is a list of probabilities for
each topic in the model (this list is indexed by the topic id).
> (B) - a file containing #topics x #words probabilities for each word, in
> each topic
def get_topic_to_wordids(model):
p = list()
for topicid in range(model.num_topics):
topic = model.state.get_lambda()[topicid]
topic = topic / topic.sum() # normalize to probability dist
p.append(topic)
return p
This will return a k*M matrix (a list of lists), where the matrix rows
are the topic ids, the columns word ids for *every* word in your corpus,
and the value is the probability that word is within that topic.
That is, if l = get_topic_to_wordids(m), then l[2][400] is the
probability that word id 400 is in topic 2. You can index the corpus
dictionary to figure out which word it is before writing to a file.
> (C) - a file containing #topics x #words probabilities for each of some
> subset of target words
You can filter something like the above by passing in a set of word ids
you want:
def get_topic_to_subset(model, subset):
p = list()
for topicid in range(model.num_topics):
topic = model.state.get_lambda()[topicid]
topic = [weight for word_id, weight in
enumerate(topic) if word_id in subset]
topic = topic / topic.sum() # normalize to probability dist
p.append(topic)
return p
Note that this normalizes the weights by the *subset*, not the entire
topic. You can store the sum if you'd like the probabilities to be for
the entire topic:
def get_topic_to_subset(model, subset):
p = list()
for topicid in range(model.num_topics):
topic = model.state.get_lambda()[topicid]
s = topic.sum()
topic = [weight for word_id, weight in
enumerate(topic) if word_id in subset]
topic = topic / s # normalize to probability dist
p.append(topic)
return p
>
> The reason for the target words is because for research purposes I'm
> interested only in specific words that appear in the dictionary file and
> which may or may not be related to similar topics. So after training a
> model, I don't necessarily need to keep the vectors of all 50,000 or
> whatever tokens in the entire dictionary (though in some cases I might).
> For each model trained using various parameters, I would like to achieve
> persistency by saving the output in a more universal format than binary.
> Then later take these -- especially (C) -- and read them into, for
> example, R.
>
>
> Sorry if there is an obvious answer for this. I have a suspicion that most
> (all?) of these needs can be met using operations like numpy.savetxt, so if
> the answer is "go learn to use numpy", that is understandable :)
>
>
> Thanks for any help!
>
> April
>
I will leave the output format up to you. But these functions should
get the probabilties you want. Hope this helps get you started!