Saving and subsetting model output

941 views
Skip to first unread message

April

unread,
Mar 4, 2014, 3:48:06 PM3/4/14
to gen...@googlegroups.com
Hi all,

Apologies in advance for what is probably a naive question. I just began using python/gensim and am pretty excited about it so far.

My question is similar to one in a recent post -- I'm trying to understand how to generate non-gensim-specific output files from an LDA model. Specifically the following:

(A) - a file containing #documents x #topics topic probabilities for each document 
(B) - a file containing #topics x #words probabilities for each word, in each topic 
(C) - a file containing #topics x #words probabilities for each of some subset of target words

The reason for the target words is because for research purposes I'm interested only in specific words that appear in the dictionary file and which may or may not be related to similar topics. So after training a model, I don't necessarily need to keep the vectors of all 50,000 or whatever tokens in the entire dictionary (though in some cases I might). For each model trained using various parameters, I would like to achieve persistency by saving the output in a more universal format than binary. Then later take these -- especially (C) --  and read them into, for example, R.


Sorry if there is an obvious answer for this. I have a suspicion that most (all?) of these needs can be met using operations like numpy.savetxt, so if the answer is "go learn to use numpy", that is understandable  :) 


Thanks for any help!
April

Christopher Corley

unread,
Mar 4, 2014, 4:49:00 PM3/4/14
to gensim
Excerpts from April's message of 2014-03-04 14:48:06 -0600:
> Hi all,
>
> Apologies in advance for what is probably a naive question. I just began
> using python/gensim and am pretty excited about it so far.
>
> My question is similar to one in a recent post -- I'm trying to understand
> how to generate non-gensim-specific output files from an LDA model.
> Specifically the following:
>
> (A) - a file containing #documents x #topics topic probabilities for each
> document

You're looking to do this:

1. Build the model
2. For each document in the corpus, infer it's topic probability.

def get_doc_topic(corpus, model):
doc_topic = list()
for doc in corpus:
doc_topic.append(model.__getitem__(doc, eps=0))
return doc_topic

This will return a d*k matrix (a list of lists) where the index is the
nth document in the corpus, and the value is a list of probabilities for
each topic in the model (this list is indexed by the topic id).

> (B) - a file containing #topics x #words probabilities for each word, in
> each topic

def get_topic_to_wordids(model):
p = list()
for topicid in range(model.num_topics):
topic = model.state.get_lambda()[topicid]
topic = topic / topic.sum() # normalize to probability dist
p.append(topic)
return p

This will return a k*M matrix (a list of lists), where the matrix rows
are the topic ids, the columns word ids for *every* word in your corpus,
and the value is the probability that word is within that topic.

That is, if l = get_topic_to_wordids(m), then l[2][400] is the
probability that word id 400 is in topic 2. You can index the corpus
dictionary to figure out which word it is before writing to a file.

> (C) - a file containing #topics x #words probabilities for each of some
> subset of target words

You can filter something like the above by passing in a set of word ids
you want:

def get_topic_to_subset(model, subset):
p = list()
for topicid in range(model.num_topics):
topic = model.state.get_lambda()[topicid]
topic = [weight for word_id, weight in
enumerate(topic) if word_id in subset]
topic = topic / topic.sum() # normalize to probability dist
p.append(topic)
return p

Note that this normalizes the weights by the *subset*, not the entire
topic. You can store the sum if you'd like the probabilities to be for
the entire topic:

def get_topic_to_subset(model, subset):
p = list()
for topicid in range(model.num_topics):
topic = model.state.get_lambda()[topicid]
s = topic.sum()
topic = [weight for word_id, weight in
enumerate(topic) if word_id in subset]
topic = topic / s # normalize to probability dist
p.append(topic)
return p
>
> The reason for the target words is because for research purposes I'm
> interested only in specific words that appear in the dictionary file and
> which may or may not be related to similar topics. So after training a
> model, I don't necessarily need to keep the vectors of all 50,000 or
> whatever tokens in the entire dictionary (though in some cases I might).
> For each model trained using various parameters, I would like to achieve
> persistency by saving the output in a more universal format than binary.
> Then later take these -- especially (C) -- and read them into, for
> example, R.
>
>
> Sorry if there is an obvious answer for this. I have a suspicion that most
> (all?) of these needs can be met using operations like numpy.savetxt, so if
> the answer is "go learn to use numpy", that is understandable :)
>
>
> Thanks for any help!
>
> April
>

I will leave the output format up to you. But these functions should
get the probabilties you want. Hope this helps get you started!

April

unread,
Mar 4, 2014, 5:41:15 PM3/4/14
to gen...@googlegroups.com
Thank you for the quick reply! 

I will digest this and try it out. This makes sense; once the matrix formats are correct, presumably the output should be straightforward enough.

Much appreciated!
April

Christopher Corley

unread,
Mar 4, 2014, 5:45:20 PM3/4/14
to gensim
Excerpts from April's message of 2014-03-04 16:41:15 -0600:
No problem. I'm like you: it was the *first* thing I wanted to do when
I switched to a gensim setup. So, I had all this ready to go. Was
going to write a blog post about it (eventually...).

Chris.

Christopher Corley

unread,
Mar 4, 2014, 5:51:56 PM3/4/14
to gensim
Excerpts from April's message of 2014-03-04 16:41:15 -0600:
Also, I just noticed that the last set of functions for C I wrote will
not work since it builds a Python list and divides it by the sum.
I kind of wrote those on the fly, but that's the general idea.

Chris.
Message has been deleted

Radim Řehůřek

unread,
Mar 9, 2014, 6:02:53 PM3/9/14
to gen...@googlegroups.com
Thanks for the code snippets Chris!

The "recent post" April was referring to is probably this one: https://groups.google.com/forum/#!topic/gensim/7q3JOPX4Kbk

You can find similar recipes there, using NumPy 2D arrays instead of lists-of-lists.

Best,
Radim

--
Radim Řehůřek, Ph.D.
consulting @ machine learning, natural language processing, big data
 
Reply all
Reply to author
Forward
0 new messages