Document-Topic Matrix as numpy array

932 views
Skip to first unread message

Michael Haus

unread,
Apr 25, 2014, 6:32:04 AM4/25/14
to gen...@googlegroups.com
Hi,
 
I want to convert the LDA topic model into an numpy array for further clustering. With the method gensim.matutils.corpus2dense the corpus can be converted into a numpy array, it includes the documents (columns) and terms (rows) [terms x documents].
 
I want a numpy array with [document x topic terms], the implementation:
 
lda = models.LdaModel.load(path_lda_model)                                                                                      
                                                                                 
corpus_tfidf
= corpora.MmCorpus(path_corpus)                                                                                      
                                                                                      
corpus_lda
= lda[corpus_tfidf]                                                                                      
                                                                                      
corpus_lda_dense
= matutils.corpus2dense(corpus_lda, corpus_tfidf.num_terms, corpus_tfidf.num_docs)

 

The corpus_lda_dense included now all documents as columns and all terms as rows with a value between [0, 1]. Is it right, that this array contains the topic words for every document? I mean I converted the whole corpus to the LDA space, all the terms are then arranged into topics.

 

Radim Řehůřek

unread,
Apr 25, 2014, 7:20:38 AM4/25/14
to gen...@googlegroups.com
Hello Michael,


On Friday, April 25, 2014 12:32:04 PM UTC+2, Michael Haus wrote:

corpus_lda_dense
= matutils.corpus2dense(corpus_lda, corpus_tfidf.num_terms, corpus_tfidf.num_docs)


just pass `lda.num_topics` as the second parameter to corpus2dense().

 

 

The corpus_lda_dense included now all documents as columns and all terms as rows with a value between [0, 1]. Is it right, that this array contains the topic words for every document? I mean I converted the whole corpus to the LDA space, all the terms are then arranged into topics.


It contains topic weights for each document, yes (not "topic words").

Note that serializing the corpus into a numpy/scipy.sparse array is not scalable -- once your corpus gets large, you'll run out of memory. It's best to work directly with `lda[corpus]` if you can, bypassing numpy, to avoid these memory problems.

HTH,
Radim

 

 

Christopher Grainger

unread,
May 23, 2014, 10:25:58 AM5/23/14
to gen...@googlegroups.com
I tried passing in lda.num_topics as the second parameter in corpus2dense() and it returned an Index Error: 

IndexError: index 386 is out of bounds for size 50

Any idea why this might be the case?

Connie

unread,
Aug 19, 2014, 7:20:08 PM8/19/14
to gen...@googlegroups.com
Hi,

Have you found a solution for this problem? I am running into the same issue using, gensim.matutils.corpus2dense(lda[corpus], num_terms=lda.num_topics, num_docs=corpus.num_docs)

By using the following into a loop doc_topic.append(lda.__getitem__(corpus[i], eps=0)) 

I was able to identify the items in the corpora that resulted in the following error: index 75641 is out of bounds for axis 1 with size 75620

When I do print corpora[item]  -- for the item that draws the error message - I see that one element is bigger than 75620  (75641, 1.0)] 

I do not know if this is related but it seems that it always the case that if an items is bigger than 75620 I get an error. Of course I am only seeing the items that have errors here. 

Some characteristic of my corpus, MmCorpus(10269 documents, 76508 features, 3064091 non-zero entries)

I would appreciate any hints towards what is going on here. Thank you! Best,
Connie

Connie

unread,
Aug 19, 2014, 9:25:11 PM8/19/14
to gen...@googlegroups.com
Ok. So now I found that 75620 was the original length of my dictionary, the one that I used to train the model. I think that I need to (1) build the dictionary (2) select a random sample of documents to train the model (3) then applied it to LDA. What I did was built the dictionary from the documents I used to train the model.

Is this intuition correct?
Reply all
Reply to author
Forward
0 new messages