How to do dimension reduction

116 views
Skip to first unread message

ChamingaD

unread,
Mar 13, 2012, 12:54:32 PM3/13/12
to gen...@googlegroups.com
How to change k-dimensions in Gensim ? And how to exclude first dimension ?

Brian Murphy

unread,
Mar 14, 2012, 4:19:29 PM3/14/12
to gensim
Hi,

assuming you're talking about LSI/LSA or LDA, then you can control the
output dimensionality with the num_topics parameter:
http://radimrehurek.com/gensim/models/lsimodel.html
http://radimrehurek.com/gensim/models/ldamodel.html

... no idea how to exclude the first dimension upfront, but I guess
the most straightforward way is to generate it, and then throw it
away ...

Brian

ChamingaD

unread,
Mar 15, 2012, 2:05:17 AM3/15/12
to gen...@googlegroups.com
Ya Its about LSA :) Thanks for your reply.

Puffin mention about excluding 1st dimension for small collection of documents.

For large collections of documents, the number of dimensions used is in the 100 to 500 range. In our little example, since we want to graph it, we’ll use 3 dimensions, throw out the first dimension, and graph the second and third dimensions.

The reason we throw out the first dimension is interesting. For documents, the first dimension correlates with the length of the document. For words, it correlates with the number of times that word has been used in all documents. If we had centered our matrix, by subtracting the average column value from each column, then we would use the first dimension. As an analogy, consider golf scores. We don’t want to know the actual score, we want to know the score after subtracting it from par. That tells us whether the player made a birdie, bogie, etc.

 
Can someone gimme steps how to do it ?

Radim Řehůřek

unread,
Mar 15, 2012, 5:33:29 AM3/15/12
to gensim
Hello Chaminga,

how to obtain the U, S, V matrices of LSA is described in the FAQ:

https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ

with numpy matrices you can use slicing (like with Python lists), for
example `s = s[1:]` etc.

HTH,
Radim

ChamingaD

unread,
Mar 17, 2012, 6:29:47 AM3/17/12
to gen...@googlegroups.com
Thanks Radim. I'm clear about how to get U, S, V matrices. But couldn't understand about excluding first dimension.

Can you give me Recipe of that ?

Thiago Marzagão

unread,
Apr 5, 2014, 11:16:58 AM4/5/14
to gen...@googlegroups.com
Ok, I can retrieve U, S, and V after LSI using the FAQ recipe. And I can get the documentsXtopics matrix after LSI by multiplying V and S (which, with the proper TF-IDF parameters, yields the exact same result as sklearn's TruncatedSVD() followed by fit_transform() - which is reassuring). But how do I get the documentsXtopics matrix after LDA?

Radim Řehůřek

unread,
Apr 5, 2014, 3:51:58 PM4/5/14
to gen...@googlegroups.com

On Saturday, April 5, 2014 5:16:58 PM UTC+2, Thiago Marzagão wrote:
Ok, I can retrieve U, S, and V after LSI using the FAQ recipe. And I can get the documentsXtopics matrix after LSI by multiplying V and S (which, with the proper TF-IDF parameters, yields the exact same result as sklearn's TruncatedSVD() followed by fit_transform() - which is reassuring).


It's a better idea to check against `numpy.linalg.svd`. AFAIK sklearn started using the same SVD algorithm as gensim (bar the scalability), so your tests don't tell you much.

 
But how do I get the documentsXtopics matrix after LDA?

`lda[corpus]` will give you the matrix as a stream of sparse vectors.

HTH,
Radim
Reply all
Reply to author
Forward
0 new messages