Dropping topics from an LSA space

34 views
Skip to first unread message

zfu...@gmail.com

unread,
Mar 22, 2012, 2:43:54 PM3/22/12
to gensim
Hi,
I am new to the group, and new to this kind of work so my apologies if
this is a neophyte question.
I was wondering if there is a simple way to drop dimensions from the
LSA space prior to similarity queries.
Specifically, I am thinking about this in the context of using a
wikipedia corpus as the training corpus. The first three or so
dimensions seem to be purely wikipedia related words rather than
content(at least from the example in the tutorial, my space is still
processing):
topic #0(200.540): 0.475*"delete" + 0.383*"deletion" + 0.275*"debate"
+ 0.223*"comments" + 0.221*"edits" + 0.213*"modify" +
0.208*"appropriate" + 0.195*"subsequent" + 0.155*"wp" +
0.116*"notability"

topic #1(142.463): -0.292*"diff" + -0.277*"link" + -0.210*"image" +
-0.160*"www" + 0.151*"delete" + -0.149*"user" + -0.134*"contribs" +
-0.133*"undo" + -0.128*"album" + -0.115*"copyright"

topic #2(134.758): -0.458*"diff" + -0.415*"link" + -0.210*"undo" +
-0.201*"user" + -0.195*"www" + -0.186*"contribs" + 0.154*"image" +
-0.115*"added" + 0.098*"album" + -0.096*"accounts"

I intend to index and compare non wikipedia documents within this
space - using wikipedia as a reasonable corpus to represent the scope
of things people may write about. As such, these dimensions are not
relevant. Can they be dropped from the space? If they can't, will it
negatively impact the reliability of the comparisons I make within
that space?
Thanks so much,
Zander Furnas

Radim Řehůřek

unread,
Mar 22, 2012, 5:11:49 PM3/22/12
to gensim
Hi Zander,

you can do that: the LSI matrices are stored as plain numpy arrays, so
you can do `model.projection.s = model.projection.s[start:end]` and
`model.projection.u = model.projection.u[:, start:end]`, with "start"
and "end" to your liking.

There was a related question recently on this mailing list:
http://groups.google.com/group/gensim/msg/d6bfdc7a56910411 , see the
link there for extra info.

Best,
Radim
Reply all
Reply to author
Forward
0 new messages