Why the default LSI model projection use as default scaled=False?

85 views
Skip to first unread message

Alejandro

unread,
Jul 28, 2011, 7:09:08 PM7/28/11
to gen...@googlegroups.com
Hi:

Looking at the LsiModel class, I noticed that the __getitem__ method has `scaled=False` as default value. That means, if I am understanding the code correctly, that the projection into the latent space is computed as

q = U^-1 * x

rather than

q = S^-1 * U^-1 * x

I think that Deerwester et. al. paper use the scaled version. Is there are reason for using one projection over the other?

Alejandro.

Radim

unread,
Jul 30, 2011, 9:59:18 AM7/30/11
to gensim
Hello Alejandro,

this question has come up many times in the past, so I will just
copy&paste my previous email response:

>> Amber writes:
>> I am attempting to use gensim for part of my thesis work, and I'm
>> having a problem I hope you can help with. To test that I am using it
>> correctly, I have copied an example from a tutorial:
>> www.engr.uvic.ca/~seng474/svd.pdf

> Radim writes:
> About document scaling: LSA in gensim builds latent document representation of any document x_q ("pseudo-document" in the original Deerwester et al terminology) by the formula d_q = s^-1 * u^T * x_q (from x = u * s * d). To compare similarity of two documents, d_q1 and d_q2, Deerwester et al suggest the formula d_q1 * s^2 * d_q2, that is, dot product between the `d` vectors each scaled by `s`. When combined, the `s` cancel out, that's why in lsa[query] I actually do only d_q = u^T * x_q, and then only d_q1 * d_q2 in doc-doc similarity.
>
> So the difference is, calling lsa[corpus] already produces what your tutorial calls d_1, d_2 etc. (not just U_2). The values are already scaled by `s`.

HTH,
Radim
Reply all
Reply to author
Forward
0 new messages