LSI Model dimensionality

346 views
Skip to first unread message

Stephan Gabler

unread,
Apr 14, 2011, 6:59:34 AM4/14/11
to gen...@googlegroups.com

Hello List,


We want to assess the accuracy of LSI models of different dimensionality.
Of course we can do this by training models with different values for `numTopics`
but this costs a lot of time for high values of numTopics. Shouldn't it be possible to
just compute once a model with very high dimensionality (lets say N = 5000) and then use only
the first M < N dimensions for assessing the quality of a lower dimensional model.

The matrices should be somewhere in the LSI model, but I could not find how to access them.

best, stephan

Radim

unread,
Apr 14, 2011, 8:11:15 AM4/14/11
to gensim
Hi Stephan,

exactly like you say, dimensions in LSI are "nested" (unlike e.g.
LDA). So once you compute the LSI model for `numTopics=N`, you have
available all models for numTopics <= N, via the matrices
`model.projection.u` (the left singular vectors) and
`model.projection.s` (the singular values).

Both are plain numpy arrays so slicing is super easy, but watch out
that `u` needs to be in fortran order (column major). The code checks
this so you'll get an assert error if `u` is not fortran-contiguous.

HTH,
Radim

On Apr 14, 5:59 pm, Stephan Gabler <stephan.gab...@googlemail.com>
wrote:

Stephan Gabler

unread,
Apr 28, 2011, 6:38:10 AM4/28/11
to gen...@googlegroups.com

Hello Radim,

thanks for your reply. Found the matrices, this was really easy.
I think this functionality would also be useful for other gensim users wo
don't want to mess with the LSImodel internals.

I would like to add this to the class and now think about which interface to use.
Usually a dimensionality parameter like this would be passed to the function,
but transformation is done in __getitem_ which should not receive another parameter.

So should it be set in another function like:
model.set_dim(50)
c = model[corpus]

or would this be confusing because maybe people afterwards don't remember that dimensionality
was reduced permanently?


stephan

Xolve

unread,
Jun 21, 2011, 9:15:19 AM6/21/11
to gensim
Yes I also feel this. Having the reduced dimensions in LSI through the
interfaces would be great.

On Apr 28, 3:38 pm, Stephan Gabler <stephan.gab...@googlemail.com>
wrote:
> Hello Radim,
>
> thanks for your reply. Found the matrices, this was really easy.
> I think this functionality would also be useful for other gensim users wo
> don't want to mess with the LSImodel internals.
>
> I would like to add this to the class and now think about which interface to use.
> Usually a dimensionality parameter like this would be passed to the function,
> but transformation is done in __getitem_ which should not receive another parameter.
>
> So should it be set in another function like:
> model.set_dim(50)
> c = model[corpus]
>
> or would this be confusing because maybe people afterwards don't remember that dimensionality
> was reduced permanently?
>
> stephan
>
> Am 14.04.2011 um 14:11 schrieb Radim:
>
>
>
>
>
>
>
> > Hi Stephan,
>
> > exactly like you say, dimensions inLSIare "nested" (unlike e.g.
> > LDA). So once you compute theLSImodel for `numTopics=N`, you have
> > available all models for numTopics <= N, via the matrices
> > `model.projection.u` (the left singular vectors) and
> > `model.projection.s` (the singular values).
>
> > Both are plain numpy arrays so slicing is super easy, but watch out
> > that `u` needs to be in fortran order (column major). The code checks
> > this so you'll get an assert error if `u` is not fortran-contiguous.
>
> > HTH,
> > Radim
>
> > On Apr 14, 5:59 pm, Stephan Gabler <stephan.gab...@googlemail.com>
> > wrote:
> >> Hello List,
>
> >> We want to assess the accuracy ofLSImodels of different dimensionality.
> >> Of course we can do this by training models with different values for `numTopics`
> >> but this costs a lot of time for high values of numTopics. Shouldn't it be possible to
> >> just compute once a model with very high dimensionality (lets say N = 5000) and then use only
> >> the first M < N dimensions for assessing the quality of a lower dimensional model.
>
> >> The matrices should be somewhere in theLSImodel, but I could not find how to access them.
>
> >> best, stephan

Stephan Gabler

unread,
Jun 21, 2011, 9:39:50 AM6/21/11
to gen...@googlegroups.com

Hi,

this is already implemented in the latest version as far as I know.
You can just set the num_topics variable.

So you first create your model with e.g. 300 dim and then

model.num_topics = 200
c = model[corpus]

will give you a transform of the corpus to a model using only the 200 dim with largest eigenvalues.

stephan

Radim

unread,
Jun 21, 2011, 1:22:00 PM6/21/11
to gensim
Yep, it's already there, many thanks to Stephan who contributed the
code :-)

Radim

On Jun 21, 3:39 pm, Stephan Gabler <stephan.gab...@googlemail.com>

Xolve

unread,
Jun 21, 2011, 1:31:45 PM6/21/11
to gensim
Thanks :)
Really I am very grateful to gensim developed. They have provided a
very neat and easy to use Python implementation of the very optimized
algorithms. My run time hav reduced from hours to minutes.

shiva

unread,
Jul 4, 2011, 12:31:02 PM7/4/11
to gensim
Hello Dear Gensim Users,

I 'm new with Gensim and I found this topic discussion related, to ask
my question.
Actually, I want to know how I can control the number of
Topics(dimensions) that
LSI model chooses.

When LSI model is created for my corpus the number of Topics
becomes lower than the number of Topics I want in creating the model.
I mean, for example when I run the following code:

lsi = models.LsiModel(mycorpus, numTopics=50,
id2word=mycorpus.dictionary)

for my corpus of size 100 documents, then the number of the columns of
the matrix
in 'lsi.projection.u' is less than 50. However, sometimes I need all
the documents of
my corpus to contribute in this matrix.

Any help would be appreciated.

Besides, I 'd like to have a reference that explain more about how the
LSI model
implemented in Gensim choose the Topics from the corpus documents.

Best,
Shiva

Radim

unread,
Jul 4, 2011, 6:00:34 PM7/4/11
to gensim
Hello Shiva,


> Besides, I 'd like to have a reference that explain more about how the
> LSI model
> implemented in Gensim choose the Topics from the corpus documents.

gensim uses the num_topics dimensions with the greatest variance
(=with the greatest singular values). If it finds fewer dimensions for
you, it means your input matrix was rank-deficient: the real rank of
the matrix was smaller than the rank you requested.

If you want to make sure gensim got it right, you can always (at least
for inputs that fit in RAM) compare the lsi.projection.s matrix to the
result of `numpy.linalg.svd(gensim.matutils.corpus2dense(mycorpus))`.
All singular values at the tail of the spectrum should be zero (or
almost zero) for rank-deficient input.

HTH,
Radim

Shiva Taslimi

unread,
Jul 5, 2011, 7:20:57 AM7/5/11
to gen...@googlegroups.com
Thank you so much for your really helpful explanations.
 

>If you want to make sure gensim got it right, you can always (at least
>for inputs that fit in RAM) compare the lsi.projection.s matrix to the
>result of `numpy.linalg.svd(gensim.matutils.corpus2dense(mycorpus))`.
>All singular values at the tail of the spectrum should be zero (or
>almost zero) for rank-deficient input.


As your comment, if I execute:

csvd = numpy.linalg.svd(gensim.matutils.corpus2dense(mycorpus,num_terms))

All singular values at the tail of the vector 'csvd[1]' should be zero, that is not true
in my output.
The singular values at the head of the vector 'csvd[1]' are almost the same as the
ones in 'lsi.projection.s' matrix, but the remaining ones are not almost zero.
What does it mean?

regards,
Shiva

Radim

unread,
Jul 5, 2011, 11:51:32 AM7/5/11
to gensim
Hia, can you upload your training corpus somewhere? I'll have a look.
Plus the parameters you train with.

Cheers,
Radim

Shiva Taslimi

unread,
Jul 6, 2011, 10:01:14 AM7/6/11
to gen...@googlegroups.com
I really appreciate your help in advance.
Is it adequate to upload the corpus matrix obtained from

'gensim.matutils.corpus2dense(mycorpus,num_terms)'

and the LSI parameters e.g. 'lsi = models.LsiModel(mycorpus, numTopics=100, id2word=mycorpus.dictionary)'?

I can upload the result of
'numpy.linalg.svd(gensim.matutils.corpus2dense(mycorpus,num_terms))[1]'
and
'lsi.projection.s'

Regards,
Shiva

Radim

unread,
Jul 6, 2011, 3:29:36 PM7/6/11
to gensim
On Jul 6, 4:01 pm, Shiva Taslimi <sh.tasl...@googlemail.com> wrote:
> I really appreciate your help in advance.
> Is it adequate to upload the corpus matrix obtained from
>
> 'gensim.matutils.corpus2dense(mycorpus,num_terms)'

Yep, the matrix is enough :)

Best,
Radim

Radim

unread,
Jul 7, 2011, 8:00:31 AM7/7/11
to gensim
Hello,

I converted the data you sent me to standard Matrix Market format:
http://nlp.fi.muni.cz/projekty/gensim/shiva.mm

At a glance, you should be running LSI on normalized tf-idf vectors,
not plain integer bag-of-words counts.

On closer inspection, this turns out also to be the reason why gensim
clips your spectrum to <100 factors, even if you request 100. The
singular values scale with the magnitude of input entries, and your
matrix contains values like `38958.0` (some term appears 39k times in
one document).

Since spectrum is computed in eigenvalues (square of singular values),
this leads to extremely large contributions from the first few
eigenvalues. Then the tail of the spectrum is clipped, as it
contributes almost nothing in comparison.

You can comment out the line `samples = clip_spectrum(numpy.diag(r),
samples, discard=eps)` in models/lsimodel.py to remove this behaviour.
Or you can use tf-idf to normalize your input in the first place.

I will also add some checks to gensim so that it handles these
situations more intelligently in the future :)

Best,
Radim

Shiva Taslimi

unread,
Jul 7, 2011, 5:47:51 PM7/7/11
to gen...@googlegroups.com
Thank you so much Radim :)

I think my problem will be solved.
Besides, I have learned some other things about Gensim from you.

Best Regards,
Shiva

Reply all
Reply to author
Forward
0 new messages