Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion accuracy of gensim svd
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Radim  
View profile  
 More options Sep 26 2011, 10:27 am
From: Radim <radimrehu...@seznam.cz>
Date: Mon, 26 Sep 2011 07:27:00 -0700 (PDT)
Subject: Re: accuracy of gensim svd
Hi all,

I'm posting some more experiment re. accuracy of incremental SVD.

This time I used a bigger corpus (1 million documents from the English
Wikipedia), so that I can run finer incremental updates -- with 1k
chunksize, the final decomposition comes from 1,000 successive
incremental updates. As before, I computed the exact
eigendecomposition of a 10k x 10k matrix A*A^T, and then compared the
decomposition from gensim against this exact result (spectrum here:
http://radimrehurek.com/gensim/wiki_1m_xlog.pdf ).

Accuracy for num_topics=500, fixed oversampling of 100:

                         no. of incremental updates (chunksize)
power iters   40 (25k)    200 (5k)   1000 (1k)  multipass=batch Halko
----------------+----------------------------------------------------
0               | 1.67808    1.51908   1.1482       2.43968
2               | 1.06174    1.07794   1.07018     1.04137
4               | 1.04878    1.06825   1.06935     1.00727
8               | 1.07138    1.07317   1.06917     1.00818

Some observations:

* Without power iterations, the randomized algo of Halko et al. is
very inaccurate, in both batch and incremental version. The default
number of power iterations in gensim is set to 2, for this reason. But
interestingly, the incremental algo is more accurate than batch here!
This is because the partial inaccuracies from base decompositions
cancel each other out, over many updates.
* With power iterations, accuracy of the multipass algorithm (called
A1 in my previous email) improves quickly. So use `onepass=False` if
your data is static (not streamed or incremental), and you want extra
accuracy... and don't mind the wait.
* I suspect the reason why 8 steps of power iteration in multipass are
less accurate than 4 is connected to numerical overflows. Power
iteration is a method for "boosting" the spectrum by repeatedly
multiplying a matrix by A*A^T. This happens in double precision of
course, but even so, for large matrices, precision will be ultimately
lost as the numbers involved get too big.
* Relative error of the incremental algo is always around 7% on this
dataset, *regardless of the number of updates*. This is connected to
the spectral properties of term-document matrices I mentioned earlier.

I'm looking to repeat these experiments on other datasets, from other
computer science fields (image processing, or the word-word matrices
from Marco). Cooperation is welcome :)

Radim


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.