Hi all,
I'm posting some more experiment re. accuracy of incremental SVD.
This time I used a bigger corpus (1 million documents from the English
Wikipedia), so that I can run finer incremental updates -- with 1k
chunksize, the final decomposition comes from 1,000 successive
incremental updates. As before, I computed the exact
eigendecomposition of a 10k x 10k matrix A*A^T, and then compared the
decomposition from gensim against this exact result (spectrum here:
http://radimrehurek.com/gensim/wiki_1m_xlog.pdf ).
Accuracy for num_topics=500, fixed oversampling of 100:
no. of incremental updates (chunksize)
power iters 40 (25k) 200 (5k) 1000 (1k) multipass=batch Halko
----------------+----------------------------------------------------
0 | 1.67808 1.51908 1.1482 2.43968
2 | 1.06174 1.07794 1.07018 1.04137
4 | 1.04878 1.06825 1.06935 1.00727
8 | 1.07138 1.07317 1.06917 1.00818
Some observations:
* Without power iterations, the randomized algo of Halko et al. is
very inaccurate, in both batch and incremental version. The default
number of power iterations in gensim is set to 2, for this reason. But
interestingly, the incremental algo is more accurate than batch here!
This is because the partial inaccuracies from base decompositions
cancel each other out, over many updates.
* With power iterations, accuracy of the multipass algorithm (called
A1 in my previous email) improves quickly. So use `onepass=False` if
your data is static (not streamed or incremental), and you want extra
accuracy... and don't mind the wait.
* I suspect the reason why 8 steps of power iteration in multipass are
less accurate than 4 is connected to numerical overflows. Power
iteration is a method for "boosting" the spectrum by repeatedly
multiplying a matrix by A*A^T. This happens in double precision of
course, but even so, for large matrices, precision will be ultimately
lost as the numbers involved get too big.
* Relative error of the incremental algo is always around 7% on this
dataset, *regardless of the number of updates*. This is connected to
the spectral properties of term-document matrices I mentioned earlier.
I'm looking to repeat these experiments on other datasets, from other
computer science fields (image processing, or the word-word matrices
from Marco). Cooperation is welcome :)
Radim