accuracy of gensim svd

Marco Baroni

unread,

Sep 12, 2011, 3:53:35 PM9/12/11

to gen...@googlegroups.com

Dear Gensim friends,

My name is Marco Baroni, I work for the University of Trento (Italy),
and I am an enthusiastic Gensim newbie.

I plan to use Gensim for various purposes, but right now the most
pressing use I have for it is to employ its very efficient SVD
implementation as a surrogate for standard SVD with matrices that
SVDLIBC cannot handle.

For these purposes, I am curious to know about the accuracy of Gensim's
SVD in approximating the exact decomposition.

A preliminary experiment I an on my own is very encouraging.

I started with a word-by-word matrix (i.e., the rows are not documents,
but words, and the columns measure their
(local-Mutual-Information-weighted) co-occurrences with other words
within a window of two items to the left and to the right). The
co-occurrence matrix was extracted from a large corpus (Wikipedia+ukWaC,
for a total of more than 2.5 billion words), and it has the following
properties:

- about 30K rows (target words)

- 5K columns (context words)

- about 28 million non-0 cells (18% density)

I performed SVD with Gensim, using the default form of the
models.lsimodel() routine (and asking for 300 factors).

I also performed "exact" SVD with SVDLIBC, also with the default options.

The results indicate that the approximation is very good.

Evidence # 1:

I computed the cosines of the 332 word pairs in the Wordsim353 dataset
that are also attested in my target word list, using both reduced matrices.

I used this data-set because it contains a range of pairs with different
degrees of similarity (just sampling random pairs would have resulted in
getting mostly highly dissimilar pairs, I suppose).

The correlation between the cosines obtained with the Gensim vs SVDLIBC
reduced matrices is virtually 1 (0.9991788 to be precise).

Evidence # 2:

I randomly sampled the same 3000 cells from the Gensim and SVDLIBC
matrices, and measured the resulting correlation (thus, correlation of
random dimensions of random rows).

If we ignore the sign of the values (since, I suppose, it does not
matter whether a singular vector was picked to point in the positive or
negative direction, as long as it is parallel), the correlation is a
high 0.84 (if we do consider the sign, correlation is <0.1).

Just by visual inspection, correlation could be made much higher by
considering less than 300 factors.

So, that's it for my own experiments.

My two questions for y'all are the following:

- Has anybody carried out similar experiments? What are your results?

- Regarding Radim's experiments in the NIPS paper, which model described
there corresponds to the one implemented in the default Gensim routine?
Is it the hybrid one- and two- pass models?

- Any recommendation for strategies (read: non-default arguments to the
routine) to improve accuracy, that do not eat up too much memory?

Thanks in advance.

Ciao,

Marco

--
Marco Baroni
Center for Mind/Brain Sciences (CIMeC)
University of Trento
http://clic.cimec.unitn.it/marco

Romain Loth

unread,

Sep 13, 2011, 7:04:43 AM9/13/11

to gen...@googlegroups.com, Marco Baroni

Dear Marco, Radim and gensim group

Afaik you can tweak 3 parameters in the fast svd of the lsi model :
- iterations : the number of internal iterations of a local pass [power_iters]
- oversampling : the local amount of additional samples for the random projections of the "action matrix" [extra_samples]
- initial matrix : the global amount of the samples to consider in one pass [chunksize]

Their role can be better understood by reading Radim's manuals and also § 1.3, 1.4 and 4.1 of [Halko, Martinsson and Tropp 2009]

Marco, I'm also using gensim for a DSM model and I think it demands to be extra careful. For those of you who are used to document processing, "DSM" just means our rows are words not documents and we're trying to model their semantics. Some problems can stem from that : each word is simply more different from other words than a given document is from other documents.

In a nutshell I believe the rank of a DSM (word X occurence_contexts) matrix to be much higher than that of a LSI (document X words) matrix.

So if the real rank is indeed a lot larger than the fixed rank we're asking from the algorithm (target number of dims), then we risk missing important trends in the matrix. Unfortunately, I didn't make any serious large scale comparison with standard SVD, but I tried adjusting the parameters and comparing the results with each other on an external task. Hence two common sense remarks :

1) the quality of the resulting dimensions can suffer from a matrix that's too sparse or that has a real rank too high. Then adjusting the parameters becomes important.

2) so i'm usually setting power_iters to 3 or 4 and the extra_samples on a lot more than advised by Halko et al... i use p=10000 or whatever my RAM will allow

Radim explained in his 2010 article "Fast and faster..." how power iterations and not oversampling is the normal way to increase accuracy, pointing to [Halko et al., 2009]. As far as i understand, they advise that a small oversampling parameter is enough when the latent rank of a matrix M is close to the fixed rank asked by the user. But in a DSM, when we ask for a few hundreds target dimensions imho we're already asking the algorithm to drastically compress the data (and thus oversimplify). Then intuitively the amount of samples [extra_samples] has to go up accordingly. Otherwise the k+p samples (target number of dimensions plus oversampling size) just won't be enough to get the behaviour of the range of M.

Voilà, these are non verified remarks AND i'm rather over my head with this kind of math so *please* take them with care.
Hope that helps, any comments are welcome.

R. Loth

Marco Baroni

unread,

Sep 13, 2011, 4:41:12 PM9/13/11

to gen...@googlegroups.com

Dear Romain, thanks for your advice and comments, that are very very
helpful.

Guess I'll bite the bullet and try to read Halko et al...

Just one curiosity... what makes you think that word-by-word matrices
should (for the same mxn size) be of a higher rank than document-by-word
matrices? Just at an intuitive level, I'd expect documents to be more
independent than words, if something...

Romain Loth

unread,

Sep 14, 2011, 8:12:06 AM9/14/11

to gen...@googlegroups.com, Marco Baroni, Pierre (LI)

Dear Marco and list

This is a tough question! What's the rank of a semantic space? I guess
we can only attempt philosophical speculation at this point. I just hope
it won't seem too ridiculous or arrogant to you.

Marco, you say that you expect document models to have a higher rank
than word models. The nature of a text is indeed more intricate than the
nature of a word. It's like atoms and molecules... linguistic
fundamentals tell us that a text's meaning is the infinite combination
of the finite meanings of the words... so it gives you reason : texts
can mean a lot more things than words simply because they are "woven"
out of many words, and the combinations are never the same.

But in reality we must take into account that we're talking about the
rank of a given model, not of meaning itself... Our models are finite
approximations targeted for a purpose, whereas this "deep meaning" is a
thing that occurs at various levels of granularity (abstract classes
like "object", "notion", "activity" are parts of meaning, so are
anthropic classes like "human", "animal", "food" or topic classes like
"from the hospital", "oil-related", "finance related"). Meaning is a
fractal self-organizing mess serving a variety of purposes.

Our models are basically zillions of observations + a few perspective
choices that allow the algorithms to make generalizations. These
perspectives select the granularity of the behaviours we're trying to
estimate. A functional view of meaning should encourage us to look at
*relevancy* in a given context. In our models, more or less conscious
choices are made to target some aspects of meaning over others because
we serve purposes with their own representation of what's relevant.

For instance a typical LSI model pertaining to documents would target
some kind of topic modelling. How many underlying topics are there? For
a given field or textual genre there is a clear phenomenon of topic
convergence. People just talk about the same topics all the time,
seriously... So if we have n documents there is a locally relevant
partition of k1 groups, where k1 is of the order of perhaps 10 or 100.

On the other hand, Models of word meaning will target lexicographic
tasks (dictionary and ontology building) or disambiguation, priming and
other cognitive tasks... At that level, your model may need to describe
some subtle internal nuances that only have local generalizations :
there is no wide partition of "word topics" but a score of word niches
with their own local partitions. So our relevant partitioning would
probably have k2 groups, where k2 is of the order of something like log(n).

Then in most cases you have k1 << k2... I assume that an optimal
partitioning of observations is a strong indicator of its
dimensionality. The relevance purpose is more complex for words so the
resulting optimal k is higher.

So in a nutshell if you are looking at the nature of signs, texts or
utterances that you're modelling, then the more combinations the system
allows, the more complex will the resulting space be, so texts have
higher dimensionality. But if you are talking about the function of the
utterances, then a multi-functional unit like a word needs more
dimensions to be modelled than a mono-functional unit like a text.

Hope that makes sense...

Romain Loth
Ing�nieur d'�tudes
MoDyCO, UMR 7114 - CNRS
Universit� Paris 10 Nanterre
t�l : 01 40 97 74 31

Le 13/09/2011 22:41, Marco Baroni a �crit :

Radim

unread,

Sep 15, 2011, 6:43:40 AM9/15/11

to gensim

Hello guys,

let me chime in with my own experiments. But first, a few general
points:

> I also performed "exact" SVD with SVDLIBC, also with the default options.

SVDLIBC also uses an approximate algo. From SVDLIBC docs:

"Currently the only SVDPACKC algorithm implemented in SVDLIBC is las2,
because it seems to be consistently the fastest. This algorithm has
the drawback that the low order singular values may be relatively
imprecise, but that is not a problem for most users who only want the
higher-order values or who can tolerate some imprecision."

But compared to the randomized algo, we can indeed consider it
"exact", especially for the top part of the spectrum.

> Afaik you can tweak 3 parameters in the fast svd of the lsi model :
> - iterations : the number of internal iterations of a local pass [power_iters]
> - oversampling : the local amount of additional samples for the random projections of the "action matrix" [extra_samples]
> - initial matrix : the global amount of the samples to consider in one pass [chunksize]

In addition to these parameters, you can also set the base
decomposition to use SVDLIBC (and not the randomized algo). The change
can be done by modifying `use_svdlibc=False` to `use_svdlibc=True` in
lsimodel.py.

As for the incremental update algo, it is
1) exact, if the target rank is near the true rank of input.
Theoretical analyses like to focus on this case, which is
unfortunately not the case in NLP.
2) near-exact, if the spectrum of the input drops sharply, so that s_1
>> s_k and s_k ~= s_tail. That is, there is a sharp drop in the
eigenvalues of the requested range, and the rest of the spectrum is
(reasonably) flat. Such matrices are called low-rank-plus-shift and
their properties were studied by Zha et al. [1].

This type of low-rank-plus-shift spectrum appears in term-document
matrices in NLP, and that's why the error in gensim doesn't quickly
spiral out of control in its incremental algo, despite holding on to
only a small part of truncated spectrum.

As an example, I computed the exact spectrum for a 10k x 100k matrix
of 100k randomly selected documents from the English Wikipedia:
http://radimrehurek.com/gensim/wiki_small_evals_xlog.pdf (note that
the x axis is in logarithmic scale -- the flat tail is very very long,
summing up to almost 70% of entire spectrum mass). The spectrum was
computed by clipping the vocabulary to the 10k most common words, and
computing the exact spectrum of corpus*corpus^T (a 10k x 10k symmetric
matrix, in RAM).

I am actually very interested how the algo fares in other problems and
for other domains. If you know people who need fast matrix
decompositions from image processing / geophysics / differential
equation fields, please point them to this thread :-) Numbers in the
following experiments were computed with this script:
http://radimrehurek.com/gensim/svd_error.py .

I measured accuracy of the decomposition on the same 100k docs matrix,
comparing SVDLIBC and Halko et al.'s randomized algo to the exact
decomposition, in batch mode (no incremental updates):

Asking for the top 300 factors:
SVDLIBC: 1.0 (norm_frobenius=550.327820)
Halko, 0 power iterations, 100 oversampling: 2.30524
(norm_frobenius=1268.664551)
Halko, 2 power iterations, 100 oversampling: 1.03308
(norm_frobenius=568.546387)
Halko, 4 power iterations, 100 oversampling: 1.00493
(norm_frobenius=553.053162)
Halko, 8 power iterations, 100 oversampling: 1.00051
(norm_frobenius=550.619934)

Each number is a ratio of the error ||A*A^T - U*S^2*U^T|| to the ideal
decomposition error (as computed from the exact decomposition). The
randomized algo without power iterations is complete rubbish, as
already noticed in Halko and my NIPS paper. The default no. of power
iters in gensim is 2, but you can increase this for extra accuracy.

For the incremental algo:

number of updates
(chunksize)
power iterations 1 (100k) 2 (50k) 10 (10k) 100 (1k)
-----------------------
+------------------------------------------------------------
0 | 2.30524 2.02242 1.73747 1.30574
2 | 1.03308 1.03095 1.04936 1.07816
4 | 1.00493 1.01286 1.0368 1.07357
8 | 1.00051 1.01026 1.03539 1.07182

So even over 100 incremental updates (each of which clips to 300
factors, throwing away all information beyond that!), the
decomposition error is around 7%, compared to the exact batch
decomposition.

An important question is how this error affects the similarity task
(or whatever task you're solving with the SVD). Here I am in favour
of measuring that "real" error directly, like Marco did, instead of
the proxy metric of SVD error, which may or may not have a negative/
neutral/positive(?) effect on the end task.

I'm on the road now so my replies will be slow, but I'd like to thank
you guys for the discussion and your experiments.

Radim

[1] Zha, Marques, Simon: Large-Scale SVD and Subspace-Based Methods
for Information Retrieval.

Marco Baroni

unread,

Sep 16, 2011, 6:06:08 PM9/16/11

to gen...@googlegroups.com

Thanks Romain and Radim.

I'm also back from travel and it will take me some time to process all
the info you kindly provided.

> In addition to these parameters, you can also set the base
> decomposition to use SVDLIBC (and not the randomized algo). The change
> can be done by modifying `use_svdlibc=False` to `use_svdlibc=True` in
> lsimodel.py.

Would this scale up, or is it subject to the same memory limitations as
SVDLIBC?

> As an example, I computed the exact spectrum for a 10k x 100k matrix
> of 100k randomly selected documents from the English Wikipedia:
> http://radimrehurek.com/gensim/wiki_small_evals_xlog.pdf (note that
> the x axis is in logarithmic scale -- the flat tail is very very long,
> summing up to almost 70% of entire spectrum mass). The spectrum was
> computed by clipping the vocabulary to the 10k most common words, and
> computing the exact spectrum of corpus*corpus^T (a 10k x 10k symmetric
> matrix, in RAM).

Interesting: do you have sharable code to compute this? I'd like to
obtain the same plot for a word-by-word matrix...

> decompositions from image processing

I am also working with word-by-image-feature matrices -- not
super-large, but in the 10Kx24K range, or something like that... would
that be of interest?

Thanks again.

Radim

unread,

Sep 18, 2011, 8:01:46 AM9/18/11

to gensim

Hello Marco,

> Would this scale up, or is it subject to the same memory limitations as
> SVDLIBC?

here is a simple chart, illustrating the different algos present in
gensim:

Halko SVDLIBC
------------+-------+-------
batch | A1 | A2
incremental | A3 | A4

Gensim implements all four algorithms A1..A4 (actually more, because
of the distributed versions). So you have a choice of 1) running SVD
in batch mode (multipass: A1 + A2) vs. incrementally (singlepass A3 +
A4), and 2) using SVDLIBC (A2 + A4) vs. a memory-friendly modification
of Halko et al.'s randomized algorithm (A1 + A3), for the partial in-
core decompositions.

Option 1) is realized by passing `onepass=False` constructor parameter
to LsiModel. Option 2) is realized by changing `use_svdlibc=False` to
`use_svdlibc=True` in lsimodel.py.

All algorithms *except* A2 (batch SVDLIBC) run in constant memory and
therefore scale up. If the error introduced via incremental
decomposition is not acceptable, I'd suggest A1 (multipass Halko) with
power iterations and massive oversampling.

And yes, all of this should probably be mentioned more explicitly in
the documentation... ;[

> > As an example, I computed the exact spectrum for a 10k x 100k matrix
> > of 100k randomly selected documents from the English Wikipedia:

> >http://radimrehurek.com/gensim/wiki_small_evals_xlog.pdf(note that

> > the x axis is in logarithmic scale -- the flat tail is very very long,
> > summing up to almost 70% of entire spectrum mass). The spectrum was
> > computed by clipping the vocabulary to the 10k most common words, and
> > computing the exact spectrum of corpus*corpus^T (a 10k x 10k symmetric
> > matrix, in RAM).
>
> Interesting: do you have sharable code to compute this? I'd like to
> obtain the same plot for a word-by-word matrix...

Sure! I'd like to see that plot, too.
The spectrum is computed by the script I linked in the previous post.
It accepts any input (in MatrixMarket format), so you can run it over
your own data easily. The script saves eigenvalues to a file, which I
then plotted to pdf with matplotlib.

> I am also working with word-by-image-feature matrices -- not
> super-large, but in the 10Kx24K range, or something like that... would
> that be of interest?

Yes, definitely :) Input of that size still fits in RAM, even if it's
fully dense, so comparing speed and accuracy should be
straightforward.

Best,
Radim

Marco Baroni

unread,

Sep 18, 2011, 8:51:24 AM9/18/11

to gen...@googlegroups.com

Hi Radim et al.

> here is a simple chart, illustrating the different algos present in
> gensim:

...

Thanks, this is very very useful!

>> Interesting: do you have sharable code to compute this? I'd like to
>> obtain the same plot for a word-by-word matrix...
>
> Sure! I'd like to see that plot, too.
> The spectrum is computed by the script I linked in the previous post.
> It accepts any input (in MatrixMarket format), so you can run it over
> your own data easily. The script saves eigenvalues to a file, which I
> then plotted to pdf with matplotlib.

OK, actually, then I don't even need to run your script. I just plotted
the "exact" eigenvalues from my matrix (30Kx5K word by word matrix). The
result looks linear in log(x)-by-untransformed(y) space, see:

clic.cimec.unitn.it/marco/temp/eigenvalues-of-word-by-word.pdf

Besides the difference in scale, this looks quite different from your
plot, even if I'm plotting a smaller number of eigenvalues...

> Yes, definitely :) Input of that size still fits in RAM, even if it's
> fully dense, so comparing speed and accuracy should be
> straightforward.
>

Do you want me to just run your script on this matrix (in which case
I'll be in touch offline with some clarification questions re the
script), or should I send you the matrix?

M

Radim

unread,

Sep 26, 2011, 10:27:00 AM9/26/11

to gensim

Hi all,

I'm posting some more experiment re. accuracy of incremental SVD.

This time I used a bigger corpus (1 million documents from the English
Wikipedia), so that I can run finer incremental updates -- with 1k
chunksize, the final decomposition comes from 1,000 successive
incremental updates. As before, I computed the exact
eigendecomposition of a 10k x 10k matrix A*A^T, and then compared the
decomposition from gensim against this exact result (spectrum here:
http://radimrehurek.com/gensim/wiki_1m_xlog.pdf ).

Accuracy for num_topics=500, fixed oversampling of 100:

no. of incremental updates (chunksize)
power iters 40 (25k) 200 (5k) 1000 (1k) multipass=batch Halko
----------------+----------------------------------------------------
0 | 1.67808 1.51908 1.1482 2.43968
2 | 1.06174 1.07794 1.07018 1.04137
4 | 1.04878 1.06825 1.06935 1.00727
8 | 1.07138 1.07317 1.06917 1.00818

Some observations:

* Without power iterations, the randomized algo of Halko et al. is
very inaccurate, in both batch and incremental version. The default
number of power iterations in gensim is set to 2, for this reason. But
interestingly, the incremental algo is more accurate than batch here!
This is because the partial inaccuracies from base decompositions
cancel each other out, over many updates.
* With power iterations, accuracy of the multipass algorithm (called
A1 in my previous email) improves quickly. So use `onepass=False` if
your data is static (not streamed or incremental), and you want extra
accuracy... and don't mind the wait.
* I suspect the reason why 8 steps of power iteration in multipass are
less accurate than 4 is connected to numerical overflows. Power
iteration is a method for "boosting" the spectrum by repeatedly
multiplying a matrix by A*A^T. This happens in double precision of
course, but even so, for large matrices, precision will be ultimately
lost as the numbers involved get too big.
* Relative error of the incremental algo is always around 7% on this
dataset, *regardless of the number of updates*. This is connected to
the spectral properties of term-document matrices I mentioned earlier.

I'm looking to repeat these experiments on other datasets, from other
computer science fields (image processing, or the word-word matrices
from Marco). Cooperation is welcome :)

Radim

Radim

unread,

Oct 5, 2011, 2:34:44 PM10/5/11

to gensim, marco....@unitn.it

Hi all,

bumping this thread with some more experiments, coming from Marco:

--------

1) Text and Images dataset (relevant article:
http://clic.cimec.unitn.it/marco/publications/gems-11/bruni-etal-gems-2011.pdf
)

* no. of features clipped to 10k
* 14k vectors, 6M non-zeroes
* exact spectrum http://radimrehurek.com/gensim/sift_10k_xlog.pdf
* asking for 500 most dominant factors:

no. of incremental updates (chunksize)

power iters 1 (14k) 3 (5k) 14 (1k) multipass=batch
Halko
----------------+----------------------------------------------------
0 | 5.97755 3.56967 1.26446 6.14296
2 | 1.04825 1.05358 1.06969 1.04889
4 | 1.00576 1.04582 1.06958 1.00573

--------

2) word-word matrix

* again, number of features clipped to 10k
* 30.7k observations, total 26.9M non-zeroes
* plot of the exact eigen spectrum: http://radimrehurek.com/gensim/symwindow2_10k_xlog.pdf
(note the monstrous y scale -- observations are not normalized).
* asking for the top 300 factors:

no. of incremental updates (chunksize)

power iters 4 (10k) 7 (5k) 31 (1k) multipass=batch
Halko
----------------+----------------------------------------------------
0 | 3.2396 1.93435 1.10849 6.97803
2 | 1.04814 1.06898 1.09798 1.01891
4 | 1.04735 1.06893 1.09808 1.00087

----

As before, these numbers evaluate svd accuracy by comparing it to the
exact A*A^T eigendecomposition (dense 10k x 10k matrix, computed in
memory by lapack). The closer to 1.0, the more accurate. And as
before, no power iteration steps in the randomized algo result in
rubbish accuracy (for both incremental and batch algo), so be sure to
leave at least the default 2 iteration steps.

I would be interested in evaluating accuracy with a more application-
oriented metric (Frobenius norm is more of a "nuisance" metric,
really). And on larger datasets, too.
If you have ideas, please let me know :)

Radim

Marco Baroni

unread,

Oct 6, 2011, 9:03:40 AM10/6/11

to gensim

Thanks for the interesting experiments, Radim.

So, it looks like the overall patterns are quite consistent across
data-sets.

A more empirical measure of quality would be the correlation of cosines
(or distances) between random vector pairs, across an "exact" and an
approximate SVD. If correlations are very high, we'd expect exact and
approximate SVD to be interchangeable for any practical task. Two issues
with this approach would be 1) what if the matrix is just too large for
exact SVD? 2) probably most random pairs would have very low similarity
(but this could be taken care of by picking pairs in less random ways...).

Ciao,

Marco

Radim

unread,

Oct 27, 2011, 7:26:27 AM10/27/11

to gensim, Mark Tygert

> ------------ Původní zpráva ------------
> Od: Mark Tygert <tyg...@courant.nyu.edu>
> Předmět: Re: SVD error estimation
> Datum: 27.10.2011 02:25:25
> ----------------------------------------
> On Thu, 27 Oct 2011, Radim Rehurek wrote:
>
> > The disappointing thing is that numeric overflows quickly occur when
> > using extra power iteration steps (6 steps is already too much for these
> > datasets). Did you experience this also in your work? Do you have any
> > idea on how to avoid this issue, so that accuracy of the multipass SVD
> > can be improved arbitrarily, as in the theory?
>
> Have you seen Algorithm 4.4 on page 27 of the article available at
> http://arxiv.org/pdf/0909.4061

Ah, so I use QR already on the intermediate step results, nice. I
don't know how I missed algo 4.4 :(

And thank you very much for the other links Mark, I was not aware of
your experiments. I will study these papers and compare to my results.

Best,
Radim

> A really detailed treatment is available at
> http://www.sci.ccny.cuny.edu/~szlam/npisvdnipsshort.pdf
> If you want to use the block Lanczos iterations, then you have to
> estimate the spectral norm of the matrix being approximated and normalize
> by this estimate every time you apply the matrix. Details are available in
> Remark 3.3 on page 5 of the paper available at
> http://cims.nyu.edu/~tygert/outofcore.pdf
> Also relevant is the first full paragraph at the top of page 2 of this
> same article, available at http://cims.nyu.edu/~tygert/outofcore.pdf
> The power method started with a random vector provides a convenient,
> provably reliable means for estimating the spectral norm of a matrix. That
> said, I would recommend using the normalized power/subspace iterations
> rather than block Lanczos unless there is some overwhelming reason to use
> the latter.
>
>
> Wishing you all the best,
> Mark
>
>
>

Radim

unread,

Oct 30, 2011, 1:45:44 PM10/30/11

to gensim

The problem was fixed in issue #59: https://github.com/piskvorky/gensim/issues/59
and will be part of the next release.

Best,
Radim

On Oct 27, 12:26 pm, Radim <radimrehu...@seznam.cz> wrote:
> > ------------ Pùvodní zpráva ------------
> > Od: Mark Tygert <tyg...@courant.nyu.edu>
> > Pøedmìt: Re: SVD error estimation

> > Datum: 27.10.2011 02:25:25
> > ----------------------------------------
> > On Thu, 27 Oct 2011, Radim Rehurek wrote:
>
> > > The disappointing thing is that numeric overflows quickly occur when
> > > using extra power iteration steps (6 steps is already too much for these
> > > datasets). Did you experience this also in your work? Do you have any
> > > idea on how to avoid this issue, so that accuracy of the multipass SVD
> > > can be improved arbitrarily, as in the theory?
>
> > Have you seen Algorithm 4.4 on page 27 of the article available at
> >http://arxiv.org/pdf/0909.4061
>
> Ah, so I use QR already on the intermediate step results, nice. I
> don't know how I missed algo 4.4 :(
>
> And thank you very much for the other links Mark, I was not aware of
> your experiments. I will study these papers and compare to my results.
>
> Best,
> Radim
>
>
>
>
>
>
>
> > A really detailed treatment is available at
> >http://www.sci.ccny.cuny.edu/~szlam/npisvdnipsshort.pdf
> > If you want to use the block Lanczos iterations, then you have to
> > estimate the spectral norm of the matrix being approximated and normalize
> > by this estimate every time you apply the matrix. Details are available in
> > Remark 3.3 on page 5 of the paper available at
> >http://cims.nyu.edu/~tygert/outofcore.pdf
> > Also relevant is the first full paragraph at the top of page 2 of this

> > same article, available athttp://cims.nyu.edu/~tygert/outofcore.pdf

Shivani

unread,

Aug 9, 2012, 4:20:46 PM8/9/12

to gen...@googlegroups.com

Hello Radim,
You mentioned that by setting use_svdlibc = True in the lsimodel.py file I can use just the svdlibc to compute the projection object

I see that this flag is not external to the class but internal. If I do modify it in lsimodel.py I will need to re-install gensim right? I am still a beginner when it comes to python programming so sorry for the stupid question.

Thanks for your help
Shivani

Senthil

unread,

Aug 9, 2012, 4:34:51 PM8/9/12

to gen...@googlegroups.com

Shivani,

Python is an interpreted language, so any mods you make to the source will be taken into account the next time you run it without any compilation step in between(like Java or C).

And further the use_svdlibc flag seems to be a parameter to a function. So it looks like if you just change it and run your setup as it is, svdlibc will kick in. I see a logger message in the source for svdlibc path which should confirm that you are using svdlibc ( logger.info("computing sparse SVD of %s matrix" % str(docs.shape)) )

As to re-installing after the change, I guess this will overwrite lsimodel.py again nullifying your change.

This is at least my understanding.

best

bluishgreen.

Reply all

Reply to author

Forward