My name is Marco Baroni, I work for the University of Trento (Italy),
and I am an enthusiastic Gensim newbie.
I plan to use Gensim for various purposes, but right now the most
pressing use I have for it is to employ its very efficient SVD
implementation as a surrogate for standard SVD with matrices that
SVDLIBC cannot handle.
For these purposes, I am curious to know about the accuracy of Gensim's
SVD in approximating the exact decomposition.
A preliminary experiment I an on my own is very encouraging.
I started with a word-by-word matrix (i.e., the rows are not documents,
but words, and the columns measure their
(local-Mutual-Information-weighted) co-occurrences with other words
within a window of two items to the left and to the right). The
co-occurrence matrix was extracted from a large corpus (Wikipedia+ukWaC,
for a total of more than 2.5 billion words), and it has the following
properties:
- about 30K rows (target words)
- 5K columns (context words)
- about 28 million non-0 cells (18% density)
I performed SVD with Gensim, using the default form of the
models.lsimodel() routine (and asking for 300 factors).
I also performed "exact" SVD with SVDLIBC, also with the default options.
The results indicate that the approximation is very good.
Evidence # 1:
I computed the cosines of the 332 word pairs in the Wordsim353 dataset
that are also attested in my target word list, using both reduced matrices.
I used this data-set because it contains a range of pairs with different
degrees of similarity (just sampling random pairs would have resulted in
getting mostly highly dissimilar pairs, I suppose).
The correlation between the cosines obtained with the Gensim vs SVDLIBC
reduced matrices is virtually 1 (0.9991788 to be precise).
Evidence # 2:
I randomly sampled the same 3000 cells from the Gensim and SVDLIBC
matrices, and measured the resulting correlation (thus, correlation of
random dimensions of random rows).
If we ignore the sign of the values (since, I suppose, it does not
matter whether a singular vector was picked to point in the positive or
negative direction, as long as it is parallel), the correlation is a
high 0.84 (if we do consider the sign, correlation is <0.1).
Just by visual inspection, correlation could be made much higher by
considering less than 300 factors.
So, that's it for my own experiments.
My two questions for y'all are the following:
- Has anybody carried out similar experiments? What are your results?
- Regarding Radim's experiments in the NIPS paper, which model described
there corresponds to the one implemented in the default Gensim routine?
Is it the hybrid one- and two- pass models?
- Any recommendation for strategies (read: non-default arguments to the
routine) to improve accuracy, that do not eat up too much memory?
Thanks in advance.
Ciao,
Marco
--
Marco Baroni
Center for Mind/Brain Sciences (CIMeC)
University of Trento
http://clic.cimec.unitn.it/marco
Guess I'll bite the bullet and try to read Halko et al...
Just one curiosity... what makes you think that word-by-word matrices
should (for the same mxn size) be of a higher rank than document-by-word
matrices? Just at an intuitive level, I'd expect documents to be more
independent than words, if something...
This is a tough question! What's the rank of a semantic space? I guess
we can only attempt philosophical speculation at this point. I just hope
it won't seem too ridiculous or arrogant to you.
Marco, you say that you expect document models to have a higher rank
than word models. The nature of a text is indeed more intricate than the
nature of a word. It's like atoms and molecules... linguistic
fundamentals tell us that a text's meaning is the infinite combination
of the finite meanings of the words... so it gives you reason : texts
can mean a lot more things than words simply because they are "woven"
out of many words, and the combinations are never the same.
But in reality we must take into account that we're talking about the
rank of a given model, not of meaning itself... Our models are finite
approximations targeted for a purpose, whereas this "deep meaning" is a
thing that occurs at various levels of granularity (abstract classes
like "object", "notion", "activity" are parts of meaning, so are
anthropic classes like "human", "animal", "food" or topic classes like
"from the hospital", "oil-related", "finance related"). Meaning is a
fractal self-organizing mess serving a variety of purposes.
Our models are basically zillions of observations + a few perspective
choices that allow the algorithms to make generalizations. These
perspectives select the granularity of the behaviours we're trying to
estimate. A functional view of meaning should encourage us to look at
*relevancy* in a given context. In our models, more or less conscious
choices are made to target some aspects of meaning over others because
we serve purposes with their own representation of what's relevant.
For instance a typical LSI model pertaining to documents would target
some kind of topic modelling. How many underlying topics are there? For
a given field or textual genre there is a clear phenomenon of topic
convergence. People just talk about the same topics all the time,
seriously... So if we have n documents there is a locally relevant
partition of k1 groups, where k1 is of the order of perhaps 10 or 100.
On the other hand, Models of word meaning will target lexicographic
tasks (dictionary and ontology building) or disambiguation, priming and
other cognitive tasks... At that level, your model may need to describe
some subtle internal nuances that only have local generalizations :
there is no wide partition of "word topics" but a score of word niches
with their own local partitions. So our relevant partitioning would
probably have k2 groups, where k2 is of the order of something like log(n).
Then in most cases you have k1 << k2... I assume that an optimal
partitioning of observations is a strong indicator of its
dimensionality. The relevance purpose is more complex for words so the
resulting optimal k is higher.
So in a nutshell if you are looking at the nature of signs, texts or
utterances that you're modelling, then the more combinations the system
allows, the more complex will the resulting space be, so texts have
higher dimensionality. But if you are talking about the function of the
utterances, then a multi-functional unit like a word needs more
dimensions to be modelled than a mono-functional unit like a text.
Hope that makes sense...
Romain Loth
Ing�nieur d'�tudes
MoDyCO, UMR 7114 - CNRS
Universit� Paris 10 Nanterre
t�l : 01 40 97 74 31
Le 13/09/2011 22:41, Marco Baroni a �crit :
I'm also back from travel and it will take me some time to process all
the info you kindly provided.
> In addition to these parameters, you can also set the base
> decomposition to use SVDLIBC (and not the randomized algo). The change
> can be done by modifying `use_svdlibc=False` to `use_svdlibc=True` in
> lsimodel.py.
Would this scale up, or is it subject to the same memory limitations as
SVDLIBC?
> As an example, I computed the exact spectrum for a 10k x 100k matrix
> of 100k randomly selected documents from the English Wikipedia:
> http://radimrehurek.com/gensim/wiki_small_evals_xlog.pdf (note that
> the x axis is in logarithmic scale -- the flat tail is very very long,
> summing up to almost 70% of entire spectrum mass). The spectrum was
> computed by clipping the vocabulary to the 10k most common words, and
> computing the exact spectrum of corpus*corpus^T (a 10k x 10k symmetric
> matrix, in RAM).
Interesting: do you have sharable code to compute this? I'd like to
obtain the same plot for a word-by-word matrix...
> decompositions from image processing
I am also working with word-by-image-feature matrices -- not
super-large, but in the 10Kx24K range, or something like that... would
that be of interest?
Thanks again.
> here is a simple chart, illustrating the different algos present in
> gensim:
...
Thanks, this is very very useful!
>> Interesting: do you have sharable code to compute this? I'd like to
>> obtain the same plot for a word-by-word matrix...
>
> Sure! I'd like to see that plot, too.
> The spectrum is computed by the script I linked in the previous post.
> It accepts any input (in MatrixMarket format), so you can run it over
> your own data easily. The script saves eigenvalues to a file, which I
> then plotted to pdf with matplotlib.
OK, actually, then I don't even need to run your script. I just plotted
the "exact" eigenvalues from my matrix (30Kx5K word by word matrix). The
result looks linear in log(x)-by-untransformed(y) space, see:
clic.cimec.unitn.it/marco/temp/eigenvalues-of-word-by-word.pdf
Besides the difference in scale, this looks quite different from your
plot, even if I'm plotting a smaller number of eigenvalues...
> Yes, definitely :) Input of that size still fits in RAM, even if it's
> fully dense, so comparing speed and accuracy should be
> straightforward.
>
Do you want me to just run your script on this matrix (in which case
I'll be in touch offline with some clarification questions re the
script), or should I send you the matrix?
M
So, it looks like the overall patterns are quite consistent across
data-sets.
A more empirical measure of quality would be the correlation of cosines
(or distances) between random vector pairs, across an "exact" and an
approximate SVD. If correlations are very high, we'd expect exact and
approximate SVD to be interchangeable for any practical task. Two issues
with this approach would be 1) what if the matrix is just too large for
exact SVD? 2) probably most random pairs would have very low similarity
(but this could be taken care of by picking pairs in less random ways...).
Ciao,
Marco