speeding up similarity calculations (and gensim in general)

Dieter Plaetinck

unread,

Feb 24, 2011, 5:06:47 AM2/24/11

to gensim

Hi,
my laptop:
Intel Core i3 M 370 @ 2.40GHz (4 cores)
4GB ram.
On my corpus (170k documents, 670k tokens in dict) most of the speeds
are reasonable: dictionary building + converting to blei format, tfidf
model building and sparse matrix generation are all at about +- 50k
documents/minute.
However, I'm building a datastructure containing for each document the
10 most similar documents. (using SparseMatrixSimilarity). This goes
at 30 documents per minute (i.e. 2 seconds per document => 94 hours
for the entire corpus). I I want to make this drastically shorter.
I have some ideas:
* I can probably nearly halve the time by looking up BxA similarity
when I have done already AxB (I will probably need to patch
SparseMatrixSimilarity or something to do that)
* I noticed in SimilarityABC __getitem__(), that the largest part of
the time is spent sorting the similarities to collect the 10 most
similar ones. this takes considerably longer then calculating all the
similarities (with all corpus documents) themselves?! Maybe I could
write an algorithm to, instead of sorting the long list and then to
cut it off, build the short sorted list right away by iterating over
the long list, dropping irrelevant scores as I go.
* I notice that it uses only 1 core, if I would manage to use all 4, I
could further reduce processing time by 4.
I use:
Python 2.7.1
scipy 0.8.0
numpy 1.5.1

I tried building atlas-lapack on my system, but that always gives the
dreaded `res/dgemvT_102_75 : VARIATION EXCEEDS TOLERENCE, RERUN WITH
HIGHER REPS` errors.
I also tried the intel blas, but got problems with that too (although
I forgot which, specifically)
Someone I know managed to build atlas/scipy/numpy optimized for his
amd athlon64, he sent me the packages but I don't measure a noticeable
performance gain from that.

does anyone have further input or tips for me? thanks in advance.

~ python2 -c 'import scipy; scipy.show_config()'
umfpack_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['f77blas', 'cblas', 'atlas']
library_dirs = ['/usr/lib']
define_macros = [('ATLAS_INFO', '"\\"3.8.3\\""')]
language = c
include_dirs = ['/usr/include']
atlas_blas_threads_info:
NOT AVAILABLE
lapack_opt_info:
libraries = ['lapack', 'f77blas', 'cblas', 'atlas']
library_dirs = ['/usr/lib']
define_macros = [('ATLAS_INFO', '"\\"3.8.3\\""')]
language = f77
include_dirs = ['/usr/include']
atlas_info:
libraries = ['lapack', 'f77blas', 'cblas', 'atlas']
library_dirs = ['/usr/lib']
language = f77
include_dirs = ['/usr/include']
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
atlas_blas_info:
libraries = ['f77blas', 'cblas', 'atlas']
library_dirs = ['/usr/lib']
language = c
include_dirs = ['/usr/include']
mkl_info:
NOT AVAILABLE

Dieter

Dieter Plaetinck

unread,

Feb 24, 2011, 9:52:23 AM2/24/11

to gensim

On Thu, Feb 24, 2011 at 11:06 AM, Dieter Plaetinck <dieterp...@gmail.com> wrote:

* I noticed in SimilarityABC __getitem__(), that the largest part of
the time is spent sorting the similarities to collect the 10 most
similar ones. this takes considerably longer then calculating all the
similarities (with all corpus documents) themselves?! Maybe I could
write an algorithm to, instead of sorting the long list and then to
cut it off, build the short sorted list right away by iterating over
the long list, dropping irrelevant scores as I go.

I just tried this.
interestingly, my approach is about as slow as the original code :(
maybe someone who's better in Python can further optimize it.
my attempt can be seen @ https://gist.github.com/842243

Dieter

Radim

unread,

Feb 24, 2011, 10:39:56 PM2/24/11

to gensim

Hello Dieter,

> However, I'm building a datastructure containing for each document the
> 10 most similar documents. (using SparseMatrixSimilarity). This goes
> at 30 documents per minute (i.e. 2 seconds per document => 94 hours
> for the entire corpus). I I want to make this drastically shorter.

yes, similarity computation is currently the slowest part of gensim, I
spent most time on model generation so far. Contributions are very
welcome!

> I have some ideas:
> * I can probably nearly halve the time by looking up BxA similarity
> when I have done already AxB (I will probably need to patch
> SparseMatrixSimilarity or something to do that)

Not sure what you mean here.

> Maybe I could
> write an algorithm to, instead of sorting the long list and then to
> cut it off, build the short sorted list right away by iterating over
> the long list, dropping irrelevant scores as I go.

No, rewriting python's `sort` with explicit loops won't give you
anything.
I have several ideas for making the code faster, just never got round
to implementing them, perhaps you can help :)
In increasing order of difficulty, depending how daring you feel as a
coder:

1) rewrite `SimilarityABC.__getitem__` with `numpy.argsort`, instead
of forming and sorting explicit 2-tuples. Should be much faster.
2) process queries in bigger chunks (10, 100 document at once), not
document-by-document. Would save a little time during matrix
multiplications.
3) distributed index. I like this one because it increases
scalability, but it's a lot of coding.

> * I notice that it uses only 1 core, if I would manage to use all 4, I
> could further reduce processing time by 4.

Python has something called GIL (Global Interpreter Lock), which makes
threading useless for CPU-bound tasks. It only helps with I/O.
Concurrency could be done at process level, or, better yet, across a
cluster of computers.

Please keep us posted about your progress Dieter, speeding up
similarities would be cool.

Radim

Dieter Plaetinck

unread,

Feb 25, 2011, 5:26:27 AM2/25/11

to gen...@googlegroups.com

yes, similarity computation is currently the slowest part of gensim, I
spent most time on model generation so far. Contributions are very
welcome!

Like I said, in my measurements, when using numBest, it seems that the take-the-top-numBest-similarities step is actually much slower (about 5 times) then the actual similarities getting.

> I have some ideas:
> * I can probably nearly halve the time by looking up BxA similarity
> when I have done already AxB (I will probably need to patch
> SparseMatrixSimilarity or something to do that)

Not sure what you mean here.

So, my app precomputes the list of 10 most similar documents, for each document; so that at runtime, whenever I want to know the top 10 similar documents for a given document in the corpus, I can just look it up instead of calculating it.
In my case, a query is always an existing document in the corpus, so I need to do N^2 vector multiplications. (N= number of documents)
But I don't need to do them all. suppose my documents are a sequence labeled d1, d2, d3, ... and I do all the multiplications in order (d1x d2, d1xd3, ..., d1xdn, d2xd1, d2xd3, ...); then any time I need to calculate dixdj I could just look up djxdi (if i > j)
However, such approach would also introduce some overhead (track id's along with documents, store results in a datastructure, do lookups, etc). In my case, my documents are actually very short (average.. say 30 tokens) so I'm not sure to which extent this will be an improvement.

> Maybe I could
> write an algorithm to, instead of sorting the long list and then to
> cut it off, build the short sorted list right away by iterating over
> the long list, dropping irrelevant scores as I go.

No, rewriting python's `sort` with explicit loops won't give you
anything.

of course. that's not what I meant. my point was: it is not necessary to sort the entire list, you just need to iterate the list once to collect the top results and then just sort the top results.
I.e.: O(n) + O(numBest*log(numBest)) instead of O (nlogn) if you sort the entire list.
This should be much faster (assuming numBest << n), or well.. that's what I thought.
I implemented this approach (in 2 slightly different ways) see https://gist.github.com/842243/0b54a49d6c65a47e6321cacd52b7e38c00b68767 , but interestingly (and sadly) it brings no noticeable improvement.
if you understand what I'm missing, please enlighten me :)

I have several ideas for making the code faster, just never got round
to implementing them, perhaps you can help :)
In increasing order of difficulty, depending how daring you feel as a
coder:

1) rewrite `SimilarityABC.__getitem__` with `numpy.argsort`, instead
of forming and sorting explicit 2-tuples. Should be much faster.
2) process queries in bigger chunks (10, 100 document at once), not
document-by-document. Would save a little time during matrix
multiplications.
3) distributed index. I like this one because it increases
scalability, but it's a lot of coding.

I would like to help, unfortunately I have to limit my contributions. The policy of my university dictates that I cannot open source substantial improvements. (but I don't even understand your point 2 and 3, so let alone I could code it..)

> * I notice that it uses only 1 core, if I would manage to use all 4, I
> could further reduce processing time by 4.

Python has something called GIL (Global Interpreter Lock), which makes
threading useless for CPU-bound tasks. It only helps with I/O.
Concurrency could be done at process level, or, better yet, across a
cluster of computers.

Process-level is fine for me. I'm fine with running 4 processes on the same system. Question is: what's the best way to implement this? add multiprocess support to the blas, to gensim, or to my application? or since gensim already supports distributed computing, maybe I should just run 4 workers on the same system.

Also, are blas libraries affected by this? I read atlas supports threading, for example.

Please keep us posted about your progress Dieter, speeding up
similarities would be cool.

Well, my main concern is actually the BLAS thing. On the gensim website I read I can gain upto +- 15 speed improvement, that sounds very nice. Especially if it can utilize all 4 cores without me needing to do anything. But I have a very hard time getting the atlas blas to run.
(see https://sourceforge.net/tracker/?func=detail&aid=3191421&group_id=23725&atid=379483)
I also need to wary that, any optimisation I do can become moot (or have an averse affect) once I have a decent blas setup.

Radim

unread,

Feb 25, 2011, 7:29:00 AM2/25/11

to gensim

> of course. that's not what I meant. my point was: it is not necessary to
> sort the entire list, you just need to iterate the list once to collect the
> top results and then just sort the top results.
> I.e.: O(n) + O(numBest*log(numBest)) instead of O (nlogn) if you sort the
> entire list.
> This should be much faster (assuming numBest << n), or well.. that's what I
> thought.

Yes, I understand what you mean. But unless `n` is very large, a
tight, well-written C code (such as `sort`) is always faster than
anything you can do with explicit Python loops, big-O notation
notwithstanding. Python is interpreted, so any explicit looping is
best avoided where possible.

One option would be to code your approach in C (through scipy.weave,
or a compiled module, for example). But that's not very gensim-like,
only as a last resort; I think option 1) below should be enough, and
is already a part of numpy.

Yes, that's what I meant by "3) distributed index". Sharding the index
across `p` processes (possibly on different machines), then merging
the `p` partial query results for the final answer. Your university
policy is interesting; is there a limit on lines of code? With Python,
I'm sure we could squeeze in there, no matter how small :)

> Also, are blas libraries affected by this? I read atlas supports threading,
> for example.

Yes, most BLAS libraries support threading, independently of gensim. I
know ATLAS supports this (and numpy/scipy should automatically detect
and use the threaded version during installation). There can indeed be
a two-digit increase in matrix multiplication performance, but in the
overall application performance, it's much less (matrix mult is only a
part of the overall processing cost).

I don't know enough about ATLAS installation issues to help you there,
sorry.

Radim

>
> Please keep us posted about your progress Dieter, speeding up
>
> > similarities would be cool.
>
> Well, my main concern is actually the BLAS thing. On the gensim website I
> read I can gain upto +- 15 speed improvement, that sounds very nice.
> Especially if it can utilize all 4 cores without me needing to do anything.
> But I have a very hard time getting the atlas blas to run.

> (seehttps://sourceforge.net/tracker/?func=detail&aid=3191421&group_id=237...
> )

Radim

unread,

Feb 27, 2011, 9:37:31 AM2/27/11

to gensim

Hello,

> > > 1) rewrite `SimilarityABC.__getitem__` with `numpy.argsort`, instead
> > > of forming and sorting explicit 2-tuples. Should be much faster.

I just tried; it's about 40x faster: https://github.com/piskvorky/gensim/issues#issue/5.
I pushed the version with `numpy.argsort` into the develop branch on
github. Dieter, can you please check how it affects your app
benchmark?

Btw, how do I link a commit to a specific issue on github? (something
like "re #5: i did this and this" in trac/SVN)

Cheers,
Radim

Dieter Plaetinck

unread,

Feb 28, 2011, 4:26:42 AM2/28/11

to gen...@googlegroups.com

On Sun, Feb 27, 2011 at 3:37 PM, Radim <radimr...@seznam.cz> wrote:

Hello,

> > > 1) rewrite `SimilarityABC.__getitem__` with `numpy.argsort`, instead
> > > of forming and sorting explicit 2-tuples. Should be much faster.

I just tried; it's about 40x faster: https://github.com/piskvorky/gensim/issues#issue/5.
I pushed the version with `numpy.argsort` into the develop branch on
github. Dieter, can you please check how it affects your app
benchmark?

My sorts are now between 2.9% and 4% of what they used to be. So that makes the sorting step 25-34 times faster.
The similarities calculations are now the bottleneck again (and with those included, I went from 25-30 documents/minute to 100.)
I also verified that the actual returned data (i.e. top similarities) are the same as what they were with the previous code.
Great work! I hope more optimisations follow :)

Btw, how do I link a commit to a specific issue on github? (something
like "re #5: i did this and this" in trac/SVN)

I don't know, sorry.

Dieter Plaetinck

unread,

Mar 8, 2011, 5:13:04 AM3/8/11

to gen...@googlegroups.com

Hmm, so thanks to the numpy.argsort similarity sorting I went from 25/30 documents/minute to 100.

But installing atlas-lapack didn't bring a big gain. Averages are fluctuating between 100 and 190 (so, roughly a 50% improvement, not 4-15X like i was hoping)
Interestingly, atlas-lapack also only uses one core. Maybe because it won't do multithreading on a calculation as finegrained as a single corpus*vector multiplication?
I'm thinking about writing a multiprocess layer on top of gensim so that i can distribute jobs of similarity calculations over my cores. So I would create jobs to do the getSimilarities(corpus, vec) for a range of 100 documents or so.
You mentioned processing documents in chunks in your point 2, but AFAICT not in the context of multiprocessing? So what's the idea there exactly? For a chunksize of Y you want to do ` N x T * T x Y = N x Y` ? why would that be faster?

Do you have ideas to make similarity calculations leverage all my cores?

Dieter

Dieter Plaetinck

unread,

Mar 8, 2011, 8:33:09 AM3/8/11

to gen...@googlegroups.com

Even worse, I just found out after removing my custom atlas-lapack, my numbers are the same (i.e. 100-190 documents per minute), so it seems I get no speedup at all from it :( (I needed some more testruns to get a better picture)

Interestingly, the test:
`time python2 -c "import numpy as N; a=N.random.randn(1000, 1000); N.dot(a, a)"`
finishes in less then 2 seconds, whereas it needs 15-30 seconds to complete without atlas-lapack, so it's not that I installed it incorrectly.
I'm currently trying to profile my code, trying to understand why I'm not seeing any benefit from atlas-lapack when doing similarity calculations with gensim.

Dieter

Radim

unread,

Mar 8, 2011, 10:29:00 AM3/8/11

to gensim

On Mar 8, 5:13 pm, Dieter Plaetinck <die...@plaetinck.be> wrote:
> But installing atlas-lapack didn't bring a big gain. Averages are

Gain in what exactly, how do you measure it? LAPACK is a library for
dense matrix operations, so the benefits can only be seen in
`MatrixSimilarity` (and dense operations in general). Sparse matrix
operations are hard-wired in `scipy.sparse`; they do not benefit from
these specialized libraries. The sparse code is in C and is reasonably
sane (esp. in newer versions of scipy), but nowhere near as optimized
as these monster *PACKs.

> I'm thinking about writing a multiprocess layer on top of gensim so that i
> can distribute jobs of similarity calculations over my cores. So I would
> create jobs to do the getSimilarities(corpus, vec) for a range of 100
> documents or so.
> You mentioned processing documents in chunks in your point 2, but AFAICT not
> in the context of multiprocessing? So what's the idea there exactly? For a
> chunksize of Y you want to do ` N x T * T x Y = N x Y` ? why would that be
> faster?

Point 2) was about not multiplying corpus * query (one document at a
time), but rather corpus * chunk (more query documents). It could be
useful when you a) need to process many similarity queries and b)
you're not required to wait for one query to finish before submitting
the next. Then, instead of processing one query after another, you'd
submit a batch of queries (perform only one multiplication) and
possibly save some time. I'm not sure how much though.

>
> Do you have ideas to make similarity calculations leverage all my cores?

Not beyond "sit down and think it through" :-) The idea is of course
sharding the index, but coming up with a clean design is probably more
challenging than implementing it. If you go down this road Dieter,
please consider treating the process location as black-box (Pyro makes
this really easy), so that the shards can also reside on different
machines, not just different cores. The difference between a multi-
core vs. multi-machine computation would then be a matter of where the
processes are physically run, inner logic would stay the same.

In any case I appreciate your effort here: like I said, querying is
currently the weakest part of gensim. A solid scalable implementation
will bring you undying fame ;)

Radim

Dieter Plaetinck

unread,

Mar 8, 2011, 11:13:44 AM3/8/11

to gen...@googlegroups.com

Radim,
thanks for getting back to me.

On Tue, Mar 8, 2011 at 4:29 PM, Radim <radimr...@seznam.cz> wrote:

On Mar 8, 5:13 pm, Dieter Plaetinck <die...@plaetinck.be> wrote:
> But installing atlas-lapack didn't bring a big gain. Averages are

Gain in what exactly, how do you measure it?

Well:
my app goes over all documents, and for each document, calculates the top 10 similar documents. (N^2 similarity calculations in total). I keep track of how many of the N outer iterations I can do per minute. (i.e. "for how many documents per minute can I calculate this top 10?"). I use python's timeit module (http://docs.python.org/library/timeit.html), and I measure it on chunks of the corpus, so that I can calculate averages as I go, based on subsets of the corpus I've just finished. although the results fluctuate, if I let it run for enough chunks, I can get a pretty good idea.
So currently I measure about 100-190 documents per minute, so 100-190 times "what's the top 10 most similar documents in the corpus for this document". But this number is the same, whether I use atlas-lapack or not.

LAPACK is a library for
dense matrix operations, so the benefits can only be seen in
`MatrixSimilarity` (and dense operations in general). Sparse matrix
operations are hard-wired in `scipy.sparse`; they do not benefit from
these specialized libraries. The sparse code is in C and is reasonably
sane (esp. in newer versions of scipy), but nowhere near as optimized
as these monster *PACKs.

ooh... I'm using SparseMatrixSimilarity, I'll try MatrixSimilarity then and see what it gives. Hopefully it will be faster!
If I understand correctly, similarity classes can be ordered like this:
MatrixSimilarity, SparseMatrixSimilarity, Similarity
(ordered by reverse memory requirements and hopefully, by performance of similarity calculations)

In case anyone is interested, I've been profiling with profilehooks and profilestats (the latter generates files which can be loaded in kcachegrind), from what I can interpret the most expensive part is going on in /usr/lib/python2.7/site-packages/scipy/sparse/compressed.py, function _mul_sparse_matrix on line 285, more specifically these 2 routines:
{scipy.sparse.sparsetools._csr.csr_matmat_pass2}
{scipy.sparse.sparsetools._csr.csr_matmat_pass1}
SparseMatrixSimilarity::getSimilarities() always does the `else` for me ( causing a call to scipy.sparse.dok_matrix() followed by an interation over all fields in the doc) but it seems like that's not a big deal.

> I'm thinking about writing a multiprocess layer on top of gensim so that i
> can distribute jobs of similarity calculations over my cores. So I would
> create jobs to do the getSimilarities(corpus, vec) for a range of 100
> documents or so.
> You mentioned processing documents in chunks in your point 2, but AFAICT not
> in the context of multiprocessing? So what's the idea there exactly? For a
> chunksize of Y you want to do ` N x T * T x Y = N x Y` ? why would that be
> faster?

Point 2) was about not multiplying corpus * query (one document at a
time), but rather corpus * chunk (more query documents). It could be
useful when you a) need to process many similarity queries and b)
you're not required to wait for one query to finish before submitting
the next. Then, instead of processing one query after another, you'd
submit a batch of queries (perform only one multiplication) and
possibly save some time. I'm not sure how much though.

Yes, I think I got that. This seems very appropriate for my use case, but I want to understand why you think this could be faster.
Let's take my use case as example, basically it means for every document (i.e. N times), you do a calculation:
N x T * T x 1 = N x 1 (multiply the entire corpus matrix with a one-column matrix (the document vector))
so, for each document:
N*T multiplications
N*(T-1) additions.
Since you need to do this N times, this means you ultimately have:
N^2*T multiplications
N^2*(T-1) additions.

With the chunk approach, if you set chunksize = Y (do Y documents at a time).
you will do N/Y times the following matrix multiplication:

N x T * T x Y = N x Y

so, for each chunk:
N*T*Y multiplications
N*(T-1)*Y additions.
Since you need to do this N/Y times, this means you ultimately have:
N*T*Y * N/Y = N^2*T multiplications
N*(T-1)*Y * N/Y = N^2*(T-1) additions.
Which is the same as the first approach. So, why do you think this would be any faster?
I don't see how it would be faster for at least MatrixSimilarity vs SparseMatrixSimilarity.
If the user uses Similarity, and iterating the corpus is slow, then I think this will cause a speedup of nearly a factor of Y.
(but luckily I'm not in that situation :-)

Dieter

Radim

unread,

Mar 8, 2011, 12:31:30 PM3/8/11

to gensim

> With the chunk approach, if you set chunksize = Y (do Y documents at a
> time).
> you will do N/Y times the following matrix multiplication:
> N x T * T x Y = N x Y
> so, for each chunk:
> N*T*Y multiplications
> N*(T-1)*Y additions.
> Since you need to do this N/Y times, this means you ultimately have:
> N*T*Y * N/Y = N^2*T multiplications
> N*(T-1)*Y * N/Y = N^2*(T-1) additions.
> Which is the same as the first approach. So, why do you think this would be
> any faster?

Because matrix multiplication is memory bound, so memory access
patterns also very important, not just the number of * or +. In fact,
that's what optimizing BLAS is about: writing routines that process
the matrix in blocks, making use of caches or TLB buffer...

Anyway I am curious myself so I ran a test:

In [1]: import numpy

In [2]: index = numpy.random.rand(100000, 500) # fake a corpus of 100k
docs; each doc is represented by 500 numbers (topics)

In [3]: q = numpy.random.rand(1000, 500) # 1k total queries

In [5]: time sims = numpy.column_stack(numpy.dot(index, query) for
query in q) # process queries one-by-one
CPU times: user 81.34 s, sys: 8.23 s, total: 89.57 s
Wall time: 112.32 s

In [6]: time sims2 = numpy.dot(index, q.T) # all queries at once
CPU times: user 14.23 s, sys: 0.65 s, total: 14.87 s
Wall time: 9.19 s

The difference is massive! Also I noticed that for `sims2` (the matrix-
matrix multiplication), the computation was threaded, which is what
you wanted in the first place. The vector-matrix mult in `sims` only
ran on a single core.

This was for dense operations only. I am surprised to see there is a
solid speed-up even in the sparse case:

[1] import numpy, scipy.sparse

[2] index = scipy.sparse.csc_matrix(0.3 * (numpy.random.rand(5000,
20000) / 0.9).astype(int)) # fake 5k documents, 20k vocab, documents
are 90% sparse

[3] queries = scipy.sparse.csc_matrix(0.3 * (numpy.random.rand(1000,
20000) / 0.9).astype(int)) # 1k queries

[8] time sims = numpy.column_stack((index *
query.T).toarray().flatten() for query in queries)
CPU times: user 19.97 s, sys: 0.93 s, total: 20.90 s
Wall time: 21.07 s

[9] time sims2 = (index * queries.T).todense()
CPU times: user 9.04 s, sys: 0.27 s, total: 9.31 s
Wall time: 9.50 s

Radim

Dieter Plaetinck

unread,

Mar 9, 2011, 9:48:24 AM3/9/11

to gen...@googlegroups.com

Interesting, I'll try to reproduce some of those results later. (for now I'm asked to calculate the similarities for the entire corpus asap, so I cannot work much on optimisations right now)
from your simple test I would conclude:
- threading happens as soon as you chunk (at least for dense matrices, sparse matrices needs the regular blas so we don't know if that one uses threading)
- chunking gives about a 100% speed increase with sparse matrices, and 10x with dense matrices.
- we cannot really compare dense vs sparse because you used highly different corpus/vocabulary sizes, but that would definitely be interesting aswell. (if dense is not faster then sparse, there's no reason to have it at all in gensim)

My thoughts:
- A function to support chunked multiplications seems fairly easy to implement, maybe this is the next step?
- I don't get why dense would be so fast, given there are so many zeroes to iterate/multiply with each other. I also don't get why we couldn't make sparse matrices multiplications so fast ("highly optimize them with a lapack-like library")
- I will personally avoid dense matrices in the long run because of the memory restrictions (although I would like to experiment more with them to get an idea) [*]
- I'm asked to strongly consider distributing my application with hadoop (map/reduce framework), so I'll be looking into that.

[*] I actually tried working with a the dense similaritymatrix, but I get a python MemoryException, oddly enough the python memory consumption is at 13% when that happens :/ (VSZ=740MB RSS=506MB of the available 4GiB)

Dieter

Radim

unread,

Mar 10, 2011, 2:06:40 AM3/10/11

to gensim

Hello Dieter,

On Mar 9, 9:48 pm, Dieter Plaetinck <die...@plaetinck.be> wrote:
> Interesting, I'll try to reproduce some of those results later. (for now
> I'm asked to calculate the similarities for the entire corpus asap, so I
> cannot work much on optimisations right now)

have a look at `gensim.dmlcz.gensim_xml` module, where i did a similar
thing (pre-compute the 10 most similar articles for each article in a
digital library).

> from your simple test I would conclude:
> - threading happens as soon as you chunk (at least for dense matrices,
> sparse matrices needs the regular blas so we don't know if that one uses
> threading)

Sparse matrix operations are part of `scipy.sparse` and currently do
not use threading or any external BLAS library.

> - chunking gives about a 100% speed increase with sparse matrices, and 10x
> with dense matrices.
> - we cannot really compare dense vs sparse because you used highly different
> corpus/vocabulary sizes, but that would definitely be interesting aswell.
> (if dense is not faster then sparse, there's no reason to have it at all in
> gensim)

According to the test, computing similarities over 500 dense topics
(as opposed to 20k sparse words as in bag-of-words) is about 20x
faster.

> My thoughts:
> - A function to support chunked multiplications seems fairly easy to
> implement, maybe this is the next step?

I thought that is what this thread was about? With a line of code for
every post here, the work would be almost done...

> - I'm asked to strongly consider distributing my application with hadoop
> (map/reduce framework), so I'll be looking into that.

Other gensim users suggested Dumbo, https://github.com/klbostee/dumbo/wiki/
, a Python interface to Hadoop. It looks good! Shields you from the
Java ugliness to a degree, and not that much more work compared to
Pyro, with a lot of benefits too. Again, if you manage to code up
something useful Dieter, please do share. However, note that the map/
reduce framework targets batch processing, so the range of tasks it
can be applied to is limited.

Best,
Radim

Dieter Plaetinck

unread,

Apr 5, 2011, 1:03:32 PM4/5/11

to gen...@googlegroups.com

Okay, what do you think is best to focus on?
Here are my thoughts:
* building multiprocess code in gensim is pointless, that's what frameworks like hadoop and others are good at. we get multi-core for free with multi-node.
* adding mmap support to `[Sparse]MatrixSimilarity` -> decreases memory pressure of these classes: allows to use in-memory similarity classes for datasets that don't fit into memory, but this solution does not scale, disk reads remain a bottleneck. So i think it's a nice idea, but maybe not wise to spend time on this now.
* `Similarity` is slow. needs a lot of read() calls which are slow, even if they come from the Linux block cache. I wonder how the performance of `[Sparse]MatrixSimilarity` with mmap would compare to `Similarity`, maybe there isn't much need for `Similarity` if we implement mmap support in the MatrixSimilarity classes.
* an instance of Similarity persisted to disk is interestingly big; 52MB; where does this usage come from? My bleicorpus is 115MB big (with 6.4MB vocab file), but a Similarity object only has some functions, and a normalize, numBest and corpus attribute, where the latter should also be a lightweight object because it doesn't "contain", documents. I would expect the size of a Similarity object on disk to be in the order of a few kB.

* how exactly do you see a distributed/"sharded" index?
With index you mean an instance of a corpus class right?
A corpus can be looked at as a list of N documentvectors, where each can have an arbitrary length, depending on how many different tokens ("features") they contain.
So we could indeed split the corpus in smaller chunks of size, this works with random single-queries being distributed over the chunks, then merging the results; but it's not ideal for the all2all use case: "for all N documents in the corpus - subjects-, give me the top numBest similar documents in the corpus - targets-". If we store a set of X documents in a shard, for each target in X, we will need to fetch N-X subjects from other nodes. (or: for each subject in X, you need to fetch N-X targets from other nodes), an interesting thought would be to divide the "all2all" square in subsquares of subjects to targets, and store those subsquares on invididual nodes, so that all targets and subjects they need are local.
the merging process would be a bit more involved, but not much I think. The big tradeof here is that you get much more data locality (i.e. less network traffic) at the cost of storage space. Strict application of the map/reduce model doesn't even allow nodes to talk to each other btw.

* We already noted chunking (calculating similarities for x to y documents at the same time, where x,y > 1) is good for performance. So this can be implemented later on top of the sharded-index codebase.

* we could write a "if you already calculated docA,docB similarity, don't calculate docB,docA" system. this will probably work fine if we have many little subsquares, but otherwise the bookkeeping itself might introduce a too noticeable overhead.

I think that's pretty much it, I might have forgotten something but I gotta go. :)

Dieter

Radim

unread,

May 14, 2011, 7:39:20 AM5/14/11

to gensim

I cleaned up the code of similarity queries yesterday. The result is a
simpler, faster code, plus I also added the chunked querying (asking
for similarity of more documents in a single query: `sims =
index[corpus]` instead of just `sim = index[document]`).

Some numbers: to get all vs. all similarities, on a sub-corpus of wiki
articles, the 0.7.8 code takes:

* 63s (5k articles, 200 dense LSA features)
* 127s (same 5k article, 64k sparse TF-IDF features, 0.2% density)

The new code, same dataset:
* 0.8s dense
* 2.3s sparse

So the speed-up is solid. The code is in the `simspeed_wip` branch on
github. Different sizes of chunks give different results; I created a
script for running the speed test automatically, it's in `src/gensim/
test/simspeed.py`:

2011-05-14 13:08:52,874 : INFO : running ./simspeed.py
wikismall.dense.mm wikismall.sparse.mm 5000
2011-05-14 13:08:52,874 : INFO : initializing corpus reader from
wikismall.dense.mm
2011-05-14 13:08:52,875 : INFO : accepted corpus with 10000 documents,
200 features, 2000000 non-zero entries
2011-05-14 13:08:57,867 : INFO : initializing corpus reader from
wikismall.sparse.mm
2011-05-14 13:08:57,867 : INFO : accepted corpus with 10000 documents,
64538 features, 1305021 non-zero entries
2011-05-14 13:09:01,785 : INFO : scanning corpus to determine the
number of features
2011-05-14 13:09:02,031 : INFO : creating matrix for 5000 documents
and 200 features
2011-05-14 13:09:02,032 : INFO : PROGRESS: at document #0/5000
2011-05-14 13:09:02,158 : INFO : PROGRESS: at document #1000/5000
2011-05-14 13:09:02,319 : INFO : PROGRESS: at document #2000/5000
2011-05-14 13:09:02,458 : INFO : PROGRESS: at document #3000/5000
2011-05-14 13:09:02,608 : INFO : PROGRESS: at document #4000/5000
2011-05-14 13:09:02,762 : INFO : creating sparse matrix for 5000
documents
2011-05-14 13:09:02,770 : INFO : PROGRESS: at document #0/5000
2011-05-14 13:09:06,406 : INFO : created <5000x64538 sparse matrix of
type '<type 'numpy.float32'>'
with 645303 stored elements in Compressed Sparse Row format>
2011-05-14 13:09:06,406 : INFO : test 1: similarity of all vs. all,
5000 documents, 200 features
2011-05-14 13:09:19,168 : INFO : chunks=0, time=7.9255s (630.88 docs/
s)
2011-05-14 13:09:30,441 : INFO : chunks=1, time=6.3004s (793.60 docs/
s, 793.60 queries/s), meandiff=0.000e+00
2011-05-14 13:09:39,464 : INFO : chunks=5, time=2.9537s (1692.77 docs/
s, 338.55 queries/s), meandiff=1.426e-08
2011-05-14 13:09:46,833 : INFO : chunks=10, time=1.6014s (3122.27 docs/
s, 312.23 queries/s), meandiff=1.426e-08
2011-05-14 13:09:53,771 : INFO : chunks=100, time=0.8181s (6111.98
docs/s, 61.12 queries/s), meandiff=1.373e-08
2011-05-14 13:10:00,722 : INFO : chunks=200, time=0.7031s (7111.68
docs/s, 35.56 queries/s), meandiff=1.346e-08
2011-05-14 13:10:07,520 : INFO : chunks=500, time=0.6312s (7920.99
docs/s, 15.84 queries/s), meandiff=1.341e-08
2011-05-14 13:10:14,363 : INFO : chunks=1000, time=0.6073s (8233.38
docs/s, 8.23 queries/s), meandiff=1.341e-08
2011-05-14 13:10:14,377 : INFO : test 2: as above, but only ask for
top-10 most similar for each document
2011-05-14 13:10:23,455 : INFO : chunks=0, time=9.0775s (550.82 docs/
s, 550.82 queries/s)
2011-05-14 13:10:32,836 : INFO : chunks=1, time=9.3814s (532.97 docs/
s, 532.97 queries/s)
2011-05-14 13:10:38,861 : INFO : chunks=5, time=6.0246s (829.93 docs/
s, 165.99 queries/s)
2011-05-14 13:10:44,037 : INFO : chunks=10, time=5.1761s (965.98 docs/
s, 96.60 queries/s)
2011-05-14 13:10:48,455 : INFO : chunks=100, time=4.4171s (1131.98
docs/s, 11.32 queries/s)
2011-05-14 13:10:52,512 : INFO : chunks=200, time=4.0562s (1232.69
docs/s, 6.16 queries/s)
^[[C2011-05-14 13:10:56,781 : INFO : chunks=500, time=4.2689s (1171.26
docs/s, 2.34 queries/s)
2011-05-14 13:11:00,876 : INFO : chunks=1000, time=4.0949s (1221.05
docs/s, 1.22 queries/s)
2011-05-14 13:11:00,876 : INFO : test 3: sparse all vs. all, 5000
documents, 64538 features, 0.20% density
2011-05-14 13:12:14,086 : INFO : chunks=0, time=67.9235s (73.61 docs/
s)
2011-05-14 13:12:36,204 : INFO : chunks=5, time=16.4034s (304.81 docs/
s, 60.96 queries/s), meadiff=0.000e+00
2011-05-14 13:12:52,788 : INFO : chunks=10, time=10.5228s (475.16 docs/
s, 47.52 queries/s), meadiff=0.000e+00
2011-05-14 13:13:02,387 : INFO : chunks=100, time=3.5809s (1396.29
docs/s, 13.96 queries/s), meadiff=0.000e+00
2011-05-14 13:13:10,812 : INFO : chunks=500, time=2.3692s (2110.39
docs/s, 4.22 queries/s), meadiff=0.000e+00
2011-05-14 13:13:19,248 : INFO : chunks=1000, time=2.3263s (2149.31
docs/s, 2.15 queries/s), meadiff=0.000e+00
2011-05-14 13:13:27,765 : INFO : chunks=5000, time=2.3704s (2109.35
docs/s, 0.42 queries/s), meadiff=0.000e+00
2011-05-14 13:13:27,779 : INFO : finished running simspeed.py

(I uploaded the corpus I used to http://nlp.fi.muni.cz/projekty/gensim/wikismall.tgz
)

The `simspeed_wip` branch is WIP=work in progress, I expect there will
be some rebasing before merging into develop. I would appreciate if
you could run the speed script or try the new code in your projects,
to test and debug it (Dieter?). This feature will be a part of the 0.8
release :)

Cheers,
Radim

Alan James Salmoni

unread,

May 15, 2011, 9:38:15 AM5/15/11

to gensim

That is quite an improvement! I cannot promise to help testing &
debugging this because my family and I are moving address very soon
but I'll report back with what I can.

All the best,

Alan

> (I uploaded the corpus I used tohttp://nlp.fi.muni.cz/projekty/gensim/wikismall.tgz

Dieter Plaetinck

unread,

May 17, 2011, 3:39:35 AM5/17/11

to gen...@googlegroups.com

> I would appreciate if

you could run the speed script or try the new code in your projects,
to test and debug it (Dieter?). This feature will be a part of the 0.8
release :)

Cheers,
Radim

Great news.
I look very forward to testing this, but this week I'll be occupied with a deadline, so hopefully next week.

Dieter

Radim

unread,

May 25, 2011, 3:39:38 PM5/25/11

to gensim

Merged into develop now. I also added a couple more speed tests.

I'll copy&paste my results here; if you get results 2 orders of
magnitude slower, you know something is wrong with you numpy/scipy
setup :)
(dataset here: http://nlp.fi.muni.cz/projekty/gensim/wikismall.tgz)

$ ./simspeed.py wikismall.dense.mm wikismall.sparse.mm 5000

INFO : running ./simspeed.py wikismall.dense.mm wikismall.sparse.mm
5000

INFO : initializing corpus reader from wikismall.dense.mm

INFO : accepted corpus with 10000 documents, 200 features, 2000000 non-
zero entries

INFO : initializing corpus reader from wikismall.sparse.mm

INFO : accepted corpus with 10000 documents, 64538 features, 1305021
non-zero entries

INFO : scanning corpus to determine the number of features

INFO : creating matrix for 5000 documents and 200 features

INFO : PROGRESS: at document #0/5000

INFO : PROGRESS: at document #1000/5000

INFO : PROGRESS: at document #2000/5000

INFO : PROGRESS: at document #3000/5000

INFO : PROGRESS: at document #4000/5000

INFO : creating sparse index
INFO : creating sparse matrix from corpus
INFO : PROGRESS: at document #0

INFO : created <5000x64538 sparse matrix of type '<type
'numpy.float32'>'
with 645303 stored elements in Compressed Sparse Row format>

INFO : test 1 (dense): similarity of all vs. all (5000 documents, 200
dense features)
INFO : chunks=0, time=6.4224s (778.53 docs/s)
INFO : chunks=1, time=6.6099s (756.44 docs/s, 756.44 queries/s),
meandiff=0.000e+00
INFO : chunks=4, time=3.6077s (1385.93 docs/s, 346.48 queries/s),
meandiff=1.426e-08
INFO : chunks=8, time=1.8560s (2693.97 docs/s, 336.75 queries/s),
meandiff=1.426e-08
INFO : chunks=16, time=1.2657s (3950.50 docs/s, 247.30 queries/s),
meandiff=1.426e-08
INFO : chunks=64, time=0.7981s (6265.21 docs/s, 98.99 queries/s),
meandiff=1.343e-08
INFO : chunks=128, time=0.6812s (7339.75 docs/s, 58.72 queries/s),
meandiff=1.343e-08
INFO : chunks=256, time=0.6614s (7560.13 docs/s, 30.24 queries/s),
meandiff=1.343e-08
INFO : chunks=512, time=0.6379s (7838.33 docs/s, 15.68 queries/s),
meandiff=1.343e-08
INFO : chunks=1024, time=0.6122s (8166.92 docs/s, 8.17 queries/s),
meandiff=1.337e-08
INFO : test 2 (dense): as above, but only ask for the top-10 most
similar for each document
INFO : chunks=0, time=9.5552s (523.28 docs/s, 523.28 queries/s)
INFO : chunks=1, time=9.7479s (512.93 docs/s, 512.93 queries/s)
INFO : chunks=4, time=6.8546s (729.43 docs/s, 182.36 queries/s)
INFO : chunks=8, time=5.5134s (906.88 docs/s, 113.36 queries/s)
INFO : chunks=16, time=4.9587s (1008.33 docs/s, 63.12 queries/s)
INFO : chunks=64, time=4.2042s (1189.30 docs/s, 18.79 queries/s)
INFO : chunks=128, time=4.6684s (1071.02 docs/s, 8.57 queries/s)
INFO : chunks=256, time=4.4773s (1116.74 docs/s, 4.47 queries/s)
INFO : chunks=512, time=4.8769s (1025.24 docs/s, 2.05 queries/s)
INFO : chunks=1024, time=4.5585s (1096.85 docs/s, 1.10 queries/s)
INFO : test 3 (sparse): similarity of all vs. all (5000 documents,
64538 features, 0.20% density)
INFO : chunks=0, time=70.6080s (70.81 docs/s)
INFO : chunks=5, time=16.6838s (299.69 docs/s, 59.94 queries/s),
meandiff=0.000e+00
INFO : chunks=10, time=10.5134s (475.58 docs/s, 47.56 queries/s),
meandiff=0.000e+00
INFO : chunks=100, time=3.6064s (1386.43 docs/s, 13.86 queries/s),
meandiff=0.000e+00
INFO : chunks=500, time=2.3292s (2146.69 docs/s, 4.29 queries/s),
meandiff=0.000e+00
INFO : chunks=1000, time=2.2676s (2204.97 docs/s, 2.20 queries/s),
meandiff=0.000e+00
INFO : chunks=5000, time=2.3709s (2108.87 docs/s, 0.42 queries/s),
meandiff=0.000e+00
INFO : test 4 (sparse): as above, but only ask for the top-10 most
similar for each document
INFO : chunks=0, time=68.9583s (72.51 docs/s, 72.51 queries/s)
INFO : chunks=5, time=19.5296s (256.02 docs/s, 51.20 queries/s)
INFO : chunks=10, time=13.5711s (368.43 docs/s, 36.84 queries/s)
INFO : chunks=100, time=6.8343s (731.61 docs/s, 7.32 queries/s)
INFO : chunks=500, time=5.4134s (923.63 docs/s, 1.85 queries/s)
INFO : chunks=1000, time=5.4764s (913.01 docs/s, 0.91 queries/s)
INFO : chunks=5000, time=5.5655s (898.39 docs/s, 0.18 queries/s)
INFO : test 5 (dense): dense corpus of 1000 docs vs. index (5000
documents, 200 dense features)
INFO : chunks=1, time=1.5808s (632.60 docs/s, 632.60 queries/s)
INFO : chunks=4, time=0.9604s (1041.23 docs/s, 260.31 queries/s)
INFO : chunks=8, time=0.7054s (1417.63 docs/s, 177.20 queries/s)
INFO : chunks=16, time=0.8139s (1228.69 docs/s, 77.41 queries/s)
INFO : chunks=64, time=0.8691s (1150.63 docs/s, 18.41 queries/s)
INFO : chunks=128, time=0.6678s (1497.39 docs/s, 11.98 queries/s)
INFO : chunks=256, time=1.0467s (955.40 docs/s, 3.82 queries/s)
INFO : chunks=512, time=0.8719s (1146.89 docs/s, 2.29 queries/s)
INFO : chunks=1024, time=0.8716s (1147.34 docs/s, 1.15 queries/s)
INFO : test 6 (sparse): sparse corpus of 1000 docs vs. sparse index
(5000 documents, 64538 features, 0.20% density)
INFO : chunks=1, time=11.1445s (89.73 docs/s, 89.73 queries/s)
INFO : chunks=5, time=3.1074s (321.81 docs/s, 64.36 queries/s)
INFO : chunks=10, time=2.0758s (481.74 docs/s, 48.17 queries/s)
INFO : chunks=100, time=1.0792s (926.59 docs/s, 9.27 queries/s)
INFO : chunks=500, time=0.8421s (1187.51 docs/s, 2.38 queries/s)
INFO : chunks=1000, time=0.8604s (1162.22 docs/s, 1.16 queries/s)
INFO : finished running simspeed.py

Chris Wj

unread,

May 25, 2011, 3:41:43 PM5/25/11

to gen...@googlegroups.com

There is also numexpr that you can look into for speeding up some things and using multiple cores. I have used it successfully recently.

Dieter Plaetinck

unread,

Jun 7, 2011, 10:18:54 AM6/7/11

to gen...@googlegroups.com

Radim,
- I see very similar results as you, I'm looking forward to test this on my own corpus, but I need to refactor and bugfix some other stuff first.
- Chris: numexpr looks nifty.

Radim Řehůřek

unread,

Jul 22, 2012, 7:31:41 AM7/22/12

to Radim, gensim

I upgraded my MacBookPro (late 2009 => mid-2012 model); for reference,
here are the new results of performance tests on similarity querying.

Things are about 3x faster now, though there have been optimizations
to gensim in the meantime, so not all of this is due to the HW
upgrade.

./test/simspeed.py wikismall.dense.mm wikismall.sparse.mm 5000
2012-07-22 13:13:01,988 : INFO : accepted corpus with 10000 documents,

200 features, 2000000 non-zero entries

2012-07-22 13:13:02,003 : INFO : accepted corpus with 10000 documents,

64538 features, 1305021 non-zero entries

2012-07-22 13:13:07,294 : INFO : scanning corpus to determine the
number of features
2012-07-22 13:13:07,416 : INFO : creating matrix for 5000 documents
and 200 features
2012-07-22 13:13:07,915 : INFO : creating sparse index
2012-07-22 13:13:07,915 : INFO : creating sparse matrix from corpus
2012-07-22 13:13:07,916 : INFO : PROGRESS: at document #0
2012-07-22 13:13:08,536 : INFO : created <5000x64538 sparse matrix of

type '<type 'numpy.float32'>'
with 645303 stored elements in Compressed Sparse Row format>

2012-07-22 13:13:08,536 : INFO : test 1 (dense): dense corpus of 1000

docs vs. index (5000 documents, 200 dense features)

2012-07-22 13:13:09,215 : INFO : chunksize=1, time=0.6780s (1474.88
docs/s, 1474.88 queries/s)
2012-07-22 13:13:09,848 : INFO : chunksize=4, time=0.6334s (1578.77
docs/s, 394.69 queries/s)
2012-07-22 13:13:10,270 : INFO : chunksize=8, time=0.4212s (2373.92
docs/s, 296.74 queries/s)
2012-07-22 13:13:10,575 : INFO : chunksize=16, time=0.3051s (3278.08
docs/s, 206.52 queries/s)
2012-07-22 13:13:10,842 : INFO : chunksize=64, time=0.2675s (3738.12
docs/s, 59.81 queries/s)
2012-07-22 13:13:11,105 : INFO : chunksize=128, time=0.2628s (3805.83
docs/s, 30.45 queries/s)
2012-07-22 13:13:11,363 : INFO : chunksize=256, time=0.2571s (3888.90
docs/s, 15.56 queries/s)
2012-07-22 13:13:11,618 : INFO : chunksize=512, time=0.2553s (3916.64
docs/s, 7.83 queries/s)
2012-07-22 13:13:11,914 : INFO : chunksize=1024, time=0.2959s (3379.22
docs/s, 3.38 queries/s)
2012-07-22 13:13:11,914 : INFO : test 2 (sparse): sparse corpus of

1000 docs vs. sparse index (5000 documents, 64538 features, 0.20%
density)

2012-07-22 13:13:15,548 : INFO : chunksize=1, time=3.6339s (275.19
docs/s, 275.19 queries/s)
2012-07-22 13:13:16,739 : INFO : chunksize=5, time=1.1901s (840.27
docs/s, 168.05 queries/s)
2012-07-22 13:13:17,614 : INFO : chunksize=10, time=0.8749s (1142.94
docs/s, 114.29 queries/s)
2012-07-22 13:13:18,059 : INFO : chunksize=100, time=0.4449s (2247.87
docs/s, 22.48 queries/s)
2012-07-22 13:13:18,389 : INFO : chunksize=500, time=0.3301s (3029.03
docs/s, 6.06 queries/s)
2012-07-22 13:13:18,713 : INFO : chunksize=1000, time=0.3234s (3092.29
docs/s, 3.09 queries/s)
2012-07-22 13:13:18,713 : INFO : test 3 (dense): similarity of all vs.

all (5000 documents, 200 dense features)

2012-07-22 13:13:23,781 : INFO : chunksize=0, time=2.4236s (2063.09
docs/s)
2012-07-22 13:13:28,992 : INFO : chunksize=1, time=2.3609s (2117.83
docs/s, 2117.83 queries/s), meandiff=0.000e+00
2012-07-22 13:13:33,080 : INFO : chunksize=4, time=1.3450s (3717.59
docs/s, 929.40 queries/s), meandiff=1.748e-08
2012-07-22 13:13:36,541 : INFO : chunksize=8, time=0.6613s (7561.30
docs/s, 945.16 queries/s), meandiff=1.883e-08
2012-07-22 13:13:39,778 : INFO : chunksize=16, time=0.4335s (11535.12
docs/s, 722.10 queries/s), meandiff=1.883e-08
2012-07-22 13:13:42,906 : INFO : chunksize=64, time=0.2920s (17124.88
docs/s, 270.57 queries/s), meandiff=1.883e-08
2012-07-22 13:13:46,129 : INFO : chunksize=128, time=0.2672s (18709.99
docs/s, 149.68 queries/s), meandiff=1.883e-08
2012-07-22 13:13:49,770 : INFO : chunksize=256, time=0.2499s (20009.04
docs/s, 80.04 queries/s), meandiff=1.883e-08
2012-07-22 13:13:53,839 : INFO : chunksize=512, time=0.2082s (24012.05
docs/s, 48.02 queries/s), meandiff=1.883e-08
2012-07-22 13:13:57,956 : INFO : chunksize=1024, time=0.2033s
(24596.24 docs/s, 24.60 queries/s), meandiff=1.883e-08
2012-07-22 13:13:57,971 : INFO : test 4 (dense): as above, but only

ask for the top-10 most similar for each document

2012-07-22 13:14:02,521 : INFO : chunksize=0, time=4.5502s (1098.84
docs/s, 1098.84 queries/s)
2012-07-22 13:14:07,008 : INFO : chunksize=1, time=4.4863s (1114.51
docs/s, 1114.51 queries/s)
2012-07-22 13:14:10,542 : INFO : chunksize=4, time=3.5342s (1414.76
docs/s, 353.69 queries/s)
2012-07-22 13:14:13,214 : INFO : chunksize=8, time=2.6719s (1871.32
docs/s, 233.92 queries/s)
2012-07-22 13:14:15,712 : INFO : chunksize=16, time=2.4977s (2001.85
docs/s, 125.32 queries/s)
2012-07-22 13:14:18,065 : INFO : chunksize=64, time=2.3533s (2124.66
docs/s, 33.57 queries/s)
2012-07-22 13:14:20,344 : INFO : chunksize=128, time=2.2783s (2194.59
docs/s, 17.56 queries/s)
2012-07-22 13:14:22,713 : INFO : chunksize=256, time=2.3688s (2110.82
docs/s, 8.44 queries/s)
2012-07-22 13:14:25,059 : INFO : chunksize=512, time=2.3463s (2131.03
docs/s, 4.26 queries/s)
2012-07-22 13:14:27,508 : INFO : chunksize=1024, time=2.4487s (2041.90
docs/s, 2.04 queries/s)
2012-07-22 13:14:27,508 : INFO : test 5 (sparse): similarity of all

vs. all (5000 documents, 64538 features, 0.20% density)

2012-07-22 13:15:12,522 : INFO : chunksize=0, time=42.3504s (118.06
docs/s)
2012-07-22 13:15:24,321 : INFO : chunksize=5, time=8.9180s (560.66
docs/s, 112.13 queries/s), meandiff=0.000e+00
2012-07-22 13:15:32,827 : INFO : chunksize=10, time=5.6751s (881.04
docs/s, 88.10 queries/s), meandiff=0.000e+00
2012-07-22 13:15:37,634 : INFO : chunksize=100, time=1.8678s (2676.95
docs/s, 26.77 queries/s), meandiff=0.000e+00
2012-07-22 13:15:41,709 : INFO : chunksize=500, time=1.0960s (4562.06
docs/s, 9.12 queries/s), meandiff=0.000e+00
2012-07-22 13:15:45,663 : INFO : chunksize=1000, time=0.9768s (5118.52
docs/s, 5.12 queries/s), meandiff=0.000e+00
2012-07-22 13:15:49,489 : INFO : chunksize=5000, time=0.9335s (5356.46
docs/s, 1.07 queries/s), meandiff=0.000e+00
2012-07-22 13:15:49,504 : INFO : test 6 (sparse): as above, but only

ask for the top-10 most similar for each document

2012-07-22 13:16:33,699 : INFO : chunksize=0, time=44.1950s (113.13
docs/s, 113.13 queries/s)
2012-07-22 13:16:44,353 : INFO : chunksize=5, time=10.6540s (469.31
docs/s, 93.86 queries/s)
2012-07-22 13:16:51,806 : INFO : chunksize=10, time=7.4524s (670.92
docs/s, 67.09 queries/s)
2012-07-22 13:16:55,400 : INFO : chunksize=100, time=3.5940s (1391.19
docs/s, 13.91 queries/s)
2012-07-22 13:16:58,343 : INFO : chunksize=500, time=2.9426s (1699.19
docs/s, 3.40 queries/s)
2012-07-22 13:17:01,246 : INFO : chunksize=1000, time=2.9036s (1721.99
docs/s, 1.72 queries/s)
2012-07-22 13:17:04,238 : INFO : chunksize=5000, time=2.9911s (1671.65
docs/s, 0.33 queries/s)
2012-07-22 13:17:04,238 : INFO : finished running simspeed.py

-rr

Reply all

Reply to author

Forward