Re: [gensim:9976] LSI bottlenecks (the "action matrix")

141 views
Skip to first unread message
Message has been deleted
Message has been deleted

Radim Řehůřek

unread,
Dec 27, 2017, 5:32:33 PM12/27/17
to gensim
Hi Brian,

welcome back :) You're trying the new CSC-corpus-all-in-memory code path, right?

First, you can compress the enwiki_tfidf.mm file (.gz, .bz2), to save space.

Your question: nothing to do with QR. Some parts of the single-pass SVD employ scipy.sparse for sparse matrix operations using (that's the csr_matvecs_thunk, I suspect called from here).

Unfortunately scipy.sparse is not parallelized -- this is not BLAS (BLAS is for dense vectors). It's just C routines, compiled inside scipy. Whenever multiplying sparse matrices or vectors, the processing will be only single core.

One option would be to use a better optimized sparse BLAS, but I'm not aware of a good sparse BLAS library. Any tips?

HTH,
Radim




On Wednesday, December 27, 2017 at 10:50:15 PM UTC+1, Brian Mingus wrote:
Hey Radim - 

Looking at lsimodel.py and matutils.py (qr_destroy), there is a lot of memory optimization.

I wonder though, is this preventing LAPACK from using multiple cores for the QR decomposition?

On Wed, Dec 27, 2017 at 2:26 PM, Brian Mingus <refle...@gmail.com> wrote:

I have already done previous experiments with 5000 topics and LSI used 270GB of ram and took six hours, so now I am trying 10,000.

The output looks like this:

2017-12-27 18:38:30,541 : INFO : loaded corpus index from enwiki_tfidf.mm.index
2017-12-27 18:38:30,542 : INFO : initializing corpus reader from enwiki_tfidf.mm
2017-12-27 18:38:30,542 : INFO : accepted corpus with 4385561 documents, 100000 features, 689946173 non-zero entries
2017-12-27 18:39:11,674 : INFO : using serial LSI version on this node
2017-12-27 18:39:11,674 : INFO : updating model with new documents
2017-12-27 18:39:11,675 : INFO : using 100 extra samples and 2 power iterations
2017-12-27 18:39:11,675 : INFO : 1st phase: constructing (100000, 10100) action matrix
2017-12-27 20:45:58,537 : INFO : orthonormalizing (100000, 10100) action matrix


So I check in a couple of hours later and we are only using 80GB of ram and 1 cpu. `perf top1' looks like this:

  98.55%  _sparsetools.cpython-36m-x86_64-linux-gnu.so  [.] csr_matvecs_thunk

So constructing and orthonormalizing the action matrix takes a huge amount of time, using relatively little resources, and costing a lot of money. 

I am actually not sure what the "action matrix" is - is optimizing these steps something we can look into? 

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/BVEtU6m-QVI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Radim Řehůřek

unread,
Dec 27, 2017, 5:44:07 PM12/27/17
to gensim
Actually, the operation we need is simply `dense = CSC * normal(0.0, 1.0)`.

It sounds simple enough that we don't need any fancy BLAS; implementing this one specific operation in low-level C/Cython, with multiple threads, maybe the simplest and fastest option.

Do you want to try your hand at this?

-rr

Radim Řehůřek

unread,
Dec 27, 2017, 5:57:54 PM12/27/17
to gensim
The typical application of Gensim is highly sparse data (<1% sparsity -- your corpus seems to be 0.16%). So representing a chunk of documents as a dense array would be inefficient, both in RAM and in computation speed.

In particular, that bottleneck operation is `dense = CSC * normal`, with shapes `(100000, 10100) = (100000, chunksize) * (chunksize, 10100)`.

Rewriting that operation for thread parallelism and to avoid the extra intermediate large array allocations should be both much faster and more memory-efficient.

Best,
Radim


On Wednesday, December 27, 2017 at 11:43:32 PM UTC+1, Brian Mingus wrote:
Hello again, Radim :)

Since these are one-off models that are useful for a variety of purposes (perhaps you want them in your model repo?), I just spun up a very large AWS instance with lots of RAM and very high provisioned IOPS. So no need to compress anything. We want it to finish as fast as possible and disk is not a bottleneck.

There is also gobs of RAM (up to 4TB) and lots of CPU (up to 128). 

Yesterday I noticed that a lot of time was being spent using just one core and not much RAM. And then it finally got to work and used hundreds of GB of RAM.

I assumed that this was because the matrix was dense in memory.

So why not do everything dense?


Radim Řehůřek

unread,
Dec 27, 2017, 6:00:28 PM12/27/17
to gensim
(And we could then also get rid of scipy.sparse.sparsetools, which would be awesome! That module is not terribly efficient, and IIRC it's also been deprecated, so newer scipy versions will be even less efficient.)

-rr
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Radim Řehůřek

unread,
Jan 2, 2018, 7:32:04 AM1/2/18
to gensim
Yeah, that's often the trouble with academic packages :( Project ends / funding runs out, and then it's dead—you're on your own.

I still think optimizing that `corpus_sparse * random_dense` operation natively is our best bet. And probably also the easiest, since bringing in external low-level dependencies is often highly non-trivial. Even plain old BLAS can be a PITA.

All the best in 2018 :)

Radim


On Sunday, December 31, 2017 at 3:02:29 AM UTC+1, Brian Mingus wrote:
There is also the Python package KDT - Knowledge Discovery Toolbox. http://kdt.sourceforge.net

This package is of particular interest because it builds on Combinatorial BLAS, which is being used to guide the GraphBLAS standard.

It hasn't been updated since 2013 :/


On Sat, Dec 30, 2017 at 6:25 PM, Brian Mingus <refle...@gmail.com> wrote:
Radim - 

I did some more work on this and dug up this comparison. Most that support sparse matrices do not support python, or have no python wrapper. http://www.netlib.org/utk/people/JackDongarra/la-sw.html

I could not get sppy working: https://github.com/charanpald/sppy

I got pyrsb working, but it was crashing due to stack smashing. The author of librsb wrote this wrapper, so it's somewhat interesting, if immature. https://github.com/michelemartone/pyrsb

'sparse' is related to Dask, but has no plans to support parallelism https://github.com/mrocklin/sparse/

SLEPc looks quite interesting. http://slepc.upv.es/ The main downside is MPI, and it may be difficult to make it easy to actually get it running. You can pip install it, it is actively developed, and you could probably replace the entire LSI implementation with it, which in some regards seems like overkill, but is also kind of cool. I also can't immediately tell how it works.. there is a MatMult_MPIAIJCUSPARSE function. It may depend on a GPU being on every node and CUDA installed. Sounds like a huge pain.

I also took another look at bhSparse, in particular https://github.com/bhSPARSE/Benchmark_SpGEMM_using_CSR

Unfortunately I could not get it to compile..

Obviously by now I could have written a simple sparse matrix * random matrix routine, but sparse matrix comptuations are important to me for other reasons as well.  I am actually surprised there is nothing great here for Python. 

Brian



Message has been deleted
Message has been deleted

Radim Řehůřek

unread,
Jan 5, 2018, 12:50:51 PM1/5/18
to gensim
That's some awesome work!

Tensorflow is too crude a tool for this, I wouldn't expect any help there (unless they implemented these low-level parallel routines directly, the communication overhead would kill any sort of fine grained optimizations).

Regarding the other libraries -- what are the options? Can we bundle them in Gensim (in which case the question is their license, maturity in terms of multi-platform support, supported dtypes etc), or would we rely on them as a 3rd party dependency (which brings additional questions of stability, support, maintenance)?

But 30x speed-up is certainly worth it. That's in line with what I'd expect: a fully optimized rewrite of the whole SVD thing should bring that much, starting from faster input reading (reading the sparse CSC matrix in blocks, directly filling the C structures) to parallelized BLAS QR, avoiding any temporary copies or memory allocations along the way, etc.

Very exciting -- please let me know how you progress!

Radim



On Friday, January 5, 2018 at 12:04:51 AM UTC+1, Brian Mingus wrote:
Hey Radim - 

I implemented sparse*dense with tensorflow's built-in routine (and tensorflow eager - 10x less code), but it turns out that tensorflow's sparse*dense routine considers the entire computation to be one "OP" and so it's not parallelized. I also tried on GPUs, but you hit the memory limit very fast.

I started implementing in Cython as the algorithm is simple.  While Googling for issues I was having I stumbled upon this ancient gist which claims to implement Parallel Sparse Matrix Dense Matrix Product in C/Cython/Python.


It relies on a C library written by professor Tim Davis called CSparse. It is LGPL, like gensim. The Cython wrapper just implements pmultiply, which uses Cython's built-in parallelism and OpenMP. pmultiply has no license and so is technically copyright. I cc'd the author in case he is willing to relicense it under the LGPL (cc rmcgibbo).

In any case, I forked it, and with modest changes I got it working. I ran my tests on a c5.18xlarge EC2 instance using the simple english wikipedia. With 71 cores we get a 30x speedup (that's 8 seconds versus 4 minutes). The products pass np.testing.assert_array_equal. I had to type cast to float64.

$ python psparse.py
Loading simplewiki
Generating random numbers
X: (32440, 86479)
W: (86479, 20000)
Parallel multiply
Standard multiply
This Code : 8.85654s
Scipy     : 245.26226s


Tim Davis now has some software called SuiteSparse. There are a bunch of projects in there, some with proprietary code owned by i.e. AMD, but he also has a C GraphBLAS implementation that is Apache licensed, with a suite of methods such as GB_Matrix_multiply.

Overall, I suggest we go this path for sparse routines. If we can get permission to use pmultiply, I can get you a pull request.


Just python setup.py install && python psparse.py

Best,

Brian Mingus



Radim Řehůřek

unread,
Jan 5, 2018, 12:57:36 PM1/5/18
to gensim
Awesome. That means we can fork/include it in Gensim directly.

Avoiding materializing the dense matrix altogether (it's just a bunch of random number, really) should bring another order of magnitude speed-up. Both from not having to allocate anything, to better cache utilization (no need for RAM, generate the rands on the fly).

-rr


On Friday, January 5, 2018 at 5:55:04 PM UTC+1, Brian Mingus wrote:
Thanks for the info Tim.

fyi, Robert has notified me that the gist containing the code for pmultiply is now licensed under the MIT license.

On Thu, Jan 4, 2018 at 4:39 PM, Tim Davis <da...@tamu.edu> wrote:
Thanks for the info.    

GraphBLAS includes a sparse matrix-matrix multiply, GrB_mxm (which calls the non-user callable GB_Matrix_Multiply).  GraphBLAS is under the Apache 2 license.  It just recently got added to Debian in their latest update to their packaging of SuiteSparse.  GraphBLAS is a supported code so it will be continuing to be developed in the future.  It would be a useful package for Python to rely on, as a result.  In particular, an OpenMP parallel version of GrB_mxm is in progress.  Let me know if you have any questions about GraphBLAS.

None of SuiteSparse is owned by AMD, by the way.  AMD is my acronym for the Approximate Minimum Degree method and has no connection to AMD to company.  My A.M.D. ordering method is license BSD 3-clause.

Thanks,
Tim


On Thu, Jan 4, 2018 at 5:04 PM, Brian Mingus <refle...@gmail.com> wrote:
Hey Radim - 

I implemented sparse*dense with tensorflow's built-in routine (and tensorflow eager - 10x less code), but it turns out that tensorflow's sparse*dense routine considers the entire computation to be one "OP" and so it's not parallelized. I also tried on GPUs, but you hit the memory limit very fast.

I started implementing in Cython as the algorithm is simple.  While Googling for issues I was having I stumbled upon this ancient gist which claims to implement Parallel Sparse Matrix Dense Matrix Product in C/Cython/Python.


It relies on a C library written by professor Tim Davis called CSparse. It is LGPL, like gensim. The Cython wrapper just implements pmultiply, which uses Cython's built-in parallelism and OpenMP. pmultiply has no license and so is technically copyright. I cc'd the author in case he is willing to relicense it under the LGPL (cc rmcgibbo).

In any case, I forked it, and with modest changes I got it working. I ran my tests on a c5.18xlarge EC2 instance using the simple english wikipedia. With 71 cores we get a 30x speedup (that's 8 seconds versus 4 minutes). The products pass np.testing.assert_array_equal. I had to type cast to float64.

$ python psparse.py
Loading simplewiki
Generating random numbers
X: (32440, 86479)
W: (86479, 20000)
Parallel multiply
Standard multiply
This Code : 8.85654s
Scipy     : 245.26226s


Tim Davis now has some software called SuiteSparse. There are a bunch of projects in there, some with proprietary code owned by i.e. AMD, but he also has a C GraphBLAS implementation that is Apache licensed, with a suite of methods such as GB_Matrix_multiply.

Overall, I suggest we go this path for sparse routines. If we can get permission to use pmultiply, I can get you a pull request.


Just python setup.py install && python psparse.py

Best,

Brian Mingus


On Tue, Jan 2, 2018 at 5:32 AM, Radim Řehůřek <m...@radimrehurek.com> wrote:
Message has been deleted
Message has been deleted

Radim Řehůřek

unread,
Jan 7, 2018, 5:37:49 AM1/7/18
to gensim
Interesting.

I see no way that "allocating + generating + storing + loading" could be faster than "generating"—a physical impossibility :-)

But I don't really see that as critical. What you have now is already amazing, a welcome addition and a great speed up. The primary questions are around stability, packaging, compatibility, testing etc (maintenance stuff), CC Ivan.

@Ivan: related to our long-standing issues of #557. (I completely forgot about the `reduceat` trick suggested there, never tried it)

SVD / PCA are at the core of most scientific computing, they come up in all sorts of statistical scenarios, neural networks etc. So this change will affect more than just LsiModel, I'm pretty excited about it!

Radim


On Saturday, January 6, 2018 at 9:36:37 PM UTC+1, Brian Mingus wrote:
I have checked in a cs_gaxpy_rnd function that generates random numbers on the fly. This was fairly difficult because random number generators have static variables and are generally not thread safe, resulting in spinlocks. I ended up using Knuth's algorithm with #pragma omp threadprivate on the static variables and using the omp thread number as the seed which recovered the speed. I also tried several other packages for random numbers (SFMT is currently checked in but commented out), but as a rule they were orders of magnitude slower.

I suspect it is faster to allocate the numbers ahead of time because 1) random number generation algorithms are often not thread safe (in C at least) resulting in ping-ponging cache lines between threads and 2) generating them ahead of time results in quite fast matrix multiplication, because everything is layed out contiguously in memory.

Perhaps there are more optimizations we can make, such as switching to C++ and using boost, but generating on the fly with cs_gaxpy_rnd is 20 times slower than generating them ahead of time and holding them in memory. Of course, it uses basically zero extra memory.



On Fri, Jan 5, 2018 at 11:04 AM, Brian Mingus <refle...@gmail.com> wrote:
Yes agreed. My implementation uses that method. I will look at merging them.
Message has been deleted
Message has been deleted
Message has been deleted

Radim Řehůřek

unread,
Feb 1, 2018, 3:48:07 AM2/1/18
to gensim
Hi Brian, I didn't realize I was the blocking element here -- still psyched about this optimization!

I find it fascinating how useful and practical SVD/LSI still are, 30 years down the line. Deep learning is all the rage now, but when it comes to practical results, their impact on IR quality is not terribly impressive in my experience (while coming at a terrible cost).

What will you need from us for this "embedded CSparse" PR? Will be happy to review and assist with testing/bechmarking/maintenance.

Radim


On Wednesday, January 31, 2018 at 8:26:14 PM UTC+1, Brian Mingus wrote:
Hey Radim - how has this solution aged? Ready for a PR?

Just to clarify the general approach: long term we want to go with GraphBLAS, but not all of the implementations use OpenMP and are parallelized yet. Short term, we're going to use CSparse and manually write parallel OpenMP routines using Cython where appropriate, starting with the parallel sparse * dense multiplication in LSI. CSparse itself will be checked in to Gensim to facilitate this work.

On Thu, Jan 11, 2018 at 9:35 AM, Brian Mingus <refle...@gmail.com> wrote:
Hey Radim - 

My algorithm for generating them on the fly was - potentially - too naive. I am generating a new normal random number for every single multiplication. 

On Sun, Jan 7, 2018 at 8:27 AM, Brian Mingus <refle...@gmail.com> wrote:
Definitely exciting :) 

fyi the numbers are generated and held in memory as a Fortran array, and from then on all access is via pointers.

There is essentially no speedup due to threading when the normally distributed numbers are generated on the  fly - I only managed to ameliorate an even worse slowdown due to poor use of the cache. So it seems some sort of thread contention or further cache misses may be the problem. I got rid of almost all of the spinlocking so I think it's the cache. 

Radim Řehůřek

unread,
Feb 6, 2018, 3:58:10 AM2/6/18
to gensim
Hi Brian,

is there anything we could help with regarding this PR? Do you have any constraints/timeline, can other people join in this project?

I'm eager to see this new functionality in Gensim.

Many thanks,
Radim
Message has been deleted

Radim Řehůřek

unread,
Feb 13, 2018, 1:59:19 AM2/13/18
to gensim
For anyone following along: here is Brian's PR on Github: https://github.com/RaRe-Technologies/gensim/pull/1896

Help welcome!

-rr


On Tuesday, February 6, 2018 at 5:19:06 PM UTC+1, Brian Mingus wrote:
No problem I will prioritize this.
Reply all
Reply to author
Forward
0 new messages