Hey Radim -Looking at lsimodel.py and matutils.py (qr_destroy), there is a lot of memory optimization.I wonder though, is this preventing LAPACK from using multiple cores for the QR decomposition?On Wed, Dec 27, 2017 at 2:26 PM, Brian Mingus <refle...@gmail.com> wrote:--I am using gensim like so: https://gist.github.com/brianmingus/ee35160b52e7b04e8ea4ca2da3ce5772I have already done previous experiments with 5000 topics and LSI used 270GB of ram and took six hours, so now I am trying 10,000.The output looks like this:2017-12-27 18:38:30,541 : INFO : loaded corpus index from enwiki_tfidf.mm.index 2017-12-27 18:38:30,542 : INFO : initializing corpus reader from enwiki_tfidf.mm 2017-12-27 18:38:30,542 : INFO : accepted corpus with 4385561 documents, 100000 features, 689946173 non-zero entries 2017-12-27 18:39:11,674 : INFO : using serial LSI version on this node 2017-12-27 18:39:11,674 : INFO : updating model with new documents 2017-12-27 18:39:11,675 : INFO : using 100 extra samples and 2 power iterations 2017-12-27 18:39:11,675 : INFO : 1st phase: constructing (100000, 10100) action matrix 2017-12-27 20:45:58,537 : INFO : orthonormalizing (100000, 10100) action matrixSo I check in a couple of hours later and we are only using 80GB of ram and 1 cpu. `perf top1' looks like this:98.55% _sparsetools.cpython-36m-x86_64-linux-gnu.so [.] csr_matvecs_thunkSo constructing and orthonormalizing the action matrix takes a huge amount of time, using relatively little resources, and costing a lot of money.I am actually not sure what the "action matrix" is - is optimizing these steps something we can look into?
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/BVEtU6m-QVI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hello again, Radim :)Since these are one-off models that are useful for a variety of purposes (perhaps you want them in your model repo?), I just spun up a very large AWS instance with lots of RAM and very high provisioned IOPS. So no need to compress anything. We want it to finish as fast as possible and disk is not a bottleneck.There is also gobs of RAM (up to 4TB) and lots of CPU (up to 128).Yesterday I noticed that a lot of time was being spent using just one core and not much RAM. And then it finally got to work and used hundreds of GB of RAM.I assumed that this was because the matrix was dense in memory.So why not do everything dense?
There is also the Python package KDT - Knowledge Discovery Toolbox. http://kdt.sourceforge.netThis package is of particular interest because it builds on Combinatorial BLAS, which is being used to guide the GraphBLAS standard.It hasn't been updated since 2013 :/On Sat, Dec 30, 2017 at 6:25 PM, Brian Mingus <refle...@gmail.com> wrote:Radim -I did some more work on this and dug up this comparison. Most that support sparse matrices do not support python, or have no python wrapper. http://www.netlib.org/utk/people/JackDongarra/la-sw.htmlI could not get sppy working: https://github.com/charanpald/sppyI got pyrsb working, but it was crashing due to stack smashing. The author of librsb wrote this wrapper, so it's somewhat interesting, if immature. https://github.com/michelemartone/pyrsb'sparse' is related to Dask, but has no plans to support parallelism https://github.com/mrocklin/sparse/SLEPc looks quite interesting. http://slepc.upv.es/ The main downside is MPI, and it may be difficult to make it easy to actually get it running. You can pip install it, it is actively developed, and you could probably replace the entire LSI implementation with it, which in some regards seems like overkill, but is also kind of cool. I also can't immediately tell how it works.. there is a MatMult_MPIAIJCUSPARSE function. It may depend on a GPU being on every node and CUDA installed. Sounds like a huge pain.I also took another look at bhSparse, in particular https://github.com/bhSPARSE/Benchmark_SpGEMM_using_CSRUnfortunately I could not get it to compile..Obviously by now I could have written a simple sparse matrix * random matrix routine, but sparse matrix comptuations are important to me for other reasons as well. I am actually surprised there is nothing great here for Python.Brian
Hey Radim -I implemented sparse*dense with tensorflow's built-in routine (and tensorflow eager - 10x less code), but it turns out that tensorflow's sparse*dense routine considers the entire computation to be one "OP" and so it's not parallelized. I also tried on GPUs, but you hit the memory limit very fast.I started implementing in Cython as the algorithm is simple. While Googling for issues I was having I stumbled upon this ancient gist which claims to implement Parallel Sparse Matrix Dense Matrix Product in C/Cython/Python.It relies on a C library written by professor Tim Davis called CSparse. It is LGPL, like gensim. The Cython wrapper just implements pmultiply, which uses Cython's built-in parallelism and OpenMP. pmultiply has no license and so is technically copyright. I cc'd the author in case he is willing to relicense it under the LGPL (cc rmcgibbo).In any case, I forked it, and with modest changes I got it working. I ran my tests on a c5.18xlarge EC2 instance using the simple english wikipedia. With 71 cores we get a 30x speedup (that's 8 seconds versus 4 minutes). The products pass np.testing.assert_array_equal. I had to type cast to float64.$ python psparse.pyLoading simplewikiGenerating random numbersX: (32440, 86479)W: (86479, 20000)Parallel multiplyStandard multiplyThis Code : 8.85654sScipy : 245.26226sTim Davis now has some software called SuiteSparse. There are a bunch of projects in there, some with proprietary code owned by i.e. AMD, but he also has a C GraphBLAS implementation that is Apache licensed, with a suite of methods such as GB_Matrix_multiply.Overall, I suggest we go this path for sparse routines. If we can get permission to use pmultiply, I can get you a pull request.Here is my fork btw: https://gist.github.com/brianmingus/ff6c3e9816c2cb7de0efc17d9bed06e6Just python setup.py install && python psparse.pyBest,Brian Mingus
Thanks for the info Tim.fyi, Robert has notified me that the gist containing the code for pmultiply is now licensed under the MIT license.On Thu, Jan 4, 2018 at 4:39 PM, Tim Davis <da...@tamu.edu> wrote:Thanks for the info.GraphBLAS includes a sparse matrix-matrix multiply, GrB_mxm (which calls the non-user callable GB_Matrix_Multiply). GraphBLAS is under the Apache 2 license. It just recently got added to Debian in their latest update to their packaging of SuiteSparse. GraphBLAS is a supported code so it will be continuing to be developed in the future. It would be a useful package for Python to rely on, as a result. In particular, an OpenMP parallel version of GrB_mxm is in progress. Let me know if you have any questions about GraphBLAS.None of SuiteSparse is owned by AMD, by the way. AMD is my acronym for the Approximate Minimum Degree method and has no connection to AMD to company. My A.M.D. ordering method is license BSD 3-clause.Thanks,Tim
On Thu, Jan 4, 2018 at 5:04 PM, Brian Mingus <refle...@gmail.com> wrote:
Hey Radim -I implemented sparse*dense with tensorflow's built-in routine (and tensorflow eager - 10x less code), but it turns out that tensorflow's sparse*dense routine considers the entire computation to be one "OP" and so it's not parallelized. I also tried on GPUs, but you hit the memory limit very fast.I started implementing in Cython as the algorithm is simple. While Googling for issues I was having I stumbled upon this ancient gist which claims to implement Parallel Sparse Matrix Dense Matrix Product in C/Cython/Python.It relies on a C library written by professor Tim Davis called CSparse. It is LGPL, like gensim. The Cython wrapper just implements pmultiply, which uses Cython's built-in parallelism and OpenMP. pmultiply has no license and so is technically copyright. I cc'd the author in case he is willing to relicense it under the LGPL (cc rmcgibbo).In any case, I forked it, and with modest changes I got it working. I ran my tests on a c5.18xlarge EC2 instance using the simple english wikipedia. With 71 cores we get a 30x speedup (that's 8 seconds versus 4 minutes). The products pass np.testing.assert_array_equal. I had to type cast to float64.$ python psparse.pyLoading simplewikiGenerating random numbersX: (32440, 86479)W: (86479, 20000)Parallel multiplyStandard multiplyThis Code : 8.85654sScipy : 245.26226sTim Davis now has some software called SuiteSparse. There are a bunch of projects in there, some with proprietary code owned by i.e. AMD, but he also has a C GraphBLAS implementation that is Apache licensed, with a suite of methods such as GB_Matrix_multiply.Overall, I suggest we go this path for sparse routines. If we can get permission to use pmultiply, I can get you a pull request.Here is my fork btw: https://gist.github.com/brianmingus/ff6c3e9816c2cb7de0efc17d9bed06e6Just python setup.py install && python psparse.pyBest,Brian Mingus
On Tue, Jan 2, 2018 at 5:32 AM, Radim Řehůřek <m...@radimrehurek.com> wrote:
I have checked in a cs_gaxpy_rnd function that generates random numbers on the fly. This was fairly difficult because random number generators have static variables and are generally not thread safe, resulting in spinlocks. I ended up using Knuth's algorithm with #pragma omp threadprivate on the static variables and using the omp thread number as the seed which recovered the speed. I also tried several other packages for random numbers (SFMT is currently checked in but commented out), but as a rule they were orders of magnitude slower.I suspect it is faster to allocate the numbers ahead of time because 1) random number generation algorithms are often not thread safe (in C at least) resulting in ping-ponging cache lines between threads and 2) generating them ahead of time results in quite fast matrix multiplication, because everything is layed out contiguously in memory.Perhaps there are more optimizations we can make, such as switching to C++ and using boost, but generating on the fly with cs_gaxpy_rnd is 20 times slower than generating them ahead of time and holding them in memory. Of course, it uses basically zero extra memory.On Fri, Jan 5, 2018 at 11:04 AM, Brian Mingus <refle...@gmail.com> wrote:Yes agreed. My implementation uses that method. I will look at merging them.
Hey Radim - how has this solution aged? Ready for a PR?Just to clarify the general approach: long term we want to go with GraphBLAS, but not all of the implementations use OpenMP and are parallelized yet. Short term, we're going to use CSparse and manually write parallel OpenMP routines using Cython where appropriate, starting with the parallel sparse * dense multiplication in LSI. CSparse itself will be checked in to Gensim to facilitate this work.On Thu, Jan 11, 2018 at 9:35 AM, Brian Mingus <refle...@gmail.com> wrote:Hey Radim -My algorithm for generating them on the fly was - potentially - too naive. I am generating a new normal random number for every single multiplication.On Sun, Jan 7, 2018 at 8:27 AM, Brian Mingus <refle...@gmail.com> wrote:Definitely exciting :)fyi the numbers are generated and held in memory as a Fortran array, and from then on all access is via pointers.There is essentially no speedup due to threading when the normally distributed numbers are generated on the fly - I only managed to ameliorate an even worse slowdown due to poor use of the cache. So it seems some sort of thread contention or further cache misses may be the problem. I got rid of almost all of the spinlocking so I think it's the cache.
No problem I will prioritize this.