On 4. 2. 2013 Sturla Molden wrote:
> On 04.02.2013 01:02, Matěj Laitl wrote:
> > have you ever wanted to compute a matrix product of typed memoryviews
> > but couldn't afford the overhead of numpy.dot?
>
> Never. The overhead has never been significant compared to DGEMM et al.
You're right that O(n^3) operation is not a good example to mention overhead.
But let the numbers speak (numpy is linked against carefully compiled
atlas-3.10.1 there):
size: 2x2, iterations: 1000000
numpy: 5.18252134323e-07s per call, 0.518252134323s total
ceygen: 4.24573898315e-07s per call, 0.424573898315s total
size: 4x4, iterations: 1000000
numpy: 5.83323001862e-07s per call, 0.583323001862s total
ceygen: 5.01266002655e-07s per call, 0.501266002655s total
size: 8x8, iterations: 1000000
numpy: 9.83945846558e-07s per call, 0.983945846558s total
ceygen: 8.80959987640e-07s per call, 0.88095998764s total
size: 24x24, iterations: 217013
numpy: 6.45563130066e-06s per call, 1.40095591545s total
ceygen: 5.41761569641e-06s per call, 1.17569303513s total
size: 48x48, iterations: 27126
numpy: 4.99972707884e-05s per call, 1.35622596741s total
ceygen: 2.96417067011e-05s per call, 0.804060935974s total
size: 100x100, iterations: 3000
numpy: 0.000250008662542s per call, 0.750025987625s total
ceygen: 0.000206195990245s per call, 0.618587970734s total
Not bad, isn't it? The code is at [1], you can try to reproduce yourself. OTOH
very recent atlas seems to take better advantage of my CPU with ever larger
matrices, but nothing than could be fixed in future Eigen versions.
[1]
https://github.com/strohel/Ceygen/blob/master/ceygen/tests/bench.pyx
> > without holding the GIL so that the code could scale to multiple threads?
>
> You have misunderstood the GIL.
No. You speak about multi-threading *inside* the numpy.dot, I about
parallelising the Cython code that *uses* the dot() perhaps inside a prange
block. It depends on the matrix size, for example in our problem we need to
call dot() on thousands of small independent matrices. I hope you'll agree
that parallelising inside dot() isn't very useful here.
> > Search no more and try Ceygen [1], my latest effort to make memoryviews
> > more usable for numerics. Ceygen is fast (i.e. virtually no overhead),
> > well documented, data-type agnostic (uses fused types), tested,
> > multithreading- friendly and uses Eigen C++ template library for actual
> > computation (no need to install, just unpack; no dependency on
> > BLAS/LAPACK).
>
> So instead of using an hardware optimized BLAS and LAPACK (Intel MKL,
> AMD ACML, Cray libsci, OpenBLAS, Apple Accelerate Framework, Nvidia
> cuBLAS) we will be better off with some non-optimized C++ templates? How
> did you get that idea?
You call very comparable to MKL (and 2x faster for some matrix sizes - i.e.
12x12 dgemm, not to speak about fixed-size cases, transpositions...) "non-
optimized C++ templates"?
http://eigen.tuxfamily.org/index.php?title=Benchmark
I don't say BLAS/LAPACK is bad, but I say sometimes the
effort/money/complications is not worth it, see above.
Regards,
Matěj