Introducing Ceygen: helper for fast linear algebra with Cython typed memoryviews

650 views
Skip to first unread message

Matěj Laitl

unread,
Feb 3, 2013, 7:02:09 PM2/3/13
to cython...@googlegroups.com, Václav Šmídl, Radek Hofman
Hola,
have you ever wanted to compute a matrix product of typed memoryviews (e.g.
cdef double[:, :] x), but couldn't afford the overhead of numpy.dot? Ever
wanted to compute matrix inverse or perhaps a determinant without holding the
GIL so that the code could scale to multiple threads?

Search no more and try Ceygen [1], my latest effort to make memoryviews more
usable for numerics. Ceygen is fast (i.e. virtually no overhead), well
documented, data-type agnostic (uses fused types), tested, multithreading-
friendly and uses Eigen C++ template library for actual computation (no need
to install, just unpack; no dependency on BLAS/LAPACK).

Documentation is at [2], released versions are available from PyPI [3]. If you
lack a function you need there, it should be real easy to add it.

Note: I'm aware of Mark's array expressions work, Ceygen's aim is to
complement it (with more special functions) rather than substitute it.
However, until array expressions are hopefully merged to Cython, there's
Ceygen's elemwise module [4] as a stop-gap solution.

[1] https://github.com/strohel/Ceygen
[2] http://strohel.github.com/Ceygen-doc/
[3] http://pypi.python.org/pypi/Ceygen
[4] http://strohel.github.com/Ceygen-doc/elemwise.html

Regards,
Matěj Laitl

Sturla Molden

unread,
Feb 4, 2013, 5:51:57 AM2/4/13
to cython...@googlegroups.com
On 04.02.2013 01:02, Matěj Laitl wrote:
> Hola,
> have you ever wanted to compute a matrix product of typed memoryviews (e.g.
> cdef double[:, :] x), but couldn't afford the overhead of numpy.dot?

Never. The overhead has never been significant compared to DGEMM et al.

> without holding the
> GIL so that the code could scale to multiple threads?

You have misunderstood the GIL.

The GIL only serializes access to the Python interpreter. The Python
interpreter is not used by BLAS, so BLAS threads can run freely even if
the GIL was not released. (This also applies to matrix inversion with
LAPACK.) Currently I have np.dot linked to Intel MKL. MKL will use
multiple threads if it can. MKL doesn't know what the GIL is and spawns
its own threads.

However, np.dot does release the GIL before calling into BLAS.

https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c

As you can see, there are macros that release the GIL on lines 482, 618,
664, 716, 791, 967, and 1185. This is to prevent np.dot from locking up
access to the Python interpreter. It is not to allow the matrix
multiplication to use multiple cores. BLAS threads are managed by the
BLAS library.


> Search no more and try Ceygen [1], my latest effort to make memoryviews more
> usable for numerics. Ceygen is fast (i.e. virtually no overhead), well
> documented, data-type agnostic (uses fused types), tested, multithreading-
> friendly and uses Eigen C++ template library for actual computation (no need
> to install, just unpack; no dependency on BLAS/LAPACK).

So instead of using an hardware optimized BLAS and LAPACK (Intel MKL,
AMD ACML, Cray libsci, OpenBLAS, Apple Accelerate Framework, Nvidia
cuBLAS) we will be better off with some non-optimized C++ templates? How
did you get that idea?


Sturla




Kornél JAHN

unread,
Feb 4, 2013, 7:53:58 AM2/4/13
to cython...@googlegroups.com, Václav Šmídl, Radek Hofman
While Matěj's showy style of advertising doesn't quite fit my taste, calling Eigen "some non-optimized C++ templates" seems a hasty judgement as well. If we can believe this benchmark which was posted on the Eigen website, Eigen is quickly becoming a serious contender to Intel MKL etc while being open-source.

Kornel


On Mon, Feb 4, 2013 at 12:58 PM, Sturla Molden <stu...@molden.no> wrote:

On 04.02.2013 01:02, Matěj Laitl wrote:
Hola,
have you ever wanted to compute a matrix product of typed memoryviews (e.g.
cdef double[:, :] x), but couldn't afford the overhead of numpy.dot?

Never. The overhead has never been significant compared to DGEMM et al.

without holding the
GIL so that the code could scale to multiple threads?

You have misunderstood the GIL.

The GIL only serializes access to the Python interpreter. The Python interpreter is not used by BLAS, so BLAS threads can run freely even if the GIL was not released. (This also applies to matrix inversion with LAPACK.) Currently I have np.dot linked to Intel MKL. MKL will use multiple threads if it can. MKL doesn't know what the GIL is and spawns its own threads.

However, np.dot does release the GIL before calling into BLAS.

https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c

As you can see, there are macros that release the GIL on lines 482, 618, 664, 716, 791, 967, and 1185. This is to prevent np.dot from locking up access to the Python interpreter. It is not to allow the matrix multiplication to use multiple cores. BLAS threads are managed by the BLAS library.


Search no more and try Ceygen [1], my latest effort to make memoryviews more
usable for numerics. Ceygen is fast (i.e. virtually no overhead), well
documented, data-type agnostic (uses fused types), tested, multithreading-
friendly and uses Eigen C++ template library for actual computation (no need
to install, just unpack; no dependency on BLAS/LAPACK).

So instead of using an hardware optimized BLAS and LAPACK (Intel MKL, AMD ACML, Cray libsci, OpenBLAS, Apple Accelerate Framework, Nvidia cuBLAS) we will be better off with some non-optimized C++ templates? How did you get that idea?


Sturla

--
 
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Kornél JAHN
engineering physicist, PhD student
Budapest University of Technology and Economics, Hungary

Kornél JAHN

unread,
Feb 4, 2013, 8:16:10 AM2/4/13
to cython...@googlegroups.com, Václav Šmídl, Radek Hofman
For the sake of completeness, many (including myself) are sceptical of Eigen outperforming MKL and ATLAS, see, for instance, 
Nevertheless, Eigen is a nice piece of software, I used it in the past, too.

Sturla Molden

unread,
Feb 4, 2013, 8:38:07 AM2/4/13
to cython...@googlegroups.com
On 04.02.2013 14:16, Kornél JAHN wrote:
> For the sake of completeness, many (including myself) are sceptical of
> Eigen outperforming MKL and ATLAS, see, for instance,
> http://stackoverflow.com/questions/10366054/c-performance-in-eigen-library
> http://software.intel.com/en-us/forums/topic/326413
> Nevertheless, Eigen is a nice piece of software, I used it in the past, too.

If you compute A*A' there is a lot of redundancy the C++ compiler can
throw away at compile time. And if you compare with a Fortran library
that actually computes A*B' you can declare yourself victorious. But for
general matrix multiplier like np.dot the speed of A*A' is not important.

Sturla





Matěj Laitl

unread,
Feb 4, 2013, 10:03:14 AM2/4/13
to cython...@googlegroups.com, Sturla Molden, Václav Šmídl, Radek Hofman
On 4. 2. 2013 Sturla Molden wrote:
> On 04.02.2013 01:02, Matěj Laitl wrote:
> > have you ever wanted to compute a matrix product of typed memoryviews
> > but couldn't afford the overhead of numpy.dot?
>
> Never. The overhead has never been significant compared to DGEMM et al.

You're right that O(n^3) operation is not a good example to mention overhead.
But let the numbers speak (numpy is linked against carefully compiled
atlas-3.10.1 there):

size: 2x2, iterations: 1000000
numpy: 5.18252134323e-07s per call, 0.518252134323s total
ceygen: 4.24573898315e-07s per call, 0.424573898315s total
size: 4x4, iterations: 1000000
numpy: 5.83323001862e-07s per call, 0.583323001862s total
ceygen: 5.01266002655e-07s per call, 0.501266002655s total
size: 8x8, iterations: 1000000
numpy: 9.83945846558e-07s per call, 0.983945846558s total
ceygen: 8.80959987640e-07s per call, 0.88095998764s total
size: 24x24, iterations: 217013
numpy: 6.45563130066e-06s per call, 1.40095591545s total
ceygen: 5.41761569641e-06s per call, 1.17569303513s total
size: 48x48, iterations: 27126
numpy: 4.99972707884e-05s per call, 1.35622596741s total
ceygen: 2.96417067011e-05s per call, 0.804060935974s total
size: 100x100, iterations: 3000
numpy: 0.000250008662542s per call, 0.750025987625s total
ceygen: 0.000206195990245s per call, 0.618587970734s total

Not bad, isn't it? The code is at [1], you can try to reproduce yourself. OTOH
very recent atlas seems to take better advantage of my CPU with ever larger
matrices, but nothing than could be fixed in future Eigen versions.

[1] https://github.com/strohel/Ceygen/blob/master/ceygen/tests/bench.pyx

> > without holding the GIL so that the code could scale to multiple threads?
>
> You have misunderstood the GIL.

No. You speak about multi-threading *inside* the numpy.dot, I about
parallelising the Cython code that *uses* the dot() perhaps inside a prange
block. It depends on the matrix size, for example in our problem we need to
call dot() on thousands of small independent matrices. I hope you'll agree
that parallelising inside dot() isn't very useful here.

> > Search no more and try Ceygen [1], my latest effort to make memoryviews
> > more usable for numerics. Ceygen is fast (i.e. virtually no overhead),
> > well documented, data-type agnostic (uses fused types), tested,
> > multithreading- friendly and uses Eigen C++ template library for actual
> > computation (no need to install, just unpack; no dependency on
> > BLAS/LAPACK).
>
> So instead of using an hardware optimized BLAS and LAPACK (Intel MKL,
> AMD ACML, Cray libsci, OpenBLAS, Apple Accelerate Framework, Nvidia
> cuBLAS) we will be better off with some non-optimized C++ templates? How
> did you get that idea?

You call very comparable to MKL (and 2x faster for some matrix sizes - i.e.
12x12 dgemm, not to speak about fixed-size cases, transpositions...) "non-
optimized C++ templates"? http://eigen.tuxfamily.org/index.php?title=Benchmark

I don't say BLAS/LAPACK is bad, but I say sometimes the
effort/money/complications is not worth it, see above.

Regards,
Matěj

Sturla Molden

unread,
Feb 4, 2013, 10:24:53 AM2/4/13
to cython...@googlegroups.com
On 04.02.2013 16:03, Matěj Laitl wrote:

> You're right that O(n^3) operation is not a good example to mention overhead.
> But let the numbers speak (numpy is linked against carefully compiled
> atlas-3.10.1 there):

> Not bad, isn't it?

So you're benchmarking against ATLAS?

Use a vendor-optimized BLAS for comparison. ATLAS is always slow.


> No. You speak about multi-threading *inside* the numpy.dot, I about
> parallelising the Cython code that *uses* the dot()
> perhaps inside a prange block.

No you didn't. But anyway:

NumPy's dotblas module (np.dot) releases the GIL.



> You call very comparable to MKL (and 2x faster for some matrix sizes - i.e.
> 12x12 dgemm, not to speak about fixed-size cases, transpositions...) "non-
> optimized C++ templates"? http://eigen.tuxfamily.org/index.php?title=Benchmark

12 x 12 dgemm is a tiny computation where Python (or Cython) overhead is
important.

If you are doing a lot of these small matrix multiplications you will be
better off moving all of them to C++ (Eigen or Blitz++) or Fortran 90,
or perhaps just call cblas_dgemm from Cython.

(Using Eigen directly in Cython would be nice. But currently Cython does
not get along too well with stack-allocated C++ objects.)


Sturla



Thomas Wiecki

unread,
Feb 4, 2013, 11:20:06 AM2/4/13
to cython...@googlegroups.com
I'm not sure I follow the reservations. This seems extremely useful as a cython library. Specifically, if I'm already inside of cython code I want to avoid the overhead of calling into numpy for which I will also have to aquire the GIL. That's also the motivation for CythonGSL.

Thomas


--

--- You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+unsubscribe@googlegroups.com.

Matěj Laitl

unread,
Feb 4, 2013, 11:44:49 AM2/4/13
to cython...@googlegroups.com
On 4. 2. 2013 Sturla Molden wrote:
> On 04.02.2013 16:03, Matěj Laitl wrote:
> > You're right that O(n^3) operation is not a good example to mention
> > overhead. But let the numbers speak (numpy is linked against carefully
> > compiled atlas-3.10.1 there):
> >
> > Not bad, isn't it?
>
> So you're benchmarking against ATLAS?

Yes, I haven't bought proprietary implementations. I thought *you* could back
your statements by running the code yourself, it's hell easy!

> Use a vendor-optimized BLAS for comparison. ATLAS is always slow.

Not true, atlas 3.10.1 can get > 90% (19-20 GFLOPS) of peak single-core
performance (20.5 GFLOPS in doubles) out of my Core i5 2520M starting with
500x500 matrices, and gently falls down for smaller ones. I use [1] to do the
benchmark.

Perhaps you could actually provide evidence for your claims? All the benchmark
code I use is public and extremely easy to run.

[1] https://github.com/strohel/NUSO

> > No. You speak about multi-threading *inside* the numpy.dot, I about
> > parallelising the Cython code that *uses* the dot() perhaps inside a
> > prange block.
>
> No you didn't. But anyway:
> NumPy's dotblas module (np.dot) releases the GIL.

Releasing the GIL is not bad, but not needing to hold it it all at the call
time is IMO crucial for tight prange loops.

> > You call very comparable to MKL (and 2x faster for some matrix sizes -
> > i.e. 12x12 dgemm, not to speak about fixed-size cases, transpositions...)
> > "non-optimized C++ templates"?
> > http://eigen.tuxfamily.org/index.php?title=Benchmark
>
> 12 x 12 dgemm is a tiny computation where Python (or Cython) overhead is
> important.

Python overhead - yes. Cython overhead: no! There could be no Cython overhead
if you use Cython + Ceygen with typed memoryviews. (but the point of the
argument was that Eigen performs well for *all* matrix sizes, not just 12x12)

> If you are doing a lot of these small matrix multiplications you will be
> better off moving all of them to C++ (Eigen or Blitz++) or Fortran 90,
> or perhaps just call cblas_dgemm from Cython.

You don't need to resort to C++ or cblas_dgemm with Ceygen, w/out sacrificing
speed. Try it. Really. :-)

> (Using Eigen directly in Cython would be nice.

Agreed. But I claim Ceygen is very near to it when it comes to speed. (but not
when it comes to elegance or code readability)

> But currently Cython does not get along too well with stack-allocated C++
objects.

Not true, look at the code Cython emits for Ceygen. It's pretty fine. There's
just one work-around (a need to have default constructor) that costs
readability of internal Ceygen code, but incurs no overhead.

Cheers,
Matěj

Matěj Laitl

unread,
Feb 4, 2013, 11:56:40 AM2/4/13
to cython...@googlegroups.com
On 4. 2. 2013 Kornél JAHN wrote:
> While Matěj's showy style of advertising doesn't quite fit my taste (...)

Sorry about that. :-|

The intent was to attract as many users (who may have the same problem to
solve) as possible with something catchy in order not to duplicate efforts.
(and perhaps to improve the GPL-licensed code together)

I also try to be very frank about what Ceygen is *not* in the homepage (and
README) in order not to fool anybody.

Regards,
Matěj
Reply all
Reply to author
Forward
0 new messages