[SciPy-User] fast small matrix multiplication with cython?

Skipper Seabold

unread,

Dec 6, 2010, 5:34:19 PM12/6/10

to SciPy Users List

I'm wondering if anyone might have a look at my cython code that does
matrix multiplication and see where I can speed it up or offer some
pointers/reading. I'm new to Cython and my knowledge of C is pretty
basic based on trial and (mostly) error, so I am sure the code is
still very naive.

import numpy as np
from matmult import dotAB, multAB

A = np.array([[ 1., 3., 4.],
[ 5., 6., 3.]])
B = A.T.copy()

timeit dotAB(A,B)
# 1 loops, best of 3: 826 ms per loop

timeit multAB(A,B)
# 1 loops, best of 3: 1.16 s per loop

As you can see my multAB results in a negative speedup of about .75.

I compile the cython code with

cython -a matmult.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing
-I/usr/include/python2.6
-I/usr/local/lib/python2.6/dist-packages/numpy/core/include/ -o
matmult.so matmult.c

Cython code is attached and inlined below.

Profile is here (some of which I don't understand why there are
bottlenecks) http://eagle1.american.edu/~js2796a/matmult/matmult.html
-----------------------------------------------------------

from numpy cimport float64_t, ndarray, NPY_DOUBLE, npy_intp
cimport cython
from numpy import dot

ctypedef float64_t DOUBLE

cdef extern from "numpy/arrayobject.h":
cdef void import_array()
cdef object PyArray_SimpleNew(int nd, npy_intp *dims, int typenum)

import_array()

@cython.boundscheck(False)
@cython.wraparound(False)
cdef inline object matmult(ndarray[DOUBLE, ndim=2, mode='c'] A,
ndarray[DOUBLE, ndim=2, mode='c'] B):
cdef int lda = A.shape[0]
cdef int n = B.shape[1]
cdef npy_intp *dims = [lda, n]
cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)
cdef int i,j,k
cdef double s
for i in xrange(lda):
for j in xrange(n):
s = 0
for k in xrange(A.shape[1]):
s += A[i,k] * B[k,j]
out[i,j] = s
return out

def multAB(ndarray[DOUBLE, ndim=2] A, ndarray[DOUBLE, ndim=2] B):
for i in xrange(1000000):
C = matmult(A,B)
return C

def dotAB(ndarray[DOUBLE, ndim=2] A, ndarray[DOUBLE, ndim=2] B):
for i in xrange(1000000):
C = dot(A,B)
return C

Skipper

matmult.pyx

Pauli Virtanen

unread,

Dec 6, 2010, 7:11:12 PM12/6/10

to scipy...@scipy.org

On Mon, 06 Dec 2010 17:34:19 -0500, Skipper Seabold wrote:
> I'm wondering if anyone might have a look at my cython code that does
> matrix multiplication and see where I can speed it up or offer some
> pointers/reading. I'm new to Cython and my knowledge of C is pretty
> basic based on trial and (mostly) error, so I am sure the code is still
> very naive.

You'll be hard pressed to do better than Numpy's dot. In the raw data
handling, BLAS is very likely faster than most things you can code
manually. Moreover, the Cython routine you write must have as much
overhead as dot() --- dealing with refcounting, allocating/dellocating
PyArrayObjects (which is expensive) etc.

If you are willing to give up wrapping each small matrix in a separate
Numpy ndarray, then you can expect to get additional speed gains.
(Although even in that case it could make more sense to call BLAS
routines to do the multiplication instead, unless your matrices are small
and of fixed size in which case the C compiler may be able to produce
some tightly optimized code.)

However, in many cases the small matrices can be just stuffed into a
single Numpy array. At the moment there is no "vectorized" matrix
multiplication routine, however, so that could be written e.g. in Cython.

--
Pauli Virtanen

_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Robert Kern

unread,

Dec 6, 2010, 7:23:12 PM12/6/10

to SciPy Users List

On Mon, Dec 6, 2010 at 18:11, Pauli Virtanen <p...@iki.fi> wrote:
> On Mon, 06 Dec 2010 17:34:19 -0500, Skipper Seabold wrote:
>> I'm wondering if anyone might have a look at my cython code that does
>> matrix multiplication and see where I can speed it up or offer some
>> pointers/reading. I'm new to Cython and my knowledge of C is pretty
>> basic based on trial and (mostly) error, so I am sure the code is still
>> very naive.
>
> You'll be hard pressed to do better than Numpy's dot. In the raw data
> handling, BLAS is very likely faster than most things you can code
> manually. Moreover, the Cython routine you write must have as much
> overhead as dot() --- dealing with refcounting, allocating/dellocating
> PyArrayObjects (which is expensive) etc.

The main thing for his use case is reducing the overhead when called
from Cython. This started in a Cython-user thread where he was
directly calling the Python numpy.dot() from Cython. I suggested that
writing a Cython implementation may be better given the small
dimensions (only up to 10x10) might be better handled by writing the
matmult directly. Unfortunately, the buffer syntax adds a bunch of
overhead. Not the *same* overhead, mind, and I was hoping it would be
less, but it turns out to be more.

Getting access to the C BLAS implementations would be best. I guess
you could get descr.f.dotfunc and use that.

> If you are willing to give up wrapping each small matrix in a separate
> Numpy ndarray, then you can expect to get additional speed gains.
> (Although even in that case it could make more sense to call BLAS
> routines to do the multiplication instead, unless your matrices are small
> and of fixed size in which case the C compiler may be able to produce
> some tightly optimized code.)
>
> However, in many cases the small matrices can be just stuffed into a
> single Numpy array.

His use case (Kalman filters) prevents this.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco

Skipper Seabold

unread,

Dec 6, 2010, 7:30:29 PM12/6/10

to SciPy Users List

On Mon, Dec 6, 2010 at 7:11 PM, Pauli Virtanen <p...@iki.fi> wrote:
>
> On Mon, 06 Dec 2010 17:34:19 -0500, Skipper Seabold wrote:
> > I'm wondering if anyone might have a look at my cython code that does
> > matrix multiplication and see where I can speed it up or offer some
> > pointers/reading. I'm new to Cython and my knowledge of C is pretty
> > basic based on trial and (mostly) error, so I am sure the code is still
> > very naive.
>
> You'll be hard pressed to do better than Numpy's dot. In the raw data
> handling, BLAS is very likely faster than most things you can code
> manually. Moreover, the Cython routine you write must have as much
> overhead as dot() --- dealing with refcounting, allocating/dellocating
> PyArrayObjects (which is expensive) etc.
>
> If you are willing to give up wrapping each small matrix in a separate
> Numpy ndarray, then you can expect to get additional speed gains.
> (Although even in that case it could make more sense to call BLAS
> routines to do the multiplication instead, unless your matrices are small
> and of fixed size in which case the C compiler may be able to produce
> some tightly optimized code.)
>
> However, in many cases the small matrices can be just stuffed into a
> single Numpy array. At the moment there is no "vectorized" matrix
> multiplication routine, however, so that could be written e.g. in Cython.
>

Ah, I see. I didn't think about the overhead of PyArrayObject.

Skipper

Keith Goodman

unread,

Dec 6, 2010, 7:31:28 PM12/6/10

to SciPy Users List

On Mon, Dec 6, 2010 at 2:34 PM, Skipper Seabold <jsse...@gmail.com> wrote:

> @cython.boundscheck(False)
> @cython.wraparound(False)
> cdef inline object matmult(ndarray[DOUBLE, ndim=2, mode='c'] A,
> ndarray[DOUBLE, ndim=2, mode='c'] B):
> cdef int lda = A.shape[0]
> cdef int n = B.shape[1]
> cdef npy_intp *dims = [lda, n]
> cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)
> cdef int i,j,k
> cdef double s

Do the cdef's above take a sizeable fraction of the time given that
your input arrays are small? If so, then you could do those before you
enter the inner loop where the dot product is needed. You wouldn't end
up with a reusable matmult function, but you'd get rid of some
overhead.

So in your inner loop, you'd only have:

> for i in xrange(lda):
> for j in xrange(n):
> s = 0
> for k in xrange(A.shape[1]):
> s += A[i,k] * B[k,j]
> out[i,j] = s
> return out

Skipper Seabold

unread,

Dec 6, 2010, 7:31:26 PM12/6/10

to SciPy Users List

On Mon, Dec 6, 2010 at 7:23 PM, Robert Kern <rober...@gmail.com> wrote:
> On Mon, Dec 6, 2010 at 18:11, Pauli Virtanen <p...@iki.fi> wrote:
>> On Mon, 06 Dec 2010 17:34:19 -0500, Skipper Seabold wrote:
>>> I'm wondering if anyone might have a look at my cython code that does
>>> matrix multiplication and see where I can speed it up or offer some
>>> pointers/reading. I'm new to Cython and my knowledge of C is pretty
>>> basic based on trial and (mostly) error, so I am sure the code is still
>>> very naive.
>>
>> You'll be hard pressed to do better than Numpy's dot. In the raw data
>> handling, BLAS is very likely faster than most things you can code
>> manually. Moreover, the Cython routine you write must have as much
>> overhead as dot() --- dealing with refcounting, allocating/dellocating
>> PyArrayObjects (which is expensive) etc.
>
> The main thing for his use case is reducing the overhead when called
> from Cython. This started in a Cython-user thread where he was
> directly calling the Python numpy.dot() from Cython. I suggested that
> writing a Cython implementation may be better given the small
> dimensions (only up to 10x10) might be better handled by writing the
> matmult directly. Unfortunately, the buffer syntax adds a bunch of
> overhead. Not the *same* overhead, mind, and I was hoping it would be
> less, but it turns out to be more.
>

Sorry for the cross-post. I figured this was better hashed out over here.

> Getting access to the C BLAS implementations would be best. I guess
> you could get descr.f.dotfunc and use that.
>

Thanks, I will see what I can come up with. I know it can be sped up
since other software in C++ solves the whole optimization almost
instantaneously when mine takes ~5 seconds for the same case, and my
profiling says that most of the time is spent in the loglikelihood
loop.

>> If you are willing to give up wrapping each small matrix in a separate
>> Numpy ndarray, then you can expect to get additional speed gains.
>> (Although even in that case it could make more sense to call BLAS
>> routines to do the multiplication instead, unless your matrices are small
>> and of fixed size in which case the C compiler may be able to produce
>> some tightly optimized code.)
>>
>> However, in many cases the small matrices can be just stuffed into a
>> single Numpy array.
>
> His use case (Kalman filters) prevents this.
>

For posterity's sake. More akin to my actual problem.

http://groups.google.com/group/cython-users/browse_thread/thread/a605a70626a455d

josef...@gmail.com

unread,

Dec 6, 2010, 7:33:13 PM12/6/10

to SciPy Users List

Does this generate c code, since it's not a cdef ? (I haven't updated
cython in a while.)

I guess you would want to have the entire loop in c.

Josef

>
> def dotAB(ndarray[DOUBLE, ndim=2] A, ndarray[DOUBLE, ndim=2] B):
> for i in xrange(1000000):
> C = dot(A,B)
> return C
>
> Skipper
>

Fernando Perez

unread,

Dec 7, 2010, 1:56:17 AM12/7/10

to SciPy Users List

Hi Skipper,

On Mon, Dec 6, 2010 at 2:34 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> I'm wondering if anyone might have a look at my cython code that does
> matrix multiplication and see where I can speed it up or offer some
> pointers/reading. I'm new to Cython and my knowledge of C is pretty
> basic based on trial and (mostly) error, so I am sure the code is
> still very naive.

a few years ago I had a similar problem, and I ended up getting a very
significant speedup by hand-coding a very unsafe, but very fast pure C
extension just to compute these inner products. This was basically a
replacement for dot() that would only work with double precision
inputs of compatible dimensions and would happily segfault with
anything else, but it ran very fast. The inner loop is implemented
completely naively, but it still beats calls to BLAS (even linked with
ATLAS) for small matrix dimensions (my case was also up to ~ 15x15).

I'm attaching the code in case you find it useful, please keep in mind
I haven't compiled it in years, so it may have bit-rotted a little.

Cheers,

f

flinalg.c

Dag Sverre Seljebotn

unread,

Dec 7, 2010, 3:51:59 AM12/7/10

to SciPy Users List

On 12/07/2010 07:56 AM, Fernando Perez wrote:
> Hi Skipper,
>
> On Mon, Dec 6, 2010 at 2:34 PM, Skipper Seabold<jsse...@gmail.com> wrote:
>
>> I'm wondering if anyone might have a look at my cython code that does
>> matrix multiplication and see where I can speed it up or offer some
>> pointers/reading. I'm new to Cython and my knowledge of C is pretty
>> basic based on trial and (mostly) error, so I am sure the code is
>> still very naive.
>>
> a few years ago I had a similar problem, and I ended up getting a very
> significant speedup by hand-coding a very unsafe, but very fast pure C
> extension just to compute these inner products. This was basically a
> replacement for dot() that would only work with double precision
> inputs of compatible dimensions and would happily segfault with
> anything else, but it ran very fast. The inner loop is implemented
> completely naively, but it still beats calls to BLAS (even linked with
> ATLAS) for small matrix dimensions (my case was also up to ~ 15x15).
>

Another idea: If the matrices are more in the intermediate range, here's
a Cython library for calling BLAS more directly:

http://www.vetta.org/2009/09/tokyo-a-cython-blas-wrapper-for-fast-matrix-math/

For intermediate-size matrices the use of SSE instructions should be
able to offset any call overhead. Try to stay clear of using NumPy for
slicing though, instead one should do pointer arithmetic...

Dag Sverre

Charles R Harris

unread,

Dec 7, 2010, 9:54:39 AM12/7/10

to SciPy Users List

Blas adds quite a bit of overhead for multiplying small matrices, but so does calling from python. For implementing Kalman filters it might be better to write a whole Kalman class so that operations can be combined at the c level.

Skipper, what kind of Kalman filter are you trying to implement?

Chuck

Keith Goodman

unread,

Dec 7, 2010, 10:35:54 AM12/7/10

to SciPy Users List

On Mon, Dec 6, 2010 at 2:34 PM, Skipper Seabold <jsse...@gmail.com> wrote:

> I'm wondering if anyone might have a look at my cython code that does
> matrix multiplication and see where I can speed it up or offer some
> pointers/reading. I'm new to Cython and my knowledge of C is pretty
> basic based on trial and (mostly) error, so I am sure the code is
> still very naive.
>

> <snip>

>
> cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)

I'd like to reduce the overhead in creating the empty array. Using
PyArray_SimpleNew in Cython is faster than using np.empty but both are
slower than using np.empty without Cython. Have I done something
wrong? I suspect is has something to do with this line in the code
below: "cdef npy_intp *dims = [r, c]"

PyArray_SimpleNew:
>> timeit matmult(2,2)
1000000 loops, best of 3: 773 ns per loop

np.empty in cython:
>> timeit matmult2(2,2)
1000000 loops, best of 3: 1.62 us per loop

np.empty in python:
>> timeit np.empty((2,2))
1000000 loops, best of 3: 465 ns per loop

Code:

import numpy as np

from numpy cimport float64_t, ndarray, NPY_DOUBLE, npy_intp

ctypedef float64_t DOUBLE

cdef extern from "numpy/arrayobject.h":
cdef void import_array()
cdef object PyArray_SimpleNew(int nd, npy_intp *dims, int typenum)

# initialize numpy
import_array()

def matmult(int r, int c):
cdef npy_intp *dims = [r, c] # Is there a faster way to do this?

cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)

return out

def matmult2(int r, int c):
cdef ndarray[DOUBLE, ndim=2] out = np.empty((r, c), dtype=np.float64)
return out

Robert Kern

unread,

Dec 7, 2010, 10:37:37 AM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 08:54, Charles R Harris
<charles...@gmail.com> wrote:

> Blas adds quite a bit of overhead for multiplying small matrices, but so
> does calling from python. For implementing Kalman filters it might be better
> to write a whole Kalman class so that operations can be combined at the c
> level.

As I said, he's writing the Kalman filter in Cython.

> Skipper, what kind of Kalman filter are you trying to implement?

Does this help?

http://groups.google.com/group/cython-users/browse_thread/thread/a605a70626a455d?pli=1

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco

Charles R Harris

unread,

Dec 7, 2010, 11:10:22 AM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 8:37 AM, Robert Kern <rober...@gmail.com> wrote:

On Tue, Dec 7, 2010 at 08:54, Charles R Harris
<charles...@gmail.com> wrote:

> Blas adds quite a bit of overhead for multiplying small matrices, but so
> does calling from python. For implementing Kalman filters it might be better
> to write a whole Kalman class so that operations can be combined at the c
> level.

As I said, he's writing the Kalman filter in Cython.

> Skipper, what kind of Kalman filter are you trying to implement?

Does this help?

http://groups.google.com/group/cython-users/browse_thread/thread/a605a70626a455d?pli=1

A bit, but it isn't a class. Since the Kalman filter is basically weighted linear least squares with a noisy change of variable, Skipper's function could probably be implemented that way also. Since he seems to be doing a lot of observation updates in a single go the information Kalman filter, which basically implements the usual least squares, might be a faster way to go.

Chuck

Skipper Seabold

unread,

Dec 7, 2010, 11:33:25 AM12/7/10

to SciPy Users List

Thanks. This was my next step and would've taken me some time.

Skipper

Skipper Seabold

unread,

Dec 7, 2010, 11:37:34 AM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 3:51 AM, Dag Sverre Seljebotn
<da...@student.matnat.uio.no> wrote:
> On 12/07/2010 07:56 AM, Fernando Perez wrote:
>> Hi Skipper,
>>
>> On Mon, Dec 6, 2010 at 2:34 PM, Skipper Seabold<jsse...@gmail.com> wrote:
>>
>>> I'm wondering if anyone might have a look at my cython code that does
>>> matrix multiplication and see where I can speed it up or offer some
>>> pointers/reading. I'm new to Cython and my knowledge of C is pretty
>>> basic based on trial and (mostly) error, so I am sure the code is
>>> still very naive.
>>>
>> a few years ago I had a similar problem, and I ended up getting a very
>> significant speedup by hand-coding a very unsafe, but very fast pure C
>> extension just to compute these inner products. This was basically a
>> replacement for dot() that would only work with double precision
>> inputs of compatible dimensions and would happily segfault with
>> anything else, but it ran very fast. The inner loop is implemented
>> completely naively, but it still beats calls to BLAS (even linked with
>> ATLAS) for small matrix dimensions (my case was also up to ~ 15x15).
>>
>
> Another idea: If the matrices are more in the intermediate range, here's
> a Cython library for calling BLAS more directly:
>
> http://www.vetta.org/2009/09/tokyo-a-cython-blas-wrapper-for-fast-matrix-math/
>

I actually tried to use tokyo, but I couldn't get it to build against
the ATLAS I compiled a few days ago out of the box. A few changes to
setup.py didn't fix it, so I gave up.

> For intermediate-size matrices the use of SSE instructions should be
> able to offset any call overhead. Try to stay clear of using NumPy for
> slicing though, instead one should do pointer arithmetic...

Right. Thanks.

Skipper

Skipper Seabold

unread,

Dec 7, 2010, 11:47:21 AM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 9:54 AM, Charles R Harris
<charles...@gmail.com> wrote:
>
>
> On Mon, Dec 6, 2010 at 11:56 PM, Fernando Perez <fpere...@gmail.com>

It's just a linear Gaussian filter. I use it to get the loglikelihood
of a univariate ARMA series with exact initial conditions. As it
stands it is fairly inflexible, but if I can make it fast I would like
to generalize it.

There is a fair amount of scratch work in here, and some attempts at
generalized state space models, but all the action for my purposes is
in KalmanFilter.loglike

http://bazaar.launchpad.net/~jsseabold/statsmodels/statsmodels-skipper/annotate/head%3A/scikits/statsmodels/tsa/kalmanf/kalmanf.py#L505

It's not terribly slow, but I have to maximize the likelihood using
numerical derivatives, so it's getting called quite a few times. A
1000 observation ARMA(2,2) series takes about 5-6 seconds on my
machine with fmin_l_bfgs_b.

Dag Sverre Seljebotn

unread,

Dec 7, 2010, 11:52:43 AM12/7/10

to SciPy Users List

On 12/07/2010 04:35 PM, Keith Goodman wrote:
> On Mon, Dec 6, 2010 at 2:34 PM, Skipper Seabold<jsse...@gmail.com> wrote:
>
>> I'm wondering if anyone might have a look at my cython code that does
>> matrix multiplication and see where I can speed it up or offer some
>> pointers/reading. I'm new to Cython and my knowledge of C is pretty
>> basic based on trial and (mostly) error, so I am sure the code is
>> still very naive.
>>
>> <snip>
>>
>> cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)
>>
> I'd like to reduce the overhead in creating the empty array. Using
> PyArray_SimpleNew in Cython is faster than using np.empty but both are
> slower than using np.empty without Cython. Have I done something
> wrong? I suspect is has something to do with this line in the code
> below: "cdef npy_intp *dims = [r, c]"
>

Nope, unless something very strange is going on, that line would be
ridiculously fast compared to the rest. Basically just copying two
integers on the stack.

Try PyArray_EMPTY?

Dag Sverre

Charles R Harris

unread,

Dec 7, 2010, 11:53:23 AM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 9:47 AM, Skipper Seabold <jsse...@gmail.com> wrote:

On Tue, Dec 7, 2010 at 9:54 AM, Charles R Harris

<charles...@gmail.com> wrote:
>
>

> On Mon, Dec 6, 2010 at 11:56 PM, Fernando Perez <fperez.net@gmail.com>

Just a guess here, but the numerical derivative bit makes it sounds like you are implementing a generalized Kalman filter. Have you looked at unscented Kalman filters?

Chuck

josef...@gmail.com

unread,

Dec 7, 2010, 12:05:21 PM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 11:53 AM, Charles R Harris

<charles...@gmail.com> wrote:
>
>
> On Tue, Dec 7, 2010 at 9:47 AM, Skipper Seabold <jsse...@gmail.com> wrote:
>>
>> On Tue, Dec 7, 2010 at 9:54 AM, Charles R Harris
>> <charles...@gmail.com> wrote:
>> >
>> >

>> > On Mon, Dec 6, 2010 at 11:56 PM, Fernando Perez <fpere...@gmail.com>

It's still a linear filter, non-linear optimization comes in because
the exact loglikelihood function for ARMA is non-linear in the
coefficients.
(There might be a way to calculate the derivative in the same loop,
but that's a different issue.)

Josef

>
> Chuck

Keith Goodman

unread,

Dec 7, 2010, 12:08:51 PM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 8:52 AM, Dag Sverre Seljebotn
<da...@student.matnat.uio.no> wrote:
> On 12/07/2010 04:35 PM, Keith Goodman wrote:
>> On Mon, Dec 6, 2010 at 2:34 PM, Skipper Seabold<jsse...@gmail.com> wrote:
>>
>>> I'm wondering if anyone might have a look at my cython code that does
>>> matrix multiplication and see where I can speed it up or offer some
>>> pointers/reading. I'm new to Cython and my knowledge of C is pretty
>>> basic based on trial and (mostly) error, so I am sure the code is
>>> still very naive.
>>>
>>> <snip>
>>>
>>> cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)
>>>
>> I'd like to reduce the overhead in creating the empty array. Using
>> PyArray_SimpleNew in Cython is faster than using np.empty but both are
>> slower than using np.empty without Cython. Have I done something
>> wrong? I suspect is has something to do with this line in the code
>> below: "cdef npy_intp *dims = [r, c]"
>>
>
> Nope, unless something very strange is going on, that line would be
> ridiculously fast compared to the rest. Basically just copying two
> integers on the stack.
>
> Try PyArray_EMPTY?

PyArray_EMPTY is a little faster (but np.empty is still much faster):

PyArray_SimpleNew
>> timeit matmult(2,2)
1000000 loops, best of 3: 778 ns per loop

PyArray_EMPTY
>> timeit matmult3(2,2)
1000000 loops, best of 3: 763 ns per loop

np.empty in python
>> timeit np.empty((2,2))

1000000 loops, best of 3: 470 ns per loop

def matmult3(int r, int c):

cdef npy_intp *dims = [r, c]

cdef ndarray[DOUBLE, ndim=2] out = PyArray_EMPTY(2, dims, NPY_FLOAT64, 0)

Skipper Seabold

unread,

Dec 7, 2010, 12:15:46 PM12/7/10

to SciPy Users List

Right. The derivative is of the whole likelihood function with
respect to the parameters that make up the system matrices in the
state equation. You can calculate the derivatives as part of the
recursions but the literature I've seen suggests that numerical
derivatives of the state equation matrices are the way to go.

Skipper

Charles R Harris

unread,

Dec 7, 2010, 12:17:05 PM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 10:05 AM, <josef...@gmail.com> wrote:

On Tue, Dec 7, 2010 at 11:53 AM, Charles R Harris

<charles...@gmail.com> wrote:
>
>
> On Tue, Dec 7, 2010 at 9:47 AM, Skipper Seabold <jsse...@gmail.com> wrote:
>>
>> On Tue, Dec 7, 2010 at 9:54 AM, Charles R Harris
>> <charles...@gmail.com> wrote:
>> >
>> >

>> > On Mon, Dec 6, 2010 at 11:56 PM, Fernando Perez <fperez.net@gmail.com>

The unscented Kalman filter is a better way to estimate the covariance of a non-linear process, think of it as a better integrator. If the propagation is easy to compute, which seems to be the case here, it will probably save you some time. You might even be able to use the basic idea and skip the Kalman part altogether.

My general aim here is to optimize the algorithm first before getting caught up in the details of matrix multiplication in c. Premature optimization and all that.

Chuck

Skipper Seabold

unread,

Dec 7, 2010, 12:39:28 PM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 12:17 PM, Charles R Harris
<charles...@gmail.com> wrote:
>
>
> On Tue, Dec 7, 2010 at 10:05 AM, <josef...@gmail.com> wrote:

<snip>

>> It's still a linear filter, non-linear optimization comes in because
>> the exact loglikelihood function for ARMA is non-linear in the
>> coefficients.
>> (There might be a way to calculate the derivative in the same loop,
>> but that's a different issue.)
>>
>
> The unscented Kalman filter is a better way to estimate the covariance of a
> non-linear process, think of it as a better integrator. If the propagation
> is easy to compute, which seems to be the case here, it will probably save
> you some time. You might even be able to use the basic idea and skip the
> Kalman part altogether.
>
> My general aim here is to optimize the algorithm first before getting caught
> up in the details of matrix multiplication in c. Premature optimization and
> all that.
>

Hmm I haven't seen this mentioned much in what I've been reading or
the documentation on existing software for ARMA processes, so I never
thought much about it. I will have a closer look. Well, google turns
up this thread...

There is another optimization that I could employ by switching to fast
recursions when the state variance converges to its steady state, but
this makes it less general for future enhancements (ie., time varying
coefficients). Maybe I will go ahead and try it.

Skipper

Charles R Harris

unread,

Dec 7, 2010, 6:45:22 PM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 10:39 AM, Skipper Seabold <jsse...@gmail.com> wrote:

On Tue, Dec 7, 2010 at 12:17 PM, Charles R Harris

<charles...@gmail.com> wrote:
>
>

> On Tue, Dec 7, 2010 at 10:05 AM, <josef...@gmail.com> wrote:

<snip>

>> It's still a linear filter, non-linear optimization comes in because
>> the exact loglikelihood function for ARMA is non-linear in the
>> coefficients.
>> (There might be a way to calculate the derivative in the same loop,
>> but that's a different issue.)
>>
>
> The unscented Kalman filter is a better way to estimate the covariance of a
> non-linear process, think of it as a better integrator. If the propagation
> is easy to compute, which seems to be the case here, it will probably save
> you some time. You might even be able to use the basic idea and skip the
> Kalman part altogether.
>
> My general aim here is to optimize the algorithm first before getting caught
> up in the details of matrix multiplication in c. Premature optimization and
> all that.
>

Hmm I haven't seen this mentioned much in what I've been reading or
the documentation on existing software for ARMA processes, so I never
thought much about it. I will have a closer look. Well, google turns
up this thread...

I've started reading up a bit on what you are doing and the application doesn't use extended Kalman filters, so the suggestion to use unscented Kalman filters is irrelevant. Sorry about that ;) I'm still wading through the various statistical notation thickets to see if there might be a better form to use for the problem but I don't see one at the moment.

Chuck

josef...@gmail.com

unread,

Dec 7, 2010, 7:09:12 PM12/7/10

to SciPy Users List

On Tue, Dec 7, 2010 at 6:45 PM, Charles R Harris

There are faster ways to get the likelihood for a simple ARMA process
than using a Kalman Filter. I think the main advantage and reason for
the popularity of Kalman Filter for this is that it is easier to
extend. So using too many tricks that are specific to the simple ARMA
might take away much of the advantage of getting a fast Kalman Filter.

I didn't read much of the details for the Kalman Filter for this, but
that was my conclusion from the non-Kalman Filter literature.

Josef

>
> Chuck

Charles R Harris

unread,

Dec 7, 2010, 7:37:31 PM12/7/10

to SciPy Users List

Well, there are five forms of the standard Kalman filter that I am somewhat familiar with and some are better suited to some applications than others. But at this point I don't see that there is any reason not to use the common form for the ARMA case. It would be interesting to see some profiling since the matrix inversions are likely to dominate as the number of variables go up.

Chuck

Skipper Seabold

unread,

Dec 8, 2010, 8:08:43 PM12/8/10

to SciPy Users List

On Tue, Dec 7, 2010 at 7:37 PM, Charles R Harris

That's the idea. The advantages of the KF are that it's inherently
structural and it's *very* general. The ARMA case was just a jumping
off point, but has also proved to be a sticking point. I'd like to
have a fast and general linear Gaussian KF available for larger state
space models, as it's the baseline workhorse for estimating linearized
large macroeconomic models at the moment.

>
> Well, there are five forms of the standard Kalman filter that I am somewhat
> familiar with and some are better suited to some applications than others.
> But at this point I don't see that there is any reason not to use the common
> form for the ARMA case. It would be interesting to see some profiling since
> the matrix inversions are likely to dominate as the number of variables go
> up.
>

If interested, I am using Durbin and Koopman's "Time Series Analysis
by State Space Methods" and Andrew Harvey's "Forecasting, Structural
Time Series Models, and the Kalman Filter" as my main references for
this. The former is nice and concise but has a lot of details,
suggestions, and use cases.

I have looked some more and it does seem that the filter converges to
its steady state after maybe 2 % of the iterations depending on the
properties of the series, so for the ARMA case I can switch to the
fast recursions only updating the state (not quite sure on the time
savings yet), but I am moving away from my goal of a fast and general
KF implementation...

About the matrix inversions, the ARMA model right now is only
univariate, so there is no real inverting of matrices. The suggestion
of Durbin and Koopman for larger, multivariate cases is to split it
into a series of univariate problems in order to avoid inversion.
They provide in the book some benchmarks on computational efficiency
in terms of multiplications needed based on their experience writing
http://www.ssfpack.com/index.html.

Skipper

Skipper Seabold

unread,

Dec 8, 2010, 11:15:55 PM12/8/10

to SciPy Users List

It looks like I don't save too much time with just Python/scipy
optimizations. Apparently ~75% of the time is spent in l-bfgs-b,
judging by its user time output and the profiler's CPU time output(?).
Non-cython versions:

Brief and rough profiling on my laptop for ARMA(2,2) with 1000
observations. Optimization uses fmin_l_bfgs_b with m = 12 and iprint
= 0.

Full Kalman Filter, starting parameters found via iterations in Python
-----------------------------------------------------------------------------------------------------
1696041 function calls (1695957 primitive calls) in 7.622 CPU seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
114 3.226 0.028 5.123 0.045 kalmanf.py:504(loglike)
1368384 1.872 0.000 1.872 0.000 {numpy.core._dotblas.dot}
102 1.736 0.017 2.395 0.023 arima.py:196(loglike_css)
203694 0.622 0.000 0.622 0.000 {sum}
90 0.023 0.000 0.023 0.000 {numpy.linalg.lapack_lite.dgesdd}
218 0.020 0.000 0.024 0.000 arima.py:117(_transparams)
46 0.015 0.000 0.016 0.000
function_base.py:494(asarray_chkfinite)
1208 0.013 0.000 0.013 0.000 {numpy.core.multiarray.array}
102163 0.012 0.000 0.012 0.000 {method 'append' of
'list' objects}
46 0.010 0.000 0.028 0.001 decomp_svd.py:12(svd)
<snip>

Full Kalman Filter, starting parameters found with scipy.signal.lfilter
--------------------------------------------------------------------------------------------------
1249493 function calls (1249409 primitive calls) in 4.596 CPU seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
102 2.862 0.028 4.437 0.043 kalmanf.py:504(loglike)
1224360 1.556 0.000 1.556 0.000 {numpy.core._dotblas.dot}
270 0.029 0.000 0.029 0.000 {sum}
90 0.025 0.000 0.025 0.000 {numpy.linalg.lapack_lite.dgesdd}
194 0.018 0.000 0.021 0.000 arima.py:117(_transparams)
46 0.016 0.000 0.017 0.000
function_base.py:494(asarray_chkfinite)
46 0.011 0.000 0.029 0.001 decomp_svd.py:12(svd)
<snip>

Kalman Filter with fast recursions, starting parameters with lfilter
---------------------------------------------------------------------------------------------
1097454 function calls (1097370 primitive calls) in 4.465 CPU seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
90 2.860 0.032 4.305 0.048 kalmanf.py:504(loglike)
1073757 1.431 0.000 1.431 0.000 {numpy.core._dotblas.dot}
270 0.029 0.000 0.029 0.000 {sum}
90 0.025 0.000 0.025 0.000 {numpy.linalg.lapack_lite.dgesdd}
182 0.016 0.000 0.019 0.000 arima.py:117(_transparams)
46 0.016 0.000 0.018 0.000
function_base.py:494(asarray_chkfinite)
46 0.011 0.000 0.030 0.001 decomp_svd.py:12(svd)
<snip>

josef...@gmail.com

unread,

Dec 8, 2010, 11:28:13 PM12/8/10

to SciPy Users List

>
> It looks like I don't save too much time with just Python/scipy
> optimizations. Apparently ~75% of the time is spent in l-bfgs-b,
> judging by its user time output and the profiler's CPU time output(?).
> Non-cython versions:
>
> Brief and rough profiling on my laptop for ARMA(2,2) with 1000
> observations. Optimization uses fmin_l_bfgs_b with m = 12 and iprint
> = 0.

Completely different idea: How costly are the numerical derivatives in l-bfgs-b?
With l-bfgs-b, you should be able to replace the derivatives with the
complex step derivatives that calculate the loglike function value and
the derivatives in one iteration.

Josef

Skipper Seabold

unread,

Dec 9, 2010, 4:33:41 PM12/9/10

to SciPy Users List

On Wed, Dec 8, 2010 at 11:28 PM, <josef...@gmail.com> wrote:
>>
>> It looks like I don't save too much time with just Python/scipy
>> optimizations. Apparently ~75% of the time is spent in l-bfgs-b,
>> judging by its user time output and the profiler's CPU time output(?).
>> Non-cython versions:
>>
>> Brief and rough profiling on my laptop for ARMA(2,2) with 1000
>> observations. Optimization uses fmin_l_bfgs_b with m = 12 and iprint
>> = 0.
>
> Completely different idea: How costly are the numerical derivatives in l-bfgs-b?
> With l-bfgs-b, you should be able to replace the derivatives with the
> complex step derivatives that calculate the loglike function value and
> the derivatives in one iteration.
>

I couldn't figure out how to use it without some hacks. The
fmin_l_bfgs_b will call both f and fprime as (x, *args), but
approx_fprime or approx_fprime_cs need actually approx_fprime(x, func,
args=args) and call func(x, *args). I changed fmin_l_bfgs_b to make
the call like this for the gradient, and I get (different computer)

Using approx_fprime_cs
-----------------------------------
861609 function calls (861525 primitive calls) in 3.337 CPU seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

70 1.942 0.028 3.213 0.046 kalmanf.py:504(loglike)
840296 1.229 0.000 1.229 0.000 {numpy.core._dotblas.dot}
56 0.038 0.001 0.038 0.001 {numpy.linalg.lapack_lite.zgesv}
270 0.025 0.000 0.025 0.000 {sum}
90 0.019 0.000 0.019 0.000 {numpy.linalg.lapack_lite.dgesdd}
46 0.013 0.000 0.014 0.000
function_base.py:494(asarray_chkfinite)
162 0.012 0.000 0.014 0.000 arima.py:117(_transparams)

Using approx_grad = True
---------------------------------------
1097454 function calls (1097370 primitive calls) in 3.615 CPU seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

90 2.316 0.026 3.489 0.039 kalmanf.py:504(loglike)
1073757 1.164 0.000 1.164 0.000 {numpy.core._dotblas.dot}
270 0.025 0.000 0.025 0.000 {sum}
90 0.020 0.000 0.020 0.000 {numpy.linalg.lapack_lite.dgesdd}
182 0.014 0.000 0.016 0.000 arima.py:117(_transparams)
46 0.013 0.000 0.014 0.000
function_base.py:494(asarray_chkfinite)
46 0.008 0.000 0.023 0.000 decomp_svd.py:12(svd)
23 0.004 0.000 0.004 0.000 {method 'var' of
'numpy.ndarray' objects}

Definitely less function calls and a little faster, but I had to write
some hacks to get it to work.

Skipper

Skipper Seabold

unread,

Dec 9, 2010, 5:01:55 PM12/9/10

to SciPy Users List

This is more like it! With fast recursions in Cython:

15186 function calls (15102 primitive calls) in 0.750 CPU seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

18 0.622 0.035 0.625 0.035
kalman_loglike.pyx:15(kalman_loglike)
270 0.024 0.000 0.024 0.000 {sum}

90 0.019 0.000 0.019 0.000 {numpy.linalg.lapack_lite.dgesdd}

156 0.013 0.000 0.013 0.000 {numpy.core._dotblas.dot}

46 0.013 0.000 0.014 0.000
function_base.py:494(asarray_chkfinite)

110 0.008 0.000 0.010 0.000 arima.py:118(_transparams)

46 0.008 0.000 0.023 0.000 decomp_svd.py:12(svd)
23 0.004 0.000 0.004 0.000 {method 'var' of
'numpy.ndarray' objects}

26 0.004 0.000 0.004 0.000 tsatools.py:109(lagmat)
90 0.004 0.000 0.042 0.000 arima.py:197(loglike_css)
81 0.004 0.000 0.004 0.000
{numpy.core.multiarray._fastCopyAndTranspose}

I can live with this for now.

Wes McKinney

unread,

Mar 14, 2011, 11:13:27 AM3/14/11

to SciPy Users List

On Mon, Mar 14, 2011 at 11:12 AM, Wes McKinney <wesm...@gmail.com> wrote:

> Revisiting this topic from a few months ago. I was able to get Tokyo
> (see github.com/tokyo/tokyo and
> http://www.vetta.org/2009/09/tokyo-a-cython-blas-wrapper-for-fast-matrix-math/)
> to build against (ATLAS? or MKL?) in EPD 7.0 with some modifications
> to the setup.py, see my fork on github:
>
> https://github.com/wesm/tokyo/blob/master/setup.py
>
> Someone who knows a bit more about the linking against multiple
> versions of BLAS/ATLAS/MKL might be able to make it work more
> generally-- I basically just poked around numpy/core/setup.py
>
> The speedups for small matrices suggest it would probably be
> worthwhile to have in our toolbelt:
>
> $ python double_speed.py
>
> <snip>
>
> SPEED TEST BLAS 2
>
> Double precision: Vector size = 4 Matrix size = 4x4
>
> numpy.dot +: 312 kc/s
> dgemv: 2857 kc/s 9.1x
> dgemv3: 9091 kc/s 29.1x
> dgemv5: 9375 kc/s 30.0x
> dgemv6: 9375 kc/s 30.0x
> dgemv_: 11321 kc/s 36.2x
>
> numpy.outer: 118 kc/s
> dger: 2344 kc/s 19.8x
> dger3: 7895 kc/s 66.7x
> dger4: 8108 kc/s 68.5x
> dger_: 9449 kc/s 79.8x
>
> Double precision: Vector size = 15 Matrix size = 15x15
>
> numpy.dot +: 296 kc/s
> dgemv: 2000 kc/s 6.8x
> dgemv3: 4444 kc/s 15.0x
> dgemv5: 5000 kc/s 16.9x
> dgemv6: 4615 kc/s 15.6x
> dgemv_: 5217 kc/s 17.6x
>
> numpy.outer: 89 kc/s
> dger: 1143 kc/s 12.9x
> dger3: 2330 kc/s 26.2x
> dger4: 2667 kc/s 30.0x
> dger_: 2824 kc/s 31.8x
>
> Double precision: Vector size = 30 Matrix size = 30x30
>
> numpy.dot +: 261 kc/s
> dgemv: 1271 kc/s 4.9x
> dgemv3: 2676 kc/s 10.3x
> dgemv5: 2311 kc/s 8.9x
> dgemv6: 2676 kc/s 10.3x
> dgemv_: 2421 kc/s 9.3x
>
> numpy.outer: 64 kc/s
> dger: 782 kc/s 12.2x
> dger3: 1412 kc/s 22.1x
> dger4: 1182 kc/s 18.5x
> dger_: 1356 kc/s 21.2x
>
>
> SPEED TEST BLAS 3
>
> Double precision: Vector size = 4 Matrix size = 4x4
>
> numpy.dot: 845 kc/s
> dgemm: 2259 kc/s 2.7x
> dgemm3: 4808 kc/s 5.7x
> dgemm5: 4934 kc/s 5.8x
> dgemm7: 4808 kc/s 5.7x
> dgemm_: 5357 kc/s 6.3x
>
> Double precision: Vector size = 15 Matrix size = 15x15
>
> numpy.dot: 290 kc/s
> dgemm: 476 kc/s 1.6x
> dgemm3: 580 kc/s 2.0x
> dgemm5: 606 kc/s 2.1x
> dgemm7: 580 kc/s 2.0x
> dgemm_: 606 kc/s 2.1x
>
> Double precision: Vector size = 30 Matrix size = 30x30
>
> numpy.dot: 108 kc/s
> dgemm: 128 kc/s 1.2x
> dgemm3: 145 kc/s 1.3x
> dgemm5: 139 kc/s 1.3x
> dgemm7: 145 kc/s 1.3x
> dgemm_: 145 kc/s 1.3x
>

I should add that it worked on both OS X and Ubuntu-- have not tested
on Windows (yet)

Wes McKinney

unread,

Mar 14, 2011, 11:12:22 AM3/14/11

to SciPy Users List

Revisiting this topic from a few months ago. I was able to get Tokyo

https://github.com/wesm/tokyo/blob/master/setup.py

$ python double_speed.py

<snip>

SPEED TEST BLAS 2

SPEED TEST BLAS 3

Jorge E. ´Sanchez Sanchez

unread,

Mar 15, 2011, 2:48:51 PM3/15/11

to scipy user list

Dear friends,

I have some "n" discrete curvilinear parallel isosurfaces S_n(x_i, y_i, z_i_n) whose gradient at their inner nodes I am interested to calculate, from the information of numpy.gradient I understand that it only works when dx, dy and dz are constants, but due to the curvilinear nature of these isosurfaces it seems to me that I cannot use it. So, I have been trying to design a method to calculate them as finite differences, but I am not convinced of the goodnesses of it and I would like to know if somebody knows of a python implementation with the best way to do this in order to not invent the octhagonal wheel (not "circularly" polished), or otherwise refer me to a proper reference to learn how to do it the best way possible with a reasonable precision.

Thanks in advance,
Jorge

phubaba

unread,

Jun 7, 2011, 12:53:19 PM6/7/11

to scipy...@scipy.org

Hello Skipper,

is there any chance you could explain the fast recursion algorithm or supply
the cython code you used to implement it?

Thanks,
Rob

--
View this message in context: http://old.nabble.com/fast-small-matrix-multiplication-with-cython--tp30391922p31793732.html
Sent from the Scipy-User mailing list archive at Nabble.com.

Skipper Seabold

unread,

Jun 9, 2011, 6:25:20 PM6/9/11

to SciPy Users List

On Tue, Jun 7, 2011 at 12:53 PM, phubaba <phu...@gmail.com> wrote:
>
> Hello Skipper,
>
> is there any chance you could explain the fast recursion algorithm or supply
> the cython code you used to implement it?
>
> Thanks,
> Rob

Missed this. For posterity, this was discussed here.

https://groups.google.com/group/pystatsmodels/browse_thread/thread/b27668230a65d1b9

Reply all

Reply to author

Forward