import numpy as np
from matmult import dotAB, multAB
A = np.array([[ 1., 3., 4.],
[ 5., 6., 3.]])
B = A.T.copy()
timeit dotAB(A,B)
# 1 loops, best of 3: 826 ms per loop
timeit multAB(A,B)
# 1 loops, best of 3: 1.16 s per loop
As you can see my multAB results in a negative speedup of about .75.
I compile the cython code with
cython -a matmult.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing
-I/usr/include/python2.6
-I/usr/local/lib/python2.6/dist-packages/numpy/core/include/ -o
matmult.so matmult.c
Cython code is attached and inlined below.
Profile is here (some of which I don't understand why there are
bottlenecks) http://eagle1.american.edu/~js2796a/matmult/matmult.html
-----------------------------------------------------------
from numpy cimport float64_t, ndarray, NPY_DOUBLE, npy_intp
cimport cython
from numpy import dot
ctypedef float64_t DOUBLE
cdef extern from "numpy/arrayobject.h":
cdef void import_array()
cdef object PyArray_SimpleNew(int nd, npy_intp *dims, int typenum)
import_array()
@cython.boundscheck(False)
@cython.wraparound(False)
cdef inline object matmult(ndarray[DOUBLE, ndim=2, mode='c'] A,
ndarray[DOUBLE, ndim=2, mode='c'] B):
cdef int lda = A.shape[0]
cdef int n = B.shape[1]
cdef npy_intp *dims = [lda, n]
cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)
cdef int i,j,k
cdef double s
for i in xrange(lda):
for j in xrange(n):
s = 0
for k in xrange(A.shape[1]):
s += A[i,k] * B[k,j]
out[i,j] = s
return out
def multAB(ndarray[DOUBLE, ndim=2] A, ndarray[DOUBLE, ndim=2] B):
for i in xrange(1000000):
C = matmult(A,B)
return C
def dotAB(ndarray[DOUBLE, ndim=2] A, ndarray[DOUBLE, ndim=2] B):
for i in xrange(1000000):
C = dot(A,B)
return C
Skipper
You'll be hard pressed to do better than Numpy's dot. In the raw data
handling, BLAS is very likely faster than most things you can code
manually. Moreover, the Cython routine you write must have as much
overhead as dot() --- dealing with refcounting, allocating/dellocating
PyArrayObjects (which is expensive) etc.
If you are willing to give up wrapping each small matrix in a separate
Numpy ndarray, then you can expect to get additional speed gains.
(Although even in that case it could make more sense to call BLAS
routines to do the multiplication instead, unless your matrices are small
and of fixed size in which case the C compiler may be able to produce
some tightly optimized code.)
However, in many cases the small matrices can be just stuffed into a
single Numpy array. At the moment there is no "vectorized" matrix
multiplication routine, however, so that could be written e.g. in Cython.
--
Pauli Virtanen
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
The main thing for his use case is reducing the overhead when called
from Cython. This started in a Cython-user thread where he was
directly calling the Python numpy.dot() from Cython. I suggested that
writing a Cython implementation may be better given the small
dimensions (only up to 10x10) might be better handled by writing the
matmult directly. Unfortunately, the buffer syntax adds a bunch of
overhead. Not the *same* overhead, mind, and I was hoping it would be
less, but it turns out to be more.
Getting access to the C BLAS implementations would be best. I guess
you could get descr.f.dotfunc and use that.
> If you are willing to give up wrapping each small matrix in a separate
> Numpy ndarray, then you can expect to get additional speed gains.
> (Although even in that case it could make more sense to call BLAS
> routines to do the multiplication instead, unless your matrices are small
> and of fixed size in which case the C compiler may be able to produce
> some tightly optimized code.)
>
> However, in many cases the small matrices can be just stuffed into a
> single Numpy array.
His use case (Kalman filters) prevents this.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
Ah, I see. I didn't think about the overhead of PyArrayObject.
Skipper
> @cython.boundscheck(False)
> @cython.wraparound(False)
> cdef inline object matmult(ndarray[DOUBLE, ndim=2, mode='c'] A,
> ndarray[DOUBLE, ndim=2, mode='c'] B):
> cdef int lda = A.shape[0]
> cdef int n = B.shape[1]
> cdef npy_intp *dims = [lda, n]
> cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)
> cdef int i,j,k
> cdef double s
Do the cdef's above take a sizeable fraction of the time given that
your input arrays are small? If so, then you could do those before you
enter the inner loop where the dot product is needed. You wouldn't end
up with a reusable matmult function, but you'd get rid of some
overhead.
So in your inner loop, you'd only have:
> for i in xrange(lda):
> for j in xrange(n):
> s = 0
> for k in xrange(A.shape[1]):
> s += A[i,k] * B[k,j]
> out[i,j] = s
> return out
Sorry for the cross-post. I figured this was better hashed out over here.
> Getting access to the C BLAS implementations would be best. I guess
> you could get descr.f.dotfunc and use that.
>
Thanks, I will see what I can come up with. I know it can be sped up
since other software in C++ solves the whole optimization almost
instantaneously when mine takes ~5 seconds for the same case, and my
profiling says that most of the time is spent in the loglikelihood
loop.
>> If you are willing to give up wrapping each small matrix in a separate
>> Numpy ndarray, then you can expect to get additional speed gains.
>> (Although even in that case it could make more sense to call BLAS
>> routines to do the multiplication instead, unless your matrices are small
>> and of fixed size in which case the C compiler may be able to produce
>> some tightly optimized code.)
>>
>> However, in many cases the small matrices can be just stuffed into a
>> single Numpy array.
>
> His use case (Kalman filters) prevents this.
>
For posterity's sake. More akin to my actual problem.
http://groups.google.com/group/cython-users/browse_thread/thread/a605a70626a455d
Does this generate c code, since it's not a cdef ? (I haven't updated
cython in a while.)
I guess you would want to have the entire loop in c.
Josef
>
> def dotAB(ndarray[DOUBLE, ndim=2] A, ndarray[DOUBLE, ndim=2] B):
> for i in xrange(1000000):
> C = dot(A,B)
> return C
>
> Skipper
>
On Mon, Dec 6, 2010 at 2:34 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> I'm wondering if anyone might have a look at my cython code that does
> matrix multiplication and see where I can speed it up or offer some
> pointers/reading. I'm new to Cython and my knowledge of C is pretty
> basic based on trial and (mostly) error, so I am sure the code is
> still very naive.
a few years ago I had a similar problem, and I ended up getting a very
significant speedup by hand-coding a very unsafe, but very fast pure C
extension just to compute these inner products. This was basically a
replacement for dot() that would only work with double precision
inputs of compatible dimensions and would happily segfault with
anything else, but it ran very fast. The inner loop is implemented
completely naively, but it still beats calls to BLAS (even linked with
ATLAS) for small matrix dimensions (my case was also up to ~ 15x15).
I'm attaching the code in case you find it useful, please keep in mind
I haven't compiled it in years, so it may have bit-rotted a little.
Cheers,
f
Another idea: If the matrices are more in the intermediate range, here's
a Cython library for calling BLAS more directly:
http://www.vetta.org/2009/09/tokyo-a-cython-blas-wrapper-for-fast-matrix-math/
For intermediate-size matrices the use of SSE instructions should be
able to offset any call overhead. Try to stay clear of using NumPy for
slicing though, instead one should do pointer arithmetic...
Dag Sverre
I'd like to reduce the overhead in creating the empty array. Using
PyArray_SimpleNew in Cython is faster than using np.empty but both are
slower than using np.empty without Cython. Have I done something
wrong? I suspect is has something to do with this line in the code
below: "cdef npy_intp *dims = [r, c]"
PyArray_SimpleNew:
>> timeit matmult(2,2)
1000000 loops, best of 3: 773 ns per loop
np.empty in cython:
>> timeit matmult2(2,2)
1000000 loops, best of 3: 1.62 us per loop
np.empty in python:
>> timeit np.empty((2,2))
1000000 loops, best of 3: 465 ns per loop
Code:
import numpy as np
from numpy cimport float64_t, ndarray, NPY_DOUBLE, npy_intp
ctypedef float64_t DOUBLE
cdef extern from "numpy/arrayobject.h":
cdef void import_array()
cdef object PyArray_SimpleNew(int nd, npy_intp *dims, int typenum)
# initialize numpy
import_array()
def matmult(int r, int c):
cdef npy_intp *dims = [r, c] # Is there a faster way to do this?
cdef ndarray[DOUBLE, ndim=2] out = PyArray_SimpleNew(2, dims, NPY_DOUBLE)
return out
def matmult2(int r, int c):
cdef ndarray[DOUBLE, ndim=2] out = np.empty((r, c), dtype=np.float64)
return out
> Blas adds quite a bit of overhead for multiplying small matrices, but so
> does calling from python. For implementing Kalman filters it might be better
> to write a whole Kalman class so that operations can be combined at the c
> level.
As I said, he's writing the Kalman filter in Cython.
> Skipper, what kind of Kalman filter are you trying to implement?
Does this help?
http://groups.google.com/group/cython-users/browse_thread/thread/a605a70626a455d?pli=1
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
On Tue, Dec 7, 2010 at 08:54, Charles R HarrisAs I said, he's writing the Kalman filter in Cython.
<charles...@gmail.com> wrote:
> Blas adds quite a bit of overhead for multiplying small matrices, but so
> does calling from python. For implementing Kalman filters it might be better
> to write a whole Kalman class so that operations can be combined at the c
> level.
Does this help?
> Skipper, what kind of Kalman filter are you trying to implement?
http://groups.google.com/group/cython-users/browse_thread/thread/a605a70626a455d?pli=1
Thanks. This was my next step and would've taken me some time.
Skipper
I actually tried to use tokyo, but I couldn't get it to build against
the ATLAS I compiled a few days ago out of the box. A few changes to
setup.py didn't fix it, so I gave up.
> For intermediate-size matrices the use of SSE instructions should be
> able to offset any call overhead. Try to stay clear of using NumPy for
> slicing though, instead one should do pointer arithmetic...
Right. Thanks.
Skipper
It's just a linear Gaussian filter. I use it to get the loglikelihood
of a univariate ARMA series with exact initial conditions. As it
stands it is fairly inflexible, but if I can make it fast I would like
to generalize it.
There is a fair amount of scratch work in here, and some attempts at
generalized state space models, but all the action for my purposes is
in KalmanFilter.loglike
It's not terribly slow, but I have to maximize the likelihood using
numerical derivatives, so it's getting called quite a few times. A
1000 observation ARMA(2,2) series takes about 5-6 seconds on my
machine with fmin_l_bfgs_b.
Nope, unless something very strange is going on, that line would be
ridiculously fast compared to the rest. Basically just copying two
integers on the stack.
Try PyArray_EMPTY?
Dag Sverre
On Tue, Dec 7, 2010 at 9:54 AM, Charles R Harris
> On Mon, Dec 6, 2010 at 11:56 PM, Fernando Perez <fperez.net@gmail.com>
It's still a linear filter, non-linear optimization comes in because
the exact loglikelihood function for ARMA is non-linear in the
coefficients.
(There might be a way to calculate the derivative in the same loop,
but that's a different issue.)
Josef
>
> Chuck
PyArray_EMPTY is a little faster (but np.empty is still much faster):
PyArray_SimpleNew
>> timeit matmult(2,2)
1000000 loops, best of 3: 778 ns per loop
PyArray_EMPTY
>> timeit matmult3(2,2)
1000000 loops, best of 3: 763 ns per loop
np.empty in python
>> timeit np.empty((2,2))
1000000 loops, best of 3: 470 ns per loop
def matmult3(int r, int c):
cdef npy_intp *dims = [r, c]
cdef ndarray[DOUBLE, ndim=2] out = PyArray_EMPTY(2, dims, NPY_FLOAT64, 0)
Right. The derivative is of the whole likelihood function with
respect to the parameters that make up the system matrices in the
state equation. You can calculate the derivatives as part of the
recursions but the literature I've seen suggests that numerical
derivatives of the state equation matrices are the way to go.
Skipper
On Tue, Dec 7, 2010 at 11:53 AM, Charles R Harris
<charles...@gmail.com> wrote:
>
>
> On Tue, Dec 7, 2010 at 9:47 AM, Skipper Seabold <jsse...@gmail.com> wrote:
>>
>> On Tue, Dec 7, 2010 at 9:54 AM, Charles R Harris
>> <charles...@gmail.com> wrote:
>> >
>> >
>> > On Mon, Dec 6, 2010 at 11:56 PM, Fernando Perez <fperez.net@gmail.com>
<snip>
>> It's still a linear filter, non-linear optimization comes in because
>> the exact loglikelihood function for ARMA is non-linear in the
>> coefficients.
>> (There might be a way to calculate the derivative in the same loop,
>> but that's a different issue.)
>>
>
> The unscented Kalman filter is a better way to estimate the covariance of a
> non-linear process, think of it as a better integrator. If the propagation
> is easy to compute, which seems to be the case here, it will probably save
> you some time. You might even be able to use the basic idea and skip the
> Kalman part altogether.
>
> My general aim here is to optimize the algorithm first before getting caught
> up in the details of matrix multiplication in c. Premature optimization and
> all that.
>
Hmm I haven't seen this mentioned much in what I've been reading or
the documentation on existing software for ARMA processes, so I never
thought much about it. I will have a closer look. Well, google turns
up this thread...
There is another optimization that I could employ by switching to fast
recursions when the state variance converges to its steady state, but
this makes it less general for future enhancements (ie., time varying
coefficients). Maybe I will go ahead and try it.
Skipper
On Tue, Dec 7, 2010 at 12:17 PM, Charles R Harris
<snip>
Hmm I haven't seen this mentioned much in what I've been reading or
>> It's still a linear filter, non-linear optimization comes in because
>> the exact loglikelihood function for ARMA is non-linear in the
>> coefficients.
>> (There might be a way to calculate the derivative in the same loop,
>> but that's a different issue.)
>>
>
> The unscented Kalman filter is a better way to estimate the covariance of a
> non-linear process, think of it as a better integrator. If the propagation
> is easy to compute, which seems to be the case here, it will probably save
> you some time. You might even be able to use the basic idea and skip the
> Kalman part altogether.
>
> My general aim here is to optimize the algorithm first before getting caught
> up in the details of matrix multiplication in c. Premature optimization and
> all that.
>
the documentation on existing software for ARMA processes, so I never
thought much about it. I will have a closer look. Well, google turns
up this thread...
There are faster ways to get the likelihood for a simple ARMA process
than using a Kalman Filter. I think the main advantage and reason for
the popularity of Kalman Filter for this is that it is easier to
extend. So using too many tricks that are specific to the simple ARMA
might take away much of the advantage of getting a fast Kalman Filter.
I didn't read much of the details for the Kalman Filter for this, but
that was my conclusion from the non-Kalman Filter literature.
Josef
>
> Chuck
That's the idea. The advantages of the KF are that it's inherently
structural and it's *very* general. The ARMA case was just a jumping
off point, but has also proved to be a sticking point. I'd like to
have a fast and general linear Gaussian KF available for larger state
space models, as it's the baseline workhorse for estimating linearized
large macroeconomic models at the moment.
>
> Well, there are five forms of the standard Kalman filter that I am somewhat
> familiar with and some are better suited to some applications than others.
> But at this point I don't see that there is any reason not to use the common
> form for the ARMA case. It would be interesting to see some profiling since
> the matrix inversions are likely to dominate as the number of variables go
> up.
>
If interested, I am using Durbin and Koopman's "Time Series Analysis
by State Space Methods" and Andrew Harvey's "Forecasting, Structural
Time Series Models, and the Kalman Filter" as my main references for
this. The former is nice and concise but has a lot of details,
suggestions, and use cases.
I have looked some more and it does seem that the filter converges to
its steady state after maybe 2 % of the iterations depending on the
properties of the series, so for the ARMA case I can switch to the
fast recursions only updating the state (not quite sure on the time
savings yet), but I am moving away from my goal of a fast and general
KF implementation...
About the matrix inversions, the ARMA model right now is only
univariate, so there is no real inverting of matrices. The suggestion
of Durbin and Koopman for larger, multivariate cases is to split it
into a series of univariate problems in order to avoid inversion.
They provide in the book some benchmarks on computational efficiency
in terms of multiplications needed based on their experience writing
http://www.ssfpack.com/index.html.
Skipper
It looks like I don't save too much time with just Python/scipy
optimizations. Apparently ~75% of the time is spent in l-bfgs-b,
judging by its user time output and the profiler's CPU time output(?).
Non-cython versions:
Brief and rough profiling on my laptop for ARMA(2,2) with 1000
observations. Optimization uses fmin_l_bfgs_b with m = 12 and iprint
= 0.
Full Kalman Filter, starting parameters found via iterations in Python
-----------------------------------------------------------------------------------------------------
1696041 function calls (1695957 primitive calls) in 7.622 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
114 3.226 0.028 5.123 0.045 kalmanf.py:504(loglike)
1368384 1.872 0.000 1.872 0.000 {numpy.core._dotblas.dot}
102 1.736 0.017 2.395 0.023 arima.py:196(loglike_css)
203694 0.622 0.000 0.622 0.000 {sum}
90 0.023 0.000 0.023 0.000 {numpy.linalg.lapack_lite.dgesdd}
218 0.020 0.000 0.024 0.000 arima.py:117(_transparams)
46 0.015 0.000 0.016 0.000
function_base.py:494(asarray_chkfinite)
1208 0.013 0.000 0.013 0.000 {numpy.core.multiarray.array}
102163 0.012 0.000 0.012 0.000 {method 'append' of
'list' objects}
46 0.010 0.000 0.028 0.001 decomp_svd.py:12(svd)
<snip>
Full Kalman Filter, starting parameters found with scipy.signal.lfilter
--------------------------------------------------------------------------------------------------
1249493 function calls (1249409 primitive calls) in 4.596 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
102 2.862 0.028 4.437 0.043 kalmanf.py:504(loglike)
1224360 1.556 0.000 1.556 0.000 {numpy.core._dotblas.dot}
270 0.029 0.000 0.029 0.000 {sum}
90 0.025 0.000 0.025 0.000 {numpy.linalg.lapack_lite.dgesdd}
194 0.018 0.000 0.021 0.000 arima.py:117(_transparams)
46 0.016 0.000 0.017 0.000
function_base.py:494(asarray_chkfinite)
46 0.011 0.000 0.029 0.001 decomp_svd.py:12(svd)
<snip>
Kalman Filter with fast recursions, starting parameters with lfilter
---------------------------------------------------------------------------------------------
1097454 function calls (1097370 primitive calls) in 4.465 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
90 2.860 0.032 4.305 0.048 kalmanf.py:504(loglike)
1073757 1.431 0.000 1.431 0.000 {numpy.core._dotblas.dot}
270 0.029 0.000 0.029 0.000 {sum}
90 0.025 0.000 0.025 0.000 {numpy.linalg.lapack_lite.dgesdd}
182 0.016 0.000 0.019 0.000 arima.py:117(_transparams)
46 0.016 0.000 0.018 0.000
function_base.py:494(asarray_chkfinite)
46 0.011 0.000 0.030 0.001 decomp_svd.py:12(svd)
<snip>
Completely different idea: How costly are the numerical derivatives in l-bfgs-b?
With l-bfgs-b, you should be able to replace the derivatives with the
complex step derivatives that calculate the loglike function value and
the derivatives in one iteration.
Josef
I couldn't figure out how to use it without some hacks. The
fmin_l_bfgs_b will call both f and fprime as (x, *args), but
approx_fprime or approx_fprime_cs need actually approx_fprime(x, func,
args=args) and call func(x, *args). I changed fmin_l_bfgs_b to make
the call like this for the gradient, and I get (different computer)
Using approx_fprime_cs
-----------------------------------
861609 function calls (861525 primitive calls) in 3.337 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
70 1.942 0.028 3.213 0.046 kalmanf.py:504(loglike)
840296 1.229 0.000 1.229 0.000 {numpy.core._dotblas.dot}
56 0.038 0.001 0.038 0.001 {numpy.linalg.lapack_lite.zgesv}
270 0.025 0.000 0.025 0.000 {sum}
90 0.019 0.000 0.019 0.000 {numpy.linalg.lapack_lite.dgesdd}
46 0.013 0.000 0.014 0.000
function_base.py:494(asarray_chkfinite)
162 0.012 0.000 0.014 0.000 arima.py:117(_transparams)
Using approx_grad = True
---------------------------------------
1097454 function calls (1097370 primitive calls) in 3.615 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
90 2.316 0.026 3.489 0.039 kalmanf.py:504(loglike)
1073757 1.164 0.000 1.164 0.000 {numpy.core._dotblas.dot}
270 0.025 0.000 0.025 0.000 {sum}
90 0.020 0.000 0.020 0.000 {numpy.linalg.lapack_lite.dgesdd}
182 0.014 0.000 0.016 0.000 arima.py:117(_transparams)
46 0.013 0.000 0.014 0.000
function_base.py:494(asarray_chkfinite)
46 0.008 0.000 0.023 0.000 decomp_svd.py:12(svd)
23 0.004 0.000 0.004 0.000 {method 'var' of
'numpy.ndarray' objects}
Definitely less function calls and a little faster, but I had to write
some hacks to get it to work.
Skipper
This is more like it! With fast recursions in Cython:
15186 function calls (15102 primitive calls) in 0.750 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
18 0.622 0.035 0.625 0.035
kalman_loglike.pyx:15(kalman_loglike)
270 0.024 0.000 0.024 0.000 {sum}
90 0.019 0.000 0.019 0.000 {numpy.linalg.lapack_lite.dgesdd}
156 0.013 0.000 0.013 0.000 {numpy.core._dotblas.dot}
46 0.013 0.000 0.014 0.000
function_base.py:494(asarray_chkfinite)
110 0.008 0.000 0.010 0.000 arima.py:118(_transparams)
46 0.008 0.000 0.023 0.000 decomp_svd.py:12(svd)
23 0.004 0.000 0.004 0.000 {method 'var' of
'numpy.ndarray' objects}
26 0.004 0.000 0.004 0.000 tsatools.py:109(lagmat)
90 0.004 0.000 0.042 0.000 arima.py:197(loglike_css)
81 0.004 0.000 0.004 0.000
{numpy.core.multiarray._fastCopyAndTranspose}
I can live with this for now.
I should add that it worked on both OS X and Ubuntu-- have not tested
on Windows (yet)
Revisiting this topic from a few months ago. I was able to get Tokyo
https://github.com/wesm/tokyo/blob/master/setup.py
$ python double_speed.py
<snip>
SPEED TEST BLAS 2
SPEED TEST BLAS 3
is there any chance you could explain the fast recursion algorithm or supply
the cython code you used to implement it?
Thanks,
Rob
--
View this message in context: http://old.nabble.com/fast-small-matrix-multiplication-with-cython--tp30391922p31793732.html
Sent from the Scipy-User mailing list archive at Nabble.com.
Missed this. For posterity, this was discussed here.
https://groups.google.com/group/pystatsmodels/browse_thread/thread/b27668230a65d1b9