Re: [cython-users] old buffer syntax vs memoryviews: speed

Sturla Molden

unread,

Nov 5, 2012, 9:36:08 AM11/5/12

to cython...@googlegroups.com

On 05.11.2012 13:46, stefandw wrote:
> Hello everybody, I have a question about memoryviews (which will
> undoubtedly show my lack of understanding of the low level details of it
> all).
>
> Basically, my problem is that replacing the old buffer syntax with
> memoryview syntax makes my function twice as slow.
> Here is the code. It takes an numpy array as input and returns a new
> numpy array where the sum of column elements has been normalized to 1.

The main problem is that your program loops over discontinuous memory.
Swap the order of i and j. Also declare the arrays you know to be
contiguous to be contiguous (e.g. TM).

Apart from that, I would just write:

TM = TMu / TMu.sum(axis=0)[np.newaxis,:]

Sturla

Dag Sverre Seljebotn

unread,

Nov 5, 2012, 9:46:14 AM11/5/12

to cython...@googlegroups.com

On 11/05/2012 03:36 PM, Sturla Molden wrote:
> On 05.11.2012 13:46, stefandw wrote:
>> Hello everybody, I have a question about memoryviews (which will
>> undoubtedly show my lack of understanding of the low level details of it
>> all).
>>
>> Basically, my problem is that replacing the old buffer syntax with
>> memoryview syntax makes my function twice as slow.
>> Here is the code. It takes an numpy array as input and returns a new
>> numpy array where the sum of column elements has been normalized to 1.
>
> The main problem is that your program loops over discontinuous memory.
> Swap the order of i and j. Also declare the arrays you know to be
> contiguous to be contiguous (e.g. TM).

Note that the matrices were 8x8 matrices, so you hit L1 cache anyway;
this is not it.

I think the problem is that the overhead is in the "np.empty" and the
acquisition of the buffer. memoryviews may well be slower during
acquisition, I don't think that aspect was optimized, performance focus
was on a) performance for larger arrays, b) quickly making slices of
already acquired memoryviews.

Dag Sverre

Chris Barker

unread,

Nov 5, 2012, 1:16:18 PM11/5/12

to cython...@googlegroups.com

On Mon, Nov 5, 2012 at 7:36 AM, stefandw <stefa...@gmail.com> wrote:
> Thanks for your replies!

> @dag sverre: yes, I suspected something like that might be the case. The
> problem I'm having is that I have to construct / multiply these small
> matrices millions of times, and I'm trying to pull out pieces of pure python
> and rewrite into cython, but I guess the object construction overhead is
> always going to kill me? Do you have any suggestions here?

1) Do you need anew array? or could you alter the input one in place?

2) the fact that you are working with 8X8 arrays, and care about speed
suggests that you have a LOT of these to process -- pehaps you could
put them all in a (N, 8, 8) array, and then process them all in a loop
in Cython. That would allow you to create a single numpy array for the
whole pile -- and object creating performance no longer is an issue.

-CHB

> As another example, I moved the matrix "population" to cython as well: a
> single object for the "unnormalized" matrix is created only once in the
> python code, and it then gets populated in the cython code as below. This is
> the biggest drag in the whole program, so I tried various things, but so far
> the old buffer syntax is the fastest (and about 20% faster than the
> memoryview syntax): (TM stands for transition matrix, by the way, it's a
> piece of code to do with markov chains)
>
> def BTM_populate(np.ndarray[DTYPE_t, ndim=2] TMu, int iReo, DTYPE_t qindex,
> DTYPE_t refi, DTYPE_t sell, parameters):
> '''CAREFUL: operates on its first argument'''
> cdef int i, j, k
> cdef DTYPE_t myval
> #------------------------------current to a1 and back
> TMu[1, 0] = exp(parameters['c_a1_const'] + parameters['c_a1_beta'] *
> qindex)
> TMu[0, 1] = exp(parameters['a1_c_const'] + parameters['a1_c_beta'] *
> qindex)
> #---------------------------------increase in arrears
> myval = exp(parameters['a_a+_const'] + parameters['a_a+_beta'] * qindex)
> for i in range(2, iReo): TMu[i, i - 1] = myval
> #------------------------repossession and sale from repossession
> TMu[iReo, iReo - 1] = exp(parameters['a_reo_const'] +
> parameters['a_reo_beta'] * qindex)
> TMu[iReo + 1, iReo] = exp(parameters['reo_sold_const'])
> #---------------------------------prepayment
> TMu[iReo + 2, 0] = exp(parameters['c_pp_const'] +
> parameters['c_pp_beta'] * qindex +
> parameters['pp_refi'] * refi)
> myval = exp(parameters['a_pp_const'] + parameters['a_pp_beta'] * qindex
> +
> parameters['pp_refi'] * refi +
> parameters['pp_sell_from_arr']*sell)
> for i in range(1, iReo - 1): TMu[iReo + 2, i] = myval
> TMu[iReo + 2, iReo-1] = exp(parameters['aMax_pp_const'] +
> parameters['aMax_pp_beta'] * qindex +
> parameters['pp_refi'] * refi +
> parameters['pp_sell_from_arr']*sell)
> #------------------------------decrease in arrears
> myval = exp(parameters['aMax_a-_const'] + parameters['aMax_a-_beta'] *
> qindex)
> for i in range(iReo-1): TMu[i, iReo - 1] = myval
> myval = exp(parameters['a_a-_const'] + parameters['a_a-_beta'] * qindex)
> for i in range(2, iReo - 1):
> for j in range(i):
> TMu[j, i] = myval / i
> return None

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Dag Sverre Seljebotn

unread,

Nov 5, 2012, 2:04:38 PM11/5/12

to cython...@googlegroups.com

On 11/05/2012 04:36 PM, stefandw wrote:
> Thanks for your replies!
>

> @sturlamolden: the pure numpy version you suggest is much slower than
> the cython version (about factor 3, so really pleased I tried cython).
> I'll have a look at the loop order, but I really need to sum columns,
> not rows; not sure if transposing it would take even more time.

>
> @dag sverre: yes, I suspected something like that might be the case. The
> problem I'm having is that I have to construct / multiply these small
> matrices millions of times, and I'm trying to pull out pieces of pure
> python and rewrite into cython, but I guess the object construction
> overhead is always going to kill me? Do you have any suggestions here?
>

Woah, all of these dict lookups will be a big deal. You must pull out
all the parameters to local variables at the top of the function. That
should be a big speedup.

After that, the next step if you need it is to use something like Intel
VML or ACML for faster exp-s than what libc provides.

DS

Sturla Molden

unread,

Nov 6, 2012, 8:07:22 AM11/6/12

to cython...@googlegroups.com

On 05.11.2012 16:36, stefandw wrote:

> faster than the memoryview syntax): (TM stands for transition matrix, by
> the way, it's a piece of code to do with markov chains)

My experience with certain Markov chains (Metropolis-Hastings, Gibbs
sampler) is that it is more efficient to run multiple chains in parallel
when using languages like Python/Cython or MATLAB. That is, run multiple
chains per thread, not one chain per thread or process. I am not taking
about optimization with multi-threading now. This reduces the Python
overhead for each thread by a factor of O(number of chains).

The reason: If you have a transition matrix e.g. of rank 8 x 8, you can
stack multiple transition matrices in an array of shape (n,8,8) e.g.
with n = 1024. (Ditto for other data you use.) Then acquiring the buffer
will happen only once for every n chain you simulate.

I.e. something like this (not debugged):

import numpy as np
from libc.string cimport memset
cimport cython

@cython.wraparound(False)
@cython.boundscheck(False)
cdef inline void _normalize(double[:,::1] TM, double[:,::1] out):
""" normalize a single transition matrix """
double tmp[32]
cdef int m,n,i,j
m,n = <int> TM.shape[0], <int> TM.shape[1]
memset(&tmp[0],0,n*sizeof(double))
for i in range(m):
for j in range(n):
tmp[j] += TM[i,j]
for i in range(m):
for j in range(n):
out[i,j] = TM[i,j] / tmp[j]

def normalize(double[:,:,::1] TMstack): #slow
""" normalize a stack of transition matrices """
cdef double[:,:,::1] out
cdef int chain, number_of_chains, m, n
num_chains = <int> TMstack.shape[0]
m, n = <int> TMstack.shape[1], <int> TMstack.shape[2]
out = np.empty((num_chains,m,n)) #slow
if n > 32:
raise ValueError, "Number of rows too large (max. 32)"
for chain in range(number_of_chains):
_normalize(TMstack[chain,:,:], out[chain,:,:], tmp) #fast
return out

The slicing and function call is not needed in the code. It is just
there to demonstrate why the new buffer syntax is better than the old
ndarray syntax. The cdef function _normalize is void of any Python
overhead, and almost as fast as plain C would be (i.e. performance
at least 80% of plain C).

Sturla

Reply all

Reply to author

Forward