buffer access to ndarrays as cdef class attributes

becker.nils

unread,

Apr 11, 2012, 5:35:07 AM4/11/12

to cython...@googlegroups.com

hi,

i am trying to modify ndarrays by cdef methods in a cdef class but can't seem to get it to work without python reference counting and slowness. i am sticking with the buffer interface since memoryviews are not yet released and well-tested. below is a minimal version of what i tried. the assignment step of the local var B inside the 'work' step slows things down it seems, but i do not know how to get around it. do i need to use pointers to self.B.data instead of buffer access?

thanks for any hints!

cdef class Incr:
    cdef np.ndarray B # cannot define buffer type here

    def __init__(self, B):
        self.B = B

    cdef np.ndarray work(self, np.ndarray[np.float, mode='c'] x):
        cdef np.ndarray[np.float, mode='c'] B
        B = <np.ndarray> self.B # casting or not does not make a big difference
        for i in range(x.shape[0]):
            B[i] += x[i]

刘振海

unread,

Apr 11, 2012, 6:59:50 AM4/11/12

to cython...@googlegroups.com

Hi,

I am trying to do the same thing long time ago,but I could not implement that until using memoryview slice.

so I suggest you to try memoryview slice. you can get the newest cython from mark florisson's github

https://github.com/markflorisson88/cython/tree/release

Hope it will help

Best Regards,

liu zhenhai

Stefan Behnel

unread,

Apr 11, 2012, 7:21:42 AM4/11/12

to cython...@googlegroups.com

[fixed top-post]

刘振海, 11.04.2012 12:59:

> 在 2012年4月11日下午5:35，becker.nils <becke...@gmx.net>写道：
>> i am trying to modify ndarrays by cdef methods in a cdef class but can't
>> seem to get it to work without python reference counting and slowness. i am
>> sticking with the buffer interface since memoryviews are not yet released
>> and well-tested.
>

> I am trying to do the same thing long time ago,but I could not
> implement that until using memoryview slice.
> so I suggest you to try memoryview slice. you can get the newest cython
> from mark florisson's github
> https://github.com/markflorisson88/cython/tree/release

Or use the release branch in the main repository, should be the same code.

It's close to being released anyway, and the more testing it receives now,
the better. Unless *you* (our users) find bugs in it, it's unlikely to
change before the release.

Stefan

becker.nils

unread,

Apr 16, 2012, 4:10:01 AM4/16/12

to cython...@googlegroups.com

> so I suggest you to try memoryview slice. you can get the newest cython
> from mark florisson's github
> https://github.com/markflorisson88/cython/tree/release
Or use the release branch in the main repository, should be the same code.

hi again,

thanks for the hints. i tried the memoryview solution with 0.16rc1. generally it works quite well. what's most useful is that memoryviews can now be declared as cdef class attributes. however i found that some care has to be taken to avoid them slowing things down. correct me if i did something wrong

1. memoryview assignments inside inner loops are not a good idea. although no data is being copied, making a new slice involves quite some error checking overhead
2. memoryviews are more general than the ndarray buffer access but they are not a drop-in replacement, because one cannot return a memoryview object as a full ndarray without explicit conversion with np.asarray. so to have both fast access from the C side and a handle on the array on the python side, requires two local variables: one an ndarray and one a memoryview into it. (previously the ndarray with buffer access did both of these things)
3. one slow-down that i was not able to avoid is this:

 143:         for i in range(x.shape[0]):

 144:             self.out[i] *= dt * self.M[i]

where all of x, self.out and self.M are memoryviews. in the for-loop, cython checks for un-initialized memoryviews like so (output from cython -a)

    if (unlikely(!__pyx_v_self->M.memview)) {PyErr_SetString(PyExc_AttributeError,"Memoryview is not initialized");{__pyx_filename = __pyx_f[0]; __pyx_lineno = 144; __pyx_clineno = __LINE__; goto __pyx_L1_error;}}
    __pyx_t_4 = __pyx_v_i;

is there a way to tell cython that these views are in fact initialized (that's done in __init__ of the class) ?

cheers, Nils

mark florisson

unread,

Apr 16, 2012, 7:03:17 AM4/16/12

to cython...@googlegroups.com

On 16 April 2012 09:10, becker.nils <becke...@gmx.net> wrote:
>
>> > so I suggest you to try memoryview slice. you can get the newest cython
>> > from mark florisson's github
>> > https://github.com/markflorisson88/cython/tree/release
>>
>> Or use the release branch in the main repository, should be the same code.
>
> hi again,
>
> thanks for the hints. i tried the memoryview solution with 0.16rc1.
> generally it works quite well. what's most useful is that memoryviews can
> now be declared as cdef class attributes. however i found that some care has
> to be taken to avoid them slowing things down. correct me if i did something
> wrong
>
> 1. memoryview assignments inside inner loops are not a good idea. although
> no data is being copied, making a new slice involves quite some error
> checking overhead

Do you mean assignment to the data or to the slice itself?

> 2. memoryviews are more general than the ndarray buffer access but they are
> not a drop-in replacement, because one cannot return a memoryview object as
> a full ndarray without explicit conversion with np.asarray. so to have both
> fast access from the C side and a handle on the array on the python side,
> requires two local variables: one an ndarray and one a memoryview into it.
> (previously the ndarray with buffer access did both of these things)

Yes, that's correct. That's because you can now slice the memoryviews,
which does not invoke anything on the original buffer object, so when
converting to an object it may be out of sync with the original, which
means you'd have to convert it explicitly.
We could allow the user to register a conversion function to do this
automatically - only invoked if the slice was re-sliced - (and cache
the results), but it would mean that conversion back from the object
to a memoryview slice would have to validate the buffer again, which
would be more expensive. Maybe that could be mitigated by
special-casing numpy arrays and some other tricks.

> 3. one slow-down that i was not able to avoid is this:
>
> 143: for i in range(x.shape[0]):
>
> 144: self.out[i] *= dt * self.M[i]
>
>
> where all of x, self.out and self.M are memoryviews. in the for-loop,
> cython checks for un-initialized memoryviews like so (output from cython
> -a)
>
> if (unlikely(!__pyx_v_self->M.memview))
> {PyErr_SetString(PyExc_AttributeError,"Memoryview is not
> initialized");{__pyx_filename = __pyx_f[0]; __pyx_lineno = 144;
> __pyx_clineno = __LINE__; goto __pyx_L1_error;}}
> __pyx_t_4 = __pyx_v_i;
>
>
> is there a way to tell cython that these views are in fact initialized
> (that's done in __init__ of the class) ?

You can avoid this by annotating your function with
@cython.initializedcheck(False), or by using a module-global directive
at the top of your file '#cython: initializedcheck=False' (the same
goes for boundscheck and wraparound).

These things should be pulled out of loops whenever possible, just
like bounds checking and wrap-around code (if possible). Currently
that is very hard, as even a decref of an object could mean invocation
of a destructor which rebinds or deletes the memoryview (highly
unlikely but possible). To enable optimizations for 99.9% of the use
cases, I think we should be restrictive and allow only explicit
rebinding of memoryview slices in loops themselves, but not in called
functions or destructors. In other words, if a memoryview slice is not
rebound directly in the loop, the compiler is free to create a
temporary outside of the loop and use that everywhere in the loop.

> cheers, Nils

becker.nils

unread,

Apr 16, 2012, 11:40:47 AM4/16/12

to cython...@googlegroups.com

> 1. memoryview assignments inside inner loops are not a good idea. although
> no data is being copied, making a new slice involves quite some error
> checking overhead
Do you mean assignment to the data or to the slice itself?

i meant something like

cdef float[:] new_view = existing_numpy_array

inside a loop. i guess the obvious solution is to do this outside of the loop.

> 2. memoryviews are more general than the ndarray buffer access but they are
> not a drop-in replacement, because one cannot return a memoryview object as
> a full ndarray without explicit conversion with np.asarray. so to have both
> fast access from the C side and a handle on the array on the python side,
> requires two local variables: one an ndarray and one a memoryview into it.
> (previously the ndarray with buffer access did both of these things)
Yes, that's correct. That's because you can now slice the memoryviews,
which does not invoke anything on the original buffer object, so when
converting to an object it may be out of sync with the original, which
means you'd have to convert it explicitly.

that makes sense.

We could allow the user to register a conversion function to do this
automatically - only invoked if the slice was re-sliced - (and cache
the results), but it would mean that conversion back from the object
to a memoryview slice would have to validate the buffer again, which
would be more expensive. Maybe that could be mitigated by
special-casing numpy arrays and some other tricks.

so for the time being, it seems that the most efficient way of handling this is
that cdef functions or any fast C-side manipulation uses only memoryviews,
and allocation and communication with python then uses the underlying ndarrays.

> 3. one slow-down that i was not able to avoid is this:
>
> 143: for i in range(x.shape[0]):
>
> 144: self.out[i] *= dt * self.M[i]
>
>
> where all of x, self.out and self.M are memoryviews. in the for-loop,
> cython checks for un-initialized memoryviews like so (output from cython
> -a)
>
> if (unlikely(!__pyx_v_self->M.memview))
> {PyErr_SetString(PyExc_AttributeError,"Memoryview is not
> initialized");{__pyx_filename = __pyx_f[0]; __pyx_lineno = 144;
> __pyx_clineno = __LINE__; goto __pyx_L1_error;}}
> __pyx_t_4 = __pyx_v_i;
>
>
> is there a way to tell cython that these views are in fact initialized
> (that's done in __init__ of the class) ?
You can avoid this by annotating your function with
@cython.initializedcheck(False), or by using a module-global directive
at the top of your file '#cython: initializedcheck=False' (the same
goes for boundscheck and wraparound).

ah! helpful! i did not see this on the annotation wiki page.
(there is no official documentation on annotations it seems)

These things should be pulled out of loops whenever possible, just
like bounds checking and wrap-around code (if possible). Currently
that is very hard, as even a decref of an object could mean invocation
of a destructor which rebinds or deletes the memoryview (highly
unlikely but possible). To enable optimizations for 99.9% of the use
cases, I think we should be restrictive and allow only explicit
rebinding of memoryview slices in loops themselves, but not in called
functions or destructors. In other words, if a memoryview slice is not
rebound directly in the loop, the compiler is free to create a
temporary outside of the loop and use that everywhere in the loop.

or a memoryview context manager which makes the memoryview non-rebindable?
"with old_array as cdef float_t[:] new_view:
loop ...
"
(just fantasizing, probably nonsense). anyway, thanks!
nils

mark florisson

unread,

Apr 16, 2012, 1:56:08 PM4/16/12

to cython...@googlegroups.com, Core developer mailing list of the Cython compiler

On 16 April 2012 16:40, becker.nils <becke...@gmx.net> wrote:
>
>> > 1. memoryview assignments inside inner loops are not a good idea.
>> > although
>> > no data is being copied, making a new slice involves quite some error
>> > checking overhead
>>
>> Do you mean assignment to the data or to the slice itself?
>
> i meant something like
>
> cdef float[:] new_view = existing_numpy_array
>
> inside a loop. i guess the obvious solution is to do this outside of the
> loop.

Definitely, that is pretty slow :)

Yes, it is best to minimize conversion to and from numpy (which is
quite expensive either way).

>> > 3. one slow-down that i was not able to avoid is this:
>> >
>> > 143: for i in range(x.shape[0]):
>> >
>> > 144: self.out[i] *= dt * self.M[i]
>> >
>> >
>> > where all of x, self.out and self.M are memoryviews. in the for-loop,
>> > cython checks for un-initialized memoryviews like so (output from
>> > cython
>> > -a)
>> >
>> > if (unlikely(!__pyx_v_self->M.memview))
>> > {PyErr_SetString(PyExc_AttributeError,"Memoryview is not
>> > initialized");{__pyx_filename = __pyx_f[0]; __pyx_lineno = 144;
>> > __pyx_clineno = __LINE__; goto __pyx_L1_error;}}
>> > __pyx_t_4 = __pyx_v_i;
>> >
>> >
>> > is there a way to tell cython that these views are in fact initialized
>> > (that's done in __init__ of the class) ?
>>
>> You can avoid this by annotating your function with
>> @cython.initializedcheck(False), or by using a module-global directive
>> at the top of your file '#cython: initializedcheck=False' (the same
>> goes for boundscheck and wraparound).
>
> ah! helpful! i did not see this on the annotation wiki page.
> (there is no official documentation on annotations it seems)
>

Indeed, this should be documented. Documentation for other directives
can be found here:
http://docs.cython.org/src/reference/compilation.html?highlight=boundscheck#compiler-directives
. I'll add documentation for this directive too, currently it only
works for memoryviews, but maybe it should also work for objects?

>>
>> These things should be pulled out of loops whenever possible, just
>> like bounds checking and wrap-around code (if possible). Currently
>> that is very hard, as even a decref of an object could mean invocation
>> of a destructor which rebinds or deletes the memoryview (highly
>> unlikely but possible). To enable optimizations for 99.9% of the use
>> cases, I think we should be restrictive and allow only explicit
>> rebinding of memoryview slices in loops themselves, but not in called
>> functions or destructors. In other words, if a memoryview slice is not
>> rebound directly in the loop, the compiler is free to create a
>> temporary outside of the loop and use that everywhere in the loop.
>
>
> or a memoryview context manager which makes the memoryview non-rebindable?
> "with old_array as cdef float_t[:] new_view:
> loop ...
> "
> (just fantasizing, probably nonsense). anyway, thanks!
> nils

We could support final fields and variables, but it would be kind of a
pain to declare that everywhere.

mark florisson

unread,

Apr 16, 2012, 2:04:24 PM4/16/12

to cython...@googlegroups.com, Core developer mailing list of the Cython compiler

Maybe final would not be too bad, as you'd only need it for globals
(who uses them anyway) and attributes, but not for local variables.

becker.nils

unread,

Apr 18, 2012, 9:32:10 AM4/18/12

to cython...@googlegroups.com, Core developer mailing list of the Cython compiler

just a quick follow-up

>>> > 2. memoryviews are more general than the ndarray buffer access but they
>>> > are
>>> > not a drop-in replacement, because one cannot return a memoryview object
>>> > as
>>> > a full ndarray without explicit conversion with np.asarray. so to have
>>> > both
>>> > fast access from the C side and a handle on the array on the python
>>> > side,
>>> > requires two local variables: one an ndarray and one a memoryview into
>>> > it.
>>> > (previously the ndarray with buffer access did both of these things)
>>>
>>> Yes, that's correct. That's because you can now slice the memoryviews,
>>> which does not invoke anything on the original buffer object, so when
>>> converting to an object it may be out of sync with the original, which
>>> means you'd have to convert it explicitly.

>>> We could allow the user to register a conversion function to do this
>>> automatically - only invoked if the slice was re-sliced - (and cache
>>> the results), but it would mean that conversion back from the object
>>> to a memoryview slice would have to validate the buffer again, which
>>> would be more expensive. Maybe that could be mitigated by
>>> special-casing numpy arrays and some other tricks.
>>
>> so for the time being, it seems that the most efficient way of handling this
>> is
>> that cdef functions or any fast C-side manipulation uses only memoryviews,
>> and allocation and communication with python then uses the underlying
>> ndarrays.
>>
>
> Yes, it is best to minimize conversion to and from numpy (which is
> quite expensive either way).

it seems that keeping ndarrays and memoryviews separate has the following disadvantage:

cdef class Cls:
cpdef method(self, np.ndarray[float, mode='c'] x):
cdef int i

for i in range(x.shape[0]):

do_something_with(x[i])

this used to work nicely on both C and python sides and could be overridden in from python or C subclasses.
with memoryviews, to get the same speed on the C side it seems separate method signatures are needed: one
taking a memoryview (for the cdef method) and one taking an ndarray (for the def method). so the above
pattern does not seem to have an equally short version using memoryviews.

correct?

N.

mark florisson

unread,

Apr 18, 2012, 9:38:08 AM4/18/12

to cython...@googlegroups.com

I'm not sure what you mean here, when you say C do you mean Cython?
Using typed np.ndarray buffers would be just as slow from Python
space, and slower from Cython space then memoryviews. The only thing
that's more expensive (speaking roughly), is conversion from the
memoryview to an object.

becker.nils

unread,

Apr 18, 2012, 10:11:56 AM4/18/12

to cython...@googlegroups.com

> it seems that keeping ndarrays and memoryviews separate has the following
> disadvantage:
>
> cdef class Cls:
>     cpdef method(self, np.ndarray[float, mode='c'] x):
>         cdef int i
>
>         for i in range(x.shape[0]):
>              do_something_with(x[i])
>
> this used to work nicely on both C and python sides and could be overridden
> in from python or C subclasses.
> with memoryviews, to get the same speed on the C side it seems separate
> method signatures are needed: one
> taking a memoryview (for the cdef method) and one taking an ndarray (for the
> def method). so the above
> pattern does not seem to have an equally short version using memoryviews.
>
> correct?
>
> N.
>
>
I'm not sure what you mean here, when you say C do you mean Cython?
Using typed np.ndarray buffers would be just as slow from Python
space, and slower from Cython space then memoryviews.

hi,

let me try to clarify. it's clear that the method above is slow to access from the python side. i was
thinking more in the spirit of dual-access methods that work from the python side and are fast when
accessed from the cython side (see http://docs.cython.org/src/userguide/early_binding_for_speed.html)

The only thing
that's more expensive (speaking roughly), is conversion from the
memoryview to an object.

so let's add an array to return:

cdef class Cls:
     cpdef method(self, np.ndarray[float, mode='c'] x):
         cdef int i
         for i in range(x.shape[0]):

x[i] += 1
return x

now, imagine Cls.method is supposed to be called in a tight loop. clearly that is slow from python,
but my impression was that it will be fast from cython (since the ndarray x does not have to be allocated).
on the other hand, it seemed to me that to do the equivalent thing with memoryviews would require doing a
conversion between array and view inside the method, _except_ one makes an extra cdef method which takes and returns a view.

so i concluded that memoryviews require more lines of code in this case...

n.

mark florisson

unread,

Apr 18, 2012, 10:42:29 AM4/18/12

to cython...@googlegroups.com

Well, you could just type the return value from the cpdef method, i.e.
'cpdef float[:] method(...):'. That should be about as fast as doing
the same thing for buffers.

mark florisson

unread,

Apr 18, 2012, 11:13:40 AM4/18/12

to cython...@googlegroups.com

Faster, actually, since it doesn't need to reacquire the buffer for
every call. Somehow, for larger arrays, if I unpack the memoryview's
stride in a local variable in the generated C code, it runs about 3
times as fast as without, that should be investigated...

Dag Sverre Seljebotn

unread,

Apr 18, 2012, 11:57:09 AM4/18/12

to cython...@googlegroups.com

Is all pointers/structures leading to the unpacked stride __restrict?

You'd expect this in non-C99/non-restrict, since every write or every
function call could potentially change the stride..

Dag

mark florisson

unread,

Apr 19, 2012, 6:57:10 AM4/19/12

to cython...@googlegroups.com

No, nothing is declared restrict, and I don't think it can be. When I
unpack the data pointer it doesn't make a difference, although
technically the pointer could be overwritten though an access to
itself without strict aliasing. But the stride makes a huge
difference, which makes it run slower than the buffers which unpack a
buffer 2 times per outer iteration (the inner loop in the method does
1000 iterations only). I'll look at the assembly.

mark florisson

unread,

Apr 19, 2012, 6:59:33 AM4/19/12

to cython...@googlegroups.com

(The test is attached)

test.pyx

Sturla Molden

unread,

Apr 19, 2012, 8:24:52 AM4/19/12

to cython...@googlegroups.com

On 18.04.2012 17:13, mark florisson wrote:

> Faster, actually, since it doesn't need to reacquire the buffer for
> every call. Somehow, for larger arrays, if I unpack the memoryview's
> stride in a local variable in the generated C code, it runs about 3
> times as fast as without, that should be investigated...

A local variable can be stored in a register, which is often more
difficult with a structure member.

Sturla

Dag Sverre Seljebotn

unread,

Apr 20, 2012, 1:32:29 AM4/20/12

to cython...@googlegroups.com

I think the question that's being asked is why the compiler doesn't do
it automatically. (And the answer because access through other pointers
could be possible.)

Mark: To me the question is why things don't go slower with data as
well; however I think an answer might be in how the CPU reorder the
loads, I imagine it doing something like:

load strides
load dataptr
compute offset given strides and indices
do lookup

so that the dataptr load is masked by computation, but you get a stall
in "do lookup" for offset computation to complete, which stalls on
loading indices.

If a cast to __restrict doesn't do it I think we should just unpack to
locals. We *know* something the compiler can't know here, which is that
strides and shape are constant.

Dag

Dag Sverre Seljebotn

unread,

Apr 20, 2012, 1:36:46 AM4/20/12

to cython...@googlegroups.com

Sorry; Mark, you *do* unpack currently, right, and your comment on "that
should be investigated" is just that you were curious why you had to?

Dag

mark florisson

unread,

Apr 20, 2012, 5:17:29 AM4/20/12

to cython...@googlegroups.com

That might be what is going on, but what baffles me is that the buffer code does the exact same thing:

buffers: *__Pyx_BufPtrStrided1d(double *, __pyx_pybuffernd_buf.rcbuffer->pybuffer.buf, __pyx_t_6, __pyx_pybuffernd_buf.diminfo[0].strides) *= 2.0;

memoryviews: *((double *) ( /* dim=0 */ (__pyx_v_buf.data + __pyx_t_7 * __pyx_v_buf.strides[0]) )) *= 2.0;

So for small buffers the buffer code is at least 3 times as slow because of the buffer unpacking for each call, but for a somewhat larger buffer the generated code is bad, although both look up the data pointer and strides in a struct.

BTW, this is with gcc 4.2. Clang is faster but memoryviews are still slower, but gcc 4.7 is the fastest and memoryviews are faster:

buffers 0.224775791168

views 0.148458003998 # <- nearly 3 times as fast as gcc 4.2

Casting using restrict has no effect at all. So I think we should unpack memoryviews if they are local variables and not reassigned in loops, to make performance more portable, this difference is too large.

mark florisson

unread,

Apr 20, 2012, 7:37:31 AM4/20/12

to cython...@googlegroups.com

I think the difference is strength reduction, which it does for the buffer code but not for the memoryview code (although in both cases the strides are struct members). i.e. it rewrites data + i * stride to data_temp += stride, so instead of imul + add, you just get add.

<__pyx_f_4test_7Buffers_method+592>: mov %rdx,%rax
0x00000001004dbab3 <__pyx_f_4test_7Buffers_method+595>: add -0xf8(%rbp),%rax
0x00000001004dbaba <__pyx_f_4test_7Buffers_method+602>: movsd (%rax),%xmm0
0x00000001004dbabe <__pyx_f_4test_7Buffers_method+606>: addsd %xmm0,%xmm0
0x00000001004dbac2 <__pyx_f_4test_7Buffers_method+610>: movsd %xmm0,(%rax)
0x00000001004dbac6 <__pyx_f_4test_7Buffers_method+614>: inc %rcx
0x00000001004dbac9 <__pyx_f_4test_7Buffers_method+617>: add %rbx,%rdx
0x00000001004dbacc <__pyx_f_4test_7Buffers_method+620>: cmp %rsi,%rcx
0x00000001004dbacf <__pyx_f_4test_7Buffers_method+623>: jne 0x1004dbab0 <__pyx_f_4test_7Buffers_method+592>

vs

<__pyx_f_4test_11MemoryViews_method+80>: mov    %rdx,%rax

0x00000001004d1aa3 <__pyx_f_4test_11MemoryViews_method+83>: imul   0x60(%rbp),%rax

0x00000001004d1aa8 <__pyx_f_4test_11MemoryViews_method+88>: add    0x18(%rbp),%rax

0x00000001004d1aac <__pyx_f_4test_11MemoryViews_method+92>: movsd  (%rax),%xmm0

0x00000001004d1ab0 <__pyx_f_4test_11MemoryViews_method+96>: addsd  %xmm0,%xmm0

0x00000001004d1ab4 <__pyx_f_4test_11MemoryViews_method+100>:    movsd  %xmm0,(%rax)

0x00000001004d1ab8 <__pyx_f_4test_11MemoryViews_method+104>:    inc    %rdx

0x00000001004d1abb <__pyx_f_4test_11MemoryViews_method+107>:    cmp    %rcx,%rdx

0x00000001004d1abe <__pyx_f_4test_11MemoryViews_method+110>:    jne    0x1004d1aa0 <__pyx_f_4test_11MemoryViews_method+80>

It wouldn't be too hard to do this optimization directly in Cython.

Dag Sverre Seljebotn

unread,

Apr 20, 2012, 3:12:17 PM4/20/12

to cython...@googlegroups.com

On 04/20/2012 11:17 AM, mark florisson wrote:
>
>
> On 20 April 2012 06:32, Dag Sverre Seljebotn <d.s.se...@astro.uio.no

It should be irrelevant that they both look it up in a struct, that's
just syntax candy after all. Question is: Is it in both cases a) on the
stack, and b) has never had its address taken?

Dag

mark florisson

unread,

Apr 20, 2012, 4:24:01 PM4/20/12

to cython...@googlegroups.com

On 20 April 2012 20:12, Dag Sverre Seljebotn <d.s.se...@astro.uio.no> wrote:

On 04/20/2012 11:17 AM, mark florisson wrote:

On 20 April 2012 06:32, Dag Sverre Seljebotn <d.s.se...@astro.uio.no

Yeah, you're right. It's a cpdef method, so it checks if its overridden in Python, which generates code to convert the memoryview to an object, which passes in a pointer (so the address is taken). Removing that code makes it more than three times as fast.

I think to avoid these situations it's not unreasonable to just not rely on C compilers doing the right thing, and just use temporary pointers which we add to.

mark florisson

unread,

Apr 20, 2012, 4:28:41 PM4/20/12

to cython...@googlegroups.com

On 20 April 2012 21:24, mark florisson <markflo...@gmail.com> wrote:
>
>
>
> On 20 April 2012 20:12, Dag Sverre Seljebotn <d.s.se...@astro.uio.no>
> wrote:
>>
>> On 04/20/2012 11:17 AM, mark florisson wrote:
>>>
>>>
>>>
>>> On 20 April 2012 06:32, Dag Sverre Seljebotn <d.s.se...@astro.uio.no

That is, add with an unpacked stride. Funnily enough unpacking the
shape is hardly relevant at all for the bounds checking.

Dag Sverre Seljebotn

unread,

Apr 20, 2012, 4:46:36 PM4/20/12

to cython...@googlegroups.com

Could you just pass it by value?

>> I think to avoid these situations it's not unreasonable to just not rely
>> on C compilers doing the right thing, and just use temporary pointers which
>> we add to.
>
> That is, add with an unpacked stride. Funnily enough unpacking the
> shape is hardly relevant at all for the bounds checking.

Probably because the branch is unlikely(), so the CPU just continues on
with speculative computation even while it is waiting for the shape to
load into registers and getting compared, so you never get hit by a
stall and the overhead is just the extra load-to-register instruction.

Something like:

Dispatch shape load
Do computation and store value in register
Do bounds check (latency of loading shape hidden by computation)
Store value to memory

But you can't really start to do the FLOP until you know the offset, so
in that case you stall.

Dag

Dag Sverre Seljebotn

unread,

Apr 20, 2012, 4:55:01 PM4/20/12

to cython...@googlegroups.com

On 04/20/2012 10:24 PM, mark florisson wrote:
>
>
> On 20 April 2012 20:12, Dag Sverre Seljebotn <d.s.se...@astro.uio.no

> <mailto:d.s.se...@astro.uio.no>> wrote:
>
> On 04/20/2012 11:17 AM, mark florisson wrote:
>
>
>
> On 20 April 2012 06:32, Dag Sverre Seljebotn

> <d.s.se...@astro.uio.no <mailto:d.s.se...@astro.uio.no>
> <mailto:d.s.seljebotn@astro.__uio.no

> __pyx_pybuffernd_buf.rcbuffer-__>pybuffer.buf, __pyx_t_6,
> __pyx_pybuffernd_buf.diminfo[__0].strides) *= 2.0;

>
> memoryviews: *((double *) ( /* dim=0 */ (__pyx_v_buf.data +
> __pyx_t_7 *
> __pyx_v_buf.strides[0]) )) *= 2.0;
>
> So for small buffers the buffer code is at least 3 times as slow
> because
> of the buffer unpacking for each call, but for a somewhat larger
> buffer
> the generated code is bad, although both look up the data
> pointer and
> strides in a struct.
>
>
> It should be irrelevant that they both look it up in a struct,
> that's just syntax candy after all. Question is: Is it in both cases
> a) on the stack, and b) has never had its address taken?
>
> Dag
>
>
> Yeah, you're right. It's a cpdef method, so it checks if its overridden
> in Python, which generates code to convert the memoryview to an object,
> which passes in a pointer (so the address is taken). Removing that code
> makes it more than three times as fast.
> I think to avoid these situations it's not unreasonable to just not rely
> on C compilers doing the right thing, and just use temporary pointers
> which we add to.

I believe (strongly) this should be fixed by not taking the address, and
leave the optimization to the C compiler.

The Cython philosophy, at least so far, has been to use C compilers for
what C compilers are good at. Once you start going down that rabbit hole
and do C-level optimizations "just in case" it'll never stop. Let people
upgrade their gcc if they care.

Keep in mind that not everything is sequential access too; strides must
be in registers for random access as well.

Dag

mark florisson

unread,

Apr 21, 2012, 8:09:07 AM4/21/12

to cython...@googlegroups.com

Yeah I think you're right, if we don't take the address anywhere any
worthy C compiler will optimize it. I used pointers in some cases
because at the time the memoryviews had 32 dimensions and the copy
overhead was pretty large. This can be reworked at some time, wouldn't
be too hard.

Reply all

Reply to author

Forward