刘振海, 11.04.2012 12:59:
> 在 2012年4月11日 下午5:35,becker.nils <becke...@gmx.net>写道:
>> i am trying to modify ndarrays by cdef methods in a cdef class but can't
>> seem to get it to work without python reference counting and slowness. i am
>> sticking with the buffer interface since memoryviews are not yet released
>> and well-tested.
>
> I am trying to do the same thing long time ago,but I could not
> implement that until using memoryview slice.
> so I suggest you to try memoryview slice. you can get the newest cython
> from mark florisson's github
> https://github.com/markflorisson88/cython/tree/release
Or use the release branch in the main repository, should be the same code.
It's close to being released anyway, and the more testing it receives now,
the better. Unless *you* (our users) find bugs in it, it's unlikely to
change before the release.
Stefan
> so I suggest you to try memoryview slice. you can get the newest cython
> from mark florisson's github
> https://github.com/markflorisson88/cython/tree/releaseOr use the release branch in the main repository, should be the same code.
143: for i in range(x.shape[0]):
144: self.out[i] *= dt * self.M[i]
if (unlikely(!__pyx_v_self->M.memview)) {PyErr_SetString(PyExc_AttributeError,"Memoryview is not initialized");{__pyx_filename = __pyx_f[0]; __pyx_lineno = 144; __pyx_clineno = __LINE__; goto __pyx_L1_error;}}
__pyx_t_4 = __pyx_v_i;Do you mean assignment to the data or to the slice itself?
> 2. memoryviews are more general than the ndarray buffer access but they are
> not a drop-in replacement, because one cannot return a memoryview object as
> a full ndarray without explicit conversion with np.asarray. so to have both
> fast access from the C side and a handle on the array on the python side,
> requires two local variables: one an ndarray and one a memoryview into it.
> (previously the ndarray with buffer access did both of these things)
Yes, that's correct. That's because you can now slice the memoryviews,
which does not invoke anything on the original buffer object, so when
converting to an object it may be out of sync with the original, which
means you'd have to convert it explicitly.
We could allow the user to register a conversion function to do this
automatically - only invoked if the slice was re-sliced - (and cache
the results), but it would mean that conversion back from the object
to a memoryview slice would have to validate the buffer again, which
would be more expensive. Maybe that could be mitigated by
special-casing numpy arrays and some other tricks.
> 3. one slow-down that i was not able to avoid is this:
>
> 143: for i in range(x.shape[0]):
>
> 144: self.out[i] *= dt * self.M[i]
>
>
> where all of x, self.out and self.M are memoryviews. in the for-loop,
> cython checks for un-initialized memoryviews like so (output from cython
> -a)
>
> if (unlikely(!__pyx_v_self->M.memview))
> {PyErr_SetString(PyExc_AttributeError,"Memoryview is not
> initialized");{__pyx_filename = __pyx_f[0]; __pyx_lineno = 144;
> __pyx_clineno = __LINE__; goto __pyx_L1_error;}}
> __pyx_t_4 = __pyx_v_i;
>
>
> is there a way to tell cython that these views are in fact initialized
> (that's done in __init__ of the class) ?
You can avoid this by annotating your function with
@cython.initializedcheck(False), or by using a module-global directive
at the top of your file '#cython: initializedcheck=False' (the same
goes for boundscheck and wraparound).
These things should be pulled out of loops whenever possible, just
like bounds checking and wrap-around code (if possible). Currently
that is very hard, as even a decref of an object could mean invocation
of a destructor which rebinds or deletes the memoryview (highly
unlikely but possible). To enable optimizations for 99.9% of the use
cases, I think we should be restrictive and allow only explicit
rebinding of memoryview slices in loops themselves, but not in called
functions or destructors. In other words, if a memoryview slice is not
rebound directly in the loop, the compiler is free to create a
temporary outside of the loop and use that everywhere in the loop.
> cheers, Nils
> 1. memoryview assignments inside inner loops are not a good idea. although
> no data is being copied, making a new slice involves quite some error
> checking overheadDo you mean assignment to the data or to the slice itself?
> 2. memoryviews are more general than the ndarray buffer access but they are
> not a drop-in replacement, because one cannot return a memoryview object as
> a full ndarray without explicit conversion with np.asarray. so to have both
> fast access from the C side and a handle on the array on the python side,
> requires two local variables: one an ndarray and one a memoryview into it.
> (previously the ndarray with buffer access did both of these things)Yes, that's correct. That's because you can now slice the memoryviews,
which does not invoke anything on the original buffer object, so when
converting to an object it may be out of sync with the original, which
means you'd have to convert it explicitly.
We could allow the user to register a conversion function to do this
automatically - only invoked if the slice was re-sliced - (and cache
the results), but it would mean that conversion back from the object
to a memoryview slice would have to validate the buffer again, which
would be more expensive. Maybe that could be mitigated by
special-casing numpy arrays and some other tricks.
> 3. one slow-down that i was not able to avoid is this:
>
> 143: for i in range(x.shape[0]):
>
> 144: self.out[i] *= dt * self.M[i]
>
>
> where all of x, self.out and self.M are memoryviews. in the for-loop,
> cython checks for un-initialized memoryviews like so (output from cython
> -a)
>
> if (unlikely(!__pyx_v_self->M.memview))
> {PyErr_SetString(PyExc_AttributeError,"Memoryview is not
> initialized");{__pyx_filename = __pyx_f[0]; __pyx_lineno = 144;
> __pyx_clineno = __LINE__; goto __pyx_L1_error;}}
> __pyx_t_4 = __pyx_v_i;
>
>
> is there a way to tell cython that these views are in fact initialized
> (that's done in __init__ of the class) ?You can avoid this by annotating your function with
@cython.initializedcheck(False), or by using a module-global directive
at the top of your file '#cython: initializedcheck=False' (the same
goes for boundscheck and wraparound).
These things should be pulled out of loops whenever possible, just
like bounds checking and wrap-around code (if possible). Currently
that is very hard, as even a decref of an object could mean invocation
of a destructor which rebinds or deletes the memoryview (highly
unlikely but possible). To enable optimizations for 99.9% of the use
cases, I think we should be restrictive and allow only explicit
rebinding of memoryview slices in loops themselves, but not in called
functions or destructors. In other words, if a memoryview slice is not
rebound directly in the loop, the compiler is free to create a
temporary outside of the loop and use that everywhere in the loop.
Definitely, that is pretty slow :)
Yes, it is best to minimize conversion to and from numpy (which is
quite expensive either way).
>> > 3. one slow-down that i was not able to avoid is this:
>> >
>> > 143: for i in range(x.shape[0]):
>> >
>> > 144: self.out[i] *= dt * self.M[i]
>> >
>> >
>> > where all of x, self.out and self.M are memoryviews. in the for-loop,
>> > cython checks for un-initialized memoryviews like so (output from
>> > cython
>> > -a)
>> >
>> > if (unlikely(!__pyx_v_self->M.memview))
>> > {PyErr_SetString(PyExc_AttributeError,"Memoryview is not
>> > initialized");{__pyx_filename = __pyx_f[0]; __pyx_lineno = 144;
>> > __pyx_clineno = __LINE__; goto __pyx_L1_error;}}
>> > __pyx_t_4 = __pyx_v_i;
>> >
>> >
>> > is there a way to tell cython that these views are in fact initialized
>> > (that's done in __init__ of the class) ?
>>
>> You can avoid this by annotating your function with
>> @cython.initializedcheck(False), or by using a module-global directive
>> at the top of your file '#cython: initializedcheck=False' (the same
>> goes for boundscheck and wraparound).
>
> ah! helpful! i did not see this on the annotation wiki page.
> (there is no official documentation on annotations it seems)
>
Indeed, this should be documented. Documentation for other directives
can be found here:
http://docs.cython.org/src/reference/compilation.html?highlight=boundscheck#compiler-directives
. I'll add documentation for this directive too, currently it only
works for memoryviews, but maybe it should also work for objects?
>>
>> These things should be pulled out of loops whenever possible, just
>> like bounds checking and wrap-around code (if possible). Currently
>> that is very hard, as even a decref of an object could mean invocation
>> of a destructor which rebinds or deletes the memoryview (highly
>> unlikely but possible). To enable optimizations for 99.9% of the use
>> cases, I think we should be restrictive and allow only explicit
>> rebinding of memoryview slices in loops themselves, but not in called
>> functions or destructors. In other words, if a memoryview slice is not
>> rebound directly in the loop, the compiler is free to create a
>> temporary outside of the loop and use that everywhere in the loop.
>
>
> or a memoryview context manager which makes the memoryview non-rebindable?
> "with old_array as cdef float_t[:] new_view:
> loop ...
> "
> (just fantasizing, probably nonsense). anyway, thanks!
> nils
We could support final fields and variables, but it would be kind of a
pain to declare that everywhere.
Maybe final would not be too bad, as you'd only need it for globals
(who uses them anyway) and attributes, but not for local variables.
>>> > 2. memoryviews are more general than the ndarray buffer access but they
>>> > are
>>> > not a drop-in replacement, because one cannot return a memoryview object
>>> > as
>>> > a full ndarray without explicit conversion with np.asarray. so to have
>>> > both
>>> > fast access from the C side and a handle on the array on the python
>>> > side,
>>> > requires two local variables: one an ndarray and one a memoryview into
>>> > it.
>>> > (previously the ndarray with buffer access did both of these things)
>>>
>>> Yes, that's correct. That's because you can now slice the memoryviews,
>>> which does not invoke anything on the original buffer object, so when
>>> converting to an object it may be out of sync with the original, which
>>> means you'd have to convert it explicitly.
>>> We could allow the user to register a conversion function to do this
>>> automatically - only invoked if the slice was re-sliced - (and cache
>>> the results), but it would mean that conversion back from the object
>>> to a memoryview slice would have to validate the buffer again, which
>>> would be more expensive. Maybe that could be mitigated by
>>> special-casing numpy arrays and some other tricks.
>>
>> so for the time being, it seems that the most efficient way of handling this
>> is
>> that cdef functions or any fast C-side manipulation uses only memoryviews,
>> and allocation and communication with python then uses the underlying
>> ndarrays.
>>
>
> Yes, it is best to minimize conversion to and from numpy (which is
> quite expensive either way).
I'm not sure what you mean here, when you say C do you mean Cython?
Using typed np.ndarray buffers would be just as slow from Python
space, and slower from Cython space then memoryviews. The only thing
that's more expensive (speaking roughly), is conversion from the
memoryview to an object.
> it seems that keeping ndarrays and memoryviews separate has the following
> disadvantage:
>
> cdef class Cls:
> cpdef method(self, np.ndarray[float, mode='c'] x):
> cdef int i
>
> for i in range(x.shape[0]):
> do_something_with(x[i])
>
> this used to work nicely on both C and python sides and could be overridden
> in from python or C subclasses.
> with memoryviews, to get the same speed on the C side it seems separate
> method signatures are needed: one
> taking a memoryview (for the cdef method) and one taking an ndarray (for the
> def method). so the above
> pattern does not seem to have an equally short version using memoryviews.
>
> correct?
>
> N.
>
>I'm not sure what you mean here, when you say C do you mean Cython?
Using typed np.ndarray buffers would be just as slow from Python
space, and slower from Cython space then memoryviews.
The only thing
that's more expensive (speaking roughly), is conversion from the
memoryview to an object.
Well, you could just type the return value from the cpdef method, i.e.
'cpdef float[:] method(...):'. That should be about as fast as doing
the same thing for buffers.
Faster, actually, since it doesn't need to reacquire the buffer for
every call. Somehow, for larger arrays, if I unpack the memoryview's
stride in a local variable in the generated C code, it runs about 3
times as fast as without, that should be investigated...
Is all pointers/structures leading to the unpacked stride __restrict?
You'd expect this in non-C99/non-restrict, since every write or every
function call could potentially change the stride..
Dag
No, nothing is declared restrict, and I don't think it can be. When I
unpack the data pointer it doesn't make a difference, although
technically the pointer could be overwritten though an access to
itself without strict aliasing. But the stride makes a huge
difference, which makes it run slower than the buffers which unpack a
buffer 2 times per outer iteration (the inner loop in the method does
1000 iterations only). I'll look at the assembly.
> Faster, actually, since it doesn't need to reacquire the buffer for
> every call. Somehow, for larger arrays, if I unpack the memoryview's
> stride in a local variable in the generated C code, it runs about 3
> times as fast as without, that should be investigated...
A local variable can be stored in a register, which is often more
difficult with a structure member.
Sturla
I think the question that's being asked is why the compiler doesn't do
it automatically. (And the answer because access through other pointers
could be possible.)
Mark: To me the question is why things don't go slower with data as
well; however I think an answer might be in how the CPU reorder the
loads, I imagine it doing something like:
load strides
load dataptr
compute offset given strides and indices
do lookup
so that the dataptr load is masked by computation, but you get a stall
in "do lookup" for offset computation to complete, which stalls on
loading indices.
If a cast to __restrict doesn't do it I think we should just unpack to
locals. We *know* something the compiler can't know here, which is that
strides and shape are constant.
Dag
Sorry; Mark, you *do* unpack currently, right, and your comment on "that
should be investigated" is just that you were curious why you had to?
Dag
It should be irrelevant that they both look it up in a struct, that's
just syntax candy after all. Question is: Is it in both cases a) on the
stack, and b) has never had its address taken?
Dag
On 04/20/2012 11:17 AM, mark florisson wrote:
Could you just pass it by value?
>> I think to avoid these situations it's not unreasonable to just not rely
>> on C compilers doing the right thing, and just use temporary pointers which
>> we add to.
>
> That is, add with an unpacked stride. Funnily enough unpacking the
> shape is hardly relevant at all for the bounds checking.
Probably because the branch is unlikely(), so the CPU just continues on
with speculative computation even while it is waiting for the shape to
load into registers and getting compared, so you never get hit by a
stall and the overhead is just the extra load-to-register instruction.
Something like:
Dispatch shape load
Do computation and store value in register
Do bounds check (latency of loading shape hidden by computation)
Store value to memory
But you can't really start to do the FLOP until you know the offset, so
in that case you stall.
Dag
I believe (strongly) this should be fixed by not taking the address, and
leave the optimization to the C compiler.
The Cython philosophy, at least so far, has been to use C compilers for
what C compilers are good at. Once you start going down that rabbit hole
and do C-level optimizations "just in case" it'll never stop. Let people
upgrade their gcc if they care.
Keep in mind that not everything is sequential access too; strides must
be in registers for random access as well.
Dag
Yeah I think you're right, if we don't take the address anywhere any
worthy C compiler will optimize it. I used pointers in some cases
because at the time the memoryviews had 32 dimensions and the copy
overhead was pretty large. This can be reworked at some time, wouldn't
be too hard.