Python array.array and Cython best practices

1,422 views
Skip to first unread message

Yury V. Zaytsev

unread,
Jun 3, 2013, 10:44:58 AM6/3/13
to cython...@googlegroups.com
Hi,

This thread has to do with another problem with by Python bindings for a
C++ project: the data transfer in the C++ -> Python direction, where on
the C++ level things are stored in vectors of int and double.

Again, I'd like to avoid a compile-time dependency on NumPy and be
compatible with Python 2.6+. Few questions:

1) Do I understand it correctly, that returning array.arrays is the best
idea, because memoryview objects are only available in Python 2.7+ ?

2) Can I on the level of Python hide the fact that I'm using array.array
if users have NumPy installed? Will numpy.asarray() produce a view on
the array without copying the data and big performance losses?

3) On the level of Cython, what's the best way to allocate and populate
large array.arrays?

In the Python docs, I see no way of specifying the size of the array
[*]. So I had a look at array.pxd and there I see something called
newarrayobject(), but it's unclear how to call it.

Or shall I use clone() instead? I don't need to zero the memory though,
because I will be writing there the values from the vector anyways.

Now, what's the best (fastest) method to copy the data?

arr = clone(...)
for i in range(N): arr[i] = vec[i]

or

arr = array(...)
for i in range(N): arr.append(vec[i])

Many thanks for your advice!

[*]: http://docs.python.org/2/library/array.html

--
Sincerely yours,
Yury V. Zaytsev


Nikita Nemkin

unread,
Jun 4, 2013, 3:37:56 AM6/4/13
to cython...@googlegroups.com
On Mon, 03 Jun 2013 20:44:58 +0600, Yury V. Zaytsev <yu...@shurup.com>
wrote:

> Hi,
>
> This thread has to do with another problem with by Python bindings for a
> C++ project: the data transfer in the C++ -> Python direction, where on
> the C++ level things are stored in vectors of int and double.

> 1) Do I understand it correctly, that returning array.arrays is the best
> idea, because memoryview objects are only available in Python 2.7+ ?

memoryview is an iteration of an older builtin named "buffer", available
in all Python versions.
Neither buffer nor memoryview own the data they represent. You always
need an underlying object supporting buffer protocol. Such objects
include numpy.ndarray, array.array and ctypes arrays.

array.array is a good (but limited) alternative if you don't want
to depend on numpy.
Bear in mind that numpy can wrap any external chunk of memory without
any copying, while array.array always requires an initial copy
(unless you modify your algorithms to take preallocated array as input).

Also, Python's array.array doesn't in fact support buffer interface.
But Cython (and apparently numpy) hack around it, so it's not a problem
in practice.

> 2) Can I on the level of Python hide the fact that I'm using array.array
> if users have NumPy installed? Will numpy.asarray() produce a view on
> the array without copying the data and big performance losses?

numpy.frombuffer will give you a view, numpy.asarray will copy.

> 3) On the level of Cython, what's the best way to allocate and populate
> large array.arrays?

Typical (optimized) array operations:

from cpython cimport array

# Declare global template arrays, you only need one of each per
project.
# (see array module docs for available type specifiers)
cdef INT_ARRAY = array.array('i')
cdef BYTE_ARRAY = array.array('B')
...

# Allocate an empty byte array
cdef array.array data = array.copy(BYTE_ARRAY)

# Allocate a new int array with n elements:
cdef array.array arr = array.clone(INT_ARRAY, n, False)
# ... and access its data
for i in range(n):
arr.data.as_ints[i] = i

# Append some external data to the int array
cdef int* data = [1,2,3]
cdef int len = 3
array.extend_buffer(<char*>data, len)

# Iterate over array values (works with any pointer, really):
cdef int value, sum = 0
for value in arr.data.as_ints[:len(arr)]:
sum += value

# Accept an iterable from a user and turn it into a vector
def func(points not None): # points is any iterable
cdef array.array pointsArray = array.array('f', points)
cdef float* pointsData = pointsArray.data.as_floats
# now I can pass pointsData to any C function...

Arrays in Cython support buffer access too, like

cdef array.array[int] arr = ...
arr[i] = 1 # optimized indexing

But using arr.data.as_ints[i] (and other .data.as_xxx) is
faster still, because it avoids buffer setup and teardown.

> Now, what's the best (fastest) method to copy the data?
>
> arr = clone(...)
> for i in range(N): arr[i] = vec[i]
>
> or
>
> arr = array(...)
> for i in range(N): arr.append(vec[i])

Fastest way is:

cdef vector[int] vec = ...
cdef array.array arr = array.clone(INT_ARAY, vec.size(), False)
memcpy(arr.data.as_ints, &vec[0], vec.size() * sizeof(int))


Best regards,
Nikita Nemkin

Chris Barker - NOAA Federal

unread,
Jun 4, 2013, 11:28:16 AM6/4/13
to cython...@googlegroups.com
On Tue, Jun 4, 2013 at 12:37 AM, Nikita Nemkin <nik...@nemkin.ru> wrote:
> numpy.frombuffer will give you a view, numpy.asarray will copy.

asarray() will not copy if the input is already a numpy array (that
fits the requested specification). That's the whole point of it. Not
sure what it does with other buffers but it's easy to test:

In [26]: array_dot_array = array.array('b', "abcdefg")

In [27]: numpy_array = np.asarray(array_dot_array, np.uint8)

In [28]: numpy_array[0] = 0

In [29]: array_dot_array
Out[29]: array('b', [97, 98, 99, 100, 101, 102, 103])

OK, so it does copy.

You might want:

In [32]: numpy_array = np.frombuffer(array_dot_array, np.uint8)

In [33]: numpy_array[0] = 0

In [34]: array_dot_array
Out[34]: array('b', [0, 98, 99, 100, 101, 102, 103])

but then you need to know that the input is a valid buffer.

It would be nice to have a Cython: "asmemoryview" or something,
similar to np.asarray, but without the numpy dependency. But that
would take a lot of re-writing code that's already in numpy...

-Chris



--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Yury V. Zaytsev

unread,
Jun 7, 2013, 3:23:21 AM6/7/13
to cython...@googlegroups.com
Hi Nikita,

Thank you very much for your advice! I finally got it to work, yay!

Maybe you also have an opinion on how to to best check for buffer
interface, so that I can transfer data in the opposite direction too?

On Tue, 2013-06-04 at 13:37 +0600, Nikita Nemkin wrote:

> array.array is a good (but limited) alternative if you don't want to
> depend on numpy.

Yes, that's exactly the way I see it.

> Bear in mind that numpy can wrap any external chunk of memory without
> any copying, while array.array always requires an initial copy (unless
> you modify your algorithms to take preallocated array as input).

Very good to know!

> Also, Python's array.array doesn't in fact support buffer interface.
> But Cython (and apparently numpy) hack around it, so it's not a
> problem in practice.

Yes, so it seems.

I gathered that it only supports the new style buffer interface from
Python 3+, but Cython and NumPy use the old one for Python 2
automatically; the Python documentation isn't very clear on that.

> > 2) Can I on the level of Python hide the fact that I'm using array.array
> > if users have NumPy installed? Will numpy.asarray() produce a view on
> > the array without copying the data and big performance losses?
>
> numpy.frombuffer will give you a view, numpy.asarray will copy.

Exactly what I need!

> # Declare global template arrays, you only need one of each per project.
> # (see array module docs for available type specifiers)
> cdef INT_ARRAY = array.array('i')
> cdef BYTE_ARRAY = array.array('B')
> ...

I discovered that I should declare the template arrays in the same
function, or the clone call with crash with an arithmetic exception.
Originally, I tried to put them in the *.pxd file and it took quite some
time to figure out that this is the source of the problem...

> Fastest way is:
>
> cdef vector[int] vec = ...
> cdef array.array arr = array.clone(INT_ARAY, vec.size(), False)
> memcpy(arr.data.as_ints, &vec[0], vec.size() * sizeof(int))

Brilliant!!! Here is what I ended up with:

cdef array.array arr
cdef ARRAY_LONG = array.array('l')
cdef vector[long]* vector_long_ptr = NULL

vector_long_ptr = deref_ivector(<IntVectorDatum*> dat)
arr = array.clone(ARRAY_LONG, vector_long_ptr.size(), False)
memcpy(arr.data.as_longs, &vector_long_ptr.front(), vector_long_ptr.size() * sizeof(long))

Somehow, &vector_long_ptr[0] was giving me a wrong address, but
&vector_long_ptr.front() worked nicely.

Stefan Behnel

unread,
Jun 7, 2013, 9:28:05 AM6/7/13
to cython...@googlegroups.com
Yury V. Zaytsev, 07.06.2013 09:23:
> On Tue, 2013-06-04 at 13:37 +0600, Nikita Nemkin wrote:
>> Also, Python's array.array doesn't in fact support buffer interface.

It does, at least in Python 3.3.

>>> import array
>>> array.array("i", [1,2,3])
array('i', [1, 2, 3])
>>> memoryview(array.array("i", [1,2,3]))
<memory at 0x7f9176d6aef0>
>>> memoryview(array.array("i", [1,2,3]))[2]
3
>>> memoryview(array.array("i", [1,2,3]))[1]
2

The above doesn't really work in Py3.2, but I'm pretty sure that's due to
the memoryview type being incomplete, not array.array itself. There are
several known bugs and inconsistencies in the Python level implementation
of buffers (i.e. the builtin Python "memoryview" type) in Py3.[012]. That
doesn't have an impact on Cython code and its native memory views, though.


>> But Cython (and apparently numpy) hack around it, so it's not a
>> problem in practice.
>
> I gathered that it only supports the new style buffer interface from
> Python 3+, but Cython and NumPy use the old one for Python 2
> automatically; the Python documentation isn't very clear on that.

Cython never uses the old Python 2.x buffer interface (which was
essentially just a 1D byte buffer, without any metadata). Instead, Cython
emulates the new one in Python versions that do not support it, for types
that it knows at compile time (specifically, NumPy arrays and array.array).

Stefan

Yury V. Zaytsev

unread,
Jun 7, 2013, 11:42:01 AM6/7/13
to cython...@googlegroups.com
On Fri, 2013-06-07 at 15:28 +0200, Stefan Behnel wrote:
> > I gathered that it only supports the new style buffer interface from
> > Python 3+, but Cython and NumPy use the old one for Python 2
> > automatically; the Python documentation isn't very clear on that.
>
> Cython never uses the old Python 2.x buffer interface (which was
> essentially just a 1D byte buffer, without any metadata). Instead,
> Cython emulates the new one in Python versions that do not support it,
> for types that it knows at compile time (specifically, NumPy arrays
> and array.array).

Yes, that's what I meant to say... Sorry for the confusion!

Yury V. Zaytsev

unread,
Jul 16, 2013, 3:26:05 PM7/16/13
to cython...@googlegroups.com
On Tue, 2013-06-04 at 13:37 +0600, Nikita Nemkin wrote:

> numpy.frombuffer will give you a view, numpy.asarray will copy.

Hi Nikita,

One month later, I have got a question :-)

When I allocate the array and then return what numpy.frombuffer makes,
does it increase the reference count of the array?

I just want to be sure that the data behind the resulting object doesn't
suddenly disappear from under my feet:

def func_returning_a_buffer_like_thing():

vector_long_ptr = deref_ivector(<IntVectorDatum*> dat)
arr = array.clone(ARRAY_LONG, vector_long_ptr.size(), False)
memcpy(arr.data.as_longs, ...)

if HAVE_NUMPY:
return numpy.frombuffer(arr, dtype=numpy.int_)
else:
return arr

Am I on the safe side here?
Reply all
Reply to author
Forward
0 new messages