[Numpy-discussion] size_t or npy_intp?

140 views
Skip to first unread message

Francesc Alted

unread,
Jul 27, 2010, 9:08:41 AM7/27/10
to Discussion of Numerical Python
Hi,

I'm a bit confused on which datatype should I use when referring to NumPy
ndarray lengths. In one hand I'd use `size_t` that is the canonical way to
refer to lengths of memory blocks. In the other hand, `npy_intp` seems the
standard data type used in NumPy for this.

Which one would you recommend to use in NumPy extensions?

--
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Charles R Harris

unread,
Jul 27, 2010, 9:20:47 AM7/27/10
to Discussion of Numerical Python
On Tue, Jul 27, 2010 at 7:08 AM, Francesc Alted <fal...@pytables.org> wrote:
Hi,

I'm a bit confused on which datatype should I use when referring to NumPy
ndarray lengths.  In one hand I'd use `size_t` that is the canonical way to
refer to lengths of memory blocks.  In the other hand, `npy_intp` seems the
standard data type used in NumPy for this.


They have different ranges, npy_intp is signed and in later versions of Python is the same as Py_ssize_t, while size_t is unsigned. It would be a bad idea to mix the two.

Chuck

Francesc Alted

unread,
Jul 27, 2010, 10:45:32 AM7/27/10
to Discussion of Numerical Python
A Tuesday 27 July 2010 15:20:47 Charles R Harris escrigué:

Agreed that mixing the two is a bad idea. So I suppose that you are
suggesting to use `npy_intp`. But then, I'd say that `size_t` being unsigned,
is a better fit for describing a memory length.

Mmh, I'll stick with `size_t` for the time being (unless anyone else can
convince me that this is really a big mistake ;-)

Dag Sverre Seljebotn

unread,
Jul 27, 2010, 11:10:05 AM7/27/10
to Discussion of Numerical Python
Francesc Alted wrote:
> A Tuesday 27 July 2010 15:20:47 Charles R Harris escrigué:
>
>> On Tue, Jul 27, 2010 at 7:08 AM, Francesc Alted <fal...@pytables.org> wrote:
>>
>>> Hi,
>>>
>>> I'm a bit confused on which datatype should I use when referring to NumPy
>>> ndarray lengths. In one hand I'd use `size_t` that is the canonical way
>>> to refer to lengths of memory blocks. In the other hand, `npy_intp`
>>> seems the standard data type used in NumPy for this.
>>>
>> They have different ranges, npy_intp is signed and in later versions of
>> Python is the same as Py_ssize_t, while size_t is unsigned. It would be a
>> bad idea to mix the two.
>>
>
> Agreed that mixing the two is a bad idea. So I suppose that you are
> suggesting to use `npy_intp`. But then, I'd say that `size_t` being unsigned,
> is a better fit for describing a memory length.
>
> Mmh, I'll stick with `size_t` for the time being (unless anyone else can
> convince me that this is really a big mistake ;-)
>
Well, Python has reasons for using Py_ssize_t (= ssize_t where
available) internally for everything that has to do with indexing. (E.g.
it wants to use the same type for the strides, which can be negative.)

You just can't pass indices to any Python API that doesn't fit in
ssize_t. You're free to use size_t in your own code, but if you actually
use the extra bit, the moment it hits Python you'll overflow and get
garbage...so you need to check every time you hit any Python layer,
rather than only in the input to your code.

Your choice though.

Dag Sverre

Kurt Smith

unread,
Jul 27, 2010, 11:11:46 AM7/27/10
to Discussion of Numerical Python
On Tue, Jul 27, 2010 at 9:45 AM, Francesc Alted <fal...@pytables.org> wrote:
> A Tuesday 27 July 2010 15:20:47 Charles R Harris escrigué:
>> On Tue, Jul 27, 2010 at 7:08 AM, Francesc Alted <fal...@pytables.org> wrote:
>> > Hi,
>> >
>> > I'm a bit confused on which datatype should I use when referring to NumPy
>> > ndarray lengths.  In one hand I'd use `size_t` that is the canonical way
>> > to refer to lengths of memory blocks.  In the other hand, `npy_intp`
>> > seems the standard data type used in NumPy for this.
>>
>> They have different ranges, npy_intp is signed and in later versions of
>> Python is the same as Py_ssize_t, while size_t is unsigned. It would be a
>> bad idea to mix the two.
>
> Agreed that mixing the two is a bad idea.  So I suppose that you are
> suggesting to use `npy_intp`.  But then, I'd say that `size_t` being unsigned,
> is a better fit for describing a memory length.
>
> Mmh, I'll stick with `size_t` for the time being (unless anyone else can
> convince me that this is really a big mistake ;-)

This would be good to clear up; I've been confused on the issue myself
for my project. The PyArrayObject struct is defined using
`npy_intp`s:

typedef struct PyArrayObject {
PyObject_HEAD
char *data; /* pointer to raw data buffer */
int nd; /* number of dimensions, also called
ndim */
npy_intp *dimensions; /* size in each dimension */
npy_intp *strides; /* bytes to jump to get to the
next element in each dimension */
PyObject *base; /* This object should be decref'd
upon deletion of array */
/* For views it points to the original
array */
/* For creation from buffer object it
points
to an object that shold be decref'd
on
deletion */
/* For UPDATEIFCOPY flag this is an
array
to-be-updated upon deletion of this
one */
PyArray_Descr *descr; /* Pointer to type structure */
int flags; /* Flags describing array -- see
below*/
PyObject *weakreflist; /* For weakreferences */
} PyArrayObject;

(numpy 1.4.1, numpy/core/include/numpy/ndarrayobject.h)

And because of that, Cython's numpy functionality uses `npy_intp`
everywhere. Perhaps this is required for backwards compat. in numpy,
but in an ideal world, should those be `npy_uintp`s?

Looking at the bufferinfo struct for the buffer protocol, it uses `Py_ssize_t`:

struct bufferinfo {
void *buf;
Py_ssize_t len;
int readonly;
const char *format;
int ndim;
Py_ssize_t *shape;
Py_ssize_t *strides;
Py_ssize_t *suboffsets;
Py_ssize_t itemsize;
void *internal;
} Py_buffer;

So everyone is using signed values where it would make more sense (to
me at least) to use unsigned. Any reason for this?

I'm using `npy_intp` since Cython does it that way :-)

Kurt

Dag Sverre Seljebotn

unread,
Jul 27, 2010, 11:17:55 AM7/27/10
to Discussion of Numerical Python
Kurt Smith wrote:
>
> Looking at the bufferinfo struct for the buffer protocol, it uses `Py_ssize_t`:
>
> struct bufferinfo {
> void *buf;
> Py_ssize_t len;
> int readonly;
> const char *format;
> int ndim;
> Py_ssize_t *shape;
> Py_ssize_t *strides;
> Py_ssize_t *suboffsets;
> Py_ssize_t itemsize;
> void *internal;
> } Py_buffer;
>
> So everyone is using signed values where it would make more sense (to
> me at least) to use unsigned. Any reason for this?
>
> I'm using `npy_intp` since Cython does it that way :-)
>
And Cython (and NumPy, I expect) does it that way because Python does it
that way. And that really can't be changed.

The reasons are mostly historical/for convenience. And once 64-bit is
more widespread, do we really care about the one bit?

From PEP 353:


Why not size_t56 <http://www.python.org/dev/peps/pep-0353/#id9>

An initial attempt to implement this feature tried to use size_t. It
quickly turned out that this cannot work: Python uses negative indices
in many places (to indicate counting from the end). Even in places where
size_t would be usable, too many reformulations of code where necessary,
e.g. in loops like:

for(index = length-1; index >= 0; index--)

This loop will never terminate if index is changed from int to size_t.


Dag Sverre

Francesc Alted

unread,
Jul 27, 2010, 11:24:48 AM7/27/10
to Discussion of Numerical Python
A Tuesday 27 July 2010 17:17:55 Dag Sverre Seljebotn escrigué:

> Kurt Smith wrote:
> > Looking at the bufferinfo struct for the buffer protocol, it uses
> > `Py_ssize_t`:
> >
> > struct bufferinfo {
> > void *buf;
> > Py_ssize_t len;
> > int readonly;
> > const char *format;
> > int ndim;
> > Py_ssize_t *shape;
> > Py_ssize_t *strides;
> > Py_ssize_t *suboffsets;
> > Py_ssize_t itemsize;
> > void *internal;
> > } Py_buffer;
> >
> > So everyone is using signed values where it would make more sense (to
> > me at least) to use unsigned. Any reason for this?

My reason was just being consistent with `malloc(size_t size)` signature (and
that the C world seems to widely use `size_t` for sizes).

> >
> > I'm using `npy_intp` since Cython does it that way :-)
>
> And Cython (and NumPy, I expect) does it that way because Python does it
> that way. And that really can't be changed.
>
> The reasons are mostly historical/for convenience. And once 64-bit is
> more widespread, do we really care about the one bit?
>
> From PEP 353:
>
>
> Why not size_t56 <http://www.python.org/dev/peps/pep-0353/#id9>
>
> An initial attempt to implement this feature tried to use size_t. It
> quickly turned out that this cannot work: Python uses negative indices
> in many places (to indicate counting from the end). Even in places where
> size_t would be usable, too many reformulations of code where necessary,
> e.g. in loops like:
>
> for(index = length-1; index >= 0; index--)
>
> This loop will never terminate if index is changed from int to size_t.

Ok, I'm not going to break Python/NumPy conventions so you convinced me: I'll
use `npy_intp` then.

Thanks!

--
Francesc Alted

Kurt Smith

unread,
Jul 27, 2010, 11:27:21 AM7/27/10
to Discussion of Numerical Python
On Tue, Jul 27, 2010 at 10:17 AM, Dag Sverre Seljebotn
<da...@student.matnat.uio.no> wrote:
>  From PEP 353:
>
>
>    Why not size_t56 <http://www.python.org/dev/peps/pep-0353/#id9>
>
> An initial attempt to implement this feature tried to use size_t. It
> quickly turned out that this cannot work: Python uses negative indices
> in many places (to indicate counting from the end). Even in places where
> size_t would be usable, too many reformulations of code where necessary,
> e.g. in loops like:
>
> for(index = length-1; index >= 0; index--)
>
> This loop will never terminate if index is changed from int to size_t.

Of course. Makes sense; thanks for the clarification.

Kurt

David Cournapeau

unread,
Jul 27, 2010, 11:30:34 AM7/27/10
to Discussion of Numerical Python
On Tue, Jul 27, 2010 at 10:08 PM, Francesc Alted <fal...@pytables.org> wrote:
> Hi,
>
> I'm a bit confused on which datatype should I use when referring to NumPy
> ndarray lengths.  In one hand I'd use `size_t` that is the canonical way to
> refer to lengths of memory blocks.  In the other hand, `npy_intp` seems the
> standard data type used in NumPy for this.

npy_intp is the one to use ATM. I agree it is confusing (because
intp_t and ssize_t are for different use cases), adding a npy_ssize_t
and fixing the API accordingly is on my TODO list, but that's pretty
low :)

David

Reply all
Reply to author
Forward
0 new messages