Passing structured numpy array with strings to a cython function

1,149 views
Skip to first unread message

Joshua

unread,
Jan 28, 2014, 3:50:13 PM1/28/14
to cython...@googlegroups.com

I am attempting to create a function in cython that accepts a numpy structured array or record array by defining a cython struct type. Suppose I have the data:

a = np.recarray(3, dtype=[('a', np.float32),  ('b', np.int32), ('c', '|S5'), ('d', '|S3')])
a[0] = (1.1, 1, 'this\0', 'to\0')
a[1] = (2.1, 2, 'that\0', 'ta\0')
a[2] = (3.1, 3, 'dogs\0', 'ot\0')

and then the cython code:

import numpy as np
cimport numpy as np

cdef packed struct tstruct:
    np.float32_t a
    np.int32_t b
    char[5] c
    char[3] d

def test_struct(tstruct[:] x):
    cdef:
        int k
        tstruct y

    for k in xrange(3):
        y = x[k]
        print y.a, y.b, y.c, y.d

When I try to run test_struct(a), I get the error:


ValueError: Expected a dimension of size 5, got 8

If in the array and corresponding struct I reorder the fields such that the string fields are not adjacent to each other, then the function works as expected. Null-terminating the strings as I show above does not seem to make a difference.

I was wondering if there was some simple string handling issue that I’m missing? In the actual code, I don’t typically even need to manipulate the string fields. I just need them read in properly.

Any suggestions would be appreciated.

Thanks,
Josh

Joshua

unread,
Jan 30, 2014, 12:22:21 PM1/30/14
to cython...@googlegroups.com
Does anyone have any insight into this issue? I posted a similar question on Stackoverflow (http://stackoverflow.com/questions/21435378/passing-a-structured-numpy-array-with-strings-to-a-cython-function) and there was general interest in the question, but no solutions suggested. I'm curious if this is a bug that should be reported, a case of unsupported types, or if I'm handling the types  incorrectly.

Josh
  

Robert Bradshaw

unread,
Feb 5, 2014, 2:09:56 PM2/5/14
to cython...@googlegroups.com
Does it work without the "packed"?

Joshua Adelman

unread,
Feb 5, 2014, 2:22:17 PM2/5/14
to cython...@googlegroups.com
Hi Robert,

It does not work if I remove the "packed" keyword. Additionally, without the packed keyword, if I change the int32_t to a int16_t, for example, and separate the two strings by making the field order (a,c,b,d), I get a buffer dtype error. With the packed keyword, everything works as expected when re-ordering the fields. I expected this given that my understanding is the numpy uses non-aligned offsets for structured/recarrays.

Do you think this is a bug, or is this mapping of numpy fixed length strings to char arrays not supported?

Josh

Robert Bradshaw

unread,
Feb 5, 2014, 2:49:47 PM2/5/14
to cython...@googlegroups.com
I would chalk it up to "not supported," which is of course just
another kind of bug. We'd welcome a contribution to fix this.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "cython-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cython-users...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Stefan Behnel

unread,
Feb 5, 2014, 2:55:28 PM2/5/14
to cython...@googlegroups.com
Joshua, 28.01.2014 21:50:
> I am attempting to create a function in cython that accepts a numpy
> structured array or record array by defining a cython struct type. Suppose
> I have the data:
>
> a = np.recarray(3, dtype=[('a', np.float32), ('b', np.int32), ('c', '|S5'), ('d', '|S3')])
> a[0] = (1.1, 1, 'this\0', 'to\0')
> a[1] = (2.1, 2, 'that\0', 'ta\0')
> a[2] = (3.1, 3, 'dogs\0', 'ot\0')
>
> and then the cython code:
>
> import numpy as np
> cimport numpy as np
>
> cdef packed struct tstruct:
> np.float32_t a
> np.int32_t b
> char[5] c
> char[3] d
> def test_struct(tstruct[:] x):
> cdef:
> int k
> tstruct y
>
> for k in xrange(3):
> y = x[k]
> print y.a, y.b, y.c, y.d
>
> When I try to run test_struct(a), I get the error:
>
> ValueError: Expected a dimension of size 5, got 8

I can reproduce this, and it seems to originate from the memory view
unpacking when entering test_struct(), specifically the buffer type
validation in __Pyx_BufFmt_ProcessTypeChunk(). the actual code in
test_struct() isn't relevant.


> If in the array and corresponding struct I reorder the fields such that the
> string fields are not adjacent to each other, then the function works as
> expected. Null-terminating the strings as I show above does not seem to
> make a difference.
>
> I was wondering if there was some simple string handling issue that I’m
> missing? In the actual code, I don’t typically even need to manipulate the
> string fields. I just need them read in properly.

I couldn't look into this very deeply, but __Pyx_BufFmt_ProcessTypeChunk()
seems to behave a bit funny here, so it might be a problem in Cython.

Stefan

Joshua Adelman

unread,
Feb 5, 2014, 11:51:31 PM2/5/14
to cython...@googlegroups.com
> -

Hi Stefan,

Thanks for taking a look at this. I tried to dig through the code in Cython/Utility/Buffer.c to see if I could pick out what was going wrong, but I'm afraid I'm not familiar enough with all of the type encodings to make definite sense of things this first time looking at things. My naive guess is that there is something going on in:
https://github.com/cython/cython/blob/master/Cython/Utility/Buffer.c#L738

since that appears to be the only place where `enc_count` is being incremented. For my own reference, the relevant line of code that catches the incorrect value of `enc_count` is:
https://github.com/cython/cython/blob/master/Cython/Utility/Buffer.c#L468

```
if (ctx->enc_count != ctx->head->field->type->arraysize[0]) {
PyErr_Format(PyExc_ValueError,
"Expected a dimension of size %zu, got %zu",
ctx->head->field->type->arraysize[0], ctx->enc_count);
return -1;
}
```

I'm happy to submit an issue to trac and write a test case for the problem to submit as a pull request.

Thanks again for your help.

Josh


Stefan Behnel

unread,
Feb 6, 2014, 2:39:41 PM2/6/14
to cython...@googlegroups.com
Joshua Adelman, 06.02.2014 05:51:
> I'm happy to submit an issue to trac and write a test case for the problem to submit as a pull request.

You already provided test code. Turning it into a regression test is
straight forward.

This code in __Pyx_BufFmt_CheckString() looks suspicious to me:

"""
case 'O': case 's': case 'p':
if (ctx->enc_type == *ts && got_Z == ctx->is_complex &&
ctx->enc_packmode == ctx->new_packmode) {
/* Continue pooling same type */
ctx->enc_count += ctx->new_count;
"""

The type format string it's processing is this:

"""
T{f:a:i:b:5s:c:3s:d:}
"""

When it runs into the code above on the second "s", it considers the "3s"
related to the first "5s" and adds up their numbers, not noticing that it's
already skipped over the rest of the first string type (case ":") and
hasn't processed it yet.

My guess is that the test for "ctx->enc_type == *ts" is a tiny bit too
naive and fails in cases like this when it encounters the same base type
twice in a row.

That being said, I'm not into this deep enough to be sure about my analysis
above and to know how best to fix it without breaking other corner cases.

Could someone please take a look who has an idea about how these NumPy type
formats work?

Stefan

Stefan Behnel

unread,
Feb 22, 2014, 4:15:03 AM2/22/14
to joshua....@gmail.com, cython...@googlegroups.com
Stefan Behnel, 06.02.2014 20:39:
> Joshua Adelman, 06.02.2014 05:51:
>> I'm happy to submit an issue to trac and write a test case for the problem to submit as a pull request.
>
> You already provided test code. Turning it into a regression test is
> straight forward.
>
> This code in __Pyx_BufFmt_CheckString() looks suspicious to me:
>
> """
> case 'O': case 's': case 'p':
> if (ctx->enc_type == *ts && got_Z == ctx->is_complex &&
> ctx->enc_packmode == ctx->new_packmode) {
> /* Continue pooling same type */
> ctx->enc_count += ctx->new_count;
> """
>
> The type format string it's processing is this:
>
> """
> T{f:a:i:b:5s:c:3s:d:}
> """
>
> When it runs into the code above on the second "s", it considers the "3s"
> related to the first "5s" and adds up their numbers, not noticing that it's
> already skipped over the rest of the first string type (case ":") and
> hasn't processed it yet.

I think I figured it out. Essentially, the pooling cannot be done for "s"
(strings) but only for simple types. Here is a fix:

https://github.com/cython/cython/commit/58d9361e0a6d4cb3d4e87775f78e0550c2fea836

Joshua, could you please test it on your side?

Stefan

Joshua Adelman

unread,
Feb 24, 2014, 12:47:35 PM2/24/14
to Stefan Behnel, cython...@googlegroups.com
Hi Stefan,

I tested out my simple examples that were previously failing with the code that is in the master branch of the github repository and everything seems to be working properly now. Thanks for looking into this and making the necessary changes to fix the bug. I haven't done any extensive testing to ensure that there aren't secondary effects of this change, but hopefully the cython unit tests would catch any issues.

Josh
Reply all
Reply to author
Forward
0 new messages