Advice copying memory strings while maintaining py2/3 compatibility

45 views

Skip to first unread message

Giovanni Torres

unread,

Aug 6, 2016, 1:23:14 AM8/6/16

to cython-users

Hi,

I'm wrapping some C code functions that each return an array of records. Each record is a pointer to a struct and there could be 100s or 1000s of records in an array. The struct has several char* members.

I have been doing the following to copy all the C strings to unicode so that it works in both Python 2 and 3, and then convert each struct into a Python object using properties:

#
# foo.pxi: convert char* to unicode
#
cimport cpython

cdef extern from "string.h":
    size_t strlen(char *s)

cdef unicode tounicode(char* s):
    if s == NULL:
        return None
    else:
        # is s.decode("UTF-8", "replace") sufficient?
        return cpython.PyUnicode_DecodeUTF8(s, strlen(s), "replace")

#
# foo.pxd
#
cdef extern from "foo.h":
    ctypedef struct foo_struct_t:
        char* bar
        char* baz

    int get_foo(foo_struct_t *msg)

#
# foo.pyx
#
from __future__ import unicode_literals
include "foo.pxi"

cdef class Foo:
    cdef:
        unicode bar
        unicode baz

    property bar:
        def __get__(self):
            return self.bar

    property bar:
        def __get__(self):
            return self.bar

cpdef get_me_some_foo():
    cdef:
        int rc, i
        foo_struct_t *msg = NULL
        list foo_list = []

    rc = get_foo(&msg)
    for i in range(msg.num_records):
        this_foo = Foo()
        this_foo.bar = tounicode(&msg.array[i].bar)
        this_foo.baz = tounicode(&msg.array[i].baz)
        foo_list.append(this_foo)
    return foo_list

1. Is the use of tounicode() in cpdef get_me_some_foo() a good way to copy c strings to unicode?
2. Is calling tounicode() many times not very efficient operation?
3. If class Foo has 40 to 50 properties, will this add a lot of memory overhead when handling 100s or 1000s of objects?
4. Is predefining the string members of the Foo class as unicode a good practice?
5. Is there a better mechanism for converting members of a struct to a Python object, so that I can get the following:

>>> from foo import get_me_some_foo
>>> a = get_me_some_foo()
>>> a[0].bar
>>> u"a beautiful unicode string"

I appreciate any suggestions for optimization.

Thanks!
Giovanni

Stefan Behnel

unread,

Aug 6, 2016, 2:21:47 AM8/6/16

to cython...@googlegroups.com

Giovanni Torres schrieb am 05.08.2016 um 20:46:
> I'm wrapping some C code functions that each return an array of records.
> Each record is a pointer to a struct and there could be 100s or 1000s of
> records in an array. The struct has several char* members.
>
> I have been doing the following to copy all the C strings to unicode so
> that it works in both Python 2 and 3, and then convert each struct into a
> Python object using properties:
>
> #
> # foo.pxi: convert char* to unicode
> #
> cimport cpython
>
> cdef extern from "string.h":
> size_t strlen(char *s)
>
> cdef unicode tounicode(char* s):
> if s == NULL:
> return None
> else:
> # is s.decode("UTF-8", "replace") sufficient?

Yes, just use .decode().

> return cpython.PyUnicode_DecodeUTF8(s, strlen(s), "replace")
>
>
>
> #
> # foo.pxd
> #
> cdef extern from "foo.h":
> ctypedef struct foo_struct_t:
> char* bar
> char* baz
>
> int get_foo(foo_struct_t *msg)

These seem to be purely external declarations, so I would rename this file
to "_foo.pxd" (or "cfoo.pxd", or something like that) to separate its
namespace from that of your "foo.pyx" file, and then cimport it in your
.pyx file.

> #
> # foo.pyx
> #
> from __future__ import unicode_literals
> include "foo.pxi"
>
> cdef class Foo:
> cdef:
> unicode bar
> unicode baz
>
> property bar:
> def __get__(self):
> return self.bar
>
> property bar:
> def __get__(self):
> return self.bar

Just declare these as "cdef readonly unicode bar" instead of spelling out
your own "property" block.

Also, using the special "property name:" syntax is discouraged now. If you
need a property, use the "@property" decorator.

> cpdef get_me_some_foo():
> cdef:
> int rc, i
> foo_struct_t *msg = NULL
> list foo_list = []

No need for typing "foo_list". It's obvious from the assignment.

> rc = get_foo(&msg)
> for i in range(msg.num_records):

I would spell this

for thing in msg.array[:msg.num_records]:

but I have no idea where the ".array" and ".num_records" come from, given
the declarations that you provided above.

> this_foo = Foo()
> this_foo.bar = tounicode(&msg.array[i].bar)
> this_foo.baz = tounicode(&msg.array[i].baz)
> foo_list.append(this_foo)
> return foo_list
>
>
>
>
> 1. Is the use of tounicode() in cpdef get_me_some_foo() a good way to copy
> c strings to unicode?

Looks good.

> 2. Is calling tounicode() many times not very efficient operation?

It's likely going to be inlined, but in any case, decoding is a relatively
expensive operation, so you won't be able to measure the call overhead here.

> 3. If class Foo has 40 to 50 properties, will this add a lot of memory
> overhead when handling 100s or 1000s of objects?

No, each property only exists once per class. Only the values of the
property are specific to an instance.

But Unicode strings can generally use a lot of memory, especially in Python 2.

You didn't say anything about the C memory management of your char*
strings, so I can't say if on-the-fly decoding at need would be an option
(which would be slow but save memory), or at least lazy initialisation of
the unicode string properties (which would be slow on first access but save
memory and time for unused fields).

> 4. Is predefining the string members of the Foo class as unicode a good
> practice?

Depends. If they are always exactly unicode strings (no subclasses), and
especially if they are immutable and the only assignment happens at
initialisation time, typing them explicitly is a good way of documenting them.

If the above doesn't apply to them, then it might not be that a good idea.

> 5. Is there a better mechanism for converting members of a struct to a
> Python object, so that I can get the following:
>
> >>> from foo import get_me_some_foo
> >>> a = get_me_some_foo()
> >>> a[0].bar
> >>> u"a beautiful unicode string"

Seems ok to me.

You could also enable auto-decoding for UTF-8 strings as described here:

http://docs.cython.org/en/latest/src/tutorial/strings.html#auto-encoding-and-decoding

Then you'd still have to handle the case of pointers being NULL, but you
could at least simplify the above to

this_foo.bar = thing.bar if thing.bar is not NULL else None

without having to repeat the .decode() all over the place.

There's also an automatic conversion from C structs to Python dicts
(mapping field names to their Python-converted values), but that also won't
handle the NULL pointer case for you.

I wonder if we should add a new compiler directive "c_string_null" that you
could set to None or an empty string to make that case automatic as well.

Stefan

Giovanni Torres

unread,

Aug 8, 2016, 11:36:43 AM8/8/16

to cython-users, stef...@behnel.de

If the name of the .pxd file is the same as the .pyx, there is an automatic import, correct? I used generic pseudo code, but here is a snippet of the code i am wrapping. The pxd file is pretty much all external declarations, but there is no namespace clashing, which is why I chose to keep the names the same.

#
# node.pxd
#
from libc.stdint cimport uint16_t, uint32_t
include "utils.pxi" # to get access to tounicode()

cdef extern from "time.h" nogil:
    ctypedef long time_t

cdef extern from "slurm/slurm.h" nogil:
    # struct for a single record
    ctypedef struct node_info_t:
        char *arch
        char *name
        char *node_addr
        char *node_hostname

    # Double pointer to array of records
    ctypedef struct node_info_msg_t:
        time_t last_update
        uint32_t node_scaling
        uint32_t record_count
        node_info_t *node_array

    # load pointer with all node records
    int slurm_load_node(time_t update_time,
                          node_info_msg_t **resp,
                          uint16_t show_flags)

    # Free msg pointer
    void slurm_free_node_info_msg(node_info_msg_t *node_buffer_ptr)

#
# node.pyx
#
cdef class Node:
    cdef:
        readonly unicode arch
        readonly unicode name
        readonly unicode node_addr
        readonly unicode node_hostname

cpdef get_node_info_msg():
    cdef:
        node_info_msg_t *node_info_msg_ptr = NULL
        uint16_t show_flags = 0
        uint32_t i
        int rc

    rc = slurm_load_node(<time_t>NULL, &node_info_msg_ptr, show_flags)
    node_list = []

    for i in range(&node_info_msg_ptr, b_node, show_flags):
        record = &node_info_msg_ptr.node_array[i]

        this_node = Node()
        this_node.arch = tounicode(record.arch)
        this_node.name = tounicode(record.name)
        this_node.node_addr = tounicode(record.node_addr)
        this_node.node_hostname = tounicode(record.node_hostname)
        node_list.append(this_node)

    slurm_free_node_info_msg(node_info_msg_ptr)
    node_info_msg_ptr = NULL
    return node_list


Then, in python:
>>> from pyslurm import node
>>> all = node.get_node_info_msg()
>>> node1 = all[0]
>>> node1.arch
u"x86_64"
>>>

> #
> # foo.pyx
> #
> from __future__ import unicode_literals
> include "foo.pxi"
>
> cdef class Foo:
> cdef:
> unicode bar
> unicode baz
>
> property bar:
> def __get__(self):
> return self.bar
>
> property bar:
> def __get__(self):
> return self.bar

Just declare these as "cdef readonly unicode bar" instead of spelling out
your own "property" block.

Done.

Also, using the special "property name:" syntax is discouraged now. If you
need a property, use the "@property" decorator.

I reread the docs and saw the deprecation notice. I will change the properties to use the decorator instead.

> cpdef get_me_some_foo():
> cdef:
> int rc, i
> foo_struct_t *msg = NULL
> list foo_list = []

No need for typing "foo_list". It's obvious from the assignment.

I should have checked, but doing "cdef list foo_list = []" or just "foo_list = []" outside the cdef block appears to result in the same C code.

> rc = get_foo(&msg)
> for i in range(msg.num_records):

I would spell this

for thing in msg.array[:msg.num_records]:

but I have no idea where the ".array" and ".num_records" come from, given
the declarations that you provided above.

I provided a better example above. Is it more efficient and or better practice to use your method or is iterating using range() here ok?

> this_foo = Foo()
> this_foo.bar = tounicode(&msg.array[i].bar)
> this_foo.baz = tounicode(&msg.array[i].baz)
> foo_list.append(this_foo)
> return foo_list
>
>
>
>
> 1. Is the use of tounicode() in cpdef get_me_some_foo() a good way to copy
> c strings to unicode?

Looks good.

> 2. Is calling tounicode() many times not very efficient operation?

It's likely going to be inlined, but in any case, decoding is a relatively
expensive operation, so you won't be able to measure the call overhead here.

> 3. If class Foo has 40 to 50 properties, will this add a lot of memory
> overhead when handling 100s or 1000s of objects?

No, each property only exists once per class. Only the values of the
property are specific to an instance.

But Unicode strings can generally use a lot of memory, especially in Python 2.

You didn't say anything about the C memory management of your char*
strings, so I can't say if on-the-fly decoding at need would be an option
(which would be slow but save memory), or at least lazy initialisation of
the unicode string properties (which would be slow on first access but save
memory and time for unused fields).

I'm not sure what you mean. Could you suggest something based on the code snippet above?

> 4. Is predefining the string members of the Foo class as unicode a good
> practice?

Depends. If they are always exactly unicode strings (no subclasses), and
especially if they are immutable and the only assignment happens at
initialisation time, typing them explicitly is a good way of documenting them.

Yes, they are all basically immutable. Is there a difference typing strings as "unicode" versus typing as an "object"? I've seen this in some of the examples.

If the above doesn't apply to them, then it might not be that a good idea.

> 5. Is there a better mechanism for converting members of a struct to a
> Python object, so that I can get the following:
>
> >>> from foo import get_me_some_foo
> >>> a = get_me_some_foo()
> >>> a[0].bar
> >>> u"a beautiful unicode string"

Seems ok to me.

You could also enable auto-decoding for UTF-8 strings as described here:

http://docs.cython.org/en/latest/src/tutorial/strings.html#auto-encoding-and-decoding

Then you'd still have to handle the case of pointers being NULL, but you
could at least simplify the above to

this_foo.bar = thing.bar if thing.bar is not NULL else None

without having to repeat the .decode() all over the place.

I wanted to be able to support Cython 0.15.1 and above, but this is a really convenient feature, so I'll have to think about it. Does this conflict with unicode_literals from __future__?

There's also an automatic conversion from C structs to Python dicts
(mapping field names to their Python-converted values), but that also won't
handle the NULL pointer case for you.

Where is the documentation for this?

I wonder if we should add a new compiler directive "c_string_null" that you
could set to None or an empty string to make that case automatic as well.

This would be a great, convenient feature. For the code I am wrapping, I have to check for NULL pointers, otherwise I get seg faults. If this feature were available, do you imagine the automatic C struct -> Python dict conversion also automatically handle the NULL pointer as well?

Thanks!
Giovanni

Stefan Behnel

unread,

Aug 8, 2016, 12:30:29 PM8/8/16

to cython...@googlegroups.com

Giovanni Torres schrieb am 08.08.2016 um 17:33:
> On Saturday, August 6, 2016 at 2:21:47 AM UTC-4, Stefan Behnel wrote:
>> Giovanni Torres schrieb am 05.08.2016 um 20:46:
>>> I'm wrapping some C code functions that each return an array of records.
>>
>>> Each record is a pointer to a struct and there could be 100s or 1000s of
>>> records in an array. The struct has several char* members.
>>>
>>> I have been doing the following to copy all the C strings to unicode so
>>> that it works in both Python 2 and 3, and then convert each struct into
>>> a Python object using properties:
>>>
>>> #
>>> # foo.pxi: convert char* to unicode
>>> #
>>> cimport cpython
>>>

>>> cdef unicode tounicode(char* s):
>>> if s == NULL:
>>> return None
>>> else:

>>> return s.decode("utf8", "replace")

>>>
>>> #
>>> # foo.pxd
>>> #
>>> cdef extern from "foo.h":
>>> ctypedef struct foo_struct_t:
>>> char* bar
>>> char* baz
>>>
>>> int get_foo(foo_struct_t *msg)
>>
>> These seem to be purely external declarations, so I would rename this file
>> to "_foo.pxd" (or "cfoo.pxd", or something like that) to separate its
>> namespace from that of your "foo.pyx" file, and then cimport it in your
>> .pyx file.
>
> If the name of the .pxd file is the same as the .pyx, there is an
> automatic import, correct?

They share the same namespace, that's not quite the same thing as an import.

Also, if you cimport anything from foo in another module in order to reuse
the C declarations, Cython will import the foo module at runtime, not just
the declarations in foo.pxd. Believe me, it's a source of confusion that is
easy to avoid by giving different things different names.

>>> rc = get_foo(&msg)
>>> for i in range(msg.num_records):
>>
>> I would spell this
>>
>> for thing in msg.array[:msg.num_records]:
>>
>> but I have no idea where the ".array" and ".num_records" come from, given
>> the declarations that you provided above.
>
> I provided a better example above. Is it more efficient and or better
> practice to use your method or is iterating using range() here ok?

Same as in Python: why use integer iteration and indexing if you can
iterate directly over something else?

>>> 3. If class Foo has 40 to 50 properties, will this add a lot of memory
>>> overhead when handling 100s or 1000s of objects?
>>
>> No, each property only exists once per class. Only the values of the
>> property are specific to an instance.
>>
>> But Unicode strings can generally use a lot of memory, especially in
>> Python 2.
>>
>> You didn't say anything about the C memory management of your char*
>> strings, so I can't say if on-the-fly decoding at need would be an option
>> (which would be slow but save memory), or at least lazy initialisation of
>> the unicode string properties (which would be slow on first access but
>> save memory and time for unused fields).
>
> I'm not sure what you mean. Could you suggest something based on the code
> snippet above?

It's the usual tradeoff: time versus memory.

You can instantiate all Unicode strings upfront (as you do now), which eats
memory (and some initialisation time) but gives fast access.

Or you can use explicit property getters and instantiate the Unicode
strings on first access. That provides faster instantiation and uses only
the necessary memory for the objects it instantiates, but implies some
overhead on first access and requires you to keep the original C data
alive, which may or may not work nicely with the source that provided it.

Something like this:

cdef class MyAttrs:
cdef unicode _s
cdef char* _orig_s_value_in_c

@property
def s(self):
if self._s is None and self._orig_s_value_in_c is not NULL:
self._s = self._orig_s_value_in_c.decode('utf8')
free(self._orig_s_value_in_c) # depends on where it came from
self._orig_s_value_in_c = NULL
return self._s

>>> 4. Is predefining the string members of the Foo class as unicode a good
>>> practice?
>>
>> Depends. If they are always exactly unicode strings (no subclasses), and
>> especially if they are immutable and the only assignment happens at
>> initialisation time, typing them explicitly is a good way of documenting
>> them.
>
> Yes, they are all basically immutable. Is there a difference typing
> strings as "unicode" versus typing as an "object"? I've seen this in some
> of the examples.

I've written some comments on this subject here:

http://docs.cython.org/en/latest/src/tutorial/strings.html#accepting-strings-from-python-code

>> You could also enable auto-decoding for UTF-8 strings as described here:
>>
>> http://docs.cython.org/en/latest/src/tutorial/strings.html#auto-encoding-and-decoding
>>
>> Then you'd still have to handle the case of pointers being NULL, but you
>> could at least simplify the above to
>>
>> this_foo.bar = thing.bar if thing.bar is not NULL else None
>>
>> without having to repeat the .decode() all over the place.
>
> I wanted to be able to support Cython 0.15.1 and above, but this is a
> really convenient feature, so I'll have to think about it.

You shouldn't care too much about older Cython versions. Instead, release
your packages with the generated C files included, so that normal users
don't need Cython at all.

> Does this conflict with unicode_literals from __future__?

No. You would want to explicitly prefix your byte strings with a "b" in
that case, though.

>> There's also an automatic conversion from C structs to Python dicts
>> (mapping field names to their Python-converted values), but that also
>> won't handle the NULL pointer case for you.
>
> Where is the documentation for this?

It's really just as I wrote above.

http://docs.cython.org/en/latest/src/userguide/language_basics.html#automatic-type-conversions

>> I wonder if we should add a new compiler directive "c_string_null" that
>> you
>> could set to None or an empty string to make that case automatic as well.
>
> This would be a great, convenient feature. For the code I am wrapping, I
> have to check for NULL pointers, otherwise I get seg faults. If this
> feature were available, do you imagine the automatic C struct -> Python
> dict conversion also automatically handle the NULL pointer as well?

These things usually work recursively for data structures in Cython. It's
particularly nice for C++ STL data structures, but also works well for the
tiny bit of structure that C offers.

Stefan

Giovanni Torres

unread,

Aug 10, 2016, 2:13:40 AM8/10/16

to cython-users, stef...@behnel.de

Got it.

>>> rc = get_foo(&msg)
>>> for i in range(msg.num_records):
>>
>> I would spell this
>>
>> for thing in msg.array[:msg.num_records]:
>>
>> but I have no idea where the ".array" and ".num_records" come from, given
>> the declarations that you provided above.
>
> I provided a better example above. Is it more efficient and or better
> practice to use your method or is iterating using range() here ok?

Same as in Python: why use integer iteration and indexing if you can
iterate directly over something else?

Of course. I didn't make the connection on the Cython side.

I see. Thanks for the example, but I think you're right in that this may not work nicely with this project.

>>> 4. Is predefining the string members of the Foo class as unicode a good
>>> practice?
>>
>> Depends. If they are always exactly unicode strings (no subclasses), and
>> especially if they are immutable and the only assignment happens at
>> initialisation time, typing them explicitly is a good way of documenting
>> them.
>
> Yes, they are all basically immutable. Is there a difference typing
> strings as "unicode" versus typing as an "object"? I've seen this in some
> of the examples.

I've written some comments on this subject here:

http://docs.cython.org/en/latest/src/tutorial/strings.html#accepting-strings-from-python-code

>> You could also enable auto-decoding for UTF-8 strings as described here:
>>
>> http://docs.cython.org/en/latest/src/tutorial/strings.html#auto-encoding-and-decoding
>>
>> Then you'd still have to handle the case of pointers being NULL, but you
>> could at least simplify the above to
>>
>> this_foo.bar = thing.bar if thing.bar is not NULL else None
>>
>> without having to repeat the .decode() all over the place.
>
> I wanted to be able to support Cython 0.15.1 and above, but this is a
> really convenient feature, so I'll have to think about it.

You shouldn't care too much about older Cython versions. Instead, release
your packages with the generated C files included, so that normal users
don't need Cython at all.

I considered this. I'll give it more thought.

> Does this conflict with unicode_literals from __future__?

No. You would want to explicitly prefix your byte strings with a "b" in
that case, though.

>> There's also an automatic conversion from C structs to Python dicts
>> (mapping field names to their Python-converted values), but that also
>> won't handle the NULL pointer case for you.
>
> Where is the documentation for this?

It's really just as I wrote above.

http://docs.cython.org/en/latest/src/userguide/language_basics.html#automatic-type-conversions

>> I wonder if we should add a new compiler directive "c_string_null" that
>> you
>> could set to None or an empty string to make that case automatic as well.
>
> This would be a great, convenient feature. For the code I am wrapping, I
> have to check for NULL pointers, otherwise I get seg faults. If this
> feature were available, do you imagine the automatic C struct -> Python
> dict conversion also automatically handle the NULL pointer as well?

These things usually work recursively for data structures in Cython. It's
particularly nice for C++ STL data structures, but also works well for the
tiny bit of structure that C offers.