Giovanni Torres schrieb am 08.08.2016 um 17:33:
> On Saturday, August 6, 2016 at 2:21:47 AM UTC-4, Stefan Behnel wrote:
>> Giovanni Torres schrieb am 05.08.2016 um 20:46:
>>> I'm wrapping some C code functions that each return an array of records.
>>
>>> Each record is a pointer to a struct and there could be 100s or 1000s of
>>> records in an array. The struct has several char* members.
>>>
>>> I have been doing the following to copy all the C strings to unicode so
>>> that it works in both Python 2 and 3, and then convert each struct into
>>> a Python object using properties:
>>>
>>> #
>>> # foo.pxi: convert char* to unicode
>>> #
>>> cimport cpython
>>>
>>> cdef unicode tounicode(char* s):
>>> if s == NULL:
>>> return None
>>> else:
>>> return s.decode("utf8", "replace")
>>>
>>> #
>>> # foo.pxd
>>> #
>>> cdef extern from "foo.h":
>>> ctypedef struct foo_struct_t:
>>> char* bar
>>> char* baz
>>>
>>> int get_foo(foo_struct_t *msg)
>>
>> These seem to be purely external declarations, so I would rename this file
>> to "_foo.pxd" (or "cfoo.pxd", or something like that) to separate its
>> namespace from that of your "foo.pyx" file, and then cimport it in your
>> .pyx file.
>
> If the name of the .pxd file is the same as the .pyx, there is an
> automatic import, correct?
They share the same namespace, that's not quite the same thing as an import.
Also, if you cimport anything from foo in another module in order to reuse
the C declarations, Cython will import the foo module at runtime, not just
the declarations in foo.pxd. Believe me, it's a source of confusion that is
easy to avoid by giving different things different names.
>>> rc = get_foo(&msg)
>>> for i in range(msg.num_records):
>>
>> I would spell this
>>
>> for thing in msg.array[:msg.num_records]:
>>
>> but I have no idea where the ".array" and ".num_records" come from, given
>> the declarations that you provided above.
>
> I provided a better example above. Is it more efficient and or better
> practice to use your method or is iterating using range() here ok?
Same as in Python: why use integer iteration and indexing if you can
iterate directly over something else?
>>> 3. If class Foo has 40 to 50 properties, will this add a lot of memory
>>> overhead when handling 100s or 1000s of objects?
>>
>> No, each property only exists once per class. Only the values of the
>> property are specific to an instance.
>>
>> But Unicode strings can generally use a lot of memory, especially in
>> Python 2.
>>
>> You didn't say anything about the C memory management of your char*
>> strings, so I can't say if on-the-fly decoding at need would be an option
>> (which would be slow but save memory), or at least lazy initialisation of
>> the unicode string properties (which would be slow on first access but
>> save memory and time for unused fields).
>
> I'm not sure what you mean. Could you suggest something based on the code
> snippet above?
It's the usual tradeoff: time versus memory.
You can instantiate all Unicode strings upfront (as you do now), which eats
memory (and some initialisation time) but gives fast access.
Or you can use explicit property getters and instantiate the Unicode
strings on first access. That provides faster instantiation and uses only
the necessary memory for the objects it instantiates, but implies some
overhead on first access and requires you to keep the original C data
alive, which may or may not work nicely with the source that provided it.
Something like this:
cdef class MyAttrs:
cdef unicode _s
cdef char* _orig_s_value_in_c
@property
def s(self):
if self._s is None and self._orig_s_value_in_c is not NULL:
self._s = self._orig_s_value_in_c.decode('utf8')
free(self._orig_s_value_in_c) # depends on where it came from
self._orig_s_value_in_c = NULL
return self._s
>>> 4. Is predefining the string members of the Foo class as unicode a good
>>> practice?
>>
>> Depends. If they are always exactly unicode strings (no subclasses), and
>> especially if they are immutable and the only assignment happens at
>> initialisation time, typing them explicitly is a good way of documenting
>> them.
>
> Yes, they are all basically immutable. Is there a difference typing
> strings as "unicode" versus typing as an "object"? I've seen this in some
> of the examples.
I've written some comments on this subject here:
http://docs.cython.org/en/latest/src/tutorial/strings.html#accepting-strings-from-python-code
>> You could also enable auto-decoding for UTF-8 strings as described here:
>>
>>
http://docs.cython.org/en/latest/src/tutorial/strings.html#auto-encoding-and-decoding
>>
>> Then you'd still have to handle the case of pointers being NULL, but you
>> could at least simplify the above to
>>
>> this_foo.bar = thing.bar if thing.bar is not NULL else None
>>
>> without having to repeat the .decode() all over the place.
>
> I wanted to be able to support Cython 0.15.1 and above, but this is a
> really convenient feature, so I'll have to think about it.
You shouldn't care too much about older Cython versions. Instead, release
your packages with the generated C files included, so that normal users
don't need Cython at all.
> Does this conflict with unicode_literals from __future__?
No. You would want to explicitly prefix your byte strings with a "b" in
that case, though.
>> There's also an automatic conversion from C structs to Python dicts
>> (mapping field names to their Python-converted values), but that also
>> won't handle the NULL pointer case for you.
>
> Where is the documentation for this?
It's really just as I wrote above.
http://docs.cython.org/en/latest/src/userguide/language_basics.html#automatic-type-conversions
>> I wonder if we should add a new compiler directive "c_string_null" that
>> you
>> could set to None or an empty string to make that case automatic as well.
>
> This would be a great, convenient feature. For the code I am wrapping, I
> have to check for NULL pointers, otherwise I get seg faults. If this
> feature were available, do you imagine the automatic C struct -> Python
> dict conversion also automatically handle the NULL pointer as well?
These things usually work recursively for data structures in Cython. It's
particularly nice for C++ STL data structures, but also works well for the
tiny bit of structure that C offers.
Stefan