I'm handling Unicode and have the opportunity to work with whatever type of internal Unicode representation I want: UTF8/UTF16/UTF32 or wchar_t* or codepoints. Which internal representation and C type reduces memory allocation during the c<->python translation (i.e. by simply passing a buffer)? The documentation could be clearer; the internal representation of the data within the C types is not specified. The documentation also omits the internal Unicode representation used by Python (I presume that internally Python stores Unicode strings as PyUCS4, making a uint32_t array of codepoints most efficient).
wchar_t: If this type holds codepoints, what is the endianness on
platforms where wchar_t is not 32 bits? If this type holds an encoding,
is it the stateful ISO 2022(?) encoding given by the mbrtowc
family of functions or UTF16?
char32_t: Is this UTF32 or 32-bit codepoints, or dependent on __STDC_UTF_32__
?
char16_t: Is this UTF16, or does it have anything to do with __STDC_UTF_16__
?
The relevant sections in the documentation:
# These are all errors:
>>> o = ffi.new("uint32_t[]", "\U00010437")
# Or:
>>> o = ffi.new("uint32_t[]", [0x10437])
>>> ffi.string(o)
>>> o = ffi.new("uint32_t[]", [97, 93, 0x10437])
# Using char32_t is faster because this is done automatically by cffi in C:
>>> u = "".join(chr(i) for i in o)
None of them do: all combinations require memory allocation of a copy.
The reason is that all internal representations can *and do* change,
based on various not-easily-controlled parameters.
>>> o = ffi.new("char[]", b"word")
>>> s = ffi.string(o)
>>> s
b'word'
>>> o[0] = b"t"
# New copy, no changes:
>>> s
b'word'
>>> del o
# New copy, still around:
>>> s
b'word'
# New copy, just as if doing:
>>> s = bytes(ffi.buffer(o))
>>> s = b"word"
>>> o = ffi.from_buffer(s)
# Notice, no crashing:
>>> o[0] = b"t"
>>> s
b'tord'