Unicode Data: C Type Conversions and Internal Representation

orbis...@gmail.com

unread,

Aug 30, 2017, 6:33:54 PM8/30/17

to python-cffi

(I've been asked to move this from the issue tracker #326 to here.)

I'm handling Unicode and have the opportunity to work with whatever type of internal Unicode representation I want: UTF8/UTF16/UTF32 or wchar_t* or codepoints. Which internal representation and C type reduces memory allocation during the c<->python translation (i.e. by simply passing a buffer)? The documentation could be clearer; the internal representation of the data within the C types is not specified. The documentation also omits the internal Unicode representation used by Python (I presume that internally Python stores Unicode strings as PyUCS4, making a uint32_t array of codepoints most efficient).

wchar_t: If this type holds codepoints, what is the endianness on platforms where wchar_t is not 32 bits? If this type holds an encoding, is it the stateful ISO 2022(?) encoding given by the mbrtowc family of functions or UTF16?
char32_t: Is this UTF32 or 32-bit codepoints, or dependent on __STDC_UTF_32__?
char16_t: Is this UTF16, or does it have anything to do with __STDC_UTF_16__?

The relevant sections in the documentation:

Armin Rigo

unread,

Aug 31, 2017, 2:36:33 AM8/31/17

to pytho...@googlegroups.com

Hi,

On 31 August 2017 at 00:33, <orbis...@gmail.com> wrote:
> I'm handling Unicode and have the opportunity to work with whatever type of
> internal Unicode representation I want: UTF8/UTF16/UTF32 or wchar_t* or
> codepoints. Which internal representation and C type reduces memory
> allocation during the c<->python translation (i.e. by simply passing a
> buffer)?

None of them do: all combinations require memory allocation of a copy.
The reason is that all internal representations can *and do* change,
based on various not-easily-controlled parameters. So basically, the
design choice here was to never share the memory, instead of sharing
memory only if the stars are correctly aligned. If we did, then you
might get potential obscure bugs showing up only on Python X and
platform Y---because the C library mutates the buffer and the call
you're doing shares the buffer only on Python X and platform Y.

> wchar_t: If this type holds codepoints, what is the endianness on platforms
> where wchar_t is not 32 bits? If this type holds an encoding, is it the
> stateful ISO 2022(?) encoding given by the mbrtowc family of functions or
> UTF16?
>
> char32_t: Is this UTF32 or 32-bit codepoints, or dependent on
> __STDC_UTF_32__?
>
> char16_t: Is this UTF16, or does it have anything to do with
> __STDC_UTF_16__?

CFFI assumes exactly that char32_t is UCS4 (=UTF32, =32-bit
codepoints) with the native endianness; char16_t is UTF16 with the
native endianness; and wchar_t is one of the previous two.

Here are some more details.

Let's start with the internal representation of the data within the C
types: this is a topic for the C language to define, not CFFI. And
the C language is remarkably underspecified in this area and leaves a
lot to the actual implementation. The actual meaning of 'wchar_t',
'char32_t' and 'char16_t' is not specified. Nowadays, *in practice*,
we can generally assume that 'char32_t' is just UCS4 and 'char16_t' is
just UTF16, and 'wchar_t' is equivalent to of the previous two,
depending on its size. This should be what most current C libaries
assume, and this is what CFFI assumes (I could add this point to the
documentation). I guess that it could be wrong with very old
applications using 'wchar_t', or the application handling only UCS2
and getting confused by surrogates, etc. So if you want to know the
final word, you have to look up the details of the C library you're
interfacing with.

Then, the internal representation used by Python: this is another
messy topic. In CPython < 3.3 (including 2.x), then Python uses
either UCS4 or UCS2/UTF16, which you can see in the value of
'sys.maxunicode'. On Windows it is always UCS2/UTF16. On Mac it is
normally UCS2/UTF16 too. On Linux it is normally UCS4 if you get your
Python from the distribution.

In Python >= 3.3 the situation changed, and on all platform CPython
uses either 1, 2 or 4 bytes per character depending on what is
actually stored in the string.

On PyPy (2 or 3), currently it uses UCS2/UTF16 on Windows and Mac and
UCS4 on Linux, but in-progress work might change that to use UTF8.

A bientôt,

Armin.

orbis...@gmail.com

unread,

Aug 31, 2017, 10:36:59 AM8/31/17

to python-cffi

Per the documentation, uint32_t, uint16_t are the same as char16_t and char32_t, but without the automatic conversion. I assume this means cffi doesn't handle strings of these types. I know this isn't supported (I tested it) but perhaps I missed some other methods:

    # These are all errors:
    >>> o = ffi.new("uint32_t[]", "\U00010437")
    # Or:
    >>> o = ffi.new("uint32_t[]", [0x10437])
    >>> ffi.string(o)

And instead I have to handle the conversion myself in Python:

    >>> o = ffi.new("uint32_t[]", [97, 93, 0x10437])
    # Using char32_t is faster because this is done automatically by cffi in C:
    >>> u = "".join(chr(i) for i in o)

None of them do: all combinations require memory allocation of a copy.
The reason is that all internal representations can *and do* change,
based on various not-easily-controlled parameters.

I assume even when working with bytes/char[] the methods `ffi.string`, `ffi.unpack` still allocate new memory for a copy, just as if working with unicode? Even if the internal representation of byte strings does not change. Judging by this, anyway:

    >>> o = ffi.new("char[]", b"word")
    >>> s = ffi.string(o)
    >>> s
    b'word'
    >>> o[0] = b"t"
    # New copy, no changes:
    >>> s
    b'word'
    >>> del o
    # New copy, still around:
    >>> s
    b'word'
    # New copy, just as if doing:
    >>> s = bytes(ffi.buffer(o))

While the methods `ffi.buffer` and `ffi.from_buffer` do share the underlying memory. How bad is it to modify the underlying buffer of an immutable Python byte string?

    >>> s = b"word"
    >>> o = ffi.from_buffer(s)
    # Notice, no crashing:
    >>> o[0] = b"t"
    >>> s
    b'tord'

Just for completeness, according to the documentation (#Conversions, Citation [1]), when passing a Python byte string as a char* argument, cffi shares the memory (`ffi.from_buffer`) rather than allocating a new copy (`ffi.new(char[], the_python_byte_string_var`).

- thanks

Armin Rigo

unread,

Sep 2, 2017, 5:16:47 AM9/2/17

to pytho...@googlegroups.com

Hi,

On 31 August 2017 at 16:36, <orbis...@gmail.com> wrote:
> Per the documentation, uint32_t, uint16_t are the same as char16_t and
> char32_t, but without the automatic conversion. I assume this means cffi
> doesn't handle strings of these types.

Yes, types like "uint32_t[]" can only be converted to or from lists of
integers. That's part of the motivation for adding "char32_t" in cffi
1.11.

> I assume even when working with bytes/char[] the methods `ffi.string`,
> `ffi.unpack` still allocate new memory for a copy, just as if working with
> unicode?

Yes, in this direction, it's needed: a Python byte string always
"owns" its own copy of memory, so we can't return a Python byte string
that would be a mere pointer to existing memory.

> While the methods `ffi.buffer` and `ffi.from_buffer` do share the underlying
> memory. How bad is it to modify the underlying buffer of an immutable Python
> byte string?
> >>> s = b"word"
> >>> o = ffi.from_buffer(s)
> # Notice, no crashing:
> >>> o[0] = b"t"
> >>> s
> b'tord'

That's almost always a bad idea. For example, try this:

def f():

s = b"word"
o = ffi.from_buffer(s)

o[0] = b"t"

print(b"word")
f()

Python assumes byte strings don't change, and so it shares the two
occurrences of b"word" in the source code.

A bientôt,

Armin.

Reply all

Reply to author

Forward