Convert a non-zero terminated wchar

Tin Tvrtković

unread,

Feb 16, 2016, 7:51:51 PM2/16/16

to python-cffi

Hi,

I'm trying to get Python unicode objects from a wchar_t* and its length. ffi.string worked great until I needed to allow null bytes in the string. The docs suggest using ffi.buffer, but I'm not sure how to get a Python unicode string from a buffer once I have it. What am I missing?

Thanks in advance!

Armin Rigo

unread,

Feb 17, 2016, 6:22:04 AM2/17/16

to pytho...@googlegroups.com

Hi Tin,

Uh, it seems that cffi is the one missing something. There is only
ffi.string() that can return a Python unicode string, and this would
always stop at the first null character. I'll see how to add this...

Armin

Tin Tvrtković

unread,

Feb 17, 2016, 4:26:52 PM2/17/16

to python-cffi, ar...@tunes.org

Thanks for being so responsive and helpful, Armin!

Armin Rigo

unread,

Feb 20, 2016, 10:35:16 AM2/20/16

to pytho...@googlegroups.com

Hi,

In reply to https://bitbucket.org/cffi/cffi/issues/249/convert-wchar_t-with-null-bytes-to-python
:

To repeat, we need a way to do the same as ``ffi.buffer(p, size))[:]''
where ``p'' is a ``wchar_t *'f' instead of a ``char *''. Given that
using ffi.buffer() for that purpose may be slightly unexpected, I'm
thinking about adding a different way to do exactly that, which could
work with ``char *'' or ``wchar_t *''. Two options I can think of:

1. ffi.rawstring(p, size)
2. ffi.string(p, total=size) (similar to the current ffi.string(p,
maxlen=size))

Or maybe a more general solution to the problem of turning a pointer
to C data back to a Python str/unicode/list, something that would also
accept ``int *'' and turn it into a Python list of integers. That
would give us this third option:

3. ffi.unpack(p, size) (suggestion for better name? returns a
str/unicode/list depending on the ctype of p)

Unclear if it's worth replacing ``[p[i] for i in range(n)]'', which is
not too bad if you really want to build a list. It could be
special-cased inside PyPy, though. For example, ffi.unpack() on a
``long *'' could make directly a list of Python ints with just a
memcpy, thanks to PyPy's list-of-integers optimization.

A bientôt,

Armin.

Armin Rigo

unread,

Mar 15, 2016, 1:46:44 PM3/15/16

to pytho...@googlegroups.com

Hi Tin,

On 17 February 2016 at 01:51, Tin Tvrtković <tinch...@gmail.com> wrote:

A different answer that works now (not very efficiently): the
following should work:

wchar_size = ffi.sizeof("wchar_t")
wchar_encoding = "utf-16" if wchar_size == 2 else "utf-32"

buf = ffi.buffer(ffi.cast("char *", p), length * wchar_size)
u = buf[:].decode(wchar_encoding)

A bientôt,

Armin.

Tin Tvrtković

unread,

Mar 20, 2016, 5:32:26 PM3/20/16

to python-cffi, ar...@tunes.org

Thank you Armin, that does work. :)

Armin Rigo

unread,

Apr 15, 2016, 11:40:05 AM4/15/16

to pytho...@googlegroups.com

Hi,

On 20 February 2016 at 16:34, Armin Rigo <ar...@tunes.org> wrote:
> 1. ffi.rawstring(p, size)
> 2. ffi.string(p, total=size)

> 3. ffi.unpack(p, size)

After more thoughts: from a list of ints, for example, ffi.unpack() is
not necessary because PyPy already does the optimal memcpy for this
kind of code (the slicing only creates a <cdata 'int[size]'>, without
making a copy by itself):

p = some <cdata 'int *'>
lst = list(p[0:size])

From a <cdata 'char[size]'> to a byte string, we can use

ffi.buffer(p, size)[:]

which is equivalent to

ffi.buffer(p[0:size])[:]

but we have no equivalent from <cdata 'wchar_t[size]'> to a unicode
string. Calling unicode(p[0:size]) can't really work, because that's
just str() on Python 3. The missing piece seems thus to be either one
of the solutions 1 or 2.

Moreover, solution 2 behaves differently from the "list(p[0:size])":
right now, you can call "ffi.string(p[0:size])", but it will stop at
the first zero (it still uses 'size', but only as the maximum length).
That leaves solution 1 as the minimal missing piece:

ffi.rawstring(p[0:size]) => byte- or unicode string

with or without the equivalent "ffi.rawstring(p, size)". I'm tempted
to say "without", because it makes it clearer that ffi.rawstring() is
the string equivalent of list() in the usages above.

Note that the reverse direction already works, using for example
"p[0:size] = some_unicode" if p is a <cdata 'wchar_t *'>.

So it looks like I'll add ffi.rawstring(), and also add the content of
this mail to the documentation :-)

A bientôt,

Armin.

Armin Rigo

unread,

Apr 22, 2016, 8:29:49 AM4/22/16

to pytho...@googlegroups.com

Hi,

On 15 April 2016 at 17:39, Armin Rigo <ar...@tunes.org> wrote:
> So it looks like I'll add ffi.rawstring(), and also add the content of
> this mail to the documentation :-)

Turned out that I eventually implemented ``ffi.unpack(p, length)`` in
cffi 1.6. It is the most natural way to give what we want, and for
non-wchar_t cases it replaces some slightly convoluted ways to express
the same thing. See the docs at
http://cffi.readthedocs.org/en/latest/ref.html#ffi-string-ffi-unpack .

A bientôt,

Armin.

Reply all

Reply to author

Forward

Convert a non-zero terminated wchar_t*

Tin Tvrtković

Armin Rigo

Tin Tvrtković

Armin Rigo

Armin Rigo

Tin Tvrtković

Armin Rigo

Armin Rigo