Python 2 vs 3: PyString_FromStringAndSize

Marcel Martin

unread,

Nov 29, 2011, 6:16:10 PM11/29/11

to cython...@googlegroups.com

Hello,

I have a Cython module that I is currently usable only in Python 2 because it
calls the function PyString_FromStringAndSize, which does not exist in
Python 3. I'd now like to use the module in Python 3 as well, where the
appropriate function to call, I suppose, would be PyUnicode_FromStringAndSize.

The current code looks like this (I've tried to come up with a minimal
example. In the real module, "hello" is a non-zero-terminated string):

from cpython cimport PyString_FromStringAndSize

def f():
return PyString_FromStringAndSize("hello", 5)

For Python 3 compatibility, I could change this to:

cdef extern from "Python.h":
object PyUnicode_FromStringAndSize(char *u, Py_ssize_t size)

def f():
return PyUnicode_FromStringAndSize("hello", 5)

But this would change the semantics of the function f in Python 2.x code
(returning unicode instead of str).

Is there a simple solution that allows me to maintain Python 2 and 3
compatibility from the same generated C code? Perhaps a Cython equivalent to
the C preprocessor code "#if PY_MAJOR_VERSION < 3"?

Marcel

Robert Bradshaw

unread,

Nov 30, 2011, 2:45:01 AM11/30/11

to cython...@googlegroups.com

Ignoring the issues of encoding that you're already glossing over, you
could do str(PyUnicode_FromStringAndSize(c_string, size)) or even
str(c_string[:size]). Alternatively, you could make a macro in a .h
file and "cdef extern from" it, but that would probably be overkill.

- Robert

Stefan Behnel

unread,

Nov 30, 2011, 2:47:10 AM11/30/11

to cython...@googlegroups.com

Marcel Martin, 30.11.2011 00:16:

> I have a Cython module that I is currently usable only in Python 2 because it
> calls the function PyString_FromStringAndSize, which does not exist in
> Python 3. I'd now like to use the module in Python 3 as well, where the
> appropriate function to call, I suppose, would be PyUnicode_FromStringAndSize.

Depends. Do you want to return a Unicode string (i.e., are you dealing with
text), or is a byte string (or bytearray) the right thing to use (because
you are dealing with binary data)?

> The current code looks like this (I've tried to come up with a minimal
> example. In the real module, "hello" is a non-zero-terminated string):
>
> from cpython cimport PyString_FromStringAndSize
>
> def f():
> return PyString_FromStringAndSize("hello", 5)

If it really was a literal, you'd just write

def f():
return "hello"

If you want to return a byte string from a char*, you'd write

def f():
return some_char_ptr[:length]

If you want to return a Unicode string, you'd write

def f():
return some_char_ptr[:length].decode(the_encoding)

Note that the latter two may need an appropriate error handling if the
string object allocation or the decoding fails.

Also see here:

http://docs.cython.org/src/tutorial/strings.html

> For Python 3 compatibility, I could change this to:
>
> cdef extern from "Python.h":
> object PyUnicode_FromStringAndSize(char *u, Py_ssize_t size)
>
> def f():
> return PyUnicode_FromStringAndSize("hello", 5)
>
> But this would change the semantics of the function f in Python 2.x code
> (returning unicode instead of str).

Right. Then the question is: is that a problem or the right thing to do?

> Is there a simple solution that allows me to maintain Python 2 and 3
> compatibility from the same generated C code?

It appears to be rather common for users to confuse Unicode handling with
Python 3 compatbility. You do not have to do different things in Py2 and
Py3. In fact, it's often best to let code in both behave the same,
especially when Unicode text is involved. Py2 is capable of handling
Unicode, even if it has several quirks that required a major version update
to fix. If you use Unicode strings right away in Python 2, you won't
normally notice them.

> Perhaps a Cython equivalent to
> the C preprocessor code "#if PY_MAJOR_VERSION< 3"?

Well, if you really want to return different things in Python 2 and Python
3, then you necessarily have to check for the version in order to alter the
behaviour, yes.

You can do this, for example:

from cpython.version cimport PY_MAJOR_VERSION

cdef object to_platform_specific_ascii_str(char* s, size_t length):
if PY_MAJOR_VERSION < 3:
return s[:length]
else:
return s[:length].decode("ascii")

def f():
return to_platform_specific_ascii_str(some_char_ptr, length)

Note that anything but ASCII characters are generally not safe to return in
a byte string in Python 2 because they may trigger platform specific
behaviour in user code. If your strings contain characters that the ASCII
encoding cannot represent, you should switch to Unicode strings also in Py2.

Stefan

Stefan Behnel

unread,

Nov 30, 2011, 4:53:13 AM11/30/11

to cython...@googlegroups.com

Stefan Behnel, 30.11.2011 08:47:

> cdef object to_platform_specific_ascii_str(char* s, size_t length):
> if PY_MAJOR_VERSION < 3:
> return s[:length]
> else:
> return s[:length].decode("ascii")

Actually, now that I think about this, you could also type the return value
of the helper function as 'str' if you are sure to return a byte string in
Py2 and a unicode string in Py3:

from cpython.version cimport PY_MAJOR_VERSION

cdef str to_platform_specific_ascii_str(char* s, size_t length):

if PY_MAJOR_VERSION < 3:
return s[:length]
else:
return s[:length].decode("ascii")

def f():
return to_platform_specific_ascii_str(some_char_ptr, length)

Stefan

Marcel Martin

unread,

Nov 30, 2011, 7:54:00 AM11/30/11

to cython...@googlegroups.com

Dear Stefan and Robert,

thanks to both of you for your replies!

On Wednesday, November 30, 2011 10:53:13 AM Stefan Behnel wrote:
> > But this would change the semantics of the function f in Python 2.x code
> > (returning unicode instead of str).
>
> Right. Then the question is: is that a problem or the right thing to do?

The cython code is actually not mine, I'm just trying to port it. In order to
increase chances that my modifications get accepted, I'd like to change the
behavior under Python 2 as little as possible for now.

[whether to use PyUnicode_FromStringAndSize at all]

> Depends. Do you want to return a Unicode string (i.e., are you dealing with
> text), or is a byte string (or bytearray) the right thing to use (because
> you are dealing with binary data)?

I'm aware that I should distinguish both. I haven't checked all instances in
the source where the function is used, but it seems that it's used with both
binary and text data.

> Actually, now that I think about this, you could also type the return value
> of the helper function as 'str' if you are sure to return a byte string in
> Py2 and a unicode string in Py3:
>
> from cpython.version cimport PY_MAJOR_VERSION
>
> cdef str to_platform_specific_ascii_str(char* s, size_t length):
> if PY_MAJOR_VERSION < 3:
> return s[:length]
> else:
> return s[:length].decode("ascii")
>
> def f():
> return to_platform_specific_ascii_str(some_char_ptr, length)

This looks like it's very close to what I want to achieve! I wasn't aware that
slice notation could be used like that on char pointers (I see that it's
mentioned in the tutorial - sorry for not noticing earlier). That will also
simplify the code in some places.

Ok, I just tried it and it fails on this line:
return s[:length]
with the message:
"Cannot convert 'bytes' object to str implicitly. This is not portable to
Py3."

If I use leave out the "str" return type declaration, it does work, however!

I'll try to use the version without 'str' for now, but I think I'll also have
to go over the code and actually understand what's being done in order to
decide whether I should use bytes or str at .

Thanks again for the help!

Marcel

Stefan Behnel

unread,

Dec 1, 2011, 1:10:40 PM12/1/11

to cython...@googlegroups.com

Marcel Martin, 30.11.2011 13:54:

> On Wednesday, November 30, 2011 10:53:13 AM Stefan Behnel wrote:
>> Actually, now that I think about this, you could also type the return value
>> of the helper function as 'str' if you are sure to return a byte string in
>> Py2 and a unicode string in Py3:
>>
>> from cpython.version cimport PY_MAJOR_VERSION
>>
>> cdef str to_platform_specific_ascii_str(char* s, size_t length):
>> if PY_MAJOR_VERSION< 3:
>> return s[:length]
>> else:
>> return s[:length].decode("ascii")
>>
>> def f():
>> return to_platform_specific_ascii_str(some_char_ptr, length)
>
> This looks like it's very close to what I want to achieve! I wasn't aware that
> slice notation could be used like that on char pointers (I see that it's
> mentioned in the tutorial - sorry for not noticing earlier). That will also
> simplify the code in some places.
>
> Ok, I just tried it and it fails on this line:
> return s[:length]
> with the message:
> "Cannot convert 'bytes' object to str implicitly. This is not portable to
> Py3."
>
> If I use leave out the "str" return type declaration, it does work, however!

I always favoured a clean separation here, but maybe we can just change
this into a warning and add a runtime check. The above code would then
basically become this (internally):

if PY_MAJOR_VERSION < 3:
_bytes_temp = s[:length]
if PY_MAJOR_VERSION >= 3:
raise TypeError("cannot convert bytes to str")
return _bytes_temp
else:
_unicode_temp = s[:length].decode("ascii")
if PY_MAJOR_VERSION < 3:
raise TypeError("cannot convert unicode to str")
return _unicode_temp

and the C compiler would drop the checks depending on the compile time
environment. This may look stupid at first, but it makes sense when you
take away the user code version checks.

> I'll try to use the version without 'str' for now, but I think I'll also have
> to go over the code and actually understand what's being done in order to
> decide whether I should use bytes or str at .

Yes, that's the trouble with Py2. Fixing it up in the source by using
explicit conversion functions, even if you don't change the behaviour in
Py2, will make this a lot safer and clearer.

Stefan

Marcel Martin

unread,

Dec 5, 2011, 7:34:19 AM12/5/11

to cython...@googlegroups.com

On Thursday 01. December 2011 19:10:40 Stefan Behnel wrote:
> I always favoured a clean separation here, but maybe we can just change
> this into a warning and add a runtime check. The above code would then
> basically become this (internally):
>
> if PY_MAJOR_VERSION < 3:
> _bytes_temp = s[:length]
> if PY_MAJOR_VERSION >= 3:
> raise TypeError("cannot convert bytes to str")
> return _bytes_temp
> else:
> _unicode_temp = s[:length].decode("ascii")
> if PY_MAJOR_VERSION < 3:
> raise TypeError("cannot convert unicode to str")
> return _unicode_temp
>
> and the C compiler would drop the checks depending on the compile time
> environment. This may look stupid at first, but it makes sense when you
> take away the user code version checks.

I think this makes a lot of sense. Compile-time errors are nice, but in my
case I had to rely on runtime errors anyway to find those lines that needed
fixing for Python 3 compatibility.

> > I'll try to use the version without 'str' for now, but I think I'll also
> > have to go over the code and actually understand what's being done in
> > order to decide whether I should use bytes or str at .
>
> Yes, that's the trouble with Py2. Fixing it up in the source by using
> explicit conversion functions, even if you don't change the behaviour in
> Py2, will make this a lot safer and clearer.

I've mostly finished my port of the library. If anyone is interested, you can
find the code here: http://code.google.com/r/marcelmartin-pysam3/

For completeness, this is the version of the above function that I use. To get
closer to the original semantics of PyString_FromStringAndSize, which allows
for the first argument to be a null pointer, I added "or s == NULL". If you
don't do this, you'll get a segmentation fault when s[:length].decode() is
called.

cdef _from_string_and_size(char* s, size_t length):
if PY_MAJOR_VERSION < 3 or s == NULL:
return s[:length]
else:
if s == NULL:

return s[:length]
else:
return s[:length].decode("ascii")

Thanks again for the help,
Marcel
--
Dipl.-Inform. Marcel Martin, https://ls11-www.cs.tu-dortmund.de/staff/martin/

Stefan Behnel

unread,

Dec 5, 2011, 1:32:40 PM12/5/11

to cython...@googlegroups.com

Marcel Martin, 05.12.2011 13:34:

> For completeness, this is the version of the above function that I use. To get
> closer to the original semantics of PyString_FromStringAndSize, which allows
> for the first argument to be a null pointer, I added "or s == NULL". If you
> don't do this, you'll get a segmentation fault when s[:length].decode() is
> called.
>
> cdef _from_string_and_size(char* s, size_t length):
> if PY_MAJOR_VERSION< 3 or s == NULL:
> return s[:length]

This has a clear code smell (at least). It relies on implicit assumptions
about how the conversion works. I assume that your intention is to consider
NULL an error? Sadly, I cannot be sure from your code.

I prefer handling errors explicitly by raising an exception, either
directly in the outside code, or, if the error detection (s==NULL) and the
error handling (a specific exception is raised) are exactly the same in all
cases that call the above function, you can raise it explicitly in the
conversion function. However, I doubt that this case exists. NULL pointers
can have different reasons why they occur, be it simply an empty pointer or
a failed memory allocation. These are just two examples that tend to
require very different handling.

Stefan

Marcel Martin

unread,

Dec 6, 2011, 4:41:36 AM12/6/11

to cython...@googlegroups.com

On Monday 05. December 2011 19:32:40 Stefan Behnel wrote:
> Marcel Martin, 05.12.2011 13:34:

> > cdef _from_string_and_size(char* s, size_t length):
> > if PY_MAJOR_VERSION< 3 or s == NULL:
> > return s[:length]
>
> This has a clear code smell (at least). It relies on implicit assumptions
> about how the conversion works. I assume that your intention is to consider
> NULL an error? Sadly, I cannot be sure from your code.

Ok, I realize now that the code doesn't do at all what I want (it works only
by accident and because the unit tests seem to be incomplete). The original
intention was to get the behavior of PyString_FromStringAndSize, whose docs
say: "If v is NULL, the contents of the string are uninitialized." That
behavior was used in the original code, where the string was first allocated
using that function and then filled in. This avoids one copy of a potentially
large string.

Does it make sense to define this function for that purpose?

cdef inline bytes _uninitialized_bytes(size_t length):
'''return a bytes object of the given length with unitialized content.'''
return (<char*>NULL)[:length]

Stefan Behnel

unread,

Dec 6, 2011, 5:26:06 AM12/6/11

to cython...@googlegroups.com

Marcel Martin, 06.12.2011 10:41:

In that case, I'd just go with "PyBytes_FromStringAndSize(NULL, length)".

Stefan

Reply all

Reply to author

Forward