I have a Cython module that I is currently usable only in Python 2 because it
calls the function PyString_FromStringAndSize, which does not exist in
Python 3. I'd now like to use the module in Python 3 as well, where the
appropriate function to call, I suppose, would be PyUnicode_FromStringAndSize.
The current code looks like this (I've tried to come up with a minimal
example. In the real module, "hello" is a non-zero-terminated string):
from cpython cimport PyString_FromStringAndSize
def f():
return PyString_FromStringAndSize("hello", 5)
For Python 3 compatibility, I could change this to:
cdef extern from "Python.h":
object PyUnicode_FromStringAndSize(char *u, Py_ssize_t size)
def f():
return PyUnicode_FromStringAndSize("hello", 5)
But this would change the semantics of the function f in Python 2.x code
(returning unicode instead of str).
Is there a simple solution that allows me to maintain Python 2 and 3
compatibility from the same generated C code? Perhaps a Cython equivalent to
the C preprocessor code "#if PY_MAJOR_VERSION < 3"?
Marcel
Ignoring the issues of encoding that you're already glossing over, you
could do str(PyUnicode_FromStringAndSize(c_string, size)) or even
str(c_string[:size]). Alternatively, you could make a macro in a .h
file and "cdef extern from" it, but that would probably be overkill.
- Robert
Depends. Do you want to return a Unicode string (i.e., are you dealing with
text), or is a byte string (or bytearray) the right thing to use (because
you are dealing with binary data)?
> The current code looks like this (I've tried to come up with a minimal
> example. In the real module, "hello" is a non-zero-terminated string):
>
> from cpython cimport PyString_FromStringAndSize
>
> def f():
> return PyString_FromStringAndSize("hello", 5)
If it really was a literal, you'd just write
def f():
return "hello"
If you want to return a byte string from a char*, you'd write
def f():
return some_char_ptr[:length]
If you want to return a Unicode string, you'd write
def f():
return some_char_ptr[:length].decode(the_encoding)
Note that the latter two may need an appropriate error handling if the
string object allocation or the decoding fails.
Also see here:
http://docs.cython.org/src/tutorial/strings.html
> For Python 3 compatibility, I could change this to:
>
> cdef extern from "Python.h":
> object PyUnicode_FromStringAndSize(char *u, Py_ssize_t size)
>
> def f():
> return PyUnicode_FromStringAndSize("hello", 5)
>
> But this would change the semantics of the function f in Python 2.x code
> (returning unicode instead of str).
Right. Then the question is: is that a problem or the right thing to do?
> Is there a simple solution that allows me to maintain Python 2 and 3
> compatibility from the same generated C code?
It appears to be rather common for users to confuse Unicode handling with
Python 3 compatbility. You do not have to do different things in Py2 and
Py3. In fact, it's often best to let code in both behave the same,
especially when Unicode text is involved. Py2 is capable of handling
Unicode, even if it has several quirks that required a major version update
to fix. If you use Unicode strings right away in Python 2, you won't
normally notice them.
> Perhaps a Cython equivalent to
> the C preprocessor code "#if PY_MAJOR_VERSION< 3"?
Well, if you really want to return different things in Python 2 and Python
3, then you necessarily have to check for the version in order to alter the
behaviour, yes.
You can do this, for example:
from cpython.version cimport PY_MAJOR_VERSION
cdef object to_platform_specific_ascii_str(char* s, size_t length):
if PY_MAJOR_VERSION < 3:
return s[:length]
else:
return s[:length].decode("ascii")
def f():
return to_platform_specific_ascii_str(some_char_ptr, length)
Note that anything but ASCII characters are generally not safe to return in
a byte string in Python 2 because they may trigger platform specific
behaviour in user code. If your strings contain characters that the ASCII
encoding cannot represent, you should switch to Unicode strings also in Py2.
Stefan
Actually, now that I think about this, you could also type the return value
of the helper function as 'str' if you are sure to return a byte string in
Py2 and a unicode string in Py3:
from cpython.version cimport PY_MAJOR_VERSION
cdef str to_platform_specific_ascii_str(char* s, size_t length):
if PY_MAJOR_VERSION < 3:
return s[:length]
else:
return s[:length].decode("ascii")
def f():
return to_platform_specific_ascii_str(some_char_ptr, length)
Stefan
thanks to both of you for your replies!
On Wednesday, November 30, 2011 10:53:13 AM Stefan Behnel wrote:
> > But this would change the semantics of the function f in Python 2.x code
> > (returning unicode instead of str).
>
> Right. Then the question is: is that a problem or the right thing to do?
The cython code is actually not mine, I'm just trying to port it. In order to
increase chances that my modifications get accepted, I'd like to change the
behavior under Python 2 as little as possible for now.
[whether to use PyUnicode_FromStringAndSize at all]
> Depends. Do you want to return a Unicode string (i.e., are you dealing with
> text), or is a byte string (or bytearray) the right thing to use (because
> you are dealing with binary data)?
I'm aware that I should distinguish both. I haven't checked all instances in
the source where the function is used, but it seems that it's used with both
binary and text data.
> Actually, now that I think about this, you could also type the return value
> of the helper function as 'str' if you are sure to return a byte string in
> Py2 and a unicode string in Py3:
>
> from cpython.version cimport PY_MAJOR_VERSION
>
> cdef str to_platform_specific_ascii_str(char* s, size_t length):
> if PY_MAJOR_VERSION < 3:
> return s[:length]
> else:
> return s[:length].decode("ascii")
>
> def f():
> return to_platform_specific_ascii_str(some_char_ptr, length)
This looks like it's very close to what I want to achieve! I wasn't aware that
slice notation could be used like that on char pointers (I see that it's
mentioned in the tutorial - sorry for not noticing earlier). That will also
simplify the code in some places.
Ok, I just tried it and it fails on this line:
return s[:length]
with the message:
"Cannot convert 'bytes' object to str implicitly. This is not portable to
Py3."
If I use leave out the "str" return type declaration, it does work, however!
I'll try to use the version without 'str' for now, but I think I'll also have
to go over the code and actually understand what's being done in order to
decide whether I should use bytes or str at .
Thanks again for the help!
Marcel
I always favoured a clean separation here, but maybe we can just change
this into a warning and add a runtime check. The above code would then
basically become this (internally):
if PY_MAJOR_VERSION < 3:
_bytes_temp = s[:length]
if PY_MAJOR_VERSION >= 3:
raise TypeError("cannot convert bytes to str")
return _bytes_temp
else:
_unicode_temp = s[:length].decode("ascii")
if PY_MAJOR_VERSION < 3:
raise TypeError("cannot convert unicode to str")
return _unicode_temp
and the C compiler would drop the checks depending on the compile time
environment. This may look stupid at first, but it makes sense when you
take away the user code version checks.
> I'll try to use the version without 'str' for now, but I think I'll also have
> to go over the code and actually understand what's being done in order to
> decide whether I should use bytes or str at .
Yes, that's the trouble with Py2. Fixing it up in the source by using
explicit conversion functions, even if you don't change the behaviour in
Py2, will make this a lot safer and clearer.
Stefan
I think this makes a lot of sense. Compile-time errors are nice, but in my
case I had to rely on runtime errors anyway to find those lines that needed
fixing for Python 3 compatibility.
> > I'll try to use the version without 'str' for now, but I think I'll also
> > have to go over the code and actually understand what's being done in
> > order to decide whether I should use bytes or str at .
>
> Yes, that's the trouble with Py2. Fixing it up in the source by using
> explicit conversion functions, even if you don't change the behaviour in
> Py2, will make this a lot safer and clearer.
I've mostly finished my port of the library. If anyone is interested, you can
find the code here: http://code.google.com/r/marcelmartin-pysam3/
For completeness, this is the version of the above function that I use. To get
closer to the original semantics of PyString_FromStringAndSize, which allows
for the first argument to be a null pointer, I added "or s == NULL". If you
don't do this, you'll get a segmentation fault when s[:length].decode() is
called.
cdef _from_string_and_size(char* s, size_t length):
if PY_MAJOR_VERSION < 3 or s == NULL:
return s[:length]
else:
if s == NULL:
return s[:length]
else:
return s[:length].decode("ascii")
Thanks again for the help,
Marcel
--
Dipl.-Inform. Marcel Martin, https://ls11-www.cs.tu-dortmund.de/staff/martin/
This has a clear code smell (at least). It relies on implicit assumptions
about how the conversion works. I assume that your intention is to consider
NULL an error? Sadly, I cannot be sure from your code.
I prefer handling errors explicitly by raising an exception, either
directly in the outside code, or, if the error detection (s==NULL) and the
error handling (a specific exception is raised) are exactly the same in all
cases that call the above function, you can raise it explicitly in the
conversion function. However, I doubt that this case exists. NULL pointers
can have different reasons why they occur, be it simply an empty pointer or
a failed memory allocation. These are just two examples that tend to
require very different handling.
Stefan
Ok, I realize now that the code doesn't do at all what I want (it works only
by accident and because the unit tests seem to be incomplete). The original
intention was to get the behavior of PyString_FromStringAndSize, whose docs
say: "If v is NULL, the contents of the string are uninitialized." That
behavior was used in the original code, where the string was first allocated
using that function and then filled in. This avoids one copy of a potentially
large string.
Does it make sense to define this function for that purpose?
cdef inline bytes _uninitialized_bytes(size_t length):
'''return a bytes object of the given length with unitialized content.'''
return (<char*>NULL)[:length]
In that case, I'd just go with "PyBytes_FromStringAndSize(NULL, length)".
Stefan