Erm, so, you never "have Unicode data in your strings" except when you do?
What's your use case?
> When I declare an object bytes, and if a python string is being
> converted to cython bytes I would have liked to conversion to have
> automatically. Currently I have to wrap all of these with str()
Don't do that, unless you want unpredictable, platform specific behaviour.
The right way to encode Unicode strings is to, well, encode them. Explicitly.
Just use a simple wrapper function that does the right thing for your
specific case, such as checking the input type and encoding it to your
required encoding if it's a unicode object. And then add another one that
properly decodes the values for handing them back to Python space.
> I would have thought that the most
> common use case "in cython code" would have been to deal with just
> bytes and simple ASCII encoding.
Well, there is no evidence that it a) is and b) helps more than it hurts to
make this the default. Quite the contrary, it hurt Python 2 code quite a
bit to do automatic conversions. This has been fixed in Python 3.
Stefan
Erm, so, you never "have Unicode data in your strings" except when you do?
What's your use case?
Just use a simple wrapper function that does the right thing for your specific case, such as checking the input type and encoding it to your required encoding if it's a unicode object. And then add another one that properly decodes the values for handing them back to Python space.
Well, there is no evidence that it a) is and b) helps more than it hurts to make this the default. Quite the contrary, it hurt Python 2 code quite a bit to do automatic conversions. This has been fixed in Python 3.
ASCII only has 7bit. If you have 8bit data, then you have unicode
strings.
>> What's your use case?
>
>
> I'm just taking an existing python application to Cython. The translation
> has gone rather smoothly when I use object for strings. I'm not changing
> objects to bytes.I thought of making everything bytes so that the cast to
> char * would be easier if I needed to.
What do you mean by "object"? I think the proper distinction in this
context should be between bytes and unicode (Python 2) or bytes and str
(Python 3).
> I think I understand the memory implications. cdef char, bytes will use 1
> byte per char and str will use 2.
Python uses UTF-8 internally, so an ASCII string will take 1 byte per
character. The representation of ASCII characters in UTF-8 is unchanged.
> What about the performance implications?
> Is the cast between bytes and str pretty fast?
A cast does not take time, it's a compile time operation in C. Are you
talking between the decoding and encoding process? This depends on the
codec, I believe that if it's already UTF-8 then the buffer just needs
to be copied.
Best,
-Nikolaus
--
»Time flies like an arrow, fruit flies like a Banana.«
PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C
ASCII only has 7bit. If you have 8bit data, then you have unicode
strings.
What do you mean by "object"? I think the proper distinction in thiscontext should be between bytes and unicode (Python 2) or bytes and str
(Python 3).
A cast does not take time, it's a compile time operation in C. Are you
talking between the decoding and encoding process? This depends on the
codec, I believe that if it's already UTF-8 then the buffer just needs
to be copied.
What is "them"? If you declare a variable as char*, but then assign a
Python object to it, Cython will automatically call PyBytes_AsString()
for you. If the Python object is bytes, it will be returned as-is. If
the Python object is unicode, it will be encoded in the default encoding
(mostly this is UTF-8, if you want a different encoding, you have to
call the conversion function manually). If it's anything else, you get
an exception: http://docs.python.org/c-api/string.html#PyString_AsString
>> A cast does not take time, it's a compile time operation in C. Are you
>> talking between the decoding and encoding process? This depends on the
>> codec, I believe that if it's already UTF-8 then the buffer just needs
>> to be copied.
>>
>>
> I thought the recommendation was to use the encoding and decoding function.
Yes, it is. But on most systems the encoding and decoding will just be a
copy operation.
> But how does Cython know that it is UTF-8? It would still have to call the
> cpython function and then it would just be a buffer copy.
I don't understand. If you encode or decode something, you have to
specify the codec. That's how Python/Cython knows if it's UTF-8.
Note, of course, the resulting char* only has the lifetime of the
string/unicode object.
>>> A cast does not take time, it's a compile time operation in C. Are you
>>> talking between the decoding and encoding process? This depends on the
>>> codec, I believe that if it's already UTF-8 then the buffer just needs
>>> to be copied.
>>>
>>>
>> I thought the recommendation was to use the encoding and decoding function.
>
> Yes, it is. But on most systems the encoding and decoding will just be a
> copy operation.
For bytes/Py2 str, only the pointer captured, no copying is done. For
unicode, it is stored internally as 16 or 32 bits per character
(depending on how Python was compiled, this is more memory hungry but
makes slicing/indexing much faster), so encoding/decoding is required.
>> But how does Cython know that it is UTF-8? It would still have to call the
>> cpython function and then it would just be a buffer copy.
>
> I don't understand. If you encode or decode something, you have to
> specify the codec. That's how Python/Cython knows if it's UTF-8.
Unfortunately, you can't do
cdef char* s = my_unicode_object.encode("utf-8")
as that would allocate the encoded string in a temporary object which
would be immediately garbage collected. Using
cdef char* s = my_unicode_or_str_object
gets around this as if it's a unicode object the encoded bytes are
cached on that same object. (The drawback is that it's not as explicit
about the encoding). The difficulty in dealing with strings stems from
the fact that (1) unlike scalars it's not always easy to manage who
"owns" the resulting pointer (i.e. who and when should clean it up)
and (2) there is a variety of strongly held opinions on how explicit
the encoding needs to be vs. ease of use for the (mostly
scientific/numeric) convenience in the Pure ASCII case.
As a question to the original poster, is there a reason you need to
convert/store these results as char* rather than just holding on to
the original Python unicode objects themselves (and skirting the whole
issue). I have little context to go on, but it sounds like it could be
a case of excessive/unnecessary typing.
- Robert
As a question to the original poster, is there a reason you need to
convert/store these results as char* rather than just holding on to
the original Python unicode objects themselves (and skirting the whole
issue). I have little context to go on, but it sounds like it could be
a case of excessive/unnecessary typing.
That depends a lot on your use case.
> Couple of more questions:
> 1. So are you suggesting that I use cdef object or cdef str for my python
> strings.
Yes. Unless you're using some optimized feature, there's no need to
type them at all.
> 2. If I use cdef bytes will string formatting be done in c or using the
> python api? cdef bytes a, b; "%s:%s" % (a, b).
Cython does not emulate string formatting--it always happens via the
Python API. Of course, it's internally implemented in C, so there's no
clear low-hanging fruit here, and a lot of potential for subtle
incompatibilities and corner cases (not to mention the questions of
memory allocation). There's no need to type a an b above as bytes.
Also that bytes in Python 3 is very different than str in Python 2.
- Robert
Python does not use UTF-8 internally. It uses something more like UTF-16 or
UCS4 for unicode strings. So the memory usage will be between 2x and 4x the
size of a byte encoded bytes object (ASCII, ISO8859-x, ...), depending on
platform and content.
Stefan
Well, unicode strings tend to have considerable memory overhead and are
slow to encode/decode if you need that, but they are much easier to
optimise in Cython. So they can be faster than bytes in some cases, simply
because they have well defined, portable semantics across Python versions
that Cython can build on. The bytes type doesn't really provide that, but
it is faster to map to (and also from) a C char* because it doesn't require
encoding/decoding.
As you can guess from the above paragraph, the performance implications
depend *heavily* on your exact use case.
Since you stated that your are porting an existing Python program, which
indicates that you are not going to have a need for char* conversion, I
would suggest you stick with unicode objects everywhere for simplicity.
That greatly simplifies porting to Python 3, because Python 3 code will
almost never work with bytes objects. Be aware that Python 2 code will
often pass in 'str' objects, i.e. Python 2 bytes, although Python 2 code
will usually handle unicode strings just fine if you pass it back.
So, in any case, regardless of the internal format you pick, you will need
some kind of conversion on the way in (and/or out), either bytes->unicode
or unicode->bytes. The good thing is that you can adapt your conversion
function at C compile time depending on the CPython version you are
compiling against.
I admit that these things are not exactly trivial, but most of the problems
are due to the inner workings of Python 2 and have been fixed in Python 3.
Stefan
Absolutely. And there certainly is some documentation on this topic:
http://wiki.cython.org/enhancements/stringliterals
http://docs.cython.org/src/tutorial/strings.html
Stefan
Well, unicode strings tend to have considerable memory overhead and are slow to encode/decode if you need that, but they are much easier to optimise in Cython. So they can be faster than bytes in some cases, simply because they have well defined, portable semantics across Python versions that Cython can build on. The bytes type doesn't really provide that, but it is
Well, what kind of things are you expecting it to optimize? The only
reason to declare it unicode is if the compiler can do something
special because it's a unicode. "cdef object" will have all the
benefits of a cdef variable.
- Robert
Since it doesn't know the type, it won't optimise anything special about
it. To get optimisation, declare it as "cdef unicode s".
Stefan