cython bytes and python strings/unicode

1,115 views
Skip to first unread message

pytrade

unread,
Jan 11, 2011, 10:18:21 AM1/11/11
to cython-users
There is a lot of information on strings and am still confused after
going through most of the documentation. In my particular case I know
that I will never have unicode date in my python strings.

There are several extrernal systems that return unicode strings that I
have no control over (ex: mongodb).

When I declare an object bytes, and if a python string is being
converted to cython bytes I would have liked to conversion to have
automatically. Currently I have to wrap all of these with str() [If I
don't, I get multiple errors]. I would have thought that the most
common use case "in cython code" would have been to deal with just
bytes and simple ASCII encoding.

Stefan Behnel

unread,
Jan 11, 2011, 10:53:27 AM1/11/11
to cython...@googlegroups.com
pytrade, 11.01.2011 16:18:

> There is a lot of information on strings and am still confused after
> going through most of the documentation. In my particular case I know
> that I will never have unicode date in my python strings.
>
> There are several extrernal systems that return unicode strings that I
> have no control over (ex: mongodb).

Erm, so, you never "have Unicode data in your strings" except when you do?

What's your use case?


> When I declare an object bytes, and if a python string is being
> converted to cython bytes I would have liked to conversion to have
> automatically. Currently I have to wrap all of these with str()

Don't do that, unless you want unpredictable, platform specific behaviour.
The right way to encode Unicode strings is to, well, encode them. Explicitly.

Just use a simple wrapper function that does the right thing for your
specific case, such as checking the input type and encoding it to your
required encoding if it's a unicode object. And then add another one that
properly decodes the values for handing them back to Python space.


> I would have thought that the most
> common use case "in cython code" would have been to deal with just
> bytes and simple ASCII encoding.

Well, there is no evidence that it a) is and b) helps more than it hurts to
make this the default. Quite the contrary, it hurt Python 2 code quite a
bit to do automatic conversions. This has been fixed in Python 3.

Stefan

Vineet Jain

unread,
Jan 11, 2011, 1:03:10 PM1/11/11
to cython...@googlegroups.com
Erm, so, you never "have Unicode data in your strings" except when you do?


To be clear. I have only ascii-8 data in my python strings/unicode data. 
 
What's your use case?

I'm just taking an existing python application to Cython. The translation has gone rather smoothly when I use object for strings. I'm not changing objects to bytes.I thought of making everything bytes so that the cast to char * would be easier if I needed to.

I think I understand the memory implications. cdef char, bytes will use 1 byte per char and str will use 2. What about the performance implications? Is the cast between bytes and str pretty fast?

Just use a simple wrapper function that does the right thing for your specific case, such as checking the input type and encoding it to your required encoding if it's a unicode object. And then add another one that properly decodes the values for handing them back to Python space.


Can this function be inlined? Will this be a c call or will it call a cpython function? What function should I use?
 

Well, there is no evidence that it a) is and b) helps more than it hurts to make this the default. Quite the contrary, it hurt Python 2 code quite a bit to do automatic conversions. This has been fixed in Python 3.


I agree between python 2 and python 3. Since python is extensively used for web applications and data processing. Handling unicode data is probably a common requirement. 

Although, I'm not sure about Cython. For the numpy/scipy  users handing unicode is probably a low priority. I'm not sure what it is for people interfacing with external systems or for people using cython to write extension modules to support both python 2 and python 3. 

It would be nice if there was a flag that you could pass to cython to make this conversion automatic for people who know that they will only have ascii-8 data in their python strings/unicode objects. 

So far Cython has been quite intuitive except for 

- numpy record arrays and indexing differences. In python I have to numpy_array['columnname'] and in cython numpy_array numpy_array.column_name

- Strings. Hence my message to see if there is an easier way to move forward?

Nikolaus Rath

unread,
Jan 11, 2011, 1:55:57 PM1/11/11
to cython...@googlegroups.com
Vineet Jain <vinj...@gmail.com> writes:
>>
>> Erm, so, you never "have Unicode data in your strings" except when you do?
>>
>>
> To be clear. I have only ascii-8 data in my python strings/unicode data.

ASCII only has 7bit. If you have 8bit data, then you have unicode
strings.

>> What's your use case?
>
>
> I'm just taking an existing python application to Cython. The translation
> has gone rather smoothly when I use object for strings. I'm not changing
> objects to bytes.I thought of making everything bytes so that the cast to
> char * would be easier if I needed to.

What do you mean by "object"? I think the proper distinction in this
context should be between bytes and unicode (Python 2) or bytes and str
(Python 3).

> I think I understand the memory implications. cdef char, bytes will use 1
> byte per char and str will use 2.

Python uses UTF-8 internally, so an ASCII string will take 1 byte per
character. The representation of ASCII characters in UTF-8 is unchanged.

> What about the performance implications?
> Is the cast between bytes and str pretty fast?

A cast does not take time, it's a compile time operation in C. Are you
talking between the decoding and encoding process? This depends on the
codec, I believe that if it's already UTF-8 then the buffer just needs
to be copied.

Best,

-Nikolaus

--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C

Vineet Jain

unread,
Jan 11, 2011, 2:17:23 PM1/11/11
to cython...@googlegroups.com

ASCII only has 7bit. If you have 8bit data, then you have unicode
strings.


I only have ASCII data
 
What do you mean by "object"? I think the proper distinction in this
context should be between bytes and unicode (Python 2) or bytes and str
(Python 3).


I mean if I declare them as cdef public object in my cdef class

A cast does not take time, it's a compile time operation in C. Are you
talking between the decoding and encoding process? This depends on the
codec, I believe that if it's already UTF-8 then the buffer just needs
to be copied.


I thought the recommendation was to use the encoding and decoding function. But how does Cython know that it is UTF-8? It would still have to call the cpython function and then it would just be a buffer copy. 

Thanks,

Vineet

Nikolaus Rath

unread,
Jan 11, 2011, 4:06:50 PM1/11/11
to cython...@googlegroups.com
Vineet Jain <vinj...@gmail.com> writes:
>> What do you mean by "object"? I think the proper distinction in this
>> context should be between bytes and unicode (Python 2) or bytes and str
>> (Python 3).
>>
> I mean if I declare them as cdef public object in my cdef class

What is "them"? If you declare a variable as char*, but then assign a
Python object to it, Cython will automatically call PyBytes_AsString()
for you. If the Python object is bytes, it will be returned as-is. If
the Python object is unicode, it will be encoded in the default encoding
(mostly this is UTF-8, if you want a different encoding, you have to
call the conversion function manually). If it's anything else, you get
an exception: http://docs.python.org/c-api/string.html#PyString_AsString

>> A cast does not take time, it's a compile time operation in C. Are you
>> talking between the decoding and encoding process? This depends on the
>> codec, I believe that if it's already UTF-8 then the buffer just needs
>> to be copied.
>>
>>
> I thought the recommendation was to use the encoding and decoding function.

Yes, it is. But on most systems the encoding and decoding will just be a
copy operation.

> But how does Cython know that it is UTF-8? It would still have to call the
> cpython function and then it would just be a buffer copy.

I don't understand. If you encode or decode something, you have to
specify the codec. That's how Python/Cython knows if it's UTF-8.

Robert Bradshaw

unread,
Jan 11, 2011, 4:28:58 PM1/11/11
to cython...@googlegroups.com
On Tue, Jan 11, 2011 at 1:06 PM, Nikolaus Rath <Niko...@rath.org> wrote:
> Vineet Jain <vinj...@gmail.com> writes:
>>> What do you mean by "object"? I think the proper distinction in this
>>> context should be between bytes and unicode (Python 2) or bytes and str
>>> (Python 3).
>>>
>> I mean if I declare them as cdef public object in my cdef class
>
> What is "them"? If you declare a variable as char*, but then assign a
> Python object to it, Cython will automatically call PyBytes_AsString()
> for you. If the Python object is bytes, it will be returned as-is. If
> the Python object is unicode, it will be encoded in the default encoding
> (mostly this is UTF-8, if you want a different encoding, you have to
> call the conversion function manually). If it's anything else, you get
> an exception: http://docs.python.org/c-api/string.html#PyString_AsString

Note, of course, the resulting char* only has the lifetime of the
string/unicode object.

>>> A cast does not take time, it's a compile time operation in C. Are you
>>> talking between the decoding and encoding process? This depends on the
>>> codec, I believe that if it's already UTF-8 then the buffer just needs
>>> to be copied.
>>>
>>>
>> I thought the recommendation was to use the encoding and decoding function.
>
> Yes, it is. But on most systems the encoding and decoding will just be a
> copy operation.

For bytes/Py2 str, only the pointer captured, no copying is done. For
unicode, it is stored internally as 16 or 32 bits per character
(depending on how Python was compiled, this is more memory hungry but
makes slicing/indexing much faster), so encoding/decoding is required.

>> But how does Cython know that it is UTF-8? It would still have to call the
>> cpython function and then it would just be a buffer copy.
>
> I don't understand. If you encode or decode something, you have to
> specify the codec. That's how Python/Cython knows if it's UTF-8.

Unfortunately, you can't do

cdef char* s = my_unicode_object.encode("utf-8")

as that would allocate the encoded string in a temporary object which
would be immediately garbage collected. Using

cdef char* s = my_unicode_or_str_object

gets around this as if it's a unicode object the encoded bytes are
cached on that same object. (The drawback is that it's not as explicit
about the encoding). The difficulty in dealing with strings stems from
the fact that (1) unlike scalars it's not always easy to manage who
"owns" the resulting pointer (i.e. who and when should clean it up)
and (2) there is a variety of strongly held opinions on how explicit
the encoding needs to be vs. ease of use for the (mostly
scientific/numeric) convenience in the Pure ASCII case.

As a question to the original poster, is there a reason you need to
convert/store these results as char* rather than just holding on to
the original Python unicode objects themselves (and skirting the whole
issue). I have little context to go on, but it sounds like it could be
a case of excessive/unnecessary typing.

- Robert

Vineet Jain

unread,
Jan 11, 2011, 5:30:13 PM1/11/11
to cython...@googlegroups.com
As a question to the original poster, is there a reason you need to
convert/store these results as char* rather than just holding on to
the original Python unicode objects themselves (and skirting the whole
issue). I have little context to go on, but it sounds like it could be
a case of excessive/unnecessary typing.


That's what I would like to do. The whole str/char/bytes/unicode is rather confusing.  Especially the performance implications of it. It would be great for a simple FAQ outlining what to do if all your data is in pure ASCII. Assuming some of the pure ASCII data might live in some Unicode strings (data returned back from database queries, etc). 

Couple of more questions:

1. So are you suggesting that I use cdef object or cdef str for my python strings.
2. If I use cdef bytes will string formatting be done in c or using the python api? cdef bytes a, b; "%s:%s" % (a, b). 

Thanks,

VJ

Robert Bradshaw

unread,
Jan 11, 2011, 6:25:26 PM1/11/11
to cython...@googlegroups.com
On Tue, Jan 11, 2011 at 2:30 PM, Vineet Jain <vinj...@gmail.com> wrote:
>> As a question to the original poster, is there a reason you need to
>> convert/store these results as char* rather than just holding on to
>> the original Python unicode objects themselves (and skirting the whole
>> issue). I have little context to go on, but it sounds like it could be
>> a case of excessive/unnecessary typing.
>>
>
> That's what I would like to do. The whole str/char/bytes/unicode is rather
> confusing.  Especially the performance implications of it. It would be great
> for a simple FAQ outlining what to do if all your data is in pure ASCII.
> Assuming some of the pure ASCII data might live in some Unicode strings
> (data returned back from database queries, etc).

That depends a lot on your use case.

> Couple of more questions:
> 1. So are you suggesting that I use cdef object or cdef str for my python
> strings.

Yes. Unless you're using some optimized feature, there's no need to
type them at all.

> 2. If I use cdef bytes will string formatting be done in c or using the
> python api? cdef bytes a, b; "%s:%s" % (a, b).

Cython does not emulate string formatting--it always happens via the
Python API. Of course, it's internally implemented in C, so there's no
clear low-hanging fruit here, and a lot of potential for subtle
incompatibilities and corner cases (not to mention the questions of
memory allocation). There's no need to type a an b above as bytes.

Also that bytes in Python 3 is very different than str in Python 2.

- Robert

Stefan Behnel

unread,
Jan 12, 2011, 3:32:05 AM1/12/11
to cython...@googlegroups.com
Nikolaus Rath, 11.01.2011 19:55:

> Python uses UTF-8 internally, so an ASCII string will take 1 byte per
> character. The representation of ASCII characters in UTF-8 is unchanged.

Python does not use UTF-8 internally. It uses something more like UTF-16 or
UCS4 for unicode strings. So the memory usage will be between 2x and 4x the
size of a byte encoded bytes object (ASCII, ISO8859-x, ...), depending on
platform and content.

Stefan

Stefan Behnel

unread,
Jan 12, 2011, 3:57:57 AM1/12/11
to cython...@googlegroups.com
Vineet Jain, 11.01.2011 23:30:

> The whole str/char/bytes/unicode is rather
> confusing. Especially the performance implications of it.

Well, unicode strings tend to have considerable memory overhead and are
slow to encode/decode if you need that, but they are much easier to
optimise in Cython. So they can be faster than bytes in some cases, simply
because they have well defined, portable semantics across Python versions
that Cython can build on. The bytes type doesn't really provide that, but
it is faster to map to (and also from) a C char* because it doesn't require
encoding/decoding.

As you can guess from the above paragraph, the performance implications
depend *heavily* on your exact use case.

Since you stated that your are porting an existing Python program, which
indicates that you are not going to have a need for char* conversion, I
would suggest you stick with unicode objects everywhere for simplicity.
That greatly simplifies porting to Python 3, because Python 3 code will
almost never work with bytes objects. Be aware that Python 2 code will
often pass in 'str' objects, i.e. Python 2 bytes, although Python 2 code
will usually handle unicode strings just fine if you pass it back.

So, in any case, regardless of the internal format you pick, you will need
some kind of conversion on the way in (and/or out), either bytes->unicode
or unicode->bytes. The good thing is that you can adapt your conversion
function at C compile time depending on the CPython version you are
compiling against.

I admit that these things are not exactly trivial, but most of the problems
are due to the inner workings of Python 2 and have been fixed in Python 3.

Stefan

Stefan Behnel

unread,
Jan 12, 2011, 3:38:18 AM1/12/11
to cython...@googlegroups.com
Robert Bradshaw, 12.01.2011 00:25:

> On Tue, Jan 11, 2011 at 2:30 PM, Vineet Jain wrote:
>>> As a question to the original poster, is there a reason you need to
>>> convert/store these results as char* rather than just holding on to
>>> the original Python unicode objects themselves (and skirting the whole
>>> issue). I have little context to go on, but it sounds like it could be
>>> a case of excessive/unnecessary typing.
>>
>> That's what I would like to do. The whole str/char/bytes/unicode is rather
>> confusing. Especially the performance implications of it. It would be great
>> for a simple FAQ outlining what to do if all your data is in pure ASCII.
>> Assuming some of the pure ASCII data might live in some Unicode strings
>> (data returned back from database queries, etc).
>
> That depends a lot on your use case.

Absolutely. And there certainly is some documentation on this topic:

http://wiki.cython.org/enhancements/stringliterals

http://docs.cython.org/src/tutorial/strings.html

Stefan

Vineet Jain

unread,
Jan 12, 2011, 8:15:08 AM1/12/11
to cython...@googlegroups.com

Well, unicode strings tend to have considerable memory overhead and are slow to encode/decode if you need that, but they are much easier to optimise in Cython. So they can be faster than bytes in some cases, simply because they have well defined, portable semantics across Python versions that Cython can build on. The bytes type doesn't really provide that, but it is


If I declare unicode objects as cdef object s, how will cython optimize access to s since it does not know that it contains a python unicode object?

Vineet

Robert Bradshaw

unread,
Jan 12, 2011, 11:30:38 AM1/12/11
to cython...@googlegroups.com

Well, what kind of things are you expecting it to optimize? The only
reason to declare it unicode is if the compiler can do something
special because it's a unicode. "cdef object" will have all the
benefits of a cdef variable.

- Robert

Stefan Behnel

unread,
Jan 12, 2011, 3:34:15 PM1/12/11
to cython...@googlegroups.com
Vineet Jain, 12.01.2011 14:15:

> If I declare unicode objects as cdef object s, how will cython optimize
> access to s since it does not know that it contains a python unicode object?

Since it doesn't know the type, it won't optimise anything special about
it. To get optimisation, declare it as "cdef unicode s".

Stefan

Reply all
Reply to author
Forward
0 new messages