How to set compiler directive in setup.py?

1,953 views
Skip to first unread message

Czarek Tomczak

unread,
May 22, 2013, 1:00:35 PM5/22/13
to cython...@googlegroups.com
Hi,

I'm trying to set c_string_type and c_string_encoding compiler directives
in setup.py file, but it does not work, here is the code:

from Cython.Compiler import Options
if sys.version_info.major < 3:
    Options.c_string_type = "str"
    Options.c_string_encoding = "utf8"
else:
    Options.c_string_type = "unicode"
    Options.c_string_encoding = "utf8"

On the other hand, setting the "fast_fail" option works fine:

Options.fast_fail = True

But seems like this is a different type of option, not listed in the "compiler 
directives" in the documentation here:

Setting c_string_type and c_string_encoding by adding this comment at
the top of the pyx file works fine:

# cython: c_string_type=str, c_string_encoding=utf8

But I need to set these options depending on python version, so this is not
going to work.

Regards,
Czarek 

Nikita Nemkin

unread,
May 22, 2013, 1:13:37 PM5/22/13
to cython...@googlegroups.com
On Wed, 22 May 2013 23:00:35 +0600, Czarek Tomczak
<czarek....@gmail.com> wrote:

> Hi,
>
> I'm trying to set c_string_type and c_string_encoding compiler directives
> in setup.py file, but it does not work, here is the code:
>
> from Cython.Compiler import Options
>> if sys.version_info.major < 3:
>> Options.c_string_type = "str"
>> Options.c_string_encoding = "utf8"
>> else:
>> Options.c_string_type = "unicode"
>> Options.c_string_encoding = "utf8"
>
>
> On the other hand, setting the "fast_fail" option works fine:
>
> Options.fast_fail = True
>
>
> But seems like this is a different type of option, not listed in the
> "compiler directives" in the documentation here:
> http://docs.cython.org/src/reference/compilation.html#compiler-directives

Hmm, c_string_* are compiler directives and they are listed in the table
you link to.

cythonize has "compiler_directives" parameter:

ext_modules = cythonize(
...
compiler_directives={
'c_string_type': 'str',
'c_string_encoding: utf8,
},
...)

If you are using build_ext instead of cythonize, set "cython_directives"
attribute
(dict) on Extension object.

As I understand it:
* Directives have a scope (module, class, etc), sometimes can be overriden
locally and often change language semantics.
* Options pertain to the compilation process as a whole and (usually) do
not
change language semantics.

Options can be passed to cythonize as keyword arguments.

Best regards,
Nikita Nemkin

Czarek Tomczak

unread,
May 22, 2013, 1:29:53 PM5/22/13
to cython...@googlegroups.com
Hi Nikita,

In setup.py I'm using Extension() syntax, I've tried "cython_directives",
"compiler_directives" and "pyrex_directives", but none of it works, here
is the code:

ext_modules = [Extension(
    "cefpython_py%s" % PYTHON_VERSION,
    ["cefpython.pyx"],
    cython_directives={"c_string_type": "str", "c_string_encoding": "utf8"},
    language='c++',
    ...
)]

Best regards,
Czarek

Nikita Nemkin

unread,
May 22, 2013, 1:33:32 PM5/22/13
to cython...@googlegroups.com
On Wed, 22 May 2013 23:29:53 +0600, Czarek Tomczak
<czarek....@gmail.com> wrote:

> Hi Nikita,
>
> In setup.py I'm using Extension() syntax, I've tried "cython_directives",
> "compiler_directives" and "pyrex_directives", but none of it works, here
> is the code:
>
> ext_modules = [Extension(
>> "cefpython_py%s" % PYTHON_VERSION,
>> ["cefpython.pyx"],
>> cython_directives={"c_string_type": "str", "c_string_encoding":
>> "utf8"},
>> language='c++',
>> ...
>> )]

You should be using Cython's version of Extension class
for extra parameters to have effect:

from Cython.Distutils import build_ext, Extension


Best regards,
Nikita Nemkin

Czarek Tomczak

unread,
May 22, 2013, 1:56:32 PM5/22/13
to cython...@googlegroups.com
Importing Cython.Distutils.Extension instead of distutils.Extension fixed
the problem, it works now, thank you for your help.

Best regards,
Czarek

Stefan Behnel

unread,
Jun 2, 2013, 4:18:25 AM6/2/13
to cython...@googlegroups.com
Czarek Tomczak, 22.05.2013 19:00:
> I'm trying to set c_string_type and c_string_encoding compiler directives
> in setup.py file, but it does not work, here is the code:
>
> from Cython.Compiler import Options
>> if sys.version_info.major < 3:
>> Options.c_string_type = "str"
>> Options.c_string_encoding = "utf8"
>> else:
>> Options.c_string_type = "unicode"
>> Options.c_string_encoding = "utf8"
> [...]
> Setting c_string_type and c_string_encoding by adding this comment at
> the top of the pyx file works fine:
>
> # cython: c_string_type=str, c_string_encoding=utf8
>
> But I need to set these options depending on python version, so this is not
> going to work.

Could you explain why you want to set these based on the Python version?

Stefan

Czarek Tomczak

unread,
Jun 2, 2013, 5:23:25 AM6/2/13
to cython...@googlegroups.com, stef...@behnel.de
Hi Stefan,

I get compile errors if I do not set the c_string_type and c_string_encoding options:

string_utils.pyx:26:33: default encoding required for conversion from 'string' to 'str object'

The code:
 
cdef str CefToPyString(ConstCefString& cefString):
    return cefString.ToString() # std::string

Regards,
Czarek

Stefan Behnel

unread,
Jun 2, 2013, 5:34:07 AM6/2/13
to cython...@googlegroups.com
Hi,

please don't top-post. I fixed up the citation order below.

Czarek Tomczak, 02.06.2013 11:23:
> On Sunday, June 2, 2013 10:18:25 AM UTC+2, Stefan Behnel wrote:
>>
>> Czarek Tomczak, 22.05.2013 19:00:
>>> I'm trying to set c_string_type and c_string_encoding compiler
>> directives
>>> in setup.py file, but it does not work, here is the code:
>>>
>>> from Cython.Compiler import Options
>>>> if sys.version_info.major < 3:
>>>> Options.c_string_type = "str"
>>>> Options.c_string_encoding = "utf8"
>>>> else:
>>>> Options.c_string_type = "unicode"
>>>> Options.c_string_encoding = "utf8"
>>> [...]
>>> Setting c_string_type and c_string_encoding by adding this comment at
>>> the top of the pyx file works fine:
>>>
>>> # cython: c_string_type=str, c_string_encoding=utf8
>>>
>>> But I need to set these options depending on python version, so this is
>> not
>>> going to work.
>>
>> Could you explain why you want to set these based on the Python version?
>
> I get compile errors if I do not set the c_string_type and
> c_string_encoding options:
>
> string_utils.pyx:26:33: default encoding required for conversion from
> 'string' to 'str object'
>
> The code:
>
> cdef str CefToPyString(ConstCefString& cefString):
> return cefString.ToString() # std::string

This seems wrong in Py2, unless you are only dealing with plain ASCII
strings (in which case it seems error prone to set the encoding to "utf8").
If your C++ strings contain non-ASCII content, and you really want to use
"bytes" in Python 2 instead of "unicode", you'd better do the conversion to
Python objects yourself, explicitly deciding between "bytes" and "unicode"
in Python 2 based on a runtime evaluation of the content at hand. Cython
will not magically do that for you.

If you are only dealing with ASCII content, set the encoding to "us-ascii"
and the string type to "str", regardless of the Python version. That's
basically what these options are there for.

Give this a read:

http://docs.cython.org/src/tutorial/strings.html#auto-encoding-and-decoding

Stefan

Czarek Tomczak

unread,
Jun 2, 2013, 6:48:45 AM6/2/13
to cython...@googlegroups.com, stef...@behnel.de
On Sunday, June 2, 2013 11:34:07 AM UTC+2, Stefan Behnel wrote:
> cdef str CefToPyString(ConstCefString& cefString):
>     return cefString.ToString() # std::string

This seems wrong in Py2, unless you are only dealing with plain ASCII
strings (in which case it seems error prone to set the encoding to "utf8").
If your C++ strings contain non-ASCII content, and you really want to use
"bytes" in Python 2 instead of "unicode", you'd better do the conversion to
Python objects yourself, explicitly deciding between "bytes" and "unicode"
in Python 2 based on a runtime evaluation of the content at hand. Cython
will not magically do that for you.

The strings are UTF-8 and the conversion works fine, so I'm not sure what is
wrong with it. I have already decided to return bytes by defining the return
type "str", right? How would converting to python objects on my own look
like? I have no clue.

-Czarek

Stefan Behnel

unread,
Jun 2, 2013, 7:43:13 AM6/2/13
to cython...@googlegroups.com
Czarek Tomczak, 02.06.2013 12:48:
> On Sunday, June 2, 2013 11:34:07 AM UTC+2, Stefan Behnel wrote:
>
>>> cdef str CefToPyString(ConstCefString& cefString):
>>> return cefString.ToString() # std::string
>>
>> This seems wrong in Py2, unless you are only dealing with plain ASCII
>> strings (in which case it seems error prone to set the encoding to
>> "utf8").
>> If your C++ strings contain non-ASCII content, and you really want to use
>> "bytes" in Python 2 instead of "unicode", you'd better do the conversion
>> to
>> Python objects yourself, explicitly deciding between "bytes" and "unicode"
>> in Python 2 based on a runtime evaluation of the content at hand. Cython
>> will not magically do that for you.
>
> The strings are UTF-8 and the conversion works fine, so I'm not sure what is
> wrong with it.

Do you really want to return a bytes object in Py2 that contains UTF-8
encoded content, and a decoded Unicode string in Py3? That seems like a
rather weird interface to me. Why not return a decoded Unicode string in
both cases? That makes it much more straight forward to use.


> I have already decided to return bytes by defining the return
> type "str", right?

In Py2, yes. In Py3, it must return a Unicode string if you declare the
return type as "str". "str" *is* "unicode" in Py3. Thus, your attempt to
change the auto-string type in your setup.py seems redundant.

Did you read the string tutorial I referenced?

http://docs.cython.org/src/tutorial/strings.html


> How would converting to python objects on my own look
> like? I have no clue.

Depends on what you want. Your comments above leave me puzzled how you want
your conversion to behave, and how you want your Python/Cython level APIs
to be used.

Stefan

Czarek Tomczak

unread,
Jun 2, 2013, 8:15:40 AM6/2/13
to cython...@googlegroups.com, stef...@behnel.de
On Sunday, June 2, 2013 1:43:13 PM UTC+2, Stefan Behnel wrote:
Do you really want to return a bytes object in Py2 that contains UTF-8 
encoded content, and a decoded Unicode string in Py3? That seems like a
rather weird interface to me. Why not return a decoded Unicode string in
both cases? That makes it much more straight forward to use.

Yes, I want to return str (bytes) in Py2 and unicode (str) in Py3. What is 
wrong with using utf-8 byte strings in python, is it slower or what? Sorry
for my ignorance, but my knowledge on python unicode is lacking, though
I tried google.
 
I have had bad experience with unicode in Py2 in the past, I don't remember
the case in details, but there was some conversion problem that just couldn't
be fixed, the only solution was to upgrade to Py3 that had it fixed, which 
wasn't an option as I was using some library that was Py2 only.

-Czarek

Chris Barker - NOAA Federal

unread,
Jun 3, 2013, 4:54:01 PM6/3/13
to cython...@googlegroups.com
On Sun, Jun 2, 2013 at 5:15 AM, Czarek Tomczak <czarek....@gmail.com> wrote:

> Yes, I want to return str (bytes) in Py2 and unicode (str) in Py3. What is
> wrong with using utf-8 byte strings in python, is it slower or what?

Not slower, but if a utf-8 byte sting is in a py2string, then the
encoding info is lost, and the user can't really do the right thing
with it at all -- unless they explicitly decode it into a unicode
object.

there really are only two objects in py2 and py3:

unicode objects
bytes objects

bytes objects are called "str" in py2, whereas unicode objects are
called "str" in py3, so this all gets confusing. Also confusing is
that due to legacy issues, a bytes object (py2str) is often used for
8bit ansi-encoded data, and it works fine for that (particularly for
the ascii subset)

> I have had bad experience with unicode in Py2 in the past, I don't remember
> the case in details, but there was some conversion problem that just
> couldn't
> be fixed, the only solution was to upgrade to Py3 that had it fixed, which
> wasn't an option as I was using some library that was Py2 only.

I'd confirm that this is an issue in this case with py2unicode objects
-- as those really are what you want. If they really are a problem
(which would be a big in python 2.7.5 -- it's see a LOT of testing!)
then maybe go with bytes (py2strings, but I'd encode to an ansi
encoding and raise an error if that can't be done.

Unless this in only used to store an pass strings around, and they
won't be used from python as strings, in which case, go with bytes
everywhere.

-Chris


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris....@noaa.gov

Czarek Tomczak

unread,
Jun 4, 2013, 4:44:03 AM6/4/13
to cython...@googlegroups.com
Hi Chris,

Thanks for the reply.


On Monday, June 3, 2013 10:54:01 PM UTC+2, Chris Barker - NOAA Federal wrote:
Not slower, but if a utf-8 byte sting is in a py2string, then the
encoding info is lost, and the user can't really do the right thing
with it at all -- unless they explicitly decode it into a unicode
object.

So they can decode it anytime they want, I don't see a problem.
 
Unless this in only used to store an pass strings around, and they 
won't be used from python as strings, in which case, go with bytes
everywhere.

Some of them are used internally, some are passed to python, but
if they want unicode they can just decode it.

I haven't had any complaints from users so far for passing string bytes
to python, I'm still not convinced to use unicode strings. Last time I've
used some library it returned all its data as unicode strings and that
made me only problems as I needed to encode it to bytes and there
were errors ahhh. Isn't that messy when half of the libraries expect and
return byte strings and the other half unicode strings and you need
to encode and decode them in all these places?

Regards,
Czarek

Stefan Behnel

unread,
Jun 4, 2013, 5:01:24 AM6/4/13
to cython...@googlegroups.com
Czarek Tomczak, 04.06.2013 10:44:
> On Monday, June 3, 2013 10:54:01 PM UTC+2, Chris Barker - NOAA Federal
> wrote:
>>
>> Not slower, but if a utf-8 byte sting is in a py2string, then the
>> encoding info is lost, and the user can't really do the right thing
>> with it at all -- unless they explicitly decode it into a unicode
>> object.
>
> So they can decode it anytime they want, I don't see a problem.

If they *know* that they need to decode it. Your users will only have to do
it in Python 2, and Py2 won't automatically do it correctly for them,
although it may seem so in some cases (specifically, for pure ASCII data).
So they may end up writing broken code without knowing it, because their
data just failed to show an error for them.

If you do it right, they won't have to care.


>> Unless this in only used to store an pass strings around, and they
>> won't be used from python as strings, in which case, go with bytes
>> everywhere.
>
> Some of them are used internally, some are passed to python, but
> if they want unicode they can just decode it.

Why bother them with it at all? Why not do it directly in your library?


> I haven't had any complaints from users so far for passing string bytes
> to python, I'm still not convinced to use unicode strings. Last time I've
> used some library it returned all its data as unicode strings and that
> made me only problems as I needed to encode it to bytes and there
> were errors ahhh.

You wouldn't be the first programmer to get Unicode wrong. ISTM that you're
just lacking a bit of experience with it. That's ok, you'll learn it. But
don't reject it just because you don't understand it.


> Isn't that messy when half of the libraries expect and
> return byte strings and the other half unicode strings and you need
> to encode and decode them in all these places?

Depends on how you design it. If you are dealing with text, Unicode is the
right thing to use, and it makes things safe, correct and convenient. If
you are dealing with binary data, then bytes is the right type. And if
that's how your interface works, then that's just right, no surprises.

Stefan

Czarek Tomczak

unread,
Jun 4, 2013, 5:46:57 AM6/4/13
to cython...@googlegroups.com, stef...@behnel.de
On Tuesday, June 4, 2013 11:01:24 AM UTC+2, Stefan Behnel wrote:
If they *know* that they need to decode it. Your users will only have to do 
it in Python 2, and Py2 won't automatically do it correctly for them,
although it may seem so in some cases (specifically, for pure ASCII data).
So they may end up writing broken code without knowing it, because their
data just failed to show an error for them.

If you do it right, they won't have to care.

Isn't there some "standard" to assume that if there is a bytes string passed
that is meant to be a text, then it's encoding should be assumed to be utf-8,
as this has become the dominant encoding for the world wide web?

You wouldn't be the first programmer to get Unicode wrong. ISTM that you're 
just lacking a bit of experience with it. That's ok, you'll learn it. But
don't reject it just because you don't understand it.

Thanks, I will keep that in mind. 

It seems that even the python core developers got it wrong in Py2 and fixed 
it in Py3, so I probably shouldn't be that much ashamed.

Depends on how you design it. If you are dealing with text, Unicode is the 
right thing to use, and it makes things safe, correct and convenient. If
you are dealing with binary data, then bytes is the right type. And if
that's how your interface works, then that's just right, no surprises. 

I'm still not sure of when I should use bytes or unicode strings. When returning
a path to a file should it be bytes or unicode? When a javascript calls python,
should all javascript strings be assumed to be text and unicode string should
be used?

-Czarek

Wichert Akkerman

unread,
Jun 4, 2013, 6:50:40 AM6/4/13
to cython...@googlegroups.com, stef...@behnel.de
On Jun 4, 2013, at 11:46 , Czarek Tomczak <czarek....@gmail.com> wrote:

On Tuesday, June 4, 2013 11:01:24 AM UTC+2, Stefan Behnel wrote:
If they *know* that they need to decode it. Your users will only have to do 
it in Python 2, and Py2 won't automatically do it correctly for them,
although it may seem so in some cases (specifically, for pure ASCII data).
So they may end up writing broken code without knowing it, because their
data just failed to show an error for them.

If you do it right, they won't have to care.

Isn't there some "standard" to assume that if there is a bytes string passed
that is meant to be a text, then it's encoding should be assumed to be utf-8,
as this has become the dominant encoding for the world wide web?

That is not a safe assumption to make unfortunately. UTF-8 is the most common, but other encodings are still frequently used.

You wouldn't be the first programmer to get Unicode wrong. ISTM that you're 
just lacking a bit of experience with it. That's ok, you'll learn it. But
don't reject it just because you don't understand it.

Thanks, I will keep that in mind. 

It seems that even the python core developers got it wrong in Py2 and fixed 
it in Py3, so I probably shouldn't be that much ashamed.

Py3 certainly cleaned that up a bit, but if you are careful you can get this working correctly in python 2 as well. The main thing to make sure is that you never ever try to mix unicode and str instances: python will then do a magic conversion using the ascii-codec which will give a nasty UnicodeDecodeError if your str contains any non-ASCII characters. It is a frequent problem for native English speaking developers to overlook that fact since their language only uses ascii characters.

Depends on how you design it. If you are dealing with text, Unicode is the 
right thing to use, and it makes things safe, correct and convenient. If
you are dealing with binary data, then bytes is the right type. And if
that's how your interface works, then that's just right, no surprises. 

I'm still not sure of when I should use bytes or unicode strings. When returning
a path to a file should it be bytes or unicode? When a javascript calls python,
should all javascript strings be assumed to be text and unicode string should
be used?

Bytes, and this is something python 3 actually got wrong in my opinion. The problem is that even though you think a path is text the problem is that it is impossible to know what encoding a filesystem uses. And to make it worse: each file can use another coding which  results in even more fun. That means that trying to convert a path into a unicode instance is almost impossible to do reliably. Python 3 tries to work around these problems with lots of magic but there is a real risk of that breaking.

Wichert.

Nikita Nemkin

unread,
Jun 4, 2013, 6:46:11 AM6/4/13
to cython...@googlegroups.com
On Tue, 04 Jun 2013 15:46:57 +0600, Czarek Tomczak
<czarek....@gmail.com> wrote:

> On Tuesday, June 4, 2013 11:01:24 AM UTC+2, Stefan Behnel wrote:
>
>> If they *know* that they need to decode it. Your users will only have to
>> do
>> it in Python 2, and Py2 won't automatically do it correctly for them,
>> although it may seem so in some cases (specifically, for pure ASCII
>> data).
>> So they may end up writing broken code without knowing it, because their
>> data just failed to show an error for them.
>>
>> If you do it right, they won't have to care.
>>
>
> Isn't there some "standard" to assume that if there is a bytes string
> passed
> that is meant to be a text, then it's encoding should be assumed to be
> utf-8,
> as this has become the dominant encoding for the world wide web?

Not really. The convention is that byte string requires encoding
to be specified _somehow_. For example, you can make your library
accept UTF-8 encoded byte strings and _document it_. Then it becomes
"standard" within the scope of your library.

> It seems that even the python core developers got it wrong in Py2 and
> fixed it in Py3, so I probably shouldn't be that much ashamed.

Nope, they got it (almost) right in Py2. Py3 just pushes unicode
everywhere, even in places where it does not belong, just to make
ignorant people use it "by default" and killing useful functionality
in the process (b'%d' % i, b'abc'.encode('hex'), b'abc'[i] etc...)

> I'm still not sure of when I should use bytes or unicode strings. When
> returning
> a path to a file should it be bytes or unicode? When a javascript calls
> python,
> should all javascript strings be assumed to be text and unicode string
> should
> be used?

Generally, paths should be unicode. But on Py2, byte strings in
filesystem encoding (as returned by sys.getfilesystemencoding())
are usually accepted too.
So if your library on Py2 deals with paths, you should expect both
unicode and byte input and encode when necessary. Output paths can
always be unicode.

Javascript strings are unicode by definition.

Best regards,
Nikita Nemkin

Czarek Tomczak

unread,
Jun 4, 2013, 7:46:12 AM6/4/13
to cython...@googlegroups.com, stef...@behnel.de
On Tuesday, June 4, 2013 12:50:40 PM UTC+2, Wichert Akkerman wrote:
Bytes, and this is something python 3 actually got wrong in my opinion. The problem is that even though you think a path is text the problem is that it is impossible to know what encoding a filesystem uses. And to make it worse: each file can use another coding which  results in even more fun. That means that trying to convert a path into a unicode instance is almost impossible to do reliably. Python 3 tries to work around these problems with lots of magic but there is a real risk of that breaking.

I can confirm that, I remember using libtorrent library and it returned file paths
as unicode, the torrent files / directories may have had really strange characters 
in it, with the path as unicode you couldn't access these files, converting it back
to bytes string didn't return the original path, this is a really bad unicode experience.

Regards,
Czarek

Stefan Behnel

unread,
Jun 4, 2013, 8:02:56 AM6/4/13
to cython...@googlegroups.com
Wichert Akkerman, 04.06.2013 12:50:
> On Jun 4, 2013, at 11:46 , Czarek Tomczak wrote:
>> I'm still not sure of when I should use bytes or unicode strings. When returning
>> a path to a file should it be bytes or unicode? When a javascript calls python,
>> should all javascript strings be assumed to be text and unicode string should
>> be used?
>
> Bytes, and this is something python 3 actually got wrong in my opinion.
> The problem is that even though you think a path is text the problem is
> that it is impossible to know what encoding a filesystem uses. And to make
> it worse: each file can use another coding which results in even more fun.
> That means that trying to convert a path into a unicode instance is almost
> impossible to do reliably. Python 3 tries to work around these problems
> with lots of magic but there is a real risk of that breaking.

AFAIR, this situation should have improved a lot in recent Python 3.x versions.

Stefan

Czarek Tomczak

unread,
Jun 4, 2013, 8:04:18 AM6/4/13
to cython...@googlegroups.com
On Tuesday, June 4, 2013 12:46:11 PM UTC+2, Nikita Nemkin wrote:
Javascript strings are unicode by definition. 

Hmm but I don't know the encoding that the webpage had, so decoding
the bytes string to unicode using the 'utf-8' encoding would probably be wrong.

Best regards,
Czarek

Stefan Behnel

unread,
Jun 4, 2013, 8:53:52 AM6/4/13
to cython...@googlegroups.com
Sorry for not ignoring your rant.

Nikita Nemkin, 04.06.2013 12:46:
> On Tue, 04 Jun 2013 15:46:57 +0600, Czarek Tomczak wrote:
>> It seems that even the python core developers got it wrong in Py2 and
>> fixed it in Py3, so I probably shouldn't be that much ashamed.
>
> Nope, they got it (almost) right in Py2

Except that the support in Py2 is incomplete and makes it dangerously easy
to write buggy code without noticing. So it's really more like they got it
wrong.


> Py3 just pushes unicode everywhere

Well, everywhere people are dealing with text, that is, which makes perfect
sense "these days".


> even in places where it does not belong

You might be referring to bugs in early Py3.x versions here. IMHO, it was a
good choice to start strict, then see where that brings us, and then clean
up problematic areas where practicality really beats purity.


> just to make ignorant people use it "by default"

Still better than making ignorant people write broken code by default. I'm
sure it's easier to get Unicode right when you're forced to use it than
when the runtime provides you with all sorts of nifty little special cases
that make your code hard to debug and to reason about.


> and killing useful functionality
> in the process (b'%d' % i, b'abc'.encode('hex'), b'abc'[i] etc...)

The first is text processing, so it's best done on Unicode strings. Why add
the complexity?

The second was a pretty serious design quirk in Py2 that makes no sense
(you can't "encode" byte strings), and there are both work-arounds and PEPs
that try to improve on the current situation. You might want to participate
in their design, although the apparent lack of serious interest in them may
indicate that they are not all that interesting after all. Special cases
just aren't special enough to break the rules.

The third was a design choice. Not sure if it was a good one, but it makes
sense if you consider the old indexing behaviour redundant with
"b'abc[i:i+1]". I.e., it's easy to work around and simplifies another use
case. My guess is that most people swearing at this change do so because
they are trying to port Py2 code and try hard to keep doing the same thing
they always did.

In fact, I tend to suspect most people who swear at Python 3 to be of this
kind.

Stefan

Czarek Tomczak

unread,
Jun 4, 2013, 10:37:20 AM6/4/13
to cython...@googlegroups.com
Thank you guys for all for the input. I think I understand the need to return 
unicode strings in both Py2 and Py3, I see that this might be a problem in 
the future, because if you write code in Py2 that decodes the byte strings 
then this will break in Py3 with decode errors, because when calling decode 
on a unicode string doesn't make sense and what python does is it 
encodes it to byte string using the 'ascii' codec and then decodes it back
using the 'utf-8'  codec and this will cause errors.

I understand the need for it, but I still don't like the unicode strings ;-) From
what I learned here is that the only thing unicode strings do is they keep
the encoding type and intercept any encoding errors, but I don't
think it makes any difference in real life, you use unicode strings for
the text, but the text is created by humans and humans make mistakes,
it's like requiring for the web pages and its content we create not to contain
even the slightest mistake, we all know that from time to time you will
forget to close that tag, but browsers will forgive you that, they won't display
"HtmlDecodeError" to you, but python will do. I think that in practice most
will just use the errors="ignore" option to get rid of the unicode encode or
decode errors and that disregards the whole point(?) of the unicode strings 
invention in python. Or is the main point to have an utility to easily merge
strings with different encodings? But shouldn't that be just an utility and
not a main feature (forced one) of the language built-in everywhere.

Best regards,
Czarek

Nikita Nemkin

unread,
Jun 4, 2013, 11:19:37 AM6/4/13
to cython...@googlegroups.com
On Tue, 04 Jun 2013 18:53:52 +0600, Stefan Behnel <stef...@behnel.de>
wrote:

> Sorry for not ignoring your rant.

Rant time!

> Nikita Nemkin, 04.06.2013 12:46:
>> On Tue, 04 Jun 2013 15:46:57 +0600, Czarek Tomczak wrote:
>>> It seems that even the python core developers got it wrong in Py2 and
>>> fixed it in Py3, so I probably shouldn't be that much ashamed.
>>
>> Nope, they got it (almost) right in Py2
>
> Except that the support in Py2 is incomplete and makes it dangerously
> easy
> to write buggy code without noticing. So it's really more like they got
> it wrong.

You can make the same argument about Python being dynamic language:
"It's dangerously easy to pass the object of the wrong type."
And rebuttal is the same.

Automatic ascii->unicode conversion was probably a bad idea, but it's
just one aspect.

>> Py3 just pushes unicode everywhere
>
> Well, everywhere people are dealing with text, that is, which makes
> perfect
> sense "these days".

I have three words for you: protocols, file formats, performance.
Nice example: Py3 WSGI fiasco.
"These days" UTF-8 is prevalent and burning cycles encoding and decoding
will only get you so far. Another nice example:
http://youtu.be/oK3EQH5Wdqo?t=24m26s.

>> even in places where it does not belong
>
> You might be referring to bugs in early Py3.x versions here. IMHO, it
> was a
> good choice to start strict, then see where that brings us, and then
> clean
> up problematic areas where practicality really beats purity.

I happily ignored early Py3 versions. What I mean is that some kinds of
textual data have fixed encoding (like ASCII or UTF-8) and manipulating
it as bytes should be as natural as manipulating unicode.
Some data is inherently mixed (hello PostScript, I mean, PDF),
many libraries have UTF-8 interfaces, etc...

>> just to make ignorant people use it "by default"
>
> Still better than making ignorant people write broken code by default.
> I'm
> sure it's easier to get Unicode right when you're forced to use it than
> when the runtime provides you with all sorts of nifty little special
> cases
> that make your code hard to debug and to reason about.

unicode by default is fine, but its introduction was accompanied by
degrading functionality of the bytes type.

No idea what special cases you are talking about. unicode vs str
is pretty clear cut in Py2 and standard library (2.7) accepts/returns
unicode where appropriate. (Maybe there are some dusty corners, idk.)

>> and killing useful functionality
>> in the process (b'%d' % i, b'abc'.encode('hex'), b'abc'[i] etc...)
>
> The first is text processing, so it's best done on Unicode strings. Why
> add
> the complexity?
>
> The second was a pretty serious design quirk in Py2 that makes no sense
> (you can't "encode" byte strings), and there are both work-arounds and
> PEPs
> that try to improve on the current situation. You might want to
> participate
> in their design, although the apparent lack of serious interest in them
> may
> indicate that they are not all that interesting after all. Special cases
> just aren't special enough to break the rules.
>
> The third was a design choice. Not sure if it was a good one, but it
> makes
> sense if you consider the old indexing behaviour redundant with
> "b'abc[i:i+1]". I.e., it's easy to work around and simplifies another use
> case.

This is exactly the sort of thinking that made Py3 worse than it could
have been.

There are cases where binary string formatting is useful. It was in the
language in Py2. The reason for it removal is somewhere between laziness
and dogmatism.

Codecs in Python were awesomely generic to begin with. Then someone with
too much time on his hands "fixed" them to conform to his narrow definition
of the word "encode".

bytes indexing in Py3 is inconsistent with unicode (and with Py2).
s[i:i+1] workaround breaks half of the "Zen of Python" principles.
At the same time, memoryview perfectly covers the case of
"numeric bytestring" indexing.

Making all identifiers unicode, now THAT is added complexity.
Python program in memory is essentially a graph of dicts and strings.
Both received a nice (implementation) complexity boost, esp. in 3.3.

(Alright, 3.3 string implementation is quite ingenious, considering
its design constraints: minimal memory consumption, O(1) indexing and
API compatibility. But it does not make it any simpler or prettier.
I'd rather reevaluate the constraints...)

> My guess is that most people swearing at this change do so because
> they are trying to port Py2 code and try hard to keep doing the same
> thing they always did.
>
> In fact, I tend to suspect most people who swear at Python 3 to be
> of this kind.

I haven't ported anything significant to Py3. But I have to be aware
of the differences to write portable code and some of these differences
can't be reasonably justified.


Best regards,
Nikita Nemkin

Chris Barker - NOAA Federal

unread,
Jun 4, 2013, 11:55:51 AM6/4/13
to cython...@googlegroups.com
On Tue, Jun 4, 2013 at 2:46 AM, Czarek Tomczak <czarek....@gmail.com> wrote:

> Isn't there some "standard" to assume that if there is a bytes string passed
> that is meant to be a text, then it's encoding should be assumed to be
> utf-8,
> as this has become the dominant encoding for the world wide web?

If only that were so -- in fact, on the web, the encoding is
*supposed* to be specified in the html header. In reality, it's often
not, and browsers have an enormous amount of hacky code that tried to
auto-detect the encoding. It's quite remarkable that it ever works at
all!

This is the rule everywhere -- text data has to have the encoding
specified along with it. period, end of sentence -- anything else is
prone to bugs (which doesn't mean you won't get away with in often...)

> It seems that even the python core developers got it wrong in Py2 and fixed
> it in Py3, so I probably shouldn't be that much ashamed.

well, py2 wan't wrong, it simply wasn't supported from the beginning.
However, I would prefer it if the py2 bytes object was, in fact,
different than the string object, with the later being explicitly for
8-bit text only. But what can you do?

> I'm still not sure of when I should use bytes or unicode strings. When
> returning
> a path to a file should it be bytes or unicode?

Ah- the big 'ol pain in teh *($^&%*

Here's the deal, as I understand it:

Just like everywhere else, you can't do anything without the encoding
specified. Here there are platform differences: Linux and OS-X use
utf-8 as the encoding for file systems. So in C/C++/Cython, you can
read them into a char* and when you go to/from Python, you want to use
a unicode object, encoding/decoding as you go back and forth.

Windows is weird, and I probably only have it half right. Newer
versions use UTF-16 for unicode files, so you want to use a wide char
(or std::w_string, or whatever it's called), then encode/decode as you
pass to/from Python unicode objects. But, the older Windows APIs use
char* and std::string, and those APIs will give you ansi strings, with
the encoding specified by the locale settings. (I have no idea what
happens with non-legal characters...)

So you have no choice but to have platform dependent code...

In python, locale.getpreferredencoding() give you something -- I
_think_ it's the file system encoding.

The core problem in all of this is that there is no standard type for
unicode in C or C++. MIcrosoft tried to do it by using 2-byte
w_strings and UCS-2, but by the time they did that, unicode grew and
could no longer be fit into two bytes -- so we're kind of left with
the worst of both worlds with UTF-16 in the MS world -- oh well. As
far as I have found, there is still no simple unicode type for C++,
similar to Python's unicode type. THere is the IBM unicode library,
but it does all sorts of things beyond the basics, though _maybe_ you
could use only the core bits...

the boost filesystem library does abstract a bunch of stuff like this,
maybe it's useful for dealing with filenames, etc across platforms.

>When a javascript calls python,

How do you call Python from javascript??? The only way I've passed
data between them is via JSON -- which I think is defined as encoded
in utf-8.

But IIUC, you are working on CEF, and may be passing binary bufferes
between pyton and the javascript engine -- in which cae, you'll need
to find out what the javascript engine uses as an internal encoding
(or it has an API for encoding strings in an encoding of choice...)

Chris Barker - NOAA Federal

unread,
Jun 4, 2013, 12:54:08 PM6/4/13
to cython...@googlegroups.com
An update to what I wrote:

On Tue, Jun 4, 2013 at 8:55 AM, Chris Barker - NOAA Federal
<chris....@noaa.gov> wrote:

> Windows is weird, and I probably only have it half right. Newer
> versions use UTF-16 for unicode files, so you want to use a wide char
> (or std::w_string, or whatever it's called), then encode/decode as you
> pass to/from Python unicode objects. But, the older Windows APIs use
> char* and std::string, and those APIs will give you ansi strings, with
> the encoding specified by the locale settings. (I have no idea what
> happens with non-legal characters...)
>
> So you have no choice but to have platform dependent code...
>
> In python, locale.getpreferredencoding() give you something -- I
> _think_ it's the file system encoding.

not really, for that you want:
sys.getfilesystemencoding()

However, on Windows, that seems to return 'mbcs', which means
"multi-byte charactor set", which you may not is not actually an
encoding at all. But I'm pretty sure it is actually utf-16. But you
can only use utf-16 if you are using a "wide" type to hold the
filenames (and the corresponding APIs. If you are using the old-style
APIs, then locale.getpreferredencoding() seems to be right, though we
have NOT testing on many other systems (I think we took a quick look
at a Cyrillic system once...)

Yes, it's a mess.

-Chris

Stefan Behnel

unread,
Jun 4, 2013, 2:52:59 PM6/4/13
to cython...@googlegroups.com
Nikita Nemkin, 04.06.2013 17:19:
> On Tue, 04 Jun 2013 18:53:52 +0600, Stefan Behnel wrote:
>> Nikita Nemkin, 04.06.2013 12:46:
> I have three words for you: protocols, file formats, performance.
> Nice example: Py3 WSGI fiasco.

Yep, that took a while to get properly designed and fixed. The original
implementation was a bit too underdefined and partly broken. Also in Py2,
BTW, it just didn't show all that often because it accidentally didn't fail
in a lot of common cases.


> "These days" UTF-8 is prevalent and burning cycles encoding and decoding
> will only get you so far.

Just like using encoded byte strings will only get you so far. I guess
there's a use case for both, but unless you can prove that Unicode doesn't
fit your use case, you'd better use it. It's the same reason as for using
Python itself, why bother with all those low level details when you can
just decode it once and be done?


> Another nice example:
> http://youtu.be/oK3EQH5Wdqo?t=24m26s.

That's sad. According to the speaker's profiling anecdote, he seems to be
talking about Python 2, though. The problem there is that auto-decoding can
happen at any time, even just for comparing strings. It's so hard to
control that it's totally not surprising that it hurt his code's
performance. Most of this has been fixed in Python 3, especially in 3.3.


> No idea what special cases you are talking about. unicode vs str
> is pretty clear cut in Py2

I've seen so much code raise a UnicodeDecodeError in the most unexpected
corners that I can only hope you're joking.


> There are cases where binary string formatting is useful. It was in the
> language in Py2. The reason for it removal is somewhere between laziness
> and dogmatism.

Rather the maintenance overhead, I guess. Maintaining somewhere between
three and five different C implementations of string formatting, depending
on the level at which you count, is just way too much.


> s[i:i+1] workaround breaks half of the "Zen of Python" principles.

It fits quite well with "special cases aren't special enough to break the
rules".


> Making all identifiers unicode, now THAT is added complexity.

It's actually removing complexity. Ever tried to pass a dict with unicode
strings as **keyword arguments into a function in Python 2?

Stefan

Chris Barker - NOAA Federal

unread,
Jun 4, 2013, 5:12:31 PM6/4/13
to cython...@googlegroups.com
On Tue, Jun 4, 2013 at 11:52 AM, Stefan Behnel <stef...@behnel.de> wrote:
>> Another nice example:
>> http://youtu.be/oK3EQH5Wdqo?t=24m26s.

I'm confused by that -- he refers to "getting a name from the
filesystem and it turns into unicode" -- OK -- if you want to get
names from a filesystem that supports unicode, you don't have a choice
there. So what's the problem?

I suspect the problem is that internally in his code, hw was using
non-unicode strings -- then comparing them to unicode filenames --
this is simply broken, and not only from a performance perspective.
You either support full unicode, in which case, you should use unicode
everywhere, or you don't, in which case you need to encode/decode on
I/O, and raise an exception when a non-ansi compatible character is
passed in.

Or was there really a lot of decoding going on with:

a_unicode_object == another_unicode_object

If so then, yes there's an implementation issue!

I guess option 3 is that you keep stuff encoded in byte strings, and
as long as all you need to do is pass stuff along and compare for
equality, you're OK. But then, in a sense, you are hand-writing a
really crappy unicode implementation...

Wichert Akkerman

unread,
Jun 5, 2013, 3:15:26 AM6/5/13
to cython...@googlegroups.com

On Jun 4, 2013, at 17:55 , Chris Barker - NOAA Federal <chris....@noaa.gov> wrote:
> Just like everywhere else, you can't do anything without the encoding
> specified. Here there are platform differences: Linux and OS-X use
> utf-8 as the encoding for file systems.

That isn't completely accurate. OSX requires UTF-8 if I remember correctly, but Linux does not enforce anything and is likely to use whatever encoding the currently logged in user happens to be using. If becomes extra fun if you share a folder with multiple users or even multiple users from multiple computers (think network storage) and you have files with differently encoded filenames in the same directory.

> The core problem in all of this is that there is no standard type for
> unicode in C or C++. MIcrosoft tried to do it by using 2-byte
> w_strings and UCS-2, but by the time they did that, unicode grew and
> could no longer be fit into two bytes -- so we're kind of left with
> the worst of both worlds with UTF-16 in the MS world -- oh well. As
> far as I have found, there is still no simple unicode type for C++,
> similar to Python's unicode type.

Nothing in official standards, but Glib::ustring is very useful for UTF-8 in C++.

Wichert.


Reply all
Reply to author
Forward
0 new messages