On Tue, Jun 4, 2013 at 2:46 AM, Czarek Tomczak <
czarek....@gmail.com> wrote:
> Isn't there some "standard" to assume that if there is a bytes string passed
> that is meant to be a text, then it's encoding should be assumed to be
> utf-8,
> as this has become the dominant encoding for the world wide web?
If only that were so -- in fact, on the web, the encoding is
*supposed* to be specified in the html header. In reality, it's often
not, and browsers have an enormous amount of hacky code that tried to
auto-detect the encoding. It's quite remarkable that it ever works at
all!
This is the rule everywhere -- text data has to have the encoding
specified along with it. period, end of sentence -- anything else is
prone to bugs (which doesn't mean you won't get away with in often...)
> It seems that even the python core developers got it wrong in Py2 and fixed
> it in Py3, so I probably shouldn't be that much ashamed.
well, py2 wan't wrong, it simply wasn't supported from the beginning.
However, I would prefer it if the py2 bytes object was, in fact,
different than the string object, with the later being explicitly for
8-bit text only. But what can you do?
> I'm still not sure of when I should use bytes or unicode strings. When
> returning
> a path to a file should it be bytes or unicode?
Ah- the big 'ol pain in teh *($^&%*
Here's the deal, as I understand it:
Just like everywhere else, you can't do anything without the encoding
specified. Here there are platform differences: Linux and OS-X use
utf-8 as the encoding for file systems. So in C/C++/Cython, you can
read them into a char* and when you go to/from Python, you want to use
a unicode object, encoding/decoding as you go back and forth.
Windows is weird, and I probably only have it half right. Newer
versions use UTF-16 for unicode files, so you want to use a wide char
(or std::w_string, or whatever it's called), then encode/decode as you
pass to/from Python unicode objects. But, the older Windows APIs use
char* and std::string, and those APIs will give you ansi strings, with
the encoding specified by the locale settings. (I have no idea what
happens with non-legal characters...)
So you have no choice but to have platform dependent code...
In python, locale.getpreferredencoding() give you something -- I
_think_ it's the file system encoding.
The core problem in all of this is that there is no standard type for
unicode in C or C++. MIcrosoft tried to do it by using 2-byte
w_strings and UCS-2, but by the time they did that, unicode grew and
could no longer be fit into two bytes -- so we're kind of left with
the worst of both worlds with UTF-16 in the MS world -- oh well. As
far as I have found, there is still no simple unicode type for C++,
similar to Python's unicode type. THere is the IBM unicode library,
but it does all sorts of things beyond the basics, though _maybe_ you
could use only the core bits...
the boost filesystem library does abstract a bunch of stuff like this,
maybe it's useful for dealing with filenames, etc across platforms.
>When a javascript calls python,
How do you call Python from javascript??? The only way I've passed
data between them is via JSON -- which I think is defined as encoded
in utf-8.
But IIUC, you are working on CEF, and may be passing binary bufferes
between pyton and the javascript engine -- in which cae, you'll need
to find out what the javascript engine uses as an internal encoding
(or it has an API for encoding strings in an encoding of choice...)