unicode by default

harrismh777

unread,

May 11, 2011, 5:37:49 PM5/11/11

to

hi folks,
I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in
python 3.x by default. (I know what default means, I mean, what changed?)

I think part of my problem is that I'm spoiled (American, ascii
heritage) and have been either stuck in ascii knowingly, or UTF-8
without knowing (just because the code points lined up). I am confused
by the implications for using 3.x, because I am reading that there are
significant things to be aware of... what?

On my installation 2.6 sys.maxunicode comes up with 1114111, and my
2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6
was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that
the default compile option for 2.7 & 3.2 (I didn't change anything) is
set for UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much
correctly?

The books say that the .py sources are UTF-8 by default... and that
3.x is either UCS-2 or UCS-4. If I use the file handling capabilities
of Python in 3.x (by default) what encoding will be used, and how will
that affect the output?

If I do not specify any code points above ascii 0xFF does any of
this matter anyway?

Thanks.

kind regards,
m harris

Ian Kelly

unread,

May 11, 2011, 6:09:29 PM5/11/11

to Python

On Wed, May 11, 2011 at 3:37 PM, harrismh777 <harri...@charter.net> wrote:
> hi folks,
> I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)

The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.

> I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?

Mainly Python 3 no longer does explicit conversion between bytes and
unicode, requiring the programmer to be explicit about such
conversions. If you have Python 2 code that is sloppy about this, you
may get some Unicode encode/decode errors when trying to run the same
code in Python 3. The 2to3 tool can help somewhat with this, but it
can't prevent all problems.

> On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly?

I think that UCS-2 has always been the default unicode width for
CPython, although the exact representation used internally is an
implementation detail.

> The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4. If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?

If you open a file in binary mode, the result is a non-decoded byte stream.

If you open a file in text mode and do not specify an encoding, then
the result of locale.getpreferredencoding() is used for decoding, and
the result is a unicode stream.

> If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?

You mean 0x7F, and probably, due to the need to explicitly encode and decode.

Benjamin Kaplan

unread,

May 11, 2011, 6:34:02 PM5/11/11

to pytho...@python.org

On Wed, May 11, 2011 at 2:37 PM, harrismh777 <harri...@charter.net> wrote:
> hi folks,
> I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
>
> I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
>
> On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly?
>

Not really sure about that, but it doesn't matter anyway. Because even
though internally the string is stored as either a UCS-2 or a UCS-4
string, you never see that. You just see this string as a sequence of
characters. If you want to turn it into a sequence of bytes, you have
to use an encoding.

> The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4. If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
>
> If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?

ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
there is a difference for anything over that range. A byte string is a
sequence of bytes. A unicode string is a sequence of these mythical
abstractions called characters. So a unicode string u'\u00a0' will
have a length of 1. Encode that to UTF-8 and you'll find it has a
length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
the top bit is used to signal that you need the next byte for this
character)

If you want the history behind the whole encoding mess, Joel Spolsky
wrote a rather amusing article explaining how this all came about:
http://www.joelonsoftware.com/articles/Unicode.html

And the biggest reason to use Unicode is so that you don't have to
worry about your program messing up because someone hands you input in
a different encoding than you used.

harrismh777

unread,

May 11, 2011, 6:51:07 PM5/11/11

to

Ian Kelly wrote:

Ian, Benjamin, thanks much.

> The `unicode' class was renamed to `str', and a stripped-down version
> of the 2.X `str' class was renamed to `bytes'.

... thank you, this is very helpful.

>> > If I do not specify any code points above ascii 0xFF does any of this
>> > matter anyway?

> You mean 0x7F, and probably, due to the need to explicitly encode and decode.

Yes, actually, I did... and from Benjamin's reply it seems that
this matters only if I am working with bytes. Is it true that if I am
working without using bytes sequences that I will not need to care about
the encoding anyway, unless of course I need to specify a unicode code
point?

Thanks again.

kind regards,
m harris

John Machin

unread,

May 11, 2011, 7:32:06 PM5/11/11

to pytho...@python.org

On Thu, May 12, 2011 8:51 am, harrismh777 wrote:
> Is it true that if I am
> working without using bytes sequences that I will not need to care about
> the encoding anyway, unless of course I need to specify a unicode code
> point?

Quite the contrary.

(1) You cannot work without using bytes sequences. Files are byte
sequences. Web communication is in bytes. You need to (know / assume / be
able to extract / guess) the input encoding. You need to encode your
output using an encoding that is expected by the consumer (or use an
output method that will do it for you).

(2) You don't need to use bytes to specify a Unicode code point. Just use
an escape sequence e.g. "\u0404" is a Cyrillic character.

harrismh777

unread,

May 11, 2011, 9:22:50 PM5/11/11

to

John Machin wrote:
> (1) You cannot work without using bytes sequences. Files are byte
> sequences. Web communication is in bytes. You need to (know / assume / be
> able to extract / guess) the input encoding. You need to encode your
> output using an encoding that is expected by the consumer (or use an
> output method that will do it for you).
>
> (2) You don't need to use bytes to specify a Unicode code point. Just use
> an escape sequence e.g. "\u0404" is a Cyrillic character.
>

Thanks John. In reverse order, I understand point (2). I'm less clear
on point (1).

If I generate a string of characters that I presume to be ascii/utf-8
(no \u0404 type characters) and write them to a file (stdout) how does
default encoding affect that file.by default..? I'm not seeing that
there is anything unusual going on... If I open the file with vi? If
I open the file with gedit? emacs?

....

Another question... in mail I'm receiving many small blocks that look
like sprites with four small hex codes, scattered about the mail...
mostly punctuation, maybe? ... guessing, are these unicode code
points, and if so what is the best way to 'guess' the encoding? ... is
it coded in the stream somewhere...protocol?

thanks

MRAB

unread,

May 11, 2011, 10:31:18 PM5/11/11

to pytho...@python.org

You need to understand the difference between characters and bytes.

A string contains characters, a file contains bytes.

The encoding specifies how a character is represented as bytes.

For example:

In the Latin-1 encoding, the character "Ł" is represented by the
byte 0xA3.

In the UTF-8 encoding, the character "Ł" is represented by the byte
sequence 0xC2 0xA3.

In the ASCII encoding, the character "Ł" can't be represented at all.

The advantage of UTF-8 is that it can represent _all_ Unicode
characters (codepoints, actually) as byte sequences, and all those in
the ASCII range are represented by the same single bytes which the
original ASCII system used. Use the UTF-8 encoding unless you have to
use a different one.

A file contains only bytes, a socket handles only bytes. Which encoding
you should use for characters is down to protocol. A system such as
email, which can handle different encodings, should have a way of
specifying the encoding, and perhaps also a default encoding.

Steven D'Aprano

unread,

May 11, 2011, 11:16:09 PM5/11/11

to

On Thu, 12 May 2011 03:31:18 +0100, MRAB wrote:

>> Another question... in mail I'm receiving many small blocks that look
>> like sprites with four small hex codes, scattered about the mail...
>> mostly punctuation, maybe? ... guessing, are these unicode code points,
>> and if so what is the best way to 'guess' the encoding? ... is it coded
>> in the stream somewhere...protocol?
>>
> You need to understand the difference between characters and bytes.

http://www.joelonsoftware.com/articles/Unicode.html

is also a good resource.

--
Steven

harrismh777

unread,

May 11, 2011, 11:44:23 PM5/11/11

to

Steven D'Aprano wrote:
>> You need to understand the difference between characters and bytes.
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> is also a good resource.

Thanks for being patient guys, here's what I've done:

>>>> astr="pound sign"
>>>> asym=" \u00A3"
>>>> afile=open("myfile", mode='w')
>>>> afile.write(astr + asym)
> 12
>>>> afile.close()

When I edit "myfile" with vi I see the 'characters' :

pound sign £

... same with emacs, same with gedit ...

When I hexdump myfile I see this:

0000000 6f70 6375 2064 6973 6e67 c220 00a3

This is *not* what I expected... well it is (little-endian) right up to
the 'c2' and that is what is confusing me....

I did not open the file with an encoding of UTF-8... so I'm assuming
UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
'A300' but what I got instead was UTF-8 little-endian 'c2a3' ....

See my problem?... when I open the file with emacs I see the character
pound sign... same with gedit... they're all using UTF-8 by default. By
default it looks like Python3 is writing output with UTF-8 as default...
and I thought that by default Python3 was using either UTF-16 or UTF-32.
So, I'm confused here... also, I used the character sequence \u00A3
which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
which is the normal UTF-8...

Thanks again for your patience... I really do hate to be dense about
this... but this is another area where I'm just beginning to dabble and
I'd like to know soon what I'm doing...

Thanks for the link Steve... I'm headed there now...

kind regards,
m harris

John Machin

unread,

May 11, 2011, 11:54:20 PM5/11/11

to pytho...@python.org

On Thu, May 12, 2011 11:22 am, harrismh777 wrote:
> John Machin wrote:
>> (1) You cannot work without using bytes sequences. Files are byte
>> sequences. Web communication is in bytes. You need to (know / assume /
>> be
>> able to extract / guess) the input encoding. You need to encode your
>> output using an encoding that is expected by the consumer (or use an
>> output method that will do it for you).
>>
>> (2) You don't need to use bytes to specify a Unicode code point. Just
>> use
>> an escape sequence e.g. "\u0404" is a Cyrillic character.
>>
>
> Thanks John. In reverse order, I understand point (2). I'm less clear
> on point (1).
>
> If I generate a string of characters that I presume to be ascii/utf-8
> (no \u0404 type characters)
> and write them to a file (stdout) how does
> default encoding affect that file.by default..? I'm not seeing that
> there is anything unusual going on...

About """characters that I presume to be ascii/utf-8 (no \u0404 type
characters)""": All Unicode characters (including U+0404) are encodable in
bytes using UTF-8.

The result of sys.stdout.write(unicode_characters) to a TERMINAL depends
mostly on sys.stdout.encoding. This is likely to be UTF-8 on a
linux/OSX/platform. On a typical American / Western European /[former]
colonies Windows box, this is likely to be cp850 on a Command Prompt
window, and cp1252 in IDLE.

UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises
if the terminal can't render the character -- you'll get spaces or blobs
or boxes with hex digits in them or nothing.

Windows (Command Prompt window): only a small subset of characters can be
encoded in e.g. cp850; anything else causes an exception.

Windows (IDLE): ignores sys.stdout.encoding and renders the characters
itself. Same outcome as *x/UTF-8 above.

If you write directly (or sys.stdout is redirected) to a FILE, the default
encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless
the machine's site.py has been fiddled with to make it UTF-8 or something
else.

> If I open the file with vi? If
> I open the file with gedit? emacs?

Any editor will have a default encoding; if that doesn't match the file
encoding, you have a (hopefully obvious) problem if the editor doesn't
detect the mismatch. Consult your editor's docs or HTFF1K.

> Another question... in mail I'm receiving many small blocks that look
> like sprites with four small hex codes, scattered about the mail...
> mostly punctuation, maybe? ... guessing, are these unicode code
> points,

yes

> and if so what is the best way to 'guess' the encoding?

google("chardet") or rummage through the mail headers (but 4 hex digits in
a box are a symptom of inability to render, not necessarily caused by an
incorrect decoding)

... is
> it coded in the stream somewhere...protocol?

Should be.

Ben Finney

unread,

May 12, 2011, 12:07:08 AM5/12/11

to

MRAB <pyt...@mrabarnett.plus.com> writes:

> You need to understand the difference between characters and bytes.

Yep. Those who don't need to join us in the third millennium, and the
resources pointed out in this thread are good to help that.

> A string contains characters, a file contains bytes.

That's not true for Python 2.

I'd phrase that as:

* Text is a sequence of characters. Most inputs to the program,
including files, sockets, etc., contain a sequence of bytes.

* Always know whether you're dealing with text or with bytes. No object
can be both.

* In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is
the type for text.

* In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a
sequence of bytes.

--
\ “I went to a garage sale. ‘How much for the garage?’ ‘It's not |
`\ for sale.’” —Steven Wright |
_o__) |
Ben Finney

Terry Reedy

unread,

May 12, 2011, 12:12:02 AM5/12/11

to pytho...@python.org

If you open a file as binary (bytes), you must write bytes, and they are
stored without transformation. If you open in text mode, you must write
text (string as unicode in 3.2) and Python will encode to bytes using
either some default or the encoding you specified in the open statement.
It does not matter how Python stored the unicode internally. Does this
help? Your intent is signalled by how you open the file.

--
Terry Jan Reedy

John Machin

unread,

May 12, 2011, 12:14:35 AM5/12/11

to pytho...@python.org

On Thu, May 12, 2011 1:44 pm, harrismh777 wrote:
> By
> default it looks like Python3 is writing output with UTF-8 as default...
> and I thought that by default Python3 was using either UTF-16 or UTF-32.
> So, I'm confused here... also, I used the character sequence \u00A3
> which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
> which is the normal UTF-8...

Python uses either a 16-bit or a 32-bit INTERNAL representation of Unicode
code points. Those NN bits have nothing to do with the UTF-NN encodings,
which can be used to encode the codepoints as byte sequences for EXTERNAL
purposes. In your case, UTF-8 has been used as it is the default encoding
on your platform.

Benjamin Kaplan

unread,

May 12, 2011, 12:14:49 AM5/12/11

to pytho...@python.org

On Wed, May 11, 2011 at 8:44 PM, harrismh777 <harri...@charter.net> wrote:
> Steven D'Aprano wrote:
>>>
>>> You need to understand the difference between characters and bytes.
>>
>> http://www.joelonsoftware.com/articles/Unicode.html
>>
>> is also a good resource.
>
> Thanks for being patient guys, here's what I've done:
>
>>>>> astr="pound sign"
>>>>> asym=" \u00A3"
>>>>> afile=open("myfile", mode='w')
>>>>> afile.write(astr + asym)
>>
>> 12
>>>>>
>>>>> afile.close()
>
>
> When I edit "myfile" with vi I see the 'characters' :
>
> pound sign £
>
> ... same with emacs, same with gedit ...
>
>
> When I hexdump myfile I see this:
>
> 0000000 6f70 6375 2064 6973 6e67 c220 00a3
>
>
> This is *not* what I expected... well it is (little-endian) right up to the
> 'c2' and that is what is confusing me....
>
> I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16
> by default (python3) so I was expecting a '00A3' little-endian as 'A300' but
> what I got instead was UTF-8 little-endian 'c2a3' ....
>

quick note here: UTF-8 doesn't have an endian-ness. It's always read
from left to right, with the high bit telling you whether you need to
continue or not. So it's always "little endian".

> See my problem?... when I open the file with emacs I see the character pound
> sign... same with gedit... they're all using UTF-8 by default. By default it
> looks like Python3 is writing output with UTF-8 as default... and I thought
> that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused
> here... also, I used the character sequence \u00A3 which I thought was
> UTF-16... but Python3 changed my intent to 'c2a3' which is the normal
> UTF-8...
>

The fact that CPython uses UCS-2 or UCS-4 internally is an
implementation detail and isn't actually part of the Python
specification. As far as a Python program is concerned, a Unicode
string is a list of character objects, not bytes. Much like any other
object, a unicode character needs to be serialized before it can be
written to a file. An encoding is a serialization function for
characters.

If the file you're writing to doesn't specify an encoding, Python will
default to locale.getdefaultencoding(), which tries to get your
system's preferred encoding from environment variables (in other
words, the same source that emacs and gedit will use to get the
default encoding).

John Machin

unread,

May 12, 2011, 12:41:47 AM5/12/11

to pytho...@python.org

On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:
>
> If the file you're writing to doesn't specify an encoding, Python will
> default to locale.getdefaultencoding(),

No such attribute. Perhaps you mean locale.getpreferredencoding()

harrismh777

unread,

May 12, 2011, 2:14:37 AM5/12/11

to

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>>

Yessssssss!

:)

harrismh777

unread,

May 12, 2011, 2:31:16 AM5/12/11

to

Ben Finney wrote:
> I'd phrase that as:

> * Text is a sequence of characters. Most inputs to the program,
> including files, sockets, etc., contain a sequence of bytes.

> * Always know whether you're dealing with text or with bytes. No object
> can be both.

> * In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is
> the type for text.

> * In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a
> sequence of bytes.

That is very helpful... thanks

MRAB, Steve, John, Terry, Ben F, Ben K, Ian...
...thank you guys so much, I think I've got a better picture now of
what is going on... this is also one place where I don't think the books
are as clear as they need to be at least for me...(Lutz, Summerfield).

So, the UTF-16 UTF-32 is INTERNAL only, for Python... and text in/out is
based on locale... in my case UTF-8 ...that is enormously helpful for
me... understanding locale on this system is as mystifying as unicode is
in the first place.
Well, after reading about unicode tonight (about four hours) I realize
that its not really that hard... there's just a lot of details that have
to come together. Straightening out that whole tower-of-babel thing is
sure a pain in the butt.
I also was not aware that UTF-8 chars could be up to six(6) byes long
from left to right. I see now that the little-endianness I was
ascribing to python is just a function of hexdump... and I was a little
disappointed to find that hexdump does not support UTF-8, just ascii...doh.
Anyway, thanks again... I've got enough now to play around a bit...

PS thanks Steve for that link, informative and entertaining too... Joe
says, "If you are a programmer . . . and you don't know the basics of
characters, character sets, encodings, and Unicode, and I catch you, I'm
going to punish you by making you peel onions for 6 months in a
submarine. I swear I will". :)

kind regards,
m harris

harrismh777

unread,

May 12, 2011, 2:43:23 AM5/12/11

to

Terry Reedy wrote:
> It does not matter how Python stored the unicode internally. Does this
> help? Your intent is signalled by how you open the file.

Very much, actually, thanks. I was missing the 'internal' piece, and
did not realize that if I didn't specify the encoding on the open that
python would pull the default encoding from locale...

kind regards,
m harris

John Machin

unread,

May 12, 2011, 3:58:24 AM5/12/11

to harrismh777, pytho...@python.org

On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:

>
> So, the UTF-16 UTF-32 is INTERNAL only, for Python

NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
encodings for the EXTERNAL representation of Unicode characters in byte
streams.

> I also was not aware that UTF-8 chars could be up to six(6) byes long
> from left to right.

It could be, once upon a time in ISO faerieland, when it was thought that
Unicode could grow to 2**32 codepoints. However ISO and the Unicode
consortium have agreed that 17 planes is the utter max, and accordingly a
valid UTF-8 byte sequence can be no longer than 4 bytes ... see below

>>> chr(17 * 65536)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(0x110000)
>>> chr(17 * 65536 - 1)
'\U0010ffff'
>>> _.encode('utf8')
b'\xf4\x8f\xbf\xbf'
>>> b'\xf5\x8f\xbf\xbf'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0:
invalid start byte

TheSaint

unread,

May 12, 2011, 8:40:08 AM5/12/11

to

John Machin wrote:

what about sys.getfilesystemencoding()
In the event to distribuite a program how to guess which encoding will the
user has?

--
goto /dev/null

Ian Kelly

unread,

May 12, 2011, 12:17:33 PM5/12/11

to pytho...@python.org

On Thu, May 12, 2011 at 1:58 AM, John Machin <sjma...@lexicon.net> wrote:
> On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:
>
>>
>> So, the UTF-16 UTF-32 is INTERNAL only, for Python
>
> NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
> encodings for the EXTERNAL representation of Unicode characters in byte
> streams.

Right. *Under the hood* Python uses UCS-2 (which is not exactly the
same thing as UTF-16, by the way) to represent Unicode strings.
However, this is entirely transparent. To the Python programmer, a
unicode string is just an abstraction of a sequence of code-points.
You don't need to think about UCS-2 at all. The only times you need
to worry about encodings are when you're encoding unicode characters
to byte strings, or decoding bytes to unicode characters, or opening a
stream in text mode; and in those cases the only encoding that matters
is the external one.

Terry Reedy

unread,

May 12, 2011, 4:42:45 PM5/12/11

to pytho...@python.org

On 5/12/2011 12:17 PM, Ian Kelly wrote:
> On Thu, May 12, 2011 at 1:58 AM, John Machin<sjma...@lexicon.net> wrote:
>> On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:
>>
>>>
>>> So, the UTF-16 UTF-32 is INTERNAL only, for Python
>>
>> NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
>> encodings for the EXTERNAL representation of Unicode characters in byte
>> streams.
>
> Right. *Under the hood* Python uses UCS-2 (which is not exactly the
> same thing as UTF-16, by the way) to represent Unicode strings.

I know some people say that, but according to the definitions of the
unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent
chars in the Supplementary Planes. The later (1996) UTF-16, which Python
uses, can. The standard considers 'UCS-2' obsolete long ago. See

https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
or http://www.unicode.org/faq/basic_q.html#14

The latter says: "Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided."

It goes on: "Sometimes in the past an implementation has been labeled
"UCS-2" to indicate that it does not support supplementary characters
and doesn't interpret pairs of surrogate code points as characters. Such
an implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters."

I know that 16-bit Python *does* use surrogate pairs for supplementary
chars and at least some properties work for them. I am not sure exactly
what the rest means.

> However, this is entirely transparent. To the Python programmer, a
> unicode string is just an abstraction of a sequence of code-points.
> You don't need to think about UCS-2 at all. The only times you need
> to worry about encodings are when you're encoding unicode characters
> to byte strings, or decoding bytes to unicode characters, or opening a
> stream in text mode; and in those cases the only encoding that matters
> is the external one.

If one uses unicode chars in the Supplementary Planes above the BMP (the
first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16),
then the abstraction leaks.

--
Terry Jan Reedy

Ian Kelly

unread,

May 12, 2011, 6:25:24 PM5/12/11

to Python

On Thu, May 12, 2011 at 2:42 PM, Terry Reedy <tjr...@udel.edu> wrote:
> On 5/12/2011 12:17 PM, Ian Kelly wrote:
>> Right. *Under the hood* Python uses UCS-2 (which is not exactly the
>> same thing as UTF-16, by the way) to represent Unicode strings.
>
> I know some people say that, but according to the definitions of the unicode
> consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
> Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
> standard considers 'UCS-2' obsolete long ago. See
>
> https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
> or http://www.unicode.org/faq/basic_q.html#14

At the first link, in the section _Use in major operating systems and
environments_ it states, "The Python language environment officially
only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
"Unicode" produces correct UTF-16. Python can be compiled to use UCS-4
(UTF-32) but this is commonly only done on Unix systems."

PEP 100 says:

The internal format for Unicode objects should use a Python
specific fixed format <PythonUnicode> implemented as 'unsigned
short' (or another unsigned numeric type having 16 bits). Byte
order is platform dependent.

This format will hold UTF-16 encodings of the corresponding
Unicode ordinals. The Python Unicode implementation will address
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
the same for all currently defined Unicode character points.
UTF-16 without surrogates provides access to about 64k characters
and covers all characters in the Basic Multilingual Plane (BMP) of
Unicode.

It is the Codec's responsibility to ensure that the data they pass
to the Unicode object constructor respects this assumption. The
constructor does not check the data for Unicode compliance or use
of surrogates.

I'm getting out of my depth here, but that implies to me that while
Python stores UTF-16 and can correctly encode/decode it to UTF-8,
other codecs might only work correctly with UCS-2, and the unicode
class itself ignores surrogate pairs.

Although I'm not sure how much this might have changed since the
original implementation, especially for Python 3.

jmfauth

unread,

May 13, 2011, 2:28:09 AM5/13/11

to

On 12 mai, 18:17, Ian Kelly <ian.g.ke...@gmail.com> wrote:

> ...

> to worry about encodings are when you're encoding unicode characters
> to byte strings, or decoding bytes to unicode characters

A small but important correction/clarification:

In Unicode, "unicode" does not encode a *character*. It
encodes a *code point*, a number, the integer associated
to the character.

jmf

harrismh777

unread,

May 13, 2011, 3:53:50 PM5/13/11

to

jmfauth wrote:
>> to worry about encodings are when you're encoding unicode characters
>> > to byte strings, or decoding bytes to unicode characters
>
> A small but important correction/clarification:
>
> In Unicode, "unicode" does not encode a*character*. It

> encodes a*code point*, a number, the integer associated
> to the character.
>

That is a huge code-point... pun intended.

... and there is another point that I continue to be somewhat puzzled
about, and that is the issue of fonts.

On of my hobbies at the moment is ancient Greek (biblical studies,
Septuaginta LXX, and Greek New Testament). I have these texts on my
computer in a folder in several formats... pdf, unicode 'plaintext',
osis.xml, and XML.

These texts may be found at http://sblgnt.com

I am interested for the moment only in the 'plaintext' stream,
because it is unicode. ( first, in unicode, according to all the doc
there is no such thing as 'plaintext,' so keep that in mind).

When I open the text stream in one of my unicode editors I can see
'most' of the characters in a rudimentary Greek font with accents;
however, I also see many tiny square blocks indicating (I think) that
the code points do *not* have a corresponding character in my unicode
font for that Greek symbol (whatever it is supposed to be).

The point, or question is, how does one go about making sure that
there is a corresponding font glyph to match a specific unicode code
point for display in a particular terminal (editor, browser, whatever) ?

The unicode consortium is very careful to make sure that thousands
of symbols have a unique code point (that's great !) but how do these
thousands of symbols actually get displayed if there is no font
consortium? Are there collections of 'standard' fonts for unicode that
I am not aware? Is there a unix linux package that can be installed
that drops at least 'one' default standard font that will be able to
render all or 'most' (whatever I mean by that) code points in unicode?
Is this a Python issue at all?

kind regards,
m harris

Robert Kern

unread,

May 13, 2011, 4:18:33 PM5/13/11

to pytho...@python.org

On 5/13/11 2:53 PM, harrismh777 wrote:

> The unicode consortium is very careful to make sure that thousands of symbols
> have a unique code point (that's great !) but how do these thousands of symbols
> actually get displayed if there is no font consortium? Are there collections of
> 'standard' fonts for unicode that I am not aware?

There are some well-known fonts that try to cover a large section of the Unicode
standard.

http://en.wikipedia.org/wiki/Unicode_typeface

> Is there a unix linux package
> that can be installed that drops at least 'one' default standard font that will
> be able to render all or 'most' (whatever I mean by that) code points in
> unicode? Is this a Python issue at all?

Not really.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Terry Reedy

unread,

May 13, 2011, 9:41:30 PM5/13/11

to pytho...@python.org

On 5/13/2011 3:53 PM, harrismh777 wrote:

> The unicode consortium is very careful to make sure that thousands of
> symbols have a unique code point (that's great !) but how do these
> thousands of symbols actually get displayed if there is no font
> consortium? Are there collections of 'standard' fonts for unicode that I
> am not aware? Is there a unix linux package that can be installed that
> drops at least 'one' default standard font that will be able to render
> all or 'most' (whatever I mean by that) code points in unicode? Is this
> a Python issue at all?

Easy, practical use of unicode is still a work in progress.

--
Terry Jan Reedy

harrismh777

unread,

May 14, 2011, 3:41:10 AM5/14/11

to

Terry Reedy wrote:
>> Is there a unix linux package that can be installed that
>> drops at least 'one' default standard font that will be able to render
>> all or 'most' (whatever I mean by that) code points in unicode? Is this
>> a Python issue at all?
>
> Easy, practical use of unicode is still a work in progress.

Apparently... the good news for me is that SBL provides their unicode
font here:

http://www.sbl-site.org/educational/biblicalfonts.aspx

I'm getting much closer here, but now the problem is typing. The pain
with unicode fonts is that the glyph is tied to the code point for the
represented character, and not tied to any code point that matches any
keyboard scan code for typing. :-}

So, I can now see the ancient text with accents and aparatus in all of
my editors, but I still cannot type any ancient Greek with my
keyboard... because I have to make up a keymap first. <sigh>

I don't find that SBL (nor Logos Software) has provided keymaps as
yet... rats.

I can read the test with Python though... yessss.

m harris

Nobody

unread,

May 14, 2011, 4:34:54 AM5/14/11

to

On Fri, 13 May 2011 14:53:50 -0500, harrismh777 wrote:

> The unicode consortium is very careful to make sure that thousands
> of symbols have a unique code point (that's great !) but how do these
> thousands of symbols actually get displayed if there is no font
> consortium? Are there collections of 'standard' fonts for unicode that I
> am not aware? Is there a unix linux package that can be installed that
> drops at least 'one' default standard font that will be able to render all
> or 'most' (whatever I mean by that) code points in unicode?

Using the original meaning of "font" (US) or "fount" (commonwealth), you
can't have a single font cover the whole of Unicode. A font isn't a random
set of glyphs, but a set of glyphs in a common style, which can only
practically be achieved for a specific alphabet.

You can bundle multiple fonts covering multiple repertoires into a single
TTF (etc) file, but there's not much point.

In software, the term "font" is commonly used to refer to some ad-hoc
mapping between codepoints and glyphs. This typically works by either
associating each specific font with a specific repertoire (set of
codepoints), or by simply trying each font in order until one is found
with the correct glyph.

This is a sufficiently common problem that the FontConfig library exists
to simplify a large part of it.

> Is this a Python issue at all?

No.

jmfauth

unread,

May 14, 2011, 6:26:26 AM5/14/11

to

On 14 mai, 09:41, harrismh777 <harrismh...@charter.net> wrote:

> ...

> I'm getting much closer here,

> ...

You should really understand, that Unicode is a domain per
se. It is independent from any os's, programming languages
or applications. It is up to these tools to be "unicode"
compliant.

Working in a full unicode mode (at least for texts) is
today practically a solved problem. But you have to ensure
the whole toolchain is unicode compliant (editors,
fonts (OpenType technology), rendering devices, ...).

Tip. This list is certainly not the best place to grab
informations. I suggest you start by getting informations
about XeTeX. XeTeX is the "new" TeX engine working only
in a unicode mode. From this starting point, you will
fall on plenty web sites speaking about the "unicode
world", tools, fonts, ...

A variant is to visit sites speaking about *typography*.

jmf

Terry Reedy

unread,

May 14, 2011, 4:26:53 PM5/14/11

to pytho...@python.org

On 5/14/2011 3:41 AM, harrismh777 wrote:
> Terry Reedy wrote:

>> Easy, practical use of unicode is still a work in progress.
>
> Apparently... the good news for me is that SBL provides their unicode
> font here:
>
> http://www.sbl-site.org/educational/biblicalfonts.aspx
>
> I'm getting much closer here, but now the problem is typing. The pain
> with unicode fonts is that the glyph is tied to the code point for the
> represented character, and not tied to any code point that matches any
> keyboard scan code for typing. :-}
>
> So, I can now see the ancient text with accents and aparatus in all of
> my editors, but I still cannot type any ancient Greek with my
> keyboard... because I have to make up a keymap first. <sigh>
>
> I don't find that SBL (nor Logos Software) has provided keymaps as
> yet... rats.

You need what is called, at least with Windows, an IME -- Input Method
Editor. These are part of (or associated with) the OS, so they can be
used with *any* application that will accept unicode chars (in whatever
encoding) rather than just ascii chars. Windows has about a hundred or
so, including Greek. I do not know if that includes classical Greek with
the extra marks.

> I can read the test with Python though... yessss.

--
Terry Jan Reedy

Ben Finney

unread,

May 14, 2011, 7:47:05 PM5/14/11

to

Terry Reedy <tjr...@udel.edu> writes:

> You need what is called, at least with Windows, an IME -- Input Method
> Editor.

For a GNOME or KDE environment you want an input method framework; I
recommend IBus <URL:http://code.google.com/p/ibus/> which comes with the
major GNU+Linux operating systems <URL:http://oswatershed.org/pkg/ibus>
<URL:http://packages.debian.org/squeeze/ibus> .

Then you have a wide range of input methods available. Many of them are
specific to local writing systems. For writing special characters in
English text, I use either ‘rfc1345’ or ‘latex’ within IBus.

That allows special characters to be typed into any program which
communicates with the desktop environment's input routines. Yay, unified
input of special characters!

Except Emacs :-( which fortunately has ‘ibus-el’ available to work with
IBus <URL:http://www.emacswiki.org/emacs/IBusMode> :-).

--
\ 己所不欲、勿施于人。|
`\ (What is undesirable to you, do not do to others.) |
_o__) —孔夫子 Confucius, 551 BCE – 479 BCE |
Ben Finney