Unicode questions

Tobiah

unread,

Oct 19, 2010, 3:02:36 PM10/19/10

to

I've been reading about the Unicode today.
I'm only vaguely understanding what it is
and how it works.

Please correct my understanding where it is lacking.
Unicode is really just a database of character information
such as the name, unicode section, possible
numeric value etc. These points of information
are indexed by standard, never changing numeric
indexes, so that 0x2CF might point to some
character information set, that all the world
can agree on. The actual image that gets
displayed in response to the integer is generally
assigned and agreed upon, but it is up to the
software responding to the unicode value to define
and generate the actual image that will represent that
character.

Now for the mysterious encodings. There is the UTF-{8,16,32}
which only seem to indicate what the binary representation
of the unicode character points is going to be. Then there
are 100 or so other encoding, many of which are language
specific. ASCII encoding happens to be a 1-1 mapping up
to 127, but then there are others for various languages etc.
I was thinking maybe this special case and the others were lookup
mappings, where a
particular language user could work with characters perhaps
in the range of 0-255 like we do for ASCII, but then when
decoding, to share with others, the plain unicode representation
would be shared? Why can't we just say "unicode is unicode"
and just share files the way ASCII users do. Just have a huge
ASCII style table that everyone sticks to. Please enlighten
my vague and probably ill-formed conception of this whole thing.

Thanks,

Tobiah

Petite Abeille

unread,

Oct 19, 2010, 3:43:53 PM10/19/10

to pytho...@python.org

On Oct 19, 2010, at 9:02 PM, Tobiah wrote:

> Please enlighten my vague and probably ill-formed conception of this whole thing.

Hmmm... is there a question hidden somewhere in there or is it more open ended in nature? :)

In the meantime...

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Characters vs. Bytes
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Hrvoje Niksic

unread,

Oct 19, 2010, 3:39:58 PM10/19/10

to

Tobiah <to...@rcsreg.com> writes:

> would be shared? Why can't we just say "unicode is unicode"
> and just share files the way ASCII users do. Just have a huge
> ASCII style table that everyone sticks to.

I'm not sure that I understand you correctly, but UCS-2 and UCS-4
encodings are that kind of thing. Many people prefer UTF-8 because of
convenient backward compatibility with ASCII (and space economy when
dealing with mostly-ascii text).

Chris Rebert

unread,

Oct 19, 2010, 4:14:25 PM10/19/10

to Tobiah, pytho...@python.org

On Tue, Oct 19, 2010 at 12:02 PM, Tobiah <to...@rcsreg.com> wrote:
> I've been reading about the Unicode today.
> I'm only vaguely understanding what it is
> and how it works.

Petite Abeille already pointed to Joel's excellent primer on the
subject; I can only second their endorsement of his article.

> Please correct my understanding where it is lacking.

<snip>

> Now for the mysterious encodings. There is the UTF-{8,16,32}
> which only seem to indicate what the binary representation
> of the unicode character points is going to be. Then there
> are 100 or so other encoding, many of which are language
> specific. ASCII encoding happens to be a 1-1 mapping up
> to 127, but then there are others for various languages etc.
> I was thinking maybe this special case and the others were lookup
> mappings, where a
> particular language user could work with characters perhaps
> in the range of 0-255 like we do for ASCII, but then when
> decoding, to share with others, the plain unicode representation
> would be shared?

There is no such thing as "plain Unicode representation". The closest
thing would be an abstract sequence of Unicode codepoints (ala
Python's `unicode` type), but this is way too abstract to be used for
sharing/interchange, because storing anything in a file or sending it
over a network ultimately involves serialization to binary, which is
not directly defined for such an abstract representation (Indeed, this
is exactly what encodings are: mappings between abstract codepoints
and concrete binary; the problem is, there's more than one of them).

Python's `unicode` type (and analogous types in other languages) is a
nice abstraction, but at the C level it's actually using some
(implementation-defined, IIRC) encoding to represent itself in memory;
and so when you leave Python, you also leave this implicit, hidden
choice of encoding behind and must instead be quite explicit.

> Why can't we just say "unicode is unicode"
> and just share files the way ASCII users do.

Because just "Unicode" itself is not a scheme for encoding characters
as a stream of binary. Unicode /does/ define many encodings, and these
encodings are such schemes; /but/ none of them is *THE* One True
Unambiguous Canonical "Unicode" encoding scheme. Hence, one must be
specific and specify "UTF-8", or "UTF-32", or whatever.

Cheers,
Chris
--
http://blog.rebertia.com

Tobiah

unread,

Oct 19, 2010, 4:31:06 PM10/19/10

to

> There is no such thing as "plain Unicode representation". The closest
> thing would be an abstract sequence of Unicode codepoints (ala Python's
> `unicode` type), but this is way too abstract to be used for
> sharing/interchange, because storing anything in a file or sending it
> over a network ultimately involves serialization to binary, which is not
> directly defined for such an abstract representation (Indeed, this is
> exactly what encodings are: mappings between abstract codepoints and
> concrete binary; the problem is, there's more than one of them).

Ok, so the encoding is just the binary representation scheme for
a conceptual list of unicode points. So why so many? I get that
someone might want big-endian, and I see the various virtues of
the UTF strains, but why isn't a handful of these representations
enough? Languages may vary widely but as far as I know, computers
really don't that much. big/little endian is the only problem I
can think of. A byte is a byte. So why so many encoding schemes?
Do some provide advantages to certain human languages?

Thanks,

Toby

Petite Abeille

unread,

Oct 19, 2010, 4:57:43 PM10/19/10

to pytho...@python.org

On Oct 19, 2010, at 10:31 PM, Tobiah wrote:

> So why so many encoding schemes?

http://en.wikipedia.org/wiki/Space-time_tradeoff

Chris Rebert

unread,

Oct 19, 2010, 5:09:32 PM10/19/10

to Tobiah, pytho...@python.org

UTF-8 has the virtue of being backward-compatible with ASCII.

UTF-16 has all codepoints in the Basic Multilingual Plane take up
exactly 2 bytes; all others take up 4 bytes. The Unicode people
originally thought they would only include modern scripts, so 2 bytes
would be enough to encode all characters. However, they later
broadened their scope, thus the complication of "surrogate pairs" was
introduced.

UTF-32 has *all* Unicode codepoints take up exactly 4 bytes. This
slightly simplifies processing, but wastes a lot of space for e.g.
English texts.

And then there are a whole bunch of national encodings defined for
backward compatibility, but they typically only encode a portion of
all the Unicode codepoints.

More info: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Cheers,
Chris
--
Essentially, blame backward compatibility and finite storage space.
http://blog.rebertia.com

Terry Reedy

unread,

Oct 19, 2010, 6:17:10 PM10/19/10

to pytho...@python.org

The hundred or so language-specific encodings all pre-date unicode and
are *not* unicode encodings. They are still used because of inertia and
local optimization.

There are currently about 100000 unicode codepoints, with space for
about 1,000,000. The unicode standard specifies exactly 2 internal
representations of codepoints using either 16 or 32 bit words. The
latter uses one word per codepoint, the former usually uses one word but
has to use two for codepoints above 2**16-1. The standard also specifies
about 7 byte-oriented transer formats, UTF-8,16,32 with big and little
endian variations. As far as I know, these (and a few other variations)
are the only encodings that encode all unicode chars (codepoints)

--
Terry Jan Reedy

M.-A. Lemburg

unread,

Oct 20, 2010, 8:41:01 AM10/20/10

to Tobiah, pytho...@python.org

Tobiah wrote:
> I've been reading about the Unicode today.
> I'm only vaguely understanding what it is
> and how it works.
>
> Please correct my understanding where it is lacking.
> Unicode is really just a database of character information
> such as the name, unicode section, possible
> numeric value etc. These points of information
> are indexed by standard, never changing numeric
> indexes, so that 0x2CF might point to some
> character information set, that all the world
> can agree on. The actual image that gets
> displayed in response to the integer is generally
> assigned and agreed upon, but it is up to the
> software responding to the unicode value to define
> and generate the actual image that will represent that
> character.

Correct. The "actual images" are called glyphs in Unicode-speak.

> Now for the mysterious encodings. There is the UTF-{8,16,32}
> which only seem to indicate what the binary representation
> of the unicode character points is going to be. Then there
> are 100 or so other encoding, many of which are language
> specific. ASCII encoding happens to be a 1-1 mapping up
> to 127, but then there are others for various languages etc.
> I was thinking maybe this special case and the others were lookup
> mappings, where a
> particular language user could work with characters perhaps
> in the range of 0-255 like we do for ASCII, but then when
> decoding, to share with others, the plain unicode representation
> would be shared? Why can't we just say "unicode is unicode"
> and just share files the way ASCII users do. Just have a huge
> ASCII style table that everyone sticks to. Please enlighten
> my vague and probably ill-formed conception of this whole thing.

UTF-n are transfer encodings of the Unicode table (the one you
are probably referring to). They represent the same code points,
but using different trade-offs.

If you're looking for a short intro to Unicode in Python,
have a look at these talks I've given on the subject:

http://www.egenix.com/library/presentations/#PythonAndUnicode
http://www.egenix.com/library/presentations/#DesigningUnicodeAwareApplicationsInPython

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 20 2010)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

OdarR

unread,

Oct 21, 2010, 8:51:50 AM10/21/10

to

On Oct 19, 9:02 pm, Tobiah <t...@rcsreg.com> wrote:
> I've been reading about the Unicode today.
> I'm only vaguely understanding what it is
> and how it works.
>

...
> Thanks,
>
> Tobiah

Hi,

A good advice,
read this presentation,
http://farmdev.com/talks/unicode/
Explanation and advices for coding.

Olivier

Lawrence D'Oliveiro

unread,

Oct 25, 2010, 1:42:49 AM10/25/10

to

In message <mailman.31.12875174...@python.org>, Petite
Abeille wrote:

> Characters vs. Bytes

And why do certain people insist on referring to bytes as “octets”?

Lawrence D'Oliveiro

unread,

Oct 25, 2010, 1:43:39 AM10/25/10

to

In message <mailman.33.12875192...@python.org>, Chris Rebert
wrote:

> There is no such thing as "plain Unicode representation".

UCS-4 or UTF-16 probably come the closest.

Chris Rebert

unread,

Oct 25, 2010, 1:57:11 AM10/25/10

to pytho...@python.org

How do you figure that?

Cheers,
Chris

Steve Holden

unread,

Oct 25, 2010, 2:33:46 AM10/25/10

to pytho...@python.org

Because back in the old days bytes were of varying sizes on different
architectures - indeed the DECSystem-10 and -20 had instructions that
could be parameterized as to byte size. So octet was an unambiguous term
for the (now standard) 8-bit byte.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon 2011 Atlanta March 9-17 http://us.pycon.org/
See Python Video! http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/

Seebs

unread,

Oct 25, 2010, 1:59:08 PM10/25/10

to

> And why do certain people insist on referring to bytes as ???octets????

One common reason is that there have been machines on which "bytes" were
not 8 bits. In particular, the usage of "byte" as "the smallest addressible
storage" has been rather firmly ensconced in the C spec, so people used to
that are likely aware that, on a machine where the smallest directly
addressible chunk of space is 16 bits, it's quite likely that char is 16
bits, and thus by definition a "byte" is 16 bits, and if you want an octet,
you have to extract it from a byte.

-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet...@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
I am not speaking for my employer, although they do rent some of my opinions.

Terry Reedy

unread,

Oct 25, 2010, 2:36:09 PM10/25/10

to pytho...@python.org

On 10/25/2010 2:33 AM, Steve Holden wrote:
> On 10/25/2010 1:42 AM, Lawrence D'Oliveiro wrote:

> Because back in the old days bytes were of varying sizes on different
> architectures - indeed the DECSystem-10 and -20 had instructions that
> could be parameterized as to byte size. So octet was an unambiguous term
> for the (now standard) 8-bit byte.

As I remember, there were machines (CDC? Burroughs?) with 6-bit
char/bytes: 26 upper-case letters, 10 digits, 24 symbols and control chars.
--
Terry Jan Reedy

Steve Holden

unread,

Oct 25, 2010, 3:54:27 PM10/25/10

to pytho...@python.org

On 10/25/2010 2:36 PM, Terry Reedy wrote:
> On 10/25/2010 2:33 AM, Steve Holden wrote:
>> On 10/25/2010 1:42 AM, Lawrence D'Oliveiro wrote:

>> Because back in the old days bytes were of varying sizes on different
>> architectures - indeed the DECSystem-10 and -20 had instructions that
>> could be parameterized as to byte size. So octet was an unambiguous term
>> for the (now standard) 8-bit byte.
>
> As I remember, there were machines (CDC? Burroughs?) with 6-bit
> char/bytes: 26 upper-case letters, 10 digits, 24 symbols and control chars.

Yes, and DEC used the same (?) code, calling it SIXBIT. Since their
systems had 36-bit words it packed in very nicely.

John Nagle

unread,

Oct 26, 2010, 12:32:03 PM10/26/10

to

On 10/19/2010 12:02 PM, Tobiah wrote:
> I've been reading about the Unicode today.
> I'm only vaguely understanding what it is
> and how it works.
>
> Please correct my understanding where it is lacking.

http://justfuckinggoogleit.com/

Steve Holden

unread,

Oct 26, 2010, 1:07:50 PM10/26/10

to pytho...@python.org

Neither friendly nor helpful, John. Silence might have been more
productive: feeling crabby today?