Stupid Unicode/UTF-16 Question

Eric Pepke

unread,

Aug 17, 2002, 8:26:09 PM8/17/02

to

I'm writing an application that uses XML and Unicode. I'd like it to have as
little chance of breaking in the future as possible. Unfortunately, it's a
cross-platform application, so I'm going to have to deal with various API's.
I'm using Xerces for parsing the XML. It says it uses UTF-16, and the strings
are, as one might expect, arrays of unsigned shorts. However, I also have to
accommodate Cocoa and possibly Java. These systems also use unsigned shorts
as an internal representation of Unicode strings.

Problem: the supplemental Unicode characters use more than sixteen bits. It
seems then that the natural, although wasteful, internal representation would
be arrays of structures with more than 16 bits. But that's obviously not
how Java and Cocoa do it.

Now, I had assumed that the old, traditional Unicode, with its sixteen bits,
was essentially the same as the UTF-16 encoding. However, after programming
until 4 in the morning, last night I had a series of terrible nightmares that
involved a ghostly voice warning me not to confuse UTF-16 and Unicode.

I'd like to know the exact relationship between UTF-16 and Unicode per se.
I've looked around on the official Unicode site, and I can't find any clear
answers. (It doesn't help that doing a search within it gives me a 404 error!)

Incidentally, this is a free software project, and one of the main design
goals is extreme longevity of the core system and the ability to map onto
whatever new API's and user interfaces and such may become available over
the next several decades, so some crystal-ball gazing is required.

Does anybody have clear answers, pointers to references, or guesses on this
matter? I am trying to figure out how, given a UTF-16 representation and
given an implementation of Unicode based on a 16-bit internal representation,
how little can I do or how much do I need to do to go from one to the other.

Richard Tobin

unread,

Aug 17, 2002, 9:05:44 PM8/17/02

to

In article <ef37f531.02081...@posting.google.com>,
Eric Pepke <epe...@acm.org> wrote:

>I'd like to know the exact relationship between UTF-16 and Unicode per se.

Unicode specifies a mapping of code points to characters. This is the
same for all encodings of Unicode. UTF-16 is an encoding of Unicode
that represents the first "plane" (code points less than 2^16)
directly, and the next 16 planes (up to 10FFFF) using pairs of
"surrogates". These surrogates are the code points D800 - DBFF and
DC00 - DFFF. Unicode does not assign any characters to these code
points, and the pair (D800+x, DC00+y) maps to 10000 + (x << 10) + y.

So for many purposes, UTF-16 strings can be treated as if they were
strings of 16-bit characters. If there are no characters above FFFF,
this will work perfectly. If there are other characters (which is
rare), then most things will work if you do not interpret them (so
for example an XML parser doesn't need to know much about them, though
the application might).

-- Richard
--
Spam filter: to mail me from a .com/.net site, put my surname in the headers.

FreeBSD rules!

Lars Marius Garshol

unread,

Aug 17, 2002, 9:23:02 PM8/17/02

to

* Eric Pepke

|
| Now, I had assumed that the old, traditional Unicode, with its
| sixteen bits, was essentially the same as the UTF-16 encoding.
| However, after programming until 4 in the morning, last night I had
| a series of terrible nightmares that involved a ghostly voice
| warning me not to confuse UTF-16 and Unicode.

That voice is entirely right, but your confusion is a very common one,
and one that even the Unicode Consortium itself shows signs of
suffering from at times.

In the beginning, Unicode was synonymous with what is now called
UCS-2, a straight 16-bit encoding. This arrangement worked well for a
long time, and nobody considered the character set to be different
from the character encoding in any way, which is how the present
confusion arose.

Later on it was realized that 65,536 was not going to suffice for all
the characters that were going to go into Unicode. I think the policy
changes brought on by the merger with ISO 10646 were in part
responsible for this.

In any case, the Unicode codespace was extended to 21 bits, which
raised the problem of how to represent Unicode characters in the 16
bits that implementations were currently using. The answer was UTF-16.
Two blocks of 16-bit values were set aside for use as special values
for encoding astral plane characters. To encode a character above
U+FFFF you use two of these special values, which are known as
surrogates.

This solution is backwards compatible with the older encoding (now
known as UCS-2), and not too bad to work with, since the value of any
code unit will tell you whether it is a surrogate, and if so whether
it is the first or the second of a pair.

This means that for running text you can perfectly well use 16 bits,
as long as you take care never to split surrogate pairs and to deal
with them not as two characters, but one. For example, make sure
people can't do ridiculous things like selecting half an astral
character. One problem, however, is that methods that return a single
character won't work if they return a single 16-bit value. (Yes,
the meaning of the charAt call in Java has changed, and I guess you
could claim it is broken.)

And that's basically it. The introduction of surrogates is the whole
difference between UCS-2 and the new UTF-16.

| Problem: the supplemental Unicode characters use more than sixteen
| bits. It seems then that the natural, although wasteful, internal
| representation would be arrays of structures with more than 16 bits.

That is correct. gcc, for example, will represent wchar_t using 32
bits, and you can use the UTF-32 and UCS-4 encodings to exchange
Unicode text if you want to.

(This was written late at night, so please pardon any mistakes.)

--
Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC <URL: http://www.garshol.priv.no >

Lars Marius Garshol

unread,

Aug 18, 2002, 5:35:45 AM8/18/02

to

* Richard Tobin

|
| Unicode specifies a mapping of code points to characters. This is
| the same for all encodings of Unicode.

I would say that the mapping is the character set, upon which all the
character encodings are used.

| If there are other characters (which is rare), then most things will
| work if you do not interpret them (so for example an XML parser
| doesn't need to know much about them, though the application might).

Actually, there's one place XML parser do need to know about them if
it is UTF-16-based: when presented with 𐐀 it must produce a
surrogate pair as output. But it's true that generally parsers, and
many other pieces of software, do not need to consider them.

UTF-32/UCS-4 remains the cleanest solution, however.

Eric Pepke

unread,

Aug 18, 2002, 6:36:54 PM8/18/02

to

Lars Marius Garshol <lar...@garshol.priv.no> wrote in message news:<m3ofc12...@pc36.avidiaasen.online.no>...

> In the beginning, Unicode was synonymous with what is now called
> UCS-2, a straight 16-bit encoding. This arrangement worked well for a
> long time, and nobody considered the character set to be different
> from the character encoding in any way, which is how the present
> confusion arose.

Thanks; this is very helpful. Unfortunately, I'm returning to Unicode
after an absence of several years, and I'm finding it a bit hard to
reconstruct the history. It's pretty easy to see how things are, but
not so easy to find out how things got this way.

> In any case, the Unicode codespace was extended to 21 bits, which
> raised the problem of how to represent Unicode characters in the 16
> bits that implementations were currently using. The answer was UTF-16.
> Two blocks of 16-bit values were set aside for use as special values
> for encoding astral plane characters. To encode a character above
> U+FFFF you use two of these special values, which are known as
> surrogates.

OK, so what I think I'm hearing is this. I'd appreciate it if you could
tell me if I'm correct. These surrogate blocks are different from valid
blocks in the old (now UCS-2) standard. I.e., the surrogate blocks were
not previously part of the standard. Therefore, an older string or API
that was written before UTF-16 will not contain any of those blocks, and
should probably not be expected to know what to do with any of those
blocks.

The surrogate characters can be detected simply by looking at them, so I
don't have to go back to the beginning of the text and run it through a
state machine or prove some limited lookback to detect them.

Such an older API should be able to work with a string that meets the
UTF-16 standard, provided that it contains no surrogates, and is therefore
indistinguishable from a string that meets the UCS-2 standard.

APIs that present 16-bit characters or, even worse, arrays of 16-bit
characters are arguably broken (unless they are documented to be using
UTF-16, which for the most part they aren't). However, these probably
either 1) don't handle the supplemental characters properly, or 2) quitely
internally deal with the surrogate characters. Since I'm dealing with Xerces,
which is documented to present strings as UTF-16, it probably presents each
surrogate character separately.

Since I'm not doing anything fancy internally with the strings other than
parsing them into a native API string such as Cocoa or Java, it doesn't
matter too much, as long as I can correctly identify the length of the
string. However, it would probably be more elegant to have internal
computations work with large characters (32 bit now but expandable in the
future without breaking anything), make my best guess as to what
encoding the APIs use when they just say "Unicode," and have all storage
representations be explicitly labeled encodings, to be converted to and
from the computational encoding as necessary, usually on the fly.

This internal computational encoding will be indistinguishable from UTF-32,
that is, until we join the galactic allegiance or something and realize that
21 or even 32 bits aren't enough.

Eric Pepke

unread,

Aug 18, 2002, 6:44:50 PM8/18/02

to

ric...@cogsci.ed.ac.uk (Richard Tobin) wrote in message news:<ajmrt8$1hu2$1...@pc-news.cogsci.ed.ac.uk>...

> Unicode specifies a mapping of code points to characters. This is the
> same for all encodings of Unicode. UTF-16 is an encoding of Unicode
> that represents the first "plane" (code points less than 2^16)
> directly, and the next 16 planes (up to 10FFFF) using pairs of
> "surrogates". These surrogates are the code points D800 - DBFF and
> DC00 - DFFF. Unicode does not assign any characters to these code
> points, and the pair (D800+x, DC00+y) maps to 10000 + (x << 10) + y.

Thanks. Trouble is, I also have to second-guess the intentions of API
designers who haven't made a clear distinction between a code point and
a computer word.

Now, I'm hoping that what you've said isn't quite right. That is, it doesn't
represent the entire first plane directly. I.e. that it doesn't represent
character D800 directly with D800 but either disallows it from the set or
uses a pair of surrogates to mean D800, because otherwise it doesn't make
sense.

I'm also hoping that there aren't any older 16-bit encodings that use D800,
say, directly to represent a character. That would be not much fun at all.

Lars Marius Garshol

unread,

Aug 18, 2002, 6:53:21 PM8/18/02

to

* Eric Pepke

|
| Thanks; this is very helpful. Unfortunately, I'm returning to
| Unicode after an absence of several years, and I'm finding it a bit
| hard to reconstruct the history. It's pretty easy to see how things
| are, but not so easy to find out how things got this way.

Agreed. Though how important it is *how* things ended up this way is
debatable, of course. :)

| OK, so what I think I'm hearing is this. I'd appreciate it if you
| could tell me if I'm correct. These surrogate blocks are different
| from valid blocks in the old (now UCS-2) standard. I.e., the
| surrogate blocks were not previously part of the standard.

There is no real difference between old Unicode and new Unicode. It's
just that more encodings have been added, and people have realized
that the character set and the encoding is not the same thing.

The surrogate blocks are basically two blocks of code points that were
reserved in the existing code space (at D800 - DBFF and DC00 - DFFF,
as Richard wrote) and set aside for use in UTF-16 to indicate
characters above U+FFFF. The blocks were there all the time, but they
had no defined function before, which they now do.

| Therefore, an older string or API that was written before UTF-16
| will not contain any of those blocks, and should probably not be
| expected to know what to do with any of those blocks.

That is correct.

| The surrogate characters can be detected simply by looking at them,
| so I don't have to go back to the beginning of the text and run it
| through a state machine or prove some limited lookback to detect
| them.

Yes.

Note that the surrogates are *not* characters. They are just 16-bit
values used in UTF-16 for a specific purpose in the same way that
8-bit values in the 0080 - 00FF range have a special meaning in UTF-8.
These blocks are reserved as surrogates in the code space (that is,
the character set) because the way UTF-16 is designed it has a blind
spot here. It can't encode characters in this range, since any values
in this range are interpreted differently.

| Such an older API should be able to work with a string that meets
| the UTF-16 standard, provided that it contains no surrogates, and is
| therefore indistinguishable from a string that meets the UCS-2
| standard.

It would. It would also be able to work with surrogates provided it
didn't screw them up by breaking up pairs or treating them as two
characters where it should consider them as one. So mostly older code
will work with surrogates just fine.

| APIs that present 16-bit characters or, even worse, arrays of 16-bit
| characters are arguably broken (unless they are documented to be
| using UTF-16, which for the most part they aren't).

Correct.

| However, these probably either 1) don't handle the supplemental
| characters properly, or 2) quitely internally deal with the
| surrogate characters.

Yes.

| Since I'm dealing with Xerces, which is documented to present
| strings as UTF-16, it probably presents each surrogate character
| separately.

Not sure what you mean by 'presents'.

| Since I'm not doing anything fancy internally with the strings other
| than parsing them into a native API string such as Cocoa or Java, it
| doesn't matter too much, as long as I can correctly identify the
| length of the string.

That sounds right. If your code never interprets the 16-bit values as
characters you should be fine.

| However, it would probably be more elegant to have internal
| computations work with large characters (32 bit now but expandable
| in the future without breaking anything),

Agreed.

| make my best guess as to what encoding the APIs use when they just
| say "Unicode,"

Agreed. Usually this will be UTF-16, whether the implementors designed
it that way or not. Older implementations will generally just pass the
surrogates through untouched, which means that they will effectively
be using UTF-16, too.

| and have all storage representations be explicitly labeled
| encodings, to be converted to and from the computational encoding as
| necessary, usually on the fly.

Yes. This is really the key point when using Unicode. Always always
always always always do this. Everywhere. Most people have a hard time
getting used to the fact that characters and bytes are differen
things, however, and happily leave this step out.

| This internal computational encoding will be indistinguishable from
| UTF-32, that is, until we join the galactic allegiance or something
| and realize that 21 or even 32 bits aren't enough.

By that time Unicode will probably have been replaced by something
else that ditches all the ugly backwards compatibility in any case. :)

Lars Marius Garshol

unread,

Aug 18, 2002, 6:59:29 PM8/18/02

to

* Eric Pepke

|
| Now, I'm hoping that what you've said isn't quite right. That is,
| it doesn't represent the entire first plane directly. I.e. that it
| doesn't represent character D800 directly with D800 but either
| disallows it from the set or uses a pair of surrogates to mean D800,
| because otherwise it doesn't make sense.

D800 is not a Unicode character and never will be, so UTF-16 can
represent all of Unicode without difficulties.

| I'm also hoping that there aren't any older 16-bit encodings that
| use D800, say, directly to represent a character. That would be not
| much fun at all.

Definitely not, but thankfully the Unicode folks knew better than to
do something as braindead as that. :)

Steinar Bang

unread,

Aug 19, 2002, 5:02:05 AM8/19/02

to

>>>>> Lars Marius Garshol <lar...@garshol.priv.no>:

> * Eric Pepke

>> Problem: the supplemental Unicode characters use more than sixteen
>> bits. It seems then that the natural, although wasteful, internal
>> representation would be arrays of structures with more than 16
>> bits.

> That is correct. gcc, for example, will represent wchar_t using 32
> bits, and you can use the UTF-32 and UCS-4 encodings to exchange
> Unicode text if you want to.

I belive the wchar_t size used by gcc is defined by the underlying
platform. Eg. on linux, and Solaris it will be 32 bits, but on Win32
(mingw, and maybe CygWin) it will be 16 bits.

In any case, MSVC has a 16 bit wchar_t so that code that is supposed
to compile and run on both gcc/linux and MSVC/Win32 must handle this
difference somehow (what I did was to use UTF-16 on both platforms).

Lars Marius Garshol

unread,

Aug 19, 2002, 1:26:55 PM8/19/02

to

* Lars Marius Garshol

|
| That is correct. gcc, for example, will represent wchar_t using 32
| bits, and you can use the UTF-32 and UCS-4 encodings to exchange
| Unicode text if you want to.

* Steinar Bang

|
| I belive the wchar_t size used by gcc is defined by the underlying
| platform. Eg. on linux, and Solaris it will be 32 bits, but on Win32
| (mingw, and maybe CygWin) it will be 16 bits.

You're probably right. I only tried this on Linux.

| In any case, MSVC has a 16 bit wchar_t so that code that is supposed
| to compile and run on both gcc/linux and MSVC/Win32 must handle this
| difference somehow (what I did was to use UTF-16 on both platforms).

As did we. We defined our own uni_char and used that instead. Works
well now on more platforms than I am sure I can enumerate offhand.

Using UTF-32 would have simplified text selection, line breaking, and
suchlike, but at the cost of conversion at every point of interface
with the OS, and also at the expense of memory consumption. On Linux
Qt is used for display, so UTF-16 was not a bad fit there, either.

Eric Pepke

unread,

Aug 19, 2002, 3:16:23 PM8/19/02

to

Lars Marius Garshol <lar...@garshol.priv.no> wrote in message news:<m3sn1br...@pc36.avidiaasen.online.no>...

> Agreed. Though how important it is *how* things ended up this way is
> debatable, of course. :)

Well, the history helps me second-guess the psychology of designers then
and designers now and maybe helps me extrapolate a little into the future.

> The surrogate blocks are basically two blocks of code points that were
> reserved in the existing code space (at D800 - DBFF and DC00 - DFFF,
> as Richard wrote) and set aside for use in UTF-16 to indicate
> characters above U+FFFF. The blocks were there all the time, but they
> had no defined function before, which they now do.

Right, and knowing that helps me understand how the Unicode folks think.
Just for pure academic amusement, let me give you an example of a system
that did not go this way. The old CDC CYBER system used 6 bits to represent
each character. This history can be seen in the original Pascal language,
with its "alfa" type to represent ten characters, which were the number of
characters that fit into a 60-bit CYBER word. When they realized they needed
upper- and lower-case characters, they used the caret as the first byte of
a two-byte pair to represent extra characters. Even worse, they called this
ASCII. Trouble is that the caret had been used in the old 6-bit encoding,
so this caused massive problems. I suffered through those years; I'd like
to avoid, inasmuch as it is possible, suffering through more of them by
adopting defensive coding practices.

> Note that the surrogates are *not* characters.

Understood; I'm not confused; I was just typing quickly.

> Not sure what you mean by 'presents'.

Makes available, defines, returns, passes as a parameter to a callback
function, or uses any other means to present to the code using the API.

> By that time Unicode will probably have been replaced by something
> else that ditches all the ugly backwards compatibility in any case. :)

Hah! That's what they said about 2-digit years. And now that this has
been fixed, there's still going to be a lot of code that breaks in 2049,
2050, or 2051. Also that year--I can't remember when it will be--when all
those people who stored UNIX dates into signed integers will wish they had
used unsigned integers.