What's magic about 0000D800-0000DFFF in UTF-8?

Hannu Aronsson

unread,

Dec 20, 1994, 10:05:54 AM12/20/94

to

In an earlier article I criticized about the vague statement in UTF-8
documentation:

"The range 0000 D800 to 0000 DFFF is to be excluded from
treatment by the third row of this table which governs the
UCS-4 range 0000 0800 to 0000 FFFF."

After further "study", it seems this is supposed to say that these
codes are reserved for UTF-16 support and they should never be used as
individual code points. What a very unclear way of saying this simple
thing :-(

UTF-16 seems to be a way to encode an additional 1M code points using
two consecutive 16-bit code points to represent them. The range
D800-DFFF has been reserved for this purpose.

The reservation of 3% (D800-DFFF) of the core part of the standards,
the BMP, for supporting UTF-16, which can only represent 0.05% of the
full ISO10646 code space seems quite wasteful :-(

I would have thought that the next step after the 64k code points in
Unicode would be full ISO10646 code space support, but some people
seem to think we need something in the middle to confuse things up :-(

Note 1. D800-DFFF = 2k positions, out of 64k, which is ~3%
Note 2. UTF-16 allows for 2^20 extra positions in addition to
Unicode's 2^16, out of the "full" 2^31, i.e. ~0.05%

Anyway, it would be nice if some enlightened person would shed some
light on UTF-16 background too :-)

Yours,
Hannu
--
The best thing about standards is that you can have
so many :-( :-( :-( :-( :-( while reading them :-(

Hannu Aronsson

unread,

Dec 20, 1994, 9:31:35 AM12/20/94

to

I was looking at the UTF-8 proposal (http://www.stonehand.com/unicode/
standard/wg2n1036.html) one day, and noticed that it stated:

"The range 0000 D800 to 0000 DFFF is to be excluded from
treatment by the third row of this table which governs the
UCS-4 range 0000 0800 to 0000 FFFF."

The third row is the one which would be encoded in 3 bytes as 1110xxxx
10xxxxxx 10xxxxxx, i.e. values needing 12-16 bits.

Pardon my limited english ability, but the wording seems very unclear
and vague to me. I'm wondering what are they trying to say:

- we should not encode this range with the 3rd row rule, but using
for example the 4th row rules (4 bytes, 11110xxx and 3 10xxxxxx's)
- we should not encode this range at all, and consider them as
errorneous codes
- we should not encode this range at all, and silently ignore codes
in it
- when decoding UTF-8 and seeing a value in this range, you should
ignore them, or perhaps generate an error if it was represented
as a 3-byte sequence?

Furthermore, I don't see off-hand any reason to make a special case
out for some arbitrary range of code points, even if it happened to be
e.g. a non-standard private-use zone or something like that.

The only thing they might achieve with this scheme would be to avoid
11101101 (0xED) bytes in the UTF-8 data stream, which doesn't seem so
useful either (This code point range range would otherwise be
represented with 1110(1101) 10(1xxxxx) 10(xxxxxx) bytes).

Would some enlightened person shed some light on this?

Yours,
Hannu
--
Hannu.A...@cs.hut.fi h...@ftp.funet.fi h...@unda.fi

The best thing about standards is that there are so many
to choose fromXXXXXXXXXXXX ways to read them...

Glenn A. Adams

unread,

Dec 20, 1994, 12:45:09 PM12/20/94

to

In article <HAA.94De...@kaarnavene.cs.hut.fi>,

Hannu Aronsson <h...@cs.hut.fi> wrote:
>The reservation of 3% (D800-DFFF) of the core part of the standards,
>the BMP, for supporting UTF-16, which can only represent 0.05% of the
>full ISO10646 code space seems quite wasteful :-(
>
>I would have thought that the next step after the 64k code points in
>Unicode would be full ISO10646 code space support, but some people
>seem to think we need something in the middle to confuse things up :-(

It is the belief of the developers of UTF-16 (i.e., ANSI X3L2 and
the Unicode Technical Committee) that, while the BMP by itself is
not adequate (not large enough) for encoding *all* useful characters,
the additional 14 standardizable planes accessible from UTF-16 are,
in fact, adequate for *all* standardized character encoding requirements.
Consequently, as far as the UTC is concerned, there is no UCS space
outside of these UTF-16 accessible planes.

Perhaps you could be more explicit about your ideas for why it
might be desirable to encode more than 2^20 + 2^16 standardized
characters?

Regards,
Glenn Adams

David Goldsmith

unread,

Dec 20, 1994, 11:00:00 PM12/20/94

to

In article <HAA.94De...@kaarnavene.cs.hut.fi>, h...@cs.hut.fi (Hannu

Aronsson) wrote:
> UTF-16 seems to be a way to encode an additional 1M code points using
> two consecutive 16-bit code points to represent them. The range
> D800-DFFF has been reserved for this purpose.
>
> The reservation of 3% (D800-DFFF) of the core part of the standards,
> the BMP, for supporting UTF-16, which can only represent 0.05% of the
> full ISO10646 code space seems quite wasteful :-(
>
> I would have thought that the next step after the 64k code points in
> Unicode would be full ISO10646 code space support, but some people
> seem to think we need something in the middle to confuse things up :-(
>
> Note 1. D800-DFFF = 2k positions, out of 64k, which is ~3%
> Note 2. UTF-16 allows for 2^20 extra positions in addition to
> Unicode's 2^16, out of the "full" 2^31, i.e. ~0.05%
>
> Anyway, it would be nice if some enlightened person would shed some
> light on UTF-16 background too :-)
>

The idea is that most characters in everyday use can be encoded in the
BMP. Rare and esoteric characters (or compatibility forms) can be encoded
in the planes accessible via UTF-16. Current thinking seems to be that no
characters will need to be allocated beyond those reachable from UTF-16.
Therefore, for typical use, the UCS-2 form suffices, and there's no need
to use UCS-4 as would be necessary if all 2^32 codes were used. Almost all
characters (from a usage standpoint, not code allocation) fit in the BMP,
and occasionally you need UTF-16 for something out of the ordinary.

That is the current philosophy for future code allocations as I understand it.

--
David Goldsmith
Taligent, Inc.
10201 N. De Anza Blvd.
Cupertino, CA 95014-2233
David_G...@taligent.com