utf-8 and well-formed but illegal chars

Rich Felker

unread,

Jan 18, 2006, 7:15:23 PM1/18/06

to

hope this isn't too off-topic -- i'm working on a utf-8 implementation
and trying to decide what to do with byte sequences that are
well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
illegal sequences (EILSEQ) or decoded as ordinary characters? is there
a good reference on the precedents? my main reference is the
linux/unix unicode faq (http://www.cl.cam.ac.uk/~mgk25/unicode.html)
which is somewhat ambiguous on the matter.

rich

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/

Bruno Haible

unread,

Jan 19, 2006, 6:50:09 AM1/19/06

to

Rich Felker wrote:
> hope this isn't too off-topic -- i'm working on a utf-8 implementation
> and trying to decide what to do with byte sequences that are
> well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> a good reference on the precedents?

The three cases are probably best treated separately:

- The range 0xd800-0xdfff. You should catch and reject them as invalid when
you are programming a conversion to UCS-2 or UTF-16, for example
UTF-8 -> UTF-16
or
UCS-4 -> UTF-16
Otherwise it becomes possible for malicious users to create non-BMP
characters at a level of processing where earlier stages of processing
did not see them.

In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff.

- For the other two ranges, the advice is dictated merely by consistency.

Most software layers treat 0xfffe-0xffff like unassigned Unicode characters,
therefore there is no need to catch them.

The range >= 0x110000, I would catch and reject as invalid. Some time ago
I had a crash in an application because the first level of processing
rejected only values >= 0x80000000, with a reasonable error message, and
later processing relied on valid Unicode and called abort() when a
character code >= 0x110000 was seen. Making the first level as strict
as the later one fixed this.

Bruno

Rich Felker

unread,

Jan 20, 2006, 9:31:15 AM1/20/06

to

On Thu, Jan 19, 2006 at 12:50:09PM +0100, Bruno Haible wrote:
> Rich Felker wrote:
> > hope this isn't too off-topic -- i'm working on a utf-8 implementation
> > and trying to decide what to do with byte sequences that are
> > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> > illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> > a good reference on the precedents?
>
> The three cases are probably best treated separately:
>
> - The range 0xd800-0xdfff. You should catch and reject them as invalid when
> you are programming a conversion to UCS-2 or UTF-16, for example
> UTF-8 -> UTF-16
> or
> UCS-4 -> UTF-16
> Otherwise it becomes possible for malicious users to create non-BMP
> characters at a level of processing where earlier stages of processing
> did not see them.
>
> In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff.

Thanks for the comments. Actually you've convinced me though that the
UTF-16 surrogates do always need to be treated as errors, since the
user may always later submit the decoded UCS numbers ("UCS-4") to a
buggy UTF-16 implementation (or a pre-UTF-16 UCS-2 writer).

Moreover there's nothing valid these characters can possibly mean..

> - For the other two ranges, the advice is dictated merely by consistency.
>
> Most software layers treat 0xfffe-0xffff like unassigned Unicode characters,
> therefore there is no need to catch them.

I was thinking it would be good to reject them for detecting non-UTF-8
data more reliably, but the sequences [ef bf be] and [ef bf bf] are
extremely unlikely in any other encoding as far as I know so it's
probably just a useless performance hit to check for them.

Good to know the precedent here.

> The range >= 0x110000, I would catch and reject as invalid. Some time ago
> I had a crash in an application because the first level of processing
> rejected only values >= 0x80000000, with a reasonable error message, and
> later processing relied on valid Unicode and called abort() when a
> character code >= 0x110000 was seen. Making the first level as strict
> as the later one fixed this.

I agree. What's worse, someone may try to use UCS character numbers as
an index into a lookup table (large table but still a possibility..)
without checking that they're in range (assuming the decoder will only
output valid numbers), with disastrous results.

Thanks again for your suggestions.

Rich

Rich Felker

unread,

Jan 20, 2006, 10:40:49 AM1/20/06

to

On Thu, Jan 19, 2006 at 12:50:09PM +0100, Bruno Haible wrote:

> Rich Felker wrote:
> > hope this isn't too off-topic -- i'm working on a utf-8 implementation
> > and trying to decide what to do with byte sequences that are
> > well-formed but represent illegal code positions, i.e. 0xd800-0xdfff,
> > 0xfffe-0xffff, and 0x110000-0x1fffff. should these be treated as
> > illegal sequences (EILSEQ) or decoded as ordinary characters? is there
> > a good reference on the precedents?
>
> The three cases are probably best treated separately:
>
> - The range 0xd800-0xdfff. You should catch and reject them as invalid when
> you are programming a conversion to UCS-2 or UTF-16, for example
> UTF-8 -> UTF-16
> or
> UCS-4 -> UTF-16
> Otherwise it becomes possible for malicious users to create non-BMP
> characters at a level of processing where earlier stages of processing
> did not see them.
>
> In a conversion from UTF-8 to UCS-4 you don't need to catch 0xd800-0xdfff.
>
> - For the other two ranges, the advice is dictated merely by consistency.
>
> Most software layers treat 0xfffe-0xffff like unassigned Unicode characters,
> therefore there is no need to catch them.
>
> The range >= 0x110000, I would catch and reject as invalid. Some time ago

> [...]

To follow up in case anyone cares: the Unicode standard agrees with
what you've said, except that 0xd800-0xdfff should always be rejected:

----------------------------------------------------------------------
D28 Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.
* As a result of this definition, the set of Unicode scalar
values consists of the ranges 0 to D7FF and E000 to 10FFFF,
inclusive.

D29 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit sequence.
----------------------------------------------------------------------

The standard goes on to clarify that an encoding form maps ALL Unicode
scalar values to code unit sequences, including noncharacter code
points and unassigned code points, and that this mapping does not
include the UTF-16 surrogate range.

Rich