When I wrote the UTF-8 validator routine
http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c
after having given the issue some thought, I decided to reject all of
the above without any further discrimination. It just makes things much
simpler and cleaner should there ever be any UTF-16 conversion
afterwards, if such problem sequences are caught as early as possible.
In particular:
- 0xfffe could be misinterpreted by a later process as an anti-BOM, and
- 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations.
Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
Since asking, I've been reading lots of standards material and talking
to several other people...
> > The three cases are probably best treated separately:
> >
> > - The range 0xd800-0xdfff. [...]
> > - 0xfffe-0xffff [...]
> > - The range >= 0x110000
>
> When I wrote the UTF-8 validator routine
>
> http://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c
Wow, you should really try to cut down on the number ofconditionals
there. It will kill performance due to mispredicted branches (with
that many I think you'll overflow the predictions even).
> after having given the issue some thought, I decided to reject all of
> the above without any further discrimination. It just makes things much
> simpler and cleaner should there ever be any UTF-16 conversion
> afterwards, if such problem sequences are caught as early as possible.
I tend to disagree. These code points are valid 'Unicode scalar
values' in the wording of the Unicode standard, and if you wanted to
forbid noncharacter codepoints there are several others as well.
> In particular:
>
> - 0xfffe could be misinterpreted by a later process as an anti-BOM, and
This could be an issue on Windows/Java, but neither BOMs nor UTF-16
are used on unix systems, for very good reason.
> - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations.
Any implementation with sizeof(wchar_t) is not ISO C compliant, unless
its adopted subset of UCS/Unicode is the 16bit portion only (i.e. not
using UTF-16). The standard says that wchar_t represents a whole
character and that there are no such things as multi-wchar_t
characters or shift states for wchar_t.
I agree that noncharacter code points should not be written to text
files for interchange with other applications; however, they are valid
for internal use, and must remain valid for internal use invariant
under conversions between different encoding forms. Forbidding them
belongs at the level of the file writer in an application generating
files for interchange, not at the mbrtowc/wcrtomb level. The latter
could interfere with legitimate internal processing, and just slows
UTF-8 processing down even more.
Rich
P.S. Did your old thread about converting invalid bytes to high
surrogate codepoints (for binary-clean in-band error reporting) when
decoding UTF-8 ever reach a conclusion? I found part of the thread
while doing a search, but didn't see where it ended up.
But many "unix systems" sit in heterogeneous environments with UTF-16
protocols (NTFS, CIFS, etc.), talk to Windows/Java platforms, may even
run or be ported onto Windows, or host ported Windows applications. The
world is not always as simple as we would like it to be. :(
> > - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations.
>
> Any implementation with sizeof(wchar_t) is not ISO C compliant,
Your interpretation of the holy book of ISO C! Systems with wchar_t =
uint16 exist and are widely deployed. People fight wars about holy books.
If you worry about UTF-16 at all, then I think you should also worry
about these two. Otherwise, there is no point in worrying about
surrogates either.
> The standard says that wchar_t represents a whole
> character and that there are no such things as multi-wchar_t
> characters or shift states for wchar_t.
The wchar_t parts of ISO C that you refer to were written before 1995
(Amendment 1) by people primarily interested in EUC and other ISO
2022-like schemes, and were not substantially revised since. UTF-16
(published 1996) was not around when the text you now interpret was
written. You may be stretching the standard beyond is interpretation
capacity, if you unify terms like "character" from different committees
and epochs in character-set history.
(Don't missunderstand me, I am not a fan of sizeof(wchar_t)==2; I merely
want to warn of the limits of interpretating things into standards that
the authors could not have been aware of, such as Unicode's current
character and encoding model.)
> I agree that noncharacter code points should not be written to text
> files for interchange with other applications; however, they are valid
> for internal use, and must remain valid for internal use invariant
> under conversions between different encoding forms. Forbidding them
> belongs at the level of the file writer in an application generating
> files for interchange, not at the mbrtowc/wcrtomb level. The latter
> could interfere with legitimate internal processing, and just slows
> UTF-8 processing down even more.
Pah, new-fangled, misguided Unicode view of the world! :)
Holy books are best when they are old ...
ISO 10646-1:1993/Am.2 (1996), section R.4, forbids both U+FFFF and
U+FFFE in UTF-8:
"NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved
for the UTF-16 form and do not occur in UCS-4. The values 0000 FFFE and
0000 FFFF also do not occur (see clause 8). The mappings of these code
positions in UTF-8 are undefined."
http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
> P.S. Did your old thread about converting invalid bytes to high
> surrogate codepoints (for binary-clean in-band error reporting) when
> decoding UTF-8 ever reach a conclusion? I found part of the thread
> while doing a search, but didn't see where it ended up.
I spend some time investigating a schemes that create an isomorphic
mappings between malformed UTF-8 and malformed UTF-16 schemes. They all
got horriby complicated and unpleasant. I don't think that there is a
neat and efficient isomorphic mapping. The simple approach is to define
two separate surjective encodings, one to prepresent malformed UTF-8 in
UTF-16, and the other to represent malformed UTF-16 in UTF-8, without
asking for one to be the inverse of the other.
Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
Yes and no. This UTF-16 processing should be isolated to programs that
actually deal with the windows data, such as samba. The only other
program I can think of that should ever have to handle UTF-16 on a
unix system is a web browser, for decoding UTF-16 documents served by
severely misconfigured (huge waste of bandwidth) windows webservers.
(Even if all the content is CJK, HTML is full of bloated ascii tags
which will all double in size with UTF-16, negating any marginal size
savings.)
> > > - 0xffff equals FEOF in most sizeof(wchar_t)=2 implementations.
> >
> > Any implementation with sizeof(wchar_t) is not ISO C compliant,
>
> Your interpretation of the holy book of ISO C! Systems with wchar_t =
> uint16 exist and are widely deployed. People fight wars about holy books.
>
> If you worry about UTF-16 at all, then I think you should also worry
> about these two. Otherwise, there is no point in worrying about
> surrogates either.
There is a big difference. If surrogates decoded from UTF-8 are
converted to UTF-16 wrongly, it creates two ways of representing the
same character (major security issue). Otherwise you just end up with
invalid characters in the output (which you should be checking for on
write, anyway). (There's actually the issue with BOM too, which could
have security implications.. I still can't believe people actually do
something so stupid and blatently incorrect as processing a BOM..)
> > The standard says that wchar_t represents a whole
> > character and that there are no such things as multi-wchar_t
> > characters or shift states for wchar_t.
>
> The wchar_t parts of ISO C that you refer to were written before 1995
> (Amendment 1) by people primarily interested in EUC and other ISO
> 2022-like schemes, and were not substantially revised since. UTF-16
> (published 1996) was not around when the text you now interpret was
> written. You may be stretching the standard beyond is interpretation
> capacity, if you unify terms like "character" from different committees
> and epochs in character-set history.
I agree it's dangerous to equate different definitions of character,
especially since ISO C uses "character" (without the word multibyte)
to mean "byte". However, the part about lack of shift/decoding state
is fairly clear that multi-wchar_t character encoding is not supposed
to exist.
> (Don't missunderstand me, I am not a fan of sizeof(wchar_t)==2; I merely
> want to warn of the limits of interpretating things into standards that
> the authors could not have been aware of, such as Unicode's current
> character and encoding model.)
IIRC the same language appears in C99, although I've only read the
draft myself, not the final standard, and not in detail.
> > I agree that noncharacter code points should not be written to text
> > files for interchange with other applications; however, they are valid
> > for internal use, and must remain valid for internal use invariant
> > under conversions between different encoding forms. Forbidding them
> > belongs at the level of the file writer in an application generating
> > files for interchange, not at the mbrtowc/wcrtomb level. The latter
> > could interfere with legitimate internal processing, and just slows
> > UTF-8 processing down even more.
>
> Pah, new-fangled, misguided Unicode view of the world! :)
> Holy books are best when they are old ...
Indeed! However holy books are more accessible when they're published
publicly on the web.
> ISO 10646-1:1993/Am.2 (1996), section R.4, forbids both U+FFFF and
> U+FFFE in UTF-8:
>
> "NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved
> for the UTF-16 form and do not occur in UCS-4. The values 0000 FFFE and
> 0000 FFFF also do not occur (see clause 8). The mappings of these code
> positions in UTF-8 are undefined."
>
> http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
OK, in my view ISO-10646 trumps Unicode, so I will accept your
conclusion on the matter. (Unicode is full of wordprocessor-oriented,
Windows-oriented, 16bit-oriented, etc. crap. Sometimes it's useful for
deriving things like character classes, etc. so I don't want to
dismiss it entirely, but ISO-10646 is much more vendor-neutral and
lacks various stupid semantic requirements that conflict with C and
POSIX.)
BTW your link seems to be to an old version, since my understanding is
that ISO-10646 has since forbidden overlong character encodings (and
also code points above 10ffff..?).
> > P.S. Did your old thread about converting invalid bytes to high
> > surrogate codepoints (for binary-clean in-band error reporting) when
> > decoding UTF-8 ever reach a conclusion? I found part of the thread
> > while doing a search, but didn't see where it ended up.
>
> I spend some time investigating a schemes that create an isomorphic
> mappings between malformed UTF-8 and malformed UTF-16 schemes. They all
> got horriby complicated and unpleasant. I don't think that there is a
> neat and efficient isomorphic mapping. The simple approach is to define
> two separate surjective encodings, one to prepresent malformed UTF-8 in
> UTF-16, and the other to represent malformed UTF-16 in UTF-8, without
> asking for one to be the inverse of the other.
IMO there's no good solution. Any such conversion is subject to the
flaw that string concatenation and conversion between encodings do not
commute, which is a Very Bad Thing and could have security
implications. Unless you have a better solution, my view is that
applications wishing to be binary-clean should either keep the data as
bytes internally (processing it as UTF-8 in a 'JIT' manner for
display, searching, etc.) or use their own internal representation.
Regardless, the C library implementation should do nothing but signal
(OOB) error on invalid sequences and leave additional handling to the
application. If you have a different view I'd be happy to hear it.
Rich
Actually this got me thinking about whether it's necessary or
appropriate to bother with signalling errors for surrogate codepoints
and noncharacters at all when decoding UTF-8 in mb[r]towc or other
similar interfaces. The error conditions basically are:
- Overly long representations: these are inherently a security problem
when using UTF-8 because they make the round-trip map between UTF-8
and UCS a non-identity map in some cases.
- Surrogates: these have no security implications as long as the
encodings in use are only UTF-8 and UCS character numbers (wchar_t).
They only become a problem if someone converts to UTF-16 by applying
the identity map to all code points below 0x10000 without checking
for illegal surrogates, in which case their presence will make the
round trip between UTF-8 and UTF-16 non-identity.
- FFFE: no implications for UTF-8 and wchar_t only system. When
converted to UTF-16 or UTF-32, may cause systems which honor a BOM
to misinterpret the text entirely, which may have security
implications (e.g. 2F00 gets interpreted as 002F).
- FFFF: may be interpreted as WEOF by broken systems with 16bit
wchar_t. Otherwise a non-issue.
If UTF-8 is going to be the universal character encoding on *nix
systems (and hopefully Internet protocols, embedded systems, and all
other non-MS systems) for the forseeable future, it's in the utmost
interest of users for performance to be maximized and code size to be
minimized. Otherwise there is a strong urge to stick with legacy 8bit
encodings.
Of the above error conditions, only overly long sequences affect a
system that only uses UTF-8 and wchar_t, which is the vast majority of
applications. I strongly wonder whether checking for surrogates and
illegal noncharacter codepoints should be moved to the UTF-16 encoder
(in iconv, or other implementations) and omitted from the UTF-8
decoder. The benefits:
- In the naive C implementation with conditional branches for all the
error condition checks, this eliminates two subtractions and two
conditional branches per 3-byte sequence (basically all Asian
scripts). In very naive implementations, these operations would have
been performed for ALL non-ASCII characters.
- In the optimized C implementation with bit twiddling for error
conditions, this eliminates 4 subtractions, 2 bitwise ors, and 1
bitshift per 3-byte sequence. Cache impact of reduced code should be
significant.
- In my heavily optimized x86 implementation, this eliminates 19 bytes
of code (~10% of the total function, and closer to 20% if you only
count the code that gets executed for BMP characters), comprising 7
instructions with heavy data dependencies between them, per 3-byte
sequence. I would estimate about 20 cycles on a modern cpu, plus
time saved due to lowered cache impact.
Naturally the worth of these gains is very questionable. NOT because
computers are "getting faster" -- the idea that you can write slow
code because Western Europe and America have fast computers should not
be tolerated among people interested in i18n and m17n for a second!!
-- but because the gains are _fairly_ small. On the other hand, the
practical benefits of signalling surrogates and fffe/ffff as errors in
an application which does not deal with UTF-16 are nonexistant.
Markus, Bruno, and others: I'd like to hear your opinions on this
matter. FYI: isomorphism between malformed UTF-8 and invalid wchar_t
values is totally possible without excluding surrogates. Only the
ideas for isomorphism to malformed UTF-16 suffer.
Rich