Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UTF-8 overlong encodings

446 views
Skip to first unread message

Rainer Weikusat

unread,
Dec 17, 2021, 5:55:56 PM12/17/21
to
As usual with technical terms "everyone understands", it gets thrown
around everywhere but is never defined. The definition I derived is
below.

The non-ASCII part of UTF-8 is composed of 5 ranges each of which
starts with a number which has only one bit set. The starting numbers
are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
'overlong' when this start bit isn't set.

Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
and 26.

Each range is composed of a number of six bit blocks plus a
remainder which gets put into the byte starting the encoded
sequence. Again expressed as (left) shift arguments, the highest bits of
the left-most six bit blocks are 5, 11, 17, 23, 29.

Subtracting the shift value corresponding with the highest bit in the
first six bit block from the shift value of the starting bit yiels the
position of this starting bit relative to the highest bit in the first
six bit block. The corresponding values are 2, 0, -1, -2 and -3.

The first case is special because the starting bit is the bit
corresponding with 1 in the first byte. All other start bits are in the
second byte, at positions 5, 4, 3 and 2.

An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
ignoring the initial special case, the shift value relative to the start
of the first six bit block for each encoded sequence is 8 -
its length:

3 -> 5
4 -> 4
5 -> 3
6 -> 2

Any corrections or other comments very much welcome.

Ben Bacarisse

unread,
Dec 17, 2021, 7:04:10 PM12/17/21
to
Rainer Weikusat <rwei...@talktalk.net> writes:

> As usual with technical terms "everyone understands", it gets thrown
> around everywhere but is never defined. The definition I derived is
> below.
>
> The non-ASCII part of UTF-8 is composed of 5 ranges each of which
> starts with a number which has only one bit set. The starting numbers
> are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
> 'overlong' when this start bit isn't set.

I'd express it in terms of magnitude. An overlong 2-byte sequence will
decode to a value than 0x80. An overlong encoded 3-byte value will be
less than 0x800 and so on. Or going the other way, you need at least
two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
so on.

When looking at the encoding itself, an overlong sequence is one that
starts with one of the bytes C0, C1, E0, F0, F8 or FC.

Unicode has said it won't use more than the 21 bits available in the
four-byte encoding, so all sequences of length 5 or 6 are "overlong",
although in an entirely different sense.

> Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
> and 26.
>
> Each range is composed of a number of six bit blocks plus a
> remainder which gets put into the byte starting the encoded
> sequence. Again expressed as (left) shift arguments, the highest bits of
> the left-most six bit blocks are 5, 11, 17, 23, 29.
>
> Subtracting the shift value corresponding with the highest bit in the
> first six bit block from the shift value of the starting bit yiels the
> position of this starting bit relative to the highest bit in the first
> six bit block. The corresponding values are 2, 0, -1, -2 and -3.
>
> The first case is special because the starting bit is the bit
> corresponding with 1 in the first byte. All other start bits are in the
> second byte, at positions 5, 4, 3 and 2.
>
> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
> ignoring the initial special case, the shift value relative to the start
> of the first six bit block for each encoded sequence is 8 -
> its length:
>
> 3 -> 5
> 4 -> 4
> 5 -> 3
> 6 -> 2
>
> Any corrections or other comments very much welcome.

I was not sure what this part of the description was supposed to add to
the initial definition.

--
Ben.

Keith Thompson

unread,
Dec 17, 2021, 7:37:07 PM12/17/21
to
Unicode only defines character values up to 0x10fffd, so there are no
valid encodings longer than 4 octets.

Here's a table I came up with a while ago:

00-7F (7 bits) 0xxxxxxx
0080-07FF (11 bits) 110xxxxx 10xxxxxx
0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The character code is determined by concatenating the 'x's together.
A 1-octet encoding has 7 value bits.
A 2-octet encoding has 11 value bits.
And so on.

An octet starting with 0 is always a single-octet character (ASCII compatible).
An octet starting with 11 is always the first octet of a multi-octet encoding.
An octet starting with 10 is always a continuation octet.

Overlong encodings that use more octets than necessary are invalid. For
example, the letter 'k' is 0x6b or 1101011 and is encoded in a single
octet:
01101011
-------
A two-octet encoding of the same character code is invalid:
11000001 10101011
----- ------

You could extrapolate UTF-8 to up to 8-octet encodings, representing up to
42 bits, but that's also not valid UTF-8 (though I can imagine it being
useful for some purposes).

110000-3FFFFFF (26 bits) 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
4000000-7FFFFFF (31 bits) 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
8000000-FFFFFFFFF (36 bits) 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1000000000-3FFFFFFFFFF (42 bits) 11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Rainer Weikusat

unread,
Dec 18, 2021, 12:15:16 PM12/18/21
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> Rainer Weikusat <rwei...@talktalk.net> writes:
>
>> As usual with technical terms "everyone understands", it gets thrown
>> around everywhere but is never defined. The definition I derived is
>> below.
>>
>> The non-ASCII part of UTF-8 is composed of 5 ranges each of which
>> starts with a number which has only one bit set. The starting numbers
>> are 0x80, 0x800, 0x10000, 0x200000 and 0x4000000. An encoding is
>> 'overlong' when this start bit isn't set.
>
> I'd express it in terms of magnitude. An overlong 2-byte sequence will
> decode to a value than 0x80. An overlong encoded 3-byte value will be
> less than 0x800 and so on. Or going the other way, you need at least
> two byte if the value to encode is >= 0x80, 3 bytes if it's >= 0x800 and
> so on.

Yes. That's an error I made: An overlong sequence is one where none of
the bits between the end of the prefix and the start bit (inclusive) are
set.

[...]

>> Expressed as (left) shift arguments, the starting bits are 7, 11, 16, 21
>> and 26.
>>
>> Each range is composed of a number of six bit blocks plus a
>> remainder which gets put into the byte starting the encoded
>> sequence. Again expressed as (left) shift arguments, the highest bits of
>> the left-most six bit blocks are 5, 11, 17, 23, 29.
>>
>> Subtracting the shift value corresponding with the highest bit in the
>> first six bit block from the shift value of the starting bit yiels the
>> position of this starting bit relative to the highest bit in the first
>> six bit block. The corresponding values are 2, 0, -1, -2 and -3.
>>
>> The first case is special because the starting bit is the bit
>> corresponding with 1 in the first byte. All other start bits are in the
>> second byte, at positions 5, 4, 3 and 2.
>>
>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>> ignoring the initial special case, the shift value relative to the start
>> of the first six bit block for each encoded sequence is 8 -
>> its length:
>>
>> 3 -> 5
>> 4 -> 4
>> 5 -> 3
>> 6 -> 2
>>
>> Any corrections or other comments very much welcome.
>
> I was not sure what this part of the description was supposed to add to
> the initial definition.

I want to calculate that with a general algorithm.

Rainer Weikusat

unread,
Dec 18, 2021, 12:19:22 PM12/18/21
to
Keith Thompson <Keith.S.T...@gmail.com> writes:
> Rainer Weikusat <rwei...@talktalk.net> writes:
>> As usual with technical terms "everyone understands", it gets thrown
>> around everywhere but is never defined. The definition I derived is
>> below.

[...]

> Unicode only defines character values up to 0x10fffd, so there are no
> valid encodings longer than 4 octets.
>
> Here's a table I came up with a while ago:
>
> 00-7F (7 bits) 0xxxxxxx
> 0080-07FF (11 bits) 110xxxxx 10xxxxxx
> 0800-FFFF (16 bits) 1110xxxx 10xxxxxx 10xxxxxx
> 010000-10FFFF (21 bits) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The Linux UTF-8 man page also has 5 and 6 byte sequences.

Ben Bacarisse

unread,
Dec 18, 2021, 6:14:18 PM12/18/21
to
I don't know what "that" refers to. Do you want to calculate the UTF-8
sequence length from the code point? It seems not. Do you want to
determine if a sequence is overlong by looking at the sequence? It
seems not. What is the algorithm given, and what it its result?

--
Ben.

John McCue

unread,
Dec 18, 2021, 7:50:02 PM12/18/21
to
Rainer Weikusat <rwei...@talktalk.net> wrote:
> Keith Thompson <Keith.S.T...@gmail.com> writes:
>> Rainer Weikusat <rwei...@talktalk.net> writes:
<snip>

>> Unicode only defines character values up to 0x10fffd, so there are no
>> valid encodings longer than 4 octets.

This is my understanding also.

<snip>
>
> The Linux UTF-8 man page also has 5 and 6 byte sequences.

I saw somewhere 5 and 6 byte sequences were originally
defined or thought it would be needed, but now limited
to 4 bytes.

--
csh(1) - "An elegant shell, for a more... civilized age."
- Paraphrasing Star Wars

Rainer Weikusat

unread,
Dec 19, 2021, 4:39:07 PM12/19/21
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> Rainer Weikusat <rwei...@talktalk.net> writes:

[...]

>>>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>>>> ignoring the initial special case, the shift value relative to the start
>>>> of the first six bit block for each encoded sequence is 8 -
>>>> its length:
>>>>
>>>> 3 -> 5
>>>> 4 -> 4
>>>> 5 -> 3
>>>> 6 -> 2
>>>>
>>>> Any corrections or other comments very much welcome.
>>>
>>> I was not sure what this part of the description was supposed to add to
>>> the initial definition.
>>
>> I want to calculate that with a general algorithm.
>
> I don't know what "that" refers to. Do you want to calculate the UTF-8
> sequence length from the code point? It seems not. Do you want to
> determine if a sequence is overlong by looking at the sequence? It
> seems not. What is the algorithm given, and what it its result?

I want to determine if a sequence is overlong using a generalized
algorithm for that, ie, not by special-casing start byte values. So far,
the untested (and very likely buggy) code for this looks like follows:

u_len is the length of the sequence in bytes, p a pointer to the first
byte. Some unrelated consistency checks removed.

mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
x = *p & mask;
if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */

y = *++p;

if (!x) { /* x == 0 implies u_len > 2 */
mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
}

Rainer Weikusat

unread,
Dec 19, 2021, 4:53:30 PM12/19/21
to
At the moment, I'm convinced that this algorithm is complete
nonsense. :-)

Ben Bacarisse

unread,
Dec 19, 2021, 10:57:28 PM12/19/21
to
Rainer Weikusat <rwei...@talktalk.net> writes:

> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
>> Rainer Weikusat <rwei...@talktalk.net> writes:
>
> [...]
>
>>>>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>>>>> ignoring the initial special case, the shift value relative to the start
>>>>> of the first six bit block for each encoded sequence is 8 -
>>>>> its length:
>>>>>
>>>>> 3 -> 5
>>>>> 4 -> 4
>>>>> 5 -> 3
>>>>> 6 -> 2
>>>>>
>>>>> Any corrections or other comments very much welcome.
>>>>
>>>> I was not sure what this part of the description was supposed to add to
>>>> the initial definition.
>>>
>>> I want to calculate that with a general algorithm.
>>
>> I don't know what "that" refers to. Do you want to calculate the UTF-8
>> sequence length from the code point? It seems not. Do you want to
>> determine if a sequence is overlong by looking at the sequence? It
>> seems not. What is the algorithm given, and what it its result?
>
> I want to determine if a sequence is overlong using a generalized
> algorithm for that, ie, not by special-casing start byte values.

I don't think I follow what you mean. Over long sequences are special
case so you have to special-case something. Why not the first byte? It
seems to be such a simple method.

> So far,
> the untested (and very likely buggy) code for this looks like follows:
>
> u_len is the length of the sequence in bytes,

How have you calculated u_len? You can detect and overlong sequence
without knowing it, so there is some risk in using it when it's not
needed.

> p a pointer to the first
> byte. Some unrelated consistency checks removed.
>
> mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */

That includes one more bit than you want. In a proper UTF-8 sequence,
that bit will be zero, so it's harmless, but have you already checked
that the sequence is valid (other than possibly being overlong).

By the way, I'd use 0xff >> u_len to get the mask. It seems more
natural.

> x = *p & mask;
> if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */

(or if no bits are set, but you include that in your test)

> y = *++p;

I don't see why you need to look at the next byte.

> if (!x) { /* x == 0 implies u_len > 2 */

x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.

> mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
> if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
> }

--
Ben.

Nicolas George

unread,
Dec 20, 2021, 4:42:10 AM12/20/21
to
John McCue , dans le message <splvjm$msm$1...@dont-email.me>, a écrit :
> I saw somewhere 5 and 6 byte sequences were originally
> defined or thought it would be needed, but now limited
> to 4 bytes.

Unicode was limited to 20-21 bits because Microsoft and Sun decided to use
UTF-16 to go beyond 16 bits instead of making their ABI evolve with regard
to sizeof(wchar_t) or equivalent.

Nicolas George

unread,
Dec 20, 2021, 8:38:27 AM12/20/21
to
Rainer Weikusat , dans le message
<87tuf44...@doppelsaurus.mobileactivedefense.com>, a écrit :
> I want to determine if a sequence is overlong using a generalized
> algorithm for that

Just decode and re-encode and see if the length is the same.

Rainer Weikusat

unread,
Dec 20, 2021, 1:28:52 PM12/20/21
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> Rainer Weikusat <rwei...@talktalk.net> writes:
>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:

[...]


>> u_len is the length of the sequence in bytes,
>
> How have you calculated u_len? You can detect and overlong sequence
> without knowing it, so there is some risk in using it when it's not
> needed.

Assuming x is the start byte of a UTF-8 sequences stored as unsigned
32-bit integer, the length of the sequence is (using a gcc extension)

__builtin_clz(x ^ 0xff) - 24

All bits left of 0x80 will already be clear. x ^ 0xff will clear all
bits up to the trailing 0 bit of the prefix and set that to 1.

>
>> p a pointer to the first
>> byte. Some unrelated consistency checks removed.
>>
>> mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
>
> That includes one more bit than you want.

Indeed. 8 - u_len is the shift index of the last non-zero prefix bit. It
should have been 7 - u_len or (8 - (u_len + 1)).


[...]

> I don't see why you need to look at the next byte.
>
>> if (!x) { /* x == 0 implies u_len > 2 */
>
> x == 0 implies an overlong sequence now that you have dealt with the
> length 2 case which can have one bit on x set and still be overlong.

According to the Linux man page, a number in the range 0x800 - 0xffff is
encoded as three bytes:

1110xxxx 10xxxxxx 10xxxxxx

Program encoding 0x800 in this way:

--------
#include <stdio.h>

int main(void)
{
unsigned u = 0x800;
unsigned char utf[3], *p;

p = utf;
*p++ = 0xe0 | (u >> 12);
*p++ = 0x80 | ((u >> 6) & 63);
*p = 0x80 | (u & 63);

printf("%x %x %x\n", *utf, utf[1], utf[2]);

return 0;
}
-------

And the output is e0 a0 80. The situation is similar for all other
ranges, including the 4-byte sequence which is actually supposed to be
used: The first number of the range corresponds with a set bit in the second
byte.

I may again have gotten something wrong here. But I've done all the
calculations twice and got the same result.

Keith Thompson

unread,
Dec 20, 2021, 1:56:23 PM12/20/21
to
I don't see how that follows. UTF-16, at least in its current form, can
represent all 1,112,064 valid Unicode code points.

Sun used UTF-16 for Java. UTF-16, as far as I know, was rare on Solaris.

Rainer Weikusat

unread,
Dec 20, 2021, 2:30:37 PM12/20/21
to
Rainer Weikusat <rwei...@talktalk.net> writes:
> Ben Bacarisse <ben.u...@bsb.me.uk> writes:

[...]


>> I don't see why you need to look at the next byte.
>>
>>> if (!x) { /* x == 0 implies u_len > 2 */
>>
>> x == 0 implies an overlong sequence now that you have dealt with the
>> length 2 case which can have one bit on x set and still be overlong.
>
> According to the Linux man page, a number in the range 0x800 - 0xffff is
> encoded as three bytes:
>
> 1110xxxx 10xxxxxx 10xxxxxx
>
> Program encoding 0x800 in this way:

[...]

> And the output is e0 a0 80.

Addition: Not technically a proof of correctness but the ActionCable
(rot in hell) UTF-8 checker I have to placate accepts 0xe0 0xa0 0x80 as
valid UTF-8 sequence.

Ben Bacarisse

unread,
Dec 20, 2021, 3:46:26 PM12/20/21
to
Rainer Weikusat <rwei...@talktalk.net> writes:

> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
<cut>
>> I don't see why you need to look at the next byte.
>>
>>> if (!x) { /* x == 0 implies u_len > 2 */
>>
>> x == 0 implies an overlong sequence now that you have dealt with the
>> length 2 case which can have one bit on x set and still be overlong.
>
> According to the Linux man page, a number in the range 0x800 - 0xffff is
> encoded as three bytes:

Yup. I was not thinking. You need to look at the first two bytes when
the sequence length is > 2.

If b1 is 0xE0 then b2 must be >= 0xA0.
If b1 is 0xF0 then b2 must be >= 0x90.
If b1 is 0xF8 then b2 must be >= 0x88.
If b1 is 0xFC then b2 must be >= 0x84.

In this diagram, the ^ marks the least significant bit that must be set
for the encoding not to be overlong:

110x xxxx 10xx xxxx
^
1110 xxxx 10xx xxxx 10xx xxxx
^
1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
^
1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
^
1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
^
In terms of bit masks,

(b2 & 0x3f) >> 8 - ulen

must be non zero for ulen > 2.

Sorry for the noise.

--
Ben.

Rainer Weikusat

unread,
Dec 20, 2021, 4:24:24 PM12/20/21
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> Rainer Weikusat <rwei...@talktalk.net> writes:
>> Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> <cut>
>>> I don't see why you need to look at the next byte.
>>>
>>>> if (!x) { /* x == 0 implies u_len > 2 */
>>>
>>> x == 0 implies an overlong sequence now that you have dealt with the
>>> length 2 case which can have one bit on x set and still be overlong.
>>
>> According to the Linux man page, a number in the range 0x800 - 0xffff is
>> encoded as three bytes:

[...]

> In terms of bit masks,
>
> (b2 & 0x3f) >> 8 - ulen

Idea I had meanwhile myself, too: The expressions become simpler when shifting out the
unwanted bits instead of selecting the wanted ones via &-masking. The
one above is for the second byte, the one for the first would be

b1 << u_len

with the special case that a result of 4 means it's overlong when the
length of the sequence is 2.
0 new messages