Rainer Weikusat <
rwei...@talktalk.net> writes:
> Ben Bacarisse <
ben.u...@bsb.me.uk> writes:
>> Rainer Weikusat <
rwei...@talktalk.net> writes:
>
> [...]
>
>>>>> An encoded sequences has a length of 2, 3, 4, 5 or 6 bytes. When
>>>>> ignoring the initial special case, the shift value relative to the start
>>>>> of the first six bit block for each encoded sequence is 8 -
>>>>> its length:
>>>>>
>>>>> 3 -> 5
>>>>> 4 -> 4
>>>>> 5 -> 3
>>>>> 6 -> 2
>>>>>
>>>>> Any corrections or other comments very much welcome.
>>>>
>>>> I was not sure what this part of the description was supposed to add to
>>>> the initial definition.
>>>
>>> I want to calculate that with a general algorithm.
>>
>> I don't know what "that" refers to. Do you want to calculate the UTF-8
>> sequence length from the code point? It seems not. Do you want to
>> determine if a sequence is overlong by looking at the sequence? It
>> seems not. What is the algorithm given, and what it its result?
>
> I want to determine if a sequence is overlong using a generalized
> algorithm for that, ie, not by special-casing start byte values.
I don't think I follow what you mean. Over long sequences are special
case so you have to special-case something. Why not the first byte? It
seems to be such a simple method.
> So far,
> the untested (and very likely buggy) code for this looks like follows:
>
> u_len is the length of the sequence in bytes,
How have you calculated u_len? You can detect and overlong sequence
without knowing it, so there is some risk in using it when it's not
needed.
> p a pointer to the first
> byte. Some unrelated consistency checks removed.
>
> mask = (1 << (8 - u_len)) - 1; /* all value bits in the first byte set */
That includes one more bit than you want. In a proper UTF-8 sequence,
that bit will be zero, so it's harmless, but have you already checked
that the sequence is valid (other than possibly being overlong).
By the way, I'd use 0xff >> u_len to get the mask. It seems more
natural.
> x = *p & mask;
> if (u_len == 2) if (x < 2) return U_BIN; /* 2 byte sequence overlong if only the lowest bit set */
(or if no bits are set, but you include that in your test)
> y = *++p;
I don't see why you need to look at the next byte.
> if (!x) { /* x == 0 implies u_len > 2 */
x == 0 implies an overlong sequence now that you have dealt with the
length 2 case which can have one bit on x set and still be overlong.
> mask = ~((1 << (8 - u_len)) - 1); /* all bits down to start bit in 2nd byte set */
> if ((y & mask) == 0x80) return U_BIN; /* overlong if continuation pattern only */
> }
--
Ben.