This program looks to be equivalent:
utf8.len'\xD9\xBD\x01\xFF\x8F\x8F\x8F\x8F\x8F\x8F\x8F\x00\x00'
And reduced to remove the valid parts of the string:
utf8.len'\xFF\x8F\x8F\x8F\x8F\x8F\x8F\x8F'
The issue appears to be:
for (; c & 0x40; c <<= 1) {
110XXXXX: read 1 continuation byte
1110XXXX: read 2 continuation bytes
11110XXX: read 3 continuation bytes (max possible in valid unicode)
111110XX: read 4 continuation bytes
1111110X: read 5 continuation bytes (max possible with lax)
11111110: read 6 continuation bytes (impossible normally)
11111111: read 7 continuation bytes (impossible normally)
With 6 or 7 continuation bytes the line:
res = (res << 6) | (cc & 0x3F);
Might even wrap around, though that appears to not matter. The lines after the loop:
res |= ((l_uint32)(c & 0x7F) << (count * 5)); /* add first byte */
if (count > 5 || res > MAXUTF || res < limits[count])
return NULL; /* invalid byte sequence */
Appear to be the actual issue. count*5 specifically when count is seven will be greater than or equal to 32, making the shift undefined behavior. A simple fix here is to just move the line to add the first byte after the if:
if (count > 5 || res > MAXUTF || res < limits[count])
return NULL; /* invalid byte sequence */
res |= ((l_uint32)(c & 0x7F) << (count * 5)); /* add first byte */