Undefined behavior in utf8

Sergey Bronnikov

unread,

Feb 16, 2026, 10:05:03 AMFeb 16

to lua-l

Hello,

there is an issue with undefined behavior in lutf8lib.c because shift exponent may be too large for 32-bit type 'l_uint32':

lutf8lib.c:67:34: runtime error: shift exponent 35 is too large for 32-bit type 'l_uint32' (aka 'unsigned int')
#0 0x56308fd6face in utf8_decode /src/testdir/build/lua-master/source/lutf8lib.c:67:34
#1 0x56308fd6f4fd in utflen /src/testdir/build/lua-master/source/lutf8lib.c:99:22
#2 0x56308fd05fe9 in precallC /src/testdir/build/lua-master/source/ldo.c:663:7
#3 0x56308fd0687c in luaD_precall /src/testdir/build/lua-master/source/ldo.c
#4 0x56308fd40413 in luaV_execute /src/testdir/build/lua-master/source/lvm.c:1729:22
#5 0x56308fd06ce3 in ccall /src/testdir/build/lua-master/source/ldo.c:774:5
#6 0x56308fd03b69 in luaD_rawrunprotected /src/testdir/build/lua-master/source/ldo.c:166:3
#7 0x56308fd07cd8 in luaD_pcall /src/testdir/build/lua-master/source/ldo.c:1096:12
#8 0x56308fcfaaa0 in lua_pcallk /src/testdir/build/lua-master/source/lapi.c:1097:14
#9 0x56308fcf1435 in LLVMFuzzerTestOneInput /src/testdir/tests/capi/luaL_loadbuffer_test.c:34:3

Steps to reproduce:

make MYCFLAGS=-fsanitize=undefined MYLDFLAGS=-fsanitize=undefined

echo "dXRmOC5sZW4n2b0B/4+Pj4+Pj48AACc=" | base64 --decode | ./lua -

Lua version: c6b48482

Sergey

Scott Morgan

unread,

Feb 16, 2026, 11:53:45 AMFeb 16

to lu...@googlegroups.com

On 16/02/2026 15:05, Sergey Bronnikov wrote:
>
> make MYCFLAGS=-fsanitize=undefined MYLDFLAGS=-fsanitize=undefined
> echo "dXRmOC5sZW4n2b0B/4+Pj4+Pj48AACc=" | base64 --decode | ./lua -
>

d9 bd 01 ff 8f 8f 8f 8f 8f 8f 8f 00 00

That doesn't appear to be valid UTF8 (0xff shouldn't appear in UTF8
bytes[1])

May be an argument about whether the utf8 lib should detect invalid
strings, or if it's the users job to sanitise before use.

Scott

[1] https://www.rfc-editor.org/rfc/rfc3629#page-3
> The octet values C0, C1, F5 to FF never appear

Sergey Bronnikov

unread,

Feb 16, 2026, 12:12:35 PMFeb 16

to lua-l

> May be an argument about whether the utf8 lib should detect invalid strings,

> or if it's the users job to sanitise before use.

I suppose the first one, the Lua 5.5 Reference Manual says:

> If it finds any invalid byte sequence, returns fail plus the position of the first invalid byte.

Sergey

Halalaluyafail3

unread,

Feb 16, 2026, 1:13:12 PMFeb 16

to lua-l

This program looks to be equivalent:

utf8.len'\xD9\xBD\x01\xFF\x8F\x8F\x8F\x8F\x8F\x8F\x8F\x00\x00'

And reduced to remove the valid parts of the string:

utf8.len'\xFF\x8F\x8F\x8F\x8F\x8F\x8F\x8F'

The issue appears to be:

for (; c & 0x40; c <<= 1) {

110XXXXX: read 1 continuation byte
1110XXXX: read 2 continuation bytes
11110XXX: read 3 continuation bytes (max possible in valid unicode)
111110XX: read 4 continuation bytes
1111110X: read 5 continuation bytes (max possible with lax)
11111110: read 6 continuation bytes (impossible normally)
11111111: read 7 continuation bytes (impossible normally)

With 6 or 7 continuation bytes the line:

res = (res << 6) | (cc & 0x3F);

Might even wrap around, though that appears to not matter. The lines after the loop:

res |= ((l_uint32)(c & 0x7F) << (count * 5)); /* add first byte */
if (count > 5 || res > MAXUTF || res < limits[count])
return NULL; /* invalid byte sequence */

Appear to be the actual issue. count*5 specifically when count is seven will be greater than or equal to 32, making the shift undefined behavior. A simple fix here is to just move the line to add the first byte after the if:

if (count > 5 || res > MAXUTF || res < limits[count])
return NULL; /* invalid byte sequence */
res |= ((l_uint32)(c & 0x7F) << (count * 5)); /* add first byte */

Halalaluyafail3

unread,

Feb 16, 2026, 1:26:00 PMFeb 16

to lua-l

On Monday, February 16, 2026 at 1:13:12 PM UTC-5 Halalaluyafail3 wrote:

This program looks to be equivalent:

utf8.len'\xD9\xBD\x01\xFF\x8F\x8F\x8F\x8F\x8F\x8F\x8F\x00\x00'

And reduced to remove the valid parts of the string:

utf8.len'\xFF\x8F\x8F\x8F\x8F\x8F\x8F\x8F'

The issue appears to be:

for (; c & 0x40; c <<= 1) {

110XXXXX: read 1 continuation byte
1110XXXX: read 2 continuation bytes
11110XXX: read 3 continuation bytes (max possible in valid unicode)
111110XX: read 4 continuation bytes
1111110X: read 5 continuation bytes (max possible with lax)
11111110: read 6 continuation bytes (impossible normally)
11111111: read 7 continuation bytes (impossible normally)

With 6 or 7 continuation bytes the line:

res = (res << 6) | (cc & 0x3F);

Might even wrap around, though that appears to not matter. The lines after the loop:

res |= ((l_uint32)(c & 0x7F) << (count * 5)); /* add first byte */
if (count > 5 || res > MAXUTF || res < limits[count])
return NULL; /* invalid byte sequence */

Appear to be the actual issue. count*5 specifically when count is seven will be greater than or equal to 32, making the shift undefined behavior. A simple fix here is to just move the line to add the first byte after the if:

if (count > 5 || res > MAXUTF || res < limits[count])
return NULL; /* invalid byte sequence */
res |= ((l_uint32)(c & 0x7F) << (count * 5)); /* add first byte */

Hmm it appears that the if does use res, which I did not notice before. It should be:

if (count > 5)

return NULL; /* invalid byte sequence */
res |= ((l_uint32)(c & 0x7F) << (count * 5)); /* add first byte */

if (res > MAXUTF || res < limits[count])

return NULL; /* invalid byte sequence */

On Monday, February 16, 2026 at 10:05:03 AM UTC-5 Sergey Bronnikov wrote:

Roberto Ierusalimschy

unread,

Feb 17, 2026, 1:07:28 PMFeb 17

to lu...@googlegroups.com

> there is an issue with undefined behavior in lutf8lib.c because shift
> exponent may be too large for 32-bit type 'l_uint32':
>
> lutf8lib.c:67:34: runtime error: shift exponent 35 is too large for 32-bit
> type 'l_uint32' (aka 'unsigned int')
> #0 0x56308fd6face in utf8_decode

> [...]

>
> Steps to reproduce:
>
> make MYCFLAGS=-fsanitize=undefined MYLDFLAGS=-fsanitize=undefined
> echo "dXRmOC5sZW4n2b0B/4+Pj4+Pj48AACc=" | base64 --decode | ./lua -

Many thanks for the feedback.

-- Roberto

Reply all

Reply to author

Forward

Undefined behavior in utf8_decode()

Sergey Bronnikov

Scott Morgan

Sergey Bronnikov

Halalaluyafail3

Halalaluyafail3

Roberto Ierusalimschy