utf8.offset acknowledges null terminator

94 views
Skip to first unread message

Sainan

unread,
Aug 24, 2025, 3:27:30 AMAug 24
to lu...@googlegroups.com
Hi, I don't typically use the utf8 library, but I was toying around with utf8.offset given the recent change to it and I noticed some interesting behaviour:

print(#"プ") --> 3
print(utf8.offset("プ", 1)) --> 1 3
print(utf8.offset("プ", 2)) --> 4 4
print(utf8.offset("プ", 3)) --> nil

This function seemingly acknowledges the existence of the null terminator at position 4, beyond what Lua would say the string's length is.

I checked Lua 5.4 and this doesn't seem to be regression, so maybe I'm a bit late to mention this.

-- Sainan

Roberto Ierusalimschy

unread,
Aug 24, 2025, 10:31:17 AM (14 days ago) Aug 24
to 'Sainan' via lua-l
> Hi, I don't typically use the utf8 library, but I was toying around with utf8.offset given the recent change to it and I noticed some interesting behaviour:
>
> print(#"プ") --> 3
> print(utf8.offset("プ", 1)) --> 1 3
> print(utf8.offset("プ", 2)) --> 4 4
> print(utf8.offset("プ", 3)) --> nil
>
> This function seemingly acknowledges the existence of the null terminator at position 4, beyond what Lua would say the string's length is.

That is what the manual say it does:

If the specified character is right after the end of s,
the function behaves as if there was a '\0' there.

-- Roberto

bil til

unread,
Aug 24, 2025, 11:56:40 PM (13 days ago) Aug 24
to lu...@googlegroups.com
Am So., 24. Aug. 2025 um 09:27 Uhr schrieb 'Sainan' via lua-l
<lu...@googlegroups.com>:
>
> print(#"プ") --> 3

Generally it is no wise to mixup functions for "ASCII char
interpretation" and for "utf8 interpretation" of strings (except if
you want to use this for some "more special tricks" and you know
exactly how utf8 is working internally, see wiki, but reality has
pitfalls of "possibly corrupted" strings.... you better always check
carefully, how utf8 lib functions handle such illegal "utf8 chars" in
such ill cases...)..

#"..." shows the ascii char length (bytes before 0 char).

If you want to see the "utf8 char" length, you have to use utf8.len,
which return 2 in your example case I hope :).

Sainan

unread,
Aug 25, 2025, 12:00:53 AM (13 days ago) Aug 25
to lu...@googlegroups.com
> If you want to see the "utf8 char" length, you have to use utf8.len, which return 2 in your example case I hope :).

print(utf8.len("プ")) --> 1

-- Sainan

bil til

unread,
Aug 25, 2025, 1:22:53 AM (13 days ago) Aug 25
to lu...@googlegroups.com
Am Mo., 25. Aug. 2025 um 06:00 Uhr schrieb 'Sainan' via lua-l
<lu...@googlegroups.com>:
>
> print(utf8.len("プ")) --> 1

... so this strange char in your "..." is only one utf8 char...
(sorry, I can not identify this clearly, I can read neither Chinese
nor Japanese :) ...). Can you specfiy the Unicode number of this char?

But utf8.offset of your example anyway corresponds to the ref manual
correctly: the first (and only) utf8 char in your string then has 3
bytes, from byte 1 to 3. And the 2nd utf8 char is the one after with
zero bytes (so from char 4 to 4). And if you try to invoke offset for
higher "utf8 char" number, the function will fail, thus return nil.

And also #".."=3 also is correct, and utf8.len also correct... .

Sainan

unread,
Aug 25, 2025, 1:42:42 AM (13 days ago) Aug 25
to lu...@googlegroups.com
It is a Katakana "pu". U+30D7. Katakana is used in Japanese.

I do understand and agree with # and utf8.len here, I just find utf8.offset's behaviour at the end weird, for which the input string doesn't really matter, it's just what I happened to be testing.

-- Sainan

bil til

unread,
Aug 25, 2025, 2:06:34 AM (13 days ago) Aug 25
to lu...@googlegroups.com
:)

But "weird" in fact are the problems that come up, if you try to
include "multibyte chars" into byte sequences... .

utf8 is a very elegant and nice method, to circumvent at least the
problems with "single missing byte" corruptions which can easily
appear in data transfer byte by byte... . To my judgement utf8 lib in
lua is a very smart and powerful way to address this... .

I would agree, that utf8.offset with index 2 giving 4,4 might be a bit
a strange return value pair....

But it is documented like this, and I could imagine applications,
where it has an advantage that utf8.offset gives a different result
for index 2 and for higher indices... .

The more challenging thing would be, if you add a corrupted last byte
is corrupted to your string, e. g. "(your katakan byte)\xC5"

Not sure what utf8.offset and utf8.len would return in this case. I
would assume utf8.len then returns 2, and offset for this corrupted
utf8 char with index 2 then would return 4, 5, and for the index 3 it
would return 5,5.

Philippe Verdy

unread,
Aug 27, 2025, 12:52:03 PM (11 days ago) Aug 27
to lu...@googlegroups.com
However given that the return is 4,4, it means that this is an empty sequence terminating the string; so it is still not a true '\0' encoded in the string.

In my opinion, print(utf8.offset("プ", 2)) should behave like print(utf8.offset("プ", 3)) and return nil as well as the index is already out of bound (even if it refers to the end of string position, given that "プ" contains a single code point encoded on 3 bytes and there's no other code point after it at the UTF-8 index 2).

Returning "4, 4" may add to the confusion. But maybe there's some use case that expects to see this final empty string segment to represent the end of string (to be handled *silently* by Lua code expecting this empty segment as an end of string condition, so that the code can also work when using utf8.offset("", 1)) which should just return 1,1 and not nil for the only first and final empty segment), where returning "nil" would indicate a implementation bug in the Lua code (to be handed by some error tracking code).

Reply all
Reply to author
Forward
0 new messages