Scanning of utf-8 encoding in strings

97 views
Skip to first unread message

Robert Virding

unread,
Jul 16, 2025, 11:04:08 AM7/16/25
to lua-l
Hi,
I have been checking through the code to the Lua 5.4.7 implementation to see how it handles strings. Most of it I think I get but what I don't see is when and where it handles utf-8 encoding of strings. Is it done in its own lexical scanner or does it assume that it has already been done when the input is scanned?

Robert

bil til

unread,
Jul 16, 2025, 11:35:07 AM7/16/25
to lu...@googlegroups.com
Am Mi., 16. Juli 2025 um 17:04 Uhr schrieb Robert Virding <rvir...@gmail.com>:
>
> Hi,
> I have been checking through the code to the Lua 5.4.7 implementation to see how it handles strings. Most of it I think I get but what I don't see is when and where it handles utf-8 encoding of strings. Is it done in its own lexical scanner or does it assume that it has already been done when the input is scanned?

I am not sure whether I understand your question / application
correctly, but as I understand it you somehow mixup "string" (=a null
terminated byte sequence) and encoding (which can be e. g. basic
ascii, or utf8 or many other forms of encoding).

utf8 is nice to "pop up" ascii texts to allow also unicode characters,
utf8 is also very nice for any unicode texts transferred by byte-wise
data transmission, as it will loose only one utf8 char if singular
bytes are corrupted, therefore it is e. g. the main encoding used for
Internet HTML pages.

You recognize "multi-byte utf8 char" (2-4 byte typically) by a start
byte. Null bytes are NOT used in "multi-byte utf8 chars" ... so
strings keep strings, also when they represent utf8 encoding.

To assist you in analyzing utf8 encoded strings, Lua presents the utf8
library... (see e. g. Roberto's book "Programming in Lua").

Luiz Henrique de Figueiredo

unread,
Jul 16, 2025, 11:40:02 AM7/16/25
to lu...@googlegroups.com
> I don't see is when and where it handles utf-8 encoding of strings

Lua treats strings as byte arrays; it's agnostic about encondings.
That said, it supports UTF-8 encodings of Unicode characters inside a
literal string with the escape sequence \u{XXX}.

Scott Morgan

unread,
Jul 16, 2025, 11:51:17 AM7/16/25
to lu...@googlegroups.com
Simple answer is Lua doesn't care. Strings are just arrays of bytes and
all the standard functions work on that.

You have the utf8 library in more recent versions of Lua, to help with
UTF-8 text parsing and that's it.

Technically speaking, the strings don't even have to be 8-bit clean
(i.e. embedded nulls), but you have to be very careful handling such
data, both on the Lua and C side. So you probably don't want to use text
encodings that might present that (e.g. UTF-16)

Scott

Robert Virding

unread,
Jul 17, 2025, 7:57:37 AM7/17/25
to lua-l
OK, to try and be more specific. If if gets a string input with the following bytes <97,98,203,134,194,182,99,100> does it then accept the string with these bytes as they are or does it try to utf-8 encode 203, 194 and 182? They can be utf-8 encoded.

That is what I meant when I asked where is the utf-8 encoding done?

Robert

Luau Project

unread,
Jul 17, 2025, 9:32:38 AM7/17/25
to lu...@googlegroups.com
Not sure if the following script is a possible answer to your question, but run it: (I wrote it on phone, so possible mistakes may occur)

local s = "\97\98\203\134\194\182\99\100"

for i = 1, #s do
    local char_i = s:sub(i, i)
    print(char_i:byte())
    print(char_i)
end

--
You received this message because you are subscribed to the Google Groups "lua-l" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lua-l/62fe87ba-470c-4366-981e-afa766b738b4n%40googlegroups.com.

Pierre Chapuis

unread,
Jul 17, 2025, 10:38:58 AM7/17/25
to lu...@googlegroups.com
I am not sure what you mean by "they can be utf-8 encoded".

If you consider the 8 octets with the values you give as a binary sequence representing UTF-8, it can be *decoded* to 6 unicode characters: abXYcd, where X (the octets 203 and 134 i.e. cb86) is MODIFIER LETTER CIRCUMFLEX ACCENT and Y (the octets 194 and 182 i.e. c2b6) is PILCROW SIGN.

In Python 3 for instance you would decode it like this:


In Lua there is no difference between "bytes" and "str", everything is sequences of bytes and (almost) only the utf8 library is unicode-aware.





-- 
Pierre Chapuis

Scott Morgan

unread,
Jul 17, 2025, 1:45:57 PM7/17/25
to lu...@googlegroups.com
On 17/07/2025 12:57, Robert Virding wrote:
> OK, to try and be more specific. If if gets a string input with the
> following bytes <97,98,203,134,194,182,99,100> does it then accept the
> string with these bytes as they are or does it try to utf-8 encode 203,
> 194 and 182? They can be utf-8 encoded.
>
> That is what I meant when I asked where is the utf-8 encoding done?

It's just a string of bytes. It's down to the script to decide what to
do with them, Lua doesn't care. Is it UTF-8? Is it ISO-8859-1? Is it raw
binary code for the 6502 processor? Lua has no idea.

There's no magic 'unicode' data type in a computer, it's just handling
bytes. A programming language may assume UTF-8 form and process the
bytes as such, but many (like Lua) don't, and the ones that do are still
just handling bytes.

If you know the data is UTF-8, then use Lua's utf8 module. But you may
find you'll need to supplement that with some extra functions/modules to
comfortably handle UTF-8 text.

Lua's string module just works at the byte level. So things like gsub
won't work reliably over UTF-8 strings if you want to do any complex
text processing.


> x = string.char(97,98,203,134,194,182,99,100)
> =x
abˆ¶cd
> string.len(x)
8
> utf8.len(x)
6
> string.gsub(x, ".", "(%1)")
(a)(b)(�)(�)(�)(�)(c)(d) 8

-- UTF-16LE
> y = string.char(97,0,98,0)
> =y
ab
> string.gsub(y, ".", "(%1)")
(a)()(b)() 4


Scott

Matěj Cepl

unread,
Jul 17, 2025, 3:43:17 PM7/17/25
to lu...@googlegroups.com
On Thu Jul 17, 2025 at 7:45 PM CEST, 'Scott Morgan' via lua-l wrote:
> If you know the data is UTF-8, then use Lua's utf8 module. But you may
> find you'll need to supplement that with some extra functions/modules to
> comfortably handle UTF-8 text.

To be more precise, there is no support for Unicode in Lua
whatsoever. That is a not criticism of Lua as a language, it was
intended as a small embedded language, and proper Unicode-aware
string operations are a way more complicated than that.

In terms of the support of Unicode in Python (which is the
language I know the best) it is somewhere below the version 2.*,
i.e., “strings are bunch of bytes and we don’t do anything with
them”. I am supporter and user of the vis editor [1], which has
almost all functionality in Lua scripts, and when we got to just
primitive string operations like “upper-case” or “lower-case” a
string [2], we got to the situation that the only OS-independent
solution (which wouldn’t require writing it in C) was to pipe the
string through awk and even that doesn’t work well (i.e., both on
Linux, Mac OS X and *BSD).

As I said, it is not a problem, that the language doesn’t support
this, but it is sad, that I don’t know about any good Lua
library, which would be doing this, unfortunately, not even my
preferred PenLight.

Best,

Matěj

[1] https://sr.ht/~martanne/vis/
[2] https://lists.sr.ht/~martanne/devel/patches/49212 ; just
before you try to argue otherwise, string.upper("Да Нет
Dědeček") doesn’t give correct answer.
--
http://matej.ceplovi.cz/blog/, @mc...@en.osm.town
GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8

Get up, stand up, don't give up the fight!
-- Bob Marley

E09FEF25D96484AC.asc
signature.asc

Shmuel Zeigerman

unread,
Jul 17, 2025, 4:10:22 PM7/17/25
to lu...@googlegroups.com
On 17/07/2025 22:43, Matěj Cepl wrote:
>
> As I said, it is not a problem, that the language doesn’t support
> this, but it is sad, that I don’t know about any good Lua
> library, which would be doing this, unfortunately, not even my
> preferred PenLight.

We use luautf8 library: https://github.com/starwing/luautf8 and it's
quite capable.

--
Shmuel

Matěj Cepl

unread,
Jul 17, 2025, 4:19:41 PM7/17/25
to lu...@googlegroups.com
On Thu Jul 17, 2025 at 10:10 PM CEST, Shmuel Zeigerman wrote:
> We use luautf8 library: https://github.com/starwing/luautf8 and it's
> quite capable.

Thank you, that seems interesting. Silly me, we have it actually
packaged in openSUSE [1].

Matěj

[1] https://src.opensuse.org/lua/lua-luautf8
--
http://matej.ceplovi.cz/blog/, @mc...@en.osm.town
GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8

A man once asked Mozart how to write a symphony. Mozart told him
to study at the conservatory for six or eight years, then
apprentice with a composer for four or five more years, then
begin writing a few sonatas, pieces for string quartets, piano
concertos, etc. and in another four or five years he would be
ready to try a full symphony. The man said, “But Mozart, didn’t
you write a symphony at age eight?” Mozart replied, “Yes, but
I didn’t have to ask how.”
-- ripped from another sig

E09FEF25D96484AC.asc
signature.asc

Pierre Chapuis

unread,
Jul 17, 2025, 4:21:23 PM7/17/25
to lu...@googlegroups.com
> things like gsub won't work reliably over UTF-8 strings if you want to do
> any complex text processing
>
> > x = string.char(97,98,203,134,194,182,99,100)
> > string.gsub(x, ".", "(%1)")
> (a)(b)(�)(�)(�)(�)(c)(d) 8

Note that you can do this:

> > string.gsub(x, utf8.charpattern, "(%1)")
> (a)(b)(ˆ)(¶)(c)(d) 6

Scott Morgan

unread,
Jul 17, 2025, 4:50:49 PM7/17/25
to lu...@googlegroups.com
True, but it doesn't detract from the fact gsub is byte focused, not
Unicode char. You'd run into problems with any non-trivial patterns.
There's no easy equivalent of '%u' or '%l', for example.

Scott

Reply all
Reply to author
Forward
0 new messages