Thought bubble: Unicode serialization options

39 views
Skip to first unread message

sur-behoffski

unread,
Jul 20, 2025, 6:01:12 AMJul 20
to lua-l
G'day,

During the recent Unicode-in-Lua discussions, I started thinking
about the string syntax "\u{h...h}": PiL, 4th Ed, Section 4.1
(p. 31):

Since Lua 5.3, we can also specify UTF-8 characters with
the escape sequence \u{h...h}; we can write any number of
hexadecimal digits inside the brackets.

The library "utf8" was also introduced in 5.3.

I noted that string.format's "%q" operator, designed to allow
serialization of internal values in such a way that all special
characters to the general Lua parser were escaped,e.g.:

a = 'a "problematic" \\string'
print(string.format("%q", a) --> "a \"problematic\" \\string"

I noticed that quoting was only applied to relevant ASCII
characters that had special meaning in Lua programs; all the UTF-8
codes had a byte value higher than 0x7f (127), and so, not part of
the ASCII character set, were simply passed through verbatim.

----

My thought that, since the raw Lua parser accepts the "\u{...}"
syntax, that some potential users might want a "%q On Steroids",
with UTF-8 sequences emitted using the "\u{...}" syntax. This
would have a nice side-effect of making the serialized text 7-bit
clean (although invalid UTF-8 sequences would need to be
presented as a series of one or more "\ddd" specifiers). The
resulting string would make a string easier to read, as the
code point(s) would be shown directly, instead of through the
UTF-8 encoding syntax/semantics.

I toyed with the idea of a "%Q" format specifier, but I suspect
that this would be a breaking/incompatible change that's more
trouble than it's worth.

So, perhaps adding a function to utf8 to perform this transformation
could be worthwhile?

----

Just a thought bubble; comments for/against welcome.

cheers, s-b etc

Reply all
Reply to author
Forward
0 new messages