How string.format implements %q

Robert Virding

unread,

May 30, 2025, 6:32:19 PMMay 30

to lua-l

Can anyone point to where I can find out how the `%q` works? How it decides which bytes to process and how it does process them? Sometimes while it does a "correct" thing it not that logical why it does it.

One example is when it finds a byte >128. Sometimes it passes the byte as is and sometimes it passes it as `\ddd`.

The reason I am asking is that I have implemented Luerl, an implementation of Lua in Erlang, and I would like to have my `%q` be handled the same as in Lua.

Thanks for any help,

Robert

Sainan

unread,

May 30, 2025, 6:40:16 PMMay 30

to lu...@googlegroups.com

You can find the implementation the function addquoted in lstrlib.c.

But for the distinction between \d and \ddd, it looks at if the next character is a digit to ensure the result is unambiguous.

-- Sainan

Robert Virding

unread,

May 30, 2025, 7:12:33 PMMay 30

to lu...@googlegroups.com

Wow, ok thanks.

I see I will have to brush up on my 'C'.

Robert

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lua-l/gTZBKMh7yCma5V7l8KARpeVSj3FCV5MJ3zbGGHlqdR599AxcVDG2bkhAsaqryfEZznyLQuqTazO1kf6RMRd0HtkSDSUXrl_MvryNqeWT8lM%3D%40calamity.inc.

Robert Virding

unread,

May 30, 2025, 7:40:40 PMMay 30

to lua-l

OK, I have checked a bit more in the addquoted procedure but I can only see that it explicitly handles control characters and everything else is passed straight through. Do characters between 128 and 255 also count as control characters as well here? And all the control characters are transformed to '\\d` sequence of either `\\d' or '\\ddd` which is not what quite what is happening in the transformations of the large bytes which I see. Or have I misunderstood the 'C'?

Robert

Gé Weijers

unread,

May 30, 2025, 8:38:04 PMMay 30

to lu...@googlegroups.com

On Fri, May 30, 2025 at 4:40 PM Robert Virding <rvir...@gmail.com> wrote:

OK, I have checked a bit more in the addquoted procedure but I can only see that it explicitly handles control characters and everything else is passed straight through. Do characters between 128 and 255 also count as control characters as well here? And all the control characters are transformed to '\\d` sequence of either `\\d' or '\\ddd` which is not what quite what is happening in the transformations of the large bytes which I see. Or have I misunderstood the 'C'?

Robert

Hello,

The format "%d" in C will produce 1 or more digits, the minimum necessary.

The following characters are considered control characters according to the C90 standard:

005 ENQ 006 ACK 007 BEL 010 BS 011 HT
012 NL 013 VT 014 NP 015 CR 016 SO
017 SI 020 DLE 021 DC1 022 DC2 023 DC3
024 DC4 025 NAK 026 SYN 027 ETB 030 CAN
031 EM 032 SUB 033 ESC 034 FS 035 GS
036 RS 037 US 177 DEL

This code snippet prints a sequence of all characters from 0 to 255, encoded with "%q"

local digit0, space = string.byte("0 ", 1, 2)
for i = 0, 255 do
print(("%d: %q"):format(i, string.char(i, space, i, digit0)))
end

it prints out the encode of each character followed by a space, and followed by a digit 0

Hope this helps.

--

Gé

Denis Dos Santos Silva

unread,

May 30, 2025, 11:30:55 PMMay 30

to lua-l

https://github.com/lua/lua/blob/6e22fedb74cf0c9b6656e9fce8b7331db847c605/lstrlib.c#L1273

string.format()

like any other 'formatted function', it expand based on `type and precision` of variable.

Sainan

unread,

May 31, 2025, 3:42:37 AMMay 31

to lu...@googlegroups.com

> Do characters between 128 and 255 also count as control characters as well here?

It's funny because I just recently looked at exactly this because it was not escaping invalid UTF-8 sequences: <https://github.com/PlutoLang/Pluto/pull/1176>

-- Sainan

Robert Virding

unread,

May 31, 2025, 6:02:29 AMMay 31

to lu...@googlegroups.com

Ge, I have modified your loop a bit so it gives me the actual characters that are output by the "%q" including the leading and trailing ". This I can probably use when implementing my string.format("%q", ..). And yes I admit that the code is not very beautiful and can definitely be improved, but I am not an experienced Lua programmer.

local digit0, space = string.byte("0 ", 1, 2)
for i = 0, 255 do

print(i, string.char(i), string.byte(string.format("%q", string.char(i)), 1, 40))

end

Robert

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/lua-l/WaLIqF4y5S_4Y3DrXDh-nPAdtVacTMO4Dyhupg4bvZq8Cs5zeNBKcl4DvRpboxdHM-wi3WEGNrGQE-kpoKw7BhPjs4bwD_Mo1D3ZsPn-6CM%3D%40calamity.inc.

Denis Dos Santos Silva

unread,

Jun 1, 2025, 2:35:26 AMJun 1

to lua-l

https://www.lua.org/manual/5.4/manual.html#6.4

string module dont support utf-8

6.4 – String Manipulation
This library provides generic functions for string manipulation, such as finding and extracting substrings, and pattern matching. When indexing a string in Lua, the first character is at position 1 (not at 0, as in C). Indices are allowed to be negative and are interpreted as indexing backwards, from the end of the string. Thus, the last character is at position -1, and so on.

...

string.format (formatstring, ···)
Returns a formatted version of its variable number of arguments following the description given in its first argument, which must be a string. The format string follows the same rules as the ISO C function sprintf. The only differences are that the conversion specifiers and modifiers F, n, *, h, L, and l are not supported and that there is an extra specifier, q. Both width and precision, when present, are limited to two digits.

Robert Virding

unread,

Jun 1, 2025, 12:28:35 PMJun 1

to lu...@googlegroups.com

I have done more checking and the issue is not in the '%q' formatting but in the scanning/parsing itself. I am using a Lua 5.4.7 downloaded from https://lua.org for comparison. I have this example for testing which causes the problem:

x = '"ílo"\n\\'
assert(load(string.format('return %q', x))() == x)

When i run it with lua it works fine and if I do
string.byte(x, 1, 40) ===> 34 195 173 108 111 34 10 92

When I try it with Luerl with utf-8 encoding the assertion fails. However, if I turn off utf-8 encoding then it works and

string.byte(x, 1, 40) ===> 34 247 108 111 34 10 92

If I turn on utf-8 encoding I get the same x as with Lua but the load(string.format('return %q', x)) now becomes
34 195 131 173 108 111 34 10 92
so it seems as if doing the Luerl 'load' parses the string again and the '195' becomes utf-8 encoded to '195 131' and the equality fails.

So doesn't 'load' parse its string in the same way?

Btw the string.format of the 'return %q' in both cases (when I utf-8 encode the scan) becomes
114 101 116 117 114 110 32 34 92 34 195 92 49 55 51 108 111 92 34 92 10 92 92 34
The '92 49 55 51' is the expanded 173.
Which shows my '%q' handling works as it should. At least in this case.

I don't know. Sorry about the length of this, it is just very puzzling. And I do want my Luerl to behave as it should, which it does for the rest of what it provides.

Robert

To view this discussion visit https://groups.google.com/d/msgid/lua-l/df147839-da8f-438b-a598-f6e60ddc203cn%40googlegroups.com.

Sainan

unread,

Jun 1, 2025, 12:40:06 PMJun 1

to lu...@googlegroups.com

I remember this, this is from testes/strings.lua, right? That file is using ISO Latin encoding as opposed to UTF-8. Roughly speaking, strings in Lua represent exactly the bytes that are in between the quotes, so if the file is encoded differently, the string will contain different data.

-- Sainan

Robert Virding

unread,

Jun 1, 2025, 12:51:42 PMJun 1

to lu...@googlegroups.com

The example was in an issue in Luerl git. In the issue it says "I grabbed the 5.3.4 tests from https://www.lua.org/tests/ and ran small pieces of the string.lua test suite." If this helps,

Robert

On Sun, 1 Jun 2025 at 18:40, 'Sainan' via lua-l <lu...@googlegroups.com> wrote:

I remember this, this is from testes/strings.lua, right? That file is using ISO Latin encoding as opposed to UTF-8. Roughly speaking, strings in Lua represent exactly the bytes that are in between the quotes, so if the file is encoded differently, the string will contain different data.

-- Sainan

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/lua-l/sFpcXweXiJNc3ijD0cltXrvSKwZkwufQc473fM__gamHkaX1QbMEX2d3QiC5begOUap_W_A7yqmaT3lssNdX1vIr9pLVWA5znsNoxG40jsQ%3D%40calamity.inc.

Sainan

unread,

Jun 1, 2025, 1:02:23 PMJun 1

to lu...@googlegroups.com

I would double-check that the encoding of the string was not changed in the process of extracting the snippet, as e.g. I noticed the same where I would copy it from one file into another and suddenly it was UTF-8 encoded.

-- Sainan

Roberto Ierusalimschy

unread,

Jun 2, 2025, 12:15:30 PMJun 2

to lu...@googlegroups.com

> The following characters are considered control characters according to the
> C90 standard:
>
> 005 ENQ 006 ACK 007 BEL 010 BS 011 HT
> 012 NL 013 VT 014 NP 015 CR 016 SO
> 017 SI 020 DLE 021 DC1 022 DC2 023 DC3
> 024 DC4 025 NAK 026 SYN 027 ETB 030 CAN
> 031 EM 032 SUB 033 ESC 034 FS 035 GS
> 036 RS 037 US 177 DEL

I all could find in the C99 standard as a definition for control
character whas this: "the term *control character* refers to a
member of a locale-specific set of characters that are not printing
characters. All letters and digits are printing characters." (7.4.3) The
key point here is "locale-specific": Different locales can result in
different behaviors for "%q".

The rationale for this behavior is that "printing charaters" are printed
as themselves; the others are encoded.

-- Roberto

Gé Weijers

unread,

Jun 2, 2025, 3:42:27 PMJun 2

to lu...@googlegroups.com

On Mon, Jun 2, 2025 at 9:15 AM Roberto Ierusalimschy <rob...@inf.puc-rio.br> wrote:

> The following characters are considered control characters according to the
> C90 standard:
>
> 005 ENQ 006 ACK 007 BEL 010 BS 011 HT
> 012 NL 013 VT 014 NP 015 CR 016 SO
> 017 SI 020 DLE 021 DC1 022 DC2 023 DC3
> 024 DC4 025 NAK 026 SYN 027 ETB 030 CAN
> 031 EM 032 SUB 033 ESC 034 FS 035 GS
> 036 RS 037 US 177 DEL

I all could find in the C99 standard as a definition for control
character whas this: "the term *control character* refers to a
member of a locale-specific set of characters that are not printing
characters. All letters and digits are printing characters." (7.4.3) The
key point here is "locale-specific": Different locales can result in
different behaviors for "%q".

You're right. The list above conforms to the POSIX and Single Unix specs. The C standard does not specify the character set.

The POSIX "C" locale specified the set of "iscntl" chars as

cntrl    <alert>;<backspace>;<tab>;<newline>;<vertical-tab>;\
         <form-feed>;<carriage-return>;\
         <NUL>;<SOH>;<STX>;<ETX>;<EOT>;<ENQ>;<ACK>;<SO>;\
         <SI>;<DLE>;<DC1>;<DC2>;<DC3>;<DC4>;<NAK>;<SYN>;\
         <ETB>;<CAN>;<EM>;<SUB>;<ESC>;<IS4>;<IS3>;<IS2>;\
         <IS1>;<DEL>

Source:

https://pubs.opengroup.org/onlinepubs/7908799/xbd/locale.html

The rationale for this behavior is that "printing charaters" are printed
as themselves; the others are encoded.

-- Roberto

--
You received this message because you are subscribed to the Google Groups "lua-l" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lua-l/20250602161523.GA708730%40arraial.inf.puc-rio.br.

--

Gé

Robert Virding

unread,

Jun 3, 2025, 9:40:20 AMJun 3

to lu...@googlegroups.com

OK, I have now managed to get some clarity in the problem. So the utf-8 encoding is not done in the parsing but in the reading of the file. So when I read the file with io.input("test5.lua"):read("a") I see that the 247 has been utf-8 encoded to 195 173. So if I don't do the utf-8 encoding in the parsing in Luerl then it works but the value is wrong, or rather different from Lua. Sigh! I haven't found where Lua does the utf8 encoding.

Anyway, this has had one good result and that is that I have cleaned up a lot of my code in Luerl.

Robert

You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lua-l/CAGj8prjMkkPwNSMUUDzSEXY8_uzyK2j3J3O%3DX9ETQ-Rm-jJ2yA%40mail.gmail.com.

Roberto Ierusalimschy

unread,

Jun 3, 2025, 10:04:25 AMJun 3

to lu...@googlegroups.com

> [...] Sigh! I haven't found where Lua does the utf8 encoding.

It doesn't. The language itself uses only ascii characters. Strings,
including literal strings in source code, are agnostic regarding
encodings. Whichever bytes are read from the source stream are kept
in a literal string, up to explicit escapes. (An exception is
line-breaks in long literals; the sequences '\n\r' and '\r\n' are
normalized to a single '\n'.)

-- Roberto

Sainan

unread,

Jun 3, 2025, 10:57:36 AMJun 3

to lu...@googlegroups.com

print(string.byte("í", 1, -1)) --> 237 when saved using ISO 8859-1 encoding (as testes/string.lua is)
print(string.byte("í", 1, -1)) --> 195 173 when saved with UTF-8 encoding

-- Sainan

Francisco Olarte

unread,

Jun 3, 2025, 12:10:41 PMJun 3

to lu...@googlegroups.com

On Tue, 3 Jun 2025 at 15:40, Robert Virding <rvir...@gmail.com> wrote:
> OK, I have now managed to get some clarity in the problem. So the utf-8 encoding is not done in the parsing but in the reading of the file. So when I read the file with io.input("test5.lua"):read("a") I see that the 247 has been utf-8 encoded to 195 173. So if I don't do the utf-8 encoding in the parsing in Luerl then it works but the value is wrong, or rather different from Lua. Sigh! I haven't found where Lua does the utf8 encoding.

As they have already said, lua does handle it as just a byte stream,
only ascii ones are allowed outside the strings.

Expanding on Sainan example:
$ for a in t*.lua; do echo $a; od -tu1 -ta $a; lua $a; done
tlatin1.lua
0000000 112 114 105 110 116 40 115 116 114 105 110 103 46 98 121 116
p r i n t ( s t r i n g . b y t
0000020 101 40 34 237 34 44 32 49 44 32 45 49 41 41 10
e ( " m " , sp 1 , sp - 1 ) ) nl
0000037
237
tutf8.lua
0000000 112 114 105 110 116 40 115 116 114 105 110 103 46 98 121 116
p r i n t ( s t r i n g . b y t
0000020 101 40 34 195 173 34 44 32 49 44 32 45 49 41 41 10
e ( " C - " , sp 1 , sp - 1 ) ) nl
0000040
195 173
(You may need to use a fixed font to align them, I did not want to try
html or anything else)
Different files, different bytes in the string, different output, lua
does not interpret them as utf8 or latin1 ( my terminal does, hence
the od usage ). It treats them more as bytes than characters or
codepoints.

Francisco Olarte.

Robert Virding

unread,

Jun 3, 2025, 12:16:17 PMJun 3

to lu...@googlegroups.com

Sainan, I am reading the same file in both cases.

Robert

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/lua-l/CA%2BbJJbxdTJe%2BLBOyjXqwBjiC5OMCL_LPo_ZSk%3DKLiAxWj9qYJg%40mail.gmail.com.

Sainan

unread,

Jun 3, 2025, 12:23:59 PMJun 3

to lu...@googlegroups.com

> Sainan, I am reading the same file in both cases.

Then I am confused about where you are seeing UTF-8 encoding being silently introduced. I don't see any of that in PUC Lua. Also not when using io.input:read like you did:

--í
print(string.byte(io.input("test.lua"):read("a"), 3)) --> 237

-- Sainan

Scott Morgan

unread,

Jun 3, 2025, 12:47:39 PMJun 3

to lu...@googlegroups.com

Erlang file functions can change encodings of the data they read and write:

https://www.erlang.org/docs/21/man/file#open-2

For differences in behaviour between plain Lua (C) and Luerl (Erlang)
I'd check that out first.

Scott

Robert Virding

unread,

Jun 3, 2025, 3:47:12 PMJun 3

to lu...@googlegroups.com

Problem solved! 😀 My bad 😧 I was reading the file in unicode mode which is why it was doing the processing! When I run in latin1 mode it behaves in exactly the same way as it does in Lua.

Sorry for all the trouble and work I have caused. At least for me it resulted in me improving the code and doing it in the same as it is done in Lua.

Sorry,

Robert

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/lua-l/04e93d5d-e524-4f74-9960-6a8edc6f9ee8%40blueyonder.co.uk.

Sainan

unread,

Jun 3, 2025, 3:58:30 PMJun 3

to lu...@googlegroups.com

> in latin1 mode

When you save the file as UTF-8 now, does #"í" == 2 hold?

-- Sainan

Robert Virding

unread,

Jun 3, 2025, 4:13:11 PMJun 3

to lu...@googlegroups.com

Currently I don't save any files so the issue has not yet risen. I will cross that bridge when I get to it.

Robert

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/lua-l/HKjc8ADuwxgrj69IF4erSwGYNQiFezczVvZVYidhSBo5DhR71vyn2402DhXTS4ta8-fEwO2ISFk0B_aj-fWrv_7_dCNPHZcxrv2iobXQOYs%3D%40calamity.inc.

Sainan

unread,

Jun 3, 2025, 4:17:29 PMJun 3

to lu...@googlegroups.com

Must be very hard writing software if you never save a file. :D

-- Sainan

Robert Virding

unread,

Jun 3, 2025, 4:31:57 PMJun 3

to lu...@googlegroups.com

Yes, I save files in the Erlang system in which I implement my Luerl but not as a part of the Luerl system. So far the Lua io/file is basically not implemented and no one has asked for it yet. I am guessing that when I do want to save files they will be created in latin1 mode as well as this will make their handling very straight forward.

Robert

On Tue, 3 Jun 2025 at 22:17, 'Sainan' via lua-l <lu...@googlegroups.com> wrote:

Must be very hard writing software if you never save a file. :D

-- Sainan

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/lua-l/wzEwAycOEjwtCHMxuFwtjoNG8mvwkWIjZvUKQRXCqXOEYkyc2OjSPLfXWF7qIPOplKi3bzXkT2vJFlxHfNjEWPbG4mVBbsH14VslSwSFjbw%3D%40calamity.inc.

Sainan

unread,

Jun 3, 2025, 4:39:25 PMJun 3

to lu...@googlegroups.com

As we've been saying over and over again, Lua has no concept of character encoding (during compilation, at least), but you seem to think now it uses Latin1 encoding just because one of it test files uses Latin1 encoding. This is incorrect, I'd say most Lua source code — and text files in general — are using UTF-8. But you can't make any assumptions about this.

In other words...

function add(a, b)
return 4
end
assert(add(2, 2) == 4)

Tests are passing, ship it!

-- Sainan

Robert Virding

unread,

Jun 3, 2025, 5:00:50 PMJun 3

to lu...@googlegroups.com

The latin1 encoding is about how you control the Erlang file interface. Latin1 is the default encoding which basically just read/writes the data as is. I was using the unicode encoding which was doing utf-8 decoding when reading. Which is why I was getting the different data. Your example just works.

Tomorrow I will save a new version of the Luerl develop branch with my new fixes.

Robert

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/lua-l/SD2UTGH1NClkP-s16UySGu4d3FtIUcKbJij_sLHo3tI9ZOWO9WLNUjyukeoWPYz6tQ_2BLTOKsDVu2wrBwEDfnstzCupd7iquBgB0M-USig%3D%40calamity.inc.

Sainan

unread,

Jun 3, 2025, 5:03:31 PMJun 3

to lu...@googlegroups.com

Hmm, I'd call that "binary".

-- Sainan

Robert Virding

unread,

Jun 3, 2025, 5:08:56 PMJun 3

to lu...@googlegroups.com

That's the erlang way. It's in the docs.

Robert

On Tue, 3 Jun 2025 at 23:03, 'Sainan' via lua-l <lu...@googlegroups.com> wrote:

Hmm, I'd call that "binary".

-- Sainan

--
You received this message because you are subscribed to a topic in the Google Groups "lua-l" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lua-l/Ntf0jVaZe5I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lua-l+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/lua-l/1K4r6nHjK4wq0ozxr4G9leVxJZuY-K7mytMBk3dx7KCQfMtyJpgEejBt27uFrIWFPc9--hqklm4iCLZhr4aZnHPna8_dS_7pSQJ0B9wiM4Q%3D%40calamity.inc.

Reply all

Reply to author

Forward