How to pass Unicode utf8 as command line arguments to a Lua script?

205 views
Skip to first unread message

Badr Elmers

unread,
Jun 13, 2025, 4:25:34 AMJun 13
to lua-l

I m building a command line tool using Lua, users may call my script with utf8 arguments.

Programming in Lua 4th edition says:

Several things in Lua “just work” for UTF-8 strings. Because Lua is 8-bit clean, it can read, write, and store UTF-8 strings just like other strings.

but this seems not true for cli parameters, here is a small test:

test.lua contain:

io.write(arg[1])

i run it like that:

lua test.lua かسжГ > test.txt

i get

????

and i get the same result with:

io.open("test.txt", "wb"):write(arg[1])

test done with lua-5.4.8_Win32 on win 7 x64

how to solve it? is there a workaround?


Scott Morgan

unread,
Jun 13, 2025, 5:03:43 AMJun 13
to lu...@googlegroups.com
On 13/06/2025 04:12, Badr Elmers wrote:
> I m building a command line tool using Lua, users may call my script
> with utf8 arguments.
>

Windows command line is terrible for this, but it's slowly getting
better. Ideally use Windows Terminal instead of it's old Console.

Before running your command, try using the chcp command to switch the
command line codepage to UTF8 (in Windows codepage 65001 is UTF8)

chcp 65001

If that works, you can make it permanent, but that may cause issues with
other command line apps that expect the local codepage. So do a bit of
googling before making that change. It may be better just to use the
chcp command when needed.

It's also worth noting that chcp can be called from Lua scripts, which
won't help with the command line arguments, but is useful if you know
you're going to be outputting UTF8 to stdout.

os.execute("chcp 65001")

Scott

Sainan

unread,
Jun 13, 2025, 7:17:24 AMJun 13
to lu...@googlegroups.com
Yeah, unfortunately Lua being standard C means it's very locale dependent. In Pluto (our fork of Lua), we've made many changes to convert 8-bit strings to UTF-16 on Windows (or vice-versa) so most things do actually "just work".

-- Sainan

Denis Dos Santos Silva

unread,
Jun 13, 2025, 2:06:38 PMJun 13
to lua-l
you may need a package with utf-8 support*

$ lua utf8arg.lua "Olá, 世界!"  -- hello world

utf8arg.lua 
local utf8 = require("utf8")
local input = arg[1]
print("len: " .. utf8.len(input))

for p, c in utf8.codes(input) do
    local ch = utf8.char(c)
    print(string.format("  index%d: '%s' (U+%04X)", p, ch, c))
end

* https://github.com/starwing/luautf8

Thijs Schreijer

unread,
Jun 13, 2025, 3:07:28 PMJun 13
to lu...@googlegroups.com
Windows is hard to work with in this respect. They are slowly moving to more compatibility. But it's really hard if you don't know where to look.

Check out LuaSystem [1], it has several primitives to configure terminals, and supports Windows. For example it allows you to change the codepage to UTF8 from Lua.

We have a terminal project in the works, which explicitly supports Windows (it's a GSoC project). [2]

hth
Thijs

Badr Elmers

unread,
Jun 14, 2025, 10:56:51 PMJun 14
to lua-l
>   Before running your command, try using the chcp command to switch the
>   command line codepage to UTF8 (in Windows codepage 65001 is UTF8)

> chcp 65001

chcp changes the console's code page, it doesn't automatically force all applications launched from that CMD session to operate in full UTF-8 mode internally like it happens in linux with LC_ALL. Many older Windows applications and even parts of the Windows API (often referred to as "ANSI" APIs) still rely on the system's default ANSI code page. If these applications don't explicitly use Unicode (UTF-16) APIs, they might still misinterpret or mangle UTF-8 data, even if the console is set to 65001.

Badr Elmers

unread,
Jun 14, 2025, 10:59:59 PMJun 14
to lua-l
>  Yeah, unfortunately Lua being standard C means it's very locale dependent. In Pluto (our fork of Lua), we've made many changes to convert 8-bit strings to UTF-16 on Windows (or vice-versa) so most things do actually "just work".

Pluto works perfectly, unfortunatly i could not compile it with mingw-w64 x32 because i need x32 for xp+ , and i tested to compile with vs 2015 with platform toolset v140.xp for pluto x32 but cl.exe crashes

Badr Elmers

unread,
Jun 14, 2025, 11:01:19 PMJun 14
to lua-l
> you may need a package with utf-8 support*

> $ lua utf8arg.lua "Olá, 世界!"  -- hello world

it did not work because the problem is in arg[1] , lua receives ansi input using the default ansi code page you have in win, so luautf8 can do nothing here, this is what i got:
len: 8
  index1: 'O' (U+004F)
  index2: 'l' (U+006C)
  index3: '?' (U+003F)
  index4: ',' (U+002C)
  index5: ' ' (U+0020)
  index6: '?' (U+003F)
  index7: '?' (U+003F)
  index8: '!' (U+0021)

Badr Elmers

unread,
Jun 14, 2025, 11:13:17 PMJun 14
to lua-l
>  We have a terminal project in the works, which explicitly supports Windows (it's a GSoC project). [2]

luarocks could not install terminal.lua, it fails when installing LuaSystem using mingw-w64 x32, on win7 x64, using latest lua and luarocks versions
Untitled picture.png

Andrew Trevorrow

unread,
Jun 15, 2025, 6:09:33 AMJun 15
to lu...@googlegroups.com
All my Lua projects use Peter Wu's patch that allows Lua functions
like dofile and io.open
to handle UTF-8 paths on Windows:

https://github.com/Lekensteyn/lua-unicode

Trivial to add. Would make an excellent addition to the official Lua release.

Andrew

Sainan

unread,
Jun 15, 2025, 6:25:38 AMJun 15
to lu...@googlegroups.com
> Pluto works perfectly, unfortunatly i could not compile it with mingw-w64 x32 because i need x32 for xp+ , and i tested to compile with vs 2015 with platform toolset v140.xp for pluto x32 but cl.exe crashes

Strange, I don't think cl.exe should just be crashing. I personally haven't tried compiling it for Windows XP but I also can't imagine we're doing anything too unportable -- as long as you have a C++ 17 compiler.

-- Sainan


On Sunday, 15 June 2025 at 05:00, Badr Elmers <badre...@gmail.com> wrote:

> > Yeah, unfortunately Lua being standard C means it's very locale dependent. In Pluto (our fork of Lua), we've made many changes to convert 8-bit strings to UTF-16 on Windows (or vice-versa) so most things do actually "just work".
>
> Pluto works perfectly, unfortunatly i could not compile it with mingw-w64 x32 because i need x32 for xp+ , and i tested to compile with vs 2015 with platform toolset v140.xp for pluto x32 but cl.exe crashesOn Friday, June 13, 2025 at 12:17:24 PM UTC+1 Sainan wrote:
>
> > Yeah, unfortunately Lua being standard C means it's very locale dependent. In Pluto (our fork of Lua), we've made many changes to convert 8-bit strings to UTF-16 on Windows (or vice-versa) so most things do actually "just work".
> >
> > -- Sainan
>
> --
> You received this message because you are subscribed to the Google Groups "lua-l" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/lua-l/6eb83f31-2424-4292-a0ff-94786cca8918n%40googlegroups.com.

Thijs Schreijer

unread,
Jul 11, 2025, 4:31:29 PMJul 11
to lu...@googlegroups.com

On Sun, 15 Jun 2025, at 05:13, Badr Elmers wrote:
> > We have a terminal project in the works, which explicitly supports Windows (it's a GSoC project). [2]
>
> luarocks could not install terminal.lua, it fails when installing LuaSystem using mingw-w64 x32, on win7 x64, using latest lua and luarocks versions

Was working on LuaSystem today, and revisited this old thread to see if I could reproduce. Only now seeing that the environment is Windows 7, which indeed will not work since it will not have the required API's for terminal configuration (were introduced in 2019).

Thijs

Viacheslav Usov

unread,
Jul 11, 2025, 7:03:50 PMJul 11
to lu...@googlegroups.com, lua-l
On 13 Jun 2025, at 10:25, Badr Elmers <badre...@gmail.com> wrote:



I m building a command line tool using Lua, users may call my script with utf8 arguments.

Programming in Lua 4th edition says:

Several things in Lua “just work” for UTF-8 strings. Because Lua is 8-bit clean, it can read, write, and store UTF-8 strings just like other strings.

but this seems not true for cli parameters, here is a small test:

test.lua contain:

io.write(arg[1])

i run it like that:

lua test.lua かسжГ > test.txt

You cited a passage from the book that spoke of UTF-8 strings.

However, your claim that it is incorrect is not properly substantiated: first, you do not give Lua a UTF-8 string, of which you are acutely aware in your subsequent messages; second, the citation does not mention any IO, which is what you do; but I will grant you that there might be doubts as to whether that was meant. Lua’s built-in IO is documented to have been implemented in standard C, so of course such a thing is locale dependent and cannot be implied.

I will say even more: Lua is not lua.exe. Lua is an embeddable language and if the host gives it UTF-8 strings, it will handle them as stated, and I can personally attest to that.

Unless you rephrase your problem, it has nothing to do with Lua. Your problem seems to be in obtaining UTF-8 strings from a Windows console environment. You already know that you cannot type or paste something at a Windows command prompt that has non-ASCII characters and expect it to become UTF-8. The only certain way of passing non-ASCII in this case is by wrapping it in pure ASCII, such as hex or base-64. Without more details on your application, I cannot say anything less general.

Cheers,
V.

Sainan

unread,
Jul 12, 2025, 12:12:32 AMJul 12
to lu...@googlegroups.com
> You already know that you cannot type or paste something at a Windows command prompt that has non-ASCII characters and expect it to become UTF-8.

Well, that's a bit pessimistic. Windows itself operates in UTF-16, so you can get that, and ask it to convert it to UTF-8.

-- Sainan

bil til

unread,
Jul 12, 2025, 7:52:45 AMJul 12
to lu...@googlegroups.com
Am Sa., 12. Juli 2025 um 06:12 Uhr schrieb 'Sainan' via lua-l
<lu...@googlegroups.com>:
> Well, that's a bit pessimistic. Windows itself operates in UTF-16, so you can get that, and ask it to convert it to UTF-8.

Are you sure concerning "UTF16"?

As far as I know, Windows in the "good old Windows time" only
supported "UNICODE 16" (which is VERY different to UTF-16). This
Unicode16 proved to be a nightmare for byte-wise data transfer as
typically used for e. g. Internet / HTML transfer... : If one byte is
corrupt, the complete remainder of the transferred data block
typically will be unreadable - the same applies to any Unicode 16
encoded text files.

UTF-8 is a very safe against "byte-corruption-errors" - if a byte is
lost / corrupted, typically only ONE UTF8 character will be lost, and
then it will be readable again... .

Concerning Windows GUI software, MS reacted only in Win10 onwards (for
these newer versions you can specify to use UTF8 as "base encoding"
for your GUI commands).

For compatibility with older Win GUI software it is better to use /
support Unicode 16 for international encoding... . I assume the same
should also apply to Windows command line programs (but I did not test
this so far...).

Sainan

unread,
Jul 12, 2025, 8:33:59 AMJul 12
to lu...@googlegroups.com
> If one byte is corrupt, the complete remainder of the transferred data block typically will be unreadable

Luckily a problem solved by Reed & Solomon in 1960 (among several other solutions).

> in Win10 [...] you can specify to use UTF8

Yes, possible, but the "wide" variants of the WinAPI functions are available and a more portable solution. These do use standard UTF-16, so you may have surrogate pairs for certain emoji, but should generally be fine with 1 word (2 bytes) for any Latin character as well as Kanji/Hanzi.

-- Sainan

Scott Morgan

unread,
Jul 12, 2025, 9:05:58 AMJul 12
to lu...@googlegroups.com
On 12/07/2025 12:52, bil til wrote:
> Am Sa., 12. Juli 2025 um 06:12 Uhr schrieb 'Sainan' via lua-l <lua-
> l...@googlegroups.com>:
>> Well, that's a bit pessimistic. Windows itself operates in
>> UTF-16, so you can get that, and ask it to convert it to UTF-8.
>
> Are you sure concerning "UTF16"?
>
> As far as I know, Windows in the "good old Windows time" only
> supported "UNICODE 16" (which is VERY different to UTF-16).

Windows NT started with UCS-2 Unicode encoding, the forerunner to
UTF-16. At the time (early-90s), this was considered the best practice
(Java used this as well for the same reason)


> This Unicode16 proved to be a nightmare for byte-wise data transfer
> as typically used for e. g. Internet / HTML transfer... : If one
> byte is corrupt, the complete remainder of the transferred data
> block typically will be unreadable - the same applies to any
> Unicode 16 encoded text files.

This is false. A single byte corruption would just result in a single
char failing, just like any other text encoding. The main problem with
UCS-2 (and UTF-16) data transmission is the little-endian vs big-endian
encoding question. Hence the use of BOM (byte order mark) chars at the
start of text.

A missing byte would result in problems for both UTF-16 and UTF-8. But
that would be a problem for any data transmission without heavy data
redundancy encoding. Usually an issue for the underlying transmission
protocol, not plain text encoding.


> Concerning Windows GUI software, MS reacted only in Win10 onwards
> (for these newer versions you can specify to use UTF8 as "base
> encoding" for your GUI commands).

No idea what you mean by 'GUI commands'. Are you mixing up GUI with
command line? Or are you talking about the win32 API?

Windows command line defaults to the system locals 8-bit encoding
(single or multi-byte, depending on locale), not UTF-8. You can change
this, but there are still issues handling UTF-8 cleanly.


> For compatibility with older Win GUI software it is better to use /
> support Unicode 16 for international encoding... . I assume the same
> should also apply to Windows command line programs (but I did not
> test this so far...).

Since at least XP (I think) windows has been using UTF-16 internally, as
opposed to UCS-2. Applications that are built against the windows API
can be written to use this directly or they can use the locale dependant
8-bit code page. This is why Windows API's calls come in a *A or *W
form, e.g. CreateFileA/CreateFileW. Generally, the *A form just converts
the text from locale 8-bit to UTF-16 and uses the *W calls internally.

C std lib calls, like fopen, basically go through to Windows API calls
in a similar fashion.

Scott

bil til

unread,
Jul 12, 2025, 9:51:20 AMJul 12
to lu...@googlegroups.com
Am Sa., 12. Juli 2025 um 15:05 Uhr schrieb 'Scott Morgan' via lua-l
<lu...@googlegroups.com>:
>

> ..., e.g. CreateFileA/CreateFileW. Generally, the *A form just converts
> the text from locale 8-bit to UTF-16 and uses the *W calls internally.

Unicode16 has fixed 2 bytes per char (this is nice for strlenw command
- just use relatively fast strlen and device by 2).

Getting the char length of a UTF string (either 8 or 16) is an
"elaborate task" - you have to analyze every UTF char in the string
successively, which has typically 1-4 bytes (1-4 bytes is the same max
length for UTF8 and UTF16 as I see, although I am only familiar with
UTF8). So the typically very important and heavily used strlen
function of C for UTF strings needs much more time.

The advantage of UTF formatting is, that for byte data transfer (at
least UTF 8 for sure) any byte corruption will create only local "1
char damage". (I am not familiar with UTF16 - but I think this is not
used very often...).

The Unicode16 encoding used in Windows makes sense for internal RAM of
the program, which typically I think should be fairly safe assumption.
But already for HDD/SDD files, this conditions will become
questionable, I would classify Unicode encoded files with fix number
of Bytes per encoded char as generally dangerous.

I just read the wiki article "Unicode in Windows", and I am really
surprised that as you say only WinNT used Unicode (which is restricted
to 32bits and now not used any more, as 2 bytes meanwhile do NOT cover
any more complete Unicode char tables...). So since Win2000 they
internally use UTF16, which sounds somehow strange and stubborn, but
this is as MS sometimes is :) (sometimes not the most stupid approch
for backward compatibility to be a bit stubborn :) ).

Since 2003 more than 98% of all Internet pages are encoded in UTF8 (I
I would assume hardly any in UTF16 and for sure none in the meanwhile
superseded "Unicode16" - UCS-2), but real support of UTF8 in Windows
came up only in 2019 with Win10.

Scott Morgan

unread,
Jul 12, 2025, 10:53:29 AMJul 12
to lu...@googlegroups.com
On 12/07/2025 14:51, bil til wrote:
> Am Sa., 12. Juli 2025 um 15:05 Uhr schrieb 'Scott Morgan' via lua-l
> <lu...@googlegroups.com>:
>>
>
>> ..., e.g. CreateFileA/CreateFileW. Generally, the *A form just converts
>> the text from locale 8-bit to UTF-16 and uses the *W calls internally.

Not sure why you're quoting this bit.


> Unicode16 has fixed 2 bytes per char (this is nice for strlenw command
> - just use relatively fast strlen and device by 2).

False. For starters, what is 'Unicode16'? Can you point to it's
specification? Do you mean UTF-16 which is the common 16-bit encoding
for Unicode? That uses one *or two* 16-bit words to encode a single
char. Unicode covers more characters than the 65536 possible with a
single 16-bit number.

UCS-2 started out, in the late-80s/early-90s, as a single 16-bit word
per char encoding. The idea was we'd only need 16-bits, but it was
quickly realised that it wasn't enough to handle all the codepoints needed.

If you want single word per char encoding, UTF-32 is required. But you
still have to handle things like combining chars, even after normalisation.


> Getting the char length of a UTF string (either 8 or 16) is an
> "elaborate task" - you have to analyze every UTF char in the string
> successively, which has typically 1-4 bytes (1-4 bytes is the same max
> length for UTF8 and UTF16 as I see, although I am only familiar with
> UTF8). So the typically very important and heavily used strlen
> function of C for UTF strings needs much more time.

Not really true. UTF-8 and 16 provides a hint on how many bytes/words
are involved in a char in it's first byte/word, so you can skim the
intervening bytes easily. Bit more complex than plain strlen, but not by
much.

strlen never truly worked as you think anyway, as you'd still have to
support non-Unicode multi-byte charsets like JIS or GB 18030 if you
wanted your code to be properly multi-language. It should only be
considered as a method for counting bytes, not characters/glyphs, in an
8-bit clean char encoding.


> The Unicode16 encoding used in Windows makes sense for internal RAM of
> the program, which typically I think should be fairly safe assumption.
> But already for HDD/SDD files, this conditions will become
> questionable, I would classify Unicode encoded files with fix number
> of Bytes per encoded char as generally dangerous.

This is all nonsense!


> I just read the wiki article "Unicode in Windows", and I am really
> surprised that as you say only WinNT used Unicode

False. Re-read my previous email, I never said that.

WinNT started with UCS-2, then moved to UTF-16, where it still stands
today. Modern day Windows are versions of NT. It never left, just got
rebranded.


I don't want to be rude, but it's pretty clear you don't know what
you're talking about with regard to Windows or the Unicode standard.
You're using made up terms, and evidently aren't aware of the history.
Please stop, you're not helping anyone.

Scott

Viacheslav Usov

unread,
Jul 12, 2025, 3:25:17 PMJul 12
to lu...@googlegroups.com
It is true that every Windows process has a variable of type
'UNICODE_STRING' named 'CommandLine', which is said to contain 'The
command-line string passed to the process'. [1] It is said elsewhere
[2], indirectly and rather confusingly, that this may be essentially a
random string passed on by a parent process. If the new process is an
application written in Standard C, like lua.exe, 'CommandLine' will be
read and processed by the Standard C library initialization code, in
order to translate it as the argv[] array of the main() function. Note
that before the main() function is called, the application code has
not even run yet so it cannot influence neither the parent nor the
translation in any way. And I claim that such an application started
at a random Windows command prompt is unlikely ever to receive UTF-8
encoded argv[], let alone UTF-8 encoded argv[] that correctly, in
general, reproduces whatever the application was supposed to receive
if the original arguments contained non-ASCII Unicode codepoints. This
same thread illustrates that amply.

Cheers,
V.

[1] https://learn.microsoft.com/en-us/windows/win32/api/winternl/ns-winternl-rtl_user_process_parameters

[2] https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-createprocessw

Sainan

unread,
Jul 13, 2025, 12:10:51 AMJul 13
to lu...@googlegroups.com
Well, there is a 'wmain' under Windows which receives wchar_t instead of char in argv[].

Even if applications starting yours don't use UTF-8, they still need to use whatever the ANSI code page is currently set to, or UTF-16. If they don't properly encode that, that's a bug in them.

-- Sainan
Reply all
Reply to author
Forward
0 new messages