Some thoughts about string

213 views
Skip to first unread message

重归混沌

unread,
Oct 16, 2025, 2:24:30 AMOct 16
to lua-l
I’m really glad to see that external strings are coming in the upcoming Lua 5.5.

I wonder if, in some future version, we might go even further.

In many real-world scenarios, we often need to split a large string
into many small substrings.

This can easily lead to a lot of small, duplicated string objects.

Perhaps it would make sense to think of a Lua string not as a
standalone memory block, but as a view into an internal string buffer.

In this model, operations like string.sub would simply create new
views referencing the same underlying data, rather than copying it.

Conceptually, this is similar to Go’s slices, but since Lua strings
are immutable, it would avoid the pitfalls of accidental mutations.

Of course, such a design might introduce some additional complexity,
and it’s unclear whether the benefit justifies the change.

Sainan

unread,
Oct 16, 2025, 3:52:12 AMOct 16
to lu...@googlegroups.com
Your string.sub proposal makes some sense to me for external strings, since from the C API side, you have to expect that as long as a lua_State is alive, all external strings need to also stay alive, so I don't think it would be an 'unsafe optimisation'.

-- Sainan

Xmilia Hermit

unread,
Oct 16, 2025, 5:39:48 AMOct 16
to lu...@googlegroups.com
But from the C API side, Lua strings need to be null-terminated, which a
view into a string normally is not.
And only creating a null-terminated copy when the C API needs it is also
not possible, since `lua_tolstring` is only allowed to raise a memory
error when converting a number to a string.

So, to implement this, the C API would require a change. Either
lua_tolstring and lua_tostring need to be allowed to throw memory errors
even if the input is a string, or the requirement for Lua strings to be
null-terminated needs to be removed.

Regards,
Xmilia

Berwyn Hoyt

unread,
Oct 16, 2025, 5:59:14 AMOct 16
to lu...@googlegroups.com
Good point, Xmilia. But would it be possible instead to define that external strings, since they are new, need not be NUL-terminated? And since the NUL-terminator requirement is only visible to the C API, not to Lua, and since the C API can check whether a string is an external string, could this make everything work nicely in a backward-compatible way? If so, I like findstrx's idea of making external strings act as views into a string.


--
You received this message because you are subscribed to the Google Groups "lua-l" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lua-l/2e2ef02e-6b9b-4ace-b80f-a921b6d4f389%40gmail.com.

Sainan

unread,
Oct 16, 2025, 6:03:57 AMOct 16
to lu...@googlegroups.com
Ah yeah, I forgot about null-termination. There's probably no better way to handle that than creating a new string for each view, unless you wanna optimise for the miniscule chance that the view just happens to end such that 0 is the next byte.

-- Sainan

重归混沌

unread,
Oct 16, 2025, 6:11:16 AMOct 16
to lu...@googlegroups.com
Actually, what I want to discuss goes beyond just external strings.

I’m thinking about introducing a more general abstraction for all Lua
strings, where multiple strings can reference the same underlying
string_chunk.

During garbage collection, when a string is marked, it would also mark
the underlying string_chunk. This guarantees that the string_chunk
will never be prematurely collected during its lifetime.

Unfortunately, it would indeed violate the C API guarantee that
strings are null-terminated。

Xmilia Hermit <xmilia...@gmail.com> 于2025年10月16日周四 17:39写道:

重归混沌

unread,
Oct 16, 2025, 6:36:18 AMOct 16
to lu...@googlegroups.com
By the way, the `external strings` seems cannot fully guarantee this
convention either.

重归混沌 <find...@gmail.com> 于2025年10月16日周四 18:11写道:

Xmilia Hermit

unread,
Oct 16, 2025, 6:44:03 AMOct 16
to lu...@googlegroups.com

> By the way, the `external strings` seems cannot fully guarantee this
> convention either.
Do you mean the null-termination? The documentation states for
`lua_pushexternalstring` that the external string needs to be
null-terminated and there is an api_check for it
https://github.com/lua/lua/blob/9ea06e61f20ae34974226074fc6123dbb54a07c2/lapi.c#L560.

Regards,
Xmilia

Francisco Olarte

unread,
Oct 16, 2025, 6:50:41 AMOct 16
to lu...@googlegroups.com
On Thu, 16 Oct 2025 at 12:11, 重归混沌 <find...@gmail.com> wrote:
I’m thinking about introducing a more general abstraction for all Lua
strings, where multiple strings can reference the same underlying
string_chunk.

During garbage collection, when a string is marked, it would also mark
the underlying string_chunk. This guarantees that the string_chunk
will never be prematurely collected during its lifetime.

This is done by others, I think java does it with backing arrays, and has a problem.
Besides string interning, if you load a big file, split and somehow keep one small string sub
you reference the big chunk. Rinse and repeat a few times and you may go to OOM.

It can be solved, but it is not that simple.

Francisco Olarte.

Francisco Olarte

unread,
Oct 16, 2025, 6:52:20 AMOct 16
to lu...@googlegroups.com
If you use external strings as backing chunk for a "string.sub" you can only guarantee null termination in tails.

Francisco Olarte.

bil til

unread,
Oct 16, 2025, 9:58:26 AMOct 16
to lu...@googlegroups.com
... if you want something like this, e. g. "multi-string-lists", it
should be quite straight forward to create a small "multi-string lib"
which then supports this construct.

But "intrinsic implementation" in Lua will be not in ANY way easy. You
have to see that any string has a ca. 40byte "header info", also these
external strings... - if you "dream" of a multi-string-list, where
every sub string is "Lua string", then you would have to add such
40byte for every sub-string, this would not be very efficient I am
frightened... .
> --
> You received this message because you are subscribed to the Google Groups "lua-l" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/lua-l/CABugeg-jVmp9LkQx%2BW4OJ%3D%2B8YFaRc1iC3J3%3Dx27HOt13LhNPhg%40mail.gmail.com.

云风 Cloud Wu

unread,
Oct 16, 2025, 11:02:11 PMOct 16
to lu...@googlegroups.com
重归混沌 <find...@gmail.com> 于2025年10月16日周四 14:24写道:
>
> I’m really glad to see that external strings are coming in the upcoming Lua 5.5.
>
> I wonder if, in some future version, we might go even further.
>
> In many real-world scenarios, we often need to split a large string
> into many small substrings.
>

For this case, you can split the large string into a C string view struct first:

struct string_slice {
const char *str;
size_t sz;
};

struct string_slice string_view_array [];

And store the string_slice * as a lightuserdata instead of a lua string .
When you use these strings, you can simply convert string_slice into
external string by

struct string_splice *slice = (struct string_splice *)lua_touserdata(L, index);
lua_pushexternalstring(L, slice->str, slice->sz, NULL, NULL); //
NOTICE: it would be a null-terminated string.

--
http://blog.codingnow.com

重归混沌

unread,
Oct 16, 2025, 11:43:42 PMOct 16
to lu...@googlegroups.com
> For this case, you can split the large string into a C string view struct first:
>
> struct string_slice {
> const char *str;
> size_t sz;
> };
>
> struct string_slice string_view_array [];
>
> And store the string_slice * as a lightuserdata instead of a lua string .
> When you use these strings, you can simply convert string_slice into
> external string by
>
> struct string_splice *slice = (struct string_splice *)lua_touserdata(L, index);
> lua_pushexternalstring(L, slice->str, slice->sz, NULL, NULL); //
> NOTICE: it would be a null-terminated string.

That approach would gradually make the lifetime management more complex.
If we could have a zero-cost string:sub, then many functions like
string.unpack and string.find wouldn’t need extra init or pos
parameters at all.

Also, I want to reiterate my original idea — I’m talking about
introducing a language-level string view abstraction, similar to Go’s
slice, not just another specialization of external string.

Of course, this is just a bit of brainstorming, not necessarily a good
or practical idea. But based on my experience with Go’s slices over
the past few years, such an abstraction makes me less worried about GC
pressure during data processing, and it also reduces the mental
overhead when designing APIs.

云风 Cloud Wu

unread,
Oct 16, 2025, 11:57:50 PMOct 16
to lu...@googlegroups.com
重归混沌 <find...@gmail.com> 于2025年10月17日周五 11:43写道:
>
> > For this case, you can split the large string into a C string view struct first:
> >
> > struct string_slice {
> > const char *str;
> > size_t sz;
> > };
> >
> > struct string_slice string_view_array [];
> >
> > And store the string_slice * as a lightuserdata instead of a lua string .
> > When you use these strings, you can simply convert string_slice into
> > external string by
> >
> > struct string_splice *slice = (struct string_splice *)lua_touserdata(L, index);
> > lua_pushexternalstring(L, slice->str, slice->sz, NULL, NULL); //
> > NOTICE: it would be a null-terminated string.
>
> That approach would gradually make the lifetime management more complex.

If you want to simplify the lifetime, you can define a customized
version of external string,
Put the slice meta info after the external string data.

> If we could have a zero-cost string:sub, then many functions like
> string.unpack and string.find wouldn’t need extra init or pos
> parameters at all.

When you say "zero-cost", I guess you mean O(1) ? Creating an string
object can't be zero cost,
but lua_pushexternalstring() is already O(1) now.

You can implement it by lua_pushexternalstring .

For example, if you want to split a large string into many small ones,
all the sub strings can share the same memory (need a ref count in
the meta info of the C external string object),

--
http://blog.codingnow.com

Sewbacca

unread,
Oct 17, 2025, 4:58:24 PMOct 17
to lu...@googlegroups.com
Any interaction where a C function accesses a pointer to a view would cause an allocation. I'm not sure of the statistics, but I'd imagine it happens a lot, (except maybe the lua standard library optimizes for most used functions), killing the benefit of having string views.

If you relax the terminator condition, this would get better, but it'll break lots of modules, making use oft functions defined in string.h

Let's say okay rewrite all libraries accessing strings to use tolstring instead, we get a memory problem now if a huge string is created and we created even a single string view. If this huge hunk gets discarded, all underlying string views must either keep this chunk alive, eating memory, or rellocating, into a smaller block.

I suppose it'd be possible to implement, but hurt the ecosystem. Maybe to_lstring could relax the \0 byte condition while tostring may copy a potential string view into a \0 terminated chunk, but it's bound to break lots of modules assuming \0 even if they get the size via tolstring.

~ Sewbacca

重归混沌

unread,
Oct 21, 2025, 2:43:26 PMOct 21
to lu...@googlegroups.com
Maybe we need a page-fault–like copy-on-write (COW) mechanism, which
can lazily convert to the traditional string representation whenever
we need to satisfy certain conventions (mainly C APIs). Of course,
this may trigger an OOM exception.

'Sewbacca' via lua-l <lu...@googlegroups.com> 于2025年10月18日周六 04:58写道:
> --
> You received this message because you are subscribed to the Google Groups "lua-l" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/lua-l/2ae12561-1d59-4bed-81b1-48877b47e6ec%40kolabnow.com.

bil til

unread,
Oct 22, 2025, 2:44:22 AMOct 22
to lu...@googlegroups.com
"Simple const strings" might be useful especially as function parameters.

In a new lib, I use strings very often for configuration parameters. I
think it is generally quite a bit spoiled processing time, if such
small "function parameter strings" are allways processed to "full Lua
strings" and later must be garbage collected again.

If a "verbose string" is given to such a function (not a string
VARIABLE, but "..." construct (which of course IS constant, if passed
to a function)), then it really would be possibly nice also for the
garbage collector / efficient heap handling, if Lua could be
instructed somehow to place such a "simple const string" into some
"quasi-static" predefined heap - so then this string would NOT survive
any yielding.

Or for small systems it would also be nice, if just the ROM address of
this constant string could be passed. (just I assume this will not be
possible for larger system, which typically run in protected mode, and
protected mode might get alarmed if program code is addressed as data
code... at least if this code is not marked "const" ... on the other
hand of couirse it would be no problem in C, to mark such function
string parameters const... ).

Martin Eden

unread,
Oct 22, 2025, 2:40:27 PMOct 22
to lu...@googlegroups.com
On 2025-10-22 08:44, bil til wrote:
> In a new lib, I use strings very often for configuration parameters.

"The string is a stark data structure and everywhere it is passed there
is much duplication of process. It is a perfect vehicle for hiding
information."

I hope you know who wrote this.

-- Martin

bil til

unread,
Oct 22, 2025, 2:46:59 PMOct 22
to lu...@googlegroups.com
:)

I heard this first time, but this only sounds funny for me as
programmer I am frightened... .

"from time to time programmers" / "non-experts" will LOVE strings, as
they are so near to good old textbooks... :).

... the main alternative for configuration by string parameters often
are bitmask or tables.... .

But from my point of view for "simple minded programmers" both
concepts are MUCH more challenging than strings.

Am Mi., 22. Okt. 2025 um 20:40 Uhr schrieb 'Martin Eden' via lua-l
<lu...@googlegroups.com>:

Sewbacca

unread,
Oct 23, 2025, 7:42:34 AMOct 23
to lu...@googlegroups.com

>> "The string is a stark data structure and everywhere it is passed there
>> is much duplication of process. It is a perfect vehicle for hiding
>> information."

I'm not sure how applicable this is to Lua.

> ... the main alternative for configuration by string parameters often
> are bitmask or tables.... .

They are used as enums (i.e. collectgarbage('stop'|'restart')), flags
(io.open(path, 'w'|'wb'|'r'|'rb')) and plain symbols (_G['print'] == print).
numeric bitmasks are annoyingly hard to do backwards compatible. tables
require allocation. And if you have a good memoized constructor, strings
can also be used to describe certain values: setBackgroundColor
'#5EE985'. This is thanks due to the fact that all strings are constant.

Also I'm not sure how strings duplicate processes, for that we have
functions. i.e. function setBackgroundColor(col) col = colorfromhex(col) end

Reply all
Reply to author
Forward
0 new messages