Some thoughts about string

84 views
Skip to first unread message

重归混沌

unread,
Oct 16, 2025, 2:24:30 AM (3 days ago) Oct 16
to lua-l
I’m really glad to see that external strings are coming in the upcoming Lua 5.5.

I wonder if, in some future version, we might go even further.

In many real-world scenarios, we often need to split a large string
into many small substrings.

This can easily lead to a lot of small, duplicated string objects.

Perhaps it would make sense to think of a Lua string not as a
standalone memory block, but as a view into an internal string buffer.

In this model, operations like string.sub would simply create new
views referencing the same underlying data, rather than copying it.

Conceptually, this is similar to Go’s slices, but since Lua strings
are immutable, it would avoid the pitfalls of accidental mutations.

Of course, such a design might introduce some additional complexity,
and it’s unclear whether the benefit justifies the change.

Sainan

unread,
Oct 16, 2025, 3:52:12 AM (3 days ago) Oct 16
to lu...@googlegroups.com
Your string.sub proposal makes some sense to me for external strings, since from the C API side, you have to expect that as long as a lua_State is alive, all external strings need to also stay alive, so I don't think it would be an 'unsafe optimisation'.

-- Sainan

Xmilia Hermit

unread,
Oct 16, 2025, 5:39:48 AM (2 days ago) Oct 16
to lu...@googlegroups.com
But from the C API side, Lua strings need to be null-terminated, which a
view into a string normally is not.
And only creating a null-terminated copy when the C API needs it is also
not possible, since `lua_tolstring` is only allowed to raise a memory
error when converting a number to a string.

So, to implement this, the C API would require a change. Either
lua_tolstring and lua_tostring need to be allowed to throw memory errors
even if the input is a string, or the requirement for Lua strings to be
null-terminated needs to be removed.

Regards,
Xmilia

Berwyn Hoyt

unread,
Oct 16, 2025, 5:59:14 AM (2 days ago) Oct 16
to lu...@googlegroups.com
Good point, Xmilia. But would it be possible instead to define that external strings, since they are new, need not be NUL-terminated? And since the NUL-terminator requirement is only visible to the C API, not to Lua, and since the C API can check whether a string is an external string, could this make everything work nicely in a backward-compatible way? If so, I like findstrx's idea of making external strings act as views into a string.


--
You received this message because you are subscribed to the Google Groups "lua-l" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/lua-l/2e2ef02e-6b9b-4ace-b80f-a921b6d4f389%40gmail.com.

Sainan

unread,
Oct 16, 2025, 6:03:57 AM (2 days ago) Oct 16
to lu...@googlegroups.com
Ah yeah, I forgot about null-termination. There's probably no better way to handle that than creating a new string for each view, unless you wanna optimise for the miniscule chance that the view just happens to end such that 0 is the next byte.

-- Sainan

重归混沌

unread,
Oct 16, 2025, 6:11:16 AM (2 days ago) Oct 16
to lu...@googlegroups.com
Actually, what I want to discuss goes beyond just external strings.

I’m thinking about introducing a more general abstraction for all Lua
strings, where multiple strings can reference the same underlying
string_chunk.

During garbage collection, when a string is marked, it would also mark
the underlying string_chunk. This guarantees that the string_chunk
will never be prematurely collected during its lifetime.

Unfortunately, it would indeed violate the C API guarantee that
strings are null-terminated。

Xmilia Hermit <xmilia...@gmail.com> 于2025年10月16日周四 17:39写道:

重归混沌

unread,
Oct 16, 2025, 6:36:18 AM (2 days ago) Oct 16
to lu...@googlegroups.com
By the way, the `external strings` seems cannot fully guarantee this
convention either.

重归混沌 <find...@gmail.com> 于2025年10月16日周四 18:11写道:

Xmilia Hermit

unread,
Oct 16, 2025, 6:44:03 AM (2 days ago) Oct 16
to lu...@googlegroups.com

> By the way, the `external strings` seems cannot fully guarantee this
> convention either.
Do you mean the null-termination? The documentation states for
`lua_pushexternalstring` that the external string needs to be
null-terminated and there is an api_check for it
https://github.com/lua/lua/blob/9ea06e61f20ae34974226074fc6123dbb54a07c2/lapi.c#L560.

Regards,
Xmilia

Francisco Olarte

unread,
Oct 16, 2025, 6:50:41 AM (2 days ago) Oct 16
to lu...@googlegroups.com
On Thu, 16 Oct 2025 at 12:11, 重归混沌 <find...@gmail.com> wrote:
I’m thinking about introducing a more general abstraction for all Lua
strings, where multiple strings can reference the same underlying
string_chunk.

During garbage collection, when a string is marked, it would also mark
the underlying string_chunk. This guarantees that the string_chunk
will never be prematurely collected during its lifetime.

This is done by others, I think java does it with backing arrays, and has a problem.
Besides string interning, if you load a big file, split and somehow keep one small string sub
you reference the big chunk. Rinse and repeat a few times and you may go to OOM.

It can be solved, but it is not that simple.

Francisco Olarte.

Francisco Olarte

unread,
Oct 16, 2025, 6:52:20 AM (2 days ago) Oct 16
to lu...@googlegroups.com
If you use external strings as backing chunk for a "string.sub" you can only guarantee null termination in tails.

Francisco Olarte.

bil til

unread,
Oct 16, 2025, 9:58:26 AM (2 days ago) Oct 16
to lu...@googlegroups.com
... if you want something like this, e. g. "multi-string-lists", it
should be quite straight forward to create a small "multi-string lib"
which then supports this construct.

But "intrinsic implementation" in Lua will be not in ANY way easy. You
have to see that any string has a ca. 40byte "header info", also these
external strings... - if you "dream" of a multi-string-list, where
every sub string is "Lua string", then you would have to add such
40byte for every sub-string, this would not be very efficient I am
frightened... .
> --
> You received this message because you are subscribed to the Google Groups "lua-l" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/lua-l/CABugeg-jVmp9LkQx%2BW4OJ%3D%2B8YFaRc1iC3J3%3Dx27HOt13LhNPhg%40mail.gmail.com.

云风 Cloud Wu

unread,
Oct 16, 2025, 11:02:11 PM (2 days ago) Oct 16
to lu...@googlegroups.com
重归混沌 <find...@gmail.com> 于2025年10月16日周四 14:24写道:
>
> I’m really glad to see that external strings are coming in the upcoming Lua 5.5.
>
> I wonder if, in some future version, we might go even further.
>
> In many real-world scenarios, we often need to split a large string
> into many small substrings.
>

For this case, you can split the large string into a C string view struct first:

struct string_slice {
const char *str;
size_t sz;
};

struct string_slice string_view_array [];

And store the string_slice * as a lightuserdata instead of a lua string .
When you use these strings, you can simply convert string_slice into
external string by

struct string_splice *slice = (struct string_splice *)lua_touserdata(L, index);
lua_pushexternalstring(L, slice->str, slice->sz, NULL, NULL); //
NOTICE: it would be a null-terminated string.

--
http://blog.codingnow.com

重归混沌

unread,
Oct 16, 2025, 11:43:42 PM (2 days ago) Oct 16
to lu...@googlegroups.com
> For this case, you can split the large string into a C string view struct first:
>
> struct string_slice {
> const char *str;
> size_t sz;
> };
>
> struct string_slice string_view_array [];
>
> And store the string_slice * as a lightuserdata instead of a lua string .
> When you use these strings, you can simply convert string_slice into
> external string by
>
> struct string_splice *slice = (struct string_splice *)lua_touserdata(L, index);
> lua_pushexternalstring(L, slice->str, slice->sz, NULL, NULL); //
> NOTICE: it would be a null-terminated string.

That approach would gradually make the lifetime management more complex.
If we could have a zero-cost string:sub, then many functions like
string.unpack and string.find wouldn’t need extra init or pos
parameters at all.

Also, I want to reiterate my original idea — I’m talking about
introducing a language-level string view abstraction, similar to Go’s
slice, not just another specialization of external string.

Of course, this is just a bit of brainstorming, not necessarily a good
or practical idea. But based on my experience with Go’s slices over
the past few years, such an abstraction makes me less worried about GC
pressure during data processing, and it also reduces the mental
overhead when designing APIs.

云风 Cloud Wu

unread,
Oct 16, 2025, 11:57:50 PM (2 days ago) Oct 16
to lu...@googlegroups.com
重归混沌 <find...@gmail.com> 于2025年10月17日周五 11:43写道:
>
> > For this case, you can split the large string into a C string view struct first:
> >
> > struct string_slice {
> > const char *str;
> > size_t sz;
> > };
> >
> > struct string_slice string_view_array [];
> >
> > And store the string_slice * as a lightuserdata instead of a lua string .
> > When you use these strings, you can simply convert string_slice into
> > external string by
> >
> > struct string_splice *slice = (struct string_splice *)lua_touserdata(L, index);
> > lua_pushexternalstring(L, slice->str, slice->sz, NULL, NULL); //
> > NOTICE: it would be a null-terminated string.
>
> That approach would gradually make the lifetime management more complex.

If you want to simplify the lifetime, you can define a customized
version of external string,
Put the slice meta info after the external string data.

> If we could have a zero-cost string:sub, then many functions like
> string.unpack and string.find wouldn’t need extra init or pos
> parameters at all.

When you say "zero-cost", I guess you mean O(1) ? Creating an string
object can't be zero cost,
but lua_pushexternalstring() is already O(1) now.

You can implement it by lua_pushexternalstring .

For example, if you want to split a large string into many small ones,
all the sub strings can share the same memory (need a ref count in
the meta info of the C external string object),

--
http://blog.codingnow.com

Sewbacca

unread,
Oct 17, 2025, 4:58:24 PM (24 hours ago) Oct 17
to lu...@googlegroups.com
Any interaction where a C function accesses a pointer to a view would cause an allocation. I'm not sure of the statistics, but I'd imagine it happens a lot, (except maybe the lua standard library optimizes for most used functions), killing the benefit of having string views.

If you relax the terminator condition, this would get better, but it'll break lots of modules, making use oft functions defined in string.h

Let's say okay rewrite all libraries accessing strings to use tolstring instead, we get a memory problem now if a huge string is created and we created even a single string view. If this huge hunk gets discarded, all underlying string views must either keep this chunk alive, eating memory, or rellocating, into a smaller block.

I suppose it'd be possible to implement, but hurt the ecosystem. Maybe to_lstring could relax the \0 byte condition while tostring may copy a potential string view into a \0 terminated chunk, but it's bound to break lots of modules assuming \0 even if they get the size via tolstring.

~ Sewbacca
Reply all
Reply to author
Forward
0 new messages