Crashes In Various Places -- Possible Heap Corruption?

233 views
Skip to first unread message

Chris Merck

unread,
Jul 10, 2024, 3:41:35 PM7/10/24
to lua-l
At Bond, we use lua 5.3.5 inside an ESP32 for our RF device drivers. In recent firmware versions, we've experienced crashes at various places in the lua stack.

Taking one example, the `propagatemark` crash occurs because of the assert in the default case of the switch statement. Inspecting memory near the `o` object with JTAG does not show obvious heap corruption, but the value of `o->tt` is zero.

Taking another example, the `propagateall` crash is apparently occurring due to `g` being null and `g->grey` being thus a null pointer deref.

We recently did a major FW update on our product where we moved from ESP-IDF v3.3 to v5.1, and we switched from a homegrown coredump reporter to Memfault, but no changes were made to the lua code or our C bindings. Our homegrown system did not show crashes in the old FW, but Memfault is showing these lua crashes on the new code.

Does anyone here have some experience with crashes/assert fails of this nature, or have a suggestion how we may progress towards a root cause determination?

A few example crash backtraces follow.

Much Obliged,
  Chris Merck
  CTO - Bond Home (Olibra)

0 propagatemark in …/BondScript/lua/lgc.c at line 564
1 propagateall in …/BondScript/lua/lgc.c at line 604
2 atomic in …/BondScript/lua/lgc.c at line 1000
3 singlestep in …/BondScript/lua/lgc.c at line 1065
4 luaC_step in …/BondScript/lua/lgc.c at line 1137
5 lua_newuserdata in …/BondScript/lua/lapi.c at line 1190
6 newprefile in …/BondScript/lua/liolib.c at line 189
7 createstdfile in …/BondScript/lua/liolib.c at line 756
8 luaopen_io in …/BondScript/lua/liolib.c at line 771
9 luaD_precall in …/BondScript/lua/ldo.c at line 434
10 luaD_call in …/BondScript/lua/ldo.c at line 498
11 luaD_callnoyield in …/BondScript/lua/ldo.c at line 509
12 lua_callk in …/BondScript/lua/lapi.c at line 925
13 luaL_requiref in …/BondScript/lua/lauxlib.c at line 979
14 luaL_openlibs in …/BondScript/lua/linit.c at line 64

0 propagateall in …/BondScript/lua/lgc.c at line 604
1 atomic in …/BondScript/lua/lgc.c at line 1000
2 singlestep in …/BondScript/lua/lgc.c at line 1065
3 luaC_step in …/BondScript/lua/lgc.c at line 1137
4 lua_newuserdata in …/BondScript/lua/lapi.c at line 1190
5 newprefile in …/BondScript/lua/liolib.c at line 189
6 createstdfile in …/BondScript/lua/liolib.c at line 756
7 luaopen_io in …/BondScript/lua/liolib.c at line 771
8 luaD_precall in …/BondScript/lua/ldo.c at line 434
9 luaD_call in …/BondScript/lua/ldo.c at line 498
10 luaD_callnoyield in …/BondScript/lua/ldo.c at line 509
11 lua_callk in …/BondScript/lua/lapi.c at line 925
12 luaL_requiref in …/BondScript/lua/lauxlib.c at line 979
13 luaL_openlibs in …/BondScript/lua/linit.c at line 64

0 singlestep in …/BondScript/lua/lgc.c at line 1047
1 luaC_step in …/BondScript/lua/lgc.c at line 1137
2 lua_createtable in …/BondScript/lua/lapi.c at line 692
3 luaL_getsubtable in …/BondScript/lua/lauxlib.c at line 957
4 luaL_requiref in …/BondScript/lua/lauxlib.c at line 973
5 luaL_openlibs in …/BondScript/lua/linit.c at line 64 

0 luaH_getshortstr in …/BondScript/lua/ltable.c at line 544
1 luaH_getstr in …/BondScript/lua/ltable.c at line 577
2 auxsetstr in …/BondScript/lua/lapi.c at line 747
3 lua_setfield in …/BondScript/lua/lapi.c at line 779
4 luaL_getsubtable in …/BondScript/lua/lauxlib.c at line 959
5 luaL_requiref in …/BondScript/lua/lauxlib.c at line 973
6 luaL_openlibs in …/BondScript/lua/linit.c at line 64

0 luaC_step in …/BondScript/lua/lgc.c at line 1139
1 lua_createtable in …/BondScript/lua/lapi.c at line 692
2 luaL_getsubtable in …/BondScript/lua/lauxlib.c at line 957
3 luaL_requiref in …/BondScript/lua/lauxlib.c at line 973
4 luaL_openlibs in …/BondScript/lua/linit.c at line 64

Roberto Ierusalimschy

unread,
Jul 10, 2024, 5:42:05 PM7/10/24
to lu...@googlegroups.com
> Does anyone here have some experience with crashes/assert fails of this
> nature, or have a suggestion how we may progress towards a root cause
> determination?

Have you already used the LUA_USE_APICHECK compilation flag?
(See luaconf.h for more details.)

-- Roberto

Bogdan Marinescu

unread,
Jul 12, 2024, 5:36:09 AM7/12/24
to lu...@googlegroups.com
Hi,

Have you tried to increase the size of the OS thread that runs Lua? And I mean really increase it. Add something like 4KB or even 8KB and try again.

Thanks,
Bogdan

--
You received this message because you are subscribed to the Google Groups "lua-l" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lua-l/25db5456-fcbb-480d-8279-1b6be64712fbn%40googlegroups.com.

Bogdan Marinescu

unread,
Jul 13, 2024, 5:14:36 AM7/13/24
to lu...@googlegroups.com
Hi,

On Wed, Jul 10, 2024 at 10:47 PM Bogdan Marinescu <bogdan.m...@gmail.com> wrote:
Hi,

Have you tried to increase the size of the OS thread that runs Lua? And I mean really increase it. Add something like 4KB or even 8KB and try again.

Replying to myself, because apparently writing is hard: what I meant was "have you tried to increase the size of THE STACK of the OS thread that runs Lua". This happened to me while running (e)Lua on various MCUs: errors that didn't seem to make any sense went away after increasing the stack size. I don't remember if the ESP32 can raise an exception if a stack overflow condition occurs. If it does, I'd try to enable that feature in the firmware.
Sorry for the noise.

Chris Merck

unread,
Jul 15, 2024, 10:44:59 AM7/15/24
to lua-l

On Wednesday, July 10, 2024 at 5:42:05 PM UTC-4 Roberto Ierusalimschy wrote:
> Have you already used the LUA_USE_APICHECK compilation flag.

Thank you for the suggestion, Roberto. We had not tried this. No change to the pattern of backtraces with the API check enabled.

On Saturday, July 13, 2024 at 5:14:36 AM UTC-4 Bogdan Marinescu wrote:
 > Have you tried to increase the size of the OS thread that runs Lua? And I mean really increase it. Add something like 4KB or even 8KB and try again.

We do not suspect stack overflow because (1) we are using Espressif's stack canaries that generate a "stack overflow" fault when such is detected and (2) increasing the stack significantly does not resolve. But indeed the Lua interpreter is the deepest call stack we have on the product so it is the limiting factor for stack size.

Bogdan Marinescu

unread,
Jul 15, 2024, 1:07:17 PM7/15/24
to lu...@googlegroups.com
Something else that I did and proved useful was to intercept all the memory allocation calls (malloc/free/realloc/calloc) and log them serially on the PC. A script can then be used on the PC to look for obvious problems in this data (alloc/free outside the boundaries, multiple frees, frees at an address that was not allocated, unaligned addresses and so on). This is even better with Lua's "alloc function" because it gives you a single entry point to all the memory operations. That said, this isn't trivial and might even end up masking the original issue in some cases, but maybe something to consider.
 

--
You received this message because you are subscribed to the Google Groups "lua-l" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lua-l+un...@googlegroups.com.

Chris Merck

unread,
Jul 16, 2024, 6:22:43 PM7/16/24
to lua-l

On Monday, July 15, 2024 at 12:07:17 PM UTC-5 Bogdan Marinescu wrote:
intercept all the memory allocation calls (malloc/free/realloc/calloc) and log them serially on the PC [...] look for obvious problems in this data (alloc/free outside the boundaries, multiple frees, frees at an address that was not allocated, unaligned addresses and so on)

One of Noah Pendleton of Memfault's suggestions was to see if there is a way to add heap integrity checking to Lua, but this seems rather challenging because we'd need to instrument every modification of a heap object as well.

I'll update here when we resolve. Hopefully we will have some generalizable insight to share.

Sean Conner

unread,
Jul 16, 2024, 7:47:50 PM7/16/24
to lu...@googlegroups.com
It was thus said that the Great Chris Merck once stated:
>
> On Monday, July 15, 2024 at 12:07:17 PM UTC-5 Bogdan Marinescu wrote:
>
> > intercept all the memory allocation calls (malloc/free/realloc/calloc)
> > and log them serially on the PC [...] look for obvious problems in this
> > data (alloc/free outside the boundaries, multiple frees, frees at an
> > address that was not allocated, unaligned addresses and so on)
>
> One of Noah Pendleton of Memfault's suggestions was to see if there is a
> way to add heap integrity checking to Lua, but this seems rather
> challenging because we'd need to instrument every modification of a heap
> object as well.

One possible way to to this is to write a custom allocation routine to
allocate more memory than is asked for, set the extra space to a special
value, and when freeing memory, check the extra space for this special
value. It may help catch a few memory overwrites without too much trouble.

-spc

Chris Merck

unread,
Aug 29, 2024, 2:00:19 PM8/29/24
to lua-l
We did manage to solve this issue recently, posting here for closure. As some commenters suggested, this was not a lua issue, it just showed up nearly always in Lua stack because it was a problem with corruption in the heap in which Lua was allocating its objects.

We are fairly certain that what happened was, when we updated from ESP-IDF 3.3.1 to 5.1, we had to rework all our build scripts from GNU Make to CMake since we use ESP-IDF as a library rather than our base build system. In this process, we missed the compiler and linker option `-mfix-esp32-psram-cache-strategy=memw` which is needed on older ESP32 chips to work around a problem with the external SPI PSRAM. Two models of our products use this older ESP32-WROVER-B module, while a more recently introduced model uses ESP32-WROVER-E. The -B requires the workaround while the -E does not. The ultimate result is that we had sporadic corruption of the external heap where Lua objects were stored, but only on certain products.
Reply all
Reply to author
Forward
0 new messages