LWIP socket based TCP server instabilities

Eusebius Flöthenbeyn

unread,

Jan 6, 2016, 9:12:36 AM1/6/16

to esp-open-rtos mailing list

Hi all,

I've been trying to run a simple echo TCP server using the lwip socket API (i.e. almost linux compatible).

The code actually works, but I get spurious crashes when I generate traffic. Unfortunately, the behaviour is completely random.

I've tried to make some crash stats and found it mostly crashing in either of the following locations:

memcpy+128 (excause 28)

_malloc_r+132 (excause 29)

_mallinfo_r+60 (excause 28)

From the JTAG debugger backtrace I can mostly tell it's happening inside LWIP from the pbuf handling.

Suspicion #1: Stack trashing ? No, stack appears sane (monitored it in the exception handler)

#2: Heap corruption: Not sure. Quite sure though that there is always enough memory available.

#3: non-thread safe malloc? I made sure it's using the built-in libc routines.

I'm running "recent stuff", i.e. 3e7edd4.

Any ideas, or does someone have a functional and stable telnet server?

Angus Gratton

unread,

Jan 6, 2016, 6:13:33 PM1/6/16

to Eusebius Flöthenbeyn, esp-open-rtos mailing list

Hi Eusebius,

Welcome to esp-open-rtos! Sorry to hear your first experience is a crash. :(

Apart from Issue #76 regarding out of heap, there aren't problems I'm aware of with TCP interaction like this.

As you probably know, exccause 28 & 29 are LoadProhibitedCause and StoreProhibitedCause so it certainly seems like a pointer has gone outside the addresssable area for some reason.

Are you able to post some code that exhibits the crash, and some rough steps for generating enough traffic to eventually cause crashes?

Angus

> --
> You received this message because you are subscribed to the Google Groups "esp-open-rtos mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to esp-open-rto...@googlegroups.com.
> To post to this group, send email to esp-op...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/esp-open-rtos/55befbb6-85df-46ad-bfac-5d45dae4d44a%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Eusebius Flöthenbeyn

unread,

Jan 7, 2016, 4:12:38 AM1/7/16

to esp-open-rtos mailing list

Hi Angus,

thanks for the quick reply.

As you probably know, exccause 28 & 29 are LoadProhibitedCause and StoreProhibitedCause so it certainly seems like a pointer has gone outside the addresssable area for some reason.

As I get from the trace statistics (logging every crash) the pointers are rather random, but often zero. Could still be a uninitialized variable thing, but gcc wouldn't tell me so far.

Are you able to post some code that exhibits the crash, and some rough steps for generating enough traffic to eventually cause crashes?

The setup is a little complex as it depends on an external library, I was looking at a simpler way of reconstruction, but didn't find an example server based on the lwip SOCKET API. I'll have to get rid of the external dependencies and rebuild it as a standalone-example (inside the open-rtos sdk).

Other than that I'm quite impressed by the reverse engineering work that has been done on the APIs.

On a side note: While trying to trace in GDB (via JTAG) I often run into the "wdev 1166" issue when halting, then resuming the target. The timer watchdog is being fed from inside the JTAG proxy, so it must be some NMI from the wdev device. Does anyone have some insight on how to recover from this NMI? It seems to fire only when the network is actually up. Inside the boot process, I can halt/resume the target without issues.

Angus Gratton

unread,

Jan 7, 2016, 6:40:04 AM1/7/16

to Eusebius Flöthenbeyn, esp-open-rtos mailing list

On Thu, Jan 07, 2016 at 01:12:38AM -0800, Eusebius Flöthenbeyn wrote:
> The setup is a little complex as it depends on an external library, I was
> looking at a simpler way of reconstruction, but didn't find an example
> server based on the lwip SOCKET API. I'll have to get rid of the external
> dependencies and rebuild it as a standalone-example (inside the open-rtos
> sdk).
> Other than that I'm quite impressed by the reverse engineering work that
> has been done on the APIs.

Any chance it's an open source external library we can add to 'extras/' in esp-open-rtos itself? Or is it something more specific?

> On a side note: While trying to trace in GDB (via JTAG) I often run into
> the "wdev 1166" issue when halting, then resuming the target. The timer
> watchdog is being fed from inside the JTAG proxy, so it must be some NMI
> from the wdev device. Does anyone have some insight on how to recover from
> this NMI? It seems to fire only when the network is actually up. Inside the
> boot process, I can halt/resume the target without issues.

What JTAG interface software/setup are you using?

My recollection with 'wdev 1166' is that it's one of the "internal buffer overflow" failures after the lower MAC layer (I think) handles WiFi frames in/out of the PHY. I think what happens is the NMI continues to receive frames while interrupts are off (including when halted in the debugger) so none get processed, and eventually some internal buffer overflows.

(The error is an assertion failing at wdev.c line 1166)

If we get to the point where wdev.c and the NMI routines are reverse engineered then we should be able to find a way not to panic when this happens - it should be possible to just drop some frames and keep going, even if it means possible data loss.

If you're not debugging any interrupt routines, you may be able to set the debugger interrupt level so interrupts continue to run in the background while debugger halted. I never got this to work properly (possibly because the interrupt handler drops the interrupt level before running the handler and confuses the debug interface), but I haven't tried for a while (and I was using my buggy openocd port which had other issues).

Angus

Eusebius Flöthenbeyn

unread,

Jan 7, 2016, 11:56:45 AM1/7/16

to esp-open-rtos mailing list

Any chance it's an open source external library we can add to 'extras/' in esp-open-rtos itself? Or is it something more specific?

Yes, it's in fact opensource (http://www.section5.ch/netpp). For the test case I just have to strip it down to a bare TCP server.

Possibly I have to revisit some malloc() issues, turns out when malloc() usage is enabled, it crashes are more frequent.

What JTAG interface software/setup are you using?

I'm currently using the branch from https://github.com/sysprogs/esp8266-openocd with some patches to make it compile. Tried your version first, then I figured the sysprogs guys added some stuff. If you're interested, I can post a git patch somewhere (tweaked a bit on the watchdog side).

If we get to the point where wdev.c and the NMI routines are reverse engineered then we should be able to find a way not to panic when this happens - it should be possible to just drop some frames and keep going, even if it means possible data loss.

That's what I would have been getting at next, to reconstruct the wdev stuff and see if I can just skip the packet loss condition. Obviously there are/were plenty of reverse engineering attempts, I've only done a bit of RE on the peripheral side to find out about the registers. It would make a lot of sense to coordinate the efforts, but I still don't have the perfect clue which group to join :-) there are so many orphaned git repos, that I wouldn't want to fork another one on the other hand.

If you're not debugging any interrupt routines, you may be able to set the debugger interrupt level so interrupts continue to run in the background while debugger halted. I never got this to work properly (possibly because the interrupt handler drops the interrupt level before running the handler and confuses the debug interface), but I haven't tried for a while (and I was using my buggy openocd port which had other issues).

I might have to revisit that, as I'm not sure if the current openocd disables interrupts while stepping or when halting, but the NMI would obviously be raised once too many packets queue up in the HW (FIFO or whatever), if I understand all this right. Or does it raise any time when a packet comes in? Anyhow, this needs more RE on wdev.c, I guess.

Also, I have no clue about the system interrupt handling of this SoC in general, and most forum messages about this appear rather speculative.

Since there's no quick solution, I'll just see that I can put up some grab'n'play code reference for you to test next week. Thanks so far!

Jonas S Karlsson (☯大鱼)

unread,

Jan 8, 2016, 5:11:58 AM1/8/16

to Eusebius Flöthenbeyn, esp-open-rtos mailing list

I tend not to use debugger, especially on embedded systems. In my experience w esp8266 I share exactly the same tcpip code on Linux so. I tend to debug there, which is where the convenience of esp-open-rtos comes in - mostly same code everywhere.

Since memory is limited and since Malloc doesn't return NULL when out of memory the results may be quite surprising. You noticing crashes in mem mgt functions is very suggestive... A small leak fast has consquences on esp where otherwise it may not be observed. One can print out memory heap remaining. It seems accurate an reliable.

--

You received this message because you are subscribed to the Google Groups "esp-open-rtos mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esp-open-rto...@googlegroups.com.
To post to this group, send email to esp-op...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/esp-open-rtos/37cab85d-c548-40ee-b035-c4fde0b24099%40googlegroups.com.

Eusebius Flöthenbeyn

unread,

Jan 9, 2016, 1:29:20 PM1/9/16

to esp-open-rtos mailing list, j...@yesco.org

Hi all,

I think I've got one reliable trace.

On Friday, January 8, 2016 at 11:11:58 AM UTC+1, Jonas S Karlsson (☯大鱼) wrote:

I tend not to use debugger, especially on embedded systems. In my experience w esp8266 I share exactly the same tcpip code on Linux so. I tend to debug there, which is where the convenience of esp-open-rtos comes in - mostly same code everywhere.

That was exactly my strategy so far, the debugger only comes in on the 'hard' issues when things just don't work as expected.

I have verified my code with a number of regress tests on linux, so I can exclude the leak situation, apart from that, there's a fully "static" compile option, i.e. no malloc() called. When I enable malloc, the crashes occur more often.

When I fully avoid reconnects, there's only little mallocing going on inside lwip. In this case I didn't manage a crash with the latest code. I had the suspicion that malloc might return a non-8byte-aligned block, under certain conditions, but that was not the case.

Since memory is limited and since Malloc doesn't return NULL when out of memory the results may be quite surprising. You noticing crashes in mem mgt functions is very suggestive... A small leak fast has consquences on esp where otherwise it may not be observed. One can print out memory heap remaining. It seems accurate an reliable.

I just ran into the situation where xPortGetFreeHeapSize() returns bogus (negative value), but the task is still running. So I'm not so sure I can second your statement :-)

I still have to figure out the differences in memory management with free vs. native ESP SDK and do more monitoring on the tasks that malloc often. Eventually, all dynamic malloc during runtime has to be eliminated anyhow to avoid fragmentation issues.

Jonas S Karlsson (☯大鱼)

unread,

Jan 9, 2016, 11:08:12 PM1/9/16

to Eusebius Flöthenbeyn, esp-open-rtos mailing list

On Jan 10, 2016 1:29 AM, "Eusebius Flöthenbeyn" <floeth...@gmail.com> wrote:
>
> Hi all,
>
> I think I've got one reliable trace.
>
>
> On Friday, January 8, 2016 at 11:11:58 AM UTC+1, Jonas S Karlsson (☯大鱼) wrote:
>>
>> I tend not to use debugger, especially on embedded systems. In my experience w esp8266 I share exactly the same tcpip code on Linux so. I tend to debug there, which is where the convenience of esp-open-rtos comes in - mostly same code everywhere.
>
>
> That was exactly my strategy so far, the debugger only comes in on the 'hard' issues when things just don't work as expected.
> I have verified my code with a number of regress tests on linux, so I can exclude the leak situation, apart from that, there's a fully "static" compile option, i.e. no malloc() called. When I enable malloc, the crashes occur more often.
> When I fully avoid reconnects, there's only little mallocing going on inside lwip. In this case I didn't manage a crash with the latest code. I had the suspicion that malloc might return a non-8byte-aligned block, under certain conditions, but that was not the case.
>
>
>>
>> Since memory is limited and since Malloc doesn't return NULL when out of memory the results may be quite surprising. You noticing crashes in mem mgt functions is very suggestive... A small leak fast has consquences on esp where otherwise it may not be observed. One can print out memory heap remaining. It seems accurate an reliable.
>
>
> I just ran into the situation where xPortGetFreeHeapSize() returns bogus (negative value), but the task is still running. So I'm not so sure I can second your statement :-)

OK. Then I'm convinced you run out of memory... I played with this a lot...

It's not bogus. The task is still running allocating memory that doesn't exist... You still getting pointers, not NULL... But it's not real memory. Crash often comes if you free it. I could do Malloc in 16 KB chunks and totally allocate 1 MB without problem and written values to those memory positions are also "remembered". Not getting NULL gives false security.

As said when all memory is exhausted this is not dectected by rtos so "still runnning" doesn't imply "correctness".

> I still have to figure out the differences in memory management with free vs. native ESP SDK and do more monitoring on the tasks that malloc often. Eventually, all dynamic malloc during runtime has to be eliminated anyhow to avoid fragmentation issues.
>

> --
> You received this message because you are subscribed to the Google Groups "esp-open-rtos mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to esp-open-rto...@googlegroups.com.
> To post to this group, send email to esp-op...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/esp-open-rtos/15eedb56-c730-45e9-8743-48ee26e11995%40googlegroups.com.

Jonas S Karlsson (☯大鱼)

unread,

Jan 9, 2016, 11:15:40 PM1/9/16

to Eusebius Flöthenbeyn, esp-open-rtos mailing list

See https://github.com/SuperHouse/esp-open-rtos/issues/76

Eusebius Flöthenbeyn

unread,

Jan 10, 2016, 11:50:46 AM1/10/16

to esp-open-rtos mailing list, j...@yesco.org

> I just ran into the situation where xPortGetFreeHeapSize() returns bogus (negative value), but the task is still running. So I'm not so sure I can second your statement :-)

OK. Then I'm convinced you run out of memory... I played with this a lot...

It's not bogus. The task is still running allocating memory that doesn't exist... You still getting pointers, not NULL... But it's not real memory. Crash often comes if you free it. I could do Malloc in 16 KB chunks and totally allocate 1 MB without problem and written values to those memory positions are also "remembered". Not getting NULL gives false security.

As said when all memory is exhausted this is not dectected by rtos so "still runnning" doesn't imply "correctness".

Yes, I'm aware of that issue, but the only task doing malloc/free() right now, apart from the ESP stuff, is the lwip pbuf handling. Malloc/Free in my TCP server is all turned off.

Also memory usage is not increasing, for a few 10'000 test cycles, the free heap value doesn't increment (just a bit of random more or less). Then suddenly the bogus value comes up. It happens way more often if I turn on another task doing a malloc()/free() cycle.

Can we exclude the possibility, that there are concurrent mallocs (like one inside the esp blobs) getting in the way of each other? I don't see this particular problem on the native FreeRTOS (0.9x) setup.

I'm gonna revisit the libgloss stuff (reentrant syscall wrappers) anyhow for my setup and do some more monitoring, once I find out how to get around the wdev watchdog WRT JTAG issue safely.

Reply all

Reply to author

Forward