otCoapResponseHandler: Who frees the otMessage?

299 views
Skip to first unread message

Stuart Longland

unread,
Nov 5, 2017, 6:15:25 PM11/5/17
to openthread-users
Hi all,

I've written some code to regularly sample a value (presently a pulse
input counter) and POST it to a CoAP server I have written in NodeJS.

This works well, for a while. After a period I start getting
otCoapNewMessage returning NULL which suggests I'm overrunning a buffer
somewhere. The only thing I can think of is my response handler not
freeing the message received.

The documentation does not state anywhere whose responsibility it is to
free data structures. I'm guessing OpenThread does after
otCoapResponseHandler returns, but this is neither confirmed nor denied.

Looking at the code, the action seems to be in src/core/coap/coap.cpp,
it appears that DequeueMessage does in fact free the message instance,
and this appears to be called in FinalizeCoapTransaction.

Is there any special handling I need to perform after receiving a reply
message to a CoAP request to ensure the memory is freed?
--
_ ___ Stuart Longland - Systems Engineer
\ /|_) | T: +61 7 3535 9619
\/ | \ | 38b Douglas Street F: +61 7 3535 9699
SYSTEMS Milton QLD 4064 http://www.vrt.com.au

Stuart Longland

unread,
Nov 5, 2017, 7:06:01 PM11/5/17
to openthre...@googlegroups.com
On 06/11/17 09:15, Stuart Longland wrote:
> I've written some code to regularly sample a value (presently a pulse
> input counter) and POST it to a CoAP server I have written in NodeJS.
>
> This works well, for a while. After a period I start getting
> otCoapNewMessage returning NULL which suggests I'm overrunning a buffer
> somewhere.

For the benefit of discussion, this is the log I see from the device.
The protocol used over CoAP is based on CoRE RD and CoMI. The device
does "simple registration" to my CoRE RD server which then queries the
device's /.well-known/core endpoint.

After doing this, the server endpoint launches some commissioning code
that checks the "commission date" on the device, sees that it's 0
(meaning, not commissioned), then begins inserting some configuration
from a database and a CoAP endpoint for the device to report its samples to.

Also seen here is the NTP broadcast client in action.

Once commissioned, the node begins polling every 5 seconds, reading the
water meter pulse counters and reporting those to the CoAP endpoint. It
waits for the reply before sending the next message.

After a period, it seems there's a drop-off in communications. I'm not
sure what causes this at this time, but immediately afterwards, calls to
otCoapNewMessage return NULL, which is indicated by the log message
"main.c:631 FAILED POST HIST: Failed to create message". That code is here:

> static void schedule_write_response_handler(void *aContext, otCoapHeader *aHeader,
> otMessage *aMessage, const otMessageInfo *aMessageInfo, otError aResult);
> static void schedule_write(otInstance* instance) {
> otCoapHeader request_header;
> otMessage* request = NULL;
> union {
> char c[128];
> uint8_t u[128];
> } buffer;
>
> if (!(schedule_todo & (1 << schedule_point))) {
> /* Move to next point */
> iprintf("%s:%d POST HISTORY: Skip point %d\r\n",
> __FILE__, __LINE__, schedule_point);
> schedule_point++;
> return;
> }
>
> printf("%s:%d POST HISTORY: Prepare point %d time %lu value %f\r\n",
> __FILE__, __LINE__, schedule_point, schedule_current,
> schedule_value[schedule_point]);
>
> /* Prepare the request */
> memset(&schedule_buffer, 0, sizeof(schedule_buffer));
> if (otPlatRandomGetTrue(schedule_buffer.token, sizeof(schedule_buffer.token))
> != OT_ERROR_NONE) {
> iprintf("%s:%d FAILED POST HISTORY: Failed to generate token\r\n",
> __FILE__, __LINE__);
> schedule_todo = 0;
> return;
> }
>
> /* Initialise message header */
> otCoapHeaderInit(&request_header,
> OT_COAP_TYPE_CONFIRMABLE, OT_COAP_CODE_POST);
> otCoapHeaderSetToken(&request_header,
> schedule_buffer.token, sizeof(registration_token));
> otCoapHeaderSetMessageId(&request_header, registration_msgid++);
>
> /* Figure out endpoint */
> const char* p = &schedule_path[1]; /* Skip past slash */
> while (p) {
> const char* slash = index(p, '/'); /* Find next slash */
> if (slash) {
> /* Next segment */
> strncpy(buffer.c, p, slash - p);
> p = &slash[1];
> } else {
> /* Last segment */
> strncpy(buffer.c, p, sizeof(buffer.c));
> p = NULL;
> }
> iprintf("%s:%d POST HISTORY: Path segment %s\r\n",
> __FILE__, __LINE__, buffer.c);
> otCoapHeaderAppendUriPathOptions(&request_header, buffer.c);
> }
>
> /* Final part of path, point ID */
> otCoapHeaderAppendUriPathOptions(&request_header,
> comi_point_config[schedule_point].id);
> otCoapHeaderSetPayloadMarker(&request_header);
>
> /* Prepare request message */
> request = otCoapNewMessage(instance, &request_header);
> if (!request) {
> iprintf("%s:%d FAILED POST HIST: Failed to create message\r\n",
> __FILE__, __LINE__);
> schedule_todo = 0;
> return;
> }
>
> /* History endpoint payload */
> CborEncoder encoder, map_encoder;
> cbor_encoder_init(&encoder, buffer.u, sizeof(buffer.u), 0);
> cbor_encoder_create_map(&encoder, &map_encoder, 2);
>
> cbor_encode_text_stringz(&map_encoder, "vrt-wsh:sample/timestamp");
> cbor_encode_uint(&map_encoder, (uint64_t)schedule_current);
> cbor_encode_text_stringz(&map_encoder, "vrt-wsh:sample/value");
> cbor_encode_float(&map_encoder, schedule_value[schedule_point]);
>
> cbor_encoder_close_container(&encoder, &map_encoder);
> if (otMessageAppend(request, buffer.u,
> cbor_encoder_get_buffer_size(&encoder, buffer.u))
> != OT_ERROR_NONE) {
> iprintf("%s:%d FAILED POST HIST: Failed to create message\r\n",
> __FILE__, __LINE__);
> schedule_todo = 0;
> return;
> }
>
> /* Send request */
> if (otCoapSendRequest(instance, request, &schedule_msginfo,
> schedule_write_response_handler, NULL)
> != OT_ERROR_NONE) {
> iprintf("%s:%d FAILED POST HIST: Failed to send message\r\n",
> __FILE__, __LINE__);
> return;
> }
>
> iprintf("%s:%d History sent for point %d\r\n", __FILE__, __LINE__,
> schedule_point);
> /* Wait for response */
> schedule_state = SCHEDULE_STATE_WRWAIT;
> }
>
> static void schedule_write_response_handler(void *aContext, otCoapHeader *aHeader,
> otMessage *aMessage, const otMessageInfo *aMessageInfo, otError aResult) {
> (void)aContext;
>
> if (aMessage && aMessageInfo && aHeader) {
> otCoapCode code = otCoapHeaderGetCode(aHeader);
> if ((aResult == OT_ERROR_NONE) &&
> (code == OT_COAP_CODE_CREATED)) {
> iprintf("%s:%d History stored for point %d\r\n",
> __FILE__, __LINE__, schedule_point);
>
> /* Move to next point */
> schedule_state = SCHEDULE_STATE_WRITE;
> schedule_todo &= ~(1 << schedule_point);
> schedule_point++;
> iprintf("%s:%d Next point %d, todo %lx\r\n",
> __FILE__, __LINE__,
> schedule_point, schedule_todo);
> } else {
> iprintf("%s:%d FAILED POST HIST: "
> "Bad response (code=0x%02x err=%d).\r\n",
> __FILE__, __LINE__,
> code, aResult);
> schedule_state = SCHEDULE_STATE_INIT;
> schedule_todo = 0;
> schedule_point = 0;
> }
> } else {
> iprintf("%s:%d FAILED POST HIST: No response.\r\n",
> __FILE__, __LINE__);
> schedule_state = SCHEDULE_STATE_INIT;
> schedule_todo = 0;
> schedule_point = 0;
> }
> }
I'm pretty confident the problem is either in the CoAP client, or in my
calling of it that's causing the memory leak as increasing the polling
rate has a substantial impact on the length of time before this starts
occurring.
out.log

Jonathan Hui

unread,
Nov 8, 2017, 12:21:22 PM11/8/17
to Stuart Longland, openthread-users
It's not yet obvious to me why messages are not getting freed.

Can you utilize `otMessageGetBufferInfo()` to provide more information on where the buffers being used are within OpenThread?

--
Jonathan Hui

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-users+unsubscribe@googlegroups.com.
To post to this group, send email to openthread-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/b419af7c-67f3-6559-c422-6559ab63a40e%40vrt.com.au.
For more options, visit https://groups.google.com/d/optout.

Stuart Longland

unread,
Nov 8, 2017, 4:55:08 PM11/8/17
to jon...@nestlabs.com, openthread-users
On 09/11/17 03:21, Jonathan Hui wrote:
> It's not yet obvious to me why messages are not getting freed.
>
> Can you utilize `otMessageGetBufferInfo()`
> <https://github.com/openthread/openthread/blob/master/include/openthread/message.h#L317> to
> provide more information on where the buffers being used are within
> OpenThread?

No worries, I will have a look. One thing I observe, even if allocation
succeeds, if I send messages too quickly, the OpenThread code chokes.

Specifically, I wind up in a tight loop at
https://github.com/openthread/openthread/blob/master/examples/platforms/cc2538/radio.c#L361

I'm still investigating this further, but will have a look at that
function to see what it can tell me.

Regards,

Stuart Longland

unread,
Nov 8, 2017, 7:40:14 PM11/8/17
to openthre...@googlegroups.com
On 09/11/17 07:55, Stuart Longland wrote:
> One thing I observe, even if allocation
> succeeds, if I send messages too quickly, the OpenThread code chokes.
>
> Specifically, I wind up in a tight loop at
> https://github.com/openthread/openthread/blob/master/examples/platforms/cc2538/radio.c#L361

Just confirmed it here. If I send about a dozen CoAP requests (waiting
for the ACK from each one) in quick succession, the OpenThread device
seizes up.

Breaking in with the debugger, I find myself here:

> $ arm-none-eabi-gdb demo.out
> GNU gdb (Gentoo 8.0.1 vanilla) 8.0.1
> Copyright (C) 2017 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "--host=x86_64-pc-linux-gnu --target=arm-none-eabi".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <https://bugs.gentoo.org/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from demo.out...done.
> (gdb) target remote localhost:3333
> Remote debugging using localhost:3333
> 0x00000000 in ?? ()
> (gdb) monitor reset init
> JTAG tap: cc2538.jrc tap/device found: 0x8b96402f (mfg: 0x017 (Texas Instruments), part: 0xb964, ver: 0x8)
> JTAG tap: cc2538.dap enabled
> Only resetting the Cortex-M core, use a reset-init event handler to reset any peripherals or configure hardware srst support.
> target halted due to debug-request, current mode: Thread
> xPSR: 0x01000000 pc: 0x0000136c msp: 0x20008000
> (gdb) c
> Continuing.

Letting it rip… we wait until we see LEDs on the device stop blinking.

> ^C
> Program received signal SIGINT, Interrupt.
> 0x00206a86 in otPlatRadioTransmit (aInstance=0x20000b68 <sInstanceRaw>, aFrame=aFrame@entry=0x20000ad4 <sTransmitFrame>) at /home/stuartl/vrt/projects/widesky/hub/hal/src/radio.c:358
> 358 while ((HWREG(RFCORE_XREG_FSMSTAT1) & 1) == 0);
> (gdb) bt
> #0 0x00206a86 in otPlatRadioTransmit (aInstance=0x20000b68 <sInstanceRaw>, aFrame=aFrame@entry=0x20000ad4 <sTransmitFrame>) at /home/stuartl/vrt/projects/widesky/hub/hal/src/radio.c:358
> #1 0x0020a13e in ot::Mac::Mac::RadioTransmit (this=this@entry=0x200028e4 <sInstanceRaw+7548>, aSendFrame=aSendFrame@entry=0x20000ad4 <sTransmitFrame>)
> at /home/stuartl/vrt/projects/widesky/hub/hal/third_party/openthread/src/core/mac/mac.cpp:1095
> #2 0x0020a6ea in ot::Mac::Mac::HandleBeginTransmit (this=0x200028e4 <sInstanceRaw+7548>) at /home/stuartl/vrt/projects/widesky/hub/hal/third_party/openthread/src/core/mac/mac.cpp:937
> #3 0x00202cf0 in main (argc=<optimized out>, argv=<optimized out>) at /home/stuartl/vrt/projects/widesky/hub/hal/demo/main.c:238

This 'radio.c' has the following changes applied to calibrate the output
levels to work with the CC2592:

> stuartl@vk4msl-ws ~/vrt/projects/widesky/hub/hal $ diff -u /home/stuartl/vrt/projects/widesky/hub/hal/third_party/openthread/examples/platforms/cc2538/radio.c /home/stuartl/vrt/projects/widesky/hub/hal/src/radio.c
> --- /home/stuartl/vrt/projects/widesky/hub/hal/third_party/openthread/examples/platforms/cc2538/radio.c 2017-08-01 12:48:31.466864595 +1000
> +++ /home/stuartl/vrt/projects/widesky/hub/hal/src/radio.c 2017-08-31 16:08:22.826698008 +1000
> @@ -73,23 +73,20 @@
> uint8_t mTxPowerReg;
> } TxPowerTable;
>
> -// The transmit power table, the values are from SmartRF Studio 2.4.0
> +// The transmit power table, the values are from application note 130
> static const TxPowerTable sTxPowerTable[] =
> {
> - { 7, 0xFF },
> - { 5, 0xED },
> - { 3, 0xD5 },
> - { 1, 0xC5 },
> - { 0, 0xB6 },
> - { -1, 0xB0 },
> - { -3, 0xA1 },
> - { -5, 0x91 },
> - { -7, 0x88 },
> - { -9, 0x72 },
> - { -11, 0x62 },
> - { -13, 0x58 },
> - { -15, 0x42 },
> - { -24, 0x00 },
> + { 22, 0xFF }, /* 22.0dBm =~ 158.5mW */
> + { 21, 0xD5 }, /* 20.9dBm =~ 123.0mW */
> + { 20, 0xC5 }, /* 20.1dBm =~ 102.3mW */
> + { 19, 0xB0 }, /* 19.0dBm =~ 79.4mW */
> + { 18, 0xA1 }, /* 17.8dBm =~ 60.3mW */
> + { 16, 0x91 }, /* 16.4dBm =~ 43.7mW */
> + { 15, 0x88 }, /* 14.9dBm =~ 30.9mW */
> + { 13, 0x72 }, /* 13.0dBm =~ 20.0mW */
> + { 11, 0x62 }, /* 11.0dBm =~ 12.6mW */
> + { 10, 0x58 }, /* 9.5dBm =~ 8.9mW */
> + { 8, 0x42 }, /* 7.5dBm =~ 5.6mW */
> };
>
> static otRadioFrame sTransmitFrame;

The full patched file is here:
https://gist.github.com/sjlongland/0606f07c69851ec55367c09d2b730a1d

Stuart Longland

unread,
Nov 8, 2017, 7:40:43 PM11/8/17
to openthre...@googlegroups.com
On 09/11/17 10:40, Stuart Longland wrote:
> Just confirmed it here. If I send about a dozen CoAP requests (waiting
> for the ACK from each one) in quick succession, the OpenThread device
> seizes up.

"quick succession" being 100msec or less between requests.

Stuart Longland

unread,
Nov 8, 2017, 8:14:58 PM11/8/17
to openthre...@googlegroups.com
On 09/11/17 03:21, Jonathan Hui wrote:
> It's not yet obvious to me why messages are not getting freed.
>
> Can you utilize `otMessageGetBufferInfo()`
> <https://github.com/openthread/openthread/blob/master/include/openthread/message.h#L317> to
> provide more information on where the buffers being used are within
> OpenThread?

Okay, went out to an early lunch and left the device polling… eventually
it coughed up this:

> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:214 POLL ASYNC FINISH point 8 request 0x2000779c state 0xf0
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:664 POST HISTORY: Look up destination
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:726 POST HISTORY: Destination IP fd85:4a3c:b40e:0:8c2b:f9a0:311c:46f7
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:779 POST HISTORY: Destination port 5683
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:789 POST HISTORY: Prepare write of points, todo=21f
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:796 POST HISTORY: To send: 0x21f
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:411 POST HISTORY: Prepare point 0 time 1510189950 value 2777230.000000
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:449 POST HISTORY: Path segment h
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/scheduler.c:464 FAILED POST HIST: Failed to create message
> … m6loReassemblyBuffers 0
> … m6loReassemblyMessages 0
> … m6loSendBuffers 40
> … m6loSendMessages 20
> … mApplicationCoapBuffers 0
> … mApplicationCoapMessages 0
> … mArpBuffers 0
> … mArpMessages 0
> … mCoapBuffers 0
> … mCoapMessages 0
> … mCoapSecureBuffers 0
> … mCoapSecureMessages 0
> … mFreeBuffers 0
> … mIp6Buffers 0
> … mIp6Messages 0
> … mMleBuffers 0
> … mMleMessages 0
> … mMplBuffers 0
> … mMplMessages 0
> … mTotalBuffers 40

The relevant code:
>
> /* Final part of path, point ID */
> otCoapHeaderAppendUriPathOptions(&request_header,
> comi_point_config[schedule_point].id);
> otCoapHeaderSetPayloadMarker(&request_header);
>
> /* Prepare request message */
> request = otCoapNewMessage(instance, &request_header);
> if (!request) {
> #ifdef SCHEDULE_TRACE
> iprintf("%s:%d FAILED POST HIST: Failed to create message\r\n",
> __FILE__, __LINE__);
> otBufferInfo buffer_info;
> otMessageGetBufferInfo(instance, &buffer_info);
> iprintf("… m6loReassemblyBuffers\t%d\r\n",
> buffer_info.m6loReassemblyBuffers);
> iprintf("… m6loReassemblyMessages\t%d\r\n",
> buffer_info.m6loReassemblyMessages);
> iprintf("… m6loSendBuffers\t%d\r\n",
> buffer_info.m6loSendBuffers);
> iprintf("… m6loSendMessages\t%d\r\n",
> buffer_info.m6loSendMessages);
> iprintf("… mApplicationCoapBuffers\t%d\r\n",
> buffer_info.mApplicationCoapBuffers);
> iprintf("… mApplicationCoapMessages\t%d\r\n",
> buffer_info.mApplicationCoapMessages);
> iprintf("… mArpBuffers\t%d\r\n",
> buffer_info.mArpBuffers);
> iprintf("… mArpMessages\t%d\r\n",
> buffer_info.mArpMessages);
> iprintf("… mCoapBuffers\t%d\r\n",
> buffer_info.mCoapBuffers);
> iprintf("… mCoapMessages\t%d\r\n",
> buffer_info.mCoapMessages);
> iprintf("… mCoapSecureBuffers\t%d\r\n",
> buffer_info.mCoapSecureBuffers);
> iprintf("… mCoapSecureMessages\t%d\r\n",
> buffer_info.mCoapSecureMessages);
> iprintf("… mFreeBuffers\t%d\r\n",
> buffer_info.mFreeBuffers);
> iprintf("… mIp6Buffers\t%d\r\n",
> buffer_info.mIp6Buffers);
> iprintf("… mIp6Messages\t%d\r\n",
> buffer_info.mIp6Messages);
> iprintf("… mMleBuffers\t%d\r\n",
> buffer_info.mMleBuffers);
> iprintf("… mMleMessages\t%d\r\n",
> buffer_info.mMleMessages);
> iprintf("… mMplBuffers\t%d\r\n",
> buffer_info.mMplBuffers);
> iprintf("… mMplMessages\t%d\r\n",
> buffer_info.mMplMessages);
> iprintf("… mTotalBuffers\t%d\r\n",
> buffer_info.mTotalBuffers);
> #endif
> schedule_todo = 0;
> return;
> }
>
> /* History endpoint payload */
> CborEncoder encoder, map_encoder;
> cbor_encoder_init(&encoder, buffer.u, sizeof(buffer.u), 0);
> cbor_encoder_create_map(&encoder, &map_encoder, 2);
>
> cbor_encode_text_stringz(&map_encoder, "vrt-wsh:sample/timestamp");
> cbor_encode_uint(&map_encoder, (uint64_t)schedule_current);
> cbor_encode_text_stringz(&map_encoder, "vrt-wsh:sample/value");
> cbor_encode_float(&map_encoder, schedule_value[schedule_point]);
>
> cbor_encoder_close_container(&encoder, &map_encoder);
> if (otMessageAppend(request, buffer.u,
> cbor_encoder_get_buffer_size(&encoder, buffer.u))
> != OT_ERROR_NONE) {
> #ifdef SCHEDULE_TRACE
> iprintf("%s:%d FAILED POST HIST: Failed to create message\r\n",
> __FILE__, __LINE__);
> #endif
> schedule_todo = 0;
> return;

Jonathan Hui

unread,
Nov 9, 2017, 12:18:41 PM11/9/17
to Stuart Longland, openthread-users
Having all the allocated messages queued in `m6loSendMessages` typically indicates that the MAC layer is no longer transmitting frames for some reason.

The CC2538 radio driver getting stuck in `otPlatRadioTransmit()` on reading the FSMSTAT1 register indicates that the radio is not entering the receive state when the driver expects it to be.

One thing you might try is reading out the FSMSTAT0 register to see what state the radio is actually in.

--
Jonathan Hui

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-users+unsubscribe@googlegroups.com.
To post to this group, send email to openthread-users@googlegroups.com.

Stuart Longland

unread,
Nov 9, 2017, 11:30:03 PM11/9/17
to jon...@nestlabs.com, openthread-users
On 10/11/17 03:18, Jonathan Hui wrote:
> Having all the allocated messages queued in `m6loSendMessages` typically
> indicates that the MAC layer is no longer transmitting frames for some
> reason.
>
> The CC2538 radio driver getting stuck in `otPlatRadioTransmit()` on
> reading the FSMSTAT1 register indicates that the radio is not entering
> the receive state when the driver expects it to be.
>
> One thing you might try is reading out the FSMSTAT0 register to see what
> state the radio is actually in.

No worries, I've made the following edits to radio.c, I'll run again and
see what happens:
> diff --git a/src/radio.c b/src/radio.c
> index bd94a0d..9036be6 100644
> --- a/src/radio.c
> +++ b/src/radio.c
> @@ -32,6 +32,8 @@
> *
> */
>
> +#include <stdio.h>
> +
> #include <openthread/config.h>
> #include <openthread/openthread.h>
> #include <openthread/platform/alarm-milli.h>
> @@ -355,7 +357,15 @@ otError otPlatRadioTransmit(otInstance *aInstance, otRadioFrame *aFrame)
> setChannel(aFrame->mChannel);
> setTxPower(aFrame->mPower);
>
> - while ((HWREG(RFCORE_XREG_FSMSTAT1) & 1) == 0);
> + i = 0;
> + while ((HWREG(RFCORE_XREG_FSMSTAT1) & 1) == 0) {
> + i++;
> + if (i > 1000) {
> + iprintf("%s:%d Radio stuck after %ld cycles!!! FSMSTAT0=0x%lx\r\n",
> + __FILE__, __LINE__, i, HWREG(RFCORE_XREG_FSMSTAT0));
> + i = 0;
> + }
> + }
>
> // wait for valid rssi
> while ((HWREG(RFCORE_XREG_RSSISTAT) & RFCORE_XREG_RSSISTAT_RSSI_VALID) == 0);

Stuart Longland

unread,
Nov 24, 2017, 3:07:52 AM11/24/17
to openthre...@googlegroups.com
On 10/11/17 14:29, Stuart Longland wrote:
> On 10/11/17 03:18, Jonathan Hui wrote:
>> Having all the allocated messages queued in `m6loSendMessages` typically
>> indicates that the MAC layer is no longer transmitting frames for some
>> reason.
>>
>> The CC2538 radio driver getting stuck in `otPlatRadioTransmit()` on
>> reading the FSMSTAT1 register indicates that the radio is not entering
>> the receive state when the driver expects it to be.
>>
>> One thing you might try is reading out the FSMSTAT0 register to see what
>> state the radio is actually in.
>
> No worries, I've made the following edits to radio.c, I'll run again and
> see what happens:

Okay, so it's been a while because I've been flat out with other
projects, and other parts of *this* project.

Our application is one that does data logging, and while in production,
it'll likely be sending maybe 8 samples (as individual CoAP POSTs) every
900 seconds, I'm stress-testing it with 15-second sampling and 16
samples at a time.

A lot of time was spent implementing a historian on the device that
would buffer the samples in an EEPROM then upload those later, which
allows us to backfill when communications returns. Having implemented
that, I have a solution that, at least if deployed, will quietly log in
the background and push when it can, so to the outside world, it'll
appear all okay.

The uploader module, given that it sends messages in quick succession
(with a randomised delay for congestion control), seems to trigger this
problem quite a lot.

I started turning on some logging options and implementing an otPlatLog
stub via viprintf, and recorded what was seen.

https://gist.github.com/sjlongland/9f183822f2ad886ca9303984a1713b37

Reading up the datasheet for the CC2538 (TI document SWRU319C); the
detail for the FSMSTAT0 register indicates the lower 6 bits give the
transceiver state machine state (page 724). That state is described
earlier on page 684, indicating there a RX overflow condition.

It would appear the required action, according to the state machine
diagram on that same page, is to issue a SFLUSHRX command (described
page 698).

Stuart Longland

unread,
Nov 24, 2017, 3:55:04 AM11/24/17
to openthre...@googlegroups.com
On 24/11/17 18:07, Stuart Longland wrote:
> https://gist.github.com/sjlongland/9f183822f2ad886ca9303984a1713b37
>
> Reading up the datasheet for the CC2538 (TI document SWRU319C); the
> detail for the FSMSTAT0 register indicates the lower 6 bits give the
> transceiver state machine state (page 724). That state is described
> earlier on page 684, indicating there a RX overflow condition.
>
> It would appear the required action, according to the state machine
> diagram on that same page, is to issue a SFLUSHRX command (described
> page 698).

I had a crack at doing this… but it's pretty clear to me I didn't do it
right:

https://gist.github.com/sjlongland/8e8b2cf132f0449ede6e81d62ea47cb8

Seems Mac::RadioTransmit will pass through the error back to
Mac::HandleBeginTransmit (both in src/core/mac.cpp)… and in there, on
line 935 we have:

> error = RadioTransmit(&sendFrame);
> assert(error == OT_ERROR_NONE);

Seems transmitting a frame is not supposed to fail, ever. So returning
a failure isn't the answer.

Stuart Longland

unread,
Nov 24, 2017, 4:04:05 AM11/24/17
to openthre...@googlegroups.com
On 24/11/17 18:54, Stuart Longland wrote:
> I had a crack at doing this… but it's pretty clear to me I didn't do it
> right:
>
> https://gist.github.com/sjlongland/8e8b2cf132f0449ede6e81d62ea47cb8
>
> Seems Mac::RadioTransmit will pass through the error back to
> Mac::HandleBeginTransmit (both in src/core/mac.cpp)… and in there, on
> line 935 we have:
>
>> error = RadioTransmit(&sendFrame);
>> assert(error == OT_ERROR_NONE);
> Seems transmitting a frame is not supposed to fail, ever. So returning
> a failure isn't the answer.

So, doing a check at the start of otPlatRadioTransmit, i.e:
https://gist.github.com/sjlongland/192f74feb04cbce7c5929c91021f2cfa#file-gistfile1-txt-L53-L60

seems to help this problem a lot. Before, I was waiting maybe a minute
or two before the node seized up, but now, it seems to be running fine.

I'll flash my other two nodes with this and see how they go.

Stuart Longland

unread,
Nov 24, 2017, 8:08:29 AM11/24/17
to openthre...@googlegroups.com
On 24/11/17 19:03, Stuart Longland wrote:
> On 24/11/17 18:54, Stuart Longland wrote:
>> I had a crack at doing this… but it's pretty clear to me I didn't do it
>> right:
>>
>> https://gist.github.com/sjlongland/8e8b2cf132f0449ede6e81d62ea47cb8
>>
>> Seems Mac::RadioTransmit will pass through the error back to
>> Mac::HandleBeginTransmit (both in src/core/mac.cpp)… and in there, on
>> line 935 we have:
>>
>>> error = RadioTransmit(&sendFrame);
>>> assert(error == OT_ERROR_NONE);
>> Seems transmitting a frame is not supposed to fail, ever. So returning
>> a failure isn't the answer.
>
> So, doing a check at the start of otPlatRadioTransmit, i.e:
> https://gist.github.com/sjlongland/192f74feb04cbce7c5929c91021f2cfa#file-gistfile1-txt-L53-L60
>
> seems to help this problem a lot. Before, I was waiting maybe a minute
> or two before the node seized up, but now, it seems to be running fine.
>
> I'll flash my other two nodes with this and see how they go.
>

Seems that isn't the whole story either… after flushing the RX buffer,
you can still get stuck:
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/uploader.c:484 History stored for point 4
> [DEBG]-PLAT----: State=OT_RADIO_STATE_DISABLED
> [INFO]-PLAT----: Initialized
> [DEBG]-PLAT----: State=OT_RADIO_STATE_SLEEP
> [DEBG]-PLAT----: State=OT_RADIO_STATE_RECEIVE
> /home/stuartl/vrt/projects/widesky/hub/hal/demo/uploader.c:501 Next point 5, todo 0x7fe0, delay 8
> [DEBG]-MAC-----: Ack timer fired
> [INFO]-MAC-----: Frame tx failed, error:NoAck, attempt:2/4, len:0, seqnum:219, type:Data, src:0x6c1c, dst:0x6c00, sec:yes, ackreq:yes
> ============================[TX ERR len=016]=============================
> | 69 98 DB A9 D6 00 6C 1C | 6C 0D BA 84 09 00 01 0E | i.[)V.l.l.:.....
> ------------------------------------------------------------------------
> [DEBG]-PLAT----: State=OT_RADIO_STATE_RECEIVE
> [DEBG]-PLAT----: State=OT_RADIO_STATE_RECEIVE
> [CRIT]-PLAT----: Transmitter is stuck!!! FSMSTAT0=0x00000000 FSMSTAT1=0x00000080
> [CRIT]-PLAT----: Transmitter is stuck!!! FSMSTAT0=0x00000000 FSMSTAT1=0x00000080
> [CRIT]-PLAT----: Transmitter is stuck!!! FSMSTAT0=0x00000000 FSMSTAT1=0x00000080
> [CRIT]-PLAT----: Transmitter is stuck!!! FSMSTAT0=0x00000000 FSMSTAT1=0x00000080--

Stuart Longland

unread,
Nov 27, 2017, 3:36:32 AM11/27/17
to openthre...@googlegroups.com
On 24/11/17 23:08, Stuart Longland wrote:
> On 24/11/17 19:03, Stuart Longland wrote:
>> On 24/11/17 18:54, Stuart Longland wrote:
>>> I had a crack at doing this… but it's pretty clear to me I didn't do it
>>> right:
>>>
>>> https://gist.github.com/sjlongland/8e8b2cf132f0449ede6e81d62ea47cb8
>>>
>>> Seems Mac::RadioTransmit will pass through the error back to
>>> Mac::HandleBeginTransmit (both in src/core/mac.cpp)… and in there, on
>>> line 935 we have:
>>>
>>>> error = RadioTransmit(&sendFrame);
>>>> assert(error == OT_ERROR_NONE);
>>> Seems transmitting a frame is not supposed to fail, ever. So returning
>>> a failure isn't the answer.
>>
>> So, doing a check at the start of otPlatRadioTransmit, i.e:
>> https://gist.github.com/sjlongland/192f74feb04cbce7c5929c91021f2cfa#file-gistfile1-txt-L53-L60
>>
>> seems to help this problem a lot. Before, I was waiting maybe a minute
>> or two before the node seized up, but now, it seems to be running fine.
>>
>> I'll flash my other two nodes with this and see how they go.
>>
>
> Seems that isn't the whole story either… after flushing the RX buffer,
> you can still get stuck:

So, having read up more closely, some observations:

1. The TI docs categorically state that you should not use FSMSTAT0 to
control program flow. So my earlier solution, though it did help, is
not the answer. Thus today I sought to detect the buffer overflow
condition by other means.

2. The buffering error conditions are indicated by the
`RFCORE_SFR_RFERRF` register (0x4008882c; page 749 of SWRU319C). No
code in radio.c checks this… an ISR however blindly clears it. The
stock CC2538 radio module driver in OpenThread presently works in
ignorant bliss of this register.

3. Having added some checks to detect various overflow conditions, I am
now back at the original issue where the buffers do not get freed. I no
longer have the CC2538 hard-locking within `otPlatRadioTransmit` as I
did before, which I believe is a step forward. It appears though, when
there is a failure to send (e.g. due to no ACK), a buffer does not get
freed.

https://gist.github.com/sjlongland/7f392a74970d1018ace0b6e2143db902
shows the patched radio.c and the results from this.

Regards,

Jonathan Hui

unread,
Nov 27, 2017, 5:31:29 PM11/27/17
to Stuart Longland, openthread-users
The RXFIFO overflow condition is also signaled by the FIFOP and FIFO signals in FSMSTAT1.  The example driver includes a check in radio.c#L467.  Of course, this occurs towards the end of readFrame(), so maybe there are conditions where the driver does not reach this RXFIFO overflow check/flush.

Does it help to move the check/flush after the `exit:` label?  Doing so will run the risk of dropping frames in the RXFIFO that were already acknowledged, but it may be something worth trying.

--
Jonathan Hui

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-users+unsubscribe@googlegroups.com.
To post to this group, send email to openthread-users@googlegroups.com.

Stuart Longland

unread,
Nov 27, 2017, 6:52:44 PM11/27/17
to openthre...@googlegroups.com
On 28/11/17 08:31, Jonathan Hui wrote:
> The RXFIFO overflow condition is also signaled by the FIFOP and FIFO
> signals in FSMSTAT1.  The example driver includes a check in
> radio.c#L467
> <https://github.com/openthread/openthread/blob/master/examples/platforms/cc2538/radio.c#L467>. 
> Of course, this occurs towards the end of readFrame(), so maybe there
> are conditions where the driver does not reach this RXFIFO overflow
> check/flush.

Yeah, I'm just reading the docs on those two flags; FIFOP indicates when
there's at least one complete frame, FIFO indicates there's data, unless
the FIFO is overflowed.

It's a little ambiguous. There's also no indication of underflow
conditions… RFERRF gives us all of this, and seems to be a better
candidate for detecting such situations.

> Does it help to move the check/flush after the `exit:` label?  Doing so
> will run the risk of dropping frames in the RXFIFO that were already
> acknowledged, but it may be something worth trying.

That sounds like a much better place to put the error handling code.
I'll experiment with putting the handling code there and see what happens.

Stuart Longland

unread,
Nov 27, 2017, 7:47:54 PM11/27/17
to openthre...@googlegroups.com
On 28/11/17 09:52, Stuart Longland wrote:
>> Does it help to move the check/flush after the `exit:` label?  Doing so
>> will run the risk of dropping frames in the RXFIFO that were already
>> acknowledged, but it may be something worth trying.
> That sounds like a much better place to put the error handling code.
> I'll experiment with putting the handling code there and see what happens.

Seems not…

https://gist.github.com/sjlongland/632fc898e4c53117e793270d4a60bf5d

It would appear that whilst waiting for the transmitter to warm up is
the optimal place to detect this condition.

Stuart Longland

unread,
Nov 27, 2017, 9:29:51 PM11/27/17
to openthre...@googlegroups.com
On 28/11/17 10:47, Stuart Longland wrote:
> Seems not…
>
> https://gist.github.com/sjlongland/632fc898e4c53117e793270d4a60bf5d
>
> It would appear that whilst waiting for the transmitter to warm up is
> the optimal place to detect this condition.

One thing that seems to affect this is if the main loop is blocked for
any reason. In our case, two things are culprits for blocking the main
loop:

- RS-485 serial I/O, particularly when we're using one port for the
console logging output. Switching between RX and TX requires a small
delay loop to ensure we don't cut off the start or end of a
transmission. Then there's the delay in the transmission itself.

- SPI EEPROM communication, using synchronous libraries like SPIFFS
means we can be blocked waiting for a read or write to a file stored on
the EEPROM. Most of the time, this is pretty quick, but it can be as
much as a second if garbage collection is involved.

There are two ISRs for the radio, but they don't appear to do much, and
I have my doubts as to whether they're even enabled in NVIC.

I tried using the stock OpenThread firmware and two CC2538DK dev boards,
repeatedly attempting to POST a biggish payload via CoAP, and didn't
manage to trigger the problem. The problem is reduced if I turn off
debugging to the console.

Thus the appearance of this bug is going to be entirely dependent on the
nature of the application.

Jonathan Hui

unread,
Nov 27, 2017, 10:28:23 PM11/27/17
to Stuart Longland, openthread-users
Yes, the example driver is not tolerant to long delays in the main processing loop.

Depending on whether the application-specific delays are occurring in interrupt context or not, the RXFIFO overflow conditions could be mitigated by pulling out bytes from the RXFIFO in interrupt context (even better if using DMA).

One small change that might help is to repeatedly call `readFrame()` in the RXFIFO overflow state and break out when the RXFIFO is finally flushed.  This will help pull the radio out of the RXFIFO overflow state and process receive frames that have already been acknowledged.

Thoughts?

--
Jonathan Hui

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-users+unsubscribe@googlegroups.com.
To post to this group, send email to openthread-users@googlegroups.com.

Stuart Longland

unread,
Nov 27, 2017, 11:02:51 PM11/27/17
to jon...@nestlabs.com, openthread-users
On 28/11/17 13:28, Jonathan Hui wrote:
> Yes, the example driver is not tolerant to long delays in the main
> processing loop.
>
> Depending on whether the application-specific delays are occurring in
> interrupt context or not, the RXFIFO overflow conditions could be
> mitigated by pulling out bytes from the RXFIFO in interrupt context
> (even better if using DMA).
>
> One small change that might help is to repeatedly call `readFrame()` in
> the RXFIFO overflow state and break out when the RXFIFO is finally
> flushed.  This will help pull the radio out of the RXFIFO overflow state
> and process receive frames that have already been acknowledged.
>
> Thoughts?
>

That could work… maybe it's something that can be done while it waits
for the transceiver to enter a receive state.

At the moment, I'm just flushing the receive buffer, which may or may
not be the best practice, but it at least gets things un-jammed to the
point that it can progress.

Basically, I've seen two things hard lock in this loop:
1. RX buffer overflow (red path in attached diagram)
2. TX buffer underflow (blue path in attached diagram)

The TI docs suggest the only way out of a receive overflow is to issue a
SFLUSHRX. ISFLUSHRX also works, and is what OpenThread does elsewhere.
Likely, `readFrame` may do this for me. I'll have to try that.

After flushing the buffer, I often find the transceiver is in an idle
state, and I need to then place it in receive state: `setChannel` can do
this, but won't by default because it thinks it already *is* in the
receive state, so I have to force things. (I'm considering whether some
of the global variables in radio.c could be derived from register states
to eliminate that problem.)

Not sure what causes the TX underflow, it could be debug console
related. In any case, we wind up back in the receive state and can
simply try again with our transmit.

I've pushed some changes to address the above. Basically just moving
the transmit set-up code to a separate function so I can call it when
needed, then adding the additional handling code.

https://github.com/openthread/openthread/pull/2371

Regards,
cc2538-radio-fsm.png

Jonathan Hui

unread,
Nov 28, 2017, 12:43:13 AM11/28/17
to Stuart Longland, openthread-users
TX underflow should never occur with the example driver. The complete 802.15.4 frame is fully loaded into the TXFIFO before issuing the TXON strobe.  TX underflow is only a concern when entering the TX state before the full 802.15.4 frame is fully loaded into the TXFIFO.

Interesting that the radio is entering the IDLE state after flushing the RXFIFO.  The radio state machine indicates that the radio should transition back to the RX calibration state.  Maybe rxenable is 0?

--
Jonathan Hui

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-users+unsubscribe@googlegroups.com.
To post to this group, send email to openthread-users@googlegroups.com.

Stuart Longland

unread,
Dec 1, 2017, 6:42:57 AM12/1/17
to jon...@nestlabs.com, openthread-users
On 28/11/17 15:43, Jonathan Hui wrote:
> TX underflow should never occur with the example driver. The complete
> 802.15.4 frame is fully loaded into the TXFIFO before issuing the TXON
> strobe.  TX underflow is only a concern when entering the TX state
> before the full 802.15.4 frame is fully loaded into the TXFIFO.
>
> Interesting that the radio is entering the IDLE state after flushing the
> RXFIFO.  The radio state machine indicates that the radio should
> transition back to the RX calibration state.  Maybe rxenable is 0?

Interestingly, we're finding the NCP has a big influence on what happens
at the nodes.

We use the same hardware on the NCP with the two RS-485 ports set up as
RS-422 full-duplex, running at 115200 baud. I'll have to repeat this
experiment with a plain CC2538 dev board, but this evening, I decided to
switch the NCP with a nRF52840 dev board as an experiment.

This is the graph of the "send buffers" (otMessageGetBufferInfo
m6loSendBuffers) with time… sampled on the minute every minute…

… for CC2538-based NCP (custom board):
http://static.vk4msl.id.au/openthread/2017/12/01-ncp-stats/Grafana_-_Testing_-_2017-12-01_21.20.17.png

… for nRF52840-based NCP (NordicSemi nRF52840-PK dev board):
http://static.vk4msl.id.au/openthread/2017/12/01-ncp-stats/Grafana_-_Testing_-_2017-12-01_21.20.45.png

So it appears the NCP misbehaving can cause a big back-log in the
devices on the mesh. The nodes all have 20 buffers allocated (down from
the default of 40). The NCP has the standard 40… although this
afternoon I flashed a version that had 20 buffers set, and the results
were not pretty!

The nRF52840 seems to have a default setting of 128 buffers, and so it's
virtually guaranteed to be able to cope with the traffic.

This is possibly the other half of the reason why the nodes keep running
out of buffers though: the NCP isn't ACKing them fast enough, and so
they pile up.

The nodes are all running the latest example radio driver, with the
changes included from pull request 2371 and calibration of power levels
to match the CC2592 range extender.

I think Monday, I'll be flashing a CC2538 dev board with the current
master branch FTD NCP firmware, leaving it overnight to see how it goes,
then Tuesday I'll try a build based on PR 2371 and see if that breaks
anything.

Jonathan Hui

unread,
Dec 4, 2017, 12:21:30 PM12/4/17
to Stuart Longland, openthread-users
Thanks for the update.  Couple general comments:

- If the receiver is not reliably sending IEEE 802.15.4 ACKs to the sender, the sender will message queue will backup due to excessive retries.  Inspecting the MAC counters can help indicate whether this is actually a problem for you.

- The CoAP layer also caches messages for retransmissions and duplicate detection.  Requests are cached so that they can be retransmitted (for confirmable requests) or so that responses can be matched up (when a response handler is used).  Responses are cached so that they can be retransmitted without having the upper layer process the request again when receiving duplicates due to retransmissions.  Note that RFC 7252 provides flexibility in whether or not these features are enabled, depending on your application needs.

--
Jonathan Hui
Reply all
Reply to author
Forward
0 new messages