Multiple abnormal behaviors of OpenThread

Yicong Liu

unread,

Jul 28, 2021, 2:04:58 PM7/28/21

to openthread-users

Hi Jonathan,

My colleague and I are working on nRF52833 SoC with Nordic NCS 1.51 to develop low power OpenThread MTDs and OTBRs.

We are going to make the largest commercial application of OpenThread but product delivery is affected by the following serious issues:

1. MTDs, which has been correctly commissioning with OTBR, will be disconnected without any external intervention and will not automatically recover. But this issue can be "resolved" after power cycling the device. I think it is close to the case described in this link.

2. Sometimes, when a MTD is connected to an OTBR, and the RLOC16 of the MTD can be seen from the Child Table of the OTBR. However, the MTD times out when sending packets to the cloud service. I tried to capture packets from the WPAN0 interface of OTBR and found that no packets were sent from this MTD.

3. Sometimes the MTD device will get error messages with No packets available during the commissioning process, which leads the MTD to role 1 in the detached state. I think this is close to the case described in this link and this link.

4. Multiple OTBR devices with the same Thread Dataset will remain separate leaders without any external intervention and will not join into one single network for a long time (more than 12 hours). I think this is close to the case described in this link.

5. We test OTBR based on nRF52833 RCP for load ability, OTBR childmax was set to 250. When a single OTBR connect over 183 MTDs, OTBR appears detached state, then visit OTBR with ot-ctl command will frequently show "Connection reset by peer" and "Connection refused" errors. Please refer to the attached sniffer packet. Master Key: 57506dcaab3455a04454d35d2beecf94

I have addressed the above questions to Nordic Dev support team, meanwhile, I would like to have your opinions and suggestions about these issues, perhaps some idea of a short-term patch during the problem to be fixed.

Thanks in advance.

Best,

Yicong Liu

OTBR detached.zip

Abtin Keshavarzian

unread,

Jul 28, 2021, 3:45:34 PM7/28/21

to openthread-users

Hi Yicong,

I will try to add some hints based on the info for you to investigate.

My guess/theory is that many of the issues you raise may not be directly related to OpenThread and may be related to platform layer (e.g. OS, drivers, etc) and/or the application layer (how OT stack is integrated and being used). We have integrated OpenThread in different internal products/projects (which are shipped and operating) and we do not see similar behaviors or issues.

> MTDs, which has been correctly commissioning with OTBR, will be disconnected without any external intervention and will not automatically recover. But this issue can be "resolved" after power cycling the device. I think it is close to the case described in this link.

My guess is that the device itself (or the radio) may be hanging or getting stuck somehow (which would be a platform issue), participially since device reset recovers the issue. Suggest investigating this possibility.

> Sometimes, when a MTD is connected to an OTBR, and the RLOC16 of the MTD can be seen from the Child Table of the OTBR. However, the MTD times out when sending packets to the cloud service. I tried to capture packets from the WPAN0 interface of OTBR and found that no packets were sent from this MTD.

May be the same issue as the previous one (device somehow gets stuck, therefore no more data from it).

> Sometimes the MTD device will get error messages with No packets available during the commissioning process, which leads the MTD to role 1 in the detached state. I think this is close to the case described in this link and this link.

I would suggest investigating whether or how the application layer allocates the message buffers (i.e., if the app layer calls OT API to allocate message buffers and does not free them). Before device joins, the OT stack itself will not have consumed any of its message buffers. So if there are no buffer during joining process, then I would guess the culprit may be the application layer (that may be allocating and keeping the buffers).

> 4. Multiple OTBR devices with the same Thread Dataset will remain separate leaders without any external intervention and will not join into one single network for a long time (more than 12 hours). I think this is close to the case described in this link.

I think we have different test-cases that cover partition merge behavior in OT. We have also seen partition merge working in the field (i.e., partitions happen in deployments and then merge). If you can reproduce it with more detail it may be useful. My quick guess on this would be that may be the datasets are not exactly the same (something may be different).

Hope this is helpful.

Thanks,

Abtin.

Jonathan Hui

unread,

Jul 28, 2021, 4:38:43 PM7/28/21

to Yicong Liu, openthread-users

If possible, providing OpenThread logs would give greater visibility into the issues that you are observing.

--

Jonathan Hui

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/736e8d94-d7e2-43ee-977c-47764c64b945n%40googlegroups.com.

Yicong Liu

unread,

Jul 29, 2021, 2:55:23 PM7/29/21

to openthread-users

Hi Jonathan and Abtin,

Thank you for your kind reply for behaviour 1-4.

We do compare every OTBR and MTD have the same Thread Dataset hex and Active Timestamp.

About behaviour 5.

The OTBR is using last year's commit (07e9717) and RCP using this commit. As you can see from log 1-4, there are a lot of errors with "Error decoding hdlc frame: NoBufs"

So we tried the method from issue 5678, by changing OPENTHREAD_CONFIG_NCP_TX_BUFFER_SIZE within ncp_config.h to 5k and kMaxFrameSize within spinel_interface.hpp to 10k. As you can see in log 5. There is no significant improvement with warn "Handle transmit done failed: NoAck".

Then we tried Thread 1.2 commit in OTBR and RCP. With the 4 options changed from issue 4508. Unfortunately, this is not a significant improvement either from log 6.

We really want to get the number of connected sub-devices over 200. Do you have any successful cases? Any guidance would be greatly appreciated.

Thanks in advance.

Best,

Yicong

OTBR log detached.zip

Nikhil Komalan

unread,

Jul 29, 2021, 9:26:24 PM7/29/21

to Yicong Liu, openthread-users

Hello,

Regarding Behaviour 1:

I was also using NCS 1.5.1 when I was getting issue similar to behaviour 1. But I think this issue is related to that particular SDK, try using 1.6.0 version or master branch.

Using 1.6.0. my devices got reconnected after 2_3 hours. Than it worked for 20-30 minutes, than again it got disconnected for around 5-6 hours.

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/154feb3b-7a09-4fc9-a8fa-4af5f11c6e2fn%40googlegroups.com.

Jonathan Hui

unread,

Jul 30, 2021, 4:19:34 PM7/30/21

to Yicong Liu, openthread-users

On Thu, Jul 29, 2021 at 11:55 AM Yicong Liu <liuyic...@gmail.com> wrote:

The OTBR is using last year's commit (07e9717) and RCP using this commit. As you can see from log 1-4, there are a lot of errors with "Error decoding hdlc frame: NoBufs"
So we tried the method from issue 5678, by changing OPENTHREAD_CONFIG_NCP_TX_BUFFER_SIZE within ncp_config.h to 5k and kMaxFrameSize within spinel_interface.hpp to 10k. As you can see in log 5. There is no significant improvement with warn "Handle transmit done failed: NoAck".

Is your RCP connected via UART, SPI, or USB? What baud rate are you using?

We really want to get the number of connected sub-devices over 200. Do you have any successful cases? Any guidance would be greatly appreciated.

OpenThread has been used in environments with hundreds of SEDs attached to a single parent - see https://www.threadgroup.org/BLUEGRioT