Mesh Network Formation issue with more than 30 nodes

Jason H

unread,

Oct 25, 2018, 5:11:16 PM10/25/18

to openthread-users

Hello,

I'm working on a Sub-GHz Mesh Radio that uses OpenThread Protocol Stack (NCP). We have a sub-contractor who engineered the radio hardware/software, and now I am taking over developer responsibilities. I started with 10-20 Mesh Radio nodes and a Raspberry Pi 3B (Border Router) with Mesh Radio. All the nodes were configured as REEDs (Router Eligible End Devices) with max transmit power (~28dB). When I would power them all on, the nodes would successfully determine their Roles and Links and converge to a single partition with a single Leader (several Routers and a couple Children) within a couple minutes.

However, as I got up to 30-40 Mesh Radio nodes, I would get multiple Leaders and the network would not converge and stabilize, even as the nodes would continue to change roles. I assume this means there's multiple partitions that are not successfully merging. I observe that there is a lot of traffic flowing in the network by seeing faster Tx/Rx LED activity and packet capture. I'm suspicious that there might be an issue with Multicast messages (ex MLE Parent Request) because I have observed (1) CoAP Multicast (ff02::2) PUT message transmitted by Border Router "loopback" and be received by Border Router and (2) Multicast MLE Parent Request messages being received by the Border Router from the same source address only 30-40 ms apart. I'm wondering if the network might have "forwarding loops" or duplicate routes that are causing Multicast messages to "bounce around the network" and that this is causing the high amount of traffic.

I found a workaround that is mostly effective. For a 50 node Mesh Radio network, I configured only 5 nodes and the Border Router to be REEDs while the other 45 were FEDs (Full End Devices). I also adjusted the transmit power so that the REEDs were at max tx power (~28dB) while the FEDs were at reduced tx power (~8dB). This has been effective in getting the network to converge to a single leader/partition. My thinking behind this workaround was that maybe all the radios being at max power and router eligible was giving them too many Link and Role options, and the Link costs and Roles each radio was selecting were changing. (Either change of router eligible or tx power on its own was not effective). However, this workaround is not effective if I try to go much above 50 nodes.

What do y'all think of this issue I'm experiencing? Do you think I may be on the right track with the Multicasts or if that is a red herring? Is there any advice or guidance on troubleshooting this issue that you could suggest?

I have ability to run a radio in CLI mode as well as packet capture on the Border Router (BR).

I attached a packet capture (from the BR) of the 50 node network (5 REEDs / 45 FEDs) reforming after a power cycle. An example of the Multicast MLE Parent Requests 30-40ms apart is found starting with Frame "104 251.013701 fe80::623b:71cb:8b08:4cf6 ff02::2 MLE 83 Parent Request, Bad FCS". To decode the capture in Wireshark, I followed the instructions https://community.nxp.com/docs/DOC-334901 for decryption key (ie master key) BEEFCAFEBBBBAAAA7766554433221100. (In this capture, there are CoAP Multicast Non-conf messages from the same source that also arrive the 30-40 ms apart. These messages are the start of the application-level communication between the nodes and the Border Router.)

Thanks,

Jason

20181023_PowerCycleStandalones1.cap

Jonathan Hui

unread,

Oct 25, 2018, 5:47:39 PM10/25/18

to jas...@zeromasswater.com, openthre...@googlegroups.com

Great to hear about your project!

Thanks for providing the packet capture. At first glance, the MLE Parent Request behavior does not seem correct. You should see transmission spacing on the order of several hundred milliseconds, not several tens of milliseconds.

I don't believe these are forwarding loops. The MLE Parent Request payload (i.e. Challenge and San Mask TLVs) are changing in value with each transmission. Furthermore, link-local scope multicasts should not be forwarded by any node other than the originator.

I'm wondering if the radio is stalling on transmitting messages for a lengthy period of time, causing the MLE Parent Request messages to queue up, and then are all serviced back-to-back when the radio starts transmitting again.

Is it possible for you to enable logging with timestamps to provide more visibility into what's going on?

--

Jonathan Hui

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-use...@googlegroups.com.
To post to this group, send email to openthre...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/b003ee04-42f0-43da-bf79-ad86217ab010%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jason H

unread,

Oct 25, 2018, 6:55:55 PM10/25/18

to openthre...@googlegroups.com

Thanks for your quick reply, Jonathan!

I did both a packet capture and logged the radio output at the Border Router. Both are attached, as well as info from 'wpanctl get' on the Border Router. This setup has 2 PANs on Channel 1 (both 50-nodes with 5 REEDs and 45 FEDs). PAN 0xAAAA is the network with the Border Router that I'm capturing. PAN 0xBBBB is a 2nd network on the same channel with its own border router.

I think you could be right that "the radio is stalling on transmitting messages for a lengthy period of time, causing the MLE Parent Request messages to queue up, and then are all serviced back-to-back when the radio starts transmitting again". In the radio log output, I can see instances of multiple Parent Requests received during OT_RADIO_STATE_RECEIVE before switching to TRANSMIT_STATE_READY_TO_SEND.

For example:

Line 109695: 25720590 [INFO]-PLAT----: processTxState transmit complete - TRANSMIT_STATE_SENT -> OT_RADIO_STATE_RECEIVE

Line 109706: 25856318 [INFO]-MLE-----: Receive Parent Request (fe80:0:0:0:4400:2300:551:3532)

Line 109712: 25873582 [INFO]-MLE-----: Receive Parent Request (fe80:0:0:0:3c00:5d00:1551:3632)

Line 109718: 26131329 [INFO]-MLE-----: Receive Parent Request (fe80:0:0:0:1700:3a00:451:3532)

Line 109724: 26274783 [INFO]-MLE-----: Receive Parent Request (fe80:0:0:0:b0e7:703f:d39a:4614)

Line 109730: 26313237 [INFO]-MLE-----: Receive Parent Request (fe80:0:0:0:3400:5c00:1551:3632)

Line 109737: 26388792 [INFO]-MLE-----: Receive Parent Request (fe80:0:0:0:1200:5400:451:3532)

Line 109743: 26605202 [INFO]-MLE-----: Receive Parent Request (fe80:0:0:0:3c00:5d00:1551:3632)

Line 109749: 26611756 [INFO]-MLE-----: Receive Parent Request (fe80:0:0:0:4400:2300:551:3532)

Line 109754: 26786791 [INFO]-PLAT----: otPlatRadioTransmit -> TRANSMIT_STATE_READY_TO_SEND

What do you think from these logs?

-Jason

20181025_BR1_LogAndCapture1.zip

Jonathan Hui

unread,

Oct 26, 2018, 4:00:21 PM10/26/18

to jas...@zeromasswater.com, openthre...@googlegroups.com

I assume the timestamps are in microseconds. The log in itself doesn't show anything odd.

The time between different transmits really depends on whether there's another message queued up for transmission. If there is another packet in the transmit queue, you should see the radio attempt to transmit the packet within a few milliseconds.

--

Jonathan Hui

On Thu, Oct 25, 2018 at 3:56 PM Jason H <jas...@zeromasswater.com> wrote:

Thanks for your quick reply, Jonathan!

I did both a packet capture and logged the radio output at the Border Router. Both are attached, as well as info from 'wpanctl get' on the Border Router. This setup has 2 PANs on Channel 1. PAN 0xAAAA is the network with the Border Router that I'm capturing. PAN 0xBBBB is a 2nd network on the same channel with its own border router.

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/e8a7ad41-daaa-4a4b-a29f-cfe4990faf4a%40googlegroups.com.

Jason H

unread,

Oct 26, 2018, 5:17:42 PM10/26/18

to openthread-users

Hi Jonathan,

Yes, timestamp is in microseconds.

Since the log doesn't show anything obviously wrong, it sounds like I should focus on trying to identify why MLE Parent Request transmission spacing is several tens of milliseconds (instead of several hundred milliseconds). Would you agree? Any other suggestions for troubleshooting or information I could capture?

Thanks,

Jason

Jonathan Hui

unread,

Oct 29, 2018, 12:14:36 PM10/29/18

to jas...@zeromasswater.com, openthre...@googlegroups.com

I agree that the initial focus should be understanding the MLE Parent Request transmission behavior.

Two things from logging that could help provide insight:

Enable MLE logging to determine when MLE generates MLE Parent Request messages.
Enable MAC logging to determine when the MLE messages are actually sent by the lower layers.

If we see significant delays between the generation of MLE Parent Request messages and when they are sent by the MAC/PHY layers, that would give a hint at where to look next.

--

Jonathan Hui

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/51d50e1e-8154-46d9-a69d-9ee9c4963d2c%40googlegroups.com.

Jason H

unread,

Oct 29, 2018, 8:23:40 PM10/29/18

to openthread-users

I captured logs on both a FED mesh radio node and a REED mesh radio node. This setup has 2 PANs on Channel 1 (both 50-nodes with 5 REEDs and 45 FEDs).

FED node has 2 debug output files: TC1_FED and TC2_FED. I don't see any time deltas between "Send Parent Request" and "transmitMessage txData.dataLength=6X" (seems dataLength for Parent Request is 61-63 bytes) longer than ~20ms.

REED node has 1 debug out file: TC3_REED. As noted in the TC3_REED_Observations text file, I did find several instances of longer time deltas in between "Send Parent Request" and "transmitMessage txData.dataLength=6X". For example, I see a time delta of 503805us (503ms) between "100067368 [INFO]-MLE-----: Send Parent Request to routers (ff02:0:0:0:0:0:0:2)" and "100571173 [INFO]-PLAT----: transmitMessage txData.dataLength=62 channel=1 power=0 ". It appears to me that the Parent Request transmission is being held-off by other DATA transmissions and ACK reception. This is probably the application-level CoAP Confirmable messaging that occurs once a mesh radio node has successfully joined the mesh network.

What do you think about these logs? Does my interpretation of the REED Parent Request delay make sense?

The next thing that I'm going to try is to disable the application-level CoAP Confirmable and Non-confirmable messaging that occurs after a meh radio node joins the network. I'll see if this has a positive effect on the MLE process and the Parent Request timing.

Thanks,

Jason

20181029_MeshRadioDebugOutput.zip

Jason H

unread,

Oct 31, 2018, 12:19:32 PM10/31/18

to openthre...@googlegroups.com

I ran a few tests where the application software would command the mesh radio to join the network and then hold (instead of then sending applicaiton CoAP messaging). So essentially this eliminates CoAP messages from the mix, so there's only MLE messaging to establish roles (and other OT msgs).

I think that I'm seeing MLE message transmission being held-off while other DATA is waiting to be ACK'd. For example, in TC2_FED, it looks like the Child Update Request is not being ACK'd and fails transmission, then a bunch of Parent Request (dataLength=62 dest_addr=0xffff) are sent in quick succession. In TC2_FED_FactoryReset, I did a factory reset on the mesh radio when initializing, so it wouldn't send the Child Update Request. In this log, the Parent Request message spacing is greater than 100's of ms.

In TC1_REED_Leader, I think that Delayed messages are being held-off while other DATA is waiting to be ACK'd. In the log, I can see significant time gaps between "Send delayed message" and DATA for the same dest_addr being sent. For example:

26702513 [INFO]-MLE-----: Send delayed message (fe80:0:0:0:2900:5500:1551:3632)

.........

32183502 [INFO]-PLAT----: transmitMessage txData.dataLength=112 channel=1 power=0

32184230 frame type: DATA

security-enabled=0

frame-pending=0

ack-request=1

panid-compression=1

frame-version=1

data-sequence-number: 90

dest_panid=0xaaaa

dest_addr=2b00:5500:1551:3632

src_addr=1300:4f00:1551:3632

Am I interpreting the log correctly? Is this behavior wrong? I'm wondering if there's a problem with how the DATA-ACK and re-transmission is working. Perhaps while getting the mesh radio to work, our sub-contractor made a change to this aspect that is now causing problems.

***Update***

Here are the tx attempt define values for the project:

===================================

(Default settings) openthread-core-default-config.h

===================================

#define OPENTHREAD_CONFIG_MAX_TX_ATTEMPTS_DIRECT 4

#define OPENTHREAD_CONFIG_MAX_TX_ATTEMPTS_INDIRECT_PER_POLL 1

===================================

(Overridden setting) openthread-core-efr32-config.h

===================================

#define OPENTHREAD_CONFIG_MAX_TX_ATTEMPTS_DIRECT 8

Thanks,

Jason

20181030_DebugOutput_NoCoapMsgs.zip

Jason Horwitz

unread,

Nov 1, 2018, 7:45:16 PM11/1/18

to openthre...@googlegroups.com

One other note...this Mesh Radio implementation is based off OpenThread commit e4d5cbf0c0048152bd19c36a29934edc1167be9c Dated: October 18, 2017. I see that there's been a lot of commits/changes since then, and I wonder if I might be chasing an issue that was fixed since then.

Thanks,

Jason Horwitz

Controls Engineer | Zero Mass Water

jas...@zeromasswater.com

The information contained in this email may be confidential and/or proprietary. Not for use or distribution without prior written approval from Zero Mass Water, Inc.

On Wed, Oct 31, 2018 at 9:19 AM Jason H <jas...@zeromasswater.com> wrote:

I ran a few tests where the application software would command the mesh radio to join the network and then hold (instead of then sending applicaiton CoAP messaging). So essentially this eliminates CoAP messages from the mix, so there's only MLE messaging to establish roles (and other OT msgs).

I think that I'm seeing MLE message transmission being held-off while other DATA is waiting to be ACK'd. For example, in TC2_FED, it looks like the Child Update Request is not being ACK'd and fails transmission, then a bunch of Parent Request (dataLength=62 dest_addr=0xffff) are sent in quick succession. In TC2_FED_FactoryReset, I did a factory reset on the mesh radio when initializing, so it wouldn't send the Child Update Request. In this log, the Parent Request message spacing is greater than 100's of ms.

In TC1_REED_Leader, I think that Delayed messages are being held-off while other DATA is waiting to be ACK'd. In the log, I can see significant time gaps between "Send delayed message" and DATA for the same dest_addr being sent. For example:

26702513 [INFO]-MLE-----: Send delayed message (fe80:0:0:0:2900:5500:1551:3632)
.........
32183502 [INFO]-PLAT----: transmitMessage txData.dataLength=112 channel=1 power=0
32184230 frame type: DATA
security-enabled=0
frame-pending=0
ack-request=1
panid-compression=1
frame-version=1
data-sequence-number: 90
dest_panid=0xaaaa
dest_addr=2b00:5500:1551:3632
src_addr=1300:4f00:1551:3632

Am I interpreting the log correctly? Is this behavior wrong? I'm wondering if there's a problem with how the DATA-ACK and re-transmission is working. Perhaps while getting the mesh radio to work, our sub-contractor made a change to this aspect that is now causing problems.

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/2ed3b326-345d-401b-80ec-25b14c13accd%40googlegroups.com.

Jonathan Hui

unread,

Nov 2, 2018, 12:56:21 AM11/2/18

to jas...@zeromasswater.com, openthre...@googlegroups.com

That's over a year ago! :)

Yes, there have been many updates and bug fixes to the MLE component since then. For example, mid-Jan 2018 had a number of commits addressing merging/partitioning issues.

Any chance you can upgrade to a more recent commit?

--

Jonathan Hui

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/CAPcdQ_Y_cmPa41uEj8jUxij3fujf0LJc-cDRoPoS_XLqn8Mqzg%40mail.gmail.com.

Jason Horwitz

unread,

Nov 2, 2018, 1:36:26 PM11/2/18

to jon...@google.com, openthre...@googlegroups.com

Yes, I am going to work on upgrading to the latest 'master' commit, while merging the necessary changes that the sub-contractor made to the OT code base. I will see if this is able to resolve the MLE issues I'm seeing. I'm sure this will take several days, so I'll get back to you when I'm at that point.

Thanks,

Jason Horwitz

Controls Engineer | Zero Mass Water

jas...@zeromasswater.com

The information contained in this email may be confidential and/or proprietary. Not for use or distribution without prior written approval from Zero Mass Water, Inc.

Jason H

unread,

Nov 9, 2018, 5:28:28 PM11/9/18

to openthread-users

Hi Jonathan,

I merged Openthread Commit: cbd05117759f7e2f3c3c74dbd7728e3e8ab86ab7 [cbd0511] Date: Friday, June 8, 2018 12:17:33 AM to get the majority of the MLE improvements/fixes. I am still having a problem with the network having Multiple Leaders/Partitions. My setup has 2 PANs on Channel 1 (both 50-nodes with 5 REEDs and 45 FEDs). I was hoping that these two PANs would establish the nodes' roles and stabilize, but this is not happening.

I captured radio debug output log at one of the REED nodes (20181109_Node_OutputLog_180608OT.txt) as well as a packet capture at one of the Border Routers (20181109_BR_Capture_180608OT.cap). The Parent Request spacing looks better now (100's of ms instead of 10's of ms), so I'm not sure where to look next. In the meantime, I can try to merge a more recent Openthread commit to get all MLE changes.

Thanks,

Jason

On Friday, November 2, 2018 at 10:36:26 AM UTC-7, Jason H wrote:

Yes, I am going to work on upgrading to the latest 'master' commit, while merging the necessary changes that the sub-contractor made to the OT code base. I will see if this is able to resolve the MLE issues I'm seeing. I'm sure this will take several days, so I'll get back to you when I'm at that point.

Thanks,
Jason Horwitz
Controls Engineer | Zero Mass Water

The information contained in this email may be confidential and/or proprietary. Not for use or distribution without prior written approval from Zero Mass Water, Inc.

On Thu, Nov 1, 2018 at 9:56 PM Jonathan Hui wrote:

That's over a year ago! :)

Yes, there have been many updates and bug fixes to the MLE component since then. For example, mid-Jan 2018 had a number of commits addressing merging/partitioning issues.

Any chance you can upgrade to a more recent commit?

--
Jonathan Hui

On Thu, Nov 1, 2018 at 4:45 PM Jason Horwitz wrote:

One other note...this Mesh Radio implementation is based off OpenThread commit e4d5cbf0c0048152bd19c36a29934edc1167be9c Dated: October 18, 2017. I see that there's been a lot of commits/changes since then, and I wonder if I might be chasing an issue that was fixed since then.

Thanks,
Jason Horwitz
Controls Engineer | Zero Mass Water

To unsubscribe from this group and stop receiving emails from it, send an email to openthread-users+unsubscribe@googlegroups.com.
To post to this group, send email to openthread-users@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/2ed3b326-345d-401b-80ec-25b14c13accd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openthread-users+unsubscribe@googlegroups.com.
To post to this group, send email to openthread-users@googlegroups.com.

20181109_Captures_180608Openthread.zip

Jonathan Hui

unread,

Nov 13, 2018, 3:06:00 AM11/13/18

to jas...@zeromasswater.com, openthre...@googlegroups.com

Thanks for the update.

In looking at the logs, I noticed one thing that could be improved upon is suppressing responses to MLE Parent Request messages - in particular, when in the detached state (i.e. just after boot) and when trying to attach to another partition (i.e. during the merging process). I've submitted a small change to address this in PR3295.

I'm not sure if the above is the root cause of your issue, but it should help clean up some of the behavior. At the very least, it should make the logs a bit easier to analyze.

--

Jonathan Hui

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/2ed3b326-345d-401b-80ec-25b14c13accd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openthread-use...@googlegroups.com.
To post to this group, send email to openthre...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/CAPcdQ_Y_cmPa41uEj8jUxij3fujf0LJc-cDRoPoS_XLqn8Mqzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-use...@googlegroups.com.
To post to this group, send email to openthre...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/14c2252f-745a-421c-91d3-ee6fb484bc6f%40googlegroups.com.

Reply all

Reply to author

Forward