Hello,
I'm working on a Sub-GHz Mesh Radio that uses OpenThread Protocol Stack (NCP). We have a sub-contractor who engineered the radio hardware/software, and now I am taking over developer responsibilities. I started with 10-20 Mesh Radio nodes and a Raspberry Pi 3B (Border Router) with Mesh Radio. All the nodes were configured as REEDs (Router Eligible End Devices) with max transmit power (~28dB). When I would power them all on, the nodes would successfully determine their Roles and Links and converge to a single partition with a single Leader (several Routers and a couple Children) within a couple minutes.
However, as I got up to 30-40 Mesh Radio nodes, I would get multiple Leaders and the network would not converge and stabilize, even as the nodes would continue to change roles. I assume this means there's multiple partitions that are not successfully merging. I observe that there is a lot of traffic flowing in the network by seeing faster Tx/Rx LED activity and packet capture. I'm suspicious that there might be an issue with Multicast messages (ex MLE Parent Request) because I have observed (1) CoAP Multicast (ff02::2) PUT message transmitted by Border Router "loopback" and be received by Border Router and (2) Multicast MLE Parent Request messages being received by the Border Router from the same source address only 30-40 ms apart. I'm wondering if the network might have "forwarding loops" or duplicate routes that are causing Multicast messages to "bounce around the network" and that this is causing the high amount of traffic.
I found a workaround that is mostly effective. For a 50 node Mesh Radio network, I configured only 5 nodes and the Border Router to be REEDs while the other 45 were FEDs (Full End Devices). I also adjusted the transmit power so that the REEDs were at max tx power (~28dB) while the FEDs were at reduced tx power (~8dB). This has been effective in getting the network to converge to a single leader/partition. My thinking behind this workaround was that maybe all the radios being at max power and router eligible was giving them too many Link and Role options, and the Link costs and Roles each radio was selecting were changing. (Either change of router eligible or tx power on its own was not effective). However, this workaround is not effective if I try to go much above 50 nodes.
What do y'all think of this issue I'm experiencing? Do you think I may be on the right track with the Multicasts or if that is a red herring? Is there any advice or guidance on troubleshooting this issue that you could suggest?
I have ability to run a radio in CLI mode as well as packet capture on the Border Router (BR).
I attached a packet capture (from the BR) of the 50 node network (5 REEDs / 45 FEDs) reforming after a power cycle. An example of the Multicast MLE Parent Requests 30-40ms apart is found starting with Frame "104
251.013701
fe80::623b:71cb:8b08:4cf6
ff02::2
MLE
83
Parent Request, Bad FCS". To decode the capture in Wireshark, I followed the instructions
https://community.nxp.com/docs/DOC-334901 for decryption key (ie master key) BEEFCAFEBBBBAAAA7766554433221100. (In this capture, there are CoAP Multicast Non-conf messages from the same source that also arrive the 30-40 ms apart. These messages are the start of the application-level communication between the nodes and the Border Router.)
Thanks,
Jason