Hello,
In one of our projects we do plan to use OpenThread network with some special topology - all devices will be in a straight line, trying to get as long line as possible. I understand, that Thread is being primarily focused on home networks, and the line topology isn't the main use case. Still, it should work.
We did some tests, drawn some graphs and strangely we've found few surprising behaviours, which I suspect are result of some errors within the OpenThread implementation. I'm writing this post to cross-check it, maybe our understanding is wrong somewhere.
So, lets start with some basics. What are the theoretical limits for Thread network, when devices are arranged in a line? I'll list all relevant (in my opinion) parameters and their values, as I will refer to them later on:
- Maximum routers: 32
- Max route cost: 16 (awfully low!)
- Router Upgrade Threshold: 16
- Router Downgrade Threshold: 23
In our tests environment we used standard linux PC with a posix implementation, just running a bunch of `ot-cli-ftd` simultaneously (sadly, the limit for such simulations seems to be around 34 devices). We checked a state (child/router/leader) of each device every second and put the matrix on a graph:
https://imgur.com/a/5YMXt - I will describe the graph in a moment. Each device was configured to see only 1 neighbour in each side. For simplifications on our end, we actually closed the end of the line together, effectively forming a circle topology (i.e. first node can only communicate with second and last).
Each node used following scenario:
===
>channel 11
>panid 0x1234
>ifconfig up
>thread start
wait 3 seconds
>macfilter addr whitelist
>macfilter addr clear
>macfilter rss clear
wait 1 second
>routerdowngradethreshold 200
>routerupgradethreshold 16
>macfilter addr add <ext addr of next node in line>
>macfilter addr add <ext addr of prev node in line>
>macfilter rss add <ext addr of next node in line> -20
>macfilter rss add <ext addr of prev node in line> -20
wait few hours
===
Since each node is seeing only 2 neighbours, and we need to have continuous partitions, each node should become router, we should have 1 leader, and maybe 1 child. (If it would be a proper line, not a circle, we could have 1 child at each end).
In this post I will describe few problems, we have seen identical problems on default settings, just on a smaller scale.
0 - black - disabled
1 - violet - detached
2 - red - child
3 - orange - router
4 - yellow - leader
X axis are seconds, Y axis is device number. Devices are in order, so device at Y=N sees only devices N-1 and N+1 (modulo N) devices as its neighbours.
If you look at the graph, you can see that:
- there is more than 1 leader, always - i.e. we have few partitions, which shouldn't be the case
- routers spontaneously downgrade to childs (it shouldn't happen as we have downgrade threshold at 200, and even if we wouldn't, often there is less than 23 routers in a partition when this problem happens)
- you can see partitions being merged by a red diagonal line (e.g. the one starting at X=~32000 Y=11 and going down/right), such wave usually resets other nodes to proper state, e.g. the device at Y=2 was a long standing leader, which got swept by such wave.
- sometimes such propagation stops with no apparent reason - and it gets stuck there. For example the up/right wave starting from X=~32000 Y=11 stops at Y=19, and the node is staying in child mode for >2000 seconds, not even trying to become router, let alone allowing Y=20 device to join its partition
Few conclusions:
- it seems that between each 2 leaders there is at least 1 'stuck' child, which is blocking partition merging
- it seems that the further router is from leader, the more likely it will turn into child without apparent reason (typically 2-3 devices around leader are never having that problem). I do suspect, that it is actually leader timing out particular router, and ordering it to reconfigure.
- top routers on the chart (Y=31, 32) are apparently belonging to leader Y=2, as our line wraps around.
As our POSIX simulator is using UDP for simulating radio transmission, probably a bit of this behavior is related to UDP stuff, and real devices with radios will behave differently (better or worse..).
So, questions:
- why some nodes are stuck on child state?
- why some childs do stop partition merging?
- why routers spontaneously become childs, even if downgrade threshold is not reached?
Any further insights?