While testing various aspects of unintended channel changes causing orphan scenarios. We found after fixing the channel mask issue that it consistently works. There are however some issues we ran into.
In order to test this scenario more robustly I created a command that would instantly force a channel change on the leader and increment the ActiveTimestamp ( its not something you recommend I suspect, buts a very good way to test this )
The Leader: an NXP KW41z for now. The child was either a custom NXP based device or another NXP KW41z. The complete testing has us walk through every channel with this method.
During initial testing with one leader and one child, most times the convergence of the orphaning could take a very long time ( an hour or even more ). In other cases ( less frequently ) it would converge in a few minutes.
Our concern was the lengthy convergence. This was a repeatable scenario. Tests were then performed with two children, this actually caused the convergence to succeed in minutes almost all of the time, unfortunately there will likely only be one child in range in the field.
In the singular child case What we noticed was after the proper timeout, the child went into orphan mode and was sending ParentRequest per the spec and MLE Announces as expected.
Oddly the leader was responding in the lengthy changes with 4 ParentResponses in quick succession ( .007 seconds spacing I believe ) with the same sequence number, when the convergence took to long no ACK ever was seen being sent by the child for the ParentResponse, and we would be stuck in this loop until eventually an ACK snuck in.
This led us to believe we were having CCA/collisions hence the OPENTHREAD_CONFIG_MAX_TX_ATTEMPTS_DIRECT quantity of retries on the ParentResponse. The custom NXP child has AUTOACK enabled, we presumed it was being stepped on.
A comprehensive RF site study was performed and there is overlap in Wifi and many other 802.15.4 networks at the site as would be typical in many environments.
However, the 4 ParentResponse messages are likely due to this, it did not explain the lack of an ACK.
Further investigations showed the child was never receiving the ParentResponsesin the long convergence case, until the variable lengthy time period was over and an ACK eventually snuck in ( you would only see one ParentRequest and one ParentResponse in this success btw and the ACK would show up quickly )
This led us to realize the child was changing channels to send Announces before it received the ParentResponse.
There are several #define's related to this ( mostly in mle_constants.hpp ) we attempted to adjust these to no avail. It seems as if there needs to be a channel hold time variable.
As a side note I was able to put a physical delay in ParentRequest code ( yes a big hack ) and it approximated the quick converges.
We would like it to behave like the two child case all the time. Any ideas or suggestions are appreciated.
I have many pcap files, but will have to capture log files if needed, and the custom NXP device does not have logging capability