ParentRequest/ParentResponse and MLE Announce Timing issue

171 views
Skip to first unread message

Tom Walsh

unread,
May 8, 2018, 3:35:10 PM5/8/18
to openthread-users

While testing various aspects of unintended channel changes causing orphan scenarios. We found after fixing the channel mask issue that it consistently works. There are however some issues we ran into.
In order to test this scenario more robustly I created a command that would instantly force a channel change on the leader and increment the ActiveTimestamp ( its not something you recommend I suspect, buts a very good way to test this )

The Leader: an NXP KW41z for now. The child was either a custom NXP based device or another NXP KW41z. The complete testing has us walk through every channel with this method.

During initial testing with one leader and one child, most times the convergence of the orphaning could take a very long time ( an hour or even more ). In other cases ( less frequently ) it would converge in a few minutes.
Our concern was the lengthy convergence. This was a repeatable scenario. Tests were then performed with two children, this actually caused the convergence to succeed in minutes almost all of the time, unfortunately there will likely only be one child in range in the field.

In the singular child case What we noticed was after the proper timeout, the child went into orphan mode and was sending ParentRequest per the spec and MLE Announces as expected. 
Oddly the leader was responding in the lengthy changes with 4 ParentResponses in quick succession ( .007 seconds spacing I believe ) with the same sequence number, when the convergence took to long no ACK ever was seen being sent by the child for the ParentResponse, and we would be stuck in this loop until eventually an ACK snuck in.

This led us to believe we were having CCA/collisions hence the OPENTHREAD_CONFIG_MAX_TX_ATTEMPTS_DIRECT quantity of retries on the ParentResponse. The custom NXP child has AUTOACK enabled, we presumed it was being stepped on.
A comprehensive RF site study was performed and there is overlap in Wifi and many other 802.15.4 networks at the site as would be typical in many environments.
However, the 4 ParentResponse messages are likely due to this, it did not explain the lack of an ACK. 

Further investigations showed the child was never receiving the ParentResponsesin the long convergence case, until the variable lengthy time period was over and an ACK eventually snuck in ( you would only see one ParentRequest and one ParentResponse in this success btw and the ACK would show up quickly )
 This led us to realize the child was changing channels to send Announces before it received the ParentResponse. 
There are several #define's related to this ( mostly in mle_constants.hpp ) we attempted to adjust these to no avail. It seems as if there needs to be a channel hold time variable. 
As a side note I was able to put a physical delay in ParentRequest code ( yes a big hack ) and it approximated the quick converges.

We would like it to behave like the two child case all the time. Any ideas or suggestions are appreciated.

I have many pcap files, but will have to capture log files if needed, and the custom NXP device does not have logging capability

Jonathan Hui

unread,
May 9, 2018, 12:44:45 AM5/9/18
to Tom Walsh, openthread-users
Thanks for providing this detailed info.

The orphan should not be sending an MLE Announce message while it is waiting for a Parent Response.  In `mle.cpp`, you should see that `Mle::SendOrphanAnnounce()` is called before `BecomeDetached()`, which should result in sending the MLE Announce message before attempting to attach again.

Can you provide logs from the kw41z for both the leader a child so that we can get more insight?

--
Jonathan Hui

--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-use...@googlegroups.com.
To post to this group, send email to openthre...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/391b21c0-5583-4f50-b44c-2daff9f14452%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Walsh

unread,
May 9, 2018, 11:53:50 AM5/9/18
to openthread-users
Thanks Jonathan,
I will start gathering the data, but to elaborate on the MLE Announce ( command 15 or 0xf  ) as Wireshark calls it.... 

As a simpler example: if I shut off the leader, and say the child and leader were paired/commissioned on channel 12 before the power off, the child will timeout and start sending Parent requests, Also what we see is the child sends out an MLE Announce ( command 15 ) roughly every 30 seconds on channel 12, it also is sending it out every other channel, IE: its walking all the channels sending this out. 
In our more complicated scenario of changing of channels, this same sending of the Announce on every channel is what causes us to miss the parent responses on the "target channel" most of the time in the case of a un-intended channel change.
I will verify the stock GCC based NXP child does this same as our custom NXP device does before I send out logs. I'm sure it does, but will run that one more time as that is where I will need to get the logs from anyways.

Thanks, Tom Walsh

Jonathan Hui

unread,
May 10, 2018, 7:06:41 PM5/10/18
to Tom Walsh, openthread-users
The MLE Announce transmission behavior you described is expected.  An orphan will cycle through this general process:
1) Send Parent Request to Routers (wait 0.75 seconds for response)
2) Send Parent Request to Routers and REEDs (wait 1.25 seconds for response)
3) Send Announce (after transmission, immediately return to normal channel)

The Announce message is sent on a different channel and cycles through the 16 available channels.  As a result, you should see an Announce message about every 32 seconds on a given channel.

What should happen is when the Leader receives an MLE Announce message, it responds to the orphan by sending its own MLE Announce on the orphan's channel.  The orphan receiving an MLE Announce will then attempt to attach on the Leader's channel.  If the attach fails, the orphan returns to the cycle described above.

--
Jonathan Hui

Tom Walsh

unread,
May 15, 2018, 10:52:22 AM5/15/18
to openthread-users
Yes, the timings are what we see and expected, however many times the response from the leader takes longer, and then this causes a delay and back to the cycle again as you noted, sometimes taking many, many cycles to succeed.
We attempted to adjust these values in mle_constants.hpp, but when we did, it caused the router/leader to take even longer. If we double the time on the child ( from .75 to say 1.5 ), the response took twice as long from the router/leader. 
I did not see anywhere these values are dynamically negotiated between the two devices in any TLV.
Am I missing something here as we attempted the tests several times and are pretty sure it was taking longer for the router to respond, even though it had not been adjusted or re-compiled, and just the child had been modified.

thomas...@sandc.com

unread,
May 15, 2018, 2:15:35 PM5/15/18
to openthread-users
Hi Jonathan,

I'm working with Tom W. on this as well. We typically see the following sequence of messages repeat (format below--  time_in_seconds: (source) -> message type):
t0: (Orphaned Child) -> Parent request
t0.750: (Orphaned Child) -> Parent request
t2.250: (Parent) -> Parent response
t2.255: (Parent) -> Parent response
t2.260: (Parent) -> Parent response
t2.265: (Parent) -> Parent response



The exact time delay between parent requests and parent responses varies, but any Parent Response in excess of 750ms from the second Parent Request falls on deaf ears as the child has moved onto another channel.

Seemingly randomly, however, the parent will respond quickly ( < 750ms, usually around 300-400ms); when this happens, the child always rejoins successfully.



We think there is some sort of sleepy behavior in the parent, leading to a random response time. As Tom W. explained we tried extending the timeout on the child side, but this only exacerbated the delay in the parent's response. We'd like to know how we should tackle this issue to ensure that the parent responds quickly. Is there anywhere in the meshcop code that we can modify the sleepy behavior?

Thanks,

Tom F.
FailedToRejoin.pcapng
RejoinedSuccessfully.pcapng
FailedToRejoin.PNG
RejoinedSuccessfully.PNG
Auto Generated Inline Image 1
Auto Generated Inline Image 2

Jonathan Hui

unread,
May 17, 2018, 12:38:56 AM5/17/18
to thomas...@sandc.com, openthread-users
Thanks for sending the pcap files.

From my analysis of the packet capture, it appears that the kw41z driver may have an issue in switching channels.

The expected behavior is:
1) Router receives the Announce from the orphan.
2) Router sends an Announce on the channel contained in the orphan's Announce.
3) Router immediately returns to its active operational channel.

What I see is that the router is not immediately returning to it active operational channel.  In particular:
1) Following the IEEE 802.15.4 Data Sequence Number (DSN), the router is always sending more than 1 message between receiving the Announce and sending the Parent Response.
2) The Parent Response contains a Response TLV that does not match any of the Parent Request Challenge TLVs in the packet capture.

Upon further inspection of the kw41z driver, it appears that the `otPlatRadioReceive()` implementation does not actually allow changing channels when already in receive mode (see code).  I would try removing that check.  Better yet, I would try replacing that check so that the radio registers are only updated if the requested channel is different.

Hope that helps. 

--
Jonathan Hui

Tom Walsh

unread,
May 21, 2018, 6:38:09 PM5/21/18
to openthread-users
Jonathan, That is great!
I believe Thomas Frazel was able to get this to work. I am sure he will post here soon enough! Thanks for all your help!

thomas...@sandc.com

unread,
May 23, 2018, 3:57:56 PM5/23/18
to openthre...@googlegroups.com
Hi Jonathan,

Thank you for looking into this for us. I removed the channel change in receive mode check and further added a new check to only update the registers when the requested channel differs from the current channel like you recommended, and this has completely solved our issue. Our NXP leader now replies quickly to the first parent request and the child reliably rejoins within about 30s of the parent abruptly changing channels.

I would like to do a pull request with this change to the NXP KW41Z driver so this change is not lost.

Tom

Jonathan Hui

unread,
May 25, 2018, 12:43:31 PM5/25/18
to thomas...@sandc.com, openthread-users
Great to hear!  Please submit a PR with your fix.

--
Jonathan Hui

Jonathan, That is great!
Reply all
Reply to author
Forward
0 new messages