My OTBR logs shows Main Process exited

589 views
Skip to first unread message

Michael Simpson

unread,
Apr 22, 2021, 8:51:21 PM4/22/21
to openthread-users
Hi
I have been having ongoing problems with my Thread network stalling periodically. I have opened this as a new case as it seems to be related to my otbr / RCP end.

While monitoring the wpan0 network with TCPdump, I noticed it reported "tcpdump: pcap_loop: The interface went down"

I examined the OTBR-AGENT logs from monitoring tail -f /var/log/syslog | grep otbr > otbr.txt. See extract attached from line 13. Continues further in attached log.

Can you please examine and advise what might be causing this and how I might rectify it.
Thanks

Apr 23 11:48:34 localhost otbr-agent[124449]: [CRIT]-PLAT----: exit(1): ProcessRadioStateMachine line 999, radio tx timeout, Failure
Apr 23 11:48:34 localhost systemd[1]: otbr-agent.service: Main process exited, code=exited, status=1/FAILURE
Apr 23 11:48:34 localhost systemd[1]: otbr-agent.service: Failed with result 'exit-code'.
Apr 23 11:48:39 localhost systemd[1]: otbr-agent.service: Scheduled restart job, restart counter is at 188.
Apr 23 11:48:39 localhost otbr-agent[126299]: Running 0.2.0-ea8fe65-dirty
Apr 23 11:48:39 localhost otbr-agent[126299]: Thread interface wpan0
Apr 23 11:48:39 localhost otbr-agent[126299]: Backbone interface 
Apr 23 11:48:39 localhost otbr-agent[126299]: [INFO]-PLAT----: RCP reset: RESET_SOFTWARE
Apr 23 11:48:39 localhost otbr-agent[126299]: [INFO]-CORE----: Non-volatile: Read NetworkInfo {rloc:0x6c00, extaddr:d20ffbe46b2d1774, role:Leader, mode:0x0f, version:2, keyseq:0x0, ...
Apr 23 11:48:39 localhost otbr-agent[126299]: [INFO]-CORE----: Non-volatile: ... pid:0x43fead1a, mlecntr:0xa4efc, maccntr:0x6b56ae, mliid:cc71d1a5df069614}
Apr 23 11:48:39 localhost otbr-agent[126299]: Set state callback: OK
Apr 23 11:48:39 localhost otbr-agent[126299]: Thread is down
Apr 23 11:48:39 localhost otbr-agent[126299]: Check if Thread is up: OK
Apr 23 11:48:39 localhost otbr-agent[126299]: Start Thread Border Agent: Resource temporarily unavailable
Apr 23 11:48:39 localhost otbr-agent[126299]: PSKc is initialized
Apr 23 11:48:39 localhost otbr-agent[126299]: Check if PSKc is initialized: OK
Apr 23 11:48:39 localhost otbr-agent[126299]: Initialize OpenThread Border Router Agent: OK
Apr 23 11:48:39 localhost otbr-agent[126299]: Border router agent started.






Jonathan Hui

unread,
Apr 22, 2021, 8:56:48 PM4/22/21
to Michael Simpson, openthread-users, Sébastien Parent-Charette
It seems that a request to transmit a frame was sent to the RCP but no "transmit done" event was returned from the RCP within the expected amount of time.

Sébastian - can someone from SiLabs help take a closer look?

--
Jonathan Hui


--
You received this message because you are subscribed to the Google Groups "openthread-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openthread-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openthread-users/979708d0-efc8-4c2a-89f4-2ccdec498ca8n%40googlegroups.com.

Michael Simpson

unread,
Apr 22, 2021, 9:38:14 PM4/22/21
to openthread-users
Thanks Jonathan
I just rebuilt the RCP for use on my Silabs dev kit board BRD4001A03 using the EFR32MG12P432F1024GL125 rather than my MGM12P32F1024GA custom board.
I am using sdk 3.1.2 on and OpenThread stack configuration set to Thread 1.1
Can you advise how tight these timeouts are and if I can relax them a little to see if it helps.
It seemed to go from working fine to a sudden stop without logging any timeout/retries etc... This seems a little vulnerable look at it from the outside.
Michael

Jonathan Hui

unread,
Apr 22, 2021, 9:51:40 PM4/22/21
to Michael Simpson, openthread-users
The first line in your log was the "timeout" indication - it corresponds to src/lib/spinel/radio_spinel_impl.hpp#L1081.

The timeout is specified by TX_WAIT_US and currently defaults to 5 seconds, which is quite long.

--
Jonathan Hui



Michael Simpson

unread,
Apr 22, 2021, 11:47:35 PM4/22/21
to openthread-users
Hi Jonathan
I agree 5 seconds is long but the lead up to the lines:
Apr 23 11:48:34 localhost otbr-agent[124449]: [CRIT]-PLAT----: exit(1): ProcessRadioStateMachine line 999, radio tx timeout, Failure
Apr 23 11:48:34 localhost systemd[1]: otbr-agent.service: Main process exited, code=exited, status=1/FAILURE
Apr 23 11:48:34 localhost systemd[1]: otbr-agent.service: Failed with result 'exit-code'.

shows messages being received and responded to continually right up to 1 second before (at 1 sec resolution)
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processReceive: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processTransmit: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----: Received IPv6 UDP msg, len:141, chksum:c942, from:0x9000, sec:yes, prio:normal, rss:0.0
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----:     src:[fdde:ad00:0:0:c008:7f34:f8ed:62df]:5683
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----:     dst:[fdde:ad00:0:0:cc71:d1a5:df06:9614]:5683Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processReceive: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processTransmit: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----: Received IPv6 UDP msg, len:206, chksum:71ef, from:0x9800, sec:yes, prio:normal, rss:-11.0
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----:     src:[fdde:ad00:0:0:6293:3689:4ef3:2b53]:5683
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----:     dst:[fdde:ad00:0:0:cc71:d1a5:df06:9614]:5683
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processReceive: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processTransmit: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processTransmit: OK

I also see these being sent back on TCPDump.

So I'm not sure I understand, are you saying the timeout at 11:48:34 is from a message sent back at 11:43:xx, and that the system kept sending and receiving successfully for the 5 seconds after this message failed?

The code reference you gave me comments:
// Frame has been successfully passed to radio, but no `TransmitDone` event received within TX_WAIT_US.
So is this timeout from the otbr-agent sending a message to the RCP and not getting back a response?
Does this one instance of timeout trigger the process which results in the wpan0 shutting down and restarting?
Does the protocol employ retries to try to recover or is a single timeout instance enough to warrant this action in shutting the wpan0?

Is the OTBR sending a message to reset the RCP at: 
Apr 23 11:48:39 localhost otbr-agent[126299]: [INFO]-PLAT----: RCP reset: RESET_SOFTWARE
or is it just recording the the RCP restarted. I am trying to understand cause and affect.

Is there any configuration around this process which might help?

I have tried making my devices idle so they only send a fraction of the messages they would when they are busy to see if that affects the problem, but it still occurs about 5 times  in a one hour period.

I will need to here back from Silbas on the RCP if it is responsible for the missing response, but it seems to me the otbr-agent should be more robust than to take such drastic action from one message response timeout when the messages after this are still being processed successfully? 

Maybe I misunderstand / misinterpret what is happening here.  But I have serious problems if I can't solve this quickly.

Can you comment on how often the network should normally run continuously without resetting itself.

The recovery also seems to take a long time:
Apr 23 11:48:34 localhost otbr-agent[124449]: [CRIT]-PLAT----: exit(1): ProcessRadioStateMachine line 999, radio tx timeout, Failure
...to just get Thread back up
Apr 23 11:48:39 localhost otbr-agent[126299]: Thread is up
... takes over 5 seconds.

Whereas if I restart the otbr-agent - sudo service otbr-agent restart, the entire process from 
Apr 23 15:36:17 localhost otbr-agent[152047]: select() failed
Apr 23 15:36:17 localhost otbr-agent[152047]: [INFO]-PLAT----: Sent request#7 to remove fdde:ad00::ff:fe00:fc00/64
Apr 23 15:36:17 localhost otbr-agent[152047]: [NOTE]-MLE-----: Role Leader -> Detached
Apr 23 15:36:17 localhost otbr-agent[152047]: [INFO]-PLAT----: UpdateMulticast: OK
...to 
Apr 23 15:36:18 localhost otbr-agent[155686]: Thread is up
...seems to only take 1 second - Router table updated with all my Reeds within the next second

It would be good to prevent this from occurring but it seems the recovery when it does could be a lot faster if otbr-agent just restarted?

Jonathan Hui

unread,
Apr 23, 2021, 12:21:56 AM4/23/21
to Michael Simpson, openthread-users
Responses below:

On Thu, Apr 22, 2021 at 8:47 PM Michael Simpson <michae...@gmail.com> wrote:
Hi Jonathan
I agree 5 seconds is long but the lead up to the lines:
Apr 23 11:48:34 localhost otbr-agent[124449]: [CRIT]-PLAT----: exit(1): ProcessRadioStateMachine line 999, radio tx timeout, Failure
Apr 23 11:48:34 localhost systemd[1]: otbr-agent.service: Main process exited, code=exited, status=1/FAILURE
Apr 23 11:48:34 localhost systemd[1]: otbr-agent.service: Failed with result 'exit-code'.

shows messages being received and responded to continually right up to 1 second before (at 1 sec resolution)
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processReceive: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processTransmit: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----: Received IPv6 UDP msg, len:141, chksum:c942, from:0x9000, sec:yes, prio:normal, rss:0.0
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----:     src:[fdde:ad00:0:0:c008:7f34:f8ed:62df]:5683
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----:     dst:[fdde:ad00:0:0:cc71:d1a5:df06:9614]:5683Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processReceive: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processTransmit: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----: Received IPv6 UDP msg, len:206, chksum:71ef, from:0x9800, sec:yes, prio:normal, rss:-11.0
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----:     src:[fdde:ad00:0:0:6293:3689:4ef3:2b53]:5683
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-MAC-----:     dst:[fdde:ad00:0:0:cc71:d1a5:df06:9614]:5683
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processReceive: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processTransmit: OK
Apr 23 11:48:33 localhost otbr-agent[124449]: [INFO]-PLAT----: processTransmit: OK

The processTransmit and processReceive logs come from src/posix/platform/netif.cpp, which is responsible for managing the wpan0 interface and not responsible for managing the RCP.

So I'm not sure I understand, are you saying the timeout at 11:48:34 is from a message sent back at 11:43:xx, and that the system kept sending and receiving successfully for the 5 seconds after this message failed?

The system was not sending successfully. The processTransmit logs are handling the wpan0 interface, not the RCP.

The code reference you gave me comments:
// Frame has been successfully passed to radio, but no `TransmitDone` event received within TX_WAIT_US.
So is this timeout from the otbr-agent sending a message to the RCP and not getting back a response?

This timeout is when the RCP was requested to transmit an 802.15.4 frame, but the RCP did not signal completion of the event.
 
Does this one instance of timeout trigger the process which results in the wpan0 shutting down and restarting?

Yes.
 
Does the protocol employ retries to try to recover or is a single timeout instance enough to warrant this action in shutting the wpan0?

No, because the radio is in an unknown state.

Is the OTBR sending a message to reset the RCP at: 
Apr 23 11:48:39 localhost otbr-agent[126299]: [INFO]-PLAT----: RCP reset: RESET_SOFTWARE
or is it just recording the the RCP restarted. I am trying to understand cause and affect.

Sending a message to reset the RCP.

Is there any configuration around this process which might help?

You may consider taking a look at this message from SiLabs: openthread/ot-br-posix#559.

I will need to here back from Silbas on the RCP if it is responsible for the missing response, but it seems to me the otbr-agent should be more robust than to take such drastic action from one message response timeout when the messages after this are still being processed successfully? 

If the radio is in an unknown state, there is not much otbr-agent can do to recover except reset.

Can you comment on how often the network should normally run continuously without resetting itself.

A robust RCP should never trigger these kinds of resets.

The recovery also seems to take a long time:
Apr 23 11:48:34 localhost otbr-agent[124449]: [CRIT]-PLAT----: exit(1): ProcessRadioStateMachine line 999, radio tx timeout, Failure
...to just get Thread back up
Apr 23 11:48:39 localhost otbr-agent[126299]: Thread is up
... takes over 5 seconds.

systemctl needs to notice that the process has exited.

--
Jonathan Hui

Sébastien Parent-Charette

unread,
Apr 23, 2021, 1:02:43 PM4/23/21
to openthread-users
Hey Jonathan, Micheal,

I see Micheal has already opened a support case with Silabs and is now in good hands! I'll post here later for reference once I have more details available.

Sincerely,

Michael Simpson

unread,
Apr 26, 2021, 11:53:18 PM4/26/21
to openthread-users
Hi
I have not made much progress so far with this problem with the case I have raised with Silabs. Please forgive me for just checking a few issues related to the OTBR back with the Openthread team.

From the document you sent me Jonathan (thanks), I have done the following:

1) Edited my third_party/openthread/repo/src/lib/spinel/spinel_interface.hpp and found:

kMaxFrameSize = OPENTHREAD_CONFIG_PLATFORM_RADIO_SPINEL_RX_FRAME_BUFFER_SIZE

and in platform.h:

#define OPENTHREAD_CONFIG_PLATFORM_RADIO_SPINEL_RX_FRAME_BUFFER_SIZE 8192

2) run serial vcom config handshake rtscts

3) Edited my ncp_config.h file and changed the buffer sizes as follows:

#define OPENTHREAD_CONFIG_NCP_TX_BUFFER_SIZE 2048

#define OPENTHREAD_CONFIG_NCP_UART_TX_CHUNK_SIZE 2048

4) I have tried unsuccessfully to change the otbr and RCP from 115200 baud to 460800 but they don't communicate.

I have done some testing and I think the problem seems to be with the OTBR.  If I change the OTBR /etc/default/otbr-agent as follows:

OTBR_AGENT_OPTS="-I wpan0 spinel+hdlc+uart:///dev/ttyACM0?uart-baudrate=460800", and leave the RCP still set at 115200Baud, the system still works.

So my otbr-agent doesn't seem to know about ?uart-baudrate=460800 and is ignoring it?  Does this sound possible?

Silabs have sent me a document advising I need to update my otbr to commit a37e299ff 
The instructions are:
git checkout a37e299ff
git submodule update --init --recursive 
/ot-br-posix/script/update  

Should I specifically update to commit a37e299ff as above or just make sure I have the latest?

If the latest, can you please confirm the git command to update my local copy. I am not confident with git and stackoverflow has a differing arguments on this.

I am a little skeptical that the baud rate change from 115200 to 460800 is going to make that much difference if the otbr timeout on a missing response from the RCP is 5 seconds.

Can you please confirm that these changes are for a problem which actually matches the symptoms of my problem (reported in the top paragraph). If this is a known problem, I don't understand why Silabs are not distributing these changes in the 2 SDK updates which follow this article but also why it is not a problem experienced by many developers.

Thanks for everyone ongoing help with this.

Michael Simpson

unread,
Apr 27, 2021, 3:39:42 AM4/27/21
to openthread-users
It seems the more I learn the less I know.

My host app running on the border-router records RLOC ip addresses in a table which must come from the coms. These are also what I see in TCPDUMP.

My eidcache shows the following rloc addresses:

> eidcache table
fdde:ad00:0:0:c008:7f34:f8ed:62df f000
fdde:ad00:0:0:2af:10c9:7745:5677 b000
fdde:ad00:0:0:de46:16b5:3d9e:99f5 8c00
fdde:ad00:0:0:6293:3689:4ef3:2b53 3c00
fdde:ad00:0:0:7695:70cf:a63a:895a 9400

My otbr 

> ipaddr rloc
fdde:ad00:0:0:0:ff:fe00:5c00
Done

But the otbr logs record mleid addresses eg

Received IPv6 UDP msg, len:109, chksum:8461, from:0xb000, sec:yes, prio:normal, rss:-48.0
Apr 27 19:15:46 localhost otbr-agent[31652]: [INFO]-MAC-----:     src:[fdde:ad00:0:0:2af:10c9:7745:5677]:5683
Apr 27 19:15:46 localhost otbr-agent[31652]: [INFO]-MAC-----:     dst:[fdde:ad00:0:0:9348:3d41:609:c278]:5683
Apr 27 19:15:46 localhost otbr-agent[31652]: [INFO]-PLAT----: processReceive: OK
Apr 27 19:15:46 localhost otbr-agent[31652]: [INFO]-PLAT----: processTransmit: OK
Apr 27 19:15:46 localhost otbr-agent[31652]: [INFO]-MAC-----: Sent IPv6 UDP msg, len:54, chksum:5787, to:0xb000, sec:yes, prio:normal
Apr 27 19:15:46 localhost otbr-agent[31652]: [INFO]-MAC-----:     src:[fdde:ad00:0:0:9348:3d41:609:c278]:5683
Apr 27 19:15:46 localhost otbr-agent[31652]: [INFO]-MAC-----:     dst:[fdde:ad00:0:0:2af:10c9:7745:5677]:5683


IP Address fdde:ad00:0:0:9348:3d41:609:c278 is my border routers mleid address
> ipaddr mleid
fdde:ad00:0:0:9348:3d41:609:c278
Done

And I am also seeing link local addresses in the log eg
Frame tx attempt 16/16 failed, error:NoAck, len:79, seqnum:73, type:Data, src:7a3e62085c206ac5, dst:e2b1733f7f858991, sec:no, ackreq
Apr 27 19:16:09 localhost otbr-agent[31652]: [NOTE]-MAC-----: Failed to send IPv6 UDP msg, len:95, chksum:f095, to:e2b1733f7f858991, sec:no, error:NoAck, prio:net
Apr 27 19:16:09 localhost otbr-agent[31652]: [NOTE]-MAC-----:     src:[fe80:0:0:0:783e:6208:5c20:6ac5]:19788
Apr 27 19:16:09 localhost otbr-agent[31652]: [NOTE]-MAC-----:     dst:[fe80:0:0:0:e0b1:733f:7f85:8991]:19788
Apr 27 19:16:11 localhost otbr-agent[31652]: [INFO]-MAC-----: Received IPv6 UDP msg, len:96, chksum:d1be, from:6a8eb3ae74d508c9, sec:no, prio:net, rss:-83.0
Apr 27 19:16:11 localhost otbr-agent[31652]: [INFO]-MAC-----:     src:[fe80:0:0:0:688e:b3ae:74d5:8c9]:19788
Apr 27 19:16:11 localhost otbr-agent[31652]: [INFO]-MAC-----:     dst:[ff02:0:0:0:0:0:0:1]:19788
Apr 27 19:16:11 localhost otbr-agent[31652]: [INFO]-MLE-----: Receive Advertisement (fe80:0:0:0:688e:b3ae:74d5:8c9,0x8c00)

eg my Border-Router 
> ipaddr linklocal
fe80:0:0:0:783e:6208:5c20:6ac5
Done

Is this expected? I don't understand.

otbr.txt
Reply all
Reply to author
Forward
0 new messages