remoteproc write to PRU over rpmsg device blocks even when set non-blocking

Andrew P. Lentvorski

unread,

Jun 6, 2020, 6:02:14 AM6/6/20

to BeagleBoard

I was getting some strange bugs from some remoteproc stuff I was doing on a BBB, and eventually I tracked it down to the overunning the rpmsg system which can block for several seconds on a write.

Okay, fine. No big deal. This is what poll() was made for--flip "/dev/rpmsg_pru30" to O_NONBLOCK, set up POLLOUT, wait for a write event, write the data, and check the error.

Except that my overrun writes to "/dev/rpmsg_pru30" *still* block for several seconds (very bad) and then terminate with an Error 512 (huh?).

I can handle the error, but the big problem is the blocking. That absolutely should not be allowed to happen.

What's going on? And where do I file a bug about this?

Thanks.

# uname -a
Linux beaglebone 4.19.94-ti-r42 #1buster SMP PREEMPT Tue Mar 31 19:38:29 UTC 2020 armv7l GNU/Linux

Andrew P. Lentvorski

unread,

Jun 6, 2020, 9:34:26 PM6/6/20

to BeagleBoard

It appears that the problem is in rpmsg_pru.c.

rpmsg_pru_read has the following code:

        if (kfifo_is_empty(&prudev->msg_fifo) &&
            (filp->f_flags & O_NONBLOCK))
                return -EAGAIN;

rpmsg_pru_write presumably needs a similar piece of code with kfifo_is_full() or it needs to look for O_NONBLOCK and then use rpmsg_trysend instead of rpmsg_send.

Unfortunately, I've got nowhere near the Linux kernel programming chops to debate the implications of that.

Presumably, I need to file a bug somewhere?

Thanks.

Andrew P. Lentvorski

unread,

Jun 22, 2020, 2:01:34 AM6/22/20

to BeagleBoard

Nobody knows where I should file this bug?

Jason Kridner

unread,

Jun 22, 2020, 9:12:11 AM6/22/20

to beagl...@googlegroups.com

Which repo has the code that is causing problems?

I took a quick look at https://git.ti.com/cgit/pru-software-support-package/pru-software-support-package/tree/lib/src/rpmsg_lib/pru_rpmsg.c and it seems to be structured a fair bit differently. If the same issue had been there, I'd recommend posting to e2e.ti.com.

Switching over to the kernel, I see the function you mention:

https://github.com/beagleboard/linux/blob/4.14/drivers/rpmsg/rpmsg_pru.c#L106-L129

The driver isn't upstream yet: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rpmsg

The post to a public list seems to be here:

* https://patchwork.kernel.org/patch/10795751/

The development tree seems to be here:

* https://git.ti.com/cgit/rpmsg/rpmsg/

The code seems the same in the latest development branch:

* https://git.ti.com/cgit/rpmsg/rpmsg/tree/drivers/rpmsg/rpmsg_pru.c#n108

Er, I guess that is an example of doing it right and the issue is here?

* https://git.ti.com/cgit/rpmsg/rpmsg/tree/drivers/rpmsg/rpmsg_pru.c#n142

Since it isn't upstream, I'd think an e2e post might be OK, but it might be more productive to reply to the latest post on linux-omap:

* https://lore.kernel.org/linux-omap/e97f7bfc-a3c2-92a9...@ti.com/

Copy Jason Reeder, Anthony F. Davis and Suman Anna. Not sure why it has been so long between revision posts.

Personally, I don't see any harm in modifying the _write code with a fifo check on O_NONBLOCK.

--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beagleboard/2c824e98-015d-4471-b787-a8c27ceaae5fo%40googlegroups.com.

--

https://beagleboard.org/about - a 501c3 non-profit educating around open hardware computing

Suman Anna

unread,

Jun 22, 2020, 11:17:05 AM6/22/20

to Jason Kridner, beagl...@googlegroups.com

If it is for support from a TI SDK, please post a query to E2E.

Can someone clarify meanwhile exactly what the issue is? The kfifo is
used only on the receive path because of the asynchronous callbacks. The
Tx-path is synchronous, the copy is attempted directly on the vring
buffers, and you have a number of vring buffers (dictated by firmware),
and if all of them are busy (implies PRU has either stopped processing
or is overwhelmed), then you get a failure.

regards
Suman

On 6/22/20 8:11 AM, Jason Kridner wrote:
> Which repo has the code that is causing problems?
>
> I took a quick look at
> https://git.ti.com/cgit/pru-software-support-package/pru-software-support-package/tree/lib/src/rpmsg_lib/pru_rpmsg.c
> and it seems to be structured a fair bit differently. If the same issue

> had been there, I'd recommend posting to e2e.ti.com <http://e2e.ti.com>.

> <mailto:beagleboard...@googlegroups.com>.

> To view this discussion on the web visit
> https://groups.google.com/d/msgid/beagleboard/2c824e98-015d-4471-b787-a8c27ceaae5fo%40googlegroups.com

> <https://groups.google.com/d/msgid/beagleboard/2c824e98-015d-4471-b787-a8c27ceaae5fo%40googlegroups.com?utm_medium=email&utm_source=footer>.

Mark Lazarewicz

unread,

Jun 22, 2020, 5:04:31 PM6/22/20

to beagl...@googlegroups.com

Hi Suman

Here is original thread so you have background info and time to respond if Andrew has more to add.

https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!msg/beagleboard/6Ch7Do4Hm7k/CAcSRi1pBQAJ

Regards

Mark

Sent from Yahoo Mail on Android

> send an email to beagleboard+unsub...@googlegroups.com
> <mailto:beagleboard+unsub...@googlegroups.com>.

> To view this discussion on the web visit
> https://groups.google.com/d/msgid/beagleboard/2c824e98-015d-4471-b787-a8c27ceaae5fo%40googlegroups.com
> <https://groups.google.com/d/msgid/beagleboard/2c824e98-015d-4471-b787-a8c27ceaae5fo%40googlegroups.com?utm_medium=email&utm_source=footer>.

>
>
>
> --
> https://beagleboard.org/about - a 501c3 non-profit educating around open
> hardware computing

--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.

To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard+unsub...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/beagleboard/33317f41-b499-3d1f-7281-29ac57976f7e%40ti.com.

Andrew P. Lentvorski

unread,

Jun 22, 2020, 10:32:30 PM6/22/20

to BeagleBoard

Hi, folks,

The issue is that requests cause the rpmsg channels to the PRU to fill. Which is actually fine, the PRU in this case is servicing slow requests and the rpmsg being full should exert backpressure.

The problem is that the rpmsg system *HANGS* several second before timing out and throws a fairly bizarre error. Quoting my original message:

> Except that my overrun writes to "/dev/rpmsg_pru30" *still* block for several seconds (very bad) and then terminate with an Error 512 (huh?).

This is not good behavior from all manner of perspectives:

1) Why does the write time out *at all* when not O_NONBLOCK? That's certainly not expected behavior. There is no reason why the PRU might not take a couple seconds to service a request. If that's a problem, you either set a timeout manually (usually only valid for file descriptors of sockets) or you put the file descriptor into non-blocking mode. (It appears that this is the fault of the rpmsg driver which will time out after 15 seconds and then return ERESTARTSYS)

2) Why does the write hang *at all* when in O_NONBLOCK? That's also not expected behavior. If the queue is full, an attempt to write to it should return *IMMEDIATELY* with something like ENOMEM/EAGAIN. (This appears to be the fault of the rpmsg_pru driver).

The file I was looking at is here:

https://github.com/beagleboard/linux/blob/4.19/drivers/rpmsg/rpmsg_pru.c

Two solutions seem to present themselves:

1) Use rpmsg_trysend when O_NONBLOCK is set (see rpmsg_eptdev_write_iter in rpmsg_char.c line 243 for an example)

2) Check the queue for space and return immediately with ENOMEM. (Saves the call to rpmsg_trysend and all its indirections).

3) Do both. (It's possible that trysend covers other cases than just kfifo full--but the kfifo check may be a useful optimization and catch 99%+ or all the cases quickly).

Thanks.

Andrew P. Lentvorski

unread,

Jun 23, 2020, 3:04:17 AM6/23/20

to BeagleBoard

Urk, sorry I didn't quite get the implications of this statement:

The kfifo is used only on the receive path because of the asynchronous callbacks. The
Tx-path is synchronous, the copy is attempted directly on the vring buffers

That means that kfifo doesn't exist on send so the only available solution appears to be calling rpmsg_trysend when in O_NONBLOCK mode.

That will hit the full vring buffers and should bounce back immediately with ENOMEM.

Thanks.

Mark Lazarewicz

unread,

Jun 23, 2020, 3:49:54 AM6/23/20

to beagl...@googlegroups.com

You could increase the vring buffers or check for full and retry depending on how critical the timing is.

Sent from Yahoo Mail on Android

--

For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.

To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.

To view this discussion on the web visit

https://groups.google.com/d/msgid/beagleboard/dcbb9c5a-229a-481f-8ea0-11a8735ac095o%40googlegroups.com
.

Andrew P. Lentvorski

unread,

Jun 23, 2020, 7:53:21 PM6/23/20

to BeagleBoard

Sure. Right now, I just keep track of how many messages are in flight and I don't allow it to queue too many.

That's useful once you know what the bug is. Fortunately, I hit this bug before I had two threads (one receiving USB and one receiving ethernet) which would have made hunting it down quite painful. So, at least now I know that I *must* have a single thread acting as a gatekeeper on top of the rpmsg system.

If, however, you try to use a library on top of this bug that actually expects the O_NONBLOCK behavior to work, you will have a long debugging chain.

What *originally* tripped all of this was that I tried to use Rust and Tokio, which failed mysteriously. After far too much fruitless debugging, I switched down to Rust and mio, which also failed weirdly.

So, I switched down to C, poll, and O_NONBLOCK, which then gave the incorrect blocking behavior and the ERESTARTSYS. After *that*, I could actually pinpoint the incorrect behavior as belonging to pru_rpmsg and as being due to a full queue with incorrect blocking semantics.

Getting to that point, however, was neither pleasant nor straightforward.

Andrew P. Lentvorski

unread,

Jun 30, 2020, 6:43:01 AM6/30/20

to BeagleBoard

So, we're still back at the original question of "Where do I file this bug so that it gets tracked?"

I see some recent work on rpmsg bugs at https://github.com/beagleboard/linux/issues, so I'll file a bug there. But, is there somewhere else I should file it?

Thanks.

Andrew P. Lentvorski

unread,

Dec 8, 2020, 11:16:17 PM12/8/20

to BeagleBoard

Bumping this. Again.

I'd like to *NOT* have to keep supporting the fix for this on the user side in the 5.X series when this really needs to get fixed on the kernel side. I've filed the bug reports. They're just sitting.

In reality, the rpmsg system doesn't really have the hooks to even support the fix from the user side as I can't query the size and depths of the buffers. This needs to get fixed in the PRU rpmsg kernel subsystem.

Thanks.

Reply all

Reply to author

Forward