How to reliably push data from ARM host to PRU (shared) memory with predictable (low) latency?

831 views
Skip to first unread message

ags

unread,
Mar 22, 2017, 1:08:17 AM3/22/17
to BeagleBoard
I have an application running on the ARM host that writes to the PRU shared memory. The PRU core then manipulates that data and sends it out the EGPIO pins with exact timing.

For this to work, I need to have a steady supply of data bursts available to the PRU core - about 32 KiB each burst. This won't fit into the PRU shared memory (12 KiB), PRU memory (2x 8 KiB), or even spread out over all PRU memory (28 KiB total). So it's not possible to load the data into PRU memory and then kick off the PRU core to send it out the pins. There will be a necessary transfer of additional data from ARM host to PRU memory after the PRU has started sending data out through the EGPIO pins.

I've instrumented the PRU PASM code using the CYCLE register, and see that there is variable latency when the PRU is waiting for a memory block to be received from the ARM host. This can be upwards of 5 milliSec, which won't work for this application. I've tried using ionice to set class to realtime and priority to 0, but this had no appreciable effect.

Is there some way to reduce the latency of writing from ARM/Linux to the PRU memory? I've heard that some projects use DMA to transfer data from the PRU to host (ARM, system) DDR (e.g. BeagleLogic project) but nothing about the reverse direction. Does this even make sense? Will the kernel already be invoking DMA during a memcpy from user virtual address space to the mmap'd physical PRU memory address?

I need to  provide about 32 KiB to the PRU within 5 milliSec, repeating every 20 milliSec. This seems like it should be easily accomplished, if a USB driver can sustain 480 Mbps data rates. I must be approaching this the wrong way. Any suggestions on how this should be architected will be greatly appreciated. 

Dennis Lee Bieber

unread,
Mar 22, 2017, 10:14:19 AM3/22/17
to beagl...@googlegroups.com
On Tue, 21 Mar 2017 22:08:17 -0700 (PDT), ags
<alfred.g...@gmail.com> declaimed the
following:

>I need to provide about 32 KiB to the PRU within 5 milliSec, repeating
>every 20 milliSec. This seems like it should be easily accomplished, if a
>USB driver can sustain 480 Mbps data rates. I must be approaching this the
>wrong way. Any suggestions on how this should be architected will be
>greatly appreciated.

If you've got a USB system that actually manages 480Mbps for more than
a few bytes, you've got something miraculous.

USB is a polling intensive protocol, with lots of turn-arounds (host:
Are you ready to receive?, client: ready, host: data packet send)

High Speed USB has a data payload size of 1kB plus a few bytes for
sync/PID/CRC16... Or about 8k bits per transaction. Sure, the 8k bits goes
out at 480Mbps... And then gets followed by polls of connected devices to
find out which is the next device to be serviced.

The effective rate for high-speed USB is only around 280Mbps (USB 3
Superspeed is rated 5Gbps, but the spec is considered met with an effective
rate of 3.2Gbps -- USB 3 is full-duplex signalling, the others are
half-duplex)

I've not encountered any protocol that requires something like 32kB as
a continuous stream with no subdivisions for handshake/error-checking.
Ethernet breaks data up into (with overhead) ~1.5kB chunks; TCP may be able
to send multiple chunks before getting an ACK back on the first chunk, but
it is still chunked...

--
Wulfraed Dennis Lee Bieber AF6VN
wlf...@ix.netcom.com HTTP://wlfraed.home.netcom.com/

William Hermans

unread,
Mar 22, 2017, 11:05:45 AM3/22/17
to beagl...@googlegroups.com
I'd say you most likely have a flaw in your code, because what you describe is only around 1.6 MiB/s.

I'd also like to point out that you will rarely, if ever see any USB interface that will achieve a full 450Mbit/s. For example, the g_ether network gadget driver at best usually only achieves 105-115Mbit/s, but that partly due to the code written.


--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beagleboard/eg05dc5tkbdb1ae3up4aoi92h9er1lq1sn%404ax.com.
For more options, visit https://groups.google.com/d/optout.

Charles Steinkuehler

unread,
Mar 22, 2017, 11:45:13 AM3/22/17
to beagl...@googlegroups.com
On 3/22/2017 12:08 AM, ags wrote:
>
> I need to provide about 32 KiB to the PRU within 5 milliSec, repeating
> every 20 milliSec.

That's not much data. I recommend you just make a circular buffer in
the PRU data memory, and run a periodic task on the ARM side to keep the
buffer filled. Using the 12K shared data ram you can store almost 2 mS
worth of data, which ought to be plenty. By way of example, the default
Machinekit ARM side thread period is 1 mS and could easily be faster for
something simple like this.

Note you might need an -rt or -xenomai kernel to achieve reliable
operation, I've seen the non-rt kernels occasionally "wander off into
the weeds" for several hundred mS at a time.

--
Charles Steinkuehler
cha...@steinkuehler.net

William Hermans

unread,
Mar 22, 2017, 12:43:24 PM3/22/17
to beagl...@googlegroups.com

On Wed, Mar 22, 2017 at 8:45 AM, Charles Steinkuehler <cha...@steinkuehler.net> wrote:

Note you might need an -rt or -xenomai kernel to achieve reliable
operation, I've seen the non-rt kernels occasionally "wander off into
the weeds" for several hundred mS at a time.

--
Charles Steinkuehler
cha...@steinkuehler.net


"Wander off into the weeds . . ." I get a kick out of that expression every time I see it in this context.

I do agree with Charles, and would like to add that you need to pay close attention to which C, and Linux API function calls you use in your application. Function calls such as printf() which can be handy for quick and dirty text debugging can slow your code down considerably. However, if you pipe the output of such an application into a file. You'll notice a huge performance improvement with that single trick alone. Anything related to threading, file locks( poll(), etc ), etc through Linux API calls is also going to slow you down. Certainly there is more, but these are the three things I've personally tested, and can think of off the top of my head. Also, under certain conditions, using usleep() in your code where you're using a busy wait loop, can help some, but at other times it could potentially backfire. Depending on how busy your system is. Either way though, a busy wait loop without using usleep(), or giving CPU time back to the system will wind up using ~95% processor time. Until "preempted". Just remember that there is only have one core / thread to work with.

You may also need to slim down unneeded processes, services, and kernel modules that are loaded / running by default on a stock beaglebone Linux image. As all of this will be compete for CPU time, which you may not be able to afford, in order to have your application perform as well as you'd like. Basically you need to profile your system, and see what you can get away with.

So from personal experience, I can say with reasonable confidence that the maximum possible latency with an RT kernel is going to be around 50ms. But this number will be if your system is constantly "busy". If you system is extremely busy, it can be more. But I've had an application that was doing a lot of processing in code, but was only using up to 5% processor time. Because I was giving processor time back to the system by using usleep(). But anyway, if you need "real-time" an RT kernel could work fine. Depending on your definition of the term. If you need deterministic, you may need to use xenomai, move into the kernel, or potentially both.

I would probably start by profiling your system to see what all is running in the background, and if everything you do have running is necessary. After that, try installing an RT kernel.

ags

unread,
Mar 23, 2017, 1:13:32 AM3/23/17
to BeagleBoard
You've hit the nail on the head. The issue (IMO) is Linux "wandering off into the weeds". It comes back, eventually... but while gone, bad things happen.

1) I am using a handshake approach between PRU and ARM, using interrupts. When the PRU wants more data, it generates an ARM interrupt. The user space application listens for the interrupt (using select()) and when received sends more data. The PRU is made aware of the data being ready by sending an interrupt to the PRU.
2) I am using a ring (though with only two compartments, it seems more like a "line") to send the data. I think of it as a tic/tock, or ping/pong approach: when one "side" (half) of the data space has been read by the PRU, it signals the ARM host to send another (half) buffer full of data. So the PRU is always reading from a buffer while the ARM is loading the other.
3) While the average data rate I need to sustain is about 13Mbit/second (not a problem) the challenge is ensuring, under all conditions, that I can send 262 kbits of data from ARM to PRU, in chunks small enough to fit into the 12k PRU shared RAM, in a "timely manner". With my current design, this requires sending 4KiB of data from ARM to PRU shared RAM, completing the transaction within 960 uSec of the request for more data. The limiting factors are the timing (can't starve the PRU of data, otherwise the bitstream out will have gaps which corrupt the content for the (extra-BBB client)) and the size of the PRU memory (if I could load a full "frame buffer" of data at once to ensure not starving the PRU that would work - but the PRU shared RAM only holds 1/8 of the data required by the PRU for each burst).

I thought using select() to wait for notification of an event (by "listening" to the fsys uio files) would free the ARM cpu to do other things while waiting, but provide the most immediate path to the user space application to send more data. Is there a better way?

William Hermans

unread,
Mar 23, 2017, 6:03:47 AM3/23/17
to beagl...@googlegroups.com
On Wed, Mar 22, 2017 at 10:13 PM, ags <alfred.g...@gmail.com> wrote:
I thought using select() to wait for notification of an event (by "listening" to the fsys uio files) would free the ARM cpu to do other things while waiting, but provide the most immediate path to the user space application to send more data. Is there a better way?

So that select() is probably your whole problem. Unless, you're using other system calls as well. But I've already discussed with you the best, and fastest way to achieve your goal. Several times in fact. Use a bit in memory, *somewhere*.

PRU side:
while(somewhere & 0x1 )
    usleep(1000);
/* Do our work after while() fall through */

Userspace side:
while(! (somewhere & 0x1) )
    usleep(1000);
/* Do our work after while() fall through */

No need for select(), no need for fancy threading calls, or other magical hand waving. Just two simple busy wait loops waiting for their respective turns. But, don't forget to toggle the bit back, when you're done. Anyway, it's not really Linux that's off in the weeds. Well perhaps it is, but your application is pushing it into the weeds.

William Hermans

unread,
Mar 23, 2017, 6:21:11 AM3/23/17
to beagl...@googlegroups.com
You're also not going to be able to use alarm(), or any functions that gets time from the system. Not if you want to stay reasonably deterministic. Write your userspace app to do one thing, and fast, and handle the rest else where. For instance if you're sending your data to some remote location. Have that remote location time stamp the data.

ags

unread,
Mar 23, 2017, 8:48:46 AM3/23/17
to BeagleBoard
OK, I will use the busy-wait loop w/ usleep and test. The reason I used select was I thought it would allow me to do other things (I need to have another process, thread, or loop in this same application serving out audio data to another client, synchronized with this data). My understanding was that the process blocking on select() to return would free the CPU for other things, but allow a quick wake-up to refresh the buffer as needed.

BTW, I have only mentioned the problems - but it does almost work. In my tests, I ran 12,500 4KiB buffers from ARM to PRU and measured (on the PRU side, using the precise CYCLE counter) to see if the PRU ever had to wait for the next buffer fill. Turns out that the PRU had to wait about 180 times, or about 1.5% of the buffer fill events. The worse case wait (stall) time was ~5milliSeconds.

William Hermans

unread,
Mar 23, 2017, 3:32:28 PM3/23/17
to beagl...@googlegroups.com
On Thu, Mar 23, 2017 at 5:48 AM, ags <alfred.g...@gmail.com> wrote:
OK, I will use the busy-wait loop w/ usleep and test. The reason I used select was I thought it would allow me to do other things (I need to have another process, thread, or loop in this same application serving out audio data to another client, synchronized with this data). My understanding was that the process blocking on select() to return would free the CPU for other things, but allow a quick wake-up to refresh the buffer as needed.

I thought that select(), and all that should work too, initially. But you have to remember, we're talking about an OS here that has an "expected" latency of 100ms, or more- Depending. I can tell you that one could easily experiment, and find out for themselves. One of the easiest tests one could do for themself. Would be to run a loop, for 10,000 iterations, then compare using select() to a busy wait loop. Then run the cmdline command time on each to see the difference. This of course is not a super accurate test, but should be good enough to show a huge difference in executable completion time. *If* you're more of the scientific type, then get the system time in your test app before, and after the test code, then output the difference between those two times.

Anyway, using an RT kernel, or an xenomai kernel may improve this latency *some*, but it is said that this comes at the expense of *some* other performance aspects of the OS. I've not actually tested that myself, but only read about it.

BTW, I have only mentioned the problems - but it does almost work. In my tests, I ran 12,500 4KiB buffers from ARM to PRU and measured (on the PRU side, using the precise CYCLE counter) to see if the PRU ever had to wait for the next buffer fill. Turns out that the PRU had to wait about 180 times, or about 1.5% of the buffer fill events. The worse case wait (stall) time was ~5milliSeconds.

One has to be very careful what they use in code when writing an executable that requires some degree of determinism from userspace. I can not think of the articles I've read in the past that led me to understand all this. But they're out there. Pretty much anything that is a system call, will incur a latency penalty. Because one ends up switching processor context from userspace, to kernelspace, and back to userspace. This in of it's self may not be too bad, but any variables that are needed will end up being copied back and forth as well. In these cases however, you can incur huge latency spikes that you may not have anticipated.

Personally, I've run into this problem a couple times during two different projects. So my style of coding is to just get something working, right ? Then refactoring the code to perform to my expectations. Basically, starting with really "simple" stuff like printf(), select, etc. Then refactoring those out when / if needed. Many times, it's not needed, but when it is, one should understand the consequences of using such function calls in an executable. That way, one should have  at least a rough idea where to start with "trimming the fat". But everyone falls into this "trap" at least once or twice when entering the embedded arena.

My understanding of calls like select(), is that when they're used, you're yielding the processor back to the system, with the "promise" that eventually, the system will notify you when something related to that call has changed. But with a busy wait loop, you're defining the time period you're allowing the processor to be yielded back to the system. In the case of my example, approximately 1ms. Just be aware that with any non real-time OS, much faster than 1ms intervals will yield varying results. e.g. the system will( may ) not be able to keep up with your code. If your code is super efficient, you can potentially get hundreds of thousands of iterations. This is of course not guaranteed, but I've done it personally with the ADC, so I do know it can be possible. At this performance level, you're almost certainly using mmap(). You're almost certainly using a lot of processor time as well. 80% +

Also my code was pseudo code that I picked apart myself after I posted. On the PRU side of things, you're probably going to want to do things a bit differently. For starters, you're going to want to time your data transfers from the PRU probably. That is, every 20ms, you're going to kick off a new data set. However, this has to be done smartly, as you do not want to override the userspace side file lock. So perhaps a double buffer will be needed ? That will depend on the outcome of your given situation. Another technique that could be used, would be data packing. As plain text data can be a lot larger in memory than a packed data structure. But it would also require a lot of thought on as how to do this smartly. As well as a strong understanding of struct / union "data objects" + data alignment. For the best results.

There could potentially be a lot more to consider down the road. Just pick away at it one thing at a time. Eventually you'll be done with it.

William Hermans

unread,
Mar 23, 2017, 3:43:53 PM3/23/17
to beagl...@googlegroups.com
One other thing I did not think of to mention is that: I was recently watching a video on youtube from Jason Turner. A person who is known for talking about performance related C++ coding. Now I'm not exactly a huge fan of C++, but I do like to keep up with the language. But one of the things he mentions in this video that I completely agree with. "Simple code, is 99% more likely to perform better than complex code". Or something to that effect. Which may seem obvious initially, but consider a simple two lines of a busy wait loop, to the select() system call. Is the select() only two lines of easy to read / understand code? You know, I can not say with 100% certainty that is is not. But I seriously doubt it.

ags

unread,
Mar 24, 2017, 12:34:31 PM3/24/17
to BeagleBoard
@William Hermans I thought I'd share the result of my efforts to reliably stream data from ARM host (Linux userspace) to PRU.

I instrumented the PRU ASM code to use the CYCLE register for very precise measurements. I ran tests that kept track of how many times, for how long, and the "worst offender" when the PRU was stalled waiting for data from the ARM host. I used this to test my current implementation using select(), and then replaced select() with usleep() (and nanosleep()), and then again a loop with no sleep, just a brute-force busy wait that never released the CPU. As it turns out, the results were surprising. Using usleep() (and similar related methods), the number of stalls, the overall stall time and the worst-case stall time were all significantly worse than the implementation using select(). Even the busy wait loop w/out sleep() was worse. I did a bit of research and sleep() and related methods are implemented using a syscall (sleep - used to use alarm in the olden-days (so I read)). So getting through the call gate and the context swap happens with sleep() just like with select(). My theory is that select() is more efficient precisely because of this: one call to select() incurs one system call/context swap per interrupt. The process is put on the NotRunning list, and the the OS continues on. When a trigger event happens, the OS returns the process to the Running list and then control back to user space. For the sleep() method, there are many calls per "interrupt", polling some memory location looking for the signal from the PRU. So what is handled by one userspace->kernelspace->userspace transition with select() could require dozens of these transitions using sleep().

I don't claim to be an expert, and if there is a flaw in this theory I'm open to hearing what it is. But this is my theory at the current moment.

So what I ended up doing is compress the data so that one "frame" can fit in PRU memory at once. The PRU needs to send a full "frame" out with precise timing (within microsecond timing) for all data in that frame. Between frames, there is slack. By compressing the data, I can load a full frame into the PRU0/1 DRAM and shared RAM, and then kick off writing out the frame. Now everything is (or appears to be) deterministic in the timing of all transfers between registers, scratch and PRU DRAM. So I've sidestepped the problem of unpredictable latency waiting for data from the ARM host.

I hope this might help someone else with similar requirements.

William Hermans

unread,
Mar 24, 2017, 1:03:37 PM3/24/17
to beagl...@googlegroups.com
On Fri, Mar 24, 2017 at 9:34 AM, ags <alfred.g...@gmail.com> wrote:
@William Hermans I thought I'd share the result of my efforts to reliably stream data from ARM host (Linux userspace) to PRU.

I instrumented the PRU ASM code to use the CYCLE register for very precise measurements. I ran tests that kept track of how many times, for how long, and the "worst offender" when the PRU was stalled waiting for data from the ARM host. I used this to test my current implementation using select(), and then replaced select() with usleep() (and nanosleep()), and then again a loop with no sleep, just a brute-force busy wait that never released the CPU. As it turns out, the results were surprising. Using usleep() (and similar related methods), the number of stalls, the overall stall time and the worst-case stall time were all significantly worse than the implementation using select(). Even the busy wait loop w/out sleep() was worse. I did a bit of research and sleep() and related methods are implemented using a syscall (sleep - used to use alarm in the olden-days (so I read)). So getting through the call gate and the context swap happens with sleep() just like with select(). My theory is that select() is more efficient precisely because of this: one call to select() incurs one system call/context swap per interrupt. The process is put on the NotRunning list, and the the OS continues on. When a trigger event happens, the OS returns the process to the Running list and then control back to user space. For the sleep() method, there are many calls per "interrupt", polling some memory location looking for the signal from the PRU. So what is handled by one userspace->kernelspace->userspace transition with select() could require dozens of these transitions using sleep().

I don't claim to be an expert, and if there is a flaw in this theory I'm open to hearing what it is. But this is my theory at the current moment.

I've honestly no idea how you're implementing what I suggested, so I can't really comment on what's going on. sleep() won't work though, and I'm not sure how usleep() is implemented for your particular OS( Debian Linux ), but usleep() on micro-controllers( bare metal ) is usually less than 10 lines of code. I want to say less than 6 lines of code, but it's been a while since I've looked through an implementation.

Your findings are also surprising to me, but I can not help but feel that you did not implement the busy wait loop how I expected. Or perhaps there is something else going on that we haven't discussed. So, if you used "interrupts" with the busy wait loop, then that's not how I intended it to be used. The busy wait loop was to be in place of your interrupt code. But that would explain why it could have been slower. To be sure, there are potentially many other things that could have been culprit / co-culprit for your given situation. It's not always easy to talk about these things at a high level without making sure everything is understood on both sides of the conversation. Without seeing your code, I can not really say any more, with surety.

So what I ended up doing is compress the data so that one "frame" can fit in PRU memory at once. The PRU needs to send a full "frame" out with precise timing (within microsecond timing) for all data in that frame. Between frames, there is slack. By compressing the data, I can load a full frame into the PRU0/1 DRAM and shared RAM, and then kick off writing out the frame. Now everything is (or appears to be) deterministic in the timing of all transfers between registers, scratch and PRU DRAM. So I've sidestepped the problem of unpredictable latency waiting for data from the ARM host.

I hope this might help someone else with similar requirements.

Yeah, there is usually more way than one way to do the same thing. That's why I mentioned data packing, as I had a feeling that it could at least be useful for you.

Przemek Klosowski

unread,
Mar 24, 2017, 2:10:41 PM3/24/17
to beagl...@googlegroups.com
On Fri, Mar 24, 2017 at 1:03 PM, William Hermans <yyr...@gmail.com> wrote:
> I've honestly no idea how you're implementing what I suggested, so I can't
> really comment on what's going on. sleep() won't work though, and I'm not
> sure how usleep() is implemented for your particular OS( Debian Linux ), but
> usleep() on micro-controllers( bare metal ) is usually less than 10 lines of
> code. I want to say less than 6 lines of code, but it's been a while since
> I've looked through an implementation.

On a bare-metal microcontroller, sleep() is a busy loop but in Linux
sleep/usleep/nanosleep() results in a system call, which explains the
latency differences. BTW, a busy loop on Linux could still be
interrupted and result in latency.

William Hermans

unread,
Mar 24, 2017, 3:17:00 PM3/24/17
to beagl...@googlegroups.com
On Fri, Mar 24, 2017 at 11:10 AM, Przemek Klosowski <przemek....@gmail.com> wrote:

On a bare-metal microcontroller, sleep() is a busy loop but in Linux
sleep/usleep/nanosleep() results in a system call, which explains the
latency differences. BTW, a busy loop on Linux could still be
interrupted and result in latency.

The only problem I have with that train of thought, is that I've written code, that literally handled all 200Ksps of the ADC, which used usleep() .  Prior to that, I implemented nearly exactly the same thing I was trying to explain here, but both sides of my project were in userspace. One to read from the CANBUS, and Decode PGN's in real-time, the other halt taking that data, and putting it out to a web page via web sockets. When I tested this with redundant data, I was getting 2000+ web socket msg's a second to the web client. Where using various other methods like select(), and poll() was achieving less than 20 msg's a second.

So, I'm not arguing, but rather confused as to why this would work for me, and not for someone else.
Reply all
Reply to author
Forward
0 new messages