Load and execute PRU code from bare-metal application

Satz Klauer

unread,

Nov 8, 2013, 2:09:25 AM11/8/13

to beagl...@googlegroups.com

Hi,

after there is some experimental, bare-metal code now running smoothly on my BBB I plan to utilitise the PRU to do some realtime tasks (mainly do bit-banging on some GPIO outputs).

Unfortunately documentation and examples seem to be very rare and the TRM is very detailled - to not to say sometimes much too detailled to get an overview about the whole story. There is some PRU code available at https://github.com/beagleboard/am335x_pru_package but the host-side code seems to expect a running Linux system.

So my question: are there any examples/documentation available out there that show/describe how to

- enable PR

- load code into PRU

- execute that code on PRU

- exchange data between CPU und PRU (seems to be via some shared memory?)

frwom within bare-metal code? Starterware itself seems to ignore the PRU part completely, nothing helpful there instead of an unused headerfile...

Thanks for all ideas, tips and suggestions!

BryanB

unread,

Mar 26, 2015, 3:38:50 AM3/26/15

to beagl...@googlegroups.com

Hi Satz,
Did you ever make any progress with your questions?

I have successfully used Starterware on the BBB and a TI PRU-Cape with its supporting software to load and run PRU program examples.
However, I haven't found any examples on communicating between the ARM and the PRUs via shared memory. Have you?
I did get a simple transfer to work but am not sure about ensuring mutual exclusion or the best way to set up the shared memory.
Bryan

Charles Steinkuehler

unread,

Mar 26, 2015, 9:15:39 AM3/26/15

to beagl...@googlegroups.com

On 3/26/2015 2:38 AM, BryanB wrote:
>
> Hi Satz,
> Did you ever make any progress with your questions?
>
> I have successfully used Starterware on the BBB and a TI PRU-Cape with its
> supporting software to load and run PRU program examples.
> However, I haven't found any examples on communicating between the ARM and
> the PRUs via shared memory. Have you?
> I did get a simple transfer to work but am not sure about ensuring mutual
> exclusion or the best way to set up the shared memory.

There is no single correct or "best" way to implement communications
between the ARM core and the two PRUs. This is a standard problem in
all multiple core machines, and you will find a lot of material with a
quick Google search. Depending on your application you may want to use
things like mailbox registers, lockless queues, interrupt signaling,
req/ack handshakes, etc.

--
Charles Steinkuehler
cha...@steinkuehler.net

BryanB

unread,

Mar 26, 2015, 6:35:31 PM3/26/15

to beagl...@googlegroups.com

Thanks for your reply Charles. My question was not well formulated.
I have an understanding of the theory of mutual exclusion. However, applying the theory using starterware on the beaglebone black is where I am making slow progress.
ANY examples would be useful of any method. One simple method I have used in the past is spinlocks and shared memory. But I haven't been able to get that working yet on the BBB with Starterware.
I posted a more detailed question on the TI E2E Starterware forum but haven't received any replies as yet (perhaps it was also a badly formulated question!).
https://e2e.ti.com/support/embedded/starterware/f/790/t/410442
Thanks again.

Charles Steinkuehler

unread,

Mar 26, 2015, 6:51:02 PM3/26/15

to beagl...@googlegroups.com

Pretty much all of the memory is shared, in that both the ARM core and
the BBB can see it.

Using the DDR system memory is problematic, however. It is both more
cumbersome to use on the PRU side (accessed via the interconnect bus it
stalls the PRU while reading), and on the ARM side you won't ever see
any changes unless you're careful about your cache management (typically
using kernel-mode code and the same sort of memory semantics required
when doing DMA transactions).

I'd recommend just using the PRU data memory space for anything like
semaphores or mailboxes. The memory is already mapped with the proper
access flags to avoid ARM side caching issues, and the PRUs can access
the memory without stalling. The only real reason to use anything but
the PRU shared memory is if you're data set is larger than the 8K/12K
RAMs will support.

I looked at your TI post, but it doesn't make much sense to me. I'm not
very familiar with starterware, or with the spinlock register you refer
to. I can say that the ARM atomic bus transactions are unlikely to work
properly between the PRU and the ARM if you're using the PRU shared
memory, but there are many other synchronizing constructs you can use.
The mechanisms I've used in my code rely on unidirectional atomic
access, which works well. The ARM writes values into the PRU shared
memory which the PRU reads, and the PRU writes to *DIFFERENT* locations
which the ARM side reads. With only a single writer for each memory
address, there is no need for an atomic "read-modify-write" as needed
for a traditional spinlock. It's possible to directly build lockless
work queues and req/ack handshakes out of this sort of primitive, and if
you really want a spinlock, you could build it on top of req/ack.

>> cha...@steinkuehler.net <javascript:>
>>
>

--
Charles Steinkuehler
cha...@steinkuehler.net

Karl Karpfen

unread,

Mar 30, 2015, 8:31:45 AM3/30/15

to beagl...@googlegroups.com

Am Donnerstag, 26. März 2015 08:38:50 UTC+1 schrieb BryanB:

Did you ever make any progress with your questions?

I'd suggest to have a look into this thread: https://groups.google.com/forum/#!category-topic/beagleboard/pru/rCO-2nKynVE

Bill M

unread,

Jun 10, 2015, 10:55:13 AM6/10/15

to beagl...@googlegroups.com

Greetings Charles,

Does the PRU stall when writing to memory outside of the PRU address space? I am working on interfacing a cheap camera to the PRU and want to have it write to a 640 x 480 buffer. So the PRU will only ever write to the buffer, and the ARM core will only ever read the buffer, so I don't see contention being an issue, but the amount of space I will need is bigger than what all the PRU memory combined offers so I definitely need to use DDR. My concern is that the PRU won't be able to write the data from the camera out fast enough, as there will be 8 parallel bits coming in every cycle at 12Mhz. I can shift 4 bytes in at a time and write it out DWORD at a time (which I guess would make the best use of the bus), but that is still a 3Mhz pace. Should the OCP bus be able to handle this? Any info appreciated.

Thanks,

Bill Merryman

William Hermans

unread,

Jun 10, 2015, 12:24:26 PM6/10/15

to beagl...@googlegroups.com

Here is something for you to look at Bill. http://comments.gmane.org/gmane.comp.hardware.beagleboard.user/59975

Charles, and a couple other people talk some about cycles and how many cycles reading / writing takes to various addresses. Not sure this will answer your question thoroughly or not. One user suggests using PRU0 to write to the PRU shared RAM, while PRU1 takes this data, and writes it to DDR. Instead of using DMA.

--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Charles Steinkuehler

unread,

Jun 10, 2015, 2:03:22 PM6/10/15

to beagl...@googlegroups.com

In addition to the other thread, I'd suggest looking at the
BeagleLogic code. It's possible to move _large_ amounts of data
through the PRU to the DRAM, but it requires some finesse.

A few additional comments in-line, below.

On 6/10/2015 11:24 AM, William Hermans wrote:
> Here is something for you to look at Bill.
> http://comments.gmane.org/gmane.comp.hardware.beagleboard.user/59975
>
> Charles, and a couple other people talk some about cycles and how
> many cycles reading / writing takes to various addresses. Not sure
> this will answer your question thoroughly or not. One user suggests
> using PRU0 to write to the PRU shared RAM, while PRU1 takes this
> data, and writes it to DDR. Instead of using DMA.
>
> On Wed, Jun 10, 2015 at 7:55 AM, Bill M <billme...@gmail.com>
> wrote:
>
>> Greetings Charles,
>>
>> Does the PRU stall when writing to memory outside of the PRU
>> address space?

It depends. Writes are posted, so they won't stall for long if you
aren't saturating the internal SoC bus.

>> I am working on interfacing a cheap camera to the PRU and want
>> to have it write to a 640 x 480 buffer. So the PRU will only ever
>> write to the buffer, and the ARM core will only ever read the
>> buffer, so I don't see contention being an issue, but the amount
>> of space I will need is bigger than what all the PRU memory
>> combined offers so I definitely need to use DDR. My concern is
>> that the PRU won't be able to write the data from the camera out
>> fast enough, as there will be 8 parallel bits coming in every
>> cycle at 12Mhz. I can shift 4 bytes in at a time and write it out
>> DWORD at a time (which I guess would make the best use of the
>> bus), but that is still a 3Mhz pace. Should the OCP bus be able
>> to handle this?

It's possible to move a *LOT* more data than that (again, see the
BeagleLogic code). Note that you will generally get better results
with burst transfers (ie: moving many 32-bit words at a time) than by
writing individual DWORDS. Since there are two PRUs, for maximum
throughput it makes sense to have one PRU doing the data acquisition
and the other PRU writing the data to system memory. You can
communicate up to the entire PRU register set "broadside" between the
two PRU cores in one clock using the exchange instructions.

--
Charles Steinkuehler
cha...@steinkuehler.net

Karl Karpfen

unread,

Jun 11, 2015, 3:19:38 AM6/11/15

to beagl...@googlegroups.com

2015-06-10 18:24 GMT+02:00 William Hermans <yyr...@gmail.com>:

Charles, and a couple other people talk some about cycles and how many cycles reading / writing takes to various addresses. Not sure this will answer your question thoroughly or not. One user suggests using PRU0 to write to the PRU shared RAM, while PRU1 takes this data, and writes it to DDR. Instead of using DMA.

I don't know if this is a good idea. I think this would lock both PRUs when accessing shared RAM because only one can access it at the same time. As far as I remember TRM, a PRU writing to DRAM would halt the main core, not vice-versa. When this is correct the additional write/read operation to shared RAM not only wastes a full PRU core but also adds some additional delays without winning something.

On the other hand: how much data do you really retrieve from your camera? And how long would data transfer to DDR really take comparing to the remaining time between two pictures?

Bill M

unread,

Jun 11, 2015, 3:53:11 PM6/11/15

to beagl...@googlegroups.com

William, Charles, and Karl,

I can't thank you enough for all of your input. My intention is to use a cheap OV7670 camera to capture a video stream for a robotics project (I've seen other projects that suggest at least image capturing from the camera is possible by direct output, as opposed to using I2C).

I would like to keep the other PRU free to run a half duplex UART out to some Robotis Dynamixel servos. I originally tried to read the camera from a program running on the main core. I had TIMER7 putting out a 12Mhz clock to the camera, and the VSYNC, HREF, PCLK, and the 8 bit parallel video lines coming into one of the GPIO banks. The VSYNC line appeared to be signaling 15 times a second, which was expected. An oscilloscope reading suggested the other lines were signaling at about the right intervals. It just seemed like something in the process of reading the GPIO pins was not keeping up. I thought since the main core runs at 1Ghz and this is bare metal I would have plenty of cycles between PCLK signals to read and handle the data, but I was only getting the expected data every so often, with a lot of garbage coming in between. So I decided to go the PRU route hoping the more direct GPIO access and determinism would make for a reliable process.

Since the camera is running at 15 frames a second at 640 * 480 (YUV, so 2 bytes per pixel), I would have to pump 9MB a second to where this is getting stored, with at least 614KB to store one frame (and I would kind of like to back buffer it for computer vision processing, so double that). If this is just crazy, please let me know.

BTW, I haven't actually written the code to read the PRU GPIO pins yet. Do I have to set the pinmux up in the regular pad control registers, or is their muxing controlled completely by the PRU registers.

Thanks again for all of your help!

Rick Mann

unread,

Jun 11, 2015, 4:16:19 PM6/11/15

to beagl...@googlegroups.com

> On Jun 11, 2015, at 12:53 , Bill M <billme...@gmail.com> wrote:
>
> I would like to keep the other PRU free to run a half duplex UART out to some Robotis Dynamixel servos

Can you not use one of the many USART peripherals on the SoC for this?

--
Rick Mann
rm...@latencyzero.com

Matthijs van Duin

unread,

Jun 14, 2015, 7:36:46 AM6/14/15

to beagl...@googlegroups.com

On Wednesday, 10 June 2015 20:03:22 UTC+2, Charles Steinkuehler wrote:

In addition to the other thread, I'd suggest looking at the
BeagleLogic code. It's possible to move _large_ amounts of data
through the PRU to the DRAM, but it requires some finesse.

My first intuition would be using EDMA.

On Thursday, 26 March 2015 23:51:02 UTC+1, Charles Steinkuehler wrote:

on the ARM side you won't ever see any changes unless you're careful about your cache management (typically using kernel-mode code and the same sort of memory semantics required when doing DMA transactions).

Mapping the memory as uncacheable and using appropriate barriers suffices (privilege is not needed, though typically baremetal code tends to run privileged anyway).

I'd recommend just using the PRU data memory space for anything like
semaphores or mailboxes. The memory is already mapped with the proper
access flags to avoid ARM side caching issues, and the PRUs can access
the memory without stalling.

Well in a baremetal application nothing is "already mapped". In fact an issue here is that the PRU memory space resides within the same 1MB section as peripherals, so if both are accessed from the A8 you'd need to set up a page table for that section to make PRU memory space normal uncacheable while making the peripheral space device-type. (Well, you don't *have* to, but if you care about performance...)

I do agree with having the PRUs stick to PRU memory as much as possible.

I looked at your TI post, but it doesn't make much sense to me. I'm not
very familiar with starterware, or with the spinlock register you refer
to. I can say that the ARM atomic bus transactions are unlikely to work
properly between the PRU and the ARM if you're using the PRU shared
memory, but there are many other synchronizing constructs you can use.

I'm pretty sure that ARM atomic bus transactions (neither the old locked SWP nor the newer load/store-exclusive) will not accomplish anything useful, if they get anywhere beyond the CPU boundary at all.

TI is however referring to the hwspinlock peripheral which (along with the mailbox peripheral) is specifically intended for inter-core synchronization. I personally would not be eager to use hwspinlock though.

I'd also try to go for unidirectional messaging like you're saying. The mailbox peripheral looks quite reasonable for inter-core notification, especially if this is infrequent and you want to avoid polling, though I think you also may be able to do that purely with the rather elaborate PRUSS interrupt controller.

When writing from the A8 to uncacheable memory, do remember to finish off with a memory barrier instruction since the A8 is allowed to (and *does*) buffer writes indefinitely otherwise.

Bill M

unread,

Jun 15, 2015, 10:33:41 AM6/15/15

to beagl...@googlegroups.com

Greetings All,

Wow, I'm glad this has generated so much conversation. Thanks to everyone who has chimed in.

To Rick M, one of the things that attracted me to the BBB was that it has several available UARTS, but I also need things to run in a deterministic fashion since I need to control an array of servos and updating needs to happen 128 times a second, which means a several dozen byte packet going out that frequently. After reading through a bit more in the TRM about the PRU UART, I don't think a PRU UART will be feasible since it looks like they top out at around 300Kbs, and I need a megabit. I'm hoping things will be sufficiently deterministic since I'm running bare metal, and will drive the update loop with a timer interrupt and have the UART just feed things out as fast as the line will consume it. I know things will run more slowly if I don't use caching, but if I disable caching, does that eliminate any pipelining? I'm a noob when it comes to pipelining and caching, since I've only ever hacked on AVR microcontrollers and a Cortex M3, where those weren't considerations. I'm a line of business programmer in my day job :(.

Matthijs, does EDMA offer that big a performance boost? Most of my background up to this point has been just coding things for handling hardware and timer interrupts and UART communication. I'm an extreme noob when it comes to the more involved hardware stuff like DMA. Does going from the PRU to DDR pass over the L3 interconnect whether it's DMA or regular DWORD by DWORD assignment? I'm figuring this will have to pump 9 MB a second to DDR, but with each write being a DWORD, this should only be one write every 455 clock cycles for the main core (assuming my math is correct). I have to admit, my head is swimming with some of what you wrote, so I definitely need to crack the books harder. If you know of any good references on MMUs, caching, and pipelining for beginners, let me know (I also need to educate myself more on kernel programming). I just imagine there has to be some good way to get good throughput from the PRUs to the rest of the system, otherwise the PRUs wouldn't be very useful to the rest of the system, but again, I may just be naïve.

In the meantime, right now I'm just finishing getting my development environment set up for everything, since I'm using GCC and am using Eclipse for my IDE (up til now while learning my way around Starterware and the PRU tools, I've just been using notepad and the command line >_<). I've got it set up now to build the PRU code, convert it to C header files, and include it in my main code, which it can compile for the final bin and use memcpy to load and start the PRU at run time (primitive, but it works for my purposed right now). Now I'm going to start writing code to start trying to read the camera, and I'll report back my results. Maybe I'll eventually take the dive into Linux (I've waded in until this point).

Thanks, again to everyone!

William Hermans

unread,

Jun 15, 2015, 11:22:00 AM6/15/15

to beagl...@googlegroups.com

Hey Bill,

If you're needing deterministic, and *if* you decide to run Linux( or maybe just experiment ), You can always look into Xenomai.

Now keep in mind that I have no hands on personally. But it could be as easy as writing a current image to sdcard, and apt-get install-ing one of the latest xenomai kernels. Followed by learning about Xenomai of course . . . something I've been wanting to do myself, but have no gotten around to.

--

Charles Steinkuehler

unread,

Jun 15, 2015, 11:25:40 AM6/15/15

to beagl...@googlegroups.com

On 6/15/2015 10:21 AM, William Hermans wrote:
> Hey Bill,
>
> If you're needing deterministic, and *if* you decide to run Linux( or maybe
> just experiment ), You can always look into Xenomai.
>
> Now keep in mind that I have no hands on personally. But it could be as
> easy as writing a current image to sdcard, and apt-get install-ing one of
> the latest xenomai kernels. Followed by learning about Xenomai of course .
> . . something I've been wanting to do myself, but have no gotten around to.

Even easier, the Machinekit images run Xenomai "out of the box", so no
messing with installing kernels and configuring the run-time
environment, just boot and start playing:

http://elinux.org/Beagleboard:BeagleBoneBlack_Debian#BBW.2FBBB_.28All_Revs.29_Machinekit

--
Charles Steinkuehler
cha...@steinkuehler.net

William Hermans

unread,

Jun 15, 2015, 11:42:30 AM6/15/15

to beagl...@googlegroups.com

Hey Charles,

Is there any in depth documentation on machinekit ? Preferably all in one source . . . As the documentation implementers for machinekit do not seem to get that developers do *not* enjoy a seemingly endless round-robin of pointless links . . .

Charles Steinkuehler

unread,

Jun 15, 2015, 3:57:31 PM6/15/15

to beagl...@googlegroups.com

On 6/15/2015 10:42 AM, William Hermans wrote:
> Hey Charles,
>
> Is there any in depth documentation on machinekit ? Preferably all in one
> source . . . As the documentation implementers for machinekit do not seem
> to get that developers do *not* enjoy a seemingly endless round-robin of
> pointless links . . .

<heh>

The documentation is very much a work in progress, but what's
available in the form of official docs are in the github repo
(separate from the code):

https://github.com/machinekit/machinekit-docs

Much of this is still from LinuxCNC and while it's getting updated,
given the speed of code changes at the moment the docs are a bit behind.

If you're looking for details on the BBB/Xenomai install, that's not
really within the realm of the Machinekit docs repo. The best place
to look for the details and "secret sauce" of building a working image
is to actually grab the build scripts from github. Robert Nelson is
now building the Machinekit images as part of his "universal SoC build
farm", so the Machinekit build scripts are right next to (and
virtually identical to) the scripts used to craft the other BeagleBone
images:

https://github.com/RobertCNelson/omap-image-builder

To make a Machinekit image, just:

./RootStock-NG.sh -c machinekit-debian-wheezy

...like it says at the bottom of the readme.md file.

--
Charles Steinkuehler
cha...@steinkuehler.net

William Hermans

unread,

Jun 15, 2015, 5:20:53 PM6/15/15

to beagl...@googlegroups.com

If you're looking for details on the BBB/Xenomai install, that's not
really within the realm of the Machinekit docs repo. The best place
to look for the details and "secret sauce" of building a working image
is to actually grab the build scripts from github. Robert Nelson is
now building the Machinekit images as part of his "universal SoC build
farm", so the Machinekit build scripts are right next to (and
virtually identical to) the scripts used to craft the other BeagleBone
images:

https://github.com/RobertCNelson/omap-image-builder

To make a Machinekit image, just:

./RootStock-NG.sh -c machinekit-debian-wheezy

...like it says at the bottom of the readme.md file.

Thanks for your answer Charles. However what I would like to find out is how is machinekit different from say Debian. Not so much in difference between distro's( because I'm thinking it's "just" a kernel with *some* tools ), or determinism, but how does one use it to their full advantage.

So for all I know, one would use it like you'd use Linux in general. My guess would be this is not the case however. Also, knowing some guidelines while developing deterministic code would be very handy too.

So basically, stuff that an experienced developer should know when using machinekit, but doesn't from lack of experience *with* machinekit. Which libc is expected . . . etc.

Does that make any sense ? Maybe I'm looking in the wrong place so far ?

--
Charles Steinkuehler
cha...@steinkuehler.net

Rick Mann

unread,

Jun 15, 2015, 5:48:46 PM6/15/15

to beagl...@googlegroups.com

> On Jun 15, 2015, at 07:33 , Bill M <billme...@gmail.com> wrote:
>
> To Rick M, one of the things that attracted me to the BBB was that it has several available UARTS, but I also need things to run in a deterministic fashion since I need to control an array of servos and updating needs to happen 128 times a second, which means a several dozen byte packet going out that frequently. After reading through a bit more in the TRM about the PRU UART, I don't think a PRU UART will be feasible since it looks like they top out at around 300Kbs, and I need a megabit. I'm hoping things will be sufficiently deterministic since I'm running bare metal, and will drive the update loop with a timer interrupt and have the UART just feed things out as fast as the line will consume it. I know things will run more slowly if I don't use caching, but if I disable caching, does that eliminate any pipelining? I'm a noob when it comes to pipelining and caching, since I've only ever hacked on AVR microcontrollers and a Cortex M3, where those weren't considerations. I'm a line of business programmer in my day job :(.

I'm not sure exactly what you're using the UART for. Are your servos controlled via serial packets of some kind? Or are they typical hobby PWM servos? If the latter, then I would have thought using a UART on the ARM core (not the PRU) would be the best way to go. I'm assuming they can do a megabit, although that probable requires DMA.

It sounds like you're using the UART to communicate with the servo, and a high rate. I can see why you'd want the timing to be right in that case. I don't really have any idea what the caching effects are.

Good luck!

--
Rick Mann
rm...@latencyzero.com

WZ9V

unread,

Jun 15, 2015, 9:24:46 PM6/15/15

to beagl...@googlegroups.com

There are serial hobby servos nowadays. Futaba SBus is one example.

Charles Steinkuehler

unread,

Jun 16, 2015, 12:19:24 PM6/16/15

to beagl...@googlegroups.com

On 6/15/2015 4:20 PM, William Hermans wrote:
>>
>> *If you're looking for details on the BBB/Xenomai install, that's not*
>> * really within the realm of the Machinekit docs repo. The best place*
>> * to look for the details and "secret sauce" of building a working image*
>> * is to actually grab the build scripts from github. Robert Nelson is*
>> * now building the Machinekit images as part of his "universal SoC build*
>> * farm", so the Machinekit build scripts are right next to (and*
>> * virtually identical to) the scripts used to craft the other BeagleBone*
>> * images:*
>>
>> * https://github.com/RobertCNelson/omap-image-builder
>> <https://github.com/RobertCNelson/omap-image-builder>*
>>
>> * To make a Machinekit image, just:*
>>
>> * ./RootStock-NG.sh -c machinekit-debian-wheezy*
>>
>> * ...like it says at the bottom of the readme.md <http://readme.md> file.*

>
> Thanks for your answer Charles. However what I would like to find out is
> how is machinekit different from say Debian. Not so much in difference
> between distro's( because I'm thinking it's "just" a kernel with *some*
> tools ), or determinism, but how does one use it to their full advantage.

The Machinekit BBB image *IS* Debian, just with a Xenomai capable
kernel and some packages to make use of it pre-installed.

> So for all I know, one would use it like you'd use Linux in general. My
> guess would be this is not the case however. Also, knowing some guidelines
> while developing deterministic code would be very handy too.
>
> So basically, stuff that an experienced developer should know when using
> machinekit, but doesn't from lack of experience *with* machinekit. Which
> libc is expected . . . etc.
>
> Does that make any sense ? Maybe I'm looking in the wrong place so far ?

That makes sense, but is _way_ beyond the scope of a simple email,
particularly since I don't know how much you do or don't know about
coding for real-time.

If you're wanting to easily write deterministic code, you might want
to use PREEMPT_RT, which works really well on the x86 architecture and
is coming along on the ARM architecture. This allows you to write
"normal" C code, including making kernel syscalls (directly or via
libraries like libc) without loosing real-time performance.

Xenomai runs in it's own domain, and while you can call routines in
the Linux kernel, doing so breaks any guarantee of hard real-time
performance. So you have to write Xenomai drivers or directly talk to
any hardware you're expecting to have real-time performance.

Note that Machinekit is a project to control motors and other physical
things (ie: machines) that runs under several possible real-time
environments (currently Xenomai, PREEMPT_RT, RTAI, and even plain
Posix w/o real-time guarantees). The Machinekit images for the BBB
are simply a ready-to-run version of the RCN's BBB Debian builds with
the Xenomai kernel and Machinekit packages pre-installed for ease-of-use.

--
Charles Steinkuehler
cha...@steinkuehler.net

William Hermans

unread,

Jun 16, 2015, 1:00:51 PM6/16/15

to beagl...@googlegroups.com

That makes sense, but is _way_ beyond the scope of a simple email,
particularly since I don't know how much you do or don't know about
coding for real-time.

...

Note that Machinekit is a project to control motors and other physical
things (ie: machines) that runs under several possible real-time
environments (currently Xenomai, PREEMPT_RT, RTAI, and even plain
Posix w/o real-time guarantees). The Machinekit images for the BBB
are simply a ready-to-run version of the RCN's BBB Debian builds with
the Xenomai kernel and Machinekit packages pre-installed for ease-of-use.

Thanks Charles. Your answer pretty much answered all my questions. I guess I could have been more succinct in saying that I just wished to know if looking into Xenomai, or machinekit was a waste of time for my own purposes. Which now it does seem that way. For now.

Pretty much all I wanted was some form of Linux, that ran on a "tighter schedule". PREEMPT_RT sounds like where I may want to be.

I do know a bit about real-time coding, but would definitely not consider myself an expert. In the context of Linux . . . all I know is by reading. No hands on.

--
Charles Steinkuehler
cha...@steinkuehler.net

Bill M

unread,

Jun 17, 2015, 11:34:07 AM6/17/15

to beagl...@googlegroups.com

Hi William,

Thanks for the suggestion. Actually, I had looked a little bit at Xenomai, and also Chibi-OS and a couple of other RTOSes. I was worried about adding complexity for myself, not really having experience in compiling kernals and not having much experience programming within Linux or using other tools within Linux. Also, I'm kind of a control freak. But I do like to try to learn something new with each project I work on, so I'll probably fight with the bare metal a little while longer, and then seriously consider moving to an OS based environment. Since Linux controls a lot more of the world than most people give it credit for, I really should try to get fluent with it (I could add it to my resume then). It would also be nice to abstract away other things that an OS already has code for. For example, I would like to add a wireless USB NIC to the BBB, and if there is one that already has a driver and I can just write code to use it, that would be awesome.

Bill M

unread,

Jun 17, 2015, 11:35:15 AM6/17/15

to beagl...@googlegroups.com

Thanks Charles, I hadn't seen that yet. I'm going to fight with the bare metal a little while longer, and then probably start exploring this.

Bill M

unread,

Jun 17, 2015, 11:58:38 AM6/17/15

to beagl...@googlegroups.com

Hi Rick,

The servos I am using are Robotis Dynamixel servos. The servos themselves have (I believe) Atmega8 controllers in them to handle the actual PWM details. They use a 3 wire interface :1 power, 1 ground, and 1 half duplex 1Mbps serial line. I don't know how time variation tolerant my setup would be. The way I've done it on my other setups is to use a timer based interrupt that fires about every 8 milliseconds that updates the target values in memory, then pushes them to a circular buffer that feeds the UART, with a transmit register empty interrupt to pull the next byte. With the 1Mbps line used by the servos being an order of magnitude slower than the CPU updating things, I was thinking maybe that would have the effect of smoothing out any hiccups, so if there was a millisecond or two variation, it might not have a big impact. The other controllers I am using probably interface more immediately with their memory, but they are also running much more slowly (72 Mhz being the fastest) and I haven't had to resort to anything like DMA for them, so I'm hoping that since the AM3359 runs so much faster than that, I won't have to here either (although I would like to get familiar with the workings of that at some point, and as the complexity of the code grows and the demands on the system increase, I may have to).

Thanks!

Matthijs van Duin

unread,

Jun 18, 2015, 10:03:23 AM6/18/15

to beagl...@googlegroups.com

On 15 June 2015 at 16:33, Bill M <billme...@gmail.com> wrote:

After reading through a bit more in the TRM about the PRU UART, I don't think a PRU UART will be feasible since it looks like they top out at around 300Kbs

Hmm, where'd you get that number? The PRU UART looks like the highest performance UART: it receives a 192 MHz functional clock and the datasheet specs 12 Mbps max (that would be using a /1 divider and 16x oversampling). The other UARTs receive a 48 MHz functional clock and spec max 3.6864 Mbps (/1 divider and 13x oversampling, so that would get you 3.6923 Mbps to be precise).

I've also noticed that UART0 cannot cope too many consecutive writes, even if there's enough fifo space: the fifo pointers seem to get corrupted or something (I'm guessing a bug in the synchronization logic between the interface and functional clock domains). This only appears as issue when trying to rapidly fill the UART fifo from the cortex-a8 in a tight loop (using posted writes). Inserting some dummy register write between consecutive data bytes fixes the issue, as does slowing down the loop in some other way. Using EDMA would probably also solve the problem.

I haven't tested the other UARTs, but I'd guess the other UARTs will have the same behaviour except for the PRUSS UART (due to ick/fck ratio).

I know things will run more slowly if I don't use caching, but if I disable caching, does that eliminate any pipelining? I'm a noob when it comes to pipelining and caching, since I've only ever hacked on AVR microcontrollers and a Cortex M3, where those weren't considerations.

Heh, yeah I personally went from ARM7TDMI-based microcontrollers to the DM814x, a Cortex-A8 based TI SoC closely related to the AM335x... quite a bit of culture-shock there. "Wait, is this still an ARM processor?" o.O

I'm not sure what you mean by "eliminate any pipelining": pipelining is an intrinsic part of the design of almost any modern CPU, even the AVR (2-stage) and Cortex-M3 (3-stage), although they pale in comparison to the Cortex-A8 (14-stage, plus 10 more for NEON instructions). The PRU is a notable exception for being non-pipelined, which is deeply impressive considering it runs at 200 MHz and has 32-bit compare-and-branch instructions. In general pipelining becomes most visible in unpredictable branches, which take 1 cycle on PRU, 2 on AVR, 3 on the M3, and 14 on the A8.

However, especially since the A8 executes strictly in-order, memory accesses can stall the pipeline for quite a while, and I suspect this is what you mean. This is highly dependent on memory region attributes (including cacheability), which also means setting up MMU and caches is absolutely essential on the Cortex-A8. This isn't very hard: for a baremetal application it typically suffices to setup the section translation table with the desired attributes (see http://community.arm.com/docs/DOC-10098 for an example), set L2 cache enable bit in the Auxiliary Control Register (if not already set), and the M, C, Z, and I bits of the Control Register (Z and I are already set by bootrom iirc).

One of the easiest ways to murder write-performance is by marking memory as "strongly ordered", which is the default for data access if the MMU is disabled. This makes the cpu wait synchronously on writes, so then you're looking at about 150-200 ns (= cycles @ 1 GHz) for each write, depending on the "ping time" from the cpu to the target. In contrast, writes to device or normal memory are buffered and therefore take 1 execution cycle as long as the buffer isn't full. The limiting factor in draining the buffer is that the cpu can only have one device write and one normal uncacheable (or write-through) write outstanding on the AXI bus, but almost immediately (afaik as soon as the write is accepted by the async bridge to the L3) the write is "posted" (i.e. becomes fire-and-forget) and acked to the cpu.

In case of normal memory, small writes to sequential addresses are automatically coalesced to larger writes when possible. This isn't done for device and strongly-ordered memory, so using aligned dword (strd) and quadword (neon) writes when possible will get you significant performance gain there.

In case of non-Neon reads, the cpu has to wait for the data to become available, so caches obviously have a huge impact: L1 cache hit = 1 cycle, L1 miss L2 hit = 9 cycles, L2 miss (or uncacheable) = ping time to target. If they miss the caches, reads from normal memory still have the benefit of overtaking buffered writes, while device reads aren't allowed to overtake device writes. The situation with Neon is more complicated and I never fully figured out what goes on there. For example, some timings for a simplistic memory copy using Neon (vld1, subs, vst1, bne) on a DM814x (A8 @ 600 MHz) targeting DDR3-800:

from strongly ordered to strongly ordered: 17.76 cycles/byte

from device to device: 12.77 cycles/byte

from device to uncacheable: 9.02 cycles/byte

from uncacheable to uncacheable: 1.31 cycles/byte

from uncacheable to device: 1.10 cycles/byte

from L2 miss to uncacheable: 1.06 cycles/byte

from L2 miss to device: 0.99 cycles/byte

from L2 hit to device or uncacheable: 0.50 cycles/byte

"L2 miss" refers to the first access of each cacheline (i.e. one out of four loads).

Of course for most peripheral targets caching is not an option. You could probably often get away with marking them normal uncacheable instead of device, though this may require introducing memory barriers and I don't know how expensive they are. It would also be highly Cortex-A8 specific: architectually an ARM CPU is allowed to perform arbitrary reads from normal memory, and many perform speculative reads for example.

Matthijs, does EDMA offer that big a performance boost?

After giving it more thought I'm actually not sure whether EDMA would achieve higher throughput than writes by a PRU core, since PRU is a direct initiator on the L3F while EDMA has to go through the L4HS to reach PRUSS. Having EDMA perform the transfer would however free up PRU's precious time. After setting things up, PRU could request EDMA transfers with a single write to EDMA, or using the PRUSS interrupt controller.

Another point of some importance is that since EDMA uses non-posted writes you would actually be sure the data has reached its destination when it signals completion. If PRU writes data to RAM, then signals the A8 using an interrupt, which subsequently proceeds to read from the same location, it is not guaranteed to actually read the data written by PRU: this data may still be in some queue on route from PRUSS to EMIF, while the A8 has a private hotline to EMIF that bypasses it.

For other situations the benefits are more clear: for example it can read data from a peripheral in response to its dma request and directly deliver it into PRUSS, and send notification to PRU when a certain amount of data has been transferred. This can save PRU from having to perform reads over the L3 interconnect.

EDMA also has a staggering amount of bandwidth. While its reads are limited by latency just like other initiators, the max size of a single access by EDMA is 64 bytes, so for example it can slurp the whole content of an ADC FIFO with a single read access. It is synchronous to the L3, avoiding the latency of an async bridge. Although it uses non-posted writes, it can have four writes + a read outstanding simultaneously. And all this describes a single Transfer Controller (TC), EDMA has three of these. Total theoretical bandwidth is just under 8 gigabytes per second, though I don't know how much is achievable in practice.

I think I had more stuff I wanted to say, but this email is already long enough and been sitting in Drafts for too long, so I'll just press "Send" now ;-)

Matthijs

g...@novadsp.com

unread,

Jun 18, 2015, 11:10:48 AM6/18/15

to beagl...@googlegroups.com

> I think I had more stuff I wanted to say, but this email is already long enough and been sitting in Drafts for too long, so I'll just press "Send" now ;-)
>
> Matthijs

Impressive analysis. Thanks.

BTW there is some interesting B3/Xenomai/PRU stuff here as well: https://code.soundsoftware.ac.uk/projects/beaglert/repository

o1bigtenor

unread,

Jun 18, 2015, 11:40:53 AM6/18/15

to beagl...@googlegroups.com

Interesting - - - servos - - - suggestions for non-hobby servos?

I am looking at a robotics project where I need commercial (full industrial) reliability and and and.

Just getting into this stuff so I have spent most of my time so far just reading (hopefully learning and not asking superfluous questions! grin!!).

TIA

Dee

Karl Karpfen

unread,

Jun 19, 2015, 4:38:32 AM6/19/15

to beagl...@googlegroups.com

2015-06-18 17:40 GMT+02:00 o1bigtenor <o1big...@gmail.com>:

On Mon, Jun 15, 2015 at 4:48 PM, Rick Mann <rm...@latencyzero.com> wrote:
I am looking at a robotics project where I need commercial (full industrial) reliability and and and.

This should not be a problem, these guys here are offering something very similar: http://halaser.eu/e1701m.php

o1bigtenor

unread,

Jun 19, 2015, 6:09:36 AM6/19/15

to beagl...@googlegroups.com

Very interesting.

I have just started working through the documentation.

Do you know of any other boards like this?

Dee

Bill M

unread,

Jun 19, 2015, 12:34:15 PM6/19/15

to beagl...@googlegroups.com

Wow. At this point I feel like I should be paying you tuition ^_^.

Apparently while I was falling asleep reading the TRM in bed late at night, I totally misread and misinterpreted the UART divisor tables on pg. 238. Thanks for pointing that out, and for the heads up about the pointer corruption issue. I'll probably still try to use one of the non-PRU UARTS first (in case I want to dedicate the other PRU to other sensors or processing), and fall back to the PRU one if I'm having too much trouble getting smooth real time operation.

Before getting into microcontroller programming for robots about 2 years ago, I hadn't done any hardware level programming since I was a kid 30 years ago on 6502 processors. Didn't really have to think much about pipelining, caching, or memory management back then ^_^. I do line of business desktop and web programming for my day job.

I'm probably using the term pipelining too casually/incorrectly. I know the hardware will simultaneously execute one instruction while decoding the next one and fetching the one after that. I was kind of including dealing with what is loaded in cache, how things slow down with cache misses, etc.. My first couple of 'hello world' type programs I've written for this didn't even use caching, and even now I'm online using instruction caching (since the SDK code for that is super easy and enabling it sped things up considerably). I tried to set up the MMU, but it was hanging my program, and I didn't want to get bogged down in trying to debug that yet, at least not until I learn a LOT more.

The way I am trying to set things up now, just so I can see if the camera is working or will work, the PRU will only ever write to the picture memory, and the main core will only ever read it. So if the main core stalls while reading it, that is no big deal. What will be critical is that the PRU can write the data coming from the camera (at about 9MB a second) to memory dependably.

I have a lot more to say/ask too, and I can't thank you enough for all the help and info you've given me so far, but I'm writing this from work and I think if I want to keep this job for a while longer I better get back to it. Talk more soon...

Bill M

unread,

Jun 27, 2015, 2:06:36 PM6/27/15

to beagl...@googlegroups.com

SUCCESS!! I was able to get the OV7670 camera connected, get the PRU reading it, and get the results over to my PC for display (although because I'm pushing to the PC via serial port, I can only see stills and not video, and the stills take about 20 seconds to transfer (640 * 480 greyscale (I'll work on the color later) image going byte at a time across a 115200 serial connection)). I discovered some of my initial (and persistent) problems were with poor terminations in my wiring (makes me inclined to want to go back and try it again as main core code with the GPIOs). It appears the L3 can consume writes from the PRU fast enough to move to memory (although since this is the ONLY thing running on the device right now, I don't know how it would degrade the operation of other code). This is really giving me the itch now to try to port this to run under Debian, so I can use the other facilities of the OS.

If anyone wants to look at (or laugh at) my code, you can see it here: http://sourceforge.net/p/bioloidfirmware/code/ci/master/tree/

in the 'Beaglebone Firmware' folder. Also, there is a WIndows program in the Utilities folder called 'UARTImageReceiver' that I am using on the PC side to fetch the image. It transmits a character to the BBB, and when the BBB receives it, it dumps back the contents of the array, which the PC program builds into an image and displays.

The setup is really sloppy right now (as is the code), I'll try to clean it up soon. Also, I'm using GCC and Eclipse. The way it is set up, you have to run the makefile with 'pru_bin' as the argument to build the PRU part, then run it without the argument to build the main program. I'm running 'the TI 'bintoc' program to convert the PRU program to arrays that I include in the main program which I then load into PRU memory and start the PRU. I tweaked 'bintoc' to take an extra argument to use as the name for the generated array. Because of this, and since I am directly writing to the address of my array in main memory from the PRU, there is some more compile time craziness that is necessary. I have to compile everything, check the map file to see where the array gets placed, put that address into the PRU code, then compile it again. That also only makes either the debug or release version usable, but not both (since I'm not using a debugger anyway, I might just ditch the debug build).

Finally, I included a huge chunk of the Starterware code directly in the project so I wouldn't pollute my Starterware install (because I want to keep working through the examples), and so I could move the project around without breaking stuff. TI, please don't sue me. If I need to remove something, let me know. I stole, er, borrowed liberally from a bunch of people, and will try to attribute properly as soon as possible. In case you don't notice, I'm a slob and miss a lot.

William Hermans

unread,

Jun 27, 2015, 4:38:17 PM6/27/15

to beagl...@googlegroups.com

SUCCESS!! I was able to get the OV7670 camera connected, get the PRU reading it, and get the results over to my PC for display (although because I'm pushing to the PC via serial port, I can only see stills and not video, and the stills take about 20 seconds to transfer (640 * 480 greyscale (I'll work on the color later) image going byte at a time across a 115200 serial connection)). I discovered some of my initial (and persistent) problems were with poor terminations in my wiring (makes me inclined to want to go back and try it again as main core code with the GPIOs). It appears the L3 can consume writes from the PRU fast enough to move to memory (although since this is the ONLY thing running on the device right now, I don't know how it would degrade the operation of other code). This is really giving me the itch now to try to port this to run under Debian, so I can use the other facilities of the OS.

Awesome BIll that sounds great.

One thing I was thinking, and have been playing around with myself lately is . . . You could use websockets to push images / video out over the ethernet port. How one would implement that "bare metal" I am not sure. From within Linux it is pretty easy, and a few good libraries / API's to play with. The one I've been experimenting with lately is libmongoose https://github.com/cesanta/mongoose, and they have another library called "Fossa" https://github.com/cesanta/fossa which is supposed to be cleaner.

Anyway, websockets are a protocol that can be used with, or without a browser on the client side. So, if you wanted to write your own client for manipulating the video - You could. Fairly easily.

dd

unread,

Aug 30, 2016, 4:26:44 AM8/30/16

to BeagleBoard

Hi Bill. I want to do this too. I checked your sourceforge rep and could not find your OV7670 interface.
maybe we can collaborate. contact me thru www.baremetal.tech

later...................dd

Reply all

Reply to author

Forward