I have a project down the road that will require fast writes from PRU to ARM/system DRAM. But I'm not there yet.
For this project, my focus is on reading data (from SD card, eMMC, USB stick, network, etc) into DDR and then pushing it to the PRUs and then bit-bang out precise timing (using EGP). I am trying to avoid external circuit support and thus need deterministic timing. That's what got me very interested in the BBB. Perhaps others as well - what a great, low-cost, small-footprint combination of the scope/breadth/content/flexibility of Linux with these embedded real-time units.
Eventually it dawned on me that there will be some latency/non-deterministic timing unless I use the PRUs completely fenced-off from the system (ARM, DDR, etc). So I'm trying to identify when/where that non-determinism can occur (and conversely, where it cannot).
When I referenced "shared DRAM" I was sloppy, thinking it was clear in the context. I mean the 12k shared DRAM that is part of the PRU-ICSS. I see that, or the (2) individual 8k DRAMs as the "portal" to the ARM core (along with interrupts). I haven't coded it yet, but I think I'm pretty clear on pushing the data from userland to the PRUs (mmap() & /dev/mem as was offered above). I already have a use planned for the three scratchpad areas and using the broadside interface for single-instruction transfers. They appear to not be subject to any conflict other than the other PRU.
The point I'm trying to make is that from the TRM, it appears there is the possibility of some non-deterministic latency whenever using anything connected to the 32-bit PRU-ICSS bus. That is because the system (ARM) can access that bus through the OCP slave - and it will have to do that if it's going to be pushing data to the 12k or 8k PRU-ICSS DRAM. I think I can manage that (using interrupts to trigger the ARM to write the data and not start any timing critical steps until I can determine that write is complete). But when thinking this through, the question it has raised is this:
If I have both PRUs executing, won't they be (potentially) competing for access to the single 32-bit PRU-ICSS bus each time they access their "own" 8k DRAM or the "shared" 12k DRAM? Both PRUs can access all three of these memory locations, and the diagram seems to indicate there is only one path to them. And if this is true, then other than 12k being bigger than 8k, I don't see any advantage (or difference at all, other than having the same address in memory for either of the PRUs) between using the 12k or 8k DRAM from either PRU.
That's what I'm trying to verify, or be disabused of whatever mistake I've made.
To be specific, this is what I think will (can) happen:
ARM writing to 12k PRU shared DRAM can affect timing of PRU read/write to it's own 8k DRAM, the other PRU's 8k DRAM, as well as the 12k PRU-ICSS 12k shared DRAM
PRU0 reading/writing to either 8k DRAM or 12k DRAM can affect timing of PRU1 reading/writing to either 8K DRAM or 12k DRAM, even if the source/target of PRU0 is not the same as the source/target of PRU1
Any reads from system resources (through OCP master) are subject to stalls (e.g. peripherals, GPIO, ARM DDR)
Any writes to system resources (through OCP master) are also subject to stalls (but less likely) if the interconnect fabric has been saturated. (I was hoping I could get some rough idea of how much it takes to "saturate the interconnect fabric" - and do only writes contribute, or reads as well).
I will look at that BeagleLogic code and see if I can see how that was done. I'd still like to understand the underlying operation in more detail. Thanks.