unexpected "low speed" of PRU 1

95 views
Skip to first unread message

Kasimir

unread,
May 12, 2021, 9:35:22 AM5/12/21
to BeagleBoard
Hi,
I'm working on a sine - triangle modulator, is running on BeagleBone black / PRU 1.
On Linux/Arm I calculate the pattern for one period in form of a data structure
pattern to output and time to the next event.
Output is PRU 1 __R30 bit 0, 1, 2, 3 ( 4 only for debug reasons, oscilloscope trigger )
It works .... but I'm not surprised about the speed.
The output loop of the PRU is written in some lines of ASM.
Frequencies: triangle should be 400kHz, better 800kHz,
sine wave is between 20kHz and 100kHz
Beaglebone has to drive a high speed GaN H-Bridge.

The datatransport and handshake between Linux and PRU works fine.
A C-Program on PRU is watching for new data. Then the new data ( pattern-time structure )
are copied into local ram, to get the best speed ( lowest latency ).
If the data are stored in local ram, the assembler program is called, to output the given pattern. First the arguments are saved in registers,
then the output starts in a loop.
Pick up pattern from local RAM, and output,
feed delay loop from local RAM,
delay loop,
update index register,
check for possible new data,
if not, back to the top, output next period.

What I said ... it works. But with cycle time of 5nsec ( 1/200MHz ) and 1 cycle for most of the (ASM) instructions, I can't see the speed.

So there is something wrong in my setup or code.
If somebody would like to help debugging, let me know.
Sources with Makefile etc are available.

All based on latest Debian image, all udates are  installed, HDMI is off.

So, let me know, think it makes only sense to upload that stuff in case there is really
somebody able to help on that.

Thanks in advance
Kasimir

Mark Lazarewicz

unread,
May 12, 2021, 11:55:08 AM5/12/21
to beagl...@googlegroups.com
The memory access will add some cycle post your assembler code with comments you're correct it doesn't make sense maybe someone will see the issues. The PRU labs discuss measuring cycle times in CCS if you have JTAG but toggle a GPIO and measure with a scope is probably easier.
--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beagleboard/e9fe59e9-e00d-476e-99e2-6b85a90695d2n%40googlegroups.com.

Peter Lange

unread,
May 12, 2021, 3:08:14 PM5/12/21
to beagl...@googlegroups.com
Hi Mark,
thanks you very much for the quick response.
Going to post the ASM. Looking Forward.
Kasimir
 ...

You received this message because you are subscribed to a topic in the Google Groups "BeagleBoard" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beagleboard/EvWTZ1wM8zQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beagleboard/474549500.1653123.1620834888085%40mail.yahoo.com.

Kasimir

unread,
May 12, 2021, 3:49:33 PM5/12/21
to BeagleBoard
This is my code to output pattern on __R30
; ********************************************
   .global ausgabe
ausgabe:
   ldi     r18, 0              ; initial value
   ldi     r30, 0x10           ; debug
   ldi     r17, 0x00           ; debug
   mov     r13, r15            ; R15 contains start address, save in R13
   mov     r12, r14            ; R14 contains number of data points
naechster:
   lbbo    &r30, r15, 4, 1     ; (r15) = pattern
   lbbo    &r17, r15, 0, 2     ; (r17) = time to wait to output next pattern 
warte:
   sub     r17, r17, 1         ; delay loop 
   qbne    warte, r17, 0       ;
   add     r15, r15, 5         ; next element, update pointer
   sub     r14, r14, 1         ; number of remaining elements - 1
   qbne    naechster, r14, 0   ; was it the last one?
   mov     r15, r13            ; yes, load addess pointer with saved value
   mov     r14, r12            ; and load loop counter with saved number of elements
   lbbo    &r18, r16, 0, 1     ; load variable, if 0 run again, if != 0 exit
   or      r30, r30, (1<<4)    ; debug, trigger signal for oscilloscope
   qbeq    naechster, r18, 0   ; as long handshake[0] = 0 is
   jmp     r3.w2               ; r3 contains return address
;*****************************************************************

The datastructure:
typedef struct Event Event_t;                                                              
struct  Event                                                                              
{                                                                                                              
   unsigned int  time;     // number of loops to the next event                                                                      
   unsigned char pattern;  // Bit 7 | 6 | 5 | 4 |  3 | 2 |  1 | 0 |                       
                           // ------+---+---+---+----+---+----+---+                       
                           //       |   |   | d |~z34|z34|~z12|z12|                                                        
                           // ------+---+---+---+----+---+----+---+
};

int main( int argc, char *argv[])
{
int i;
int j;
Event_t event_knoten[500];

...
....
ausgabe(pattern_liste.anzahl, &event_knoten[0].time, &handshake[0]) ; // asm to write pattern
                                                                      // as long handshake[0} == 0

It works fine, only the  delay time loop need better resolution, at the moment the time for only one loop is too long.
Have no idea to optimize ist.
Also from
or      r30, r30, (1<<4)    ; debug, trigger signal for oscilloscope
to
naechster:
   lbbo    &r30, r15, 4, 1     ; (r15) = pattern
I measure 250nsec ..... was expecting 25nsec .....

I can see some jitter on my oscilloscope ( Tektronix THS730A ), has nothing to do with
GND connection, long wires etc., all that is perfect. Oscilloscope works fine.

Is it possible that "some what" from Linux / ARM area is disturbing my timing?

Thanks again for any helpfull input.
Kasimir

Mark Lazarewicz

unread,
May 12, 2021, 7:26:10 PM5/12/21
to beagl...@googlegroups.com
Hello Kasmir


I will take a look and hopefully others who are using PRU can also be helpful I began programming in asm many many years ago but haven't used PRU assembler. Can you reply whether you have an oscilloscope or high speed logic analyzer? This is what we used to debug many years ago. 

You could remove any memory Accesses by hard coding the data( modify your code) just do a tight loop toggling GPIO and measure the frequency.

This will tell you the max frequency of your GPIO 

Perhaps write some test code doing just that and share results . Staring at source code isn't always the fastest way to find error especially since we don't have your  exact set-up.

In the meantime hopefully someone sees something obvious. I'm sure the max frequency of what you are attempting has been discussed.

Maybe someone will comment on what they have achieved and share their solution.

Break the problem into peices and resist the temptations to be drawn into detour's can be challenging when getting input. 

By running experiments you can stay busy while waiting for input from group members 

I hope that's helpful

Mark




Kasimir

--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit

Kasimir

unread,
May 13, 2021, 2:36:18 PM5/13/21
to BeagleBoard
HI Mark,
was trying to use the loop instruction .....
   .global ausgabe
ausgabe:
   ldi     r18, 0              ; initialisation

   ldi     r30, 0x10           ; debug
   ldi     r17, 0x00           ; debug
   mov     r20, r15            ; save start addresss
   mov     r21, r14            ; save number of pattern
naechster:
   loop    next_pattern, r14   ; for each pattern
   lbbo    &r30, r15, 4, 1     ; output (r15) = pattern
   lbbo    &r17, r15, 0, 2     ; load number of delay loops 
   loop    weiter, R17         ; delay loop
weiter:
   add     r15, r15, 5         ; increment address pointer by 5 ( next data structure element )
next_pattern:
   mov     r15, r20            ; load saved start address in address pointer
   mov     r14, r21            ; load saved number of pattern in pattern counter
   lbbo    &r18, r16, 0, 1     ; check if stop request

   or      r30, r30, (1<<4)    ; debug
   qbeq    naechster, r18, 0   ; if handshake[0] == 0 continue
   jmp     r3.w2               ; otherwise return r3 contains return address

**********************************************************************************
I used prudebug to test the behavior. 
So the loop instruction is not known ( UNKNOWN in disassembler list )
Is not a solution for Beaglebone black.
Assembler did not warn or complain.

Bottom line ... independent of the above code, I'm missing the 200MHz performance,
I'm far away, seems to be 20:1 .... if I think in 5nsec instruction time cycles for register operations.

There is something else .... up to now no idea what it can be.

Thanks for help and thinking
Kasimir


TJF

unread,
May 13, 2021, 2:45:49 PM5/13/21
to BeagleBoard
Kasimir schrieb am Mittwoch, 12. Mai 2021 um 21:49:33 UTC+2:
It works fine, only the  delay time loop need better resolution, at the moment the time for only one loop is too long.
Have no idea to optimize ist.

Twice as fast:

    LOOP EndWait, R17.w0 // note: max 16 bit counter
    EndWait:
 
Also from
or      r30, r30, (1<<4)    ; debug, trigger signal for oscilloscope
to
naechster:
   lbbo    &r30, r15, 4, 1     ; (r15) = pattern
I measure 250nsec ..... was expecting 25nsec .....

I can see some jitter on my oscilloscope ( Tektronix THS730A ), has nothing to do with
GND connection, long wires etc., all that is perfect. Oscilloscope works fine.

Is it possible that "some what" from Linux / ARM area is disturbing my timing?

The LBBO &r30, r15, 4, 1 instruction needs at least 3+1 cycles (as long as the adress in R15 is not in the PRU local memory map). And it may take additional cycles in case of heavy trafic on the L3 bus.

Note: for cycle watching you don't need an osci. Instead you can use the CYCLE Register (offset = Ch) in the PRUSS_PRU_CTRL register space.

din...@gmail.com

unread,
May 13, 2021, 2:46:06 PM5/13/21
to BeagleBoard
Which assembler are you using? It should have warned you that "loop weiter" body must be at least two instructions, whereas you have zero. 

Also, you cannot nest HW-assisted loops.

Regards,
Dimitar

TJF

unread,
May 13, 2021, 2:50:32 PM5/13/21
to BeagleBoard
Hi Kasimir, sorry my post overlapped.

Kasimir schrieb am Donnerstag, 13. Mai 2021 um 20:36:18 UTC+2:
So the loop instruction is not known ( UNKNOWN in disassembler list )
Is not a solution for Beaglebone black.
Assembler did not warn or complain.

The LOOP instruction works in PASM assembler.

Note: nested LOOP instructions are not allowed.

Kasimir

unread,
May 13, 2021, 3:46:05 PM5/13/21
to BeagleBoard
Hi, thanks to all :-)
so, here is a picture ( the first posted asm ). The delay is always 1 ( r17) so there is always 1 loop.
The pattern are 0-1-0-1-0-1- ......

Channel 1 is __R30 Bit 0     ( pattern)
Channel 2 is __R30 Bit 4    used for trigger
The high time of Bit 4 is > 200nsec ..... I can't understand
The high / low time of the pattern is 450nsec ..... why?
If cycle time for register - register operation is 1 and dram access is 3 ...... it should be 45nsec ....
I do not understand why I can't see the 200mHz speed of the pru unit :-(

What do you think?
Kasimir
modulator.jpg


Dennis Lee Bieber

unread,
May 13, 2021, 3:47:02 PM5/13/21
to Beagleboard
On Thu, 13 May 2021 11:46:06 -0700 (PDT), in
gmane.comp.hardware.beagleboard.user
"din...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org"
<dinuxbg-Re5JQEe...@public.gmane.org> wrote:

>Which assembler are you using? It should have warned you that "loop weiter"
>body must be at least two instructions, whereas you have zero.
>

Sliding into the thread...

From the manual:
"""
Hardware Loop Assist (LOOP) Defines a hardware-assisted loop operation. The
loop is noninterruptible (LOOP). The loop operation works by detecting when
the instruction pointer would normal hit the instruction at the designated
target label, and instead decrementing a loop counter and jumping back to
the instruction immediately following the loop instruction.
"""

So, yes... the loop encounters the target label... and jumps back to...
the target label as there is no intervening opcode to use as a target for
the jump. Might be optimized out completely unless one puts at least a NOP
instruction inside -- though the next comment probably voids all
consideration.

Not sure of the "at least two instructions" -- seems one, with label on
the next (outside of loop) instruction, would be viable. PC would hit
label, so jump back to the (one) instruction following LOOP statement.


>Also, you cannot nest HW-assisted loops.
>

A critical item to consider...


--
Dennis L Bieber

Kasimir

unread,
May 13, 2021, 4:03:16 PM5/13/21
to BeagleBoard
Hi Dennis,
thanks for information .... I'm using currently the first version of asm,  without loop.
Because here is something else wrong, the timing is factor 10 to 15 far away .....
Think I can use only one loop for timing. If I have to insert a nop .... then there is no advantage.
I'm hanging now a week on this point. have no progress.
Thinking on a hardware solution with 2 DDS devices from analog devices. One for triangle and one for sine and -sine, comparator .... done.
But then the BeagleBone / Sitara cpu makes no longer sense.
I like BeagleBone, made a lot of nice things and it works fine. But now I need the power of the pru unit and I do not see the performance.
May be my code is not placed in internal memory ... there are many possibilities to do things wrong ......
Thanks again
Kasimir   



Kasimir

unread,
May 13, 2021, 4:34:19 PM5/13/21
to BeagleBoard
Just a moment ago, I was standing on cliffs edge, now I made a big step forward .....
I'm able to generate a 10ns trigger pulse on __R30 Bit 4   :-)).
I placed the and instruction to clear Bit 4. Now it's clear, both indirect loads ( lbbo &R ) are
responsible for the unexpected delay. I was expecting both are operating from dram with
latency of 3 cycles. What is wrong? The data structure is expected in local ram, to get best latency.
In C it's declared that way:
typedef struct Event Event_t;                                                              
struct  Event                                                                              
{                                                                                                              
   unsigned int  time;     // number of loops                                                                      
   unsigned char pattern;  // Bit 7 | 6 | 5 | 4 |  3 | 2 |  1 | 0 |                       
                           // ------+---+---+---+----+---+----+---+                       
                           //       |   |   |   |~z34|z34|~z12|z12|                                                        
                           // ------+---+---+---+----+---+----+---+
};
int main( int argc, char *argv[])
{
int i;
int j;
unsigned char u;
Event_t event_knoten[100];      // later on, r15 is pointing to that address
...
...
...
ausgabe(pattern_liste.anzahl, &event_knoten[0].time, &handshake[0]) ;


***************** change to debug delay in assembler *******************
naechster:
   and     r30, r30, 0xEF           ; debug
   lbbo    &r30, r15, 4, 1     ; (r15) = pattern       <= slow
   lbbo    &r17, r15, 0, 2     ; load number of loops  <= slow
 
Any hint how to make the lbbo &r..   faster?
I'm looking forward

Kasimir


Mark Lazarewicz

unread,
May 13, 2021, 4:48:59 PM5/13/21
to beagl...@googlegroups.com
Have you seen the PRU Support Package examples???
I saw examples of linker placement in shared RAM 

This example below the C variable is in by default in local RAM

What is smallest pulse period you require for your application?


void main(void)
{
volatile uint32_t gpio;

/* Clear SYSCFG[STANDBY_INIT] to enable OCP master port */
CT_CFG.SYSCFG_bit.STANDBY_INIT = 0;

/* Toggle GPO pins */
/* Note: 0xFFFF_FFFF toggles all GPO pins */
gpio = 0xFFFFFFFF;

/* TODO: Create stop condition, else it will toggle indefinitely */
while (1) {
__R30 ^= gpio;
__delay_cycles(100000000);
}



Kasimir


--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit

Kasimir

unread,
May 13, 2021, 6:24:21 PM5/13/21
to BeagleBoard
Hi all,
it's SOLVED    :-)
Thanks for all your input.
Problem was located in memory allocation.
Was not using the PRU-Dram. The external ram is very slow and I saw also some jitter.
Now it's running with expected speed and I'm happy.
Was expecting the local variables in local memory by default. That's not the case.
Thanks again
Kasimir

Mark Lazarewicz

unread,
May 13, 2021, 6:44:28 PM5/13/21
to beagl...@googlegroups.com
Great news 

Can you share how it ended up in external RAM?
Incorrect Linker cmd file?

Mark


Kasimir

--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit

Kasimir

unread,
May 13, 2021, 6:56:12 PM5/13/21
to BeagleBoard
Hi Mark,
more simple ...... in C source.
My datastructure was not in internal ram.
volatile Event_t *event_knoten = (Event_t *) (PRU0_DRAM + 0x200);
and in main
event_knoten = (Event_t *)malloc(100*sizeof(Event_t));

solved it.

Kasimir

Mark Lazarewicz

unread,
May 13, 2021, 8:11:35 PM5/13/21
to beagl...@googlegroups.com
Hi Kasimir

What's wrong with  below??


My datastructure was not in internal ram.
volatile Event_t *event_knoten = (Event_t *) (PRU0_DRAM + 0x200);

IMO 

I think placing anything in a guaranteed memory  area  is best done with sections from linker command file.

There's examples about placing data in PRU shared RAM in the those examples I mentioned.

Yes external DDRAM yikes 🤟 the ARM is caching it.

Glad you're rolling.

Kasimir

--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit

Mark Lazarewicz

unread,
May 13, 2021, 8:40:46 PM5/13/21
to beagl...@googlegroups.com
Nevermind I understand I think now .PRU0_DRAM needs to be an address from linker command file that statement might work.

 Anyway linker command files have always been a murky science I might play around with. I use JTAG so the address being not correct  is something you can catch quickly. 
Unfortunately Using RPMsg from my recent research isn't a good match with JTAG debugging.



I'm interested in what memory the Rpmsg carves out for it's use. 

Seems like between the PRU shared RAM and unused PRU0_DRAM if using only one PRU one can squeeze additional RAM if resources are tight that's why I'm interested in researching the linker command files further.

These PRU are limited in resources it's like using a small 8 bit processor from 20 year's ago and squeezing every possible byte out. 

Back in the day some guys got job security by using so many tricks to steal memory their code was unmaintainable. They liked that boss couldn't get rid of them because changing the software would break the entire application.

Ahh I digress .

Mark

Peter Lange

unread,
May 14, 2021, 4:42:40 AM5/14/21
to beagl...@googlegroups.com
Hi Mark,
prudebug did help a lot. I'm missing a good debug environment for PRU development.
Up to now it's time consuming try&error.
It's more easy to use FPGA on top of Raspberry or use ESP32, 2. core for dedicated high speed functions. At the end I want to use the CPU in my own hardware, Beaglebone is my "emulator" and debug environment. 
The big value of the Sitara CPU are the PRU units. Think prudebug should be enhanced.
Have a great day
Kasimir

Mark Lazarewicz

unread,
May 14, 2021, 5:18:46 PM5/14/21
to beagl...@googlegroups.com
CCS/JTAG works for me . I have used FPGA arm cores and ESP32 
My position and opinion  is unique in this group
 I see no value in a PRU UNLESS every peripheral is used on DSP/ARM and you need more peripherals
 I have seen that done in a RTOS on ARM DSP PRU omapL178 very complex Motor controller from an industry leader
They also used TI Picolo for QEP and several other small processors 
Beyond having assembler run in 1 instruction on PRU a dedicated SoC with an RTOS or barebones is much more powerful and determinism on FDA and DO178B and Centelec systems use dedicated processors to achieve this

Nick from TI has a two PRU very simple PID motor controller reference design but its not true control theory and not fault tolerant like the product I worked on. This PRU is a toy compared to the ARM core or a DSP its resource limited and its instruction set is limited especially to run MATLAB

I did download PRUDEbug sorry for being negative it reminds me of desperate Linux programmers using GDB or even nworse yet the initial ESP32     had no jtag just serial debug

Maybe Im an old stubborn old man but Printf killed that ESP32 project very slow getting good debug info out 

For custom board bring up JTAG is a must. 
I worked on a Military 8 core PowerPC application debugging the multiple core SOC boards in a VME rack say 4 boards we used print logs to debug
 
Guess what all the consultants went home very rich $$$$$ it took years to validate this and it wasnt life critical it was mission critical so testing was extensive

My new research project CCS/ debugging ARM code over ethernet yikes it uses GDB server LOL

Thanks for informing me about PRUDEBUG I will try it to remind myself it is  better than using LEDS to debug PRU.
 CCS is also used to automate V&V test cases using gel files 

In all fairness there are many ways to skin a cat whatever ones feels comfortable with I get that but when your custom board wont boot GDB is useless

Mark

--
For more options, visit http://beagleboard.org/discuss
---
You received this message because you are subscribed to the Google Groups "BeagleBoard" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit

Peter Lange

unread,
May 14, 2021, 5:31:56 PM5/14/21
to beagl...@googlegroups.com
👍
Kasimir
PS try out ddd in top of gdb ..
easy to use, simple and efficient

You received this message because you are subscribed to a topic in the Google Groups "BeagleBoard" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beagleboard/EvWTZ1wM8zQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beagleboard...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beagleboard/403093636.422819.1621027110789%40mail.yahoo.com.

Mark Lazarewicz

unread,
May 14, 2021, 5:52:43 PM5/14/21
to beagl...@googlegroups.com
Thanks for future research. 

Unfortunately GDB won't work when your board doesn't run.

For cost sensitive products using a OEM board isn't always an option.

In high volume pennies make a difference that's why ESP32  it's low cost and is popular and I'm sure it now has a JTAG but yes GDB is an option on it although 3 year's ago it was serial port only.

When you get handed a board that's radically different from reference design and it doesn't work or your designing a bootloader on a project that's a billion dollars what's a $100 for a JTAG.

Certainly you would not give a hardware engineer a new design and not buy him an oscilloscope or logic analyzer.

Just a different perspective may not apply for some


Reply all
Reply to author
Forward
0 new messages