Cortex-A8 scheduling

87 views
Skip to first unread message

Ben Avison

unread,
Oct 2, 2010, 10:25:58 AM10/2/10
to
I've been taking a look at Cortex-A8 scheduling again, and I'm beginning to
think there are errors as well as omissions in the TRM.

First I'll describe a simple example which isn't in doubt, in case I've
made a
fundamental misunderstanding:

MUL r0,r1,r2
MOV r3,r0

MUL is 2 cycles, MOV is 1 cycle, but there is a dependency between r0
output by
the MUL and input to the MOV. r0 is output in the cycle in which E5 is
processing the second cycle of the MUL. If the instruction following the
MUL
didn't need r0 until E5, then it could be scheduled for the very next
cycle;
but MOV needs it in E1. So instead, in the next cycle, the MOV can't be any
further on than E1, and the second cycle of the MUL is now at E6, meaning
that
E2, E3, E4 and E5 are all idle. I've confirmed by timing real silicon that
we
do indeed get a 4-cycle stall, so I'm reasonably confident that this is a
correct interpretation.

Now we get to a contradiction - the section on multiplications says that

MUL r3,r4,r5
MLA r0,r1,r2,r3

will execute back-to-back (no stalls), but it also says that r3 is needed
by
the MLA in E4 (and is produced in E5 of the MUL, as before). By the logic
above, this ought to imply a 1-cycle stall. Experimentation shows that the
back-to-back assertion is the correct one.

I thought I'd also check the case where accumulator forwarding is not used:

LDR r3,[r4]
MLA r0,r1,r2,r3

I chose LDR because it ouputs quite late (in E3, confirmed by measuring a
2-cycle stall between LDR r3,[r4] and MOV r0,r3). This also shows no stall,
so the TRM is wrong here too: the accumulator is needed in E3 (not E2) when
forwarding is not in use, otherwise this would have stalled.

I was getting worried now, so I tested some more. I found that all the
64-bit
multiply instructions seem to output RdLo in E4, not E5 as the TRM says.
The
flag-setting versions of the multiplies behave quite differently - in fact,
MULS timings are just like those that MUL followed by TEQ would have had
(probably revealing how the silicon works, I guess). A hint here for
optimisers - the latter combination performs a similar function, but gives
you the opportunity to feed some other instructions into the otherwise
unavailable stall slots.

At this point, I was willing to accept that someone had had a bad day when
writing the MUL section of the TRM, so I thought I'd do some testing on
another class of instructions. I arbitrarily chose the sign-extension
instructions for some more experimental tests. But more inconsistencies
became clear: Rn is needed in E1, not E2, and versions with accumulate seem
to need Rd as a source in E2, even if the instruction is unconditional!

Anyone know if this level of accuracy is characteristic of the rest of the
instruction set?

Ben

Paul Gotch

unread,
Oct 2, 2010, 11:22:50 AM10/2/10
to
Ben Avison <usenet...@avison.me.uk> wrote:
> Anyone know if this level of accuracy is characteristic of the rest of the
> instruction set?

I would recommend that you take this up with ARM support.

-p
--
Paul Gotch
--------------------------------------------------------------------

Nils

unread,
Oct 2, 2010, 2:59:37 PM10/2/10
to
On 10/02/2010 04:25 PM, Ben Avison wrote:
> I've been taking a look at Cortex-A8 scheduling again, and I'm beginning to
> think there are errors as well as omissions in the TRM.

:-) Thanks for the information..

> Anyone know if this level of accuracy is characteristic of the rest of the
> instruction set?

I can tell you that the SMULxy and SMLAxy instructions are twice as fast
as specified in the TRM. They have the exact same scheduling rules as
SMULWy (aka 1 cycle latency instead of two).

I verified that with two different OMAP3530 Cores (engineering sample
two and a production CPU).

Nils


Ben Avison

unread,
Oct 2, 2010, 6:22:35 PM10/2/10
to
On Sat, 02 Oct 2010 19:59:37 +0100, Nils <n.pipe...@cubic.org> wrote:
> I can tell you that the SMULxy and SMLAxy instructions are twice as fast
> as specified in the TRM. They have the exact same scheduling rules as
> SMULWy (aka 1 cycle latency instead of two).

Thanks, I'm glad it's not just me! With my interest piqued, I've been
going back over the timing tables. I can't see anything wrong with the
data processing instructions, but there's more wrong with the multiplies
than I previously noticed, including faults with SMULxy and SMLAxy -
although
they look like different problems to me. SMLAxy seems to read Rd, Rm and Rs
one pipeline stage later than described in the TRM (but still takes 2
cycles),
whereas SMULxy has been incorrectly grouped with SMLAxy, when its timings
actually match the 1-cycle instructions like SMULWy.

The table could be simplified somewhat, as there are actually only 6
distinct
timing patterns, not the 11 ARM used! I'd rewrite it thus:

MLA, MLS, MUL, SMMLA(R), SMMLS(R), SMMUL(R)

Cycles: 2 Src: Rm:E1 Rs:E1 [Rd:E3] {Rn:E3/E5} Res: Rd:E5

SMULxy, SMULWy, SMUAD(X), SMUSD(X)

Cycles: 1 Src: Rm:E1 Rs:E1 [Rd:E2] Res: Rd:E5

SMLAxy, SMLAWy, SMLAD(X), SMLSD(X)

Cycles: 2 Src: Rm:E2 Rs:E2 [Rd:E3] Rn:E3/E5 Res: Rd:E5

SMULL, UMULL

Cycles: 3 Src: Rm:E1 Rs:E1 [RdLo:E3] [RdHi:E3] Res: RdLo:E4 RdHi:E5

SMLAL, UMAAL, UMLAL

Cycles: 3 Src: Rm:E1 Rs:E1 RdLo:E2 RdHi:E1 Res: RdLo:E4 RdHi:E5

SMLALxy, SMLALD(X)

Cycles: 2 Src: Rm:E1 Rs:E1 RdLo:E1 RdHi:E2 Res: RdLo:E4 RdHi:E5

Rather simpler than the existing table, I think you'll agree! It's shocking
that this many errors have got through - this is need-to-know stuff for
people trying to write highly optimised ARM code...

Ben

Nils

unread,
Oct 6, 2010, 5:52:33 PM10/6/10
to
On 10/03/2010 12:22 AM, Ben Avison wrote:
>
> Rather simpler than the existing table, I think you'll agree! It's shocking
> that this many errors have got through - this is need-to-know stuff for
> people trying to write highly optimised ARM code...

Ben, is it okay for you if I take your findings and post them on my blog
(of course I give credits if you want to).

I already wrote about the SMULxy behavior some month ago, and I think a
little update would be nice.

Btw: What Cortex-A8 based CPU did you used to verify the timings? I used
two OMAP3530 cores.. Maybe the timing is just the way it is on this
specific family and different on others..

Cheers,
Nils

Oh - that blog: http:/www.hilbert-space.de

Ben Avison

unread,
Oct 8, 2010, 2:39:45 PM10/8/10
to
On Wed, 06 Oct 2010 22:52:33 +0100, Nils <n.pipe...@cubic.org> wrote:
> Ben, is it okay for you if I take your findings and post them on my blog
> (of course I give credits if you want to).

No problem. I've been checking various other instructions too - there are
hardly any where the TRM seems to be totally accurate. I'll stick them
online somewhere when I'm done, but for now you might want to note that
USAD8 is actually another member of the SMULxy family in timing terms, and
USADA8 likewise a member of the SMLAxy family, right down to the
accumulator forwarding quirk.

> Btw: What Cortex-A8 based CPU did you used to verify the timings? I used
> two OMAP3530 cores.. Maybe the timing is just the way it is on this
> specific family and different on others..

I did my testing on an OMAP3530, engineering sample 3.0. It's the only
Cortex-A8 I have, so I can only speculate on variations (and I wouldn't be
in a great hurry to repeat the investigations, as they have been quite
time-consuming). However, nearly all of the differences have resulted in
improved performance, so I can't help wondering if the text in the TRM
represented a very early development version of the A8 core...

Ben

Ben Avison

unread,
Nov 4, 2010, 4:49:04 PM11/4/10
to
On Fri, 08 Oct 2010, I wrote:
> No problem. I've been checking various other instructions too - there are
> hardly any where the TRM seems to be totally accurate. I'll stick them
> online somewhere when I'm done

As promised, here's my writeup:

http://www.avison.me.uk/ben/programming/cortex-a8.html

Ben

WebShaker

unread,
Nov 6, 2010, 4:40:13 AM11/6/10
to
Le 02/10/2010 16:25, Ben Avison a écrit :
> I've been taking a look at Cortex-A8 scheduling again, and I'm beginning to
> think there are errors as well as omissions in the TRM.

Your explanation is a little bit to complex for me ;)
Does exists a soft that will export the cycle and pipeline explanation
of a assembly code for the cortex A8 (or A9)?

For exemple you give it this code

MUL r0,r1,r2
MOV r3,r4
...

And on exit you get

MUL r0,r1,r2 ; Cycle 1 pipeline 0
MOV r3,r4 ; Cycle 1 pipeline 1
...

Thank's
Etienne

Ben Avison

unread,
Nov 16, 2010, 9:08:45 PM11/16/10
to
On Sat, 06 Nov 2010 08:40:13 -0000, WebShaker <eti...@tlk.fr> wrote:
> Does exists a soft that will export the cycle and pipeline explanation
> of a assembly code for the cortex A8 (or A9)?

As it happens, one of the things I've been doing with my findings is
exactly
that - although it probably won't be as useful as it could be unless I add
all
the VFP / Advanced SIMD instructions. It's currently bundled with the
commercial toolchain for RISC OS (plug:
http://www.riscosopen.org/content/sales)
which I'm currently maintaining - but perhaps I should be considering
making a
wider release...

Here's an example of its output, run on some real-world unoptimised code:

Cycle Pipeline 0 Pipeline 1
================================================================================
1 LDR r12,0x104 output conflict, wait for r12
2 wait for r12 wait for r12
3 ADD r12,r12,#8 need pipeline 0 for
multi-cycle op
4 wait for r12 need pipeline 0 for
multi-cycle op
5 STMIA r12,{r0-r8} blocked during multi-cycle op
6 STM (cycle 2) blocked during multi-cycle op
7 STM (cycle 3) blocked during multi-cycle op
8 STM (cycle 4) blocked during multi-cycle op
9 STM (cycle 5) SUB r0,pc,#0x174
10 LDRB r1,0x15c wait for r1
11 wait for r1 wait for r1
12 CMP r1,#0 BNE 0x1bc
13 MOV r1,#1 STRB r1,0x15c
14 LDR r0,0x104 MOV r2,#0
15 wait for r0 wait for r0
16 wait for r0 wait for r0
17 STR r2,[r0] ADD r1,r0,#0x34
18 STR r1,[r0,#4] LS unit busy
19 LDR r0,0x108 output conflict, wait for r0
20 wait for r0 wait for r0
21 SUBS r0,r0,#0x10 ADD r3,r1,#0x10
22 STR r3,[r1,#0xc] wait for r3
23 MOV r1,r3 CMP r1,r0
24 BLT 0x1a0 SUB r1,r1,#0x10
25 wait for r1 wait for r1
26 STR r2,[r1,#0xc] LS unit busy
27 LDR r3,0x104 MOV r4,#0
28 MRS r4,CPSR blocked during multi-cycle op
29 MRS (cycle 2) blocked during multi-cycle op
30 MRS (cycle 3) blocked during multi-cycle op
31 MRS (cycle 4) blocked during multi-cycle op
32 MRS (cycle 5) blocked during multi-cycle op
33 MRS (cycle 6) blocked during multi-cycle op
34 MRS (cycle 7) blocked during multi-cycle op
35 MRS (cycle 8) wait for r4
36 TST r4,#0x1c

Etienne

unread,
Nov 17, 2010, 9:11:59 AM11/17/10
to
Le 17/11/2010 03:08, Ben Avison a ï¿œcrit :

> Here's an example of its output, run on some real-world unoptimised code:
>
> Cycle Pipeline 0 Pipeline 1
> ================================================================================
> 1 LDR r12,0x104 output conflict, wait for r12
> 2 wait for r12 wait for r12
> ...

Hum it'look great.
Do you know if cortex A9 (omap4) Works on RiscOS ?

I've reserved a pandaboard.
You have a beagleboard I guess !!

thank's Etienne

Ben Avison

unread,
Nov 17, 2010, 4:30:26 PM11/17/10
to
On Wed, 17 Nov 2010 14:11:59 -0000, Etienne <eti...@tlk.fr> wrote:
> Do you know if cortex A9 (omap4) Works on RiscOS ?
>
> I've reserved a pandaboard.
> You have a beagleboard I guess !!

Yes, I've been using a Beagleboard for the last couple of years.
Instruction
scheduling is very different with the Cortex-A9 of course, with its
speculative out-of-order execution.

There's been talk of porting RISC OS to Pandaboard, but I don't think any
of
our developers have one so far, so the answer is "not yet". To be honest,
it's probably not a priority because we're still lacking SMP capability and
there are still plenty of single-core Cortex-A8 chips out there we can make
better use of. But if it's something you're interested in, we can always
use
more volunteers!

Ben

WebShaker

unread,
Mar 6, 2011, 4:31:41 AM3/6/11
to
Le 02/10/2010 16:25, Ben Avison a écrit :
> MUL r0,r1,r2
> MOV r3,r0

>
> I've confirmed by timing real silicon that we
> do indeed get a 4-cycle stall, so I'm reasonably confident that this is a
> correct interpretation.

I'm agree?
If the MUL start in cycle 1, The Mov will execute in cycle 7.
2 cycle for the MUL
4 cycle to wait for r0
mov start on cycle 7.

> MUL r3,r4,r5
> MLA r0,r1,r2,r3
>
> will execute back-to-back (no stalls), but it also says that r3 is
> needed by
> the MLA in E4 (and is produced in E5 of the MUL, as before). By the logic
> above, this ought to imply a 1-cycle stall. Experimentation shows that the
> back-to-back assertion is the correct one.

I'm totally agree with your interpretation.
It should have a stall cycle between the 2 instructions.

For me, the fowarding is done at the end of the cycle.
The result is forward at the end of E5 (MUL) to the end of E4 stage of
the MLA. Then we can think that and of E4 in the same thing as beginning
of E5 !

> I thought I'd also check the case where accumulator forwarding is not used:
>
> LDR r3,[r4]
> MLA r0,r1,r2,r3
>
> I chose LDR because it ouputs quite late (in E3, confirmed by measuring a
> 2-cycle stall between LDR r3,[r4] and MOV r0,r3). This also shows no stall,
> so the TRM is wrong here too: the accumulator is needed in E3 (not E2) when
> forwarding is not in use, otherwise this would have stalled.

One time again.
I'm agree with you.

this code should take 4 cycles.
But in real case it take only 3 cycles.

So I'll said like you that accumulator is needed in E3

> At this point, I was willing to accept that someone had had a bad day when
> writing the MUL section of the TRM, so I thought I'd do some testing on
> another class of instructions. I arbitrarily chose the sign-extension
> instructions for some more experimental tests. But more inconsistencies
> became clear: Rn is needed in E1, not E2, and versions with accumulate seem
> to need Rd as a source in E2, even if the instruction is unconditional!

That's sure.
There is not reference to MLA instruction into the cycle table web page !!!


> Anyone know if this level of accuracy is characteristic of the rest of the
> instruction set?

I've made some test.
I've made a cycle calculator too.

But for the moment, I'm using your cycle table page when I have a doubt ;)

> Ben

Etienne

Reply all
Reply to author
Forward
0 new messages