Scheduling on Cortex

79 views
Skip to first unread message

Ben Avison

unread,
May 9, 2009, 4:59:36 PM5/9/09
to
I've been trying to get my head around how to schedule instructions,
including all the intricasies of dual issuing, but there just doesn't seem
to be enough information in the Cortex-A8 TRM to work out cycle counts even
for relatively simple real-world ARM code.

Some examples would be:

1)
LDREQ r0,[r0]
LSLNE r0,r0,#1

You'd hope that this would issue one instruction to each pipeline, because
they're mutually exclusive, but the data output hazard means that they
can't
be dual issued. However, nothing is said about whether you still get a
further 2-cycle stall, supposedly between the output to Rt of the LDR and
the input to Rm of the MOV. In other words, does the decision to stall take
the condition codes into account?

2)
LDM r0,{r1-r2} ; loads r1 in 1st cycle, r2 in 2nd cycle
MUL r3,rn,r3

Do the number of stalls before the MUL change depending upon whether rn is
r1 or r2?

3)
MOVS r0,r1 ; sets or clears Z flag
MOVEQ r2,r3

Do these get dual-issued? The TRM doesn't say they don't.

4)

There are a number of examples in the Cortex-A8 TRM where an unconditional
branch instruction is shown as executing in pipeline 0 while the following
instruction is dual-issued in pipeline 1. Is this correct? In other words,
is there some sort of signal from pipeline 0 to pipeline 1 to abort the
instruction being decoded? (Something similar would permit dual-issue in
example 3.)

I'm sure there are many other cases I've not thought of yet. Perhaps I'm
missing some other document that describes scheduling in more detail?

Ben

Nils

unread,
May 9, 2009, 7:20:57 PM5/9/09
to
Hi Ben,

> I'm sure there are many other cases I've not thought of yet. Perhaps I'm
> missing some other document that describes scheduling in more detail?

I think you'll find all these information in the latest ARM ARM (ARM
Architecture Reference Manual).

However, the last time I've tried to get my hands on it (e.g. two month
ago) it was *still* confidential.

Shame on you, ARM. Selling IP that ends up as real silicon and don't
giving out the details how to archive best performance is
contra-productive.

That's btw someting new: I never had problems getting the latest ARM ARM
from the website.

Cheers,
Nils

Ben Avison

unread,
May 10, 2009, 7:19:26 AM5/10/09
to
On Sun, 10 May 2009, Nils <n.pipe...@cubic.org> wrote:
>> I'm sure there are many other cases I've not thought of yet. Perhaps I'm
>> missing some other document that describes scheduling in more detail?
>
> I think you'll find all these information in the latest ARM ARM (ARM
> Architecture Reference Manual).

I seem to have been one of the chosen few to have been permitted to see
this, but as far as I can tell, like earlier ARM ARMs, it doesn't describe
implementation details like cache structures and instruction pipelines.
That sort of thing has always previously been documented in the manuals for
the CPU core (which is why I was looking in the Cortex-A8 TRM).

I'm sure the information must exist though, otherwise how could the
RealView and GCC compiler people develop their scheduling compilers? It's
possible that the information has not been made public, but as you say,
that
would be counter-productive. Especially since real silicon is now
available,
so the rules could be (with a fair bit of effort) deduced experimentally.

Ben

Wilco Dijkstra

unread,
May 11, 2009, 7:29:22 AM5/11/09
to

"Ben Avison" <usenet...@avison.me.uk> wrote in message news:op.utokl...@balaptop.ba...

> I've been trying to get my head around how to schedule instructions,
> including all the intricasies of dual issuing, but there just doesn't seem
> to be enough information in the Cortex-A8 TRM to work out cycle counts even
> for relatively simple real-world ARM code.

TRM's are written by hardware people for hardware people who already have
very detailed understanding of the core. It is telling for example that the pipeline
is not described at all, not even a single pipeline diagram is shown nor an
explanation what E1, E2 etc mean. So in order to understand the TRM at all,
you'll need to search for presentations and papers on Cortex-A8, eg:

http://www.arm.com/pdfs/TigerWhitepaperFinal.pdf

There are various old presentations that you can find on the internet that go
into more detail.

> Some examples would be:
>
> 1)
> LDREQ r0,[r0]
> LSLNE r0,r0,#1
>
> You'd hope that this would issue one instruction to each pipeline, because
> they're mutually exclusive, but the data output hazard means that they can't
> be dual issued. However, nothing is said about whether you still get a
> further 2-cycle stall, supposedly between the output to Rt of the LDR and
> the input to Rm of the MOV. In other words, does the decision to stall take
> the condition codes into account?

Instructions are statically scheduled based on the assumption they will be
executed, so you do indeed get a 2 cycle stall.

> 2)
> LDM r0,{r1-r2} ; loads r1 in 1st cycle, r2 in 2nd cycle
> MUL r3,rn,r3
>
> Do the number of stalls before the MUL change depending upon whether rn is
> r1 or r2?

Yes. The first cycle is assumed to only transfer the first register, the 2nd the
next 2 etc, even if the base address is 64-bit aligned. For STM it reads the
first 2 registers in the first cycle but may only transfer 1 register.

In your example the LDM takes 2 cycles to execute, with r1 available in E1
in the 4th cycle, and r2 in E1 in the 5th cycle. So the multiply stalls for 1 cycle
if rn = r1, and 2 if rn = r2.

> 3)
> MOVS r0,r1 ; sets or clears Z flag
> MOVEQ r2,r3
>
> Do these get dual-issued? The TRM doesn't say they don't.

The flags are updated in E2 and conditional executed ALU instructions
must resolve in E2 so these cannot be dual issued. However Cortex-A8
can dual-issue 2 flag-setting instructions and merge their flags in E2 -
this is essential for Thumb-2 code.

> 4)
>
> There are a number of examples in the Cortex-A8 TRM where an unconditional
> branch instruction is shown as executing in pipeline 0 while the following
> instruction is dual-issued in pipeline 1. Is this correct? In other words,
> is there some sort of signal from pipeline 0 to pipeline 1 to abort the
> instruction being decoded? (Something similar would permit dual-issue in
> example 3.)

A branch can be dual-issued with either the previous instruction (even if it is
a compare) or the next predicted instruction.

> I'm sure there are many other cases I've not thought of yet. Perhaps I'm
> missing some other document that describes scheduling in more detail?

Not that I know of. One can only guess what ARM has in mind when not
giving software people essential information they need to get the best
out of their cores. What we need is a detailed software optimization manual.

Wilco


Torben Ægidius Mogensen

unread,
May 12, 2009, 6:20:33 AM5/12/09
to
"Wilco Dijkstra" <Wilco.remove...@ntlworld.com> writes:

> "Ben Avison" <usenet...@avison.me.uk> wrote in message news:op.utokl...@balaptop.ba...

>> 1)


>> LDREQ r0,[r0]
>> LSLNE r0,r0,#1
>>
>> You'd hope that this would issue one instruction to each pipeline, because
>> they're mutually exclusive, but the data output hazard means that they can't
>> be dual issued. However, nothing is said about whether you still get a
>> further 2-cycle stall, supposedly between the output to Rt of the LDR and
>> the input to Rm of the MOV. In other words, does the decision to stall take
>> the condition codes into account?
>
> Instructions are statically scheduled based on the assumption they will be
> executed, so you do indeed get a 2 cycle stall.

Can you get around this by using the ITE instruction in Thumb2? I.e.,
does the scheduler track sequential dependencies between all
instructions after an I(T|E)*, or does it only track dependencies
between instructions in the same branch?

For example, if you have

ITETE (some condition)
instruction 1
instruction 2
instruction 3
instruction 4
instruction 5

does it track dependencies 1->3->5 and 2->4->5 or 1->2->3->4->5?

Torben

Marcus Harnisch

unread,
May 12, 2009, 3:55:19 PM5/12/09
to
"Wilco Dijkstra" <Wilco.remove...@ntlworld.com> writes:

> "Ben Avison" <usenet...@avison.me.uk> wrote in message news:op.utokl...@balaptop.ba...

>> 1)
>> LDREQ r0,[r0]
>> LSLNE r0,r0,#1
>>
>> You'd hope that this would issue one instruction to each pipeline, because
>> they're mutually exclusive, but the data output hazard means that they can't
>> be dual issued. However, nothing is said about whether you still get a
>> further 2-cycle stall, supposedly between the output to Rt of the LDR and
>> the input to Rm of the MOV. In other words, does the decision to stall take
>> the condition codes into account?
>
> Instructions are statically scheduled based on the assumption they will be
> executed, so you do indeed get a 2 cycle stall.

Are you sure? ARM1136/76 avoids extra stall cycles in this case,
AFAIK. Has this feature been removed in Cortex-A8?

Regards
Marcus

--
note that "property" can also be used as syntaxtic sugar to reference
a property, breaking the clean design of verilog; [...]

(seen on http://www.veripool.com/verilog-mode_news.html)

Wilco Dijkstra

unread,
May 12, 2009, 5:04:41 PM5/12/09
to

"Marcus Harnisch" <marcus....@gmx.net> wrote in message news:8763g6c...@harnisch.dyndns.org...

> "Wilco Dijkstra" <Wilco.remove...@ntlworld.com> writes:
>
>> "Ben Avison" <usenet...@avison.me.uk> wrote in message news:op.utokl...@balaptop.ba...
>>> 1)
>>> LDREQ r0,[r0]
>>> LSLNE r0,r0,#1
>>>
>>> You'd hope that this would issue one instruction to each pipeline, because
>>> they're mutually exclusive, but the data output hazard means that they can't
>>> be dual issued. However, nothing is said about whether you still get a
>>> further 2-cycle stall, supposedly between the output to Rt of the LDR and
>>> the input to Rm of the MOV. In other words, does the decision to stall take
>>> the condition codes into account?
>>
>> Instructions are statically scheduled based on the assumption they will be
>> executed, so you do indeed get a 2 cycle stall.
>
> Are you sure? ARM1136/76 avoids extra stall cycles in this case,
> AFAIK. Has this feature been removed in Cortex-A8?

The ARM11 also has stalls in this and similar cases, for example when you
conditionally write to a register that is being written as well by a load.

Cortex-A8 has a completely different micro architecture, different tradeoffs
were made. To achieve a high frequency you can't do many of the tricks
ARM11 does. Instructions take the same number of cycles even if they are
not executed for example. Basically the wider and faster one goes, the simpler
the pipeline has to be. Despite all this Cortex-A8 achieves far better efficiency
per MHz than ARM11, mainly because of its dual-issue capability and far far
more advanced branch prediction.

Wilco


Wilco Dijkstra

unread,
May 12, 2009, 5:33:07 PM5/12/09
to

"Torben "�gidius" Mogensen" <tor...@pc-003.diku.dk> wrote in message news:7zhbzqm...@pc-003.diku.dk...

I believe it implements the last option. Basically it has to check whether
up to 2 decoded instructions can be dual-issued. It has one cycle to do
this check, decide whether it can issue 0, 1, or 2 instructions, and update
the internal state with this decision. It involves comparing bitmasks with
the registers read, written etc with masks saying which registers are
available in the register file and which are available at which pipeline
stage.

You're right that it would be easy to avoid depency checks between
conditional instructions if they have the reverse condition. The issue is in
somehow updating the internal state to reflect that r0 is available to
following instructions with the NE condition in E1, in E4 if EQ and in E0
(register file) for any other condition.

To do this you need to keep track of far more state to decide whether the
next instruction(s) could be issued immediately after the LDR/LSL pair, and
this likely means a lower frequency. Given that about 5% of instructions are
conditional it's not worth doing this - the generic pairing means that if an
instruction cannot be dual-issued, it may be paired with the next instruction.

Wilco


Peter Nyström

unread,
May 14, 2009, 3:33:33 PM5/14/09
to
Torben �gidius Mogensen wrote:
<snip>

>
> Can you get around this by using the ITE instruction in Thumb2? I.e.,
> does the scheduler track sequential dependencies between all
> instructions after an I(T|E)*, or does it only track dependencies
> between instructions in the same branch?
>
> For example, if you have
>
> ITETE (some condition)
> instruction 1
> instruction 2
> instruction 3
> instruction 4
> instruction 5
>
> does it track dependencies 1->3->5 and 2->4->5 or 1->2->3->4->5?
>
> Torben

The fact that an instruction within an IT-block can change the flags
and thereby affect which of the following instructions gets executed
makes it hard to remove dependencies.

//Peter

Torben Ægidius Mogensen

unread,
May 15, 2009, 8:14:10 AM5/15/09
to

I see. I thought the execute/not-execute part was decided once and
for all at the I(T|E)* instruction, not tested against the current
flags at each instruction.

Is this an intentional design decision or a consequence of the
implementation (which may be expansion into the original 32-bit ARM
ISA)?

Torben

Marcus Harnisch

unread,
May 15, 2009, 11:50:20 AM5/15/09
to
tor...@pc-003.diku.dk (Torben �gidius Mogensen) writes:

> I see. I thought the execute/not-execute part was decided once and
> for all at the I(T|E)* instruction, not tested against the current
> flags at each instruction.

I was quite surprised when I first discovered the actual behavior. It
seems a bit counter intuitive at first[1], although certainly not
"wrong".

It'd be appreciated if the ARM ARM mentioned this explicitly. You have
to read between the lines to figure what really happens.

Regards
Marcus

Footnotes:
[1] The often quoted C-language if-then statement analogy applies to
simple cases only.

Wilco Dijkstra

unread,
May 15, 2009, 12:07:13 PM5/15/09
to

"Marcus Harnisch" <marcus....@gmx.net> wrote in message news:87d4aab...@harnisch.dyndns.org...

> tor...@pc-003.diku.dk (Torben �gidius Mogensen) writes:
>
>> I see. I thought the execute/not-execute part was decided once and
>> for all at the I(T|E)* instruction, not tested against the current
>> flags at each instruction.
>
> I was quite surprised when I first discovered the actual behavior. It
> seems a bit counter intuitive at first[1], although certainly not
> "wrong".
>
> It'd be appreciated if the ARM ARM mentioned this explicitly. You have
> to read between the lines to figure what really happens.

The reason is that you can conditionally execute comparisons like:

cmp x, #0
cmpne y,#100
addeq ...
movne ...

This kind of sequence is so frequent that you don't want to have to use
a new IT instruction every time the flags are updated.
The fact that IT only allows one condition and its inverse is already
a major constraint, assembler code I write often uses 3-4 different
conditions in as many instructions.

Wilco


Ben Avison

unread,
May 15, 2009, 4:19:06 PM5/15/09
to
On Fri, 15 May 2009, Marcus Harnisch <marcus....@gmx.net> wrote:
> It'd be appreciated if the ARM ARM mentioned this explicitly. You have
> to read between the lines to figure what really happens.

I remember having some doubt about the behaviour when I first read about
the instruction in the ARM ARM too - but there's just enough information
to deduce what the intent is.

I find it easy to visualise Thumb-2 as effectively a mixture of 20-bit and
36-bit instructions, where 4 of the bits happen to come from the CPSR
rather
than from RAM.

Ben

Reply all
Reply to author
Forward
0 new messages