Loop alignment on core duo

melis.gabor

unread,

Mar 4, 2009, 12:38:46 PM3/4/09

to

I read that aligning loop entry on a 16 byte boundary is unimportant
on core duo. While I don't doubt the wisdom of that there is clearly
something at play that I don't understand. Consider this loop:
MOV EAX, 2147483644
JMP L1
L0: SUB EAX, 4
L1: CMP EAX, 0
JNLE L0

I add a varying number of NOPS before L0 screwing with the alignment.
On an opteron this produces the expected result of being slow when the
loop crosses a 16 byte boundary. On a core duo it's producing strange
results: http://quotenil.com/tmp/coreduo.log (too long to include
here).

With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs,
execution time increases rather sharply by ~15%. The only thing that I
notice is that JNLE is getting close to a 16 byte boundary. Once it's
_partly_ over it, like here at 5F-D0:
; 52: B8FCFFFF7F MOV EAX, 2147483644
; 57: EB03 JMP L1
; 59: L0: 83E804 SUB EAX, 4
; 5C: L1: 83F800 CMP EAX, 0
; 5F: 7FF8 JNLE L0
it's faster again.

I'd appreciate any help on this matter: why it happens, how to cure
it.

Thanks,
Gabor Melis

Terje Mathisen

unread,

Mar 5, 2009, 1:27:29 AM3/5/09

to

The only explanation I can think of is that the prefetching logic is
busy getting the next 16-byte block in, when the branch goes back into
the same/previous block, and that this slows things up by a fraction of
a cycle.

As long as there is at least 8 bytes left in the current block, that
mechanism doesn't occur.

I.e. it is the tail of the block that's important, not the front alignment?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Wolfgang Kern

unread,

Mar 5, 2009, 4:13:44 AM3/5/09

to

I experienced similar behaviour on AMDs, if two ore more branches
are within one code fetch then I encountered a few cycles penalty,
especially if the target is also within this fetch-cycle.

Several packs of 16 NOPs may also raise cache bound crossing issues.

A cure for it ? :) I'd avoid spaghetti and replace EB03 with 83C804.

But in case of:
if it really matters that much then only a known fetch start point
may do it, so either a serialising or flushing instruction ahead of
the loop, or at least cache bound code align may speed up the loop.
But perhaps not really a gain for the whole story, because some
time may be wasted before the loop or it needs more cache line reads.

hth
__
wolfgang

melis.gabor

unread,

Mar 5, 2009, 5:56:33 AM3/5/09

to

On Mar 5, 10:13 am, "Wolfgang Kern" <spamt...@crayne.org> wrote:
> melis.gabor wrote:
> > I read that aligning loop entry on a 16 byte boundary is unimportant
> > on core duo. While I don't doubt the wisdom of that there is clearly
> > something at play that I don't understand. Consider this loop:
> > MOV EAX, 2147483644
> > JMP L1
> > L0: SUB EAX, 4
> > L1: CMP EAX, 0
> > JNLE L0
>
> > I add a varying number of NOPS before L0 screwing with the alignment.
> > On an opteron this produces the expected result of being slow when the
> > loop crosses a 16 byte boundary. On a core duo it's producing strange

> > results:http://quotenil.com/tmp/coreduo.log(too long to include

I tried adding a cpuid instruction just before L0 and it didn't help.
Neither did inserting the NOPs just before L0 (as opposed to before
MOV EAX ...).

Wolfgang Kern

unread,

Mar 5, 2009, 4:11:32 PM3/5/09

to

melis.gabor wrote:
...

> I tried adding a cpuid instruction just before L0 and it didn't help.
> Neither did inserting the NOPs just before L0 (as opposed to before
> MOV EAX ...).

Perhaps it's just a matter of location relative to cache bounds
and code fetch start pont.

Have you tried to replace the 'JMP L1' with 'ADD eax,4' ?

__
wolfgang

ilya

unread,

Mar 5, 2009, 11:14:26 AM3/5/09

to

I have found that if you loop body is larger in bytes than single
cache line then alignments on 64 bytes & 128 help.
>From what I understand when processor jumps back on 64 or 128 aligned
address it has to deal only with single cache line coming up ahead.

James Harris

unread,

Mar 5, 2009, 2:38:57 PM3/5/09

to

On 4 Mar, 17:38, melis.gabor <spamt...@crayne.org> wrote:
> I read that aligning loop entry on a 16 byte boundary is unimportant
> on core duo. While I don't doubt the wisdom of that there is clearly
> something at play that I don't understand. Consider this loop:
> MOV EAX, 2147483644
> JMP L1
> L0: SUB EAX, 4
> L1: CMP EAX, 0
> JNLE L0
>
> I add a varying number of NOPS before L0 screwing with the alignment.
> On an opteron this produces the expected result of being slow when the
> loop crosses a 16 byte boundary. On a core duo it's producing strange

> results:http://quotenil.com/tmp/coreduo.log(too long to include

> here).
>
> With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs,
> execution time increases rather sharply by ~15%. The only thing that I
> notice is that JNLE is getting close to a 16 byte boundary. Once it's
> _partly_ over it, like here at 5F-D0:
> ; 52: B8FCFFFF7F MOV EAX, 2147483644
> ; 57: EB03 JMP L1
> ; 59: L0: 83E804 SUB EAX, 4
> ; 5C: L1: 83F800 CMP EAX, 0
> ; 5F: 7FF8 JNLE L0
> it's faster again.
>
> I'd appreciate any help on this matter: why it happens, how to cure
> it.

I recently found something similar: that CPUs are sensitive not just
to branch target addresses but also to the location of the branch
instruction itself. For instance, on a Pentium M if I run a short
sequence of 9 instructions which includes a not-taken branch back the
native code runs in 66 cycles. If I add some extra nops before the
branch instruction the time decreases to 51 cycles. In fact here's a
summary:

00000059 75F5 jnz iter_top 66 cycles
0000005A 75F4 jnz iter_top 51 cycles
0000005B 75F3 jnz iter_top 51 cycles
0000005C 75F2 jnz iter_top 51 cycles
0000005D 75F1 jnz iter_top 66 cycles
0000005E 75F0 jnz iter_top 65 cycles
0000005F 75EF jnz iter_top 53 cycles

The times in cycles are the complete times taken for the rest of the
code (not shown) including the branch instruction. The only thing that
changes is the address of the jump as shown in the first column. The
second column, 75xx, is the opcode and operand in case the jump
distance is a contributory factor. As the branch is two bytes long the
first example sits in addresses 0x59 and 0x5a, for example. Like you,
I found that when the branch splits over a 16-byte line it gets fast
again as shown in the last example where the instruction sits in
addresses 0x5f and 0x60.

The slightly bizarre thing is I cannot match the effect with any of
the guidelines given in the optimisation manual.

To investigate this properly I'll need to set up a script to generate
and run a number of variants of the code. That should give a bigger
picture. Doing it by hand is not a good idea.

Could it be branch prediction? I don't know. Branch prediction is
notably organic in that it depends not just on the history of one
instruction but on the history of how one got to that instruction. The
reason I think it's possible is that very occasionally the figures do
change by about 15 cycles. But otherwise they stay consistently at the
figures shown run after run.

James

Jerome H. Fine

unread,

Mar 5, 2009, 11:31:44 PM3/5/09

to

Jerome Fine replies:

I have also been experimenting with similar loops with both a Pentium III and
a core 2 duo. I was not going to request additional help until more investigation.
However, perhaps I might add a suggestion. Unrolling the loop on a core 2 duo:

            MOV     EAX, 2147483644
            ADD      EAX, 4
L0:       SUB       EAX, 4
            CMP      EAX, 0
            JLE         L1
            SUB       EAX, 4
            CMP      EAX, 0
            JLE         L1
            SUB       EAX, 4
            CMP      EAX, 0
            JLE        L1
            SUB       EAX, 4
            CMP      EAX, 0
            JLE         L1
            SUB       EAX, 4
            CMP      EAX, 0
            JLE        L1
SUB       EAX, 4
            CMP      EAX, 0
            JLE         L1
            SUB       EAX, 4
            CMP      EAX, 0
            JLE        L1
            SUB      EAX, 4

CMP EAX, 0
JNLE L0

L1:

does not seem to cost anything at all for the CMP / JLE pairs which
are not taken. On a 2.66 GHz E8200, the above code probably takes
just under 0.8 seconds. The time to execute the following code on thAT
core 2 duo seems to be just about the same:

           MOV     EAX, 2147483644
           ADD      EAX, 4
L0:      SUB       EAX, 4
           SUB       EAX, 4
           SUB       EAX, 4
           SUB       EAX, 4
           SUB       EAX, 4
   SUB       EAX, 4
           SUB       EAX, 4
           SUB     EAX, 4

CMP EAX, 0
JNLE L0

L1:

In either case, the time to execute each SUB instruction is about 0.4
nanoseconds on a 2.66 GHz core 2 duo (E8200) CPU.

The conclusion is that a core 2 duo CPU is able to execute ahead
sufficiently to process the "not taken" conditional jump instructions
which do not require out of sequence instructions except at the end
of the loop.

With 750 MHz Pentium III, the first example takes about 5 seconds
while the second example takes just under 3 seconds. Can anyone
verify the above elapsed times for either of these CPUs?

By the way, I have 6 other loops which add one additional instruction
for each SUB instruction, specifically a memory reference. In all cases,
the inclusion of the CMP / JLE pairs does not seem to require additional
time to execute the unrolled loops. While I did not pay any attention to
alignment of the loop entry address, I would have expected that out of
8 loops, I would have noticed a systematic pattern of a time increase in
some cases. Instead, running under Windows XP, there was the usual
variation depending on the operating system activity. One thing I did
seem to notice was that there was a small but measurable decrease in
the execution time of the loop when I used task manager to restrict the
process to a single CPU. The possible reason may be that WXP has
less overhead when it interrupts the process to service other code and
then returns to the same core each time which is using the CPU at 100%
activity at that point - thereby also using the other core for all of the other
functions which were at minimal activity during each test. In fact, is it
possible that (since the other core was available to handle what for a
single core cpu would have been interrupt activity) the core dedicated
to the above code experienced little, if any, interrupts?

Can anyone comment on the above information? Is the above information
useful?

By the way, for the code I am testing, each SUB instruction uses the
LEA instruction instead. The JNLE instruction is a JB instruction and
each JLE instruction is a JAE instruction. I initially started out using an
ADD instruction, but found that the LEA instruction is perfect for the
loop I need to use. I have found that the execution time for the LEA
instruction which uses only registers is the same as the ADD instruction.

One final question. I would appreciate having another register. Is the
EBP register generally available? I can easily PUSH and POP the
EBP register if it can be available to use with the CMP instruction
as a fixed value during the execution of each loop. However, I really
don't understand the x86 instructions well enough to know if the EBP
register is used under special situations that I don't understand.

Sincerely yours,

Jerome Fine

Terje Mathisen

unread,

Mar 6, 2009, 2:19:02 AM3/6/09

to

Jerome H. Fine wrote:
> However, perhaps I might add a suggestion. Unrolling the loop on a core
> 2 duo:
>
> MOV EAX, 2147483644
> ADD EAX, 4
> L0: SUB EAX, 4
> CMP EAX, 0
> JLE L1
> SUB EAX, 4
> CMP EAX, 0
> JLE L1
> SUB EAX, 4

[snip]

> CMP EAX, 0
> JNLE L0
> L1:
>
> does not seem to cost anything at all for the CMP / JLE pairs which

And it should not take any time!

The SUB operations are sequentally dependent, and the intervening branch
operations are (nearly) perfectly predictable.

As long as you have two regular execution pipes and a branch unit, all
three can operate in parallel.

What you end up with is the latency of the back-to-back SUB operations.

Jerome H. Fine

unread,

Mar 6, 2009, 6:34:05 AM3/6/09

to

>Terje Mathisen wrote:

> >Jerome H. Fine wrote:
>
>> However, perhaps I might add a suggestion. Unrolling the loop on a
>> core 2 duo:
>> MOV EAX, 2147483644
>> ADD EAX, 4
>> L0: SUB EAX, 4
>> CMP EAX, 0
>> JLE L1
>

>> CMP EAX, 0
>> JNLE L0
>> L1:
>>
>> does not seem to cost anything at all for the CMP / JLE pairs which
>
> And it should not take any time!
> The SUB operations are sequentally dependent, and the intervening
> branch operations are (nearly) perfectly predictable.
> As long as you have two regular execution pipes and a branch unit, all
> three can operate in parallel.
> What you end up with is the latency of the back-to-back SUB operations.

That confirms what the times are providing.

I had one other question. I would appreciate having another register.

Is the EBP register generally available? I can easily PUSH and POP

the EBP register (to save the value for when I return from the subroutine
which includes the above code) if it can be made available for use with

the CMP instruction as a fixed value during the execution of each loop.

CMP EAX, EBP
JB L0

As an x86 dummy, I really don't understand the x86 instructions well

enough to know if the EBP register is used under special situations

(such as the ESP register which certainly can NOT be used as a
general purpose register). From what I have seen of various code
examples, EBP is available, but I thought it best to ask.

Sincerely yours,

Jerome Fine

Terje Mathisen

unread,

Mar 6, 2009, 3:00:46 PM3/6/09

to

Jerome H. Fine wrote:
> As an x86 dummy, I really don't understand the x86 instructions well
> enough to know if the EBP register is used under special situations
> (such as the ESP register which certainly can NOT be used as a
> general purpose register). From what I have seen of various code
> examples, EBP is available, but I thought it best to ask.

You can indeed use EBP to your hearth's content, but don't depend on a
high-level debugger to help you figuring out stack layouts etc.

BTW, you can also use ESP under some specific conditions, i.e. make
absolutely sure that you don't overwrite anything on the stack, and that
you recover the original ESP value before you leave that piece of code.

James Harris

unread,

Mar 6, 2009, 9:34:06 PM3/6/09

to

Sure. You can use it if you use it right. Here's a note on how I
understand the use of EBP. If it's not right hopefully someone will
correct me.

High level languages sometimes (often) use EBP as a frame pointer.
That's useful if your code changes the stack pointer as it runs - e.g.
by pushing and popping registers temporarily - but otherwise you can
address the stack frame off ESP.

If it's not being used as a frame pointer EBP is readily available to
store values as in

mov ebp, eax
add ebp, ecx
mov [val], ebp

but if you want to use it as an address

mov eax, [ebp + 4]

be aware that it uses the stack segment register for indexing so
rather than being resolved as

mov eax, [ds:ebp + 4]

the instruction is effectively

mov eax, [ss:ebp + 4]

If you run with SS = DS or use a DS: override there's no problem with
using EBP to hold an address.

James

Dirk Wolfgang Glomp

unread,

Mar 7, 2009, 3:56:49 AM3/7/09

to

Am Fri, 06 Mar 2009 06:34:05 -0500 schrieb Jerome H. Fine:
> As an x86 dummy, I really don't understand the x86 instructions well
> enough to know if the EBP register is used under special situations
> (such as the ESP register which certainly can NOT be used as a
> general purpose register).

> From what I have seen of various code
> examples, EBP is available, but I thought it best to ask.

No problem to use ESP or EBP as a general purpose register for a short
while in a closed routine, when the pointer to the stack will be first
stored in a fixed ramlocation before ...and at last will be reloaded before
any code use the stack again. But the default segment for ESP/EBP is the
stacksegemt (SS not DS) ich you want to use ESP/EBP as an adressregister.
For an access to DS:ESP or DS:EBP or in combination with other
segmentregister like ES, FS, GS use a segment-overide-prefix.

Dirk

H. Peter Anvin

unread,

Mar 8, 2009, 12:57:38 AM3/8/09

to

Careful. There are a *lot* of events which use the stack implicitly.

-hpa

Dirk Wolfgang Glomp

unread,

Mar 8, 2009, 6:22:08 AM3/8/09

to

But it´s possible. I didn´t know very well how to compute such things
like a task-switch of a multitasking-platform, but i remember that the
contents on this way will be stored.

Dirk