I add a varying number of NOPS before L0 screwing with the alignment.
On an opteron this produces the expected result of being slow when the
loop crosses a 16 byte boundary. On a core duo it's producing strange
results: http://quotenil.com/tmp/coreduo.log (too long to include
With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs,
execution time increases rather sharply by ~15%. The only thing that I
notice is that JNLE is getting close to a 16 byte boundary. Once it's
_partly_ over it, like here at 5F-D0:
; 52: B8FCFFFF7F MOV EAX, 2147483644
; 57: EB03 JMP L1
; 59: L0: 83E804 SUB EAX, 4
; 5C: L1: 83F800 CMP EAX, 0
; 5F: 7FF8 JNLE L0
it's faster again.
I'd appreciate any help on this matter: why it happens, how to cure
The only explanation I can think of is that the prefetching logic is
busy getting the next 16-byte block in, when the branch goes back into
the same/previous block, and that this slows things up by a fraction of
As long as there is at least 8 bytes left in the current block, that
mechanism doesn't occur.
I.e. it is the tail of the block that's important, not the front alignment?
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
I experienced similar behaviour on AMDs, if two ore more branches
are within one code fetch then I encountered a few cycles penalty,
especially if the target is also within this fetch-cycle.
Several packs of 16 NOPs may also raise cache bound crossing issues.
A cure for it ? :) I'd avoid spaghetti and replace EB03 with 83C804.
But in case of:
if it really matters that much then only a known fetch start point
may do it, so either a serialising or flushing instruction ahead of
the loop, or at least cache bound code align may speed up the loop.
But perhaps not really a gain for the whole story, because some
time may be wasted before the loop or it needs more cache line reads.
I tried adding a cpuid instruction just before L0 and it didn't help.
Neither did inserting the NOPs just before L0 (as opposed to before
MOV EAX ...).
Perhaps it's just a matter of location relative to cache bounds
and code fetch start pont.
Have you tried to replace the 'JMP L1' with 'ADD eax,4' ?
I recently found something similar: that CPUs are sensitive not just
to branch target addresses but also to the location of the branch
instruction itself. For instance, on a Pentium M if I run a short
sequence of 9 instructions which includes a not-taken branch back the
native code runs in 66 cycles. If I add some extra nops before the
branch instruction the time decreases to 51 cycles. In fact here's a
00000059 75F5 jnz iter_top 66 cycles
0000005A 75F4 jnz iter_top 51 cycles
0000005B 75F3 jnz iter_top 51 cycles
0000005C 75F2 jnz iter_top 51 cycles
0000005D 75F1 jnz iter_top 66 cycles
0000005E 75F0 jnz iter_top 65 cycles
0000005F 75EF jnz iter_top 53 cycles
The times in cycles are the complete times taken for the rest of the
code (not shown) including the branch instruction. The only thing that
changes is the address of the jump as shown in the first column. The
second column, 75xx, is the opcode and operand in case the jump
distance is a contributory factor. As the branch is two bytes long the
first example sits in addresses 0x59 and 0x5a, for example. Like you,
I found that when the branch splits over a 16-byte line it gets fast
again as shown in the last example where the instruction sits in
addresses 0x5f and 0x60.
The slightly bizarre thing is I cannot match the effect with any of
the guidelines given in the optimisation manual.
To investigate this properly I'll need to set up a script to generate
and run a number of variants of the code. That should give a bigger
picture. Doing it by hand is not a good idea.
Could it be branch prediction? I don't know. Branch prediction is
notably organic in that it depends not just on the history of one
instruction but on the history of how one got to that instruction. The
reason I think it's possible is that very occasionally the figures do
change by about 15 cycles. But otherwise they stay consistently at the
figures shown run after run.
And it should not take any time!
The SUB operations are sequentally dependent, and the intervening branch
operations are (nearly) perfectly predictable.
As long as you have two regular execution pipes and a branch unit, all
three can operate in parallel.
What you end up with is the latency of the back-to-back SUB operations.
> >Jerome H. Fine wrote:
>> However, perhaps I might add a suggestion. Unrolling the loop on a
>> core 2 duo:
>> MOV EAX, 2147483644
>> ADD EAX, 4
>> L0: SUB EAX, 4
>> CMP EAX, 0
>> JLE L1
>> CMP EAX, 0
>> JNLE L0
>> does not seem to cost anything at all for the CMP / JLE pairs which
> And it should not take any time!
> The SUB operations are sequentally dependent, and the intervening
> branch operations are (nearly) perfectly predictable.
> As long as you have two regular execution pipes and a branch unit, all
> three can operate in parallel.
> What you end up with is the latency of the back-to-back SUB operations.
That confirms what the times are providing.
I had one other question. I would appreciate having another register.
Is the EBP register generally available? I can easily PUSH and POP
the EBP register (to save the value for when I return from the subroutine
which includes the above code) if it can be made available for use with
the CMP instruction as a fixed value during the execution of each loop.
CMP EAX, EBP
As an x86 dummy, I really don't understand the x86 instructions well
enough to know if the EBP register is used under special situations
(such as the ESP register which certainly can NOT be used as a
general purpose register). From what I have seen of various code
examples, EBP is available, but I thought it best to ask.
You can indeed use EBP to your hearth's content, but don't depend on a
high-level debugger to help you figuring out stack layouts etc.
BTW, you can also use ESP under some specific conditions, i.e. make
absolutely sure that you don't overwrite anything on the stack, and that
you recover the original ESP value before you leave that piece of code.
Sure. You can use it if you use it right. Here's a note on how I
understand the use of EBP. If it's not right hopefully someone will
High level languages sometimes (often) use EBP as a frame pointer.
That's useful if your code changes the stack pointer as it runs - e.g.
by pushing and popping registers temporarily - but otherwise you can
address the stack frame off ESP.
If it's not being used as a frame pointer EBP is readily available to
store values as in
mov ebp, eax
add ebp, ecx
mov [val], ebp
but if you want to use it as an address
mov eax, [ebp + 4]
be aware that it uses the stack segment register for indexing so
rather than being resolved as
mov eax, [ds:ebp + 4]
the instruction is effectively
mov eax, [ss:ebp + 4]
If you run with SS = DS or use a DS: override there's no problem with
using EBP to hold an address.
> From what I have seen of various code
> examples, EBP is available, but I thought it best to ask.
No problem to use ESP or EBP as a general purpose register for a short
while in a closed routine, when the pointer to the stack will be first
stored in a fixed ramlocation before ...and at last will be reloaded before
any code use the stack again. But the default segment for ESP/EBP is the
stacksegemt (SS not DS) ich you want to use ESP/EBP as an adressregister.
For an access to DS:ESP or DS:EBP or in combination with other
segmentregister like ES, FS, GS use a segment-overide-prefix.
Careful. There are a *lot* of events which use the stack implicitly.
But it´s possible. I didn´t know very well how to compute such things
like a task-switch of a multitasking-platform, but i remember that the
contents on this way will be stored.