I read that aligning loop entry on a 16 byte boundary is unimportant on core duo. While I don't doubt the wisdom of that there is clearly something at play that I don't understand. Consider this loop: MOV EAX, 2147483644 JMP L1 L0: SUB EAX, 4 L1: CMP EAX, 0 JNLE L0
I add a varying number of NOPS before L0 screwing with the alignment. On an opteron this produces the expected result of being slow when the loop crosses a 16 byte boundary. On a core duo it's producing strange results: http://quotenil.com/tmp/coreduo.log (too long to include here).
With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs, execution time increases rather sharply by ~15%. The only thing that I notice is that JNLE is getting close to a 16 byte boundary. Once it's _partly_ over it, like here at 5F-D0: ; 52: B8FCFFFF7F MOV EAX, 2147483644 ; 57: EB03 JMP L1 ; 59: L0: 83E804 SUB EAX, 4 ; 5C: L1: 83F800 CMP EAX, 0 ; 5F: 7FF8 JNLE L0 it's faster again.
I'd appreciate any help on this matter: why it happens, how to cure it.
melis.gabor wrote: > I read that aligning loop entry on a 16 byte boundary is unimportant > on core duo. While I don't doubt the wisdom of that there is clearly > something at play that I don't understand. Consider this loop: > MOV EAX, 2147483644 > JMP L1 > L0: SUB EAX, 4 > L1: CMP EAX, 0 > JNLE L0
> I add a varying number of NOPS before L0 screwing with the alignment. > On an opteron this produces the expected result of being slow when the > loop crosses a 16 byte boundary. On a core duo it's producing strange > results: http://quotenil.com/tmp/coreduo.log (too long to include > here).
> With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs, > execution time increases rather sharply by ~15%. The only thing that I > notice is that JNLE is getting close to a 16 byte boundary. Once it's > _partly_ over it, like here at 5F-D0: > ; 52: B8FCFFFF7F MOV EAX, 2147483644 > ; 57: EB03 JMP L1 > ; 59: L0: 83E804 SUB EAX, 4 > ; 5C: L1: 83F800 CMP EAX, 0 > ; 5F: 7FF8 JNLE L0 > it's faster again.
> I'd appreciate any help on this matter: why it happens, how to cure > it.
The only explanation I can think of is that the prefetching logic is busy getting the next 16-byte block in, when the branch goes back into the same/previous block, and that this slows things up by a fraction of a cycle.
As long as there is at least 8 bytes left in the current block, that mechanism doesn't occur.
I.e. it is the tail of the block that's important, not the front alignment?
Terje
-- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
melis.gabor wrote: > I read that aligning loop entry on a 16 byte boundary is unimportant > on core duo. While I don't doubt the wisdom of that there is clearly > something at play that I don't understand. Consider this loop: > MOV EAX, 2147483644 > JMP L1 > L0: SUB EAX, 4 > L1: CMP EAX, 0 > JNLE L0
> I add a varying number of NOPS before L0 screwing with the alignment. > On an opteron this produces the expected result of being slow when the > loop crosses a 16 byte boundary. On a core duo it's producing strange > results: http://quotenil.com/tmp/coreduo.log (too long to include > here).
> With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs, > execution time increases rather sharply by ~15%. The only thing that I > notice is that JNLE is getting close to a 16 byte boundary. Once it's > _partly_ over it, like here at 5F-D0: > ; 52: B8FCFFFF7F MOV EAX, 2147483644 > ; 57: EB03 JMP L1 > ; 59: L0: 83E804 SUB EAX, 4 > ; 5C: L1: 83F800 CMP EAX, 0 > ; 5F: 7FF8 JNLE L0 > it's faster again.
> I'd appreciate any help on this matter: why it happens, how to cure > it.
I experienced similar behaviour on AMDs, if two ore more branches are within one code fetch then I encountered a few cycles penalty, especially if the target is also within this fetch-cycle.
Several packs of 16 NOPs may also raise cache bound crossing issues.
A cure for it ? :) I'd avoid spaghetti and replace EB03 with 83C804.
But in case of: if it really matters that much then only a known fetch start point may do it, so either a serialising or flushing instruction ahead of the loop, or at least cache bound code align may speed up the loop. But perhaps not really a gain for the whole story, because some time may be wasted before the loop or it needs more cache line reads.
> melis.gabor wrote: > > I read that aligning loop entry on a 16 byte boundary is unimportant > > on core duo. While I don't doubt the wisdom of that there is clearly > > something at play that I don't understand. Consider this loop: > > MOV EAX, 2147483644 > > JMP L1 > > L0: SUB EAX, 4 > > L1: CMP EAX, 0 > > JNLE L0
> > I add a varying number of NOPS before L0 screwing with the alignment. > > On an opteron this produces the expected result of being slow when the > > loop crosses a 16 byte boundary. On a core duo it's producing strange > > results:http://quotenil.com/tmp/coreduo.log(too long to include > > here).
> > With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs, > > execution time increases rather sharply by ~15%. The only thing that I > > notice is that JNLE is getting close to a 16 byte boundary. Once it's > > _partly_ over it, like here at 5F-D0: > > ; 52: B8FCFFFF7F MOV EAX, 2147483644 > > ; 57: EB03 JMP L1 > > ; 59: L0: 83E804 SUB EAX, 4 > > ; 5C: L1: 83F800 CMP EAX, 0 > > ; 5F: 7FF8 JNLE L0 > > it's faster again.
> > I'd appreciate any help on this matter: why it happens, how to cure > > it.
> I experienced similar behaviour on AMDs, if two ore more branches > are within one code fetch then I encountered a few cycles penalty, > especially if the target is also within this fetch-cycle.
> Several packs of 16 NOPs may also raise cache bound crossing issues.
> A cure for it ? :) I'd avoid spaghetti and replace EB03 with 83C804.
> But in case of: > if it really matters that much then only a known fetch start point > may do it, so either a serialising or flushing instruction ahead of > the loop, or at least cache bound code align may speed up the loop. > But perhaps not really a gain for the whole story, because some > time may be wasted before the loop or it needs more cache line reads.
> hth > __ > wolfgang
I tried adding a cpuid instruction just before L0 and it didn't help. Neither did inserting the NOPs just before L0 (as opposed to before MOV EAX ...).
> I tried adding a cpuid instruction just before L0 and it didn't help. > Neither did inserting the NOPs just before L0 (as opposed to before > MOV EAX ...).
Perhaps it's just a matter of location relative to cache bounds and code fetch start pont.
Have you tried to replace the 'JMP L1' with 'ADD eax,4' ?
> I read that aligning loop entry on a 16 byte boundary is unimportant > on core duo. While I don't doubt the wisdom of that there is clearly > something at play that I don't understand. Consider this loop: > MOV EAX, 2147483644 > JMP L1 > L0: SUB EAX, 4 > L1: CMP EAX, 0 > JNLE L0
> I add a varying number of NOPS before L0 screwing with the alignment. > On an opteron this produces the expected result of being slow when the > loop crosses a 16 byte boundary. On a core duo it's producing strange > results:http://quotenil.com/tmp/coreduo.log(too long to include > here).
> With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs, > execution time increases rather sharply by ~15%. The only thing that I > notice is that JNLE is getting close to a 16 byte boundary. Once it's > _partly_ over it, like here at 5F-D0: > ; 52: B8FCFFFF7F MOV EAX, 2147483644 > ; 57: EB03 JMP L1 > ; 59: L0: 83E804 SUB EAX, 4 > ; 5C: L1: 83F800 CMP EAX, 0 > ; 5F: 7FF8 JNLE L0 > it's faster again.
> I'd appreciate any help on this matter: why it happens, how to cure > it.
I recently found something similar: that CPUs are sensitive not just to branch target addresses but also to the location of the branch instruction itself. For instance, on a Pentium M if I run a short sequence of 9 instructions which includes a not-taken branch back the native code runs in 66 cycles. If I add some extra nops before the branch instruction the time decreases to 51 cycles. In fact here's a summary:
The times in cycles are the complete times taken for the rest of the code (not shown) including the branch instruction. The only thing that changes is the address of the jump as shown in the first column. The second column, 75xx, is the opcode and operand in case the jump distance is a contributory factor. As the branch is two bytes long the first example sits in addresses 0x59 and 0x5a, for example. Like you, I found that when the branch splits over a 16-byte line it gets fast again as shown in the last example where the instruction sits in addresses 0x5f and 0x60.
The slightly bizarre thing is I cannot match the effect with any of the guidelines given in the optimisation manual.
To investigate this properly I'll need to set up a script to generate and run a number of variants of the code. That should give a bigger picture. Doing it by hand is not a good idea.
Could it be branch prediction? I don't know. Branch prediction is notably organic in that it depends not just on the history of one instruction but on the history of how one got to that instruction. The reason I think it's possible is that very occasionally the figures do change by about 15 cycles. But otherwise they stay consistently at the figures shown run after run.
>melis.gabor wrote: >I read that aligning loop entry on a 16 byte boundary is unimportant >on core duo. While I don't doubt the wisdom of that there is clearly >something at play that I don't understand. Consider this loop: > MOV EAX, 2147483644 > JMP L1 >L0: SUB EAX, 4 >L1: CMP EAX, 0 > JNLE L0
>I add a varying number of NOPS before L0 screwing with the alignment. >On an opteron this produces the expected result of being slow when the >loop crosses a 16 byte boundary. On a core duo it's producing strange >results: http://quotenil.com/tmp/coreduo.log (too long to include >here).
>With 9,10,11 or (25,26,27 and so on in increments of 16) NOPs, >execution time increases rather sharply by ~15%. The only thing that I >notice is that JNLE is getting close to a 16 byte boundary. Once it's >_partly_ over it, like here at 5F-D0: >; 52: B8FCFFFF7F MOV EAX, 2147483644 >; 57: EB03 JMP L1 >; 59: L0: 83E804 SUB EAX, 4 >; 5C: L1: 83F800 CMP EAX, 0 >; 5F: 7FF8 JNLE L0 >it's faster again.
>I'd appreciate any help on this matter: why it happens, how to cure >it.
>Thanks, >Gabor Melis
Jerome Fine replies:
I have also been experimenting with similar loops with both a Pentium III and a core 2 duo. I was not going to request additional help until more investigation. However, perhaps I might add a suggestion. Unrolling the loop on a core 2 duo:
MOV EAX, 2147483644 ADD EAX, 4 L0: SUB EAX, 4 CMP EAX, 0 JLE L1 SUB EAX, 4 CMP EAX, 0 JLE L1 SUB EAX, 4 CMP EAX, 0 JLE L1 SUB EAX, 4 CMP EAX, 0 JLE L1 SUB EAX, 4 CMP EAX, 0 JLE L1 SUB EAX, 4 CMP EAX, 0 JLE L1 SUB EAX, 4 CMP EAX, 0 JLE L1 SUB EAX, 4 CMP EAX, 0 JNLE L0 L1:
does not seem to cost anything at all for the CMP / JLE pairs which are not taken. On a 2.66 GHz E8200, the above code probably takes just under 0.8 seconds. The time to execute the following code on thAT core 2 duo seems to be just about the same:
MOV EAX, 2147483644 ADD EAX, 4 L0: SUB EAX, 4 SUB EAX, 4 SUB EAX, 4 SUB EAX, 4 SUB EAX, 4 SUB EAX, 4 SUB EAX, 4 SUB EAX, 4 CMP EAX, 0 JNLE L0 L1:
In either case, the time to execute each SUB instruction is about 0.4 nanoseconds on a 2.66 GHz core 2 duo (E8200) CPU.
The conclusion is that a core 2 duo CPU is able to execute ahead sufficiently to process the "not taken" conditional jump instructions which do not require out of sequence instructions except at the end of the loop.
With 750 MHz Pentium III, the first example takes about 5 seconds while the second example takes just under 3 seconds. Can anyone verify the above elapsed times for either of these CPUs?
By the way, I have 6 other loops which add one additional instruction for each SUB instruction, specifically a memory reference. In all cases, the inclusion of the CMP / JLE pairs does not seem to require additional time to execute the unrolled loops. While I did not pay any attention to alignment of the loop entry address, I would have expected that out of 8 loops, I would have noticed a systematic pattern of a time increase in some cases. Instead, running under Windows XP, there was the usual variation depending on the operating system activity. One thing I did seem to notice was that there was a small but measurable decrease in the execution time of the loop when I used task manager to restrict the process to a single CPU. The possible reason may be that WXP has less overhead when it interrupts the process to service other code and then returns to the same core each time which is using the CPU at 100% activity at that point - thereby also using the other core for all of the other functions which were at minimal activity during each test. In fact, is it possible that (since the other core was available to handle what for a single core cpu would have been interrupt activity) the core dedicated to the above code experienced little, if any, interrupts?
Can anyone comment on the above information? Is the above information useful?
By the way, for the code I am testing, each SUB instruction uses the LEA instruction instead. The JNLE instruction is a JB instruction and each JLE instruction is a JAE instruction. I initially started out using an ADD instruction, but found that the LEA instruction is perfect for the loop I need to use. I have found that the execution time for the LEA instruction which uses only registers is the same as the ADD instruction.
One final question. I would appreciate having another register. Is the EBP register generally available? I can easily PUSH and POP the EBP register if it can be available to use with the CMP instruction as a fixed value during the execution of each loop. However, I really don't understand the x86 instructions well enough to know if the EBP register is used under special situations that I don't understand.
>> However, perhaps I might add a suggestion. Unrolling the loop on a >> core 2 duo: >> MOV EAX, 2147483644 >> ADD EAX, 4 >> L0: SUB EAX, 4 >> CMP EAX, 0 >> JLE L1
>> CMP EAX, 0 >> JNLE L0 >> L1:
>> does not seem to cost anything at all for the CMP / JLE pairs which
> And it should not take any time! > The SUB operations are sequentally dependent, and the intervening > branch operations are (nearly) perfectly predictable. > As long as you have two regular execution pipes and a branch unit, all > three can operate in parallel. > What you end up with is the latency of the back-to-back SUB operations.
That confirms what the times are providing.
I had one other question. I would appreciate having another register. Is the EBP register generally available? I can easily PUSH and POP the EBP register (to save the value for when I return from the subroutine which includes the above code) if it can be made available for use with the CMP instruction as a fixed value during the execution of each loop. CMP EAX, EBP JB L0
As an x86 dummy, I really don't understand the x86 instructions well enough to know if the EBP register is used under special situations (such as the ESP register which certainly can NOT be used as a general purpose register). From what I have seen of various code examples, EBP is available, but I thought it best to ask.
Jerome H. Fine wrote: > As an x86 dummy, I really don't understand the x86 instructions well > enough to know if the EBP register is used under special situations > (such as the ESP register which certainly can NOT be used as a > general purpose register). From what I have seen of various code > examples, EBP is available, but I thought it best to ask.
You can indeed use EBP to your hearth's content, but don't depend on a high-level debugger to help you figuring out stack layouts etc.
BTW, you can also use ESP under some specific conditions, i.e. make absolutely sure that you don't overwrite anything on the stack, and that you recover the original ESP value before you leave that piece of code.
Terje
-- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
> >Terje Mathisen wrote: > > >Jerome H. Fine wrote:
> >> However, perhaps I might add a suggestion. Unrolling the loop on a > >> core 2 duo: > >> MOV EAX, 2147483644 > >> ADD EAX, 4 > >> L0: SUB EAX, 4 > >> CMP EAX, 0 > >> JLE L1
> >> CMP EAX, 0 > >> JNLE L0 > >> L1:
> >> does not seem to cost anything at all for the CMP / JLE pairs which
> > And it should not take any time! > > The SUB operations are sequentally dependent, and the intervening > > branch operations are (nearly) perfectly predictable. > > As long as you have two regular execution pipes and a branch unit, all > > three can operate in parallel. > > What you end up with is the latency of the back-to-back SUB operations.
> That confirms what the times are providing.
> I had one other question. I would appreciate having another register. > Is the EBP register generally available? I can easily PUSH and POP > the EBP register (to save the value for when I return from the subroutine > which includes the above code) if it can be made available for use with > the CMP instruction as a fixed value during the execution of each loop. > CMP EAX, EBP > JB L0
> As an x86 dummy, I really don't understand the x86 instructions well > enough to know if the EBP register is used under special situations > (such as the ESP register which certainly can NOT be used as a > general purpose register). From what I have seen of various code > examples, EBP is available, but I thought it best to ask.
Sure. You can use it if you use it right. Here's a note on how I understand the use of EBP. If it's not right hopefully someone will correct me.
High level languages sometimes (often) use EBP as a frame pointer. That's useful if your code changes the stack pointer as it runs - e.g. by pushing and popping registers temporarily - but otherwise you can address the stack frame off ESP.
If it's not being used as a frame pointer EBP is readily available to store values as in
mov ebp, eax add ebp, ecx mov [val], ebp
but if you want to use it as an address
mov eax, [ebp + 4]
be aware that it uses the stack segment register for indexing so rather than being resolved as
mov eax, [ds:ebp + 4]
the instruction is effectively
mov eax, [ss:ebp + 4]
If you run with SS = DS or use a DS: override there's no problem with using EBP to hold an address.
Am Fri, 06 Mar 2009 06:34:05 -0500 schrieb Jerome H. Fine:
> As an x86 dummy, I really don't understand the x86 instructions well > enough to know if the EBP register is used under special situations > (such as the ESP register which certainly can NOT be used as a > general purpose register). > From what I have seen of various code > examples, EBP is available, but I thought it best to ask.
No problem to use ESP or EBP as a general purpose register for a short while in a closed routine, when the pointer to the stack will be first stored in a fixed ramlocation before ...and at last will be reloaded before any code use the stack again. But the default segment for ESP/EBP is the stacksegemt (SS not DS) ich you want to use ESP/EBP as an adressregister. For an access to DS:ESP or DS:EBP or in combination with other segmentregister like ES, FS, GS use a segment-overide-prefix.
Dirk Wolfgang Glomp wrote: > Am Fri, 06 Mar 2009 06:34:05 -0500 schrieb Jerome H. Fine: >> As an x86 dummy, I really don't understand the x86 instructions well >> enough to know if the EBP register is used under special situations >> (such as the ESP register which certainly can NOT be used as a >> general purpose register).
>> From what I have seen of various code >> examples, EBP is available, but I thought it best to ask.
> No problem to use ESP or EBP as a general purpose register for a short > while in a closed routine, when the pointer to the stack will be first > stored in a fixed ramlocation before ...and at last will be reloaded before > any code use the stack again. But the default segment for ESP/EBP is the > stacksegemt (SS not DS) ich you want to use ESP/EBP as an adressregister. > For an access to DS:ESP or DS:EBP or in combination with other > segmentregister like ES, FS, GS use a segment-overide-prefix.
Careful. There are a *lot* of events which use the stack implicitly.
> Dirk Wolfgang Glomp wrote: >> Am Fri, 06 Mar 2009 06:34:05 -0500 schrieb Jerome H. Fine: >>> As an x86 dummy, I really don't understand the x86 instructions well >>> enough to know if the EBP register is used under special situations >>> (such as the ESP register which certainly can NOT be used as a >>> general purpose register).
>>> From what I have seen of various code >>> examples, EBP is available, but I thought it best to ask.
>> No problem to use ESP or EBP as a general purpose register for a short >> while in a closed routine, when the pointer to the stack will be first >> stored in a fixed ramlocation before ...and at last will be reloaded before >> any code use the stack again. But the default segment for ESP/EBP is the >> stacksegemt (SS not DS) ich you want to use ESP/EBP as an adressregister. >> For an access to DS:ESP or DS:EBP or in combination with other >> segmentregister like ES, FS, GS use a segment-overide-prefix.
> Careful. There are a *lot* of events which use the stack implicitly.
But it´s possible. I didn´t know very well how to compute such things like a task-switch of a multitasking-platform, but i remember that the contents on this way will be stored.