I don't think Intel is at fault here. PAUSE is not a "delay for 4
cycles" [*] instruction, and if you use it in a limited spin loop
you should not rely on its timing. It's easy to say it after the
event, of course.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[*] it probably wasn't delaying for 4 cycles, but waiting until the all pending loads were resolved. That happens to be 3-4 cycles when loading from L1. That prevents the CPU from unrolling spin loops and executing multiple instances of them in parallel, slowing down the other tenant.
[*] it probably wasn't delaying for 4 cycles, but waiting until the all pending loads were resolved. That happens to be 3-4 cycles when loading from L1. That prevents the CPU from unrolling spin loops and executing multiple instances of them in parallel, slowing down the other tenant.
I don't think it is so simple. The latency to L1 is 3-4 cycles but the load buffer is ~60 entries deep depending on processor model so they could not all be resolved in 4 cycles.
PAUSE is a predefined delay according to the manuals
that has been 2 - 40 cycles (depending on processor) plus a hint to avoid memory order violation from speculative execution. Now that has taken a jump to 140 cycles with Skylake X. WRPAUSE on Sparc is nice in that you provide the number of cycles to delay for.