Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code

Jean-Philippe BEMPEL

unread,

Jun 19, 2018, 2:43:19 AM6/19/18

to mechanical-sympathy

Here is a blog post that you should find interesting related to pause instruction on skylake:

https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/

Regards

Avi Kivity

unread,

Jun 19, 2018, 9:43:53 AM6/19/18

to mechanica...@googlegroups.com, Jean-Philippe BEMPEL

I don't think Intel is at fault here. PAUSE is not a "delay for 4 cycles" [*] instruction, and if you use it in a limited spin loop you should not rely on its timing. It's easy to say it after the event, of course.

The increase in delay time makes sense as it allows the other hyperthread to make more progress while the first is spinning.

[*] it probably wasn't delaying for 4 cycles, but waiting until the all pending loads were resolved. That happens to be 3-4 cycles when loading from L1. That prevents the CPU from unrolling spin loops and executing multiple instances of them in parallel, slowing down the other tenant.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,

Jun 19, 2018, 11:40:27 AM6/19/18

to mechanical-sympathy

[*] it probably wasn't delaying for 4 cycles, but waiting until the all pending loads were resolved. That happens to be 3-4 cycles when loading from L1. That prevents the CPU from unrolling spin loops and executing multiple instances of them in parallel, slowing down the other tenant.

I don't think it is so simple. The latency to L1 is 3-4 cycles but the load buffer is ~60 entries deep depending on processor model so they could not all be resolved in 4 cycles.

PAUSE is a predefined delay according to the manuals that has been 2 - 40 cycles (depending on processor) plus a hint to avoid memory order violation from speculative execution. Now that has taken a jump to 140 cycles with Skylake X. WRPAUSE on Sparc is nice in that you provide the number of cycles to delay for.

Avi Kivity

unread,

Jun 19, 2018, 12:44:37 PM6/19/18

to mechanica...@googlegroups.com, Martin Thompson

On 2018-06-19 18:40, Martin Thompson wrote:

[*] it probably wasn't delaying for 4 cycles, but waiting until the all pending loads were resolved. That happens to be 3-4 cycles when loading from L1. That prevents the CPU from unrolling spin loops and executing multiple instances of them in parallel, slowing down the other tenant.

I don't think it is so simple. The latency to L1 is 3-4 cycles but the load buffer is ~60 entries deep depending on processor model so they could not all be resolved in 4 cycles.

While they can't be resolved, the processor can stuff the load buffer with those 60 loads and keep stuffing one more per cycle, harming the other thread.

PAUSE is a predefined delay according to the manuals

It is now, it used to be defined differently. I'll see if I have an old version somewhere.

that has been 2 - 40 cycles (depending on processor) plus a hint to avoid memory order violation from speculative execution. Now that has taken a jump to 140 cycles with Skylake X. WRPAUSE on Sparc is nice in that you provide the number of cycles to delay for.

The length of the delay certainly wasn't specified, and it's a mistake to rely on it for timing (though again I understand how it happened, it's quite natural).

Hypervisors can set an exit after a number of pause instructions, again screwing up any timing. It's a bad idea though and Linux ended having paravirtualized spinlocks.

Tony Finch

unread,

Jun 20, 2018, 5:21:01 AM6/20/18

to mechanica...@googlegroups.com, Jean-Philippe BEMPEL

Avi Kivity <a...@scylladb.com> wrote:

> I don't think Intel is at fault here. PAUSE is not a "delay for 4 cycles" [*]
> instruction, and if you use it in a limited spin loop you should not rely on
> its timing. It's easy to say it after the event, of course.

What boggled me about that article was the spin loop being tuned to run
for many milliseconds rather than suspending the thread after about a
microsecond.

Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/
Trafalgar: In southeast, cyclonic 6 or 7 decreasing 4 or 5 later, otherwise
northeasterly, backing northerly later, 4 or 5, occasionally 6 in northwest.
Moderate. Rain or thundery showers. Good occasionally poor.

Reply all

Reply to author

Forward