Intel Skylake/Kaby Lake processors: broken hyper-threading

William Edwards

unread,

Jun 25, 2017, 11:15:48 AM6/25/17

to

Saw this on the Debian lists:

https://lists.debian.org/debian-devel/2017/06/msg00308.html

Thought it might generally interest the crowd.

Anton Ertl

unread,

Jun 25, 2017, 1:01:10 PM6/25/17

to

Interesting, but I find the recommendations excessive. Even if you
want to disable Hyperthreading on the system that has worked ok up to
now, you don't need to do it in the BIOS right away. You can do it in
Linux on the working machine (e.g., see
<https://serverfault.com/questions/235825/disable-hyperthreading-from-within-linux-no-access-to-bios>,
and then disable it in the BIOS when the next reboot occurs anyway.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Noob

unread,

Jun 25, 2017, 2:38:08 PM6/25/17

to

Errata: SKZ7/SKW144/SKL150/SKX150/SKZ7/KBL095/KBW095
Short Loops Which Use AH/BH/CH/DH Registers May Cause
Unpredictable System Behavior.

Problem: Under complex micro-architectural conditions, short loops
of less than 64 instructions that use AH, BH, CH or DH
registers as well as their corresponding wider register
(e.g. RAX, EAX or AX for AH) may cause unpredictable
system behavior. This can only happen when both logical
processors on the same physical processor are active.

Implication: Due to this erratum, the system may experience
unpredictable system behavior.

Quadibloc

unread,

Jun 25, 2017, 11:02:02 PM6/25/17

to

There's an article on this here

http://www.theregister.co.uk/2017/06/25/intel_skylake_kaby_lake_microcode_bug/

as well.

John Savard

Quadibloc

unread,

Jun 25, 2017, 11:05:11 PM6/25/17

to

On Sunday, June 25, 2017 at 11:01:10 AM UTC-6, Anton Ertl wrote:

> Interesting, but I find the recommendations excessive. Even if you
> want to disable Hyperthreading on the system that has worked ok up to
> now, you don't need to do it in the BIOS right away.

It is possible that many programs won't perform the types of operation which
trigger the condition. But if one is present on your system, it may behave
erratically at a random time, so immediately shutting down affected systems may
not be excessive, depending on the consequences of a program running wild.

Apparently, shutting down hyperthreading won't cut the performance of a system
in half, so that's good news.

John Savard

Anton Ertl

unread,

Jun 26, 2017, 2:22:49 AM6/26/17

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Sunday, June 25, 2017 at 11:01:10 AM UTC-6, Anton Ertl wrote:
>
>> Interesting, but I find the recommendations excessive. Even if you
>> want to disable Hyperthreading on the system that has worked ok up to
>> now, you don't need to do it in the BIOS right away.
>
>It is possible that many programs won't perform the types of operation which
>trigger the condition. But if one is present on your system, it may behave
>erratically at a random time, so immediately shutting down affected systems may
>not be excessive, depending on the consequences of a program running wild.

As I wrote, you can disable hyperthreading without shutting down the
system (at least in Linux, and that's what the recommendation is for)
just as quickly, so it certainly is excessive.

If that was not possible, the sysadmin would have to balance the
consequences of shutting down the system against the probability and
consequences of "unpredictable system behaviour". If the system has
not shown any unpredictable system behaviour yet, and the set of
software running on the system does not change, the probability of
"unpredictable system behaviour" is small.

>Apparently, shutting down hyperthreading won't cut the performance of a system
>in half, so that's good news.

Yes, Hyperthreading has little performance benefit, but is not news,
and when it was news, it was not good news.

Terje Mathisen

unread,

Jun 26, 2017, 3:30:10 AM6/26/17

to

I have big batch pipelines where hyperthreading does indeed nearly
double my speed. :-(

OTOH, I suspect that I'm one of the few remaining programmers who have
written significant amounts of code that could trigger this particular
bug, i.e. it looks a lot like mishandling partial register updates.

I used code like this to very good effect back in the 486 and Pentium days:

REPT 128
mov bl,es:[di]
mov di,[si]
add si,2
add ax,increment_table[bx]

mov bh,es:[di]
mov di,[si]
add si,2
add ax,transposed_increment_table[bx]
ENDM

I.e. the key operation would alternate loading values in BL or BH, then
use the combined BX register as a table index.

As soon as the PentiumPro turned up I had to get rid of this style of
coding since the Partial Register Stalls absolutely killed my
performance. (I.e. on a Pentium it ran at 40 MB/s on a 60 MHz cpu, while
a 200 MHz PPro was significantly slower despite the 3X+ clock speed
upgrade.)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Bruce Hoult

unread,

Jun 26, 2017, 6:21:19 AM6/26/17

to

HT gives very close to 2x when the workload is gcc/llvm, as mine is. Turning it off will mean I paid i7 prices for an i5. How will Intel compensate me?

already...@yahoo.com

unread,

Jun 26, 2017, 7:14:11 AM6/26/17

to

I suppose, problem exists mostly on Debian and similar "ideological" distributions. Less ideological environments will simply use Intel-supplied binaries for microcode update.

Bruce Hoult

unread,

Jun 26, 2017, 8:20:22 AM6/26/17

to

Ahh .. I didn't read the whole thing before. So there *is* a fix for the microcode, but some people might be paranoid to use it. I just checked and my 6700K returns 0x506e3 from "iucode_tool -S", so it has the problem, and also has a fix.

Which also means none of the Ubuntu 16.04LTS updates have automagically updated the microcode. Googling didn't immediately find any Ubuntu-related information about it.

Casper H.S. Dik

unread,

Jun 26, 2017, 9:14:46 AM6/26/17

to

Intel distributes the microcode for much of its processors in on large file;
some older CPUs have been removed the the newest version.

Last time I checked it missed the Avoton/Rangely CPUs but they distribute those
in a different download. The Avoton/Rangely microcode is MUCH bigger than most
others (84K vs 30K for the next CPU and the rest are often smaller)

Casper

EricP

unread,

Jun 26, 2017, 9:26:18 AM6/26/17

to

Terje Mathisen wrote:
>
> OTOH, I suspect that I'm one of the few remaining programmers who have
> written significant amounts of code that could trigger this particular
> bug, i.e. it looks a lot like mishandling partial register updates.

That is what I was thinking, that maybe they failed
to stall after a partial register update.

Which got me wondering how a microcode update could fix this?
As it sounds like a hardware error, and since most such
instructions would not have gone through the microcode,
I can't see how a microcode patch could fix it.
Since there are innumerable instructions that could write *H registers,
trapping each instruction (if that is possible) and doing something
like a pipeline drain wouldn't be acceptable.

I looked for more info on the microcode patch mechanism
but couldn't find anything very relevant
(just some patents intentionally trying to be obtuse).

Eric

Terje Mathisen

unread,

Jun 26, 2017, 10:18:29 AM6/26/17

to

The fact that the bug could only happen when using the high half/byte
regs (AH,BH,CH,DH) is probably significant, since merging these into the
full reg will require a shift as well as the usual mask/and/or.

Bruce Hoult

unread,

Jun 26, 2017, 10:53:30 AM6/26/17

to

Would I be correct in thinking that gcc/llvm would *never* do this?

(without inline asm)

Casper H.S. Dik

unread,

Jun 26, 2017, 11:14:19 AM6/26/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:

>The fact that the bug could only happen when using the high half/byte
>regs (AH,BH,CH,DH) is probably significant, since merging these into the
>full reg will require a shift as well as the usual mask/and/or.

And perhaps already handled in the microcode.

Casper

already...@yahoo.com

unread,

Jun 26, 2017, 11:26:46 AM6/26/17

to

No, it's incorrect.
It took me just a few minutes to force gcc 6.3 into producing such code.
https://godbolt.org/g/bwLPvd

already...@yahoo.com

unread,

Jun 26, 2017, 11:42:44 AM6/26/17

to

In fact, all gcc versions available on godbolt generate similar code for any architecture and any optimization level except -O0 that I tried. Even for -mtune=pentiumpro, despite the sequence is known to be horribly slow on PPro.
It applies both to -m32 and -m64.

So far I didn't see something like that from clang, but I didn't try too hard.

Anton Ertl

unread,

Jun 26, 2017, 12:15:22 PM6/26/17

to

EricP <ThatWould...@thevillage.com> writes:
>Which got me wondering how a microcode update could fix this?

Wild guess: They implemented a new optimization for partial register
stalls, and because there is always the possibility of such
optimizations having bugs, they provided a way to disable it. The
"microcode update" does exactly that, although it probably does not
change the microcode of microcoded instructiuons.

paul wallich

unread,

Jun 26, 2017, 1:05:00 PM6/26/17

to

On 6/26/17 1:55 AM, Anton Ertl wrote:

> If that was not possible, the sysadmin would have to balance the
> consequences of shutting down the system against the probability and
> consequences of "unpredictable system behaviour". If the system has
> not shown any unpredictable system behaviour yet, and the set of
> software running on the system does not change, the probability of
> "unpredictable system behaviour" is small.

I'm not sure that's a helpful assessment unless the visible form of the
unpredictable behavior is highly predictable.

paul

Niels Jørgen Kruse

unread,

Jun 26, 2017, 1:11:45 PM6/26/17

to

EricP <ThatWould...@thevillage.com> wrote:

> Terje Mathisen wrote:
> >
> > OTOH, I suspect that I'm one of the few remaining programmers who have
> > written significant amounts of code that could trigger this particular
> > bug, i.e. it looks a lot like mishandling partial register updates.
>
> That is what I was thinking, that maybe they failed
> to stall after a partial register update.

That would only be a problem within a single thread, not between
hyper-threads that each have their own registers.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark

Terje Mathisen

unread,

Jun 26, 2017, 4:16:56 PM6/26/17

to

Almost certainly so, unless it happens as a side effect of a
merge/combine optimzation:

When you have two byte variables, the compiler could figure out that
these are adjacent in memory/on the stack and then use a 16 or 32-bit
store after having done individual updates on the byte regs.

I have never seen any compiler do so however since the time of JPI
Pascal/C/Modula-2.

Anders....@kapsi.spam.stop.fi.invalid

unread,

Jun 26, 2017, 4:23:22 PM6/26/17

to

Bruce Hoult <bruce...@gmail.com> wrote:
> Which also means none of the Ubuntu 16.04LTS updates have
> automagically updated the microcode. Googling didn't immediately find
> any Ubuntu-related information about it.

The intel-microcode package is found in the restricted repository, but
at the time of writing the Xenial version hasn't been updated since
November 2015.

<https://launchpad.net/ubuntu/+source/intel-microcode>
<https://packages.ubuntu.com/xenial/intel-microcode>

-a

EricP

unread,

Jun 26, 2017, 5:07:53 PM6/26/17

to

Hmmm... ok.

The errata refers to the problem showing up on short loops
of less than 64 instructions that use AH, BH, CH or DH.

Looking at the Skylake microarch, the instruction decode queue
is 128 uOps thread, 2*64 uOps when threaded.
The Loop Stream Detector "can stream the same sequence of µOPs directly
from the IDQ continuously without any additional fetching, decoding,
or utilizing additional caches or resources." ...
"capable of detecting loops up to 64 µOPs per thread".

https://en.wikichip.org/wiki/intel/microarchitectures/skylake#.C2.B5OP-Fusion_.26_LSD

So maybe the microcode update just shuts off the loopback detector.

Eric

EricP

unread,

Jun 26, 2017, 5:07:55 PM6/26/17

to

Terje Mathisen wrote:
>
> The fact that the bug could only happen when using the high half/byte
> regs (AH,BH,CH,DH) is probably significant, since merging these into the
> full reg will require a shift as well as the usual mask/and/or.
>
> Terje

They could track the register pieces individually,
so an operations on AH can proceed in parallel with AL.
But it would be complicated.

Another way to do partial register write is to read the whole register
as a source, eg RAX for AH, and pass it as an input operand to the
function unit, which then updates just the appropriate byte,
and writes a new whole RAX. Also has to get the flags correct.

That way they don't have to fiddle with the rename mechanism
or data paths, just pass some extra control flags to the FU
and slap some muxes on the output to do the merge.

Eric

Anton Ertl

unread,

Jun 27, 2017, 2:58:07 AM6/27/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>When you have two byte variables, the compiler could figure out that
>these are adjacent in memory/on the stack and then use a 16 or 32-bit
>store after having done individual updates on the byte regs.
>
>I have never seen any compiler do so however since the time of JPI
>Pascal/C/Modula-2.

I have seen two stores to adjacent bytes become a 16-bit store in gcc.
I don't remember what the source of the 16-bit value was and if it was
synthesized out of xH/xL, though.

William Edwards

unread,

Jun 27, 2017, 3:00:17 AM6/27/17

to

On Monday, 26 June 2017 23:07:53 UTC+2, EricP wrote:
> Niels Jørgen Kruse wrote:

So in this thread there have been some programmers pointing at work loads that get excellent parallelism from hyperthreading.

It would be really interesting to see if those who have access to affected hardware can work out any before-and-after performance stats to how whatever the microcode update does affects things?

Anton Ertl

unread,

Jun 27, 2017, 3:04:16 AM6/27/17

to

Do you mean that the unpredictable behaviour could go unnoticed much
of the time like the Pentium FDIV bug? That seems unlikely in a bug
involving the xH/xL registers. But if any of us is eager enough to
unearth the original bug report of the Ocaml developers, we could gain
more insight into the nature of the "unpredictable behaviour". My
guess is that the value of xL or xH from the other thread becomes
visible in this thread, and/or that some value from the previous
contents of the physical register appears as xL/xH value.

already...@yahoo.com

unread,

Jun 27, 2017, 3:48:29 AM6/27/17

to

On Monday, June 26, 2017 at 11:16:56 PM UTC+3, Terje Mathisen wrote:
> Bruce Hoult wrote:
> > On Monday, June 26, 2017 at 5:18:29 PM UTC+3, Terje Mathisen wrote:
> >> EricP wrote:
> >>> I looked for more info on the microcode patch mechanism but
> >>> couldn't find anything very relevant (just some patents
> >>> intentionally trying to be obtuse).
> >>
> >> The fact that the bug could only happen when using the high
> >> half/byte regs (AH,BH,CH,DH) is probably significant, since merging
> >> these into the full reg will require a shift as well as the usual
> >> mask/and/or.
> >
> > Would I be correct in thinking that gcc/llvm would *never* do this?
> >
> > (without inline asm)
> >
> Almost certainly so, unless it happens as a side effect of a
> merge/combine optimzation:
>
> When you have two byte variables, the compiler could figure out that
> these are adjacent in memory/on the stack and then use a 16 or 32-bit
> store after having done individual updates on the byte regs.

You found a long way to say 'No'.

>
> I have never seen any compiler do so however since the time of JPI
> Pascal/C/Modula-2.
>

See my post above.

Noob

unread,

Jun 27, 2017, 6:53:11 AM6/27/17

to

On 27/06/2017 08:58, Anton Ertl wrote:

> Do you mean that the unpredictable behaviour could go unnoticed much
> of the time like the Pentium FDIV bug? That seems unlikely in a bug
> involving the xH/xL registers. But if any of us is eager enough to
> unearth the original bug report of the Ocaml developers, we could gain
> more insight into the nature of the "unpredictable behaviour". My
> guess is that the value of xL or xH from the other thread becomes
> visible in this thread, and/or that some value from the previous
> contents of the physical register appears as xL/xH value.

Segfaults or wrong code execution on Intel Skylake / Kaby Lake CPUs with hyperthreading enabled

https://caml.inria.fr/mantis/view.php?id=7452

NB: xleroy is https://en.wikipedia.org/wiki/Xavier_Leroy

Regards.

Anton Ertl

unread,

Jun 27, 2017, 8:36:03 AM6/27/17

to

Thanks. It does not tell me enough about what went wrong in the buggy
cases, but at least it answers another question that came up in this
thread:

|The use of %ah is probably quite unusual, but GCC is generating it to
|deal with the GC tag bits inside a header word.

So yes, you can get this bug when just running gcc-generated code.

It also says:

|As an example we tried to deploy some tool on a large xeon skylake
|cluster, several hundred processes. They didn't crash in hours, but
|very quickly we saw corrupted data being sent over the network/written
|into the database.

(Note that the bug shows up in the Ocaml garbage collector, so it
probably affects all data processed with Ocaml. Other environments
are not immune against it, either, but I expect that the probability
of it occurring is low in general.

EricP

unread,

Jun 27, 2017, 2:35:59 PM6/27/17

to

According to the Intel manual, there are 3 performance counters
for the Loop Stream Detector (LSD):

LSD.UOPS Number of uops delivered by the LSD.
LSD.CYCLES_ACTIVE Cycles with at least one uop delivered
by the LSD and none from the decoder.
LSD.CYCLES_4_UOPS Cycles with 4 uops delivered by the LSD
and none from the decoder.

The LSD is part of the Decoded Instruction Queue,
which comes before Rename and then the Reorder Buffer.
It seems to work by detecting a loop and replaying the decoded
instructions rather than tossing them after passing to the RAT&ROB.
That detect and loop replay has to be per-thread.

Speculation: I can imagine a scenario where the LSD fails
to properly preserve the thread ID, so maybe thread 1's *H
references get mistaken accessed as thread 0's *H registers.
So thread 1 winds up reading thread 0's registers.
By disabling hyperthreading, there is only thread 0
so the problem disappears. Or disable LSD.

Eric

Anton Ertl

unread,

Jul 14, 2017, 8:55:49 AM7/14/17

to

Bruce Hoult <bruce...@gmail.com> writes:
>HT gives very close to 2x when the workload is gcc/llvm, as mine is.

I found this very surprising when I read this, but did not find the
time to check this myself. Today, I see
<http://www.anandtech.com/print/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade>,
and there it says:

Xeon E5-2699 v4 @ 3.6 EPYC 7601 @3.2 Xeon 8176 @ 3.8
403.gcc 137% 119% 131%

The percentages are the speedup of using two threads on the same core
over using one thread on one core. So we see far less than 2x
throughput from hyperthreading when running gcc, on any of these CPUs.

already...@yahoo.com

unread,

Jul 14, 2017, 9:31:14 AM7/14/17

to

403.gcc benchmark is shaped to be maximally CPU bound. Real world compilations have non-trivial I/O component which likely increases an advantage of SMT.
Besides, 403.gcc is very old compiler (3.2*) not necessarily representative of newer versions gcc or of C++ compilation which is the biggest pain in practice.

-----
* - yes, I remember that in your opinion everything newer than 2.9 is waste of time, but I also remember that Bruce has different opinion

Bruce Hoult

unread,

Jul 14, 2017, 2:13:31 PM7/14/17

to

On Friday, July 14, 2017 at 3:55:49 PM UTC+3, Anton Ertl wrote:

> Bruce Hoult <bruce...@gmail.com> writes:
> >HT gives very close to 2x when the workload is gcc/llvm, as mine is.
>
> I found this very surprising when I read this, but did not find the
> time to check this myself. Today, I see
> <http://www.anandtech.com/print/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade>,
> and there it says:
>
> Xeon E5-2699 v4 @ 3.6 EPYC 7601 @3.2 Xeon 8176 @ 3.8
> 403.gcc 137% 119% 131%
>
> The percentages are the speedup of using two threads on the same core
> over using one thread on one core. So we see far less than 2x
> throughput from hyperthreading when running gcc, on any of these CPUs.

Ok, sure 2x is an exaggeration and those look about right really.

Given a build that takes, say, 60 minutes without HT, and 1.37x, that's 43.8 minutes with HT, a saving of more than quarter of an hour. That's significant.

All the more so if you're talking about a Jenkins server running pre-commit tests. Being able to get 32 commits into VC in a day instead of 24 is a pretty significant server upgrade -- or downgrade if you lose the HT.

Not double, yeah. But *well* worth having.

Stefan Monnier

unread,

Jul 17, 2017, 12:12:23 PM7/17/17

to

>> HT gives very close to 2x when the workload is gcc/llvm, as mine is.
> I found this very surprising when I read this, but did not find the

Indeed. In my experience with various compiler workloads on various
Intel CPUs, I get somewhere around a 1.25x speedup on a Atom
Z530 and 1.4x speedup on i3-4170 and i7-L620.

Stefan

Bruce Hoult

unread,

Jul 17, 2017, 2:09:58 PM7/17/17

to

I agree with 1.4x. 2x was a bit of an exaggeration in reaction to people who claim that you are better to turn HT off!

Anton Ertl

unread,

Jul 31, 2017, 12:09:19 PM7/31/17

to

Bruce Hoult <bruce...@gmail.com> writes:
>On Friday, July 14, 2017 at 3:55:49 PM UTC+3, Anton Ertl wrote:
>> Bruce Hoult <bruce...@gmail.com> writes:
>> >HT gives very close to 2x when the workload is gcc/llvm, as mine is.
>>
>> I found this very surprising when I read this, but did not find the
>> time to check this myself. Today, I see
>> <http://www.anandtech.com/print/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade>,
>> and there it says:
>>
>> Xeon E5-2699 v4 @ 3.6 EPYC 7601 @3.2 Xeon 8176 @ 3.8
>> 403.gcc 137% 119% 131%
>>
>> The percentages are the speedup of using two threads on the same core
>> over using one thread on one core. So we see far less than 2x
>> throughput from hyperthreading when running gcc, on any of these CPUs.
>
>Ok, sure 2x is an exaggeration and those look about right really.
>
>Given a build that takes, say, 60 minutes without HT, and 1.37x, that's 43.8 minutes with HT, a saving of more than quarter of an hour. That's significant.

That assumes that build is 100% parallelizable. That's not the case
in my experience (see below).

But before looking into that, let's see how a Ryzen 1600X compares to
a Core i7-4690K (sorry, no i7-6700K results; it died last December and I
replaced it with an i5-6600K) on parallel gcc runs. Unfortunately, the
gcc versions are different (4.9 on the Core i7-4690K, and 6.3 on the
Ryzen 5 1600X), but hopefully they have similar SMT characteristics.

The Ryzen 5 1600X results (from
<2017Jul3...@mips.complang.tuwien.ac.at>):

no SMT SMT
6 threads 12 threads
10118371296 18418188966 cycles
7577287643 7581083282 instructions
1650573138 1651537403 branches
18694066 20626816 branch-misses
2.744945461 5.000948523 seconds time elapsed

The Core i7-4690K results:

no SMT SMT
4 threads 8 threads
6426M348109 11105M612668 cycles
5346M637704 5350M453749 instructions
1178M509354 1179M229362 r04c4 all branches retired
11M890818 13M009419 r04c5 all branches mispredicted
730M290134 730M933274 r82d0 all stores retired
1594M147390 1557M220350 r81d0 all loads retired
1380M446126 1423M603159 r01d1 load retired l1 hit
118M638571 53M518977 r08d1 load retired l1 miss
63M980466 29M569260 r02d1 load retired l2 hit
54M867258 23M867249 r10d1 load retired l2 miss
32M789779 17M118782 r04d1 load retired l3 hit
22M058530 6M739372 r20d1 load retired l3 miss

1.515701724 2.649699715 seconds time elapsed

On the Ryzen 1600X, SMT provides a 10% speedup over running the
processes back-to-back, on the Core i7-4690K 14%. Looking at the Core
i7-4690K results, SMT astonishingly reduces cache misses; maybe the
prefetcher is more effective when the threads are slowed down by SMT.

Concerning build times, I built
<http://www.complang.tuwien.ac.at/forth/gforth/Snapshots/0.7.9_20170705/gforth-0.7.9_20170705.tar.xz>
on an otherwise idle machine with "time (./configure && make -j)".

On the Core i7-4690k I get:

no SMT SMT
real 0m17.346s 0m16.290s
user 0m41.884s 0m54.612s
sys 0m1.672s 0m1.924s

A 6% speedup. On the Ryzen 1600X:

no SMT SMT
real 0m25.434s 0m24.360s
user 1m17.023s 1m18.427s
sys 0m4.480s 0m4.804s

A 4% speedup that vanished when I ran the SMT case again (i.e., it's
in the noise). The small difference in user-time difference (that is
also in the noise) indicates that there is little SMT use here; i.e.,
6 cores are enough for this build.

And the biggest speedup seems to be coming from using Debian 8
(gcc-4.9 etc.) instead of Debian 9 (gcc-6.3 etc.):-).

Bruce Hoult

unread,

Jul 31, 2017, 5:11:06 PM7/31/17

to

On Monday, July 31, 2017 at 7:09:19 PM UTC+3, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >On Friday, July 14, 2017 at 3:55:49 PM UTC+3, Anton Ertl wrote:
> >> Bruce Hoult <bruce...@gmail.com> writes:
> >> >HT gives very close to 2x when the workload is gcc/llvm, as mine is.
> >>
> >> I found this very surprising when I read this, but did not find the
> >> time to check this myself. Today, I see
> >> <http://www.anandtech.com/print/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade>,
> >> and there it says:
> >>
> >> Xeon E5-2699 v4 @ 3.6 EPYC 7601 @3.2 Xeon 8176 @ 3.8
> >> 403.gcc 137% 119% 131%
> >>
> >> The percentages are the speedup of using two threads on the same core
> >> over using one thread on one core. So we see far less than 2x
> >> throughput from hyperthreading when running gcc, on any of these CPUs.
> >
> >Ok, sure 2x is an exaggeration and those look about right really.
> >
> >Given a build that takes, say, 60 minutes without HT, and 1.37x, that's 43.8 minutes with HT, a saving of more than quarter of an hour. That's significant.
>
> That assumes that build is 100% parallelizable. That's not the case
> in my experience (see below).

No it doesn't. I'm timing actual huge projects that I actually build, repeatedly, every day and looking at the wall clock differences. Some things such as running ./configure or links are not parallelisable, and those are included in the total times.

Anton Ertl

unread,

Aug 1, 2017, 5:13:29 AM8/1/17

to

You have not presented any timings of your own, only computed a
supposed savings from the best factor measured by Anandtech (1.37,
based on SPEC 2006 403.gcc numbers, i.e., without ./configure and
linking, and 100% parallelizable) and an assumed build time without
SMT.

I cannot measure your builds, but I have measured one of mine, and the
benefit from SMT is small.

Noob

unread,

Aug 2, 2017, 4:42:44 AM8/2/17

to

I've been meaning to post my anecdote.

CPUs: Kaby Lake 7700K and Haswell 4790K
OS: Linux 4.11 and 3.10

When there are <= 4 active jobs, the CPU runs at full tilt.
As soon as there is a 5th active job, performance is cut in half
for all jobs! (So 6 jobs take as long to run as 3 jobs!)

Offlining the virtual cores solves the problem, with the scheduler
assigning jobs to 4 physical cores.

This experience does not seem shared by everyone, so I don't know
what is going on with our systems... But we were better off turning
HT off altogether ;-)

Regards.