Bruce Hoult <
bruce...@gmail.com> writes:
>On Friday, July 14, 2017 at 3:55:49 PM UTC+3, Anton Ertl wrote:
>> Bruce Hoult <
bruce...@gmail.com> writes:
>> >HT gives very close to 2x when the workload is gcc/llvm, as mine is.
>>
>> I found this very surprising when I read this, but did not find the
>> time to check this myself. Today, I see
>> <
http://www.anandtech.com/print/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade>,
>> and there it says:
>>
>> Xeon E5-2699 v4 @ 3.6 EPYC 7601 @3.2 Xeon 8176 @ 3.8
>> 403.gcc 137% 119% 131%
>>
>> The percentages are the speedup of using two threads on the same core
>> over using one thread on one core. So we see far less than 2x
>> throughput from hyperthreading when running gcc, on any of these CPUs.
>
>Ok, sure 2x is an exaggeration and those look about right really.
>
>Given a build that takes, say, 60 minutes without HT, and 1.37x, that's 43.8 minutes with HT, a saving of more than quarter of an hour. That's significant.
That assumes that build is 100% parallelizable. That's not the case
in my experience (see below).
But before looking into that, let's see how a Ryzen 1600X compares to
a Core i7-4690K (sorry, no i7-6700K results; it died last December and I
replaced it with an i5-6600K) on parallel gcc runs. Unfortunately, the
gcc versions are different (4.9 on the Core i7-4690K, and 6.3 on the
Ryzen 5 1600X), but hopefully they have similar SMT characteristics.
The Ryzen 5 1600X results (from
<
2017Jul3...@mips.complang.tuwien.ac.at>):
no SMT SMT
6 threads 12 threads
10118371296 18418188966 cycles
7577287643 7581083282 instructions
1650573138 1651537403 branches
18694066 20626816 branch-misses
2.744945461 5.000948523 seconds time elapsed
The Core i7-4690K results:
no SMT SMT
4 threads 8 threads
6426M348109 11105M612668 cycles
5346M637704 5350M453749 instructions
1178M509354 1179M229362 r04c4 all branches retired
11M890818 13M009419 r04c5 all branches mispredicted
730M290134 730M933274 r82d0 all stores retired
1594M147390 1557M220350 r81d0 all loads retired
1380M446126 1423M603159 r01d1 load retired l1 hit
118M638571 53M518977 r08d1 load retired l1 miss
63M980466 29M569260 r02d1 load retired l2 hit
54M867258 23M867249 r10d1 load retired l2 miss
32M789779 17M118782 r04d1 load retired l3 hit
22M058530 6M739372 r20d1 load retired l3 miss
1.515701724 2.649699715 seconds time elapsed
On the Ryzen 1600X, SMT provides a 10% speedup over running the
processes back-to-back, on the Core i7-4690K 14%. Looking at the Core
i7-4690K results, SMT astonishingly reduces cache misses; maybe the
prefetcher is more effective when the threads are slowed down by SMT.
Concerning build times, I built
<
http://www.complang.tuwien.ac.at/forth/gforth/Snapshots/0.7.9_20170705/gforth-0.7.9_20170705.tar.xz>
on an otherwise idle machine with "time (./configure && make -j)".
On the Core i7-4690k I get:
no SMT SMT
real 0m17.346s 0m16.290s
user 0m41.884s 0m54.612s
sys 0m1.672s 0m1.924s
A 6% speedup. On the Ryzen 1600X:
no SMT SMT
real 0m25.434s 0m24.360s
user 1m17.023s 1m18.427s
sys 0m4.480s 0m4.804s
A 4% speedup that vanished when I ran the SMT case again (i.e., it's
in the noise). The small difference in user-time difference (that is
also in the noise) indicates that there is little SMT use here; i.e.,
6 cores are enough for this build.
And the biggest speedup seems to be coming from using Debian 8
(gcc-4.9 etc.) instead of Debian 9 (gcc-6.3 etc.):-).