Missing AVX512 Events on SkylakeX in Multithreaded Run

63 views
Skip to first unread message

Marcel Koch

unread,
Jun 13, 2019, 5:51:35 AM6/13/19
to likwid-users
Hi @all,

I'm currently experiencing an unexpected behaviour concerning the FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE event reported from likwid-perfctr on an Intel(R) Xeon(R) Gold 6148. I have attached a small test program which exhibits the following behaviour:
If the program is run using only one thread, i.e. `likwid-perfctr -g FLOPS_DP -f -V 3 -C 0`, perfctr reports the event as expected (see cpu1).
If the program is run using two or more threads, e.g. `likwid-perfctr -g FLOPS_DP -f -V 3 -C 0-1`, perfctr reports 0 FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE events for the first core and the expected count for the other cores (see cpu2).

This happens only with AVX512 vectorization, if I use only AVX2 the multithreaded run also produces the expected event count.

WRT main.cpp:
- vectorclass is a wrapper for SIMD types by Agner Fog (https://www.agner.org/optimize/#vectorclass)
- benchmark is the google benchmark library

Any ideas what could cause this behaviour?

Thanks,
Marcel.
main.cpp
cpu1
cpu2

Thomas Gruber

unread,
Jun 13, 2019, 8:20:26 AM6/13/19
to likwid-users
Hi,

I cannot reproduce your faulty behavior. On my Skylake SP system (also Intel(R) Xeon(R) Gold 6148) the 512B counter increases for all threads as expected (I reduced the ITERATIONS to 10000):

$ g++ -mavx512f -mfma -fabi-version=0 -fopenmp -O3 main.cpp ~/SAFE/benchmark/build/src/libbenchmark.a -I ~/SAFE/benchmark/include/ -o main
$
for ((i=1;i<=10;i++)); do export OMP_NUM_THREADS=$i; likwid-perfctr -g FLOPS_DP -C 0-9 ./main 2>&1 | grep "FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE" | grep -v "STAT"; done
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |   640064 |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |   640064 |   640064 |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |   640064 |   640064 |   640064 |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |   640064 |   640064 |   640064 |   640064 |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |  640064 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |   640064 |  640064 |


Did you compile with -fopenmp? Otherwise it would be only a single process with a cpuset of {0,1}, so it could run only on CPU 1 returning zero for CPU 0.

Best regards,
Thomas

Marcel Koch

unread,
Jun 13, 2019, 8:35:28 AM6/13/19
to likwid-users
Hi Thomas,

I've compiled the program the same way you did, but I still get the weird behaviour. Also, I needed to add the `-f` flag to perfctr, not sure if that is relevant.
$ g++-7 -mavx512f -mfma -fabi-version=0 -fopenmp -O3 main.cpp -lbenchmark -o main
$
for ((i=1;i<=10;i++)); do export OMP_NUM_THREADS=$i; likwid-perfctr -g FLOPS_DP -f -C 0-9 ./main 2>&1 | grep "FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE" | grep -v "STAT"; done
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |  51200000 |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |  51200000 |  51200000 |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |  51200000 |

On a side note, if I use `perf stat` I get the expected count regardless of the number of threads.

Marcel Koch

unread,
Jun 13, 2019, 9:17:28 AM6/13/19
to likwid-users
I just want to add that adding likwid markers to the source file seems to make matters worse. I've attached the modified source and with that I get
$ for ((i=1;i<=10;i++)); do export OMP_NUM_THREADS=$i; likwid-perfctr -g FLOPS_DP -m -C 0-9 ./main 2>&1 | grep "FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE" | grep -v "STAT"; done
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |  51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |         0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |         0 |         0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |         0 |         0 |         0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |         0 |         0 |         0 |         0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |         0 |

main.cpp

Thomas Gruber

unread,
Jun 13, 2019, 10:38:18 AM6/13/19
to likwid-users
Which LIKWID version are you using? I see no problems on my side (neither with nor without MarkerAPI). Are you sure nobody else is running LIKWID (and occupies the counters)? You copied my commands, can you re-try my commands with -f (you don't see the errors for #threads > 1 because of 2>&1).

Marcel Koch

unread,
Jun 14, 2019, 4:29:39 AM6/14/19
to likwid-users
I'm using
likwid-perfctr -- Version 4.3.4 (commit: 233ab943543480cd46058b34616c174198ba0459)
but I also tried version 4.3.3 with the same results.

Also, I've recognized that this problem is not confined to this one program, it happens for any multithreaded program, even the likwid benchmarks.

In the likwid benchmarks case it is even a bit worse. If the list of processes, I pin the benchmark to, is larger than the number of threads specified in the work group, no AVX512 events are counted, e.g.
$ likwid-perfctr -g FLOPS_DP -m -f -C 0-5 likwid-bench -t stream_avx512 -w M0:400KB:2 | grep "FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE" | grep -v 'STAT'
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |          0 |          0 |
Again, this behaviour does not appear for AVX2 events.

Thomas Gruber

unread,
Jun 14, 2019, 6:18:12 AM6/14/19
to likwid-users
Hi,

I'm using
likwid-perfctr -- Version 4.3.4 (commit: 233ab943543480cd46058b34616c174198ba0459)
but I also tried version 4.3.3 with the same results.

So, a release version, that's the main info for me. 

Also, I've recognized that this problem is not confined to this one program, it happens for any multithreaded program, even the likwid benchmarks.

In the likwid benchmarks case it is even a bit worse. If the list of processes, I pin the benchmark to, is larger than the number of threads specified in the work group, no AVX512 events are counted, e.g.
$ likwid-perfctr -g FLOPS_DP -m -f -C 0-5 likwid-bench -t stream_avx512 -w M0:400KB:2 | grep "FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE" | grep -v 'STAT'
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |          0 |          0 |

I don't have the problem:
$ likwid-perfctr -g FLOPS_DP -m -f -C 0-5 likwid-bench -t stream_avx512 -w M0:400KB:2 | grep "FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE" | grep -v 'STAT'
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |  545259500 |  545259500 |

 
Again, this behaviour does not appear for AVX2 events.

This fact causes headaches to me as LIKWID does not care about the actual event when it comes to configuration, starting, reading and stopping the counter. From your logs from the inital post you can see that the 512B event is configured at PMC3 (event 0xC7, umask 0x40):
SETUP_PMC [cpu=0] Register 0x189 , Flags: 0x4140C7
SETUP_PMC
[cpu=1] Register 0x189 , Flags: 0x4140C7


and all four PMC counters are started (bold 0xF, 0x7 is for the three fixed counters):
UNFREEZE_PMC_AND_FIXED [cpu=0] Register 0x38F , Flags: 0x70000000F
UNFREEZE_PMC_AND_FIXED
[cpu=1] Register 0x38F , Flags: 0x70000000F

and read:
READ_PMC [cpu=0] Register 0xC4 , Flags: 0x0
READ_PMC
[cpu=1] Register 0xC4 , Flags: 0x2625A000

Can you try to use the 512B event in all counters, so with eventset:
FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PM0,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC1,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC2,FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE:PMC3
Maybe it's counter register PMC3 that has a problem...

Regards,

Marcel Koch

unread,
Jun 14, 2019, 7:50:12 AM6/14/19
to likwid-users
Ok, it seems like there is something wrong with PMC3, see
$ likwid-perfctr -g ALL_AVX512 -m -C 0-5 main | grep '512B' | grep -v 'STAT'
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC0  |     51200000 |     51200000 |     51200000 |     51200000 |     51200000 |     51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC1  |     51200000 |     51200000 |     51200000 |     51200000 |     51200000 |     51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC2  |     51200000 |     51200000 |     51200000 |     51200000 |     51200000 |     51200000 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |            0 |            0 |            0 |            0 |            0 |            0 |
where ALL_AVX512 is the performance group you described.

I check with other performance group and PMC3 is always 0 if I use more than one threads.

Can you say if that is a software or a hardware fault?

Thomas Gruber

unread,
Jun 14, 2019, 8:15:57 AM6/14/19
to likwid-users


Can you say if that is a software or a hardware fault?

I checked Intel Skylake SP specification update and found this erratum:
SKX90:
Performance Monitoring General Purpose Counter 3 May Contain Unexpected Values

Problem: When Restricted Transactional Memory (RTM) is supported (CPUID.07H.EBX.RTM [bit 11] = 1) and when TSX_FORCE_ABORT=0, Performance Monitor Unit (PMU) general purpose counter 3 (IA32_PMC3, MSR C4H and IA32_A_PMC3, MSR 4C4H) may contain unexpected values. Further, IA32_PREFEVTSEL3 (MSR 189H) may also contain unexpected configuration values.

Implication: Due to this erratum, software that uses PMU general purposes counter 3 may read anunexpected count and configuration.

Workaround: Software can avoid this erratum by writing 1 to bit 0 of TSX_FORCE_ABORT (MSR 10FH) which will cause all Restricted Transactional Memory (RTM) transactions to abort with EAX code 0. TSX_FORCE_ABORT MSR is available when CPUID.07H.EDX [bit 13]=1.

Status: No fix.

So yes, you hit a hardware bug. The workaround seems integratable when the access daemon mode is used. I havn't checked the kernel sources whether perf_event contains the proposed work around. At least I can add a warning when RTM is supported at the TSX_FORCE_ABORT register value is zero. I'm not sure whether LIKWID should abort all RTM transactions.

If you have sudo privileges, you can use sudo wrmsr 0x10f 0x1 before running LIKWID.

Have a nice weekend,
Thomas

Marcel Koch

unread,
Jun 14, 2019, 8:27:37 AM6/14/19
to likwid-users
Thanks for your help Thomas, using sudo wrmsr -a 0x10f 0x1 yields the expected results.

Thomas Gruber

unread,
Jun 25, 2019, 12:33:50 PM6/25/19
to likwid-users
You're welcome. I just added a warning to LIKWID for this special case. It doesn't write the TSX_FORCE_ABORT but disables the PMC3 if running on such a buggy system. It also prints the wrmsr command.
Reply all
Reply to author
Forward
0 new messages