Hello,
Is there a recommended methodology to count flops on Intel Skylake Servers? I’m having particular trouble identifying FMA floating point operations. This causes the counters to undercount on my benchmarks.
Thanks,
Brian
Brian J Gravelle
Graduate Research Assistant, HPC-ENV
The FP_ARITH_INST_RETIRED performance counter event on Skylake Xeon processors appears to count correctly in all of my tests – at least “correctly” according to how Intel has defined the event.
The notes for the sub-events in the table at https://download.01.org/perfmon/SKX/skylakex_core_v1.24.json are reasonably specific:
"EventName": "FP_ARITH_INST_RETIRED.SCALAR_DOUBLE"
"PublicDescription": "Number of SSE/AVX computational scalar double precision floating-point instructions retired; some instructions will count twice as noted below. Each count represents 1 computation. Applies to SSE* and AVX* scalar double precision floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they perform 2 calculations per element."
For the “packed” events, the user needs to scale the result by the width of the SIMD register that the event is measuring, but all of these events increment twice when the operation is any variation of FMA.
None of the AVX-512 packed events pay attention to the contents of the mask register (if used), but that would cause overcounting, not undercounting.
The events are defined in a way that is a compromise between the “hardware-centric” view of instructions issued and the “user-centric” view of traditional floating-point operation counting.
They are not quite “hardware-centric” because they increment twice for FMA operations (even though it is a single instruction dispatch), and they are not quite “user-centric” because they ignore masks and they don’t provide any way to deal with division or square root operations differently than add/sub/multiply. (It would also be nice to be able to count FMA instructions separately – sometimes the compiler splits these into separate add and multiply and it would be nice to be able to track this without resorting to the much more heavyweight binary instrumentation approach used by Intel Advisor.)
In my experience, the most common cause of (apparent) undercounting is common sub-expression elimination in the generated code. Of course it is possible that there are counter bugs that did not show up in my testing – I did not test with any denorm or out-of-range input or output values, for example….
--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
ptools-perfap...@icl.utk.edu.
To view this discussion on the web visit
https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/73A17A32-73A3-4389-8261-155159BAE652%40lanl.gov.