Ampere Altra -- FLOP_DP missing

15 views
Skip to first unread message

Nichols A. Romero

unread,
Jun 8, 2023, 1:57:45 AM6/8/23
to likwid...@googlegroups.com
Hi,

I am trying to use LIKWID with an 80-core Ampere Altra
https://solutions-portal-cms-prod-bucket.s3.amazonaws.com/Altra_Rev_A1_DS_v1_30_20220728_8170025756.pdf

I believe it's essentially an ARM Neoverse N1, similar but not identical to an AWS Graviton 2.

I seem to have the MEM performance group available, but not the FLOPS_DP performance group.

Based on the LIKWID wiki, ARM support is listed as experimental. When I look at the counter groups source file, it has not been touched in 3 years:

Does this mean there is a technical obstacle to creating the FLOPS_DP group?

Thanks,
--
Nichols A. Romero, Ph.D.

Thomas Gruber

unread,
Jun 8, 2023, 7:07:35 AM6/8/23
to likwid-users
Hi,

the basic event set defined by ARM for the ARM Neoverse N1 does not contain reliable FP events. There is:
  • ASE_SPEC, Advanced SIMD instruction speculatively executed
  • VFP_SPEC, Floating point instruction speculatively executed
It clearly states "speculatively executed" and not "retired". Moreover, there is no separation between single- and double-precision. The group folder arm8_n1 contains only groups that can be measured with the basic set.

Vendors in the ARM ecosystem can add custom events to the chip and the extension might contain better FP events. Your linked documentation does not contain any events.

Best,
Thomas

P.S. ARM is marked as experimental because it uses perf_event under the hood, so the whole measurement phase is out of LIKWIDs control.

Nichols A. Romero

unread,
Jun 22, 2023, 8:01:13 PM6/22/23
to likwid...@googlegroups.com
Hi Thomas,

I have very limited experience with ARM and have a few questions.

1. If I understand correctly, the problem with *speculative* events versus *retired* events is that one will be overcounting? Potentially by a lot.

2. In the github repo, I do see FLOP_DP groups defined for arm8_tx2 and arm64fx. They looked to be using speculative events, so they would be overcounting as well?

3. Is the lack of *retired* FP events pervasive in the ARM ecosystem?

4. (more general question, not specific to ARM). My understanding is that these hardware counters are implemented by hardware engineers when they are designing the processors. Since they didn’t get implemented for ARM Neoverse, there is just no hope?
(unless this vendor has added the *retired* FP event to the Ampere Altra and has just failed to document it).

--

---
You received this message because you are subscribed to the Google Groups "likwid-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to likwid-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/likwid-users/48a5e124-d7a5-4609-b6d4-2c7c3b0e56f7n%40googlegroups.com.
--
Sent from Gmail Mobile

Thomas Gruber

unread,
Jun 23, 2023, 1:56:27 AM6/23/23
to likwid-users
Hi Nichols,

1. Yes, they potentially overcount but how much depends on your code and the actual implementation.

2. The arm8_tx2's FLOPS_DP group contains "expect some overcounting" (I forgot to mention the speculative counting in the group description). The arm64fx's FLOPS_* groups use non-base events added by Fujitsu (FP_*P_FIXED_OPS_SPEC and FP_*P_SCALE_OPS_SPEC) which were quite accurate in our analyses ([1], [2]). 

3. There are retired events for ARM but FP events seem almost speculative. Fujitsu added some retired events like SIMD_INST_RETIRED, see [3].

4. Yes, the counters/events are added by hardware engineers. My impression is that they are not used only for design validation nowadays but some events are added for other reasons like profile-guided optimization by the compiler. The basic set of ARM Neoverse N1 will not change over the N1 lifetime to my understanding. But implementations like the Ampere Altra using the N1 design could have added them. Until now, I havn't found the official documentation for the Amere Altra which contains the event list.

Best,
Thomas


Nichols A. Romero

unread,
Jun 27, 2023, 1:23:32 AM6/27/23
to likwid...@googlegroups.com
Hi Thomas,

I was finally able to find the documentation for the Ampere Altra core PMU events. It is the first link on this page:

Click on Altra Family Monitor Events.

Unfortunately, I only see speculative events. It seems like the best one can do is something similar to arm8_tx2:

and hope that the FP overcounting is not too bad. If I understand speculative events, this should be an upper bound to the FP count.

An idea, one could measure the FP count on a machine with retired FP events (e.g. Intel) and compare it with the speculative FP events on Ampere for some workload. That could give at least some sense of the magnitude of the FP overcounting for the Ampere case. Additionally, one could make the additional leap of faith that the overcounting factor that occurs for similar workloads would be identical.

If I understand speculative execution, one must always be in the overcounting case -- it should not be possible to *under*count FP with these speculative events.

Cheers,

Thomas Gruber

unread,
Jun 28, 2023, 9:46:54 PM6/28/23
to likwid-users
Hi Nichols,

I was finally able to find the documentation for the Ampere Altra core PMU events. It is the first link on this page:

Click on Altra Family Monitor Events.

Unfortunately, I only see speculative events. It seems like the best one can do is something similar to arm8_tx2:

and hope that the FP overcounting is not too bad. If I understand speculative events, this should be an upper bound to the FP count.

This is correct. We could build something like the group for TX2 and it will be an upper bound.
 

An idea, one could measure the FP count on a machine with retired FP events (e.g. Intel) and compare it with the speculative FP events on Ampere for some workload. That could give at least some sense of the magnitude of the FP overcounting for the Ampere case. Additionally, one could make the additional leap of faith that the overcounting factor that occurs for similar workloads would be identical.

I'm not sure about that. Of course the performed FP calculations should be the same and that's what you get with the rertired FP instruction on x86 and some scaling. But the ARM FP events do not only count FP calculations but any FP related instruction. So it is probably not directly comparable.
 
If I understand speculative execution, one must always be in the overcounting case -- it should not be possible to *under*count FP with these speculative events.

Correct, that's also my assumption.

Best,
Thomas
Reply all
Reply to author
Forward
0 new messages