A64fx memory bandwidth issue

vamshi krishna

unread,

Nov 18, 2021, 8:00:59 AM11/18/21

to likwid...@googlegroups.com

Hello all,

I am trying triad benchmarks on ARM a64fx node but not getting expected memory bandwidth results

I am getting 310GB/s for Node as it should be around 800GB/s while theoretical bandwidth is 1024GB/s

I have run triad_sve512_fma benchmark

While on intel processor I got expected results using likwid-bench commands

I do not know whether I m doing anything wrong

If anybody can help me out in this issue

Regards

Vamshi

Thomas Gruber

unread,

Nov 18, 2021, 12:25:43 PM11/18/21

to likwid-users

Hi Vamshi,

how did you run the benchmark? If you specify an affinity domain than spans multiple NUMA domains, you should use -W <workgroup> as it will cause the threads to initialize their chunk of data themselves (thread-local initialization). This is a general recommendation, also on Intel nodes with CoD/SNC active. Below runs are on a single-node in the OOKAMI cluster (thx).

$ likwid-bench -t triad_sve512_fma -W N:8GB:48
MByte/s: 628728.60

$ likwid-perfctr -g MEM -m likwid-bench -t triad_sve512_fma -W N:8GB:48

MByte/s: 602873.95

Memory bandwidth [MBytes/s] STAT: 794103.1160

So we reach around 705 GByte/s with the benchmark. With huge pages, you might get a little bit more on A64FX.

Unfortunately, if you use -C N or similar for likwid-perfctr, the reported performance/bandwidth drops down significantly which is caused by the rather slow execution of non-optimized code by the A64FX cores. Since it's only important that the threads are pinned for the MarkerAPI usage and likwid-bench pins the threads itself, we can leave it out.

I hope this helps.

Best,
Thomas

vamshi krishna

unread,

Nov 18, 2021, 8:05:51 PM11/18/21

to likwid...@googlegroups.com

Hi Thomas,

Thank you for your reply,

I understood my mistake with your example on A64Fx.

I have one question, Can we find memory latency for each memory domains of processor as in A64fx, 4 HBM2 RAMs are on die with Likwid.

Give some examples to find it.

Thanks and Regards

Vamshi

--

---
You received this message because you are subscribed to the Google Groups "likwid-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to likwid-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/likwid-users/61f13646-b518-496d-a213-6572675a6801n%40googlegroups.com.

Thomas Gruber

unread,

Nov 19, 2021, 6:50:04 AM11/19/21

to likwid-users

Hi Vamshi,

You can find the latency with the clload, clstore and clcopy benchmarks as they touch a cacheline only once. The output contains cycles-per-cacheline. Not sure about A64FX though as there is no reliable cycle counter besides the hardware performance counters. So you might have to measure the total amount of cycles with likwid-perfctr, get the number of cache lines used by likwid-bench (data volume / CLsize) and calculate the 'cycles-per-CL' yourself.

Best,
Thomas

Thomas Gruber

unread,

Nov 19, 2021, 9:01:19 AM11/19/21

to likwid-users

Hi Vamshi,

just as a remark, these are not the real L1/L2/L3/MEM latencies as the HW prefetchers play an important role. For real latency measurements, you need something like a pointer-chasing code (Intel MLC as an example).

Best,
Thomas

Reply all

Reply to author

Forward