Hi Vamshi,
how did you run the benchmark? If you specify an affinity domain than spans multiple NUMA domains, you should use -W <workgroup> as it will cause the threads to initialize their chunk of data themselves (thread-local initialization). This is a general recommendation, also on Intel nodes with CoD/SNC active. Below runs are on a single-node in the OOKAMI cluster (thx).
$ likwid-bench -t triad_sve512_fma -W N:8GB:48
MByte/s: 628728.60
$ likwid-perfctr -g MEM -m likwid-bench -t triad_sve512_fma -W N:8GB:48
MByte/s: 602873.95
Memory bandwidth [MBytes/s] STAT: 794103.1160
So we reach around 705 GByte/s with the benchmark. With huge pages, you might get a little bit more on A64FX.
Unfortunately, if you use -C N or similar for likwid-perfctr, the reported performance/bandwidth drops down significantly which is caused by the rather slow execution of non-optimized code by the A64FX cores. Since it's only important that the threads are pinned for the MarkerAPI usage and likwid-bench pins the threads itself, we can leave it out.
I hope this helps.
Best,
Thomas