Ampere Altra -- MEM group gives questionable results

Nichols A. Romero

unread,

Jul 4, 2023, 1:40:54 AM7/4/23

to likwid...@googlegroups.com

Hi,

I am using the MEM group on an Ampere Altra (ARM neoverse n1). There are two observations and then a more general question.

Observations:

1. If I use the MEM group on an Intel architecture, I get values of 0 for most of the MPI ranks. I only see a non-zero value for 1 - 2 ranks. Do the non-zero values correspond to the number of unique sockets being used? There is only one socket on the Ampere Altra, but each MPI rank gives a non-zero value.

2. The sum of the values of MPI ranks for the MEM group for the Ampere Altra gives a memory bandwidth value that far exceeds the theoretical memory bandwidth that is possible. It was something like over 400 GB/s. Is this a bug and should I file a github issue for it?

General question:

Suppose I have a dual socket system with N hardware threads and a total memory bandwidth of B. Does the maximum achievable memory bandwidth scale linearly with the number of ranks in the program?

In other words,

achievable memory bandwidth = # ranks / N x B

modulo some NUMA effects which may or may not be severe.

So that if I am only using half the hardware threads, the best that a benchmark like STREAM could do is achieve 0.5 x B?

Thanks

--

Nichols A. Romero, Ph.D.

Thomas Gruber

unread,

Jul 4, 2023, 1:51:56 AM7/4/23

to likwid-users

Hi Nichols,

1.) Yes, as documented, the memory controllers and all other Uncore units on Intel platforms are per socket. LIKWID reads the counters with one HW thread per socket. If you use likwid-mpirun, it also measures the socket events only with one core per socket.

1.+2.) In the default event/counter set of ARM chips all are per HW thread, so yes, you get counts for each HW thread. The problem is not the granularity but what the events mean for the vendor who has to implement them. The two events are named MEM_ACCESS_RD and MEM_ACCESS_WR but my impression is that they count (almost similar) to LD_SPEC and ST_SPEC, so loads and stores (speculatively) executed by the core. You can test that yourself with MEM_ACCESS_RD:PMC0,LD_SPEC:PMC1,MEM_ACCESS_WR:PMC2,ST_SPEC:PMC3 and compare the results.

Best,
Thomas

Nichols A. Romero

unread,

Jul 13, 2023, 12:55:45 AM7/13/23

to likwid...@googlegroups.com

Hi Thomas,

For the workload I tested:

MEM_ACCESS_RD was within 1-2% of LD_SPEC and similarly to MEM_ACCESS_WR & ST_SPEC.

Additionally, the memory bandwidth values calculated by MEM group exceed what is possible by the STREAM benchmark.

So, my conclusion is that MEM_ACCESS_[RD,WR] are speculative and not reliable for memory bandwidth measurements.

Thanks,

--

---
You received this message because you are subscribed to the Google Groups "likwid-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to likwid-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/likwid-users/208f3e06-613d-4394-b326-21c3b3082d0cn%40googlegroups.com.

Thomas Gruber

unread,

Jul 13, 2023, 3:09:19 AM7/13/23

to likwid-users

Hi Nichols,

thanks for the confirmation for the Ampere Altra. Unfortuantely, there seems to be no way to do reliable memory traffic measurements on that chip.

Best,
Thomas

Reply all

Reply to author

Forward