Understanding MEM1 and MEM2 on AMD, also variations in data movement among different architectures

21 views
Skip to first unread message

Nichols A. Romero

unread,
Nov 30, 2023, 2:11:11 AM11/30/23
to likwid...@googlegroups.com
Hello LIKWID developers,

Two newbie questions. I will put the shorter question first

#1
Suppose I have an AMD and Intel system, I measure the data movement of applications on both systems. How much variation would I expect in the measured data movement? I am seeing about 20% which given compiler differences is probably reasonable I think. But if you think this variation is too large, please let me know.

#2 (This is the longer question)
I am double-checking to make sure I am not off by a factor of *two* in my data movement and memory bandwidth measurements.

Here are the machine specs:
n.a.romero@sal-hyperplane01:~/miniAMR_WL/mpi/1sphere_likwid_amd_64$ likwid-topology
--------------------------------------------------------------------------------
CPU name:       AMD EPYC 7313 16-Core Processor
CPU type:       AMD K19 (Zen3) architecture
CPU stepping:   1
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:                2
Cores per socket:       16
Threads per core:       2
...
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:           2
--------------------------------------------------------------------------------
Domain:                 0
Processors:             ( 0 32 1 33 2 34 3 35 4 36 5 37 6 38 7 39 8 40 9 41 10 42 11 43 12 44 13 45 14 46 15 47 )
Distances:              10 32
Free memory:            253541 MB
Total memory:           257840 MB
--------------------------------------------------------------------------------
Domain:                 1
Processors:             ( 16 48 17 49 18 50 19 51 20 52 21 53 22 54 23 55 24 56 25 57 26 58 27 59 28 60 29 61 30 62 31 63 )
Distances:              32 10
Free memory:            233706 MB
Total memory:           257987 MB
--------------------------------------------------------------------------------

And the relevant performance groups:
n.a.romero@sal-hyperplane01:~/miniAMR_WL/mpi/1sphere_likwid_amd_64$ likwid-perfctr -a
    MEM2        Main memory bandwidth in MBytes/s (channels 4-7)
...
    MEM1        Main memory bandwidth in MBytes/s (channels 0-3)

Here is my question:
It doesn't look like I can measure MEM1 and MEM2 at the same time, but for the most part the measurements appear to be equal in value.

The total data movement of my application should be equal to the sum of the data volume reported by MEM1 + MEM2, correct?

Here is the part that is tripping me up:
Is the total memory bandwidth rate also equal to the sum of memory bandwidth rates reported by MEM1 and MEM2 groups? Or is it the average?

If it's the sum, then the values appear to exceed what is obtainable by likwid-bench if a bound by main memory bandwidth:

To be specific:
If I am running my application at 8 MPI ranks on 1 socket, I would not expect the memory bandwidth rate of MEM1 + MEM2 for an application to exceed the value of:
likwid-bench  -t copy_mem -w S1:1GB:8
(I am assuming my application is bound by main memory bandwidth -- which might be incorrect)

Thanks again for your help,
--
Nichols A. Romero, Ph.D.

Thomas Gruber

unread,
Dec 1, 2023, 4:39:52 AM12/1/23
to likwid-users
Hi Nichols,

Correct, you cannot measure MEM1 and MEM2 at the same time. I couldn't create a single MEM group because there are 8 memory channels but only 4 counters. But in order to get the full picture, you should measure both groups, especially when using NPS2 or NPS4 mode.  The total bandwidth should be the sum. You have to assume that while measuring MEM1, the memory channels covered by MEM2 behave similarly and vice versa. Moreover, the runtime of both measurments should be close to equal for the sum being correct.

You probably should get the maximum memory bandwidth with copy_mem but keep in mind that it's scalar loads and (non-temporal) stores.

Best,
Thomas

P.S. Not sure whether I mentioned it already but the events are marked by AMD as approximate and are only valid in NPS1 mode.
Reply all
Reply to author
Forward
0 new messages