Memory Bandwidth and Roofline Model on zen3 Architecture (AMD EPYC)

René

unread,

Jan 24, 2022, 8:48:27 AM1/24/22

to likwid-users

Hi,

first of all thanks for developing likwid! Very useful :).

I currently try to make a roofline analysis of a program on an amd epyc (zen3 architecture) and I just wanted to ask if I am looking at the correct numbers as this architecture does not have a MEM_DP group. I'm interested in the naive roofline model, looking at memory transfers from/to RAM. I will give more details about the job at the end of the email.

The performance measurement using the FLOPS_DP group works fine and is similar to a measurement done with different tools.

For the intensity I measure the MEM1 and the MEM2 group and look at the memory bandwidth. I then calculate the intensity via:

performance_from_FLOPS_DP / (memory_bandwidth_from_MEM1 + memory_bandwidth_from_MEM2)

I am not sure if these are the numbers I should look at and would be happy to be pointed in the right direction. The reason I am asking is that the intensity is higher than expected (comparing to an estimation by hand and numbers reported in the literature). I get an intensity around 30 instead of something between 5-10.

Job Details (in case it is not just looking at the wrong numbers):

- I run a mpi-parallel job using all cores (64) of one socket on a AMD EPYC 7713

- I use the marker api around the region of interest

- The job is big enough to fall out of all caches (around 200MB per rank, which would be around 12 GB for the whole job) so it should not matter too much what data is in the L3 cache when the marker starts.

- No other jobs running.

- If useful I could add some output here or give more detailed explanations.

- Using likwid from current master.

Thanks a lot in advance,

best regards,

René

Thomas Gruber

unread,

Jan 28, 2022, 8:09:21 AM1/28/22

to likwid-users

Hi René,

Your approach looks sane to me:

ai = sum(FLOPS_DP) / sum(MEM1, MEM2)

There is no separate MEM_DP group because on AMD you cannot differentiate between DP and SP FP operations from the counter. So mixed SP/DP codes are hard to measure. Moreover, if you use some uncommon FP operations, they might not be included in the events. The events include 'ADD_SUB', 'MULT', 'DIV' and 'MAC' (=FMA).

Generally, AMD specifies the memory bandwidth metric only for systems in NPS1 mode.

I just did some simple measurements and the memory bandwidth looks good (despite it requires two measurements and the system is in NPS4 mode):

$ likwid-perfctr -C 0 -g MEM1 -m likwid-bench -t copy_avx -w N:4GB:1

MByte/s: 25904.86

| Memory bandwidth (channels 0-3) [MBytes/s] | 38967.7380 |

$ likwid-perfctr -C 0 -g MEM2 -m likwid-bench -t copy_avx -w N:4GB:1

MByte/s: 25921.11

| Memory bandwidth (channels 4-7) [MBytes/s] | 104.5372 |

-> scale to include RFOs: (25904.86 MByte/s * 1.5) = 38881.665 MByte/s

I will try to add AMD Zen3 to the Accuracy Tests.

Best,
Thomas

René Heß

unread,

Jan 29, 2022, 4:48:26 AM1/29/22

to 'Thomas Gruber' via likwid-users

Hi,

I also have some updates:

- I did the roofline analysis on a Haswell machine using intel advisor
(running on a single core) and got similar results.

- A more careful analysis by hand shows that the results might actually
be possible.

- The numbers I had in my head from literature where concerning a
slighty different case.

All in all I guess that the numbers are indeed correct. I was just
expecting a lower intensity, so it really surprised me.

Thanks for the notes regarding flop counting on Zen3 and the NPS mode.
Our machine currently uses NPS1 so it should be fine.

Thanks a lot for your reply,
best regards,

René

"'Thomas Gruber' via likwid-users" <likwid...@googlegroups.com>
writes:

> Hi René,
>
> Your approach looks sane to me:
> ai = sum(FLOPS_DP) / sum(MEM1, MEM2)
>
> There is no separate MEM_DP group because on AMD you cannot differentiate
> between DP and SP FP operations from the counter. So mixed SP/DP codes are
> hard to measure. Moreover, if you use some uncommon FP operations, they
> might not be included in the events. The events include 'ADD_SUB', 'MULT',
> 'DIV' and 'MAC' (=FMA).
> Generally, AMD specifies the memory bandwidth metric only for systems in
> NPS1 mode.
>
> I just did some simple measurements and the memory bandwidth looks good

> (despite it requires two measurements and the system is in *NPS4* mode):

> $ likwid-perfctr -C 0 -g MEM1 -m likwid-bench -t copy_avx -w N:4GB:1
> MByte/s: 25904.86
> | Memory bandwidth (channels 0-3) [MBytes/s] | 38967.7380 |
> $ likwid-perfctr -C 0 -g MEM2 -m likwid-bench -t copy_avx -w N:4GB:1
> MByte/s: 25921.11
> | Memory bandwidth (channels 4-7) [MBytes/s] | 104.5372 |
>
> -> scale to include RFOs: (25904.86 MByte/s * 1.5) = 38881.665 MByte/s
>
> I will try to add AMD Zen3 to the Accuracy Tests

> <https://github.com/RRZE-HPC/likwid/wiki/TestAccuracy>.

> --
>
> ---
> You received this message because you are subscribed to a topic in the Google Groups "likwid-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/likwid-users/iT6mEaUJ43E/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to likwid-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/likwid-users/ffa0cc5a-12fb-4a24-a6fb-15f0353db5d2n%40googlegroups.com.

Thomas Gruber

unread,

Feb 3, 2022, 9:02:41 AM2/3/22

to likwid-users

Thanks for the updates. I'm glad you could solve your riddle.

Best,
Thomas

Reply all

Reply to author

Forward