AMD Milan Memory bandwidth

39 views
Skip to first unread message

meta

unread,
Aug 9, 2024, 8:47:17 PM8/9/24
to likwid-users
Hi everyone,
I am running a streaming app (triad kernel) with three 5G-entry (5,000,000,000) arrays of double (each entry is 8 bytes)   on a Milan CPU (AMD EPYC 7313P 16-Core Processor), the expected data volume is 
4(3 arrays + 1 extra access for read before write) * 5G (size of array)* 8 byte = 160GB 
I am using this command 

 likwid-perfctr -C 0 -g MEM1 stream.5G
 likwid-perfctr -C 0 -g MEM2 stream.5G

Likwid gives me 12x as the expected value, what am I missing? I am pretty sure that no other app is running during this experiment. 

 Memory data volume (channels 0-3) [GBytes] |   978.9239
Memory data volume (channels 4-7) [GBytes] |   977.7203
------------------------------------------------------------------------------------------------
                                                                                         1955 GB ~ 12x expected value

Thomas Gruber

unread,
Aug 10, 2024, 6:27:18 AM8/10/24
to likwid-users
Hello,

There is a PR for fixing the DataFabric units for AMD Zen4: https://github.com/RRZE-HPC/likwid/pull/618

The MEM1 und MEM2 groups use events that are not available on Zen4 anymore.

Best,
Tom

meta

unread,
Aug 10, 2024, 12:58:10 PM8/10/24
to likwid-users
Thanks for your reply.
Isn't AMD EPYC 7313P 16-Core Processor "Zen3"? 
what counters should we collect for memory activities on different channels on AMD Genoa and Milan? 

Thomas Gruber

unread,
Aug 11, 2024, 1:03:52 PM8/11/24
to likwid-users
Sorry, I was answering while stitting at an airport gate shortly before boarding.

The linked PR also contains fixes for Zen3. I'm not really good at writing commit messages :)

Best,
Tom

meta

unread,
Aug 12, 2024, 12:32:24 AM8/12/24
to likwid-users
Hi Tom,
Thanks for your reply during your trip.
I am using this branch: amd_zen4_df_unit

my commands are again 
ikwid-perfctr -C 0 -g MEM2 stream.5G
ikwid-perfctr -C 0 -g MEM1 stream.5G

|    DRAM_CHANNEL_2    |   DFC2  |   7635933709 |
|    DRAM_CHANNEL_3    |   DFC3  |   7635482682 |

+----------------------+---------+--------------+


Group 1: MEM2
+----------------------+---------+--------------+
|         Event        | Counter |  HWThread 0  |
+----------------------+---------+--------------+
|   ACTUAL_CPU_CLOCK   |  FIXC1  | 236267295736 |
|     MAX_CPU_CLOCK    |  FIXC2  | 190314462990 |
| RETIRED_INSTRUCTIONS |   PMC0  |       -      |
|  CPU_CLOCKS_UNHALTED |   PMC1  | 173163653472 |
|    DRAM_CHANNEL_4    |   DFC0  |   7642840419 |
|    DRAM_CHANNEL_5    |   DFC1  |   7642426898 |

I guess these are the numbers of cache lines (64B) so it would be 28G * 64 = 1792GB; however, the expected value is 5G*8 (bytes) * 4 (arrays) = 160GB
When the expected BW is 3GB/s it is about 30GB (15GB on CH0-3 and ~15GB on CH4-7)
 
What is the PMU  (performance monitoring events) for DRAM_CHANNEL_0-7 that I cna use with perf stat -e XXXX? what is XXXX? thanks

Thomas Gruber

unread,
Aug 16, 2024, 2:06:14 PM8/16/24
to likwid-users
Hi,

Are you sure you are executing the triad kernel only once? The default NTIMES in the original STREAM benchmark is 10. Then you would have 1600 GB vs. 1792 GB. With initialization and validation, this might be correct. Generally, the DataFabric events are not 100% accurate. The DataFabric is the interconnect, not the memory controllers.

I cannot really tell how you have to set it for perf stat but is probably something like amd_df/config=0x3807/ or similar. You have to check AMD's documentation for the exact config. LIKWID splits the documented config in multiple parts and I cannot check on a Zen3 system at the moment how you have to combine them back. It depends on the formats in /sys/devices/amd_df/format. The unit name might differ like uncore_amd_df or so. It also might be that perf does not provide it on your system. This does not mean that the counters are not present, just that perf does not support it for your CPU model.

Best,
Tom

meta

unread,
Aug 22, 2024, 12:58:23 PM8/22/24
to likwid-users
Hi @thomas,
In Likwid documents, I found 
which explains the PMU counters and Umask for fabric events 

EVENT_DRAM_CHANNEL_0            0x07 DFC
UMASK_DRAM_CHANNEL_0            0x38

EVENT_DRAM_CHANNEL_1            0x47 DFC
UMASK_DRAM_CHANNEL_1            0x38

EVENT_DRAM_CHANNEL_2            0x87 DFC
UMASK_DRAM_CHANNEL_2            0x38

EVENT_DRAM_CHANNEL_3            0xC7 DFC
UMASK_DRAM_CHANNEL_3            0x38

EVENT_DRAM_CHANNEL_4            0x107 DFC
UMASK_DRAM_CHANNEL_4            0x38

EVENT_DRAM_CHANNEL_5            0x147 DFC
UMASK_DRAM_CHANNEL_5            0x38

EVENT_DRAM_CHANNEL_6            0x187 DFC
UMASK_DRAM_CHANNEL_6            0x38

EVENT_DRAM_CHANNEL_7            0x1C7 DFC
UMASK_DRAM_CHANNEL_7            0x38


are these totally giving me the entire number of memory "cachelines"? each cachline is 64B?
assuming like that, what I get 20X the expectation -- details come as follows 

Can you help me out? 

When I am using perf stat like this 
 perf stat -e r3807 -e r3847 -e r3887 -e r38c7 -e r38107 -e r38147 -e r38187 -e r381c7 ./stream_triad_5G_1x
-------------------------------------------------------------
STREAM version $Revision: 5.10 $

Performance counter stats for './stream_triad_5G_1x':

               354                r3807                                                         (74.96%)
                 0                  r3847                                                         (74.97%)
                 0                  r3887                                                         (74.98%)
    728,818                   r38c7                                                         (75.00%)
    422                          r38107                                                        (75.03%)
    5,005,685,977        r38147                                                        (75.04%)
    46,096,463,802      r38187                                                        (75.04%)
    723,644                  r381c7                                                        (74.98%)

      10.106396313 seconds time elapsed

     107.478371000 seconds user
     105.454696000 seconds sys


total events = 51103603017  = total data (byte) = 51103603017 * 64 (size of cacheline) expectation (byte) 5G * 8 byte * 4 (arrays) = 1.6E+11 ~ 20x

meta

unread,
Aug 22, 2024, 2:39:10 PM8/22/24
to likwid-users
Hi Likwid Team,
To collect all memory controller reads and writes for Milan or Genoa (AMD zen3 and 4) , what PMU counters should I monitor?
Please give me the even number and Umask.

Thomas Gruber

unread,
Aug 22, 2024, 7:03:30 PM8/22/24
to likwid-users
Hi,

as I have written in my previous answer, you have to tell perf which unit to use. In LIKWID you do that by selecting the right counters (DFC* = DataFabricCounter). For perf, you have to specify it. If you do not specify anything, it means 'cpu' unit.

This is not a forum for perf but for LIKWID. I try to help you and me (if there is a bug) but you answers do not really help because there are no answers! You havn't even answered the simple question whether you run the triad kernel only once or multiple times. Unfortunately, you cannot send me the STREAM code because publishing changed STREAM source code is against the copyright. If you have set NTIMES=1 in the original STREAM code, you will not get reliable results. It has to execute multiple times.

How about you switch to likwid-bench for this?

likwid-perfctr -C 0 -g MEM1 -m likwid-bench -t stream -W N:15GB
likwid-perfctr -C 0 -g MEM2 -m likwid-bench -t stream -W N:15GB

This runs the kernel A[i] = B[i] + scalar * C[i] where A, B and C are each 5 GB in size on a single core with only serial instructions. It outputs the bandwidth (MByte/s). Scale it up by 1.333 to include the write-alllocate/read-for-ownership traffic. Compare this scaled up bandwidth to the sum of measurement (MEM1+MEM2). The same approach works for Zen4 but the groups are named differently.

Best,
Thomas (the LIKWID team)

P.S. Yes, the DataFabric events required for bandwidth are cacheline counts, so you have to scale the raw values up by 64B. But is not the case for all DataFabric events. Check AMD's documentation for this.

meta

unread,
Aug 24, 2024, 10:55:19 PM8/24/24
to likwid-users
Hi Thomas,

I'm sorry, I think I missed your previous questions.
I was running Triad once and I changed the code somehow to be able to run it only once. 
As you suggested I ran these commands for Mem1 and Mem2 each I got 100 GBytes

+----------------------+---------+-------------+
|         Event        | Counter |  HWThread 0 |
+----------------------+---------+-------------+
|   ACTUAL_CPU_CLOCK   |  FIXC1  | 17136980000 |
|     MAX_CPU_CLOCK    |  FIXC2  | 13803410000 |
| RETIRED_INSTRUCTIONS |   PMC0  | 29687510000 |
|  CPU_CLOCKS_UNHALTED |   PMC1  | 17063080000 |
|    DRAM_CHANNEL_0    |   DFC0  |           0 |
|    DRAM_CHANNEL_1    |   DFC1  |           0 |
|    DRAM_CHANNEL_2    |   DFC2  |   786820500 |
|    DRAM_CHANNEL_3    |   DFC3  |   786840900 |
+----------------------+---------+-------------+

+--------------------------------------------+------------+
|                   Metric                   | HWThread 0 |
+--------------------------------------------+------------+
|             Runtime (RDTSC) [s]            |     4.6010 |
|            Runtime unhalted [s]            |     5.7123 |
|                 Clock [MHz]                |  3724.5263 |
|                     CPI                    |     0.5748 |
| Memory bandwidth (channels 0-3) [MBytes/s] | 21889.6846 |
| Memory data volume (channels 0-3) [GBytes] |   100.7143 |
+--------------------------------------------+------------+
I guess you are saying that we have 10 times * 15GB (3 arrays * 5GB) = 150GB scaled up by factor of 1.3 ~195GB ~ 100G (Mem2) + 100G(Mem1) 

I defined these Likwid APIs in my code to define only triad regions and run the command again and 
likwid-perfctr -C 0 -g MEM1 -m ./stream_triad_likwid_5G_10x   --> This time I am running on three arrays with the size of (5G * 8 bytes) for 10 times (NTIMES = 10) I am only running Triad kernel. 
so I am expecting 5GB * 8 * 3 (arrays)* 10 (times) = 1200GB    -->  (LIKWID gives me 2 *812GB = 1600GB)
 I am getting 
| Memory bandwidth (channels 0-3) [MBytes/s] |  8658.9994 |
| Memory data volume (channels 0-3) [GBytes] |   812.2083 |

Thanks for your suggestion. 

meta

unread,
Aug 24, 2024, 10:56:57 PM8/24/24
to likwid-users
I know this is not a Perf forum and not a good place to ask but that would be so much appreciated if  anyone can help, when I look at the DRAM_CHANNEL_X in the LIKWID report I am getting this

DRAM_CHANNEL_0 0
DRAM_CHANNEL_1 0
DRAM_CHANNEL_2 6344369000
DRAM_CHANNEL_3 6344458000
DRAM_CHANNEL_4 6347401000
DRAM_CHANNEL_5 6348221000
DRAM_CHANNEL_6 0
DRAM_CHANNEL_7 0
==============================
Sum = 25G

but when I am collecting the activities on corresponding PMU events I am getting way lower than what LIKWID reports

1,297                r3807 -75.01%
0                      r3847 -75.00%
0                      r3887 -75.01%
18,992,415       r38c7 -75.00%
1,457               r38107 -75.00%
5,322,830,413  r38147 -75.00%
9,882,455,382  r38187 -74.99%
19,013,627       r381c7 -74.99%
=============================
sum = 15G

You mentioned that I need to use the unit name (DFC) here but in the perf stat command, I can't find any relevant flag to enter the unit name. In perf, we can only define the PMU with event number and Umask.

running likwid-perfctr -e gives me the following

DRAM_CHANNEL_0, 0x7, 0x38, DFC
DRAM_CHANNEL_1, 0x47, 0x38, DFC
DRAM_CHANNEL_2, 0x87, 0x38, DFC
DRAM_CHANNEL_3, 0xC7, 0x38, DFC
DRAM_CHANNEL_4, 0x107, 0x38, DFC
DRAM_CHANNEL_5, 0x147, 0x38, DFC
DRAM_CHANNEL_6, 0x187, 0x38, DFC
DRAM_CHANNEL_7, 0x1C7, 0x38, DFC

Again, thanks for your suggestions, I appreciate your help and time.

Thomas Gruber

unread,
Aug 26, 2024, 2:56:48 AM8/26/24
to likwid-users
Hi,

likwid-bench outputs the data volume and bandwidth. If it outputs a data volume of 150GB and you measure 200GB, it is correct.

You made a mistake calculating the data volume. It is not "5GB * 8 * 3 (arrays)* 10 (times) = 1200GB" but " 5GB * 8 * (3 (arrays) + 1 write-allocate) * 10 (times) = 1600GB". In your first post, you had it correct with (3+1) accesses.

Best,
Thomas

Thomas Gruber

unread,
Aug 26, 2024, 3:13:09 AM8/26/24
to likwid-users
You are comparing apples with peaches here. LIKWID measures at the DataFabric unit. Your perf settings are for the cpu-local unit.

The events 0x07 and 0xC7 do not exist for the cpu-local units, so no clue what it counts. There is the event 0x47 (MISALIGNED_LOADS_*) but the umask 0x38 is invalid. Same for 0x87 (STALLED_CYCLES_*). But depending how perf handles it internally (ignoring excess bits or they mean something else), r147 and r187 are valid event settings for MISALIGNED_LOADS_COUNT_MA_64B and STALLED_CYCLES_BACKEND respectively.

Commonly, there always exist undocumented events.

You can define a separate unit in perf and I already wrote how it is done. If you write r3807, this translates to cpu/config=0x07,umask=038/. You can get the option names from /sys/devices/<unit>/format (unit = cpu).

My AMD Milan test system also does not offer the amd_df or amd_l3 units. This does not mean that they are not present. That's one of the downsides from having measurement support inside the kernel. The update rate for the kernel is commonly slower than for user-space applications and not all things are ported back to older kernel versions.


Best,
Thomas
Reply all
Reply to author
Forward
0 new messages