bytes transferred between LLC to DRAM

Alaul Monil

unread,

Jun 24, 2020, 6:17:10 PM6/24/20

to ptools-perfapi

Hi,

I am using PAPI counters to calculate bytes transferred between LLC to DRAM on an intel Broadwell processor for a simple vector addition code. (Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz)

There are two loads and one store operations for each iteration in the vector addition code. (streaming access)

So the total number of data transferred between LLC-DRAM should be equal to the load + store bytes. (nvprof provides same when we do it for a Cuda code)

For CPUs:

Idea was to find out the LLC misses and multiply it with the cache line length to generate the number of bytes transferred.

so we used: PAPI_L3_TCM. but it provides only the misses for loads, not stores.

For example: for 100M array size with floating-point data type (for two loads and one store) = 100M * 4 * 3 [4=size of float and 3= 2loads+1store]

= 1.2 G bytes

if we divide by cache line (64Bytes) = 1.2 Gbtyes/ 64 bytes

= then total 18M cache lines should be transferred between LLC and DRAM.

But when gather papi counter, we get PAPI_L3_TCA = 18 M (approx) but PAPI_L3_TCM = 12 M.

I have also used the offcore OFFCORE_RESPONSE_0:L3_MISS and OFFCORE_RESPONSE_1:L3_MISS (native counters). Same data I found. The cache miss only shows load misses.

Just to make sure, I removed the store instruction in the vector addition and it did not change the PAPI_L3_TCM value. So it confirms that neither PAPI_L3_TCM or the native offcore counters are not taking write into the count.

And without the write, I can not generate the complete LLC-DRAM bytes transfer data. Please help me. What am I missing?

Thanks in advance.

Monil,

PhD student,

University of Oregon.

Lawrence Stewart (Larry)

unread,

Jun 24, 2020, 6:44:18 PM6/24/20

to Alaul Monil, Lawrence Stewart (Larry), ptools-perfapi

It is <very hard> to make the L3 do what you want.

In particular, you kind of have to use 1 GB pages to guarantee that you get an even distribution across the entire L3 because that is the only way to control the physical addresses. If you just use malloc, you will likely get 4K pages with no particular distribution in physical memory. Same problem with the L2, which is 256K IIRC.

When I was doing this, I asked the sysadmin to set up HUGETLBFS with an adequate number of 1G pages, and then allocated memory there using mmap.

Without that, you could easily be heavily using part of the L3 and not using other parts at all.

I don’t think this is a total explanation, because obviously there isn’t enough L3 to hold all the missing misses. With only a 35 MB cache it isn’t going to shield the DRAM!

Also, a store to a cache line that misses in the L1 will show up at the L3 as a “read for ownership” rather than as a store. The eventual write-back to dram is only caused much later as an L1 eviction to make room for something else, followed by a later L2 eviction to make room for something else. Those are not coincident with <any> core activity, since the core stores are heavily buffered by the (large) store buffer and (large) command queue that connects the L2 to the uncore.

You may have better luck with the various offcore-reference counters.

Unfortunately I don’t have access to the PAPI setup I used for this sort of thing, so I can’t give better advice.

-L

--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/5ffad495-1913-4e42-8467-45b00551a4b8o%40icl.utk.edu.

John McCalpin

unread,

Jun 25, 2020, 12:37:34 PM6/25/20

to Alaul Monil, ptools-perfapi

The “core” performance counters on Xeon E5 v4 (Broadwell EP) don’t have the ability to count writebacks from the L3 to DRAM. In the Xeon E5 v1 (Sandy Bridge EP) was a bit in the OFFCORE_RESPONSE field that was supposed to allow you to count writebacks, but it never worked, and was removed from the tables for subsequent processor generations – up to and including the next-generation Ice Lake platform. This is all documented in many sections (one per processor family) entitled “Off-core Response Performance Monitoring” in Chapter 18 of Volume 3 of the Intel Architecture SW Developer’s Manual (document 325384-072, May 2020, available from https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html

So if you need to count WriteBacks from L3 to DRAM, you will need to use performance counters in the “uncore”. The uncore has a lot of different “boxes” – the ones that can count L3 to memory traffic are:

many “Caching Agents” (CBo) – one per L3 slice (==16 on the Xeon E5-2683 v4)
two “Home Agent” (HA) interfaces (one per memory controller)
two “Integrated Memory Controllers” (IMC), each controlling two DRAM channels.

I typically use the IMC performance counters to measure memory write traffic because they are easy to use and I have never found an implementation in which they give incorrect/unexpected counts. The down side of using the IMC counters is that you lose the ability to tie counts back to threads/cores, and you get the counts for everything happening in the system (all user apps, all OS behavior, all IO). (Counting everything may not be a negative – it is a good way to verify that the system is actually idle except for your user job.). The IMC counters are easy to use because you only have to deal with counters for 4 DDR4 channels, and you can easily get all the information you need from a single set of four performance counter events.

Each “Home Agent” handles the coherence between the mesh and the memory controller. These can also count memory reads and writes, and there are only two of them (vs 4 DRAM channels), so a bit less to count. The Home Agent should be able to use opcode filtering to distinguish between memory writes due to L3 writebacks and those due to other sources (e.g., IO, streaming stores), but don’t provide the option to tie counts back to individual cores.

Each CBo has lots of performance monitoring events, including writebacks (LLC_VICTIMS.M_STATE). The CHA counters can filter events to those coming from a single Logical Processor, but writebacks are specifically *not* associated with any specific logical processor. (They can still be counted, but they require that you set a bit field to count “non-thread-related events”.)

Some installations of PAPI provide uncore support, but it will depend on the PAPI installation, the underlying OS support for your chip, and the security configuration of your OS (perf_event_paranoid). Recent versions of Linux should have mature support for the Xeon E5 v4.

--

John D. McCalpin, Ph.D.

Texas Advanced Computing Center

University of Texas at Austin

https://www.tacc.utexas.edu/about/directory/john-mccalpin

--

Anthony Danalis

unread,

Jun 25, 2020, 4:41:36 PM6/25/20

to John McCalpin, Alaul Monil, ptools-perfapi

Thanks for the elaborate explanation John.

We also use the IMC counters for internal testing, and on Broadwell specifically you can look at bdx_unc_imc[0|1|4|5]::UNC_M_CAS_COUNT:[RD|WR]:cpu=0 (or whatever CPU your code is bound to, if not zero).

Keep in mind that the uncore component does _not_ have the strict restrictions of the core component on the number of counters that can be measured simultaneously, so you can measure a lot of things in one run. Also, the uncore component is built by default, you don't have to do anything, but as John pointed out, you do need to set the paranoid flag (run "echo 0 > /proc/sys/kernel/perf_event_paranoid" as root), or execute your test programs with root permissions (not highly recommended).

thanks,

Anthony

To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/686D2059-4B83-48B6-B63F-D7D876E220F7%40tacc.utexas.edu.

Alaul Monil

unread,

Jun 26, 2020, 7:01:13 AM6/26/20

to ptools-perfapi, mcca...@tacc.utexas.edu, mon...@gmail.com

Thanks a lot, John for your explanation. Very informative. (currently working on it).

Thanks, Anthony, thanks for the help.

(probably I will get back with more question).

Thanks.

To unsubscribe from this group and stop receiving emails from it, send an email to ptools-...@icl.utk.edu.

To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/5ffad495-1913-4e42-8467-45b00551a4b8o%40icl.utk.edu.

--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ptools-...@icl.utk.edu.

Alaul Monil

unread,

Jul 6, 2020, 5:33:05 AM7/6/20

to ptools-perfapi, mcca...@tacc.utexas.edu, mon...@gmail.com, adan...@icl.utk.edu, lste...@gmail.com

Thanks a lot, everyone. I was able to measure L3-DRAM traffic (RD and WR) following your suggested method. (using the UNC_M_CAS_COUNT). And it matches with the "should be" transfer considering a streaming access pattern.

I have two more questions.

1. I used this counter (as @Anthony suggested) bdx_unc_imc[0|1|4|5]::UNC_M_CAS_COUNT:[RD|WR]:cpu=x, where x is the cpu core. If I run my code in multiple cores, then what should be the counter name? And where can I find the explanation of these uncore counters?

2. I used a strided code to observe the impact on memory read-write.

void vecMul(float *a, float *b, float *c, int n)

{

int stride = 200;

for(int i = 0; i < n; i= i + stride)

{

c[i] = a[i] * b[i];

}

For array size with 100M and varying stride, we got the below result:

Stride	Read count	Write count
1	13,171,678.20	6,832,089.20
8	13,092,354.44	6,750,445.80
20	13,040,420.45	6,745,406.20
40	11,854,984.17	6,713,341.40
60	8,645,749.53	6,657,180.20
100	2,390,520.03	6,511,863.80
200	1,373,017.14	6,442,882.00

The read counts (if we multiply with cache line length) make sense. When the stride goes high the read count reduces. (in the beginning, it was constant because of the cache line length, then it started reducing significantly)

But the write counts are very constant even though we are writing a lot fewer data. for example. with stride 200 for 100M data, we are only writing 0.5 M times. But the counters show high and reports the write counts as high as a streaming (stride=1) access.

What could be the reason of such behavior? Why would the hardware initiate all these unnecessary transfer for write back?

Please let me know what you think.

Thanks.

Monil.

John McCalpin

unread,

Jul 6, 2020, 12:14:40 PM7/6/20

to Alaul Monil, ptools-perfapi, adan...@icl.utk.edu, lste...@gmail.com

The documentation for the uncore performance monitoring units in the Xeon E5 v4 is not one of the easier files to find on the Intel web site.

Most of the uncore manuals are linked near the bottom of https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html, but the Xeon E5 v4 is not in that bunch. The document number is 334291, and I found a link to it at https://software.intel.com/content/www/us/en/develop/blogs/documentation-for-uncore-performance-monitoring-units.html
The PDF is in a zip archive with some supporting files, available at https://www.intel.com/content/dam/www/public/us/en/zip/xeon-e5-e7-v4-uncore-performance-monitoring.zip

The counts for your test are going to be strongly influenced by compiler options and by the operation of the hardware prefetchers.

Starting with the unit stride numbers….

your read counts are about 5% higher than the number of lines in 2 arrays and
your write counts are about 9% higher than the number of lines in one array.

The compiler has two options for the type of store instruction to use for your kernel.

When the “ordinary” (default) store instruction misses in a cache, the target cache line is *read* into the cache.

For the kernel c[i] = a[i] * b[i], this would generate reads for *three* arrays, not reads for the *two* arrays that your counts suggest.
This suggests that your code is compiled with the other kind of store instructions….

The other option for stores are the “non-temporal” (aka “streaming”) store instructions.

These do not read the data into the cache. Instead, they collect the stored data in a “write-combining” buffer and send the data directly from the core to DRAM.
These stores work best when the data is contiguous and is not expected to be re-used.
Why contiguous?

If the code writes to all of the bytes in a 64-Byte write-combining buffer, then the hardware knows that every byte in the cache line has been modified, so it is safe to write the data directly to DRAM.
If the code does *not* write to all of the bytes in the write-combining buffer, then the memory controller must read the cache line from memory so that it can correctly merge the previous bytes with the newly updated bytes. The “read” part of this operation is called an “underfill read” and it can be counted separately from “regular” cache line reads.
The specific definition of the UNC_M_CAS_COUNT:RD event is, unfortunately, buried in the kernel source code in files that seem to move about randomly from place to place….

In CentOS 7.6, the event is defined in a way that should include underfill read transactions, but other versions would need to be checked.

This is described (mostly implicitly) in the descriptions of several IMC performance monitoring events in section 2.6.7 of the Xeon E5 v4 Uncore Performance Monitoring Reference Manual.
(Aside: standard DDR2/3/4 DRAM does not support writes of subsets of the bytes of a cache line, but even if it did, this could not be used in server products – the memory controller needs to merge the original byte values with the updated byte values in order to compute (and re-write) the error-correction codes for the cache line.)

You can control the generation of store instructions with the Intel C compiler using the compiler flag “-qopt-streaming-stores never” or “-qopt-streaming-stores always”.

Streaming stores are not generated very often by the compiler – mostly in loops that look very much like yours and that have very large trip counts that are visible to the compiler.
The existence of streaming store instructions is partly my fault – they significant improve the performance of the STREAM benchmark (http://www.cs.virginia.edu/stream/).

Intel processors have very aggressive hardware prefetchers that improve performance for contiguous accesses and for fixed-stride accesses (with strides less than ~2KiB).

The algorithms and heuristics used by the hardware prefetchers have not been fully disclosed.
Some clues are provided in the Intel Optimization Reference Manual (https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.html)
I don’t know of any comprehensive evaluation of the behavior of the hardware prefetchers (especially for strided stores), but Intel has provided information on how to selectively control the prefetchers in many of their processors: https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html

Testing with the prefetchers enabled and disabled is especially important when trying to understand performance counters related to cache and memory access.

Additional information to help understand your results may be available in performance monitoring units of other “boxes”.

By the time transactions get to the IMC, information about the source of the transaction has been lost.

For example, at the IMC it is not possible to distinguish between prefetch and demand load operations.

The core performance counter event “OFFCORE_RESPONSE” can be programmed to count transactions initiated by the L2 hardware prefetchers, including: reads to L2, reads to L3, Reads for Ownership (RFOs) to L2, RFOs to L3.
The CBo performance counters are capable of counting full-cacheline streaming stores and partial-cacheline streaming stores. (Not easy to use and probably not supported by the OS.)

Summary

You should try compiling with flag “-qopt-streaming-stores never” and with “-qopt-streaming-stores always” and compare the counter results.
You should try disabling all the hardware prefetchers and compare the counter results.
There are no guarantees that results will make sense -- this “simple” stuff is horribly complex and getting worse – but it is usually possible to make forward progress.

john

Heike Jagode

unread,

Jul 7, 2020, 3:25:09 PM7/7/20

to Alaul Monil, ptools-perfapi, John McCalpin, Anthony Danalis, Lawrence Stewart

Regarding your question about the counter name and the cpu=x modifier:
Uncore events (such as ..._unc_...) are per-package (not per-process
like core events). Therefore, you need to make sure you are specifying
the CPU package to monitor. You can make this specification with the
"cpu=x" modifier. You can use lscpu on the respective node to see the
distribution of CPU-core identifiers across the sockets.

We have an example on the PAPI wiki that shows how to count uncore events:
https://bitbucket.org/icl/papi/src/master/src/components/perf_event_uncore/README.md#markdown-header-measuring-uncore-events

Thanks,
Heike

On Mon, Jul 6, 2020 at 5:33 AM Alaul Monil <mon...@gmail.com> wrote:
>
> Thanks a lot, everyone. I was able to measure L3-DRAM traffic (RD and WR) following your suggested method. (using the UNC_M_CAS_COUNT). And it matches with the "should be" transfer considering a streaming access pattern.
>
> I have two more questions.
>
> 1. I used this counter (as @Anthony suggested) bdx_unc_imc[0|1|4|5]::UNC_M_CAS_COUNT:[RD|WR]:cpu=x, where x is the cpu core. If I run my code in multiple cores, then what should be the counter name? And where can I find the explanation of these uncore counters?
>
> 2. I used a strided code to observe the impact on memory read-write.
>
> void vecMul(float *a, float *b, float *c, int n)
> {
> int stride = 200;
> for(int i = 0; i < n; i= i + stride)
> {
> c[i] = a[i] * b[i];
> }
> }
>
> For array size with 100M and varying stride, we got the below result:
>
> StrideRead countWrite count
> 113,171,678.206,832,089.20
> 813,092,354.446,750,445.80
> 2013,040,420.456,745,406.20
> 4011,854,984.176,713,341.40
> 608,645,749.536,657,180.20
> 1002,390,520.036,511,863.80
> 2001,373,017.146,442,882.00
>
>
>
>
>
>
>
>
>
>
>
>

> To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
> To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/0abee161-b40e-45ee-9ad9-fcc9aacf6584o%40icl.utk.edu.

--
______________________________________
Heike Jagode, Ph.D., Research Asst. Professor
Innovative Computing Laboratory, University of Tennessee Knoxville
http://icl.utk.edu/~jagode/

Alaul Monil

unread,

Jul 24, 2020, 7:28:03 AM7/24/20

to ptools-perfapi, mon...@gmail.com, mcca...@tacc.utexas.edu, adan...@icl.utk.edu, lste...@gmail.com

Thank you, professor Jagode.

Alaul Monil

unread,

Jul 24, 2020, 7:37:28 AM7/24/20

to ptools-perfapi, mon...@gmail.com, adan...@icl.utk.edu, lste...@gmail.com

Hi John,

Thanks again for your informative reply.

These documentations helped me a lot to understand the counters and hardware prefetch in intel.

As per your suggestion, I ran some experiments.

Here is what I found:

1. If the strides are longer then the streaming store compiler options does not have any impact ( as expected). However, it still does not explain the high write numbers compared to read.

2. I also disabled all hardware prefetch that also did not have any impact on the write count. As shown in the graph, only the read counts are impacted when prefetch is disabled. But the write counts are almost constant.

Please let me know if you have any more suggestions.

Thanks.

Monil.

Reply all

Reply to author

Forward