--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/5ffad495-1913-4e42-8467-45b00551a4b8o%40icl.utk.edu.
The “core” performance counters on Xeon E5 v4 (Broadwell EP) don’t have the ability to count writebacks from the L3 to DRAM. In the Xeon E5 v1 (Sandy Bridge EP) was a bit in the OFFCORE_RESPONSE field that was supposed to allow you to count writebacks, but it never worked, and was removed from the tables for subsequent processor generations – up to and including the next-generation Ice Lake platform. This is all documented in many sections (one per processor family) entitled “Off-core Response Performance Monitoring” in Chapter 18 of Volume 3 of the Intel Architecture SW Developer’s Manual (document 325384-072, May 2020, available from https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
So if you need to count WriteBacks from L3 to DRAM, you will need to use performance counters in the “uncore”. The uncore has a lot of different “boxes” – the ones that can count L3 to memory traffic are:
I typically use the IMC performance counters to measure memory write traffic because they are easy to use and I have never found an implementation in which they give incorrect/unexpected counts. The down side of using the IMC counters is that you lose the ability to tie counts back to threads/cores, and you get the counts for everything happening in the system (all user apps, all OS behavior, all IO). (Counting everything may not be a negative – it is a good way to verify that the system is actually idle except for your user job.). The IMC counters are easy to use because you only have to deal with counters for 4 DDR4 channels, and you can easily get all the information you need from a single set of four performance counter events.
Each “Home Agent” handles the coherence between the mesh and the memory controller. These can also count memory reads and writes, and there are only two of them (vs 4 DRAM channels), so a bit less to count. The Home Agent should be able to use opcode filtering to distinguish between memory writes due to L3 writebacks and those due to other sources (e.g., IO, streaming stores), but don’t provide the option to tie counts back to individual cores.
Each CBo has lots of performance monitoring events, including writebacks (LLC_VICTIMS.M_STATE). The CHA counters can filter events to those coming from a single Logical Processor, but writebacks are specifically *not* associated with any specific logical processor. (They can still be counted, but they require that you set a bit field to count “non-thread-related events”.)
Some installations of PAPI provide uncore support, but it will depend on the PAPI installation, the underlying OS support for your chip, and the security configuration of your OS (perf_event_paranoid). Recent versions of Linux should have mature support for the Xeon E5 v4.
--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
--
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/686D2059-4B83-48B6-B63F-D7D876E220F7%40tacc.utexas.edu.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/5ffad495-1913-4e42-8467-45b00551a4b8o%40icl.utk.edu.
--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-...@icl.utk.edu.
The documentation for the uncore performance monitoring units in the Xeon E5 v4 is not one of the easier files to find on the Intel web site.
The counts for your test are going to be strongly influenced by compiler options and by the operation of the hardware prefetchers.
Summary
john