On Wed, 15 Feb 2017, Waruna Ranasinghe wrote:
> I want to compute LLC misses in a Broadwell Intel cpu. I'm trying to use
> counter "OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=ANY_RESPONSE".
> Is it possible to use this counter in PAPI? if so can anyone provide which
> functions to use.
>
> Or is there another way to count LLC data misses in Broadwell?
Does PAPI_L3_TCM (which maps to LLC_MISSES ) not work?
The OFFCORE_RESPONSE way should work under PAPI (although I can't comment
on whether it's better than LLC_MISSES or not). Does PAPI throw an error
if you try to use it?
Vince
Try running the utility "papi_native_avail" on your machine to see the
exact name of the event and the modifiers. On a machine I tested the
proper name is "OFFCORE_RESPONSE_0:ANY_DATA_RD:LOCAL_DRAM"
> email to ptools-perfapi+unsubscribe@icl.utk.edu.
As John McCalpin explains in the Intel forum
(https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/559227)
the answer to your first question if the second option (you can
measure two things at the same time).
As for your second question, a program will measure events on the CPU
it's running on, if your machine has multiple CPUs and regarding
cores, this is an off-core event, so it applies to the whole CPU.
Thanks,
Anthony
>> > email to ptools-perfapi+unsubscribe@icl.utk.edu.
The OFFCORE_RESPONSE events are a bit of a hybrid – they measure transactions between the core’s private L2 and the “ring” interconnect of the uncore, but they are limited to events caused by the specific logical processor that programmed the event.
(Aside: The location is one of the reasons that you can’t measure the “write” side of memory bandwidth with these counters. Sitting between the L2 and the “ring”, there is no way to be notified if a cache line that was written by this core gets evicted from the L3 to DRAM. The “WriteBack” event measures writebacks of dirty cache lines from the L2 to the L3 – not from the L3 to memory – so it can’t be used for memory bandwidth in general.)
So these events are “per-logical processor” (just like almost all of the rest of the core performance counters). It is a bit harder to tell what is going on if software gets involved. ☹
Typically, the “perf events” subsystem will save and restore all of the core performance counters (including the OFFCORE_RESPONSE auxiliary MSRs) on any context switch. This is an attempt to attribute the counts to a specific process, even if it is migrated to run on a different logical processor. PAPI uses the “perf events” subsystem, so this virtualization should apply.
With multi-threaded software, there are more opportunities for confusion. PAPI can be configured to report results for a single thread, or for all of the threads underneath a process.
It is best to configure PAPI to report results at the thread level, and to bind each thread to a separate logical processor. This will prevent the OS from migrating the threads (which forces lots of extra cache misses when the thread starts up again on the “new” core).
While we are here, it is important to note that the “LLC_MISSES” event on Intel processors typically maps to a specific “architectural” performance event (Event 0x2E, Umask 0x41). In Section 18.2.1.2 of Volume 3 of the Intel Architectures Software Developer’s Manual (document 325384, revision 060, September 2016), Intel warns that this event may contain “implementation-specific” characteristics. On most recent Intel systems, this event counts demand loads that miss in the L3, demand stores that miss in the L3, and L1 hardware prefetches that miss in the L3. In my experience, it does *not* count traffic due to L2 hardware prefetches that miss in the L3. These L2 hardware prefetches often make up the majority of the data traffic, so the counter is not useful for estimating bulk traffic. Instead, it is intended to identify loads that were not prefetched in time to be in the L3. These loads are the most likely to cause processor stalls, as the latency of these L3 misses is much too high to be “hidden” (overlapped) by the out-of-order execution mechanisms of the core.
--
John D. McCalpin, Ph.D.
Texas Advanced Computing Center
University of Texas at Austin
--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.