Computing LLC misses

135 views
Skip to first unread message

Waruna Ranasinghe

unread,
Feb 15, 2017, 11:29:35 AM2/15/17
to ptools-...@eecs.utk.edu
Hi all,

I want to compute LLC misses in a Broadwell Intel cpu. I'm trying to use counter "OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=ANY_RESPONSE". 
Is it possible to use this counter in PAPI? if so can anyone provide which functions to use.

Or is there another way to count LLC data misses in Broadwell?


Thanks,
Waruna


Vince Weaver

unread,
Feb 15, 2017, 11:34:12 AM2/15/17
to Waruna Ranasinghe, ptools-...@eecs.utk.edu
Does PAPI_L3_TCM (which maps to LLC_MISSES ) not work?

The OFFCORE_RESPONSE way should work under PAPI (although I can't comment
on whether it's better than LLC_MISSES or not). Does PAPI throw an error
if you try to use it?

Vince

Waruna Ranasinghe

unread,
Feb 15, 2017, 1:45:27 PM2/15/17
to Vince Weaver, ptools-...@eecs.utk.edu
On Wed, Feb 15, 2017 at 9:34 AM, Vince Weaver <vincent...@maine.edu> wrote:
On Wed, 15 Feb 2017, Waruna Ranasinghe wrote:

> I want to compute LLC misses in a Broadwell Intel cpu. I'm trying to use
> counter "OFFCORE_RESPONSE:request=DEMAND_DATA_RD:response=ANY_RESPONSE". 
> Is it possible to use this counter in PAPI? if so can anyone provide which
> functions to use.
>
> Or is there another way to count LLC data misses in Broadwell?

Does PAPI_L3_TCM (which maps to LLC_MISSES ) not work?
The counter works, but I'm not sure what the output is. Is it just the total data cache misses? or does it contain instruction cache misses? what about the prefetching?
 

The OFFCORE_RESPONSE way should work under PAPI (although I can't comment
on whether it's better than LLC_MISSES or not).  Does PAPI throw an error
if you try to use it?
I'm not sure how to use it with PAPI.

I tried the following code and it returned "Event name to code failed: -7". Is there another way to work with "OFFCORE_RESPONSE"?

sprintf( EventCodeStr, "OFFCORE_RESPONSE:request=ALL_DATA_RD:response=L3_MISS.LOCAL_DRAM" );
if ((retval=PAPI_event_name_to_code(EventCodeStr, &native)) != PAPI_OK) {
  printf("Event name to code failed: %d\n", retval);
    exit(1);
}

/* Add it to the eventset */
if ((retval=PAPI_add_event(EventSet, native)) != PAPI_OK) {
  printf("Event add failed: %d\n", retval);
    exit(1);
}


Thanks,
Waruna
 

Vince



--
-----------------------------------------------------
Regards,
Waruna Ranasinghe

Anthony Danalis

unread,
Feb 15, 2017, 1:53:38 PM2/15/17
to Waruna Ranasinghe, ptools-...@eecs.utk.edu
Try running the utility "papi_native_avail" on your machine to see the
exact name of the event and the modifiers. On a machine I tested the
proper name is "OFFCORE_RESPONSE_0:ANY_DATA_RD:LOCAL_DRAM"

Thanks,
Anthony
> --
> You received this message because you are subscribed to the Google Groups
> "ptools-perfapi" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ptools-perfap...@icl.utk.edu.
> To post to this group, send email to ptools-...@icl.utk.edu.
> Visit this group at
> https://groups.google.com/a/icl.utk.edu/group/ptools-perfapi/.

Waruna Ranasinghe

unread,
Feb 15, 2017, 4:17:33 PM2/15/17
to Anthony Danalis, ptools-...@eecs.utk.edu
On Wed, Feb 15, 2017 at 11:53 AM, Anthony Danalis <adan...@icl.utk.edu> wrote:
Try running the utility "papi_native_avail" on your machine to see the
exact name of the event and the modifiers. On a machine I tested the
proper name is "OFFCORE_RESPONSE_0:ANY_DATA_RD:LOCAL_DRAM"

Thanks for the pointer it worked. But I have some related questions which I could not figure out reading the Intel's manual.

There are two events called OFFCORE_RESPONSE_0 and OFFCORE_RESPONSE_1. 

Q1:  what is the purpose of having 2 events?
        If I want to count all the LLC misses, do I need to account both            OFFCORE_RESPONSE_0:ANY_DATA_RD:LOCAL_DRAM and    OFFCORE_RESPONSE_1:ANY_DATA_RD:LOCAL_DRAM ?

Or 

       is it the case that I can do two distinct measurements using these two events. i.e.
 OFFCORE_RESPONSE_0:ANY_DATA_RD:LOCAL_DRAM and  OFFCORE_RESPONSE_1:PF_DATA_RD:LOCAL_DRAM
 

Q2: Is this event account for requests from all the CPU cores? or is it from core 0


Thanks,
Waruna

Anthony Danalis

unread,
Feb 15, 2017, 4:55:24 PM2/15/17
to Waruna Ranasinghe, ptools-...@eecs.utk.edu
As John McCalpin explains in the Intel forum
(https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/559227)
the answer to your first question if the second option (you can
measure two things at the same time).

As for your second question, a program will measure events on the CPU
it's running on, if your machine has multiple CPUs and regarding
cores, this is an off-core event, so it applies to the whole CPU.

Thanks,
Anthony
>> > email to ptools-perfap...@icl.utk.edu.

Waruna Ranasinghe

unread,
Feb 15, 2017, 4:59:28 PM2/15/17
to Anthony Danalis, ptools-...@eecs.utk.edu
On Wed, Feb 15, 2017 at 2:55 PM, Anthony Danalis <adan...@icl.utk.edu> wrote:
As John McCalpin explains in the Intel forum
(https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/559227)
the answer to your first question if the second option (you can
measure two things at the same time).

As for your second question, a program will measure events on the CPU
it's running on, if your machine has multiple CPUs and regarding
cores, this is an off-core event, so it applies to the whole CPU.
Thanks for the response.
--Waruna
 

Thanks,
Anthony

John McCalpin

unread,
Feb 16, 2017, 3:16:33 PM2/16/17
to Waruna Ranasinghe, ptools-...@eecs.utk.edu

The OFFCORE_RESPONSE events are a bit of a hybrid – they measure transactions between the core’s private L2 and the “ring” interconnect of the uncore, but they are limited to events caused by the specific logical processor that programmed the event.

 

(Aside: The location is one of the reasons that you can’t measure the “write” side of memory bandwidth with these counters.   Sitting between the L2 and the “ring”, there is no way to be notified if a cache line that was written by this core gets evicted from the L3 to DRAM.  The “WriteBack” event measures writebacks of dirty cache lines from the L2 to the L3 – not from the L3 to memory – so it can’t be used for memory bandwidth in general.)

 

So these events are “per-logical processor” (just like almost all of the rest of the core performance counters). It is a bit harder to tell what is going on if software gets involved.  

 

Typically, the “perf events” subsystem will save and restore all of the core performance counters (including the OFFCORE_RESPONSE auxiliary MSRs) on any context switch.  This is an attempt to attribute the counts to a specific process, even if it is migrated to run on a different logical processor.   PAPI uses the “perf events” subsystem, so this virtualization should apply.    

 

With multi-threaded software, there are more opportunities for confusion.   PAPI can be configured to report results for a single thread, or for all of the threads underneath a process.   

 

It is best to configure PAPI to report results at the thread level, and to bind each thread to a separate logical processor.  This will prevent the OS from migrating the threads (which forces lots of extra cache misses when the thread starts up again on the “new” core).

 

While we are here, it is important to note that the “LLC_MISSES” event on Intel processors typically maps to a specific “architectural” performance event (Event 0x2E, Umask 0x41).   In Section 18.2.1.2 of Volume 3 of the Intel Architectures Software Developer’s Manual (document 325384, revision 060, September 2016), Intel warns that this event may contain “implementation-specific” characteristics.  On most recent Intel systems, this event counts demand loads that miss in the L3, demand stores that miss in the L3, and L1 hardware prefetches that miss in the L3.  In my experience, it does *not* count traffic due to L2 hardware prefetches that miss in the L3.  These L2 hardware prefetches often make up the majority of the data traffic, so the counter is not useful for estimating bulk traffic.  Instead, it is intended to identify loads that were not prefetched in time to be in the L3.  These loads are the most likely to cause processor stalls, as the latency of these L3 misses is much too high to be “hidden” (overlapped) by the out-of-order execution mechanisms of the core.

 

 

-- 

John D. McCalpin, Ph.D.

Texas Advanced Computing Center

University of Texas at Austin

https://www.tacc.utexas.edu/about/directory/john-mccalpin

--

You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.

Reply all
Reply to author
Forward
0 new messages