RE: A meaningful metric for stall cycles - likwid-perfctr

125 views
Skip to first unread message

Rosario Cammarota

unread,
Nov 9, 2012, 1:05:03 PM11/9/12
to likwid-d...@googlegroups.com, Spakho, r.cam...@gmail.com
Hi,
 
At the following link there is the complete list of events on K10:
 
 
Indeed, some important stall events are the following:
 
DISPATCH_STALLS, 0xD1, 0x0, PMC
DISPATCH_STALLS_BRANCH, 0xD2, 0x0, PMC
DISPATCH_STALLS_SERIAL, 0xD3, 0x0, PMC
DISPATCH_STALLS_SEGMENT_LOAD, 0xD4, 0x0, PMC
DISPATCH_STALLS_ROB_FULL, 0xD5, 0x0, PMC
DISPATCH_STALLS_RES_FULL, 0xD6, 0x0, PMC
DISPATCH_STALLS_FPU_FULL, 0xD7, 0x0, PMC
DISPATCH_STALLS_LS_FULL, 0xD8, 0x0, PMC
DISPATCH_STALLS_ALL_QUIT, 0xD9, 0x0, PMC
DISPATCH_STALLS_DRAIN, 0xDA, 0x0, PMC  
 
Probably you cannot count all the events above in a single run of a program, however you can create your own likwid groups (STALL0, STALLS1 etc.) including clock unhalted, some of the stalls above and defining operations .
Mind you, working with stall cycles is tricky because on superscalar, out-of-order architectures stall cycles are not independent and, at least on Intel architectures, there is no hardware support to disambiguate them. This boils down to the following: the count of the total number of stalls is less than the sum of the other stall events.
 
In practice, in many cases stall events overlapping can be considered negligible, but it is not always the case and one must be careful on how to interpreting stall counts. Here are some references, in the case you want to read more:
 
- Azimi et. al, Online performance analysis by statistical sampling of microprocessor performance counters. ICS '05
 
- Eyerman et al., Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27, 2, Article 3 (May 2009), 37 pages.
 
- Levinthal. Performance Analysis Guide For Intel Core i7 processors and Intel Xeon 5500 processors. (see pp20-22) 
 
- Cammarota et al. Pruning hardware evaluation space via correlation-driven application similarity analysis. Computing Frontiers 2011
 
Hope this helps.
 
Cheers,
- Ro
 
From: Spakho <amirba...@gmail.com>
Sent: November 9, 2012 7:18 AM
To: likwid-d...@googlegroups.com
Subject: A meaningful metric for stall cycles - likwid-perfctr
 
Hi all,

I am working with K10 machines.  likwid-perfctr -g CPI gives me the following metrics:

1) INSTRUCTIONS_RETIRED STAT (Sample value: 2.06153e+12)
2) CPU_CLOCKS_UNHALTED STAT (Sample value: 2.90543e+12)
3) UOPS_RETIRED STAT (Sample value: 2.41861e+12)
4) Runtime [s] STAT ( Sample value:1452.8)
5) CPI STAT( Sample value:22.5519)
6) CPI (based on uops) STAT (Sample value:19.223)
7) IPC STAT (Sample value:11.3526)

I am looking for a meaningful metric for stall cycles.
Do I need to consider CPU_CLOCKS_UNHALTED STAT - UOPS_RETIRED STAT or  CPU_CLOCKS_UNHALTED STAT - INSTRUCTIONS_RETIRED STAT?

Thanks.








Spakho

unread,
Nov 10, 2012, 11:21:10 AM11/10/12
to likwid-d...@googlegroups.com, Spakho, r.cam...@gmail.com
Thanks Ro.

Can I approximately calculate stall cycles through the following formula?

Total Clocks = Instruction_Clocks + Stall Clocks
Stall Clocks =  Total Clocks - Instruction_Clocks

also

Instruction_Clocks = (average CPI) * Instruction_count

If yes, so can I consider INSTRUCTIONS_RETIRED STAT as Instruction_count?

1) INSTRUCTIONS_RETIRED STAT (sample average value: 6.62398e+10)
2) CPU_CLOCKS_UNHALTED STAT (sample average value: 1.11164e+11)
3) UOPS_RETIRED STAT (sample average value: 7.70107e+10)
4) CPI STAT (sample average value: 1.68236)
5) CPI (based on uops) STAT (sample average value: 1.44815)


Thanks.

moebiusband

unread,
Nov 11, 2012, 3:47:46 AM11/11/12
to likwid-d...@googlegroups.com, Spakho, r.cam...@gmail.com
Hi,

The CYCLES_UNHALTED_CORE metric is available on any processor. It is not necessary to compute it from other events. You can of course compute the difference between total cycles and cycles unhalted core to get halted cycles. The other halted cycles events are mostly related to what caused the halted cycles. This is what you really are interested in. Validate your results against simple consistency checks.

Search in Google for the whitepapers from David Levinthal. He is a great supporter of cycles spent analysis with binning every cycle to a specific reason. It is for Intel only but maybe you can find similar events for AMD. The document with all the events is the BKDG for the processor you are looking at. You can also get the event list with:

likwid-perfctr -e | less

Please note that the STAT table you pasted is the statistics table with (MEAN, MIN, MAX, SUM). It is printed if you run LIKWID on more than on core. To enable easier grepping in a script I mark the labels there with a STAT tag.

Jan

PS:

Maybe consider to post questions as this in the user mailing list.

Spakho

unread,
Nov 14, 2012, 11:19:08 AM11/14/12
to likwid-d...@googlegroups.com, Spakho, r.cam...@gmail.com
Thank you very much.

moebiusband

unread,
Nov 14, 2012, 1:36:19 PM11/14/12
to likwid-d...@googlegroups.com, Spakho, r.cam...@gmail.com
Hi,

my previous answer mixed up halted to stalled cycles.  This is not the same. If a core is in halted state it does not execute any instructions. You are probably interested in stalled cycles, where
the processor executes instructions but stalls execution due to resource reasons or a any other hazard. So be careful to not mix this up.

Jan


On Saturday, November 10, 2012 5:21:10 PM UTC+1, Spakho wrote:

Rosario Cammarota

unread,
Nov 14, 2012, 4:14:24 PM11/14/12
to Spakho, likwid-d...@googlegroups.com
Hi Amir,
 
If you are interested to the total stall cycles count, I would suggest to use the resource stalls count available for AMD 10K. The difference between the clock unhalted and the total stalls count gives to you the number of cycles the pipeline was not stalled. Because stalls can occur (be distributed) anywhere during the execution of your application and such a distribution is in part a property of your program and in part induced by the architectural limits - e.g., saturation of architectural resources, I would suggest to use the following quantities:
 
%Instruction cycles = ( clock unhalted - stall cycles) / clock unhalted x 100
 
and
 
%Stall cycles = 100 - %Useful cycles
 
I am not sure of the meaning of the counters that you mentioned - in particular I am not sure on the meaning of STAT and non STAT. However, on superscalar, speculative, out-of-order architectures usually there are two main counters for counting the x86-like dynamic instructions. The first counter counts instructions issued, whereas the second counts instructions retired - and instructions issued >= instructions retired. Unless you want to assert something about branch prediction, you should use the count of instructions retired.  
One more thing about the CPI. I would not suggest to use CPI for approximating the instructions clock, because the CPI is usually derived from the instruction clock.
 
Thanks,
- Ro
 
ps. Sorry for the late reply.
Reply all
Reply to author
Forward
0 new messages