Indeed, some important stall events are the following:
DISPATCH_STALLS, 0xD1, 0x0, PMC
DISPATCH_STALLS_BRANCH, 0xD2, 0x0, PMC
DISPATCH_STALLS_SERIAL, 0xD3, 0x0, PMC
DISPATCH_STALLS_SEGMENT_LOAD, 0xD4, 0x0, PMC
DISPATCH_STALLS_ROB_FULL, 0xD5, 0x0, PMC
DISPATCH_STALLS_RES_FULL, 0xD6, 0x0, PMC
DISPATCH_STALLS_FPU_FULL, 0xD7, 0x0, PMC
DISPATCH_STALLS_LS_FULL, 0xD8, 0x0, PMC
DISPATCH_STALLS_ALL_QUIT, 0xD9, 0x0, PMC
DISPATCH_STALLS_DRAIN, 0xDA, 0x0, PMC
Probably you cannot count all the events above in a single run of a program, however you can create your own likwid groups (STALL0, STALLS1 etc.) including clock unhalted, some of the stalls above and defining operations .
Mind you, working with stall cycles is tricky because on superscalar, out-of-order architectures stall cycles are not independent and, at least on Intel architectures, there is no hardware support to disambiguate them. This boils down to the following: the count of the total number of stalls is less than the sum of the other stall events.
In practice, in many cases stall events overlapping can be considered negligible, but it is not always the case and one must be careful on how to interpreting stall counts. Here are some references, in the case you want to read more:
- Azimi et. al, Online performance analysis by statistical sampling of microprocessor performance counters. ICS '05
- Eyerman et al., Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27, 2, Article 3 (May 2009), 37 pages.
- Levinthal. Performance Analysis Guide For Intel Core i7 processors and Intel Xeon 5500 processors. (see pp20-22)
- Cammarota et al. Pruning hardware evaluation space via correlation-driven application similarity analysis. Computing Frontiers 2011
Hope this helps.
Cheers,
- Ro
I am looking for a meaningful metric for stall cycles.
Thanks.