One more (and probably the best) from the Red Hat Summit:
Ulrich Drepper - Understanding Application Memory Performance
http://www.redhat.com/promo/summit/2008/downloads/pdf/Friday/Friday_1015am_Ulrich_Drepper_Decoding_the_Code.pdf
my off-the-cuff notes, likely including mistakes:
-----------------
Notes from Udrepper's session, Friday 6/20
Intel engineer quote ["if we developed processors the way we do
memory... processor as hot as the surface of the sun"]
Energy proportion.
energy = capacitance * voltage^2 * frequency
Notes that, as frequency goes up, circuits tend to become unstable
(call a EE guy), which requiring raising the voltage
you cannot really look at any single process in isolation, they are
influencing each other, flushing things out of the caches.
multi-core also introduces complications. <example processor> L3
shared, but the higher levels all by themselves. If the threads are
accessing the cores in order (i.e. no affinity), the cost is a
performance equivalent of accessing core memory. Making scheduler much
more complex!
memory controller, typically in north bridge
in the future we expect to have 4 or 5 levels of caches
L1->miss->L2->miss->lookup physical addess-> go to Memory controller->
go to memory, and simultaneously lookup in other processor's cache (!)-
>miss in other processor->read from memory module.
row adress, column address... it takes a while before the read can be
performed. The memory chips are running at 100-200MHz, it takes an
eternity in CPU time
some processors have inclusive caches (both L1 and L2 will store)
others have exclusive caches (only one will store).
most important, high level factors relating to memory performance:
cache line utilization
caches themselv are not addressing individual words, that would be
too wasteful in terms of storage (32b data + 64b admin? too much). Use
cache lines.
cache line utilization: it is obviously better to maximize use of the
cache line. Factors of 20 improvements you can achieve by using your
memory correctly, not 2-3% of compiler optimization!
page utilization
important to use the pages themselves as completely as possible, to
minimize cost of translation of logical to physical address
translation. This process is completely under the control of the
operating system, with data structures kept in memory. These
structures are up to 5-level deep, meaning that up to 5 memory lookups
have to be carried out to resolve the address!
consuming space in the tlb cache results in having to do these
lookups - use pages wisely to save tlb space.
just-in time reading
use software prefetching to pull in data before it is needed by the
process.
help hardware prefetching. It is getting better, but it will not (for
instance) cross a page boundary. we must help the hardware prefetching
perform best.
parallelism
concurrent cache line use is bad (can be rectified by program by
using thread-local storage... need to look into this with more
detail).
bigger problem, not so obvious, is fault sharing. Inadvertedly using
same cache line in multiple threads. Say a global is used in a thread,
a _different_ global is used in another, but they are both on the same
cache line. Thrasing of that cache line occurs, even though it is not
the same memory!
example: matrix multiplication:
for ()
for ()
for ()
8 billion multiplications. Run time on (...) test machine: 11
minutes.
this was a good number 20 years ago. Something must be wrong.
Oprofile. Statistical profiling. Not tracing every memory access (the
system would grind to a halt), but looking at, for example, 1 every
5,000 of accesses. Based on hardware counters. limit to running only 2
counters at the sam time... produced numbers in multiple runs over a
whole afternoon.
What do these numbers tell you? In most cases it should be absolutely
nothing (these are absolute numbers, it depends on what is running).
What you need is relative numbers, or ratios.
very nice documentation in the Intel optimization manual's appendix,
what these ratios mean ->look up (Appendix B, Intel Optimization
Manual)
ratios are independent of length of sampling.
important ratios:
CPU_CLK_HALTED/INST_RETIRED - clocks per instruction retired
how many cycles per instruction. In multi-scalar processors (able to
execute multiple instructions in parallel, out of order), one would
expect this number to be over 1 (sometimes considerably)
CYCLES_L1|_MEM_STALLED/CPU_CLK_HALTED - instruction fetch stall
in the case of virtual machines you are doing this deliberately, but
in most cases code should be predictable.
BR_IND_CALL_EXEC/INS_RETIRED - virtual table use
L1D_CACHE_LD.MESI/CPU_CLK_UNHALTED - load rate
Large number of loads means load/store buffers full at the time
hitting L1 is ideal of course, but we are talking about 32k or 64k -
because of mismatch in program size, this cannot be achieved. Want to
keep number low, indicates going to L2 or even lower.
STORE_BLOCK.ORDER/CPU_CLK_UNHALTED - store order block
instructions being held up because of write ordering. want to keep it
low
L1D_REPL/INST_RETIRED - L1 data cache miss rate
how many instructions cause L1 cache misses. "you can use the L1 cache
as an extension of the CPU registers. Not as fast" (but good enough
and best option)
L2 ....
every single tlb can cause up to 5 memory reads. A memory read can
cost 200-300 cycles. A single instruction can baloon to requiring 1500
cycles. ridiculous!
....
back to example. Explain why only 12MFLOPS of performance
desiderable:
1 : 0.3
2: 77% waiting for the data to arrive
5: L1D misses at 15%
6: L2 20%
9/10 and 11/12: it is loads, not stores, even though algorithm
has ... loads for each store
swaps memory order access... old trick, 94% improvement :)
difference in data in lines 1, 2, 6, 8, 9, 11
next step: use huge pages, 1 MB instead of 4k. We need a couple of
thousand pages and now you only need a dozen. How? refer to his paper.
next: tiling. Fill in entire cache lines before they are evicted.
loops much more complex.
in example, 64 is cache line size
where is time spent?
oprofile can be used in two different modes.
opannotate-source
opannotate-assembly
really really nice tool to drill down into the problem, while the
ratios help you narrow down on where to spend your effort.
RedHat working on a unified framework based on systemtap. Once we have
this, it is possible to script based on hardware counters.
annotated listing example
problems of parallelism
oprofile does not care if your program is single thread or not, it
works all the same
false sharing of cache lines
group variables and align them. C language gives compiler a lot of
freedom, only way you can guarantee this is if you group the variables
in a structure and align them on a boundary.
common working set.
data loaded into memory only once for efficiency reasons.
place results in local memory and _then_ put results, all together,
in the end
synchronization
if you are using synchronization primitives, conditional variables,
semaphores, uing atomic ops, etc - by definition the representation of
these objects has to be share and results in contested cache lines.
too little synchronization eliminates the potential for parallelism,
but too much of it has serious performance impact.
ratios for multi-thread
procmon (or procmon2?) in the future, as well as systemtap.
Questions:
big program, no hotspot. Reduce to smaller chunks, make problems
stand out.
NUMA:
hardware vendor tools: vtune (he does not use it b/c licensing), AMD
code analyst (freely available)
glibc help for big pages: problem is reliability (1M pages are hard
to get by on a running system b/c of fragmentation). Until this
problem is solved, no.