Hi parallel-haskell,
I am currently doing a PhD at the University of Leeds, hoping to
improve the tool support for all the people doing parallelism in
Haskell. Therefore I have been trying out a few ideas on what kind of
data we can meaningfully get out of running programs. This obviously
overlaps with the interest of the Parallel GHC project, therefore I
was asked to write up some of my ideas and experiments to enable
better coordination on the project.
One of the ideas has been to have another stab at the basic "What code
executed at that time on that HEC?" question. The starting point here
is to use the ticks inserted by the HPC framework. The most
characteristic property of this approach is that it allows the
optimizer quite a bit of freedom in moving the ticks around. That can
be a good thing, as we obviously want to instrument code that is as
close to the real thing as possible. But on the other hand, it can
also be a bad thing: While ticks seem indeed to fire exactly *if* the
corresponding code plays a role, there's no such guarantee that it
will fire *when* or *as often as* the code executes.
As a quick illustration, here's yet another version of Fibonacci:
fib :: Int -> Int
fib n = go n 0 1
where go 1 i j = i `seq` j
go n i j = go (n-1) j (i+j)
The inner loop will compile to something like follows:
$wgo1_XJ8 =
\ (ww3_XIV ::
GHC.Prim.Int#)
(ww4_XJ0 ::
GHC.Prim.Int#)
(ww5_XJ5 ::
GHC.Prim.Int#) ->
case ww3_XIV of wild2_X1t {
__DEFAULT ->
$wgo1_XJ8
(GHC.Prim.-# wild2_X1t y_anY)
ww5_XJ5
(GHC.Prim.+# ww4_XJ0 ww5_XJ5);
1 -> ...
};
Note the complete absence of ticks - GHC has moved all of them out to
the call site of $wgo! The resulting tight loop is obviously the ideal
state as far as the optimizer is concerned, but for instrumentation
this is a problem: We end up with basically no information about the
most critical part of the program!
One might be tempted to call for optimizer modifications here, but my
opinion is that restrictions are the wrong course of action. After
all, there might be parts of the loop code that we *want* to see moved
out, so ticks have to be present outside to maintain the
association. There are also situations where you would end up having
to push ticks inwards -- say, the function parameter of a "map"
getting inlined. There's really no way to win here: If we want to stay
honest about the program's performance, putting *any* kind of
restriction on the optimizer is opening a considerable can of worms.
The way I see it, it is better to approach the problem from the other
side: What we end up showing to the programmer will have the
handwriting of the optimizer all over it no matter how we do it. So
instead of using the "by-product" ticks to guide our instrumentation,
why not use the actual result? After all, we are already on "best
effort" basis as far as source code annotations are concerned. So let
us just assume that wherever we end up instrumenting, we might
find something to tell the programmer about it. That gives us the
freedom to formulate a wishlist:
1. We want good coverage ("If something interesting happens, we want
to hear about it.")
2. We want performance ("Don't call too often, though")
So let us look at optimized Core (aka Stg): Stg conversion -- as I
understand it -- is partly about finalizing the control flow of the
program. That's pretty much the essence of "interesting" as far as
instrumentation is concerned. The list of control structures also
happens to be nice and short:
+ Closures. Come annotated in two flavors:
- Updatable: Pure thunks (appear most often, but not executed often)
- Reentrant: Functions (pretty rare, but executed a lot)
+ Multi-alternative case
I would single out reentrant closures as the best candidate here in
terms of coverage. Until someone manages to write a program that
consists solely of thunks there's a very good chance that all
interesting code will involve reentrant closures getting executed.
So what about performance? Entering a closure is one of the more
costly operations in Stg, therefore the optimizer should already be
working hard for us to keep the cost down. To test this in practice, I
prototyped a solution that puts some moderately costly (~ 50 cycles)
instrumentation code at the entry of reentrant closures positions. It
seems to work reasonably well. Here is the resulting data if I
benchmark it against a version without any instrumentation code
generated:
Program Size Allocs Runtime Elapsed TotalMem
--------------------------------------------------------------------------------
blackscholes -0.0% -0.0% +5.7% +5.0% -1.5%
coins -0.1% +0.0% +10.7% +10.5% +8.6%
gray -0.3% -0.0% +7.7% +8.0% +0.0%
mandel -0.0% +0.0% +6.5% +7.0% +0.0%
matmult -0.0% +0.4% +46.4% +47.3% +0.0%
minimax -0.0% -0.0% +19.4% +19.4% +0.0%
nbody +0.0% +0.0% +24.4% +23.8% +0.0%
parfib +0.0% +0.0% +78.6% +78.4% +0.0%
partree -0.0% -0.0% +3.2% +2.9% +7.1%
prsa -0.0% +0.0% +0.1% -0.1% +0.0%
queens -0.0% +0.0% +18.4% +18.2% +0.0%
ray -0.0% -0.1% +8.1% +8.1% +0.0%
sumeuler -0.0% -0.0% +70.8% +70.3% +0.0%
transclos -0.1% -0.1% +4.6% +4.2% +0.0%
--------------------------------------------------------------------------------
Min -0.3% -0.1% +0.1% -0.1% -1.5%
Max +0.0% +0.4% +78.6% +78.4% +8.6%
Geometric Mean -0.0% +0.0% +19.7% +19.5% +1.0%
Quite unsurprising: The "micro-benchmarks" sumeuler and parfib perform
the worst, as they have tight inner loops that need to execute the
instrumentation a lot. I highly doubt there's anything we can improve
there in terms of better tick placement.
Other obervations:
- Overhead seems to drop quickly under 20% for many benchmarks
- No noticable effect on scaling (at least not up to my humble 4
cores)
- Even for the benchmark with the lowest overhead (prsa), the
instrumentation points seem to get decent exposure: The sampling
data shows at least 50,000 hits per second with no "holes".
Lastly, one piece of bad news: Note that I benchmarked against a
version without instrumentation, but which still had HPC-style source
ticks througout the Core phase. Even though HPC ticks are supposed to
be robust against optimizations, this ends up proving that ideally you
wouldn't want to do anything at all to the optimizer. Just having them
as NOPs causes considerable slowdowns across all benchmarks:
--------------------------------------------------------------------------------
Program Size Allocs Runtime Elapsed TotalMem
--------------------------------------------------------------------------------
blackscholes +0.2% +88.5% +13.7% +14.5% +0.0%
coins +0.5% +124.2% +53.6% +54.2% -14.4%
gray +6.8% +78.3% +51.6% +48.3% -26.7%
mandel +0.3% +4.6% +6.2% +8.7% +0.0%
matmult +0.2% +2818.1% +24.7% +22.8% -15.2%
minimax +1.0% +16.3% +56.3% +55.5% +0.0%
nbody +0.3% +66.6% +48.5% +46.7% +0.0%
parfib +0.1% +40.7% +168.8% +168.2% +0.0%
partree +0.2% +14.1% +32.9% +32.0% +55.6%
prsa +0.2% +5.3% +12.7% +12.9% +0.0%
queens +0.1% +289.3% +156.1% +153.2% +9.1%
ray +1.4% +29.1% +86.3% +86.9% +0.0%
sumeuler +0.6% +0.0% +3.8% +3.7% +0.0%
transclos +0.3% +51.7% +32.4% +32.4% +0.0%
--------------------------------------------------------------------------------
Min +0.1% +0.0% +3.8% +3.7% -26.7%
Max +6.8% +2818.1% +168.8% +168.2% +55.6%
Geometric Mean +0.9% +85.6% +46.7% +46.4% -0.7%
I haven't completely investigated this - my hunch is that for
maintaining the "only execute if" restriction, the optimizer has to
miss out on quite a bit of sharing. That could be the reason for
matmult having that massive increase in allocation.
For those interested, my source code can be found on GitHub:
http://github.com/scpmw/ghc/tree/tick-dumps
http://github.com/scpmw/packages-hpc/tree/tick-dumps
Comments obviously very welcome.
Greetings,
Peter Wortmann