Let's say I have a design where all threads frequently access a single IORef using atomicModifyIORef.
The standard reasoning is that this is a Bad Thing and a potential scaling bottleneck. And in fact, monad-par and I'm sure many other packages are committing this sin. But how do I measure *how bad* it is in a particular program in Haskell?
If you have good ideas here, please educate me. As far as I can see there are three ways we could get at this information, only the last of which is a pure-Haskell solution:
- Measure actual contention in terms of machine properties like cache coherence traffic (performance counters). (But how then to isolate the information relevant to a particular IORef?)
- Use binary rewriting to generate a detailed trace of execution, and then run that through a cache simulator (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.4016&rep=rep1&type=pdf)
- Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.
> - Use GHC events to count operations on particular IORefs. Then put that
> trace through our own model that reports whether the IORef is being used
> acceptably, or is "hot".
>
> The trick with that third option is generating the model.
This sounds straightfroward, at least in an ad-hoc way. Just stick a
traceEvent "IORef #3" next to each atomicModifyIORef (pick different
#numbers for each one obviously) and then see which ones are the most
frequent using "ghc-events show | grep IORef".
This will not tell you if it is *too* frequent, but it'll at least tell
you which ones are hotter than the others and some orders of magnitude.
Try it. Let us know.
BTW, this is probably a good moment to advertise that in ghc-7.4 will
come with a new and improved traceEvent. Instead of just being exported
from GHC.Exts with type String -> IO (), we now have:
Debug.Trace.traceEvent :: String -> a -> a
Debug.Trace.traceEventIO :: String -> IO ()
The first one is like Debug.Trace.trace and can be used in pure code.
Also it'll be possible to turn on/off the user events separately from
other events with a +RTS flag.
Duncan
- Use binary rewriting to generate a detailed trace of execution, and then run that through a cache simulator (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.4016&rep=rep1&type=pdf)
While perhaps doable it sounds like a heavyweight approach. You might get away with doing it once for some really important lock in the RTS but it's not something I'd like to do on a daily basis.
- Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.
I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.
Is your reluctance because of the 20-100X+ runtime overhead? (Disclosure: the group I was in at Intel develops the "Pin" binary instrumentation tool which is the basis for the heavyweight (rewriting based) performance/parallelism analysis tools provided by Intel. But valgrind/cachegrind does similar stuff in OSS.)
- Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.
I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.Is the idea that by logging only threads blocking than all IORef accesses it wouldn't be so terrible in the average case to do this for all MVars/IORefs? (Rather than just specific ones that are instrumented by the programmer.)
As for the MVar 123 problem -- within individual programs I tend to use StableNames for figuring out when one MVar is the same as another during debugging. But I know of no good trick to create a stable identity for an MVar between multiple executions of the same (deterministic) program. Correlating back to source location would help, but of course many MVars could be coined from the same static newEmptyMVar occurrence. If the "fork" mechanism could be hijacked it would perhaps be possible to give an MVar a deterministic identity based on a counter and its position in the fork tree.What would be the best path forward for tracking source locations? I take it that the simplest way is to use template haskell and replace "newEmptyMVar" with something like $(newEmptyMVar) which would grab the source location and generate trace events tagged with that location:
Try it. Let us know.
BTW, this is probably a good moment to advertise that in ghc-7.4 will
come with a new and improved traceEvent. Instead of just being exported
traceEvent :: String -> IO () traceEvent msg = do withCString msg $ \(Ptr p) -> IO $ \s -> case traceEvent# p s of s' -> (# s', () #)
Very nice. Can we take a moment to confirm a few things about the performance of traceEvent? In the above example I'm sending it the same string repeatedly. But based on the definition of traceEvent:
BTW, this is probably a good moment to advertise that in ghc-7.4 willcome with a new and improved traceEvent. Instead of just being exportedVery nice. Can we take a moment to confirm a few things about the performance of traceEvent? In the above example I'm sending it the same string repeatedly. But based on the definition of traceEvent:traceEvent :: String -> IO () traceEvent msg = do withCString msg $ \(Ptr p) -> IO $ \s -> case traceEvent# p s of s' -> (# s', () #)It looks like that [Char] list will be traversed every time I log an event to copy it to a C string. Is that correct? But if all I'm interested in is logging the occurrences of an event in time, I think a unique-ID (Int) is all I really want to push on the log. Would it make sense to have variants of traceEvent with other types?
Cost-centre stack profiling doesn’t currently work with multiple processors (+RTS –N2 and greater). I’m going to look into this as part of the profiling overhaul I’m currently working on.
Cheers,
Simon
We do want to give users the ability to define and generate binary
events too, though the details of how to do this haven't been worked out
yet. We do have plans to make the event log format extensible though
(this was worked out at the ThreadScope summit earlier this year, and I
think Paul Bone is working on it).
Cheers,
Simon
Just a thought, but where was the eventlog file being generated too? Sending it to a network file system can have that effect ;)
Cheers,
Simon
From: parallel...@googlegroups.com [mailto:parallel...@googlegroups.com]
On Behalf Of Ryan Newton
Sent: 31 October 2011 14:08
To: parallel...@googlegroups.com
Subject: Re: Estimating contention on an IORef hammered with atomicModifyIORef
Any ideas on why I saw such a platform dependent behavior between a intel sandybridge mac laptop and intel westmere linux box (same GHC 7.2.1 version):
Testing how many traceEvents we can do in a second.
Approx getCPUTime calls per second: 282,883
First doing a test on one thread:
55,705,322 per/second average [IORef incrs] ~ 27.58
cycles
34,066,487 per/second average [traceEvent] ~ 45.10
cycles
Second, doing a test on all 4 threads at once
51,708,077 per/second average [IORef incrs] ~ 29.71
cycles
30,432,770 per/second average [traceEvent] ~ 50.48
cycles
So a slight slowdown with 4 threads as we expect. Changes I made:
- use peek/poke for the counters instead of an IORef. I'm not sure
how much difference this makes, but I wanted to factor out
generational GC and the possibility that multiple IORefs got
onto the same cache line. I did have to add some artificial
allocation to the loop though, otherwise the killThread doesn't
work.
- use the primop traceEvent# directly. This is the big one. The
real GHC.Exts.traceEvent is doing a lot of marshalling to convert
the String into a CString, including encoding to UTF-8 (in 7.2.1
and later). The encoding costs a *lot*, in fact I want to look
into this to find out why.
It's not clear to me why there was such a drop in performance with
multiple threads, but the answer lies somewhere in the UTF-8 encoding
that traceEvent was doing. Avoid that and the problems go away.
Cheers,
Simon
On 31/10/2011 19:50, Ryan Newton wrote:
> Excellent theory; I wish it were the case! But, alas, presently I see:
>
> * Equally bad performance in /tmp and in a network mounted homedir.
> * Slightly *worse* performance if +RTS -l is not supplied (in which
> too? Sending it to a network file system can have that effect ;)____
>
> __ __
>
> Cheers,____
>
> Simon____
>
> __ __
>
> *From:*parallel...@googlegroups.com
> <mailto:parallel...@googlegroups.com>
> [mailto:parallel...@googlegroups.com
> <mailto:parallel...@googlegroups.com>] *On Behalf Of *Ryan Newton
> *Sent:* 31 October 2011 14:08
> *To:* parallel...@googlegroups.com
> <mailto:parallel...@googlegroups.com>
>
>
> *Subject:* Re: Estimating contention on an IORef hammered with
> atomicModifyIORef____
>
> __ __
>
> Any ideas on why I saw such a platform dependent behavior between a
> intel sandybridge mac laptop and intel westmere linux box (same GHC
> 7.2.1 version):____
>
> __ __
>
> I saw the trace events per-thread per-second take only a modest hit
> on the laptop going from one to two threads:____
>
> __ __
>
> 425K -> 358K____
>
> __ __
>
> But on the linux desktop, the trace event throughput was both slower
> to start, and collapsed by a factor of 20 going from 1 -> 2 -> 4
> threads:____
>
> __ __
>
> 205K -> 34K -> 10K ____
>
> __ __
>
> I'll double check this with GHC head as soon as I get it
> successfully built on both platforms ;-).____
>
> __ __
>
> Thanks,____
>
> -Ryan____
>
>
*From:*parallel-haskell@googlegroups.com
<mailto:parallel-haskell@googlegroups.com>
[mailto:parallel-haskell@googlegroups.com
<mailto:parallel-haskell@googlegroups.com>] *On Behalf Of *Ryan Newton
*Sent:* 31 October 2011 14:08
- use the primop traceEvent# directly. This is the big one. The
real GHC.Exts.traceEvent is doing a lot of marshalling to convert
the String into a CString, including encoding to UTF-8 (in 7.2.1
and later). The encoding costs a *lot*, in fact I want to look
into this to find out why.
You're right, we should have another version that takes an unterminated
byte string and a length.
Cheers,
Simon