Estimating contention on an IORef hammered with atomicModifyIORef

Ryan Newton

unread,

Oct 27, 2011, 11:20:59 AM10/27/11

to parallel-haskell, mona...@googlegroups.com

Let's say I have a design where all threads frequently access a single IORef using atomicModifyIORef.

The standard reasoning is that this is a Bad Thing and a potential scaling bottleneck. And in fact, monad-par and I'm sure many other packages are committing this sin. But how do I measure *how bad* it is in a particular program in Haskell?

If you have good ideas here, please educate me. As far as I can see there are three ways we could get at this information, only the last of which is a pure-Haskell solution:

Measure actual contention in terms of machine properties like cache coherence traffic (performance counters). (But how then to isolate the information relevant to a particular IORef?)
Use binary rewriting to generate a detailed trace of execution, and then run that through a cache simulator (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.4016&rep=rep1&type=pdf)
Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".

The trick with that third option is generating the model.

Ideas?

-Ryan

Ryan Newton

unread,

Oct 28, 2011, 9:15:23 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

Use binary rewriting to generate a detailed trace of execution, and then run that through a cache simulator (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.4016&rep=rep1&type=pdf)

While perhaps doable it sounds like a heavyweight approach. You might get away with doing it once for some really important lock in the RTS but it's not something I'd like to do on a daily basis.

Is your reluctance because of the 20-100X+ runtime overhead? (Disclosure: the group I was in at Intel develops the "Pin" binary instrumentation tool which is the basis for the heavyweight (rewriting based) performance/parallelism analysis tools provided by Intel. But valgrind/cachegrind does similar stuff in OSS.)

Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.

I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.

Is the idea that by logging only threads blocking than all IORef accesses it wouldn't be so terrible in the average case to do this for all MVars/IORefs? (Rather than just specific ones that are instrumented by the programmer.)

As for the MVar 123 problem -- within individual programs I tend to use StableNames for figuring out when one MVar is the same as another during debugging. But I know of no good trick to create a stable identity for an MVar between multiple executions of the same (deterministic) program. Correlating back to source location would help, but of course many MVars could be coined from the same static newEmptyMVar occurrence. If the "fork" mechanism could be hijacked it would perhaps be possible to give an MVar a deterministic identity based on a counter and its position in the fork tree.

What would be the best path forward for tracking source locations? I take it that the simplest way is to use template haskell and replace "newEmptyMVar" with something like $(newEmptyMVar) which would grab the source location and generate trace events tagged with that location:

http://stackoverflow.com/questions/7073471/whats-the-correct-way-to-have-template-haskell-wrap-a-function-with-source-info

-Ryan

Ryan Newton

unread,

Oct 28, 2011, 11:29:52 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

Try it. Let us know.

Will do!

BTW, this is probably a good moment to advertise that in ghc-7.4 will
come with a new and improved traceEvent. Instead of just being exported

Very nice. Can we take a moment to confirm a few things about the performance of traceEvent? In the above example I'm sending it the same string repeatedly. But based on the definition of traceEvent:

traceEvent :: String -> IO ()
traceEvent msg = do
  withCString msg $ \(Ptr p) -> IO $ \s ->
    case traceEvent# p s of s' -> (# s', () #)

It looks like that [Char] list will be traversed every time I log an event to copy it to a C string. Is that correct? But if all I'm interested in is logging the occurrences of an event in time, I think a unique-ID (Int) is all I really want to push on the log. Would it make sense to have variants of traceEvent with other types?

I was curious how much overhead such a traceEvent would incur, so I wrote the attached little microbenchmark. There are results in the comments at the bottom of the file.

What it does is it prints the traces per-second per-thread and it runs twice -- once with a single thread and once with numCapabilities threads all hammering away.

The case where you simply increment an IORef in a loop shows good scalability. (IF you do not allocate the IORefs consecutively -- take a look at that #if switch.) But the max traceEvent throughput I'm seeing is about 500K/sec, AND when I tested it on a few Intel Westmere linux workstations I saw a weird effect where traceEvents/sec dropped all the way to 34K/sec and 10K/sec at 2 and 4 threads respectively! (Throughput per/thread decreased much less on my Mac laptop, so it's Arch/OS specific. Odd. )

-Ryan

{-# LANGUAGE BangPatterns, ScopedTypeVariables, CPP #-}

import Control.Exception

import Control.Monad

import Data.IORef

import Data.Int

import Data.List

import Data.List.Split hiding (split)

import GHC.Exts (traceEvent)

--import Benchmarks.BinSearch

import System.Random

import System.CPUTime (getCPUTime)

import System.CPUTime.Rdtsc

import Text.Printf

import GHC.Conc (forkIO, numCapabilities, threadDelay, killThread)

import Control.Concurrent.MVar

----------------------------------------------------------------------------------------------------

-- Main Script

----------------------------------------------------------------------------------------------------

main = do

putStrLn "Testing how many traceEvents we can do in a second."

-- binSearch 3 (0.99, 1.01) testTrace

freq <- measureFreq

putStrLn "\nFirst doing a test on one thread:"

timeAndKillThreads 1 freq "IORef incrs" (return ())

timeAndKillThreads 1 freq "traceEvent" (traceEvent "ConstString")

putStrLn$ "\nSecond, doing a test on all "++show numCapabilities++" threads at once "

timeAndKillThreads numCapabilities freq "IORef incrs" (return ())

timeAndKillThreads numCapabilities freq "traceEvent" (traceEvent "ConstString")

putStrLn "Done Testing."

--------------------------------------------------------------------------------

-- Helper and timing routines

--------------------------------------------------------------------------------

timeAndKillThreads :: Int -> Int64 -> String -> IO () -> IO ()

timeAndKillThreads numthreads freq msg action =

do

-- Note, if the IORefs are allocated on the main thread the throughput

-- in the parallel case plummets. Presumably this is due to false

-- sharing as they get bump-allocated.

#if 0

counters <- forM [1..numthreads] (const$ newIORef (1::Int64))

tids <- forM counters $ \counter ->

forkIO $ infloop counter

#else

mv <- newEmptyMVar

tids <- forM [1..numthreads] $ \_ ->

forkIO $ do r <- newIORef (1::Int64)

putMVar mv r

infloop r

counters <- forM [1..numthreads] (const$ takeMVar mv)

#endif

threadDelay (1000*1000) -- One second

mapM_ killThread tids

finals <- mapM readIORef counters

-- printf "Across %d threads got these throughputs: %s\n" numthreads (show finals)

let mean :: Double = fromIntegral (foldl1 (+) finals) / fromIntegral numthreads

cycles_per :: Double = fromIntegral freq / mean

printResult (round mean :: Int64) msg cycles_per

where

infloop !counter =

do action

incr counter

infloop counter

incr !counter =

do -- modifyIORef counter (+1) -- Not strict enough!

c <- readIORef counter

let c' = c+1

_ <- evaluate c'

writeIORef counter c'

printResult :: Int64 -> String -> Double -> IO ()

printResult total msg cycles_per =

putStrLn$ " "++ padleft 11 (commaint total) ++" per/second average "++ padright 27 ("["++msg++"]") ++" ~ "

++ fmt_num cycles_per ++" cycles"

-- Readable large integer printing:

commaint :: Integral a => a -> String

commaint n =

reverse $ concat $

intersperse "," $

chunk 3 $

reverse (show n)

padleft :: Int -> String -> String

padleft n str | length str >= n = str

padleft n str | otherwise = take (n - length str) (repeat ' ') ++ str

padright :: Int -> String -> String

padright n str | length str >= n = str

padright n str | otherwise = str ++ take (n - length str) (repeat ' ')

fmt_num :: (RealFrac a, PrintfArg a) => a -> String

fmt_num n = if n < 100

then printf "%.2f" n

else commaint (round n :: Integer)

-- Measure clock frequency, spinning rather than sleeping to try to

-- stay on the same core.

measureFreq :: IO Int64

measureFreq = do

let second = 1000 * 1000 * 1000 * 1000 -- picoseconds are annoying

t1 <- rdtsc

start <- getCPUTime

let loop !n !last =

do t2 <- rdtsc

when (t2 < last) $

putStrLn$ "COUNTERS WRAPPED "++ show (last,t2)

cput <- getCPUTime

if (cput - start < second)

then loop (n+1) t2

else return (n,t2)

(n,t2) <- loop 0 t1

putStrLn$ " Approx getCPUTime calls per second: "++ commaint (n::Int64)

when (t2 < t1) $

putStrLn$ "WARNING: rdtsc not monotonically increasing, first "++show t1++" then "++show t2++" on the same OS thread"

return$ fromIntegral (t2 - t1)

----------------------------------------------------------------------------------------------------

{-

Results:

Sandybridge macbook air, first WITHOUT -threaded:

Approx getCPUTime calls per second: 324,905

First doing a test on one thread:

43,804,752 per/second average [IORef incrs] ~ 41.53 cycles

461,589 per/second average [traceEvent] ~ 3,941 cycles

Westmere 3.1 ghz 4core noHT, first WITHOUT -threaded:

Approx getCPUTime calls per second: 558,696

First doing a test on one thread:

86,823,292 per/second average [IORef incrs] ~ 35.63 cycles

215,086 per/second average [traceEvent] ~ 14,383 cycles

Next, both WITH -threaded:

Sandybridge macbook air, -N2:

First doing a test on one thread:

57,743,676 per/second average [IORef incrs] ~ 30.34 cycles

425,853 per/second average [traceEvent] ~ 4,114 cycles

Second, doing a test on all 2 threads at once

50,622,096 per/second average [IORef incrs] ~ 34.61 cycles

358,316 per/second average [traceEvent] ~ 4,889 cycles

Finally, doing a test on 4 threads -- HYPERTHREADING

34,021,584 per/second average [IORef incrs] ~ 39.58 cycles

246,782 per/second average [traceEvent] ~ 5,457 cycles

Westmere 3.1 ghz 4core noHT:

First doing a test on one thread:

74,393,471 per/second average [IORef incrs] ~ 40.56 cycles

205,615 per/second average [traceEvent] ~ 14,673 cycles

Doing a test on 2 threads at once

76,597,784 per/second average [IORef incrs] ~ 39.39 cycles

34,764 per/second average [traceEvent] ~ 86,786 cycles

Doing a test on all 4 threads at once

71,818,103 per/second average [IORef incrs] ~ 39.02 cycles

10,360 per/second average [traceEvent] ~ 270,485 cycles

-}

Ryan Newton

unread,

Oct 28, 2011, 11:31:29 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

Very nice. Can we take a moment to confirm a few things about the performance of traceEvent? In the above example I'm sending it the same string repeatedly. But based on the definition of traceEvent:

Btw -- "above example" here was unclear. It referred to Duncan's example of (traceEvent "IORef #3") which would only be "above" if viewed in GMail thread view ;-).

Simon Marlow

unread,

Oct 31, 2011, 5:36:02 AM10/31/11

to parallel...@googlegroups.com, mona...@googlegroups.com

Cost-centre stack profiling doesn’t currently work with multiple processors (+RTS –N2 and greater). I’m going to look into this as part of the profiling overhaul I’m currently working on.

Cheers,

Simon

From: parallel...@googlegroups.com [mailto:parallel...@googlegroups.com] On Behalf Of Johan Tibell
Sent: 28 October 2011 16:26
To: parallel...@googlegroups.com
Cc: mona...@googlegroups.com
Subject: Re: Estimating contention on an IORef hammered with atomicModifyIORef

On Fri, Oct 28, 2011 at 6:15 AM, Ryan Newton <rrne...@gmail.com> wrote:

Is your reluctance because of the 20-100X+ runtime overhead? (Disclosure: the group I was in at Intel develops the "Pin" binary instrumentation tool which is the basis for the heavyweight (rewriting based) performance/parallelism analysis tools provided by Intel. But valgrind/cachegrind does similar stuff in OSS.)

I was more thinking of manual work on the part of the programmer. If there are tools that do it automatically that's much better.

Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".

The trick with that third option is generating the model.

I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.

Is the idea that by logging only threads blocking than all IORef accesses it wouldn't be so terrible in the average case to do this for all MVars/IORefs? (Rather than just specific ones that are instrumented by the programmer.)

I couldn't think of a good use case for logging all accesses, that's all. At work we automatically track contention on all locks (in C++) and that seems to work fine.

As for the MVar 123 problem -- within individual programs I tend to use StableNames for figuring out when one MVar is the same as another during debugging. But I know of no good trick to create a stable identity for an MVar between multiple executions of the same (deterministic) program. Correlating back to source location would help, but of course many MVars could be coined from the same static newEmptyMVar occurrence. If the "fork" mechanism could be hijacked it would perhaps be possible to give an MVar a deterministic identity based on a counter and its position in the fork tree.

What would be the best path forward for tracking source locations? I take it that the simplest way is to use template haskell and replace "newEmptyMVar" with something like $(newEmptyMVar) which would grab the source location and generate trace events tagged with that location:

http://stackoverflow.com/questions/7073471/whats-the-correct-way-to-have-template-haskell-wrap-a-function-with-source-info

Here's an approach:

For blocking operations (e.g. takeMVar) record the amount of time spent waiting for a lock together with the stack trace* of the blocked thread. Saving the stack trace is almost as good as tracking the source location of the lock and can sometimes be more useful. To make this less expensive we can sample these events at a frequency of say 1%. Using this information we can show contention profile like so:

total time total time (%) cost center

0.005 50% foo

0.003 30% bar

and even make hierarchical charts like we currently do for CPU profiling.

* We can either try to use the current profiling cost stacks to get a stack trace or we could just get the innermost function using the same trick we use to get assert to output the current file and line number.

-- Johan

Reply all

Reply to author

Forward