Estimating contention on an IORef hammered with atomicModifyIORef

2 views
Skip to first unread message

Ryan Newton

unread,
Oct 27, 2011, 11:20:59 AM10/27/11
to parallel-haskell, mona...@googlegroups.com
Let's say I have a design where all threads frequently access a single IORef using atomicModifyIORef.

The standard reasoning is that this is a Bad Thing and a potential scaling bottleneck.  And in fact, monad-par and I'm sure many other packages are committing this sin.  But how do I measure *how bad* it is in a particular program in Haskell?

If you have good ideas here, please educate me.  As far as I can see there are three ways we could get at this information, only the last of which is a pure-Haskell solution:
  • Measure actual contention in terms of machine properties like cache coherence traffic (performance counters).  (But how then to isolate the information relevant to a particular IORef?)
  • Use binary rewriting to generate a detailed trace of execution, and then run that through a cache simulator (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.4016&rep=rep1&type=pdf)
  • Use GHC events to count operations on particular IORefs.  Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.

Ideas?

   -Ryan

Ryan Newton

unread,
Oct 28, 2011, 9:15:23 AM10/28/11
to parallel...@googlegroups.com, mona...@googlegroups.com
While perhaps doable it sounds like a heavyweight approach. You might get away with doing it once for some really important lock in the RTS but it's not something I'd like to do on a daily basis. 

Is your reluctance because of the 20-100X+ runtime overhead?  (Disclosure: the group I was in at Intel develops the "Pin" binary instrumentation tool which is the basis for the heavyweight (rewriting based) performance/parallelism analysis tools provided by Intel.  But valgrind/cachegrind does similar stuff in OSS.)
 
  • Use GHC events to count operations on particular IORefs.  Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.

I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.

Is the idea that by logging only threads blocking than all IORef accesses it wouldn't be so terrible in the average case to do this for all MVars/IORefs?  (Rather than just specific ones that are instrumented by the programmer.)

As for the MVar 123 problem -- within individual programs I tend to use StableNames for figuring out when one MVar is the same as another during debugging.  But I know of no good trick to create a stable identity for an MVar between multiple executions of the same (deterministic) program.  Correlating back to source location would help, but of course many MVars could be coined from the same static newEmptyMVar occurrence.  If the "fork" mechanism could be hijacked it would perhaps be possible to give an MVar a deterministic identity based on a counter and its position in the fork tree.

What would be the best path forward for tracking source locations?  I take it that the simplest way is to use template haskell and replace "newEmptyMVar" with something like $(newEmptyMVar) which would grab the source location and generate trace events tagged with that location:


  -Ryan

Ryan Newton

unread,
Oct 28, 2011, 11:29:52 AM10/28/11
to parallel...@googlegroups.com, mona...@googlegroups.com
Try it. Let us know.

Will do!
 
BTW, this is probably a good moment to advertise that in ghc-7.4 will
come with a new and improved traceEvent. Instead of just being exported

Very nice.  Can we take a moment to confirm a few things about the performance of traceEvent?  In the above example I'm sending it the same string repeatedly.  But based on the definition of traceEvent:

traceEvent :: String -> IO ()
traceEvent msg = do
  withCString msg $ \(Ptr p) -> IO $ \s ->
    case traceEvent# p s of s' -> (# s', () #)

It looks like that [Char] list will be traversed every time I log an event to copy it to a C string.  Is that correct?  But if all I'm interested in is logging the occurrences of an event in time, I think a unique-ID (Int) is all I really want to push on the log.  Would it make sense to have variants of traceEvent with other types?

I was curious how much overhead such a traceEvent would incur, so I wrote the attached little microbenchmark.  There are results in the comments at the bottom of the file.

What it does is it prints the traces per-second per-thread and it runs twice -- once with a single thread and once with numCapabilities threads all hammering away.

The case where you simply increment an IORef in a loop shows good scalability.  (IF you do not allocate the IORefs consecutively -- take a look at that #if switch.)  But the max traceEvent throughput I'm seeing is about 500K/sec, AND when I tested it on a few Intel Westmere linux workstations I saw a weird effect where traceEvents/sec dropped all the way to 34K/sec and 10K/sec at 2 and 4 threads respectively!  (Throughput per/thread decreased much less on my Mac laptop, so it's Arch/OS specific.  Odd. )

  -Ryan



{-# LANGUAGE BangPatterns, ScopedTypeVariables, CPP #-}

import Control.Exception
import Control.Monad
import Data.IORef
import Data.Int
import Data.List
import Data.List.Split  hiding (split)
import GHC.Exts (traceEvent)
--import Benchmarks.BinSearch
import System.Random
import System.CPUTime  (getCPUTime)
import System.CPUTime.Rdtsc
import Text.Printf
import GHC.Conc (forkIO, numCapabilities, threadDelay, killThread)
import Control.Concurrent.MVar

----------------------------------------------------------------------------------------------------
--    Main Script
----------------------------------------------------------------------------------------------------

main = do 
  putStrLn "Testing how many traceEvents we can do in a second."
--  binSearch 3 (0.99, 1.01) testTrace
  freq <- measureFreq

  putStrLn "\nFirst doing a test on one thread:"
  timeAndKillThreads 1 freq "IORef incrs" (return ())
  timeAndKillThreads 1 freq "traceEvent" (traceEvent "ConstString")

  putStrLn$ "\nSecond, doing a test on all "++show numCapabilities++" threads at once "
  timeAndKillThreads numCapabilities freq "IORef incrs" (return ())
  timeAndKillThreads numCapabilities freq "traceEvent" (traceEvent "ConstString")

  putStrLn "Done Testing."


--------------------------------------------------------------------------------
-- Helper and timing routines
--------------------------------------------------------------------------------

timeAndKillThreads :: Int -> Int64 -> String -> IO () -> IO ()
timeAndKillThreads numthreads freq msg action =
  do 

-- Note, if the IORefs are allocated on the main thread the throughput
-- in the parallel case plummets.  Presumably this is due to false
-- sharing as they get bump-allocated.
#if 0
     counters <- forM [1..numthreads] (const$ newIORef (1::Int64)) 
     tids <- forM counters $ \counter -> 
        forkIO $ infloop counter
#else
     mv <- newEmptyMVar
     tids <- forM [1..numthreads] $ \_ -> 
         forkIO $ do r <- newIORef (1::Int64)
   putMVar mv r 
   infloop r 
     counters <- forM [1..numthreads] (const$ takeMVar mv)
#endif

     threadDelay (1000*1000) -- One second
     mapM_ killThread tids

     finals <- mapM readIORef counters
--     printf "Across %d threads got these throughputs: %s\n" numthreads (show finals)
     let mean :: Double = fromIntegral (foldl1 (+) finals) / fromIntegral numthreads
         cycles_per :: Double = fromIntegral freq / mean

     printResult (round mean :: Int64) msg cycles_per

 where 
   infloop !counter = 
     do action
incr counter        
infloop counter 

   incr !counter = 
     do -- modifyIORef counter (+1) -- Not strict enough!
c <- readIORef counter
let c' = c+1
_ <- evaluate c'
writeIORef counter c'     

printResult ::  Int64 -> String -> Double -> IO ()
printResult total msg cycles_per = 
     putStrLn$ "    "++ padleft 11 (commaint total) ++" per/second average  "++ padright 27 ("["++msg++"]") ++" ~ "
      ++ fmt_num cycles_per ++" cycles"


-- Readable large integer printing:
commaint :: Integral a => a -> String
commaint n = 
   reverse $ concat $
   intersperse "," $ 
   chunk 3 $ 
   reverse (show n)

padleft :: Int -> String -> String
padleft n str | length str >= n = str
padleft n str | otherwise       = take (n - length str) (repeat ' ') ++ str

padright :: Int -> String -> String
padright n str | length str >= n = str
padright n str | otherwise       = str ++ take (n - length str) (repeat ' ')


fmt_num :: (RealFrac a, PrintfArg a) => a -> String
fmt_num n = if n < 100 
   then printf "%.2f" n
   else commaint (round n :: Integer)

-- Measure clock frequency, spinning rather than sleeping to try to
-- stay on the same core.
measureFreq :: IO Int64
measureFreq = do 
  let second = 1000 * 1000 * 1000 * 1000 -- picoseconds are annoying
  t1 <- rdtsc 
  start <- getCPUTime
  let loop !n !last = 
       do t2 <- rdtsc 
 when (t2 < last) $
      putStrLn$ "COUNTERS WRAPPED "++ show (last,t2) 
 cput <- getCPUTime
 if (cput - start < second) 
  then loop (n+1) t2
  else return (n,t2)
  (n,t2) <- loop 0 t1
  putStrLn$ "  Approx getCPUTime calls per second: "++ commaint (n::Int64)
  when (t2 < t1) $ 
    putStrLn$ "WARNING: rdtsc not monotonically increasing, first "++show t1++" then "++show t2++" on the same OS thread"

  return$ fromIntegral (t2 - t1)

----------------------------------------------------------------------------------------------------
{-
  Results:

  Sandybridge macbook air, first WITHOUT -threaded:

     Approx getCPUTime calls per second: 324,905
   First doing a test on one thread:
     43,804,752 per/second average  [IORef incrs]               ~ 41.53 cycles
        461,589 per/second average  [traceEvent]                ~ 3,941 cycles
 
  Westmere 3.1 ghz 4core noHT, first WITHOUT -threaded:

     Approx getCPUTime calls per second: 558,696
     First doing a test on one thread:
86,823,292 per/second average  [IORef incrs]               ~ 35.63 cycles
  215,086 per/second average  [traceEvent]                ~ 14,383 cycles

  Next, both WITH -threaded:
  Sandybridge macbook air, -N2:
    First doing a test on one thread:
57,743,676 per/second average  [IORef incrs]               ~ 30.34 cycles
   425,853 per/second average  [traceEvent]                ~ 4,114 cycles
    Second, doing a test on all 2 threads at once 
50,622,096 per/second average  [IORef incrs]               ~ 34.61 cycles
   358,316 per/second average  [traceEvent]                ~ 4,889 cycles
    Finally, doing a test on 4 threads -- HYPERTHREADING
34,021,584 per/second average  [IORef incrs]               ~ 39.58 cycles
   246,782 per/second average  [traceEvent]                ~ 5,457 cycles

  Westmere 3.1 ghz 4core noHT:
    First doing a test on one thread:
74,393,471 per/second average  [IORef incrs]               ~ 40.56 cycles
   205,615 per/second average  [traceEvent]                ~ 14,673 cycles
    Doing a test on 2 threads at once 
76,597,784 per/second average  [IORef incrs]               ~ 39.39 cycles
    34,764 per/second average  [traceEvent]                ~ 86,786 cycles
    Doing a test on all 4 threads at once 
71,818,103 per/second average  [IORef incrs]               ~ 39.02 cycles
    10,360 per/second average  [traceEvent]                ~ 270,485 cycles

-}


Ryan Newton

unread,
Oct 28, 2011, 11:31:29 AM10/28/11
to parallel...@googlegroups.com, mona...@googlegroups.com
Very nice.  Can we take a moment to confirm a few things about the performance of traceEvent?  In the above example I'm sending it the same string repeatedly.  But based on the definition of traceEvent:

Btw -- "above example" here was unclear.  It referred to Duncan's example of (traceEvent "IORef #3") which would only be "above" if viewed in GMail thread view ;-).


Simon Marlow

unread,
Oct 31, 2011, 5:36:02 AM10/31/11
to parallel...@googlegroups.com, mona...@googlegroups.com

Cost-centre stack profiling doesn’t currently work with multiple processors (+RTS –N2 and greater).  I’m going to look into this as part of the profiling overhaul I’m currently working on.

 

Cheers,

                Simon

 

From: parallel...@googlegroups.com [mailto:parallel...@googlegroups.com] On Behalf Of Johan Tibell
Sent: 28 October 2011 16:26
To: parallel...@googlegroups.com
Cc: mona...@googlegroups.com
Subject: Re: Estimating contention on an IORef hammered with atomicModifyIORef

 

On Fri, Oct 28, 2011 at 6:15 AM, Ryan Newton <rrne...@gmail.com> wrote:

Is your reluctance because of the 20-100X+ runtime overhead?  (Disclosure: the group I was in at Intel develops the "Pin" binary instrumentation tool which is the basis for the heavyweight (rewriting based) performance/parallelism analysis tools provided by Intel.  But valgrind/cachegrind does similar stuff in OSS.)

 

I was more thinking of manual work on the part of the programmer. If there are tools that do it automatically that's much better.

 

 

  • Use GHC events to count operations on particular IORefs.  Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".

The trick with that third option is generating the model.


I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.

 

Is the idea that by logging only threads blocking than all IORef accesses it wouldn't be so terrible in the average case to do this for all MVars/IORefs?  (Rather than just specific ones that are instrumented by the programmer.)

 

I couldn't think of a good use case for logging all accesses, that's all. At work we automatically track contention on all locks (in C++) and that seems to work fine.

 

As for the MVar 123 problem -- within individual programs I tend to use StableNames for figuring out when one MVar is the same as another during debugging.  But I know of no good trick to create a stable identity for an MVar between multiple executions of the same (deterministic) program.  Correlating back to source location would help, but of course many MVars could be coined from the same static newEmptyMVar occurrence.  If the "fork" mechanism could be hijacked it would perhaps be possible to give an MVar a deterministic identity based on a counter and its position in the fork tree.

 

What would be the best path forward for tracking source locations?  I take it that the simplest way is to use template haskell and replace "newEmptyMVar" with something like $(newEmptyMVar) which would grab the source location and generate trace events tagged with that location:

 

 

Here's an approach:

 

For blocking operations (e.g. takeMVar) record the amount of time spent waiting for a lock together with the stack trace* of the blocked thread. Saving the stack trace is almost as good as tracking the source location of the lock and can sometimes be more useful. To make this less expensive we can sample these events at a frequency of say 1%. Using this information we can show contention profile like so:

 

total time  total time (%)  cost center

     0.005             50%          foo

     0.003             30%          bar

 

and even make hierarchical charts like we currently do for CPU profiling.

 

* We can either try to use the current profiling cost stacks to get a stack trace or we could just get the innermost function using the same trick we use to get assert to output the current file and line number.

 

-- Johan

 

Reply all
Reply to author
Forward
0 new messages