Estimating contention on an IORef hammered with atomicModifyIORef

Ryan Newton

unread,

Oct 27, 2011, 11:20:59 AM10/27/11

to parallel-haskell, mona...@googlegroups.com

Let's say I have a design where all threads frequently access a single IORef using atomicModifyIORef.

The standard reasoning is that this is a Bad Thing and a potential scaling bottleneck. And in fact, monad-par and I'm sure many other packages are committing this sin. But how do I measure *how bad* it is in a particular program in Haskell?

If you have good ideas here, please educate me. As far as I can see there are three ways we could get at this information, only the last of which is a pure-Haskell solution:

Measure actual contention in terms of machine properties like cache coherence traffic (performance counters). (But how then to isolate the information relevant to a particular IORef?)
Use binary rewriting to generate a detailed trace of execution, and then run that through a cache simulator (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.4016&rep=rep1&type=pdf)
Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".

The trick with that third option is generating the model.

Ideas?

-Ryan

Johan Tibell

unread,

Oct 27, 2011, 6:04:45 PM10/27/11

to parallel...@googlegroups.com, mona...@googlegroups.com

On Thu, Oct 27, 2011 at 8:20 AM, Ryan Newton <rrne...@gmail.com> wrote:

Let's say I have a design where all threads frequently access a single IORef using atomicModifyIORef.

We have this today in the GHC I/O manager:
https://github.com/ghc/packages-base/blob/master/GHC/Event/Manager.hs#L154

I suspect this is a bottle neck. Even if the contention on the IORef isn't too bad there's another potential concurrency issue: after a thread has CASed in a new pointer (thunk) into the IORef it goes ahead an evaluates it. During evaluation the thread is descheduled leaving a black hole behind. Now other threads come around trying to poke at the IORef and get blocked on the black hole (instead of the CASed IORef)!

(We also have a Data.IntMap in an MVar.)

The standard reasoning is that this is a Bad Thing and a potential scaling bottleneck. And in fact, monad-par and I'm sure many other packages are committing this sin. But how do I measure *how bad* it is in a particular program in Haskell?

I wish I knew. One reason I haven't worked more on scaling the I/O manager to more cores is that I don't know where the performance bottlenecks are at the moment.

If you have good ideas here, please educate me. As far as I can see there are three ways we could get at this information, only the last of which is a pure-Haskell solution:
Measure actual contention in terms of machine properties like cache coherence traffic (performance counters). (But how then to isolate the information relevant to a particular IORef?)

If you can figure out how to do this well please document it in a wiki page.

Use binary rewriting to generate a detailed trace of execution, and then run that through a cache simulator (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.4016&rep=rep1&type=pdf)

While perhaps doable it sounds like a heavyweight approach. You might get away with doing it once for some really important lock in the RTS but it's not something I'd like to do on a daily basis.

Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.

I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.

-- Johan

Duncan Coutts

unread,

Oct 28, 2011, 7:53:33 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

On Thu, 2011-10-27 at 11:20 -0400, Ryan Newton wrote:
> Let's say I have a design where all threads frequently access a single IORef
> using atomicModifyIORef.
>
> The standard reasoning is that this is a Bad Thing and a potential scaling
> bottleneck. And in fact, monad-par and I'm sure many other packages are
> committing this sin. But how do I measure *how bad* it is in a particular
> program in Haskell?
>
> If you have good ideas here, please educate me. As far as I can see there
> are three ways we could get at this information, only the last of which is a
> pure-Haskell solution:

> - Use GHC events to count operations on particular IORefs. Then put that

> trace through our own model that reports whether the IORef is being used
> acceptably, or is "hot".
>
> The trick with that third option is generating the model.

This sounds straightfroward, at least in an ad-hoc way. Just stick a
traceEvent "IORef #3" next to each atomicModifyIORef (pick different
#numbers for each one obviously) and then see which ones are the most
frequent using "ghc-events show | grep IORef".

This will not tell you if it is *too* frequent, but it'll at least tell
you which ones are hotter than the others and some orders of magnitude.

Try it. Let us know.

BTW, this is probably a good moment to advertise that in ghc-7.4 will
come with a new and improved traceEvent. Instead of just being exported
from GHC.Exts with type String -> IO (), we now have:

Debug.Trace.traceEvent :: String -> a -> a
Debug.Trace.traceEventIO :: String -> IO ()

The first one is like Debug.Trace.trace and can be used in pure code.

Also it'll be possible to turn on/off the user events separately from
other events with a +RTS flag.

Duncan

Ryan Newton

unread,

Oct 28, 2011, 9:15:23 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

Use binary rewriting to generate a detailed trace of execution, and then run that through a cache simulator (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.4016&rep=rep1&type=pdf)

While perhaps doable it sounds like a heavyweight approach. You might get away with doing it once for some really important lock in the RTS but it's not something I'd like to do on a daily basis.

Is your reluctance because of the 20-100X+ runtime overhead? (Disclosure: the group I was in at Intel develops the "Pin" binary instrumentation tool which is the basis for the heavyweight (rewriting based) performance/parallelism analysis tools provided by Intel. But valgrind/cachegrind does similar stuff in OSS.)

Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.

I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.

Is the idea that by logging only threads blocking than all IORef accesses it wouldn't be so terrible in the average case to do this for all MVars/IORefs? (Rather than just specific ones that are instrumented by the programmer.)

As for the MVar 123 problem -- within individual programs I tend to use StableNames for figuring out when one MVar is the same as another during debugging. But I know of no good trick to create a stable identity for an MVar between multiple executions of the same (deterministic) program. Correlating back to source location would help, but of course many MVars could be coined from the same static newEmptyMVar occurrence. If the "fork" mechanism could be hijacked it would perhaps be possible to give an MVar a deterministic identity based on a counter and its position in the fork tree.

What would be the best path forward for tracking source locations? I take it that the simplest way is to use template haskell and replace "newEmptyMVar" with something like $(newEmptyMVar) which would grab the source location and generate trace events tagged with that location:

http://stackoverflow.com/questions/7073471/whats-the-correct-way-to-have-template-haskell-wrap-a-function-with-source-info

-Ryan

Johan Tibell

unread,

Oct 28, 2011, 11:25:56 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

On Fri, Oct 28, 2011 at 6:15 AM, Ryan Newton <rrne...@gmail.com> wrote:

Is your reluctance because of the 20-100X+ runtime overhead? (Disclosure: the group I was in at Intel develops the "Pin" binary instrumentation tool which is the basis for the heavyweight (rewriting based) performance/parallelism analysis tools provided by Intel. But valgrind/cachegrind does similar stuff in OSS.)

I was more thinking of manual work on the part of the programmer. If there are tools that do it automatically that's much better.

Use GHC events to count operations on particular IORefs. Then put that trace through our own model that reports whether the IORef is being used acceptably, or is "hot".
The trick with that third option is generating the model.

I'd like to see us output information about blocking on MVars/IORefs etc from the RTS so we can lock for hot locks in threadscope. I've seen such lock contention analysis systems in C++ before and I'd love to have them in Haskell. The basic version would be to tell the user (of threadscope) that many threads are blocked on MVar 123, where 123 is some opaque ID. That will not help them find the MVar in the source but at least they'd know that contention is a problem. The rolls-royce version would be to map this ID back to a source location.

Is the idea that by logging only threads blocking than all IORef accesses it wouldn't be so terrible in the average case to do this for all MVars/IORefs? (Rather than just specific ones that are instrumented by the programmer.)

I couldn't think of a good use case for logging all accesses, that's all. At work we automatically track contention on all locks (in C++) and that seems to work fine.

As for the MVar 123 problem -- within individual programs I tend to use StableNames for figuring out when one MVar is the same as another during debugging. But I know of no good trick to create a stable identity for an MVar between multiple executions of the same (deterministic) program. Correlating back to source location would help, but of course many MVars could be coined from the same static newEmptyMVar occurrence. If the "fork" mechanism could be hijacked it would perhaps be possible to give an MVar a deterministic identity based on a counter and its position in the fork tree.

What would be the best path forward for tracking source locations? I take it that the simplest way is to use template haskell and replace "newEmptyMVar" with something like $(newEmptyMVar) which would grab the source location and generate trace events tagged with that location:

http://stackoverflow.com/questions/7073471/whats-the-correct-way-to-have-template-haskell-wrap-a-function-with-source-info

Here's an approach:

For blocking operations (e.g. takeMVar) record the amount of time spent waiting for a lock together with the stack trace* of the blocked thread. Saving the stack trace is almost as good as tracking the source location of the lock and can sometimes be more useful. To make this less expensive we can sample these events at a frequency of say 1%. Using this information we can show contention profile like so:

total time total time (%) cost center

0.005 50% foo

0.003 30% bar

and even make hierarchical charts like we currently do for CPU profiling.

* We can either try to use the current profiling cost stacks to get a stack trace or we could just get the innermost function using the same trick we use to get assert to output the current file and line number.

-- Johan

Ryan Newton

unread,

Oct 28, 2011, 11:29:52 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

Try it. Let us know.

Will do!

BTW, this is probably a good moment to advertise that in ghc-7.4 will
come with a new and improved traceEvent. Instead of just being exported

Very nice. Can we take a moment to confirm a few things about the performance of traceEvent? In the above example I'm sending it the same string repeatedly. But based on the definition of traceEvent:

traceEvent :: String -> IO ()
traceEvent msg = do
  withCString msg $ \(Ptr p) -> IO $ \s ->
    case traceEvent# p s of s' -> (# s', () #)

It looks like that [Char] list will be traversed every time I log an event to copy it to a C string. Is that correct? But if all I'm interested in is logging the occurrences of an event in time, I think a unique-ID (Int) is all I really want to push on the log. Would it make sense to have variants of traceEvent with other types?

I was curious how much overhead such a traceEvent would incur, so I wrote the attached little microbenchmark. There are results in the comments at the bottom of the file.

What it does is it prints the traces per-second per-thread and it runs twice -- once with a single thread and once with numCapabilities threads all hammering away.

The case where you simply increment an IORef in a loop shows good scalability. (IF you do not allocate the IORefs consecutively -- take a look at that #if switch.) But the max traceEvent throughput I'm seeing is about 500K/sec, AND when I tested it on a few Intel Westmere linux workstations I saw a weird effect where traceEvents/sec dropped all the way to 34K/sec and 10K/sec at 2 and 4 threads respectively! (Throughput per/thread decreased much less on my Mac laptop, so it's Arch/OS specific. Odd. )

-Ryan

{-# LANGUAGE BangPatterns, ScopedTypeVariables, CPP #-}

import Control.Exception

import Control.Monad

import Data.IORef

import Data.Int

import Data.List

import Data.List.Split hiding (split)

import GHC.Exts (traceEvent)

--import Benchmarks.BinSearch

import System.Random

import System.CPUTime (getCPUTime)

import System.CPUTime.Rdtsc

import Text.Printf

import GHC.Conc (forkIO, numCapabilities, threadDelay, killThread)

import Control.Concurrent.MVar

----------------------------------------------------------------------------------------------------

-- Main Script

----------------------------------------------------------------------------------------------------

main = do

putStrLn "Testing how many traceEvents we can do in a second."

-- binSearch 3 (0.99, 1.01) testTrace

freq <- measureFreq

putStrLn "\nFirst doing a test on one thread:"

timeAndKillThreads 1 freq "IORef incrs" (return ())

timeAndKillThreads 1 freq "traceEvent" (traceEvent "ConstString")

putStrLn$ "\nSecond, doing a test on all "++show numCapabilities++" threads at once "

timeAndKillThreads numCapabilities freq "IORef incrs" (return ())

timeAndKillThreads numCapabilities freq "traceEvent" (traceEvent "ConstString")

putStrLn "Done Testing."

--------------------------------------------------------------------------------

-- Helper and timing routines

--------------------------------------------------------------------------------

timeAndKillThreads :: Int -> Int64 -> String -> IO () -> IO ()

timeAndKillThreads numthreads freq msg action =

do

-- Note, if the IORefs are allocated on the main thread the throughput

-- in the parallel case plummets. Presumably this is due to false

-- sharing as they get bump-allocated.

#if 0

counters <- forM [1..numthreads] (const$ newIORef (1::Int64))

tids <- forM counters $ \counter ->

forkIO $ infloop counter

#else

mv <- newEmptyMVar

tids <- forM [1..numthreads] $ \_ ->

forkIO $ do r <- newIORef (1::Int64)

putMVar mv r

infloop r

counters <- forM [1..numthreads] (const$ takeMVar mv)

#endif

threadDelay (1000*1000) -- One second

mapM_ killThread tids

finals <- mapM readIORef counters

-- printf "Across %d threads got these throughputs: %s\n" numthreads (show finals)

let mean :: Double = fromIntegral (foldl1 (+) finals) / fromIntegral numthreads

cycles_per :: Double = fromIntegral freq / mean

printResult (round mean :: Int64) msg cycles_per

where

infloop !counter =

do action

incr counter

infloop counter

incr !counter =

do -- modifyIORef counter (+1) -- Not strict enough!

c <- readIORef counter

let c' = c+1

_ <- evaluate c'

writeIORef counter c'

printResult :: Int64 -> String -> Double -> IO ()

printResult total msg cycles_per =

putStrLn$ " "++ padleft 11 (commaint total) ++" per/second average "++ padright 27 ("["++msg++"]") ++" ~ "

++ fmt_num cycles_per ++" cycles"

-- Readable large integer printing:

commaint :: Integral a => a -> String

commaint n =

reverse $ concat $

intersperse "," $

chunk 3 $

reverse (show n)

padleft :: Int -> String -> String

padleft n str | length str >= n = str

padleft n str | otherwise = take (n - length str) (repeat ' ') ++ str

padright :: Int -> String -> String

padright n str | length str >= n = str

padright n str | otherwise = str ++ take (n - length str) (repeat ' ')

fmt_num :: (RealFrac a, PrintfArg a) => a -> String

fmt_num n = if n < 100

then printf "%.2f" n

else commaint (round n :: Integer)

-- Measure clock frequency, spinning rather than sleeping to try to

-- stay on the same core.

measureFreq :: IO Int64

measureFreq = do

let second = 1000 * 1000 * 1000 * 1000 -- picoseconds are annoying

t1 <- rdtsc

start <- getCPUTime

let loop !n !last =

do t2 <- rdtsc

when (t2 < last) $

putStrLn$ "COUNTERS WRAPPED "++ show (last,t2)

cput <- getCPUTime

if (cput - start < second)

then loop (n+1) t2

else return (n,t2)

(n,t2) <- loop 0 t1

putStrLn$ " Approx getCPUTime calls per second: "++ commaint (n::Int64)

when (t2 < t1) $

putStrLn$ "WARNING: rdtsc not monotonically increasing, first "++show t1++" then "++show t2++" on the same OS thread"

return$ fromIntegral (t2 - t1)

----------------------------------------------------------------------------------------------------

{-

Results:

Sandybridge macbook air, first WITHOUT -threaded:

Approx getCPUTime calls per second: 324,905

First doing a test on one thread:

43,804,752 per/second average [IORef incrs] ~ 41.53 cycles

461,589 per/second average [traceEvent] ~ 3,941 cycles

Westmere 3.1 ghz 4core noHT, first WITHOUT -threaded:

Approx getCPUTime calls per second: 558,696

First doing a test on one thread:

86,823,292 per/second average [IORef incrs] ~ 35.63 cycles

215,086 per/second average [traceEvent] ~ 14,383 cycles

Next, both WITH -threaded:

Sandybridge macbook air, -N2:

First doing a test on one thread:

57,743,676 per/second average [IORef incrs] ~ 30.34 cycles

425,853 per/second average [traceEvent] ~ 4,114 cycles

Second, doing a test on all 2 threads at once

50,622,096 per/second average [IORef incrs] ~ 34.61 cycles

358,316 per/second average [traceEvent] ~ 4,889 cycles

Finally, doing a test on 4 threads -- HYPERTHREADING

34,021,584 per/second average [IORef incrs] ~ 39.58 cycles

246,782 per/second average [traceEvent] ~ 5,457 cycles

Westmere 3.1 ghz 4core noHT:

First doing a test on one thread:

74,393,471 per/second average [IORef incrs] ~ 40.56 cycles

205,615 per/second average [traceEvent] ~ 14,673 cycles

Doing a test on 2 threads at once

76,597,784 per/second average [IORef incrs] ~ 39.39 cycles

34,764 per/second average [traceEvent] ~ 86,786 cycles

Doing a test on all 4 threads at once

71,818,103 per/second average [IORef incrs] ~ 39.02 cycles

10,360 per/second average [traceEvent] ~ 270,485 cycles

-}

Ryan Newton

unread,

Oct 28, 2011, 11:31:29 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

Very nice. Can we take a moment to confirm a few things about the performance of traceEvent? In the above example I'm sending it the same string repeatedly. But based on the definition of traceEvent:

Btw -- "above example" here was unclear. It referred to Duncan's example of (traceEvent "IORef #3") which would only be "above" if viewed in GMail thread view ;-).

Johan Tibell

unread,

Oct 28, 2011, 11:38:52 AM10/28/11

to parallel...@googlegroups.com, mona...@googlegroups.com

On Fri, Oct 28, 2011 at 8:29 AM, Ryan Newton <rrne...@gmail.com> wrote:

BTW, this is probably a good moment to advertise that in ghc-7.4 will

come with a new and improved traceEvent. Instead of just being exported

Very nice. Can we take a moment to confirm a few things about the performance of traceEvent? In the above example I'm sending it the same string repeatedly. But based on the definition of traceEvent:
traceEvent :: String -> IO ()
traceEvent msg = do
  withCString msg $ \(Ptr p) -> IO $ \s ->
    case traceEvent# p s of s' -> (# s', () #)
It looks like that [Char] list will be traversed every time I log an event to copy it to a C string. Is that correct? But if all I'm interested in is logging the occurrences of an event in time, I think a unique-ID (Int) is all I really want to push on the log. Would it make sense to have variants of traceEvent with other types?

I was thinking the same. Doing anything performance sensitive with String sounds like a bad idea.

-- Johan

Simon Marlow

unread,

Oct 31, 2011, 5:36:02 AM10/31/11

to parallel...@googlegroups.com, mona...@googlegroups.com

Cost-centre stack profiling doesn’t currently work with multiple processors (+RTS –N2 and greater). I’m going to look into this as part of the profiling overhaul I’m currently working on.

Cheers,

Simon

Simon Marlow

unread,

Oct 31, 2011, 9:18:01 AM10/31/11

to parallel...@googlegroups.com

We do want to give users the ability to define and generate binary
events too, though the details of how to do this haven't been worked out
yet. We do have plans to make the event log format extensible though
(this was worked out at the ThreadScope summit earlier this year, and I
think Paul Bone is working on it).

Cheers,
Simon

Ryan Newton

unread,

Oct 31, 2011, 10:07:47 AM10/31/11

to parallel...@googlegroups.com

Any ideas on why I saw such a platform dependent behavior between a intel sandybridge mac laptop and intel westmere linux box (same GHC 7.2.1 version):

I saw the trace events per-thread per-second take only a modest hit on the laptop going from one to two threads:

425K -> 358K

But on the linux desktop, the trace event throughput was both slower to start, and collapsed by a factor of 20 going from 1 -> 2 -> 4 threads:

205K -> 34K -> 10K

I'll double check this with GHC head as soon as I get it successfully built on both platforms ;-).

Thanks,

-Ryan

Simon Marlow

unread,

Oct 31, 2011, 10:11:19 AM10/31/11

to parallel...@googlegroups.com

Just a thought, but where was the eventlog file being generated too? Sending it to a network file system can have that effect ;)

Cheers,

Simon

From: parallel...@googlegroups.com [mailto:parallel...@googlegroups.com] On Behalf Of Ryan Newton
Sent: 31 October 2011 14:08
To: parallel...@googlegroups.com
Subject: Re: Estimating contention on an IORef hammered with atomicModifyIORef

Any ideas on why I saw such a platform dependent behavior between a intel sandybridge mac laptop and intel westmere linux box (same GHC 7.2.1 version):

Ryan Newton

unread,

Oct 31, 2011, 3:50:07 PM10/31/11

to parallel...@googlegroups.com

Excellent theory; I wish it were the case! But, alas, presently I see:

Equally bad performance in /tmp and in a network mounted homedir.
Slightly *worse* performance if +RTS -l is not supplied (in which case it should write nowhere, correct?)

The code for the test is here:

https://github.com/rrnewton/haskell-lockfree-queue/blob/master/misc/traceTest.hs

I compiled and ran with:

ghc-7.2.1 -O2 -eventlog -rtsopts traceTest.hs -o traceTest.exe -threaded

./traceTest.exe +RTS -l -N4

I tried again on a home desktop (3.33ghz Nehalem, Ubuntu 11.04) machine with no network file system at all and got (for 1->2->4 threads):

389K -> 343K -> 143K

So ... better. The machines with the weird effects have RHEL6.1. To try to control for OS vs. arch I tried it on a Nehalem + RHEL (unfortunately 5.7 not 6.1). This one exhibited a failure to scale as well. It had 24 cores so I measured 1,2,4,8,16,24:

205K -> 90K -> 40K -> 18K -> 8K -> 4.3K

That is, with two threads tracing away it only produced 90+90=180K events in a second -- less total events than on one thread.

Testing on other machines, seemed to yield other bad results. For example on "Core" architecture (intel X7350) 4 socket quadcore (RHEL 6.1), for 1,2,4,8,16 threads:

130K -> 9.7K -> 3.4K -> 1.6K -> 0.7K

(I double checked those on another machine of similar configuration and they stayed the same.) 185X drop in throughput for going to 16 cores! I checked on an AMD machine (4x4 opteron 8356) as well:

137K -> 110K -> 62K -> 19.7K -> 7.7K

What kind of numbers do you see?

Best,

-Ryan

Simon Marlow

unread,

Nov 7, 2011, 8:56:49 AM11/7/11

to parallel...@googlegroups.com, Ryan Newton

I've attached a better version of the test program. The results I get are:

Testing how many traceEvents we can do in a second.

Approx getCPUTime calls per second: 282,883

First doing a test on one thread:

55,705,322 per/second average [IORef incrs] ~ 27.58
cycles
34,066,487 per/second average [traceEvent] ~ 45.10
cycles

Second, doing a test on all 4 threads at once
51,708,077 per/second average [IORef incrs] ~ 29.71
cycles
30,432,770 per/second average [traceEvent] ~ 50.48
cycles

So a slight slowdown with 4 threads as we expect. Changes I made:

- use peek/poke for the counters instead of an IORef. I'm not sure
how much difference this makes, but I wanted to factor out
generational GC and the possibility that multiple IORefs got
onto the same cache line. I did have to add some artificial
allocation to the loop though, otherwise the killThread doesn't
work.

- use the primop traceEvent# directly. This is the big one. The
real GHC.Exts.traceEvent is doing a lot of marshalling to convert
the String into a CString, including encoding to UTF-8 (in 7.2.1
and later). The encoding costs a *lot*, in fact I want to look
into this to find out why.

It's not clear to me why there was such a drop in performance with
multiple threads, but the answer lies somewhere in the UTF-8 encoding
that traceEvent was doing. Avoid that and the problems go away.

Cheers,
Simon

On 31/10/2011 19:50, Ryan Newton wrote:
> Excellent theory; I wish it were the case! But, alas, presently I see:
>

> * Equally bad performance in /tmp and in a network mounted homedir.
> * Slightly *worse* performance if +RTS -l is not supplied (in which

> too? Sending it to a network file system can have that effect ;)____
>
> __ __
>
> Cheers,____
>
> Simon____
>
> __ __
>
> *From:*parallel...@googlegroups.com
> <mailto:parallel...@googlegroups.com>
> [mailto:parallel...@googlegroups.com
> <mailto:parallel...@googlegroups.com>] *On Behalf Of *Ryan Newton
> *Sent:* 31 October 2011 14:08
> *To:* parallel...@googlegroups.com
> <mailto:parallel...@googlegroups.com>
>
>
> *Subject:* Re: Estimating contention on an IORef hammered with
> atomicModifyIORef____
>
> __ __

>
> Any ideas on why I saw such a platform dependent behavior between a
> intel sandybridge mac laptop and intel westmere linux box (same GHC

> 7.2.1 version):____
>
> __ __

>
> I saw the trace events per-thread per-second take only a modest hit

> on the laptop going from one to two threads:____
>
> __ __
>
> 425K -> 358K____
>
> __ __

>
> But on the linux desktop, the trace event throughput was both slower
> to start, and collapsed by a factor of 20 going from 1 -> 2 -> 4

> threads:____
>
> __ __
>
> 205K -> 34K -> 10K ____
>
> __ __

>
> I'll double check this with GHC head as soon as I get it

> successfully built on both platforms ;-).____
>
> __ __
>
> Thanks,____
>
> -Ryan____
>
>

traceTest.hs

Ryan Newton

unread,

Nov 7, 2011, 11:43:13 AM11/7/11

to Simon Marlow, parallel...@googlegroups.com

Wow, thanks Simon for working your magic!

Here's an update for the 4 core desktop and 24 core machine I showed earlier:

old / new: 1,2,4 and 1,2,4,8,16,24 threads:

389K -> 343K -> 143K

64M -> 66M -> 65M

205K -> 90K -> 40K -> 18K -> 8K -> 4.3K

44M -> 42M -> 40M -> 36M -> 27M -> 22M

Being able to log events at 22 MHz on all 24 cores seems pretty nice! (It's only a 2GHz machine too!)

Cheers,

-Ryan

*From:*parallel-haskell@googlegroups.com
<mailto:parallel-haskell@googlegroups.com>
[mailto:parallel-haskell@googlegroups.com
<mailto:parallel-haskell@googlegroups.com>] *On Behalf Of *Ryan Newton

*Sent:* 31 October 2011 14:08

*To:* parallel-haskell@googlegroups.com
<mailto:parallel-haskell@googlegroups.com>

Ryan Newton

unread,

Mar 1, 2012, 7:11:35 AM3/1/12

to Simon Marlow, parallel...@googlegroups.com

Hello,

- use the primop traceEvent# directly. This is the big one. The
real GHC.Exts.traceEvent is doing a lot of marshalling to convert
the String into a CString, including encoding to UTF-8 (in 7.2.1
and later). The encoding costs a *lot*, in fact I want to look
into this to find out why.

I started using traceEvent# again and I'm wondering -- is there any reason not to use a CStringLen instead of a CString?

The former would allow one to use Bytestrings for tracing without a copy (to add the null terminator).

-Ryan

Simon Marlow

unread,

Mar 2, 2012, 4:13:19 AM3/2/12

to rrne...@gmail.com, parallel...@googlegroups.com

You're right, we should have another version that takes an unterminated
byte string and a length.

Cheers,
Simon

Reply all

Reply to author

Forward