Synchronizations in memory allocation?

198 views
Skip to first unread message

Andreas

unread,
Feb 10, 2012, 7:13:27 PM2/10/12
to parallel-haskell
Hi,

I'm looking into a scaling bottleneck in a multi-threaded server that
I am working on. I started to suspect that my problem had to do with a
scaling problem in allocating memory. I wrote a micro-benchmark to
explore the issue. The result surprised me.

The test program forks N worker threads each running a loop S times.
In each iteration of the loop, a worker allocates some memory. The
memory immediately becomes garbage. The worker threads do not
synchronize or share any data. I tried two variants of this, one where
each iteration allocates memory using "mallocPlainForeignPtrBytes
1024" (call this version v1) and another in which it uses "newArray
(1,250) 0 :: IO (IOUArray Int Int)", (call this v2). Here are the run
times for different N and S where N*S is always 10 million, so that
the total work done by the program is the same each time:

N S v1 v2
1 10M 1.4s 9.2s
2 5M 1.5s 5.5s
4 2.5M 6.2s 3.0s
8 1.25M 9.1s 2.4s

"M" means million. "s" means seconds. I used as many cores as Haskell
threads (i.e. equal to N).

The times are surprising. I would expect the run time to drop by half
going from 1 to 2 (and 2 to 4, etc), but for v1 the run times actually
increase. The behavior with IOUArray makes more sense. What causes
the behavior with mallocPlainForeignPtrBytes? Is there some
synchronization going on in the RTS when allocating foreign pointers?
I noticed the gc_alloc_block_sync count increases as I increase cores/
threads. Is this related? By the way, the reason I am using
mallocPlainForeignPtrBytes is because that is what ByteString uses to
create a new ByteString (createAndTrim eventually calls this), and my
code uses ByteString.

Here is the code:

{-# LANGUAGE DeriveDataTypeable #-}
{-# LANGUAGE BangPatterns #-}

import System.Console.CmdArgs
import Control.Concurrent
import Foreign.ForeignPtr
import Data.Array.MArray
import Data.Array.IO
import GHC.ForeignPtr (mallocPlainForeignPtrBytes)

data AllocTest = AllocTest { numThreads :: Int, numLoops :: Int }
deriving (Show, Data, Typeable)

args0 = AllocTest { numThreads = def &= help "number of threads" &=
typ "NUMBER"
, numLoops = def &= help "number of loops to run" &=
typ "NUMBER"
}

main = do args <- cmdArgs args0
print args
forkJoin (replicate (numThreads args) (busyWorkerFP
(numLoops args)))
putStrLn "Test done"


busyWorkerA :: Int -> IO ()
busyWorkerA n_loops = go 0
where go !n | n >= n_loops = return ()
| otherwise = do p <- newArray (1,250) 0 :: IO
(IOUArray Int Int)
go (n+1)

busyWorkerFP :: Int -> IO ()
busyWorkerFP n_loops = go 0
where go !n | n >= n_loops = return ()
| otherwise = do p <- mallocPlainForeignPtrBytes
1024
go (n+1)

forkJoin :: [IO a] -> IO [a]
forkJoin xs =
do vs <- sequence [ do {v <- newEmptyMVar; forkIO (x >>= putMVar v);
return v } | x <- xs ]
sequence [ takeMVar v | v <- vs ]

I compiled with "ghc --make -O -threaded -rtsopts AllocTest.hs -o hs-
alloc-test" using ghc 7.4.1 running on Linux. I ran it like this:
"time ./hs-alloc-test --numthreads=2 --numloops=5000000 +RTS -s -N2 -
A128m". I also ran it with event logging and looked at it in
Threadscope, but that does show any obvious problems.

Thanks,
Andreas

Andreas Voellmy

unread,
Feb 10, 2012, 11:32:01 PM2/10/12
to parallel-haskell
The implementation of mallocPlainForeignPtrBytes uses newPinnedByteArray#, which seems to be the culprit. I tried using  newByteArray#, instead and the performance now at least does not degrade when the work is divided among the Haskell threads. In detail, I added the following two worker loop variants, one for each of these methods: 

busyWorkerB :: Int -> IO ()          
busyWorkerB n_loops = go 0
  where go !n | n >= n_loops = return ()
              | otherwise    = 
          do p <- (IO $ \s ->
                    case newPinnedByteArray# 1024# s      of 
                      { (# s', mbarr# #) ->
                           (# s', () #)
                      }         
                  )
             go (n+1)

busyWorkerC :: Int -> IO ()          
busyWorkerC n_loops = go 0
  where go !n | n >= n_loops = return ()
              | otherwise    = 
          do p <- (IO $ \s ->
                    case newByteArray# 1024# s      of 
                      { (# s', mbarr# #) ->
                           (# s', () #)
                      }         
                  )
             go (n+1)

Then the times for busyWorkerB are:
N=1, S=10M: 1.429s
N=4, S=2.5M: 6.055s
N=8, S=1.25M: 9.585s

But the times for busyWorkerC are:
N=1, S=10M: 0.668s
N=4, S=2.5M: 0.591s
N=8, S=1.25M: 0.767s

newByteArray# and newPinnedByteArray# seem to call allocate(Capability *cap, lnat n) and allocatePinned(Capability *cap, lnat n) respectively. Both of these seem take the sm_mutex lock for some executions, although allocate does have a comment that it doesn't take the lock in the "common case". So maybe allocatePinned takes this lock more often? 

-Andreas

Simon Marlow

unread,
Feb 13, 2012, 5:58:51 AM2/13/12
to andreas...@gmail.com, parallel-haskell
You're allocating 1KB repeatedly on multiple threads. GHC manages
memory in blocks of 4KB, and taking into account that each 1KB
allocation needs a couple of words of header, you need to allocate a new
block every 4th allocation.

allocatePinned() gets new blocks from the global block allocator, which
is protected by a lock. Obviously this lock is going to be pretty
heavily contended in your program. The reason that newByteArray# scales
better (or less badly) is that it gets new blocks from the nursery
rather than the global block allocator, and the nursery is a purely
local data structure.

So I changed allocatePinned() to work in the same way as allocate(), and
observed a nice improvement in your benchmark.

Before:

-N1: 1.43s
-N2: 9.02s
-N4: 22.71s

After:

-N1: 1.13s
-N2: 1.78s
-N4: 2.64s

Bear in mind that allocating memory at this rate (10+Gb/s) is a stress
test for any runtime. I don't expect your real program will be affected
this much.

The changes are pretty small, I might try to get them into 7.4.2.

Cheers,
Simon

> import Data.Array.IO <http://Data.Array.IO>

Andreas Voellmy

unread,
Feb 13, 2012, 8:58:22 AM2/13/12
to parallel-haskell
On Mon, Feb 13, 2012 at 5:58 AM, Simon Marlow <marl...@gmail.com> wrote:

Bear in mind that allocating memory at this rate (10+Gb/s) is a stress test for any runtime.  I don't expect your real program will be affected this much.

This test program is somewhat artificial, but something very similar actually arose in a program I am working on. My program has several Haskell threads each reading a different TCP socket. I was using the recv :: Socket -> Int -> IO ByteString function in Network.Socket.ByteString.  This function uses Data.ByteString.Internal.createAndTrim to create a new byte string which is then passed to the C recv function to fill in with data received from the socket. createAndTrim uses mallocPlainForeignPtrBytes, which ends up calling allocatePinned.  My program performance starts to degrade after about 20 cores, and I believe this is a major culprit. I am getting around the issue now by allocating some pinned memory for each socket, and repeatedly receiving data into that buffer.

Out of curiosity, do you think 10Gb/s is a challenge for the allocator regardless of the number of cores used? Or will that scale with the number of cores? 
 
The changes are pretty small, I might try to get them into 7.4.2.


Great :)
 
-Andreas

Simon Marlow

unread,
Feb 13, 2012, 9:20:32 AM2/13/12
to andreas...@gmail.com, parallel-haskell
On 13/02/2012 13:58, Andreas Voellmy wrote:

> This test program is somewhat artificial, but something very similar
> actually arose in a program I am working on. My program has several
> Haskell threads each reading a different TCP socket. I was using the
> recv :: Socket -> Int -> IO ByteString function in
> Network.Socket.ByteString. This function uses
> Data.ByteString.Internal.createAndTrim to create a new byte string which
> is then passed to the C recv function to fill in with data received from
> the socket. createAndTrim uses mallocPlainForeignPtrBytes, which ends up
> calling allocatePinned. My program performance starts to degrade after
> about 20 cores, and I believe this is a major culprit. I am getting
> around the issue now by allocating some pinned memory for each socket,
> and repeatedly receiving data into that buffer.

If you're still getting performance improvements up to 20 cores on a
non-trivial program, I'd say that's pretty good (many of our benchmarks
start to drop off well before that).

> Out of curiosity, do you think 10Gb/s is a challenge for the allocator
> regardless of the number of cores used? Or will that scale with the
> number of cores?

It depends very much on the allocator, but scaling allocation on
multiple cores is a well-known problem for managed runtimes. See for
example:

http://arnetminer.org/viewpub.do?pid=1253606

In GHC we do pretty well by (a) having small per-core nurseries and (b)
paying attention to locality, so that minor GC only touches core-local data.

Cheers,
Simon

Duncan Coutts

unread,
Feb 13, 2012, 10:45:32 AM2/13/12
to marl...@gmail.com, andreas...@gmail.com, parallel-haskell
On 13 February 2012 10:58, Simon Marlow <marl...@gmail.com> wrote:

> allocatePinned() gets new blocks from the global block allocator, which is
> protected by a lock.  Obviously this lock is going to be pretty heavily
> contended in your program.  The reason that newByteArray# scales better (or
> less badly) is that it gets new blocks from the nursery rather than the
> global block allocator, and the nursery is a purely local data structure.
>
> So I changed allocatePinned() to work in the same way as allocate(), and
> observed a nice improvement in your benchmark.

How do you allocate pinned memory from the nursary? I didn't think
that was possible with the current heap / GC design. I remember some
more flexible stuff from the local heaps paper, but I didn't think
that was being used yet.

Duncan

Simon Marlow

unread,
Feb 15, 2012, 3:53:05 AM2/15/12
to Duncan Coutts, andreas...@gmail.com, parallel-haskell

Duncan & I talked about this on IRC, but to summarise for other members
of the list: the change I made is for the pinned memory allocator to
steal complete blocks from the nursery rather than allocating them from
the global block allocator, thus avoiding a lock on this code path.
Since it is only stealing complete blocks, there is no issue with
intermingling pinned and unpinned objects in the same block, which is
something GHC's memory manager doesn't currently support.

The pinned memory allocator can now empty the nursery, and the GC will
then have to refill the nursery by allocating new blocks to replace the
stolen ones. Overall, it seems to be a win to do it this way though.

Cheers,
Simon

Ryan Newton

unread,
Mar 21, 2012, 2:33:48 PM3/21/12
to marl...@gmail.com, Duncan Coutts, andreas...@gmail.com, parallel-haskell
Here's a quick question:

What is the reason for GHC managing all this pinned memory for foreign pointers itself rather than using an external C malloc/free implementation and thus keeping disjoint Haskell and C heaps?

We've been thinking about taking one of our benchmarks that uses storable vectors and tweaking it to call "malloc" to create the vectors which would allow us to try hoard or TBB scalable allocator.

But... if there's well known reason that this is bad we won't bother trying it.

One of the things we're interested in is how to do better on NUMA platforms.  We've got a NUMA-aware work-stealing scheduler now for monad-par, but it isn't really helping much yet.  So we need to answer the question of how well our memory is being localized to socket-preferred physical addresses.

We've been looking at libs like hwloc.  Also we wanted to at least look into what GHC's my_mmap is doing (in OSMem.c) to see if anything more NUMA-aware would be possible...

Cheers,
  -Ryan

Johan Tibell

unread,
Mar 21, 2012, 3:41:58 PM3/21/12
to rrne...@gmail.com, marl...@gmail.com, Duncan Coutts, andreas...@gmail.com, parallel-haskell
On Wed, Mar 21, 2012 at 11:33 AM, Ryan Newton <rrne...@gmail.com> wrote:
> Here's a quick question:
>
> What is the reason for GHC managing all this pinned memory for foreign
> pointers itself rather than using an external C malloc/free implementation
> and thus keeping disjoint Haskell and C heaps?
>
> We've been thinking about taking one of our benchmarks that uses storable
> vectors and tweaking it to call "malloc" to create the vectors which would
> allow us to try hoard or TBB scalable allocator.
>
> But... if there's well known reason that this is bad we won't bother trying
> it.

I don't know of any good reason not to try it. One downside I can
think of is that GHC no longer knows how much memory you're holding on
to.

-- Johan

Ryan Newton

unread,
Mar 21, 2012, 4:59:58 PM3/21/12
to Johan Tibell, marl...@gmail.com, Duncan Coutts, andreas...@gmail.com, parallel-haskell
> We've been thinking about taking one of our benchmarks that uses storable
> vectors and tweaking it to call "malloc" to create the vectors which would
> allow us to try hoard or TBB scalable allocator.
>
> But... if there's well known reason that this is bad we won't bother trying
> it.

I don't know of any good reason not to try it. One downside I can
think of is that GHC no longer knows how much memory you're holding on
to.

True, but you also get into this situation if you're using any foreign libraries that do their own allocation internally, right?

  -Ryan
 

Johan Tibell

unread,
Mar 21, 2012, 5:06:06 PM3/21/12
to rrne...@gmail.com, marl...@gmail.com, Duncan Coutts, andreas...@gmail.com, parallel-haskell
On Wed, Mar 21, 2012 at 1:59 PM, Ryan Newton <rrne...@gmail.com> wrote:
> True, but you also get into this situation if you're using any foreign
> libraries that do their own allocation internally, right?

Sure. It's not a big downside. It's just nice if the GC can know the
current memory pressure.

Simon Marlow

unread,
Mar 22, 2012, 6:27:53 AM3/22/12
to rrne...@gmail.com, Duncan Coutts, andreas...@gmail.com, parallel-haskell
On 21/03/2012 18:33, Ryan Newton wrote:
> Here's a quick question:
>
> What is the reason for GHC managing all this pinned memory for foreign
> pointers itself rather than using an external C malloc/free
> implementation and thus keeping disjoint Haskell and C heaps?
>
> We've been thinking about taking one of our benchmarks that uses
> storable vectors and tweaking it to call "malloc" to create the vectors
> which would allow us to try hoard or TBB scalable allocator.
>
> But... if there's well known reason that this is bad we won't bother
> trying it.

Basically, mallocForeignPtrBytes is much faster than malloc()/free().
There are a few reasons for this:

- GHC's allocate() is much faster than malloc(), because it is
almost just bumping a pointer.

- The GC recovers the memory, so there's no need for free().

Indeed, this is why I added mallocForeignPtr in the first place, because
I was concerned about the performance of using ForeignPtrs with
malloc/free in FFI code like Gtk.

Having said that, you're welcome to try malloc/free instead, they're
even provided for you: see Foreign.Marshal.Alloc. When you make a
ForeignPtr from malloc'd memory, you have to specify free as the finalizer.

> One of the things we're interested in is how to do better on NUMA
> platforms. We've got a NUMA-aware work-stealing scheduler now for
> monad-par, but it isn't really helping much yet. So we need to answer
> the question of how well our memory is being localized to
> socket-preferred physical addresses.
>
> We've been looking at libs like hwloc. Also we wanted to at least look
> into what GHC's my_mmap is doing (in OSMem.c) to see if anything more
> NUMA-aware would be possible...

Ah yes, so NUMA is something we haven't looked at in the RTS yet. At
the very least you would want to ensure that each nursery is allocated
from core-local memory.

Here's a possible plan. We currently have one global pool of blocks,
managed by the block allocator (BlockAlloc.c). Make this one pool per
Capability. We would still need a lock on each pool, because although
we can ensure that a Capability only allocates from its own pool, it
will be difficult to ensure that a Capability only ever frees its own
blocks.

We have to keep track of which pool each block comes from, with some
bits in the BDescr, so we know which pool to free it back to. We can't
just free blocks to any pool, because the coalescing will go wrong.

Having done all this, you can then make sure each pool allocates from
core-local memory using whatever API there is for this, and make sure
that when nurseries are allocated that the memory comes from the right pool.

There would be a slight memory overhead for doing this: 1MB per core or
so. But there would probably be some benefit to the parallel GC, which
sometimes contends for the lock on the block allocator.

Cheers,
Simon

> Cheers,
> -Ryan
>
>
>
>
> On Wed, Feb 15, 2012 at 3:53 AM, Simon Marlow <marl...@gmail.com
> <mailto:marl...@gmail.com>> wrote:
>
> On 13/02/2012 15:45, Duncan Coutts wrote:
>
> On 13 February 2012 10:58, Simon Marlow<marl...@gmail.com

Ryan Newton

unread,
Mar 22, 2012, 10:23:34 AM3/22/12
to Simon Marlow, Duncan Coutts, andreas...@gmail.com, parallel-haskell
We've been looking at libs like hwloc.  Also we wanted to at least look
into what GHC's my_mmap is doing (in OSMem.c) to see if anything more
NUMA-aware would be possible...

Ah yes, so NUMA is something we haven't looked at in the RTS yet.  At the very least you would want to ensure that each nursery is allocated from core-local memory.

Here's a possible plan.  We currently have one global pool of blocks, managed by the block allocator (BlockAlloc.c).  Make this one pool per Capability.  We would still need a lock on each pool, because although we can ensure that a Capability only allocates from its own pool, it will be difficult to ensure that a Capability only ever frees its own blocks.

This definitely sounds like an interesting plan to explore.  It looks like there would be a design tradeoff:
  • Make it one pool per Capability (i.e. 32 pools on 32 cores)
  • Make it one pool per socket (more like 4 pools) 
There's no evidence of this yet, but one hypothesis is that some of our observed Intel/AMD differences are related to NUMA issues.

  -Ryan


Simon Marlow

unread,
Mar 22, 2012, 10:26:52 AM3/22/12
to rrne...@gmail.com, Duncan Coutts, andreas...@gmail.com, parallel-haskell
On 22/03/2012 14:23, Ryan Newton wrote:
> We've been looking at libs like hwloc. Also we wanted to at
> least look
> into what GHC's my_mmap is doing (in OSMem.c) to see if anything
> more
> NUMA-aware would be possible...
>
>
> Ah yes, so NUMA is something we haven't looked at in the RTS yet.
> At the very least you would want to ensure that each nursery is
> allocated from core-local memory.
>
> Here's a possible plan. We currently have one global pool of
> blocks, managed by the block allocator (BlockAlloc.c). Make this
> one pool per Capability. We would still need a lock on each pool,
> because although we can ensure that a Capability only allocates from
> its own pool, it will be difficult to ensure that a Capability only
> ever frees its own blocks.
>
>
> This definitely sounds like an interesting plan to explore. It looks
> like there would be a design tradeoff:
>
> * Make it one pool per Capability (i.e. 32 pools on 32 cores)
> * Make it one pool per socket (more like 4 pools)

>
> There's no evidence of this yet, but one hypothesis is that some of our
> observed Intel/AMD differences are related to NUMA issues.

Definitely: with the pools still being subject to a per-pool lock, you
would be able to experiment with sharing pools between Capabilities in
whatever way you like.

Cheers,
Simon

Andreas

unread,
Jun 22, 2012, 9:14:11 AM6/22/12
to parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts, andreas...@gmail.com
Has anyone made progress on this (NUMA-aware allocation in GHC)? Is there code that I could try out?

-Andreas

Simon Marlow

unread,
Jun 25, 2012, 9:12:25 AM6/25/12
to andreas...@gmail.com, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
On 22/06/2012 14:14, Andreas wrote:
> Has anyone made progress on this (NUMA-aware allocation in GHC)? Is
> there code that I could try out?

Not yet, sorry.

Cheers,
Simon

Andreas Voellmy

unread,
Jun 26, 2012, 3:44:01 PM6/26/12
to Simon Marlow, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
OK thanks. I ask because Ryan mentioned earlier in this thread that he saw performance differences between AMD and Intel machines and that they might be due to NUMA memory allocation issues. I have been using an AMD server for some time now with good results, and I just started working with an 8 socket Intel server with worse performance, despite it having faster processors. The problem might NUMA memory allocation related, but I'm not sure - it could be something else. I just thought I'd check in to see if there something easy I could try. 

One thing I've noticed is that GCs take long when running with lots of cores (e.g. 40). Oddly enough almost all of the GC time is spent in clearNurseries(), which doesn't seem to be doing very much - just looping through the blocks in each core's nursery and setting the block's free pointer to the start pointer. 

-Andreas

Andreas Voellmy

unread,
Jun 27, 2012, 1:55:32 AM6/27/12
to Simon Marlow, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
This (clearNurseries() being "slow") may be an issue that only affects users using a large allocation area. I was using -A128m in my test runs with 40 caps. In this case, clearNurseries has to clear 40*32678 blocks. It takes about 100ms in my runs, and this averages to 76ns per block, which seems reasonable. 

One way to reduce this time is to use a parallel-for loop as the outer loop of clearNurseries(). I tried this and it works fairly well: my first attempt at one thread per nursery reduced the times to about 20ms for the same allocation area and number of cores. 

Another option would be to somehow eliminate the need for clearNurseries() or reduce the number of blocks it has to work on. Would it be possible for blocks that are put back on to the nurseries lists to have these fields cleared when they are linked into the list? 

-Andreas

Simon Marlow

unread,
Jun 27, 2012, 8:15:19 AM6/27/12
to Andreas Voellmy, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
On 27/06/2012 06:55, Andreas Voellmy wrote:
> This (clearNurseries() being "slow") may be an issue that only affects
> users using a large allocation area. I was using -A128m in my test runs
> with 40 caps. In this case, clearNurseries has to clear 40*32678 blocks.
> It takes about 100ms in my runs, and this averages to 76ns per block,
> which seems reasonable.
>
> One way to reduce this time is to use a parallel-for loop as the outer
> loop of clearNurseries(). I tried this and it works fairly well: my
> first attempt at one thread per nursery reduced the times to about 20ms
> for the same allocation area and number of cores.
>
> Another option would be to somehow eliminate the need for
> clearNurseries() or reduce the number of blocks it has to work on. Would
> it be possible for blocks that are put back on to the nurseries lists to
> have these fields cleared when they are linked into the list?

Aha! We should certainly not be doing clearNurseries() on just one
thread, because that will cause cache lines to be moved between cores.
The best solution would be to have each GC thread clear its own nursery.
However clearNurseries() also calculates the amount of allocation,
which needs to be communicated back to the master thread.

It *might* be ok to have each GC thread clear its own nursery in
gcWorkerThread(), assuming that the rest of the GC won't get upset that
the bd->free pointers have been reset (I *think* that's ok). Also if
sanity checking is on we can't do this, because sanity checking causes
clearNurseries() to overwrite the nursery with 0xaa, so we would have to
back off to the existing clearNurseries() in that case.

(BTW, sorry for not getting around to looking at your scheduler patches
yet, I've been busy with some teaching and now I'm taking some time off,
but I plan to be back in action again soon).

Cheers,
Simon

Andreas Voellmy

unread,
Jun 27, 2012, 8:34:53 AM6/27/12
to Simon Marlow, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
On Wed, Jun 27, 2012 at 8:15 AM, Simon Marlow <marl...@gmail.com> wrote:
On 27/06/2012 06:55, Andreas Voellmy wrote:
This (clearNurseries() being "slow") may be an issue that only affects
users using a large allocation area. I was using -A128m in my test runs
with 40 caps. In this case, clearNurseries has to clear 40*32678 blocks.
It takes about 100ms in my runs, and this averages to 76ns per block,
which seems reasonable.

One way to reduce this time is to use a parallel-for loop as the outer
loop of clearNurseries(). I tried this and it works fairly well: my
first attempt at one thread per nursery reduced the times to about 20ms
for the same allocation area and number of cores.

Another option would be to somehow eliminate the need for
clearNurseries() or reduce the number of blocks it has to work on. Would
it be possible for blocks that are put back on to the nurseries lists to
have these fields cleared when they are linked into the list?

Aha! We should certainly not be doing clearNurseries() on just one thread, because that will cause cache lines to be moved between cores. The best solution would be to have each GC thread clear its own nursery.  However clearNurseries() also calculates the amount of allocation, which needs to be communicated back to the master thread.

It *might* be ok to have each GC thread clear its own nursery in gcWorkerThread(), assuming that the rest of the GC won't get upset that the bd->free pointers have been reset (I *think* that's ok).  Also if sanity checking is on we can't do this, because sanity checking causes clearNurseries() to overwrite the nursery with 0xaa, so we would have to back off to the existing clearNurseries() in that case.

(BTW, sorry for not getting around to looking at your scheduler patches yet, I've been busy with some teaching and now I'm taking some time off, but I plan to be back in action again soon).

Cheers,
       Simon


I was actually just looking into whether clearNurseries() can be totally eliminated.  I think one might be able to get rid of it by resetting free=start lazily, when the block is first used by an allocator. In particular, allocate() controls when a block is made the current allocation block for a cap, and at this point it can set free=start. 

The problem with this is that we don't get to collect allocated stats after a GC. Maybe this could be done in allocate() as well? 

-Andreas

Simon Marlow

unread,
Jun 27, 2012, 9:08:36 AM6/27/12
to Andreas Voellmy, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
I think it is likely to be more complicated than that. Blocks from the
nursery are stolen by multiple different allocators: the ordinary heap
allocator, allocate(), and allocatePinned() at least.

Hmm, perhaps a better model would be to have two lists for the nursery:
one list of "empty" blocks and another of used blocks, and the
convention is that we ignore the bd->free value in the "empty" blocks
and reset it when we first use it. One slight disadvantage is that it
is a good idea to keep the nursery blocks ordered, because we get to
take maximum advantage of automatic prefetching, but that is only a
small effect.

> The problem with this is that we don't get to collect allocated stats
> after a GC. Maybe this could be done in allocate() as well?

Not just in allocate(), no. Haskell threads allocate by bumping their
heap pointers, for example.

Cheers,
Simon

> -Andreas
>
>
>
> -Andreas
>
> On Tue, Jun 26, 2012 at 3:44 PM, Andreas Voellmy
> <andreas...@gmail.com <mailto:andreas...@gmail.com>
> <mailto:andreas.voellmy@gmail.__com

Andreas Voellmy

unread,
Jun 27, 2012, 12:02:58 PM6/27/12
to Simon Marlow, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
On Wed, Jun 27, 2012 at 8:15 AM, Simon Marlow <marl...@gmail.com> wrote:
On 27/06/2012 06:55, Andreas Voellmy wrote:
This (clearNurseries() being "slow") may be an issue that only affects
users using a large allocation area. I was using -A128m in my test runs
with 40 caps. In this case, clearNurseries has to clear 40*32678 blocks.
It takes about 100ms in my runs, and this averages to 76ns per block,
which seems reasonable.

One way to reduce this time is to use a parallel-for loop as the outer
loop of clearNurseries(). I tried this and it works fairly well: my
first attempt at one thread per nursery reduced the times to about 20ms
for the same allocation area and number of cores.

Another option would be to somehow eliminate the need for
clearNurseries() or reduce the number of blocks it has to work on. Would
it be possible for blocks that are put back on to the nurseries lists to
have these fields cleared when they are linked into the list?

Aha! We should certainly not be doing clearNurseries() on just one thread, because that will cause cache lines to be moved between cores. The best solution would be to have each GC thread clear its own nursery.  However clearNurseries() also calculates the amount of allocation, which needs to be communicated back to the master thread.

It *might* be ok to have each GC thread clear its own nursery in gcWorkerThread(), assuming that the rest of the GC won't get upset that the bd->free pointers have been reset (I *think* that's ok).  Also if sanity checking is on we can't do this, because sanity checking causes clearNurseries() to overwrite the nursery with 0xaa, so we would have to back off to the existing clearNurseries() in that case.

OK, I just hacked this up and it looks really good :) With the parameters I described above I am getting about 15ms pauses, down from about 115ms before the change. My change involved having each gcthread clear the nursery of its capability right after it returns from scavenge_until_all_done() call. Same with the main GarbageCollect() procedure. Then GarbageCollect calls shutdown_gc_threads() and when this returns all the nurseries have been cleared. Each gcthread fills in a new member of the gcthread_ struct indicating how much it allocated and this is read later in the main GarbageCollect() procedure to increment allocated. I'll post my code after some more testing and polishing.

-Andreas

Andreas Voellmy

unread,
Jun 27, 2012, 6:04:07 PM6/27/12
to Simon Marlow, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
I pushed my code here:


I made the change on top of a branch that contains my previous Scheduler changes (which was based on 7.4.1 branch), but these changes to how GC clears nurseries are totally independent of them. 

-Andreas

Simon Marlow

unread,
Jul 11, 2012, 4:30:04 AM7/11/12
to Andreas Voellmy, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
I committed an equivalent patch yesterday:

http://hackage.haskell.org/trac/ghc/changeset/713cf473de8a2ad7d0b8195d78860c25fec41839

Thanks again for identifying the problem.

Cheers,
Simon
> <mailto:andreas...@gmail.com>
> <mailto:andreas.voellmy@gmail.__com
> <mailto:marl...@gmail.com <mailto:marl...@gmail.com>>>

Andreas Voellmy

unread,
Jul 11, 2012, 8:58:28 AM7/11/12
to Simon Marlow, parallel...@googlegroups.com, rrne...@gmail.com, Duncan Coutts
Cool. I think your patch misses misses out on some potential improvement though. In your patch the main gc thread will wait for all gcWorkers to finish clearing the nursery (with the call to shutdown_gc_threads) before going on to do various things, including clearing its own nursery. As a result, the main gc thread is often waiting a while for the other threads to clear their nurseries, and then later all the other threads are waiting for the main gc thread to clear its nursery (which tends to have completely used its nursery and hence takes longest - see next paragraph). I found it better to have the main gc thread clear its nursery before calling shutdown_gc_threads (hopefully this didn't break anything!), and then gather the allocated counts from the gcWorkers after the shutdown_gc_threads barrier.

Also, is it possible to terminate the loop in clearNursery() early? It seems like you could terminate it as soon as you find the first block where the free and start pointers coincided. I added a patch for this in my github. I didn't see any big performance gain, but it seems like it can't hurt (assuming it doesn't break things), so I am using it (https://github.com/AndreasVoellmy/ghc-arv/commit/95dd315f64f71e041c01d63472d56c4e6921b1ea).

-Andreas

            <mailto:andreas.voellmy@gmail.com>
            <mailto:andreas.voellmy@gmail.__com
Reply all
Reply to author
Forward
0 new messages