allocatePinned() gets new blocks from the global block allocator, which
is protected by a lock. Obviously this lock is going to be pretty
heavily contended in your program. The reason that newByteArray# scales
better (or less badly) is that it gets new blocks from the nursery
rather than the global block allocator, and the nursery is a purely
local data structure.
So I changed allocatePinned() to work in the same way as allocate(), and
observed a nice improvement in your benchmark.
Before:
-N1: 1.43s
-N2: 9.02s
-N4: 22.71s
After:
-N1: 1.13s
-N2: 1.78s
-N4: 2.64s
Bear in mind that allocating memory at this rate (10+Gb/s) is a stress
test for any runtime. I don't expect your real program will be affected
this much.
The changes are pretty small, I might try to get them into 7.4.2.
Cheers,
Simon
> import Data.Array.IO <http://Data.Array.IO>
Bear in mind that allocating memory at this rate (10+Gb/s) is a stress test for any runtime. I don't expect your real program will be affected this much.
The changes are pretty small, I might try to get them into 7.4.2.
> This test program is somewhat artificial, but something very similar
> actually arose in a program I am working on. My program has several
> Haskell threads each reading a different TCP socket. I was using the
> recv :: Socket -> Int -> IO ByteString function in
> Network.Socket.ByteString. This function uses
> Data.ByteString.Internal.createAndTrim to create a new byte string which
> is then passed to the C recv function to fill in with data received from
> the socket. createAndTrim uses mallocPlainForeignPtrBytes, which ends up
> calling allocatePinned. My program performance starts to degrade after
> about 20 cores, and I believe this is a major culprit. I am getting
> around the issue now by allocating some pinned memory for each socket,
> and repeatedly receiving data into that buffer.
If you're still getting performance improvements up to 20 cores on a
non-trivial program, I'd say that's pretty good (many of our benchmarks
start to drop off well before that).
> Out of curiosity, do you think 10Gb/s is a challenge for the allocator
> regardless of the number of cores used? Or will that scale with the
> number of cores?
It depends very much on the allocator, but scaling allocation on
multiple cores is a well-known problem for managed runtimes. See for
example:
http://arnetminer.org/viewpub.do?pid=1253606
In GHC we do pretty well by (a) having small per-core nurseries and (b)
paying attention to locality, so that minor GC only touches core-local data.
Cheers,
Simon
> allocatePinned() gets new blocks from the global block allocator, which is
> protected by a lock. Obviously this lock is going to be pretty heavily
> contended in your program. The reason that newByteArray# scales better (or
> less badly) is that it gets new blocks from the nursery rather than the
> global block allocator, and the nursery is a purely local data structure.
>
> So I changed allocatePinned() to work in the same way as allocate(), and
> observed a nice improvement in your benchmark.
How do you allocate pinned memory from the nursary? I didn't think
that was possible with the current heap / GC design. I remember some
more flexible stuff from the local heaps paper, but I didn't think
that was being used yet.
Duncan
Duncan & I talked about this on IRC, but to summarise for other members
of the list: the change I made is for the pinned memory allocator to
steal complete blocks from the nursery rather than allocating them from
the global block allocator, thus avoiding a lock on this code path.
Since it is only stealing complete blocks, there is no issue with
intermingling pinned and unpinned objects in the same block, which is
something GHC's memory manager doesn't currently support.
The pinned memory allocator can now empty the nursery, and the GC will
then have to refill the nursery by allocating new blocks to replace the
stolen ones. Overall, it seems to be a win to do it this way though.
Cheers,
Simon
I don't know of any good reason not to try it. One downside I can
think of is that GHC no longer knows how much memory you're holding on
to.
-- Johan
> We've been thinking about taking one of our benchmarks that uses storableI don't know of any good reason not to try it. One downside I can
> vectors and tweaking it to call "malloc" to create the vectors which would
> allow us to try hoard or TBB scalable allocator.
>
> But... if there's well known reason that this is bad we won't bother trying
> it.
think of is that GHC no longer knows how much memory you're holding on
to.
Sure. It's not a big downside. It's just nice if the GC can know the
current memory pressure.
Basically, mallocForeignPtrBytes is much faster than malloc()/free().
There are a few reasons for this:
- GHC's allocate() is much faster than malloc(), because it is
almost just bumping a pointer.
- The GC recovers the memory, so there's no need for free().
Indeed, this is why I added mallocForeignPtr in the first place, because
I was concerned about the performance of using ForeignPtrs with
malloc/free in FFI code like Gtk.
Having said that, you're welcome to try malloc/free instead, they're
even provided for you: see Foreign.Marshal.Alloc. When you make a
ForeignPtr from malloc'd memory, you have to specify free as the finalizer.
> One of the things we're interested in is how to do better on NUMA
> platforms. We've got a NUMA-aware work-stealing scheduler now for
> monad-par, but it isn't really helping much yet. So we need to answer
> the question of how well our memory is being localized to
> socket-preferred physical addresses.
>
> We've been looking at libs like hwloc. Also we wanted to at least look
> into what GHC's my_mmap is doing (in OSMem.c) to see if anything more
> NUMA-aware would be possible...
Ah yes, so NUMA is something we haven't looked at in the RTS yet. At
the very least you would want to ensure that each nursery is allocated
from core-local memory.
Here's a possible plan. We currently have one global pool of blocks,
managed by the block allocator (BlockAlloc.c). Make this one pool per
Capability. We would still need a lock on each pool, because although
we can ensure that a Capability only allocates from its own pool, it
will be difficult to ensure that a Capability only ever frees its own
blocks.
We have to keep track of which pool each block comes from, with some
bits in the BDescr, so we know which pool to free it back to. We can't
just free blocks to any pool, because the coalescing will go wrong.
Having done all this, you can then make sure each pool allocates from
core-local memory using whatever API there is for this, and make sure
that when nurseries are allocated that the memory comes from the right pool.
There would be a slight memory overhead for doing this: 1MB per core or
so. But there would probably be some benefit to the parallel GC, which
sometimes contends for the lock on the block allocator.
Cheers,
Simon
> Cheers,
> -Ryan
>
>
>
>
> On Wed, Feb 15, 2012 at 3:53 AM, Simon Marlow <marl...@gmail.com
> <mailto:marl...@gmail.com>> wrote:
>
> On 13/02/2012 15:45, Duncan Coutts wrote:
>
> On 13 February 2012 10:58, Simon Marlow<marl...@gmail.com
Ah yes, so NUMA is something we haven't looked at in the RTS yet. At the very least you would want to ensure that each nursery is allocated from core-local memory.We've been looking at libs like hwloc. Also we wanted to at least look
into what GHC's my_mmap is doing (in OSMem.c) to see if anything more
NUMA-aware would be possible...
Here's a possible plan. We currently have one global pool of blocks, managed by the block allocator (BlockAlloc.c). Make this one pool per Capability. We would still need a lock on each pool, because although we can ensure that a Capability only allocates from its own pool, it will be difficult to ensure that a Capability only ever frees its own blocks.
Definitely: with the pools still being subject to a per-pool lock, you
would be able to experiment with sharing pools between Capabilities in
whatever way you like.
Cheers,
Simon
On 27/06/2012 06:55, Andreas Voellmy wrote:Aha! We should certainly not be doing clearNurseries() on just one thread, because that will cause cache lines to be moved between cores. The best solution would be to have each GC thread clear its own nursery. However clearNurseries() also calculates the amount of allocation, which needs to be communicated back to the master thread.
This (clearNurseries() being "slow") may be an issue that only affects
users using a large allocation area. I was using -A128m in my test runs
with 40 caps. In this case, clearNurseries has to clear 40*32678 blocks.
It takes about 100ms in my runs, and this averages to 76ns per block,
which seems reasonable.
One way to reduce this time is to use a parallel-for loop as the outer
loop of clearNurseries(). I tried this and it works fairly well: my
first attempt at one thread per nursery reduced the times to about 20ms
for the same allocation area and number of cores.
Another option would be to somehow eliminate the need for
clearNurseries() or reduce the number of blocks it has to work on. Would
it be possible for blocks that are put back on to the nurseries lists to
have these fields cleared when they are linked into the list?
It *might* be ok to have each GC thread clear its own nursery in gcWorkerThread(), assuming that the rest of the GC won't get upset that the bd->free pointers have been reset (I *think* that's ok). Also if sanity checking is on we can't do this, because sanity checking causes clearNurseries() to overwrite the nursery with 0xaa, so we would have to back off to the existing clearNurseries() in that case.
(BTW, sorry for not getting around to looking at your scheduler patches yet, I've been busy with some teaching and now I'm taking some time off, but I plan to be back in action again soon).
Cheers,
Simon
On 27/06/2012 06:55, Andreas Voellmy wrote:Aha! We should certainly not be doing clearNurseries() on just one thread, because that will cause cache lines to be moved between cores. The best solution would be to have each GC thread clear its own nursery. However clearNurseries() also calculates the amount of allocation, which needs to be communicated back to the master thread.
This (clearNurseries() being "slow") may be an issue that only affects
users using a large allocation area. I was using -A128m in my test runs
with 40 caps. In this case, clearNurseries has to clear 40*32678 blocks.
It takes about 100ms in my runs, and this averages to 76ns per block,
which seems reasonable.
One way to reduce this time is to use a parallel-for loop as the outer
loop of clearNurseries(). I tried this and it works fairly well: my
first attempt at one thread per nursery reduced the times to about 20ms
for the same allocation area and number of cores.
Another option would be to somehow eliminate the need for
clearNurseries() or reduce the number of blocks it has to work on. Would
it be possible for blocks that are put back on to the nurseries lists to
have these fields cleared when they are linked into the list?
It *might* be ok to have each GC thread clear its own nursery in gcWorkerThread(), assuming that the rest of the GC won't get upset that the bd->free pointers have been reset (I *think* that's ok). Also if sanity checking is on we can't do this, because sanity checking causes clearNurseries() to overwrite the nursery with 0xaa, so we would have to back off to the existing clearNurseries() in that case.
<mailto:andreas.voellmy@gmail.__com<mailto:andreas.voellmy@gmail.com>