Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Synchronizations in memory allocation?
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Andreas Voellmy  
View profile  
 More options Jul 11 2012, 8:58 am
From: Andreas Voellmy <andreas.voel...@gmail.com>
Date: Wed, 11 Jul 2012 08:58:28 -0400
Local: Wed, Jul 11 2012 8:58 am
Subject: Re: Synchronizations in memory allocation?

Cool. I think your patch misses misses out on some potential improvement
though. In your patch the main gc thread will wait for all gcWorkers to
finish clearing the nursery (with the call to shutdown_gc_threads) before
going on to do various things, including clearing its own nursery. As a
result, the main gc thread is often waiting a while for the other threads
to clear their nurseries, and then later all the other threads are waiting
for the main gc thread to clear its nursery (which tends to have completely
used its nursery and hence takes longest - see next paragraph). I found it
better to have the main gc thread clear its nursery before calling
shutdown_gc_threads (hopefully this didn't break anything!), and then
gather the allocated counts from the gcWorkers after the
shutdown_gc_threads barrier.

Also, is it possible to terminate the loop in clearNursery() early? It
seems like you could terminate it as soon as you find the first block where
the free and start pointers coincided. I added a patch for this in my
github. I didn't see any big performance gain, but it seems like it can't
hurt (assuming it doesn't break things), so I am using it (
https://github.com/AndreasVoellmy/ghc-arv/commit/95dd315f64f71e041c01...
).

-Andreas

On Wed, Jul 11, 2012 at 4:30 AM, Simon Marlow <marlo...@gmail.com> wrote:
> I committed an equivalent patch yesterday:

> http://hackage.haskell.org/**trac/ghc/changeset/**
> 713cf473de8a2ad7d0b8195d78860c**25fec41839<http://hackage.haskell.org/trac/ghc/changeset/713cf473de8a2ad7d0b8195...>

> Thanks again for identifying the problem.

> Cheers,
>         Simon

> On 27/06/2012 23:04, Andreas Voellmy wrote:

>> I pushed my code here:

>> https://github.com/**AndreasVoellmy/ghc-arv/commit/**
>> f5dca46086f4d51f30479e3a031c31**dbdbbb2bed<https://github.com/AndreasVoellmy/ghc-arv/commit/f5dca46086f4d51f3047...>

>> I made the change on top of a branch that contains my previous Scheduler
>> changes (which was based on 7.4.1 branch), but these changes to how GC
>> clears nurseries are totally independent of them.

>> -Andreas

>> On Wed, Jun 27, 2012 at 12:02 PM, Andreas Voellmy
>> <andreas.voel...@gmail.com <mailto:andreas.voellmy@gmail.**com<andreas.voel...@gmail.com>>>
>> wrote:

>>     On Wed, Jun 27, 2012 at 8:15 AM, Simon Marlow <marlo...@gmail.com
>>     <mailto:marlo...@gmail.com>> wrote:

>>         On 27/06/2012 06:55, Andreas Voellmy wrote:

>>             This (clearNurseries() being "slow") may be an issue that
>>             only affects
>>             users using a large allocation area. I was using -A128m in
>>             my test runs
>>             with 40 caps. In this case, clearNurseries has to clear
>>             40*32678 blocks.
>>             It takes about 100ms in my runs, and this averages to 76ns
>>             per block,
>>             which seems reasonable.

>>             One way to reduce this time is to use a parallel-for loop as
>>             the outer
>>             loop of clearNurseries(). I tried this and it works fairly
>>             well: my
>>             first attempt at one thread per nursery reduced the times to
>>             about 20ms
>>             for the same allocation area and number of cores.

>>             Another option would be to somehow eliminate the need for
>>             clearNurseries() or reduce the number of blocks it has to
>>             work on. Would
>>             it be possible for blocks that are put back on to the
>>             nurseries lists to
>>             have these fields cleared when they are linked into the list?

>>         Aha! We should certainly not be doing clearNurseries() on just
>>         one thread, because that will cause cache lines to be moved
>>         between cores. The best solution would be to have each GC thread
>>         clear its own nursery.  However clearNurseries() also calculates
>>         the amount of allocation, which needs to be communicated back to
>>         the master thread.

>>         It *might* be ok to have each GC thread clear its own nursery in
>>         gcWorkerThread(), assuming that the rest of the GC won't get
>>         upset that the bd->free pointers have been reset (I *think*
>>         that's ok).  Also if sanity checking is on we can't do this,
>>         because sanity checking causes clearNurseries() to overwrite the
>>         nursery with 0xaa, so we would have to back off to the existing
>>         clearNurseries() in that case.

>>     OK, I just hacked this up and it looks really good :) With the
>>     parameters I described above I am getting about 15ms pauses, down
>>     from about 115ms before the change. My change involved having each
>>     gcthread clear the nursery of its capability right after it returns
>>     from scavenge_until_all_done() call. Same with the main
>>     GarbageCollect() procedure. Then GarbageCollect calls
>>     shutdown_gc_threads() and when this returns all the nurseries have
>>     been cleared. Each gcthread fills in a new member of the gcthread_
>>     struct indicating how much it allocated and this is read later in
>>     the main GarbageCollect() procedure to increment allocated. I'll
>>     post my code after some more testing and polishing.

>>     -Andreas

>>         (BTW, sorry for not getting around to looking at your scheduler
>>         patches yet, I've been busy with some teaching and now I'm
>>         taking some time off, but I plan to be back in action again soon).

>>         Cheers,
>>                 Simon

>>             -Andreas

>>             On Tue, Jun 26, 2012 at 3:44 PM, Andreas Voellmy
>>             <andreas.voel...@gmail.com
>>             <mailto:andreas.voellmy@gmail.**com<andreas.voel...@gmail.com>

>>             <mailto:andreas.voellmy@gmail.**__com

>>             <mailto:andreas.voellmy@gmail.**com<andreas.voel...@gmail.com>>>>
>> wrote:

>>                 OK thanks. I ask because Ryan mentioned earlier in this
>>             thread that
>>                 he saw performance differences between AMD and Intel
>>             machines and
>>                 that they might be due to NUMA memory allocation issues.
>>             I have been
>>                 using an AMD server for some time now with good results,
>>             and I just
>>                 started working with an 8 socket Intel server with worse
>>                 performance, despite it having faster processors. The
>>             problem might
>>                 NUMA memory allocation related, but I'm not sure - it
>>             could be
>>                 something else. I just thought I'd check in to see if
>> there
>>                 something easy I could try.

>>                 One thing I've noticed is that GCs take long when
>>             running with
>>                 lots of cores (e.g. 40). Oddly enough almost all of the
>>             GC time is
>>                 spent in clearNurseries(), which doesn't seem to be
>>             doing very much
>>                 - just looping through the blocks in each core's nursery
>>             and setting
>>                 the block's free pointer to the start pointer.

>>                 -Andreas

>>                 On Mon, Jun 25, 2012 at 9:12 AM, Simon Marlow
>>             <marlo...@gmail.com <mailto:marlo...@gmail.com>
>>                 <mailto:marlo...@gmail.com <mailto:marlo...@gmail.com>>>

>>             wrote:

>>                     On 22/06/2012 14:14, Andreas wrote:

>>                         Has anyone made progress on this (NUMA-aware
>>             allocation in
>>                         GHC)? Is
>>                         there code that I could try out?

>>                     Not yet, sorry.

>>                     Cheers,
>>                             Simon

>>                         -Andreas

>>                         On Thursday, March 22, 2012 10:26:52 AM UTC-4,
>>             Simon Marlow
>>                         wrote:

>>                             On 22/03/2012 14:23, Ryan Newton wrote:
>>                          > We've been looking at libs like hwloc. Also
>>             we wanted to at
>>                          > least look
>>                          > into what GHC's my_mmap is doing (in OSMem.c)
>>             to see if
>>                         anything
>>                          > more
>>                          > NUMA-aware would be possible...

>>                          > Ah yes, so NUMA is something we haven't
>>             looked at in the
>>                         RTS yet.
>>                          > At the very least you would want to ensure
>>             that each
>>                         nursery is
>>                          > allocated from core-local memory.

>>                          > Here's a possible plan. We currently have one
>>             global pool of
>>                          > blocks, managed by the block allocator
>>             (BlockAlloc.c).
>>                         Make this
>>                          > one pool per Capability. We would still need
>>             a lock on
>>                         each pool,
>>                          > because although we can ensure that a
>>             Capability only
>>                         allocates from
>>                          > its own pool, it will be difficult to ensure
>>             that a
>>                         Capability only
>>                          > ever frees its own blocks.

>>                          > This definitely sounds like an interesting
>>             plan to
>>                         explore. It looks
>>                          > like there would be a design tradeoff:

>>                          > * Make it one pool per Capability (i.e. 32
>>             pools on 32 cores)
>>                          > * Make it one pool per socket (more like 4
>> pools)

>>                          > There's no evidence of this yet, but one
>>             hypothesis is
>>                         that some
>>                             of our
>>                          > observed Intel/AMD differences are related to
>>             NUMA issues.

>>                             Definitely: with the pools still being
>>             subject to a
>>                         per-pool lock, you
>>                             would be able to experiment with sharing
>>             pools between
>>                         Capabilities in
>>                             whatever way you like.

>>                             Cheers,
>>                             Simon


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.