Received: by 10.204.149.210 with SMTP id u18mr2569387bkv.1.1340834649018; Wed, 27 Jun 2012 15:04:09 -0700 (PDT) X-BeenThere: parallel-haskell@googlegroups.com Received: by 10.204.128.67 with SMTP id j3ls2236176bks.9.gmail; Wed, 27 Jun 2012 15:04:07 -0700 (PDT) Received: by 10.205.133.196 with SMTP id hz4mr470279bkc.4.1340834647841; Wed, 27 Jun 2012 15:04:07 -0700 (PDT) Received: by 10.205.133.196 with SMTP id hz4mr470277bkc.4.1340834647773; Wed, 27 Jun 2012 15:04:07 -0700 (PDT) Return-Path: Received: from mail-lb0-f182.google.com (mail-lb0-f182.google.com [209.85.217.182]) by gmr-mx.google.com with ESMTPS id iv15si19083208bkc.0.2012.06.27.15.04.07 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 27 Jun 2012 15:04:07 -0700 (PDT) Received-SPF: pass (google.com: domain of andreas.voel...@gmail.com designates 209.85.217.182 as permitted sender) client-ip=209.85.217.182; Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of andreas.voel...@gmail.com designates 209.85.217.182 as permitted sender) smtp.mail=andreas.voel...@gmail.com; dkim=pass header...@gmail.com Received: by mail-lb0-f182.google.com with SMTP id n10so3923246lbo.27 for ; Wed, 27 Jun 2012 15:04:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=XWXR768OrCZBR8XtwUNsiMzoTwMmPacK/NYX5R5pUWE=; b=NpLOijAa311PiV/zV3qqOZYaIP6iz/AMV943Oz6weh3hU2kQmOHehejVs/RGHEOzjn crheQW5IxuNsA8mfHQD2vVACj7Kp878XoxBiqwSlaBtqb+LLnLHfaPWC95on8iGQqBk+ wZQJPrPh9o+/me+zGhs7h4vbwLHGolRvAotdlIvrmSvFUwoFVGX/WDYFXeALgwy7kmer 2ufHTcdMiioYxBYMhvNkOrDBx2gdzKcS9+yEI82sguGIoKgbGx3+FRkN6rUGlr8zjUse DDt4HAGqiHDqWsJJlh2st0D2h+qLdvBrv1pifv1ZGR+0UHU1UuO1MlVrO/OL+l++ZFqI B6Kg== MIME-Version: 1.0 Received: by 10.152.102.234 with SMTP id fr10mr22678532lab.32.1340834647404; Wed, 27 Jun 2012 15:04:07 -0700 (PDT) Received: by 10.112.45.194 with HTTP; Wed, 27 Jun 2012 15:04:07 -0700 (PDT) In-Reply-To: References: <4a44b19c-6391-46a1-9743-94a0e47ec...@bs8g2000vbb.googlegroups.com> <4F38ECEB.10...@gmail.com> <4F3B7271.1090...@gmail.com> <4F6AFEA9.90...@gmail.com> <4F6B36AC.8010...@gmail.com> <4FE863B9.4030...@gmail.com> <4FEAF957.7070...@gmail.com> Date: Wed, 27 Jun 2012 18:04:07 -0400 Message-ID: Subject: Re: Synchronizations in memory allocation? From: Andreas Voellmy To: Simon Marlow Cc: parallel-haskell@googlegroups.com, rrnew...@gmail.com, Duncan Coutts Content-Type: multipart/alternative; boundary=f46d04071259dc70a804c37b6451 --f46d04071259dc70a804c37b6451 Content-Type: text/plain; charset=ISO-8859-1 I pushed my code here: https://github.com/AndreasVoellmy/ghc-arv/commit/f5dca46086f4d51f30479e3a031c31dbdbbb2bed I made the change on top of a branch that contains my previous Scheduler changes (which was based on 7.4.1 branch), but these changes to how GC clears nurseries are totally independent of them. -Andreas On Wed, Jun 27, 2012 at 12:02 PM, Andreas Voellmy wrote: > > > On Wed, Jun 27, 2012 at 8:15 AM, Simon Marlow wrote: > >> On 27/06/2012 06:55, Andreas Voellmy wrote: >> >>> This (clearNurseries() being "slow") may be an issue that only affects >>> users using a large allocation area. I was using -A128m in my test runs >>> with 40 caps. In this case, clearNurseries has to clear 40*32678 blocks. >>> It takes about 100ms in my runs, and this averages to 76ns per block, >>> which seems reasonable. >>> >>> One way to reduce this time is to use a parallel-for loop as the outer >>> loop of clearNurseries(). I tried this and it works fairly well: my >>> first attempt at one thread per nursery reduced the times to about 20ms >>> for the same allocation area and number of cores. >>> >>> Another option would be to somehow eliminate the need for >>> clearNurseries() or reduce the number of blocks it has to work on. Would >>> it be possible for blocks that are put back on to the nurseries lists to >>> have these fields cleared when they are linked into the list? >>> >> >> Aha! We should certainly not be doing clearNurseries() on just one >> thread, because that will cause cache lines to be moved between cores. The >> best solution would be to have each GC thread clear its own nursery. >> However clearNurseries() also calculates the amount of allocation, which >> needs to be communicated back to the master thread. >> >> It *might* be ok to have each GC thread clear its own nursery in >> gcWorkerThread(), assuming that the rest of the GC won't get upset that the >> bd->free pointers have been reset (I *think* that's ok). Also if sanity >> checking is on we can't do this, because sanity checking causes >> clearNurseries() to overwrite the nursery with 0xaa, so we would have to >> back off to the existing clearNurseries() in that case. >> > > OK, I just hacked this up and it looks really good :) With the parameters > I described above I am getting about 15ms pauses, down from about 115ms > before the change. My change involved having each gcthread clear the > nursery of its capability right after it returns from > scavenge_until_all_done() call. Same with the main GarbageCollect() > procedure. Then GarbageCollect calls shutdown_gc_threads() and when this > returns all the nurseries have been cleared. Each gcthread fills in a new > member of the gcthread_ struct indicating how much it allocated and this is > read later in the main GarbageCollect() procedure to increment allocated. > I'll post my code after some more testing and polishing. > > -Andreas > > > >> >> (BTW, sorry for not getting around to looking at your scheduler patches >> yet, I've been busy with some teaching and now I'm taking some time off, >> but I plan to be back in action again soon). >> >> Cheers, >> Simon >> >> >> -Andreas >>> >>> On Tue, Jun 26, 2012 at 3:44 PM, Andreas Voellmy >>> >> >>> wrote: >>> >>> OK thanks. I ask because Ryan mentioned earlier in this thread that >>> he saw performance differences between AMD and Intel machines and >>> that they might be due to NUMA memory allocation issues. I have been >>> using an AMD server for some time now with good results, and I just >>> started working with an 8 socket Intel server with worse >>> performance, despite it having faster processors. The problem might >>> NUMA memory allocation related, but I'm not sure - it could be >>> something else. I just thought I'd check in to see if there >>> something easy I could try. >>> >>> One thing I've noticed is that GCs take long when running with >>> lots of cores (e.g. 40). Oddly enough almost all of the GC time is >>> spent in clearNurseries(), which doesn't seem to be doing very much >>> - just looping through the blocks in each core's nursery and setting >>> the block's free pointer to the start pointer. >>> >>> -Andreas >>> >>> >>> On Mon, Jun 25, 2012 at 9:12 AM, Simon Marlow >> > wrote: >>> >>> On 22/06/2012 14:14, Andreas wrote: >>> >>> Has anyone made progress on this (NUMA-aware allocation in >>> GHC)? Is >>> there code that I could try out? >>> >>> >>> Not yet, sorry. >>> >>> Cheers, >>> Simon >>> >>> >>> >>> -Andreas >>> >>> On Thursday, March 22, 2012 10:26:52 AM UTC-4, Simon Marlow >>> wrote: >>> >>> On 22/03/2012 14:23, Ryan Newton wrote: >>> > We've been looking at libs like hwloc. Also we wanted to at >>> > least look >>> > into what GHC's my_mmap is doing (in OSMem.c) to see if >>> anything >>> > more >>> > NUMA-aware would be possible... >>> > >>> > >>> > Ah yes, so NUMA is something we haven't looked at in the >>> RTS yet. >>> > At the very least you would want to ensure that each >>> nursery is >>> > allocated from core-local memory. >>> > >>> > Here's a possible plan. We currently have one global pool >>> of >>> > blocks, managed by the block allocator (BlockAlloc.c). >>> Make this >>> > one pool per Capability. We would still need a lock on >>> each pool, >>> > because although we can ensure that a Capability only >>> allocates from >>> > its own pool, it will be difficult to ensure that a >>> Capability only >>> > ever frees its own blocks. >>> > >>> > >>> > This definitely sounds like an interesting plan to >>> explore. It looks >>> > like there would be a design tradeoff: >>> > >>> > * Make it one pool per Capability (i.e. 32 pools on 32 >>> cores) >>> > * Make it one pool per socket (more like 4 pools) >>> > >>> > There's no evidence of this yet, but one hypothesis is >>> that some >>> of our >>> > observed Intel/AMD differences are related to NUMA issues. >>> >>> Definitely: with the pools still being subject to a >>> per-pool lock, you >>> would be able to experiment with sharing pools between >>> Capabilities in >>> whatever way you like. >>> >>> Cheers, >>> Simon >>> >>> >>> >>> >>> >> > --f46d04071259dc70a804c37b6451 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I pushed my code here:


I made the change on top of a branch that contains my p= revious Scheduler changes (which was based on 7.4.1 branch), but these chan= ges to how GC clears nurseries are totally independent of them.=A0

-Andreas

On Wed, Jun 27, 2= 012 at 12:02 PM, Andreas Voellmy <andreas.voel...@gmail.com>= ; wrote:


On Wed, Jun 27, 2012 at 8:15 AM, Simon Marlow &l= t;marlo...@gmail.co= m> wrote:
On 27/06/2012 06:55, Andreas Voellmy wrote:
This (clearNurseries() being "slow") may be an issue that only af= fects
users using a large allocation area. I was using -A128m in my test runs
with 40 caps. In this case, clearNurseries has to clear 40*32678 blocks. It takes about 100ms in my runs, and this averages to 76ns per block,
which seems reasonable.

One way to reduce this time is to use a parallel-for loop as the outer
loop of clearNurseries(). I tried this and it works fairly well: my
first attempt at one thread per nursery reduced the times to about 20ms
for the same allocation area and number of cores.

Another option would be to somehow eliminate the need for
clearNurseries() or reduce the number of blocks it has to work on. Would it be possible for blocks that are put back on to the nurseries lists to have these fields cleared when they are linked into the list?

Aha! We should certainly not be doing clearNurseries() on just one thread, = because that will cause cache lines to be moved between cores. The best sol= ution would be to have each GC thread clear its own nursery. =A0However cle= arNurseries() also calculates the amount of allocation, which needs to be c= ommunicated back to the master thread.

It *might* be ok to have each GC thread clear its own nursery in gcWorkerTh= read(), assuming that the rest of the GC won't get upset that the bd-&g= t;free pointers have been reset (I *think* that's ok). =A0Also if sanit= y checking is on we can't do this, because sanity checking causes clear= Nurseries() to overwrite the nursery with 0xaa, so we would have to back of= f to the existing clearNurseries() in that case.

OK, I just hacked this up and it loo= ks really good :) With the parameters I described above I am getting about = 15ms pauses, down from about 115ms before the change. My change involved ha= ving each gcthread clear the nursery of its capability right after it retur= ns from scavenge_until_all_done() call. Same with the main GarbageCollect()= procedure. Then GarbageCollect calls shutdown_gc_threads() and when this r= eturns all the nurseries have been cleared. Each gcthread fills in a new me= mber of the gcthread_ struct indicating how much it allocated and this is r= ead later in the main GarbageCollect() procedure to increment allocated. I&= #39;ll post my code after some more testing and polishing.

-Andreas

=A0

(BTW, sorry for not getting around to looking at your scheduler patches yet= , I've been busy with some teaching and now I'm taking some time of= f, but I plan to be back in action again soon).

Cheers,
=A0 =A0 =A0 =A0Simon


-Andreas

On Tue, Jun 26, 2012 at 3:44 PM, Andreas Voellmy
<andreas.= voel...@gmail.com <mailto:andreas.voellmy@gmail.com>> wrote:
=A0 =A0OK thanks. I ask because Ryan mentioned earlier in this thread that=
=A0 =A0he saw performance differences between AMD and Intel machines and =A0 =A0that they might be due to NUMA memory allocation issues. I have bee= n
=A0 =A0using an AMD server for some time now with good results, and I just=
=A0 =A0started working with an 8 socket Intel server with worse
=A0 =A0performance, despite it having faster processors. The problem might=
=A0 =A0NUMA memory allocation related, but I'm not sure - it could be<= br> =A0 =A0something else. I just thought I'd check in to see if there
=A0 =A0something easy I could try.

=A0 =A0One thing I've noticed is that GCs take long when running with<= br> =A0 =A0lots of cores (e.g. 40). Oddly enough almost all of the GC time is<= br> =A0 =A0spent in clearNurseries(), which doesn't seem to be doing very = much
=A0 =A0- just looping through the blocks in each core's nursery and se= tting
=A0 =A0the block's free pointer to the start pointer.

=A0 =A0-Andreas


=A0 =A0On Mon, Jun 25, 2012 at 9:12 AM, Simon Marlow <marlo...@gmail.com
=A0 =A0<mailto:= marlo...@gmail.com>> wrote:

=A0 =A0 =A0 =A0On 22/06/2012 14:14, Andreas wrote:

=A0 =A0 =A0 =A0 =A0 =A0Has anyone made progress on this (NUMA-aware alloca= tion in
=A0 =A0 =A0 =A0 =A0 =A0GHC)? Is
=A0 =A0 =A0 =A0 =A0 =A0there code that I could try out?


=A0 =A0 =A0 =A0Not yet, sorry.

=A0 =A0 =A0 =A0Cheers,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Simon



=A0 =A0 =A0 =A0 =A0 =A0-Andreas

=A0 =A0 =A0 =A0 =A0 =A0On Thursday, March 22, 2012 10:26:52 AM UTC-4, Simo= n Marlow
=A0 =A0 =A0 =A0 =A0 =A0wrote:

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0On 22/03/2012 14:23, Ryan Newton wrote:
=A0 =A0 =A0 =A0 =A0 =A0 > We've been looking at libs like hwloc. Al= so we wanted to at
=A0 =A0 =A0 =A0 =A0 =A0 > least look
=A0 =A0 =A0 =A0 =A0 =A0 > into what GHC's my_mmap is doing (in OSMe= m.c) to see if
=A0 =A0 =A0 =A0 =A0 =A0anything
=A0 =A0 =A0 =A0 =A0 =A0 > more
=A0 =A0 =A0 =A0 =A0 =A0 > NUMA-aware would be possible...
=A0 =A0 =A0 =A0 =A0 =A0 >
=A0 =A0 =A0 =A0 =A0 =A0 >
=A0 =A0 =A0 =A0 =A0 =A0 > Ah yes, so NUMA is something we haven't l= ooked at in the
=A0 =A0 =A0 =A0 =A0 =A0RTS yet.
=A0 =A0 =A0 =A0 =A0 =A0 > At the very least you would want to ensure th= at each
=A0 =A0 =A0 =A0 =A0 =A0nursery is
=A0 =A0 =A0 =A0 =A0 =A0 > allocated from core-local memory.
=A0 =A0 =A0 =A0 =A0 =A0 >
=A0 =A0 =A0 =A0 =A0 =A0 > Here's a possible plan. We currently have= one global pool of
=A0 =A0 =A0 =A0 =A0 =A0 > blocks, managed by the block allocator (Block= Alloc.c).
=A0 =A0 =A0 =A0 =A0 =A0Make this
=A0 =A0 =A0 =A0 =A0 =A0 > one pool per Capability. We would still need = a lock on
=A0 =A0 =A0 =A0 =A0 =A0each pool,
=A0 =A0 =A0 =A0 =A0 =A0 > because although we can ensure that a Capabil= ity only
=A0 =A0 =A0 =A0 =A0 =A0allocates from
=A0 =A0 =A0 =A0 =A0 =A0 > its own pool, it will be difficult to ensure = that a
=A0 =A0 =A0 =A0 =A0 =A0Capability only
=A0 =A0 =A0 =A0 =A0 =A0 > ever frees its own blocks.
=A0 =A0 =A0 =A0 =A0 =A0 >
=A0 =A0 =A0 =A0 =A0 =A0 >
=A0 =A0 =A0 =A0 =A0 =A0 > This definitely sounds like an interesting pl= an to
=A0 =A0 =A0 =A0 =A0 =A0explore. It looks
=A0 =A0 =A0 =A0 =A0 =A0 > like there would be a design tradeoff:
=A0 =A0 =A0 =A0 =A0 =A0 >
=A0 =A0 =A0 =A0 =A0 =A0 > * Make it one pool per Capability (i.e. 32 po= ols on 32 cores)
=A0 =A0 =A0 =A0 =A0 =A0 > * Make it one pool per socket (more like 4 po= ols)
=A0 =A0 =A0 =A0 =A0 =A0 >
=A0 =A0 =A0 =A0 =A0 =A0 > There's no evidence of this yet, but one = hypothesis is
=A0 =A0 =A0 =A0 =A0 =A0that some
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0of our
=A0 =A0 =A0 =A0 =A0 =A0 > observed Intel/AMD differences are related to= NUMA issues.

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Definitely: with the pools still being subj= ect to a
=A0 =A0 =A0 =A0 =A0 =A0per-pool lock, you
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0would be able to experiment with sharing po= ols between
=A0 =A0 =A0 =A0 =A0 =A0Capabilities in
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0whatever way you like.

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Cheers,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Simon







--f46d04071259dc70a804c37b6451--