On Wed, Jul 11, 2012 at 4:30 AM, Simon Marlow <marlo
...@gmail.com> wrote:
> I committed an equivalent patch yesterday:
> http://hackage.haskell.org/**trac/ghc/changeset/**
> 713cf473de8a2ad7d0b8195d78860c**25fec41839<http://hackage.haskell.org/trac/ghc/changeset/713cf473de8a2ad7d0b8195...>
> Thanks again for identifying the problem.
> Cheers,
> Simon
> On 27/06/2012 23:04, Andreas Voellmy wrote:
>> I pushed my code here:
>> https://github.com/**AndreasVoellmy/ghc-arv/commit/**
>> f5dca46086f4d51f30479e3a031c31**dbdbbb2bed<https://github.com/AndreasVoellmy/ghc-arv/commit/f5dca46086f4d51f3047...>
>> I made the change on top of a branch that contains my previous Scheduler
>> changes (which was based on 7.4.1 branch), but these changes to how GC
>> clears nurseries are totally independent of them.
>> -Andreas
>> On Wed, Jun 27, 2012 at 12:02 PM, Andreas Voellmy
>> <andreas.voel...@gmail.com <mailto:andreas.voellmy@gmail.**com<andreas.voel...@gmail.com>>>
>> wrote:
>> On Wed, Jun 27, 2012 at 8:15 AM, Simon Marlow <marlo...@gmail.com
>> <mailto:marlo...@gmail.com>> wrote:
>> On 27/06/2012 06:55, Andreas Voellmy wrote:
>> This (clearNurseries() being "slow") may be an issue that
>> only affects
>> users using a large allocation area. I was using -A128m in
>> my test runs
>> with 40 caps. In this case, clearNurseries has to clear
>> 40*32678 blocks.
>> It takes about 100ms in my runs, and this averages to 76ns
>> per block,
>> which seems reasonable.
>> One way to reduce this time is to use a parallel-for loop as
>> the outer
>> loop of clearNurseries(). I tried this and it works fairly
>> well: my
>> first attempt at one thread per nursery reduced the times to
>> about 20ms
>> for the same allocation area and number of cores.
>> Another option would be to somehow eliminate the need for
>> clearNurseries() or reduce the number of blocks it has to
>> work on. Would
>> it be possible for blocks that are put back on to the
>> nurseries lists to
>> have these fields cleared when they are linked into the list?
>> Aha! We should certainly not be doing clearNurseries() on just
>> one thread, because that will cause cache lines to be moved
>> between cores. The best solution would be to have each GC thread
>> clear its own nursery. However clearNurseries() also calculates
>> the amount of allocation, which needs to be communicated back to
>> the master thread.
>> It *might* be ok to have each GC thread clear its own nursery in
>> gcWorkerThread(), assuming that the rest of the GC won't get
>> upset that the bd->free pointers have been reset (I *think*
>> that's ok). Also if sanity checking is on we can't do this,
>> because sanity checking causes clearNurseries() to overwrite the
>> nursery with 0xaa, so we would have to back off to the existing
>> clearNurseries() in that case.
>> OK, I just hacked this up and it looks really good :) With the
>> parameters I described above I am getting about 15ms pauses, down
>> from about 115ms before the change. My change involved having each
>> gcthread clear the nursery of its capability right after it returns
>> from scavenge_until_all_done() call. Same with the main
>> GarbageCollect() procedure. Then GarbageCollect calls
>> shutdown_gc_threads() and when this returns all the nurseries have
>> been cleared. Each gcthread fills in a new member of the gcthread_
>> struct indicating how much it allocated and this is read later in
>> the main GarbageCollect() procedure to increment allocated. I'll
>> post my code after some more testing and polishing.
>> -Andreas
>> (BTW, sorry for not getting around to looking at your scheduler
>> patches yet, I've been busy with some teaching and now I'm
>> taking some time off, but I plan to be back in action again soon).
>> Cheers,
>> Simon
>> -Andreas
>> On Tue, Jun 26, 2012 at 3:44 PM, Andreas Voellmy
>> <andreas.voel...@gmail.com
>> <mailto:andreas.voellmy@gmail.**com<andreas.voel...@gmail.com>
>> <mailto:andreas.voellmy@gmail.**__com
>> <mailto:andreas.voellmy@gmail.**com<andreas.voel...@gmail.com>>>>
>> wrote:
>> OK thanks. I ask because Ryan mentioned earlier in this
>> thread that
>> he saw performance differences between AMD and Intel
>> machines and
>> that they might be due to NUMA memory allocation issues.
>> I have been
>> using an AMD server for some time now with good results,
>> and I just
>> started working with an 8 socket Intel server with worse
>> performance, despite it having faster processors. The
>> problem might
>> NUMA memory allocation related, but I'm not sure - it
>> could be
>> something else. I just thought I'd check in to see if
>> there
>> something easy I could try.
>> One thing I've noticed is that GCs take long when
>> running with
>> lots of cores (e.g. 40). Oddly enough almost all of the
>> GC time is
>> spent in clearNurseries(), which doesn't seem to be
>> doing very much
>> - just looping through the blocks in each core's nursery
>> and setting
>> the block's free pointer to the start pointer.
>> -Andreas
>> On Mon, Jun 25, 2012 at 9:12 AM, Simon Marlow
>> <marlo...@gmail.com <mailto:marlo...@gmail.com>
>> <mailto:marlo...@gmail.com <mailto:marlo...@gmail.com>>>
>> wrote:
>> On 22/06/2012 14:14, Andreas wrote:
>> Has anyone made progress on this (NUMA-aware
>> allocation in
>> GHC)? Is
>> there code that I could try out?
>> Not yet, sorry.
>> Cheers,
>> Simon
>> -Andreas
>> On Thursday, March 22, 2012 10:26:52 AM UTC-4,
>> Simon Marlow
>> wrote:
>> On 22/03/2012 14:23, Ryan Newton wrote:
>> > We've been looking at libs like hwloc. Also
>> we wanted to at
>> > least look
>> > into what GHC's my_mmap is doing (in OSMem.c)
>> to see if
>> anything
>> > more
>> > NUMA-aware would be possible...
>> > Ah yes, so NUMA is something we haven't
>> looked at in the
>> RTS yet.
>> > At the very least you would want to ensure
>> that each
>> nursery is
>> > allocated from core-local memory.
>> > Here's a possible plan. We currently have one
>> global pool of
>> > blocks, managed by the block allocator
>> (BlockAlloc.c).
>> Make this
>> > one pool per Capability. We would still need
>> a lock on
>> each pool,
>> > because although we can ensure that a
>> Capability only
>> allocates from
>> > its own pool, it will be difficult to ensure
>> that a
>> Capability only
>> > ever frees its own blocks.
>> > This definitely sounds like an interesting
>> plan to
>> explore. It looks
>> > like there would be a design tradeoff:
>> > * Make it one pool per Capability (i.e. 32
>> pools on 32 cores)
>> > * Make it one pool per socket (more like 4
>> pools)
>> > There's no evidence of this yet, but one
>> hypothesis is
>> that some
>> of our
>> > observed Intel/AMD differences are related to
>> NUMA issues.
>> Definitely: with the pools still being
>> subject to a
>> per-pool lock, you
>> would be able to experiment with sharing
>> pools between
>> Capabilities in
>> whatever way you like.
>> Cheers,
>> Simon