Nested scheduler brought back up to date with merge... some naughty bug introduced

19 views
Skip to first unread message

Ryan Newton

unread,
Oct 24, 2011, 10:43:45 AM10/24/11
to mona...@googlegroups.com
Ah, if only the only problem with version control merges were the explicit conflicts (and not the non-conflicting bug introductions).

The nested branch seems to have worked on "parfib nested" before pulling new changes from master this morning.  Now it works for non-nested tests, but it gets a stack overflow on parfib nested 2 (yes even of 2!).

I'm just putting this out there in case anyone else wants to take a look.  After it's fixed it would be nice to run benchmark.hs on the nested branch to look for performance regressions.

Speaking of performance regressions we've been seeing some pretty bad results on older Intel and AMD architectures (see results/).  Simon M's 24 core machine always did well with monad-par -- what was its configuration again?  It would be nice to get results from there in the results/ collection.

  -Ryan


Simon Marlow

unread,
Oct 24, 2011, 11:30:34 AM10/24/11
to mona...@googlegroups.com, Ryan Newton

4x Intel Xeon E7450 (2.4GHz), Windows Server 2008.

I think I used +RTS -A1m

Make sure you're using at least GHC 7.2.1, because there's a little
optimisation in runPar_internal that affects the initial scheduling of
workers to OS threads.

Cheers,
Simon

Ryan Newton

unread,
Oct 24, 2011, 7:41:45 PM10/24/11
to Simon Marlow, mona...@googlegroups.com
FYI, my prior comments about a bug resulting from the merge weren't really true.  It was just a problem with the parfib benchmark itself.  It's fixed and I did a little regression testing (attached below).  Because there was no performance regression at all on parfib I went ahead an merged branch "nested" into "master".  If there are any objections I'll roll it back.

  -Ryan 



[2011.10.24] {Timing nested scheduler version}
----------------------------------------------

Checking for performance regression.  This is on a 3.1 GHz Westmere
with hyperthreading disabled.  First a plain fib on the nested branch:

Data Schema:            User, system, productivity, alloc
   fib(38) 1 thread :   20.2  19.7   94.1%  82GB   -- TraceNested
   fib(38) 4 threads:   6.23  24.2   90.6%  85GB   -- TraceNested

And for arguments sake with a cutoff of 10:
   fib(42) 1 thread :   5.5   5.5    89.2%  8.2GB  -- TraceNested
   fib(42) 4 threads:   1.72  6.38   87.5%  8.4GB  -- TraceNested

And with the Sparks scheduler:
   fib(38) 1 thread :   2.2
   fib(38) 4 threads:   .75   2.7    69.0%  7.5GB
   fib(42) 1 thread :   14.8  14.5   82.8%  52GB
   fib(42) 4 threads:   4.7   18.3   71.1%  52GB
   fib(42) 4 threads:   1.0   3.8    100%   11MB -- cutoff 10

And the plain par/pseq version:
   fib(42) 1 thread :   8.7   8.6    86.2%  17GB
   fib(42) 4 threads:   2.8   10.5   73.9%  17GB

And then for regression testing the ORIGINAL Trace scheduler (no nesting support):
   fib(38) 1 thread :   22.1  21.5   93.8%  97GB -- TraceOrig
   fib(38) 4 threads:   7.5   28.6   90.4%  97GB -- TraceOrig

Indeed, rather than regression, it would seem that Daniel improved the
parfib performance!

Super-nested parfib:
-----------------------
And the perversely Nested parfib:
   nfib(38) 1 thread :   3.3   3.2    82.7%  12G      -- nested but Sparks.hs
   nfib(38) 4 threads:   1.1   4.1    70.9%  12.9GB   -- nested but Sparks.hs

Oops!  That was with the sparks scheduler!  Here's the actual Trace/nested:    
   nfib(30) 4 threads:   1.3   4.8    93.5%  7GB    -- super nested fib / trace
   nfib(32) 4 threads:   3.26  11.7   92.9%  18GB
   nfib(42) 4 threads:   6.5   23.5   94.7%  29.6GB -- cutoff 10:
 (Note, those only used 376% cpu.)

Finally, this is the original Trace scheduler on the perversely nested parfib:

   nfib(30) 1 thread :   1.8   1.8    92.1%  5GB
   nfib(32) 1 threads:   4.9   4.8    91.7%  14.9GB
   nfib(32) 4 threads:   -- memory explosion
   nfib(28) 4 threads:   9.7   37.2   33.8%  5.8GB -- 2GB ram usage

One interesting consequence here is that while the Sparks scheduler
has an 8X advantage over Trace (and par/pseq an additional 60%
advantage, 13.8X total), that advantage widens to over 256X in the
case of the perversely nested parfib!!!

Simon Marlow

unread,
Oct 25, 2011, 4:15:52 AM10/25/11
to rrne...@gmail.com, mona...@googlegroups.com
Ok, so clearly the sparks scheduler has much lower overhead - as we
expect, given that it is basically the Eval monad. However, the sparks
scheduler has subtly different semantics than the Trace scheduler. For
example, try this with the Sparks and the Trace schedulers:

main = print (runPar (spawn_ (error "help!") >> return 42))

The Sparks semantics is perfectly fine (better even), but it is not
implementable in the Trace scheduler.

Also, the Sparks scheduler will be affected by the fixed-size spark pools.

On the other hand, the new spark tracing in ThreadScope can be used for
debugging performance issues when using the Sparks scheduler.

Cheers,
Simon

Ryan Newton

unread,
Oct 25, 2011, 10:22:26 AM10/25/11
to Simon Marlow, mona...@googlegroups.com
Good point about the semantic differences.  For futures-only I think there's a lot to be said for the "serial equivalence" semantics that Cilk uses, which would disallow the "spawn . error"  program.

I'm not really advocating the Sparks scheduler -- first of all I do want IVars and all the stuff we can build on top of them!  It's more that I find it useful to have as a point of comparison.  As we do more benchmarks we don't want to write all of them with par/pseq as well as monad-par.  The Sparks scheduler should give us some idea of how the par/pseq approach would fare.

  -Ryan
Reply all
Reply to author
Forward
0 new messages