Make Compiler Go Fast

30 views
Skip to first unread message

Scott Blum

unread,
Jul 18, 2008, 12:01:18 PM7/18/08
to Lex Spoon, BobV, Google-Web-Toolkit-Contributors
Hi guys,

Could you please review this patch for trunk?  It makes the compiler go much faster when you have:

a) Several permutations
b) More than one processor

Actually, it even helps a little bit with several permutations and only one processor.

Here's the basic design:

1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved deferred binding decisions as explicit nodes
2) Do optimizations on this tree.
3) Serialize into a byte buffer (thanks to Bruce for this tasty suggestion)
4) Crank up N worker threads
5) Each worker thread:
  a) Grabs a particular permutation and deserializes its own copy of the unified Java AST
  b) Resolves all deferred binding for that particular permutation
  c) Optimizes the permutation with the new knowledge of deferred binding decisions
6) Profit!

Some notes:
- Doing step 2 up front on the unified tree actually realizes some performance gains by making step 5c take less time in each permutation, even when you have only one worker thread
- The number of worker threads created is Math.min(numberOfMachineProcessors, estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, there's no point creating extra threads if you don't have the memory for them, or you can't actually utilize the hardware.  Permutation optimization barely touches the I/O, it's all CPU bound.
- Making the Java and JS ASTs serializable was surprisingly easy.

Toby's about to email around some results from testing Showcase on his phatty 4-processor machine, if he hasn't already done so.

Thanks!
Scott

compiler-parallel-r3250.patch

Kelly Norton

unread,
Jul 18, 2008, 12:17:11 PM7/18/08
to Google-Web-Tool...@googlegroups.com, Lex Spoon, BobV
Scott,

I'm kind of surprised to see all the code the hand manage a worker pool of threads. Why did you not use a fixed thread pool from java.util.concurrency?

/kel
--
If you received this communication by mistake, you are entitled to one free ice cream cone on me. Simply print out this email including all relevant SMTP headers and present them at my desk to claim your creamy treat. We'll have a laugh at my emailing incompetence, and play a game of ping pong. (offer may not be valid in all States).

Eric Ayers

unread,
Jul 18, 2008, 12:31:15 PM7/18/08
to Google-Web-Tool...@googlegroups.com
FYI:

I ran this on my desktop - 2.4 GHz Intel Core Duo2 (2 cores) and noted
Showcase compile time from 02:02 down to 01:19.

I noted that interactive performance of using my desktop did not
suffer while the build was going. Still, I was thinking that it might
be wise to set some way (Java Property?) in computeThreadCount() to
throttle the number of threads. Someone running an automated build on
a shared machine might not like the behavior of grabbing every CPU on
the system. I did a web search and didn't see a JVM argument that
might artificially lower the # of processors.


On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote:

--
Eric Z. Ayers - GWT Team - Atlanta, GA USA
http://code.google.com/webtoolkit/

Scott Blum

unread,
Jul 18, 2008, 12:54:08 PM7/18/08
to Lex Spoon, BobV, Google-Web-Toolkit-Contributors
And here's Toby's data:

---------- Forwarded message ----------
From: Toby Reyelts <to...@google.com>
Date: Fri, Jul 18, 2008 at 12:20 PM
Subject: Some perf numbers on Scott's parallel compilation improvements

Hi guys,

I sat down with Scott and ran some compilations of Showcase and Dynatable on my four core machine. All compilations were run with JDK 1.6, -Xmx=1G -Xms=1G, and specific compilations were run with -server versus -client, where noted. Showcase got about 60% faster for the fastest case, and DynaTable got about 33% faster.

If you look closely at the numbers, you can see that, even for the single-threaded compiles in trunk, the JVM is taking advantage of multiple processors. In particular, you can see that using -server causes a significant amount of JIT compiling to occur on a background thread. The numbers for Showcase make it astoundingly obvious that -server can have some huge performance affects.

Executive Summary: 

1) Nice patch Scott!
2) Use a multi-core machine even if you're not using Scott's patch
3) Use -server !!

Trunk

Showcase
(server)
real    2m46.133s
user    3m42.396s
sys     0m3.604s

(client)
real    3m43.187s
user    3m41.401s
sys     0m1.716s

DynaTable
(server)
real    0m23.787s
user    0m48.124s
sys     0m0.748s

(client)
real    0m21.779s
user    0m22.541s
sys     0m1.471s


Scott's patch

Showcase
(server)
real    1m6.743s
user    3m18.468s
sys     0m2.254s

(client)
real    1m16.876s
user    2m48.144s
sys     0m1.850s

DynaTable
(server)
real    0m18.870s
user    0m49.098s
sys     0m0.753s

(client)
real    0m14.637s
user    0m21.867s
sys     0m0.535s



Scott Blum

unread,
Jul 18, 2008, 12:55:35 PM7/18/08
to Google-Web-Tool...@googlegroups.com, Lex Spoon, BobV
Because I have no idea what I'm doing and just did the simplest thing I could think of to get it working. :)

Seriously, I would love suggestions on how to do this better.  I'd also love for you to get your butt in to work so we can try it on your 8-core monster box. :)

Ray Cromwell

unread,
Jul 18, 2008, 1:42:02 PM7/18/08
to Google-Web-Tool...@googlegroups.com, Lex Spoon, BobV
Scott,
First of all, kickass job. The code you are looking for is probably
Executors.newFixedThreadPool(nThreads).

I have a question tho, is deserializing bytebuffers as fast as using
a deep tree clone method? I'm not an expert in Java serialization, but
I always had the impression it was not the greatest performer. I would
have thought that a copy() method on every JNode subclass would have
been faster, especially in cases where it is effectively final (not
overriden after HotSpot type-tightens a callsite) and can be inlined
by HotSpot.

I can't wait to try this on my Mac Pro!
-Ray

Matthew Mastracci

unread,
Jul 18, 2008, 1:53:17 PM7/18/08
to Google-Web-Tool...@googlegroups.com
Wow, awesome patch. This is going to be fantastic for dual-core boxes.

Is there a lot of time spent serializing/deserializing? If so, it
might be useful to try out JBoss Serialization as a drop-in
replacement. It's quite a bit faster for serializing large trees, but
(at least in my experience) increases the output size by a few percent
(< 10%).

You could also serialize to a MappedByteBuffer as well to take
advantage of platform mmap() and allow the OS to page the cache out if
necessary. Our current build server is memory constrained and pages a
fair bit during the compile.

> <compiler-parallel-r3250.patch>

John Tamplin

unread,
Jul 18, 2008, 2:07:04 PM7/18/08
to Google-Web-Tool...@googlegroups.com
On Fri, Jul 18, 2008 at 12:31 PM, Eric Ayers <zun...@google.com> wrote:
I noted that interactive performance of using my desktop did not
suffer while the build was going.  Still, I was thinking that it might
be wise to set some way (Java Property?) in computeThreadCount() to
throttle the number of threads.  Someone running an automated build on
a shared machine might not like the behavior of grabbing every CPU on
the system.  I did a web search and didn't see a JVM argument that
might artificially lower the # of processors.

Actually, I would rather have the same number of threads but set the OS priority such that interactive tasks will get preference.

--
John A. Tamplin
Software Engineer (GWT), Google

Alex Rudnick

unread,
Jul 18, 2008, 2:14:57 PM7/18/08
to Google-Web-Tool...@googlegroups.com
On Fri, Jul 18, 2008 at 2:07 PM, John Tamplin <j...@google.com> wrote:
> Actually, I would rather have the same number of threads but set the OS
> priority such that interactive tasks will get preference.

Linux typically does that anyway, if I'm not mistaken. Do other
platforms? (if they don't by default, is there some way to make them?)

--
Alex Rudnick
swe, gwt, atl

Toby Reyelts

unread,
Jul 18, 2008, 2:21:51 PM7/18/08
to Google-Web-Tool...@googlegroups.com
On Fri, Jul 18, 2008 at 1:53 PM, Matthew Mastracci <mat...@mastracci.com> wrote:

Wow, awesome patch.  This is going to be fantastic for dual-core boxes.

Is there a lot of time spent serializing/deserializing?  If so, it
might be useful to try out JBoss Serialization as a drop-in
replacement.  It's quite a bit faster for serializing large trees, but
(at least in my experience) increases the output size by a few percent
(< 10%).

You could also serialize to a MappedByteBuffer as well to take
advantage of platform mmap() and allow the OS to page the cache out if
necessary.  Our current build server is memory constrained and pages a
fair bit during the compile.

I'm very surprised by this. Any statistics you could give us here? 

Scott's patch checks the free amount of available memory, so if you don't set the JVM's max memory high enough, you won't get a parallel build at all.

Kelly Norton

unread,
Jul 18, 2008, 2:22:34 PM7/18/08
to Google-Web-Tool...@googlegroups.com, Lex Spoon, BobV
My box is only a mere 4 cores.
2 x 2.66GHz Dual Core Intel Xeon

And for all that hardware, there is little difference from what Eric saw. Shouldn't I get a partial refund from Apple?

Showcase:
Before:  (0:02:04.754)
After:  (0:01:03.398)

/kel

John Tamplin

unread,
Jul 18, 2008, 2:39:53 PM7/18/08
to Google-Web-Tool...@googlegroups.com, Lex Spoon, BobV
On Fri, Jul 18, 2008 at 2:22 PM, Kelly Norton <kno...@google.com> wrote:
My box is only a mere 4 cores.
2 x 2.66GHz Dual Core Intel Xeon

And for all that hardware, there is little difference from what Eric saw. Shouldn't I get a partial refund from Apple?

Showcase:
Before:  (0:02:04.754)
After:  (0:01:03.398)

Multicore Intel solutions have a front-side bus bandwidth constraint, so you generally have trouble keeping them busy if they don't have a really high L1/L2 hit rate.  I don't know if this is related to what you saw or not, just a data point.

Eric Ayers

unread,
Jul 18, 2008, 2:47:39 PM7/18/08
to Google-Web-Tool...@googlegroups.com
That we can do with 'nice' on the unix cmdline. I don't know that
we'd need to add explicit support from inside the compiler.

Still, not all mutl-processor machines are built alike. allowing a
manual override for the calculation in compteThreadCount would at
least add a way to experiement.

On Fri, Jul 18, 2008 at 2:07 PM, John Tamplin <j...@google.com> wrote:

--

Ray Cromwell

unread,
Jul 18, 2008, 2:51:27 PM7/18/08
to Google-Web-Tool...@googlegroups.com, Lex Spoon, BobV
4-core Mac Pro, 4x2.66Ghz, 6Gb of RAM, compiling Showcase

real 0m56.745s
user 2m43.539s
sys 0m9.241s

I used the following JVM args:
/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Commands/java
-server -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=40 -XX:+AggressiveOpts
-XstartOnFirstThread -Xmx1024m -Xms1024m

I clearly need a 3Ghz 8-core Mac Pro :)
-Ray

Matthew Mastracci

unread,
Jul 18, 2008, 3:03:58 PM7/18/08
to Google-Web-Tool...@googlegroups.com
On 18-Jul-08, at 12:21 PM, Toby Reyelts wrote:

> You could also serialize to a MappedByteBuffer as well to take
> advantage of platform mmap() and allow the OS to page the cache out if
> necessary. Our current build server is memory constrained and pages a
> fair bit during the compile.
>
> I'm very surprised by this. Any statistics you could give us here?
>
> Scott's patch checks the free amount of available memory, so if you
> don't set the JVM's max memory high enough, you won't get a parallel
> build at all.
>

I'll try to dig up concrete numbers later today, but our build server
is fairly old and only has a 1/2 GB or RAM. With the stock GWT 1.5
compiler, I get OOM exceptions (I have to force it to allocate a 512MB
heap to avoid this). It's not a powerhouse by far and our current AST
as-is seems to occupy all of physical RAM. :)

Strangely enough, adding -server to the JVM on this old machine made
the compile+test *slower* (from 20 minutes to 36 minutes).

This build machine is due to be replaced at some point (sooner rather
than later), so this might not be a useful datapoint.

Matt.

John Tamplin

unread,
Jul 18, 2008, 3:53:45 PM7/18/08
to Google-Web-Tool...@googlegroups.com
On Fri, Jul 18, 2008 at 2:47 PM, Eric Ayers <zun...@google.com> wrote:
That we can do with 'nice' on the unix cmdline.  I don't know that
we'd need to add explicit support from inside the compiler.

Still, not all mutl-processor machines are built alike. allowing a
manual override for the calculation in compteThreadCount would at
least add a way to experiement.

That is harder to do when building from inside an IDE.  Yes, you can go look up the process ids and renice them, but then you are spending more time fooling with it than you should have to.  I agree, if I am running ./FooApp-compile it is easy enough to nice it but I would like some option from an IDE launch config.

Toby Reyelts

unread,
Jul 18, 2008, 4:12:16 PM7/18/08
to Google-Web-Tool...@googlegroups.com
On Fri, Jul 18, 2008 at 3:03 PM, Matthew Mastracci <mat...@mastracci.com> wrote:

On 18-Jul-08, at 12:21 PM, Toby Reyelts wrote:

> You could also serialize to a MappedByteBuffer as well to take
> advantage of platform mmap() and allow the OS to page the cache out if
> necessary.  Our current build server is memory constrained and pages a
> fair bit during the compile.
>
> I'm very surprised by this. Any statistics you could give us here?
>
> Scott's patch checks the free amount of available memory, so if you
> don't set the JVM's max memory high enough, you won't get a parallel
> build at all.
>

I'll try to dig up concrete numbers later today, but our build server
is fairly old and only has a 1/2 GB or RAM.  With the stock GWT 1.5
compiler, I get OOM exceptions (I have to force it to allocate a 512MB
heap to avoid this).  

If you're running with a 1.5 JVM, the default heap size is very low (something along the lines of 64M), so this is no surprise. Unlike native apps which will just run away with all of your machine's memory, Java apps are bounded by default :)

Java 1.6 makes this better, in that it will attempt to set the max heap size based on a percentage of your machine's total memory.

It's not a powerhouse by far and our current AST
as-is seems to occupy all of physical RAM.  :)

Strangely enough, adding -server to the JVM on this old machine made
the compile+test *slower* (from 20 minutes to 36 minutes).

If you're actually paging (which wouldn't be surprising with only 512M RAM), then all bets are off about performance. We can probably find ways to reduce memory usage, but it hasn't been a priority since RAM is cheap and plentiful. If people are running out of heap space with reasonable amounts of physical memory, then we need to know about it.

This build machine is due to be replaced at some point (sooner rather
than later), so this might not be a useful datapoint.

My wife's old pentium II has 768M RAM. I think that officially puts you outside of useful datapoints.

We might need to put down some minimum recommended specs for running the compiler. Something like - "Your machine should be no more than seven years old and have more memory than comes in an MP3 player" ;)
 
Matt.



Matthew Mastracci

unread,
Jul 18, 2008, 4:34:28 PM7/18/08
to Google-Web-Tool...@googlegroups.com
On 18-Jul-08, at 2:12 PM, Toby Reyelts wrote:

>
> My wife's old pentium II has 768M RAM. I think that officially puts
> you outside of useful datapoints.
>
> We might need to put down some minimum recommended specs for running
> the compiler. Something like - "Your machine should be no more than
> seven years old and have more memory than comes in an MP3 player" ;)
>

Agreed, wholeheartedly. I'd rather have a fast compile on my machine
with more memory than worry about an old build machine. It'd have
more memory in it if I could get the damn thing past the bios screen
with more than one stick of RAM. ;)

Matt.

Cameron Braid

unread,
Jul 18, 2008, 10:11:19 PM7/18/08
to Google-Web-Tool...@googlegroups.com
Fantastic patch.

Some stats

System :

Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
8G Ram
Ubuntu hardy
java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) 64-Bit Server VM (build 10.0-b22, mixed mode)

Compiler JVM Args

<jvmarg value="-Xmx1G" />
<jvmarg value="-Xms1G" />
<jvmarg value="-XX:MaxPermSize=256m" />
<jvmarg value="-XX:+UseConcMarkSweepGC" />
<jvmarg value="-XX:+UseParNewGC" />
<jvmarg value="-XX:MinHeapFreeRatio=25" />
<jvmarg value="-XX:MaxHeapFreeRatio=40" />
<jvmarg value="-XX:+AggressiveOpts" />
<jvmarg value="-server" />


Compiling two modules in my project :

54s -> 34s
102s -> 57s

Compiling Sample Modules :

Note : I modified samples/common.ant.xml to add the above jvm args
Note : I tried with and without -server, and both seem to be about the same

Showcase 111s -> 53s
DynaTable 17s -> 13.5s

Awesome !

Regards

Cameron

Cameron Braid

unread,
Jul 18, 2008, 10:13:35 PM7/18/08
to Google-Web-Tool...@googlegroups.com
If I build gwt samples using the default gwtc jvm arg -Xmx256m the following error occurs

If I increase this to 512m it is fine

gwtc:
     [java] Compiling module com.google.gwt.sample.showcase.Showcase
     [java] Compiling permutations
     [java]    Analyzing permutation #3
     [java]       [WARN] Out of memory; will retry permutation using fewer concurrent threads
     [java]    Analyzing permutation #13
     [java]       [WARN] Out of memory; will retry permutation using fewer concurrent threads
     [java]    Analyzing permutation #11
     [java]       [ERROR] An internal compiler exception occurred
     [java] com.google.gwt.dev.jjs.InternalCompilerException: Unexpected error during visit.
     [java]     at com.google.gwt.dev.jjs.ast.JVisitor.translateException(JVisitor.java:552)
     [java]     at com.google.gwt.dev.jjs.ast.JVisitor.doTraverse(JVisitor.java:543)
     [java]     at com.google.gwt.dev.jjs.ast.JVisitor.doAccept(JVisitor.java:529)
     [java]     at com.google.gwt.dev.jjs.ast.JVisitor.accept(JVisitor.java:77)
     [java]     at com.google.gwt.dev.jjs.ast.JProgram.traverse(JProgram.java:777)
     [java]     at com.google.gwt.dev.jjs.ast.JVisitor.doTraverse(JVisitor.java:541)
     [java]     at com.google.gwt.dev.jjs.ast.JVisitor.doAccept(JVisitor.java:523)
     [java]     at com.google.gwt.dev.jjs.ast.JVisitor.accept(JVisitor.java:69)
     [java]     at com.google.gwt.dev.jjs.impl.GenerateJavaScriptAST.execImpl(GenerateJavaScriptAST.java:1854)
     [java]     at com.google.gwt.dev.jjs.impl.GenerateJavaScriptAST.exec(GenerateJavaScriptAST.java:1650)
     [java]     at com.google.gwt.dev.jjs.JavaToJavaScriptCompiler.compile(JavaToJavaScriptCompiler.java:487)
     [java]     at com.google.gwt.dev.GWTCompiler$PermutationWorker.run(GWTCompiler.java:347)
     [java]     at java.lang.Thread.run(Thread.java:619)
     [java] Caused by: java.lang.OutOfMemoryError: Java heap space
     [java]     at java.util.Arrays.copyOf(Arrays.java:2734)
     [java]     at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
     [java]     at java.util.ArrayList.add(ArrayList.java:351)
     [java]     at com.google.gwt.dev.jjs.impl.GenerateJavaScriptLiterals.popList(GenerateJavaScriptLiterals.java:114)
     [java]     at com.google.gwt.dev.jjs.impl.GenerateJavaScriptAST$GenerateJavaScriptVisitor.endVisit(GenerateJavaScriptAST.java:495)
     [java]     at com.google.gwt.dev.jjs.ast.JClassType.traverse(JClassType.java:64)
     [java]     at com.google.gwt.dev.jjs.ast.JVisitor.doTraverse(JVisitor.java:541)
     [java]     ... 11 more
     [java]          [ERROR] Out of memory; to increase the amount of memory, use the -Xmx flag at startup (java -Xmx128M ...)
     [java]          [ERROR] at RichTextToolbar.java(40): final class RichTextToolbar extends Composite
     [java]             com.google.gwt.dev.jjs.ast.JClassType
     [java]          [ERROR] <no source info>: <JProgram>
     [java]             com.google.gwt.dev.jjs.ast.JProgram
     [java]       [ERROR] Permutation failed
     [java] com.google.gwt.core.ext.UnableToCompleteException: (see previous log entries)
     [java]     at com.google.gwt.dev.jjs.JavaToJavaScriptCompiler.logAndTranslateException(JavaToJavaScriptCompiler.java:670)
     [java]     at com.google.gwt.dev.jjs.JavaToJavaScriptCompiler.compile(JavaToJavaScriptCompiler.java:543)
     [java]     at com.google.gwt.dev.GWTCompiler$PermutationWorker.run(GWTCompiler.java:347)
     [java]     at java.lang.Thread.run(Thread.java:619)
     [java] [ERROR] Build failed

Regards

Cameron

On Sat, Jul 19, 2008 at 2:01 AM, Scott Blum <sco...@google.com> wrote:

Scott Blum

unread,
Jul 19, 2008, 7:45:53 PM7/19/08
to Google-Web-Tool...@googlegroups.com
Yeah, this is the intended behavior.  Basically, it tried to run 3 threads initially, then as memory got tight, had to drop to fewer and fewer threads.  Finally, even running just 1 thread it still ran out.

I think something is slightly fishy, and we need to investigate this a bit.  The memory shouldn't be going up much permutation over permutation, unless the SoftReference in StandardCompilationResult isn't doing its job correctly.

Scott Blum

unread,
Jul 19, 2008, 7:48:05 PM7/19/08
to Google-Web-Tool...@googlegroups.com, Lex Spoon, BobV
There are pros and cons to both.

Serialization:
- Was cake to implement
- Lower memory ceiling in tight situations since you don't have two copies of the AST in memory at the same time
- Anticipates out of process / distributed parallelized build

Clone:
- Probably faster
- Immutables such as String could be shared across permutations

Cameron Braid

unread,
Jul 19, 2008, 8:28:13 PM7/19/08
to Google-Web-Tool...@googlegroups.com
I guess part of my point is that the default build (before applying the patch) worked out of the box.

Also my machine is capable of building with 4 threads (4 cores and 8 gig ram) yet the 'out of the box' configuration prevents this due to jvm arguments.

Would it be such a bad thing to have the defaults use -Xmx1G ?

What would the drawback be for people with < 1G ram ?

Also, when the  "Out of memory; will retry permutation using fewer concurrent threads" occurs, do you call System.gc() to try and free up some memory ?

Regards

Cameron

Lex Spoon

unread,
Jul 21, 2008, 11:04:12 AM7/21/08
to Google-Web-Tool...@googlegroups.com
On Sat, Jul 19, 2008 at 8:28 PM, Cameron Braid <cam...@braid.com.au> wrote:
> I guess part of my point is that the default build (before applying the
> patch) worked out of the box.

Indeed. Scott's argument seems pretty good for why this should not
happen. And yet it does. :)


> Also, when the "Out of memory; will retry permutation using fewer
> concurrent threads" occurs, do you call System.gc() to try and free up some
> memory ?

That should be irrelevant. The JVM should do its own System.gc()
before throwing an OOM, and I've never seen evidence that it would not
do so.


-Lex

Scott Blum

unread,
Jul 21, 2008, 12:39:47 PM7/21/08
to Google-Web-Tool...@googlegroups.com
On Fri, Jul 18, 2008 at 1:53 PM, Matthew Mastracci <mat...@mastracci.com> wrote:
Is there a lot of time spent serializing/deserializing?  If so, it
might be useful to try out JBoss Serialization as a drop-in
replacement.  It's quite a bit faster for serializing large trees, but
(at least in my experience) increases the output size by a few percent
(< 10%).

On my machine, it's about 2 seconds per permutation spent deserializing Showcase AST, so it's worth looking at.

Toby also suggested  that there's a way to use bytecode rewriting to make your serializable objects externalizable in a way that makes then serialize/deserialize much faster, so we might look into this as well.

Brad Leupen

unread,
Jul 21, 2008, 1:12:17 PM7/21/08
to Google Web Toolkit Contributors
This patch sounds fantastic! Any chance it will make it into RC2?

On Jul 21, 12:39 pm, "Scott Blum" <sco...@google.com> wrote:
> On Fri, Jul 18, 2008 at 1:53 PM, Matthew Mastracci <matt...@mastracci.com>

Scott Blum

unread,
Jul 21, 2008, 2:23:58 PM7/21/08
to Google-Web-Tool...@googlegroups.com
It would be pretty risky to throw it in at this point; the target was 1.6.

Lex Spoon

unread,
Jul 21, 2008, 4:09:39 PM7/21/08
to Scott Blum, BobV, Google-Web-Toolkit-Contributors
This is a welcome improvement, and the implementation is concise.
You can tell how exciting it is just by looking at the number of people who
applied the patch and tried it locally. Very nice work.

I have reviewed the code now and think one aspect of it should be
looked at. More on that below.

Before getting to that, I have two questions that can hopefully be
brushed off but that might be cause for concern:

First, is it now possible, when findEntryPoints() gets entry points
via defered binding, that users can get multiple entries? It looks like
the answer before-patch was no, but after-patch is yes. Before, there
was a rebind call that generated a single class name. Now, there is a
rebind call that generates a list of class names, and they are all
processed. What is going on there? I am not sure to begin with how a
rebind can get multiple results, but furthermore I don't see why to
change behavior in this patch.

Less importantly, does method tracing still work? I see a lot of
System.out.println's. As an expedient fixup, if there really is a
break, it would seem fine to simply force a sequential mode if tracing
is on. It is not necessary to write all the code that would be
necessary to serialize the logs after all the workers are done.

Aside from these questions, my only significant concern is about the
mechanics of farming out jobs to workers and accumulating their
answers. There are a number of little issues I see, enough that it
seems worth another iteration on this area. The issues are:

1. Some posters have suggested using an existing Java thread pool.
Nice idea, although I don't know off hand if there would be an out of the
box solution that can handle the out-of-memory condition as desired.
It seems worth a look.

2. I think OutOfMemory handling looks good for the most likely cases, but
I am unsure how well it's handled in less common cases.
First, if it happens, nextPerm is set to perm.number, possibly causing
intermediate permutations to be computed more than once. Second, I
am not 100% sure it is guaranteed that even one worker will survive.
Isn't it possible that an OOM will be sent to one worker while
another worker is still handling its own OOM? For this reason,
before bailing entirely, it would be nice to go through the cycle
once of going to zero workers, starting one worker, and then that
one really does die in isolation. The third problem is that there
is no guarantee that I can see that a worker will be killed in favor
of the manager thread. It would be very nice if there were some
way to tell the JVM which threads to kill first. Does anyone know
JVM threading well enough to say if this is possible?

3. As a gut reaction, PermutationManager has more synchronized
methods than it should need. For example, hasNextPermutation() is
exposed as a public synchronized method, but I don't think it should
be. In general, getting synchronization right can be mind bending,
so the fewer places it happens the better. See below for one idea.

4. The deserialization of ASTs is currently serialized by waiting on
myLockObject. Instead, it would be faster and no more complex to
say that it's worker 0 on its first deserialize that gets to reuse
the original AST. Then, myLockObject could even be removed.

5. GWTCompiler.compilePermutations catches and ignores
InterruptedException's. If I read the API docs correctly, such an
exception means that the *current* thread has been interrupted. In
that case, I don't think continuing to loop is the right thing. I'm
not sure, honestly, what you *are* supposed to do, but it would seem
nicer to give an explicit ICE.


Again, none of these is huge, but they seem enough to merit another
iteration on this aspect of the patch. Two ideas to address them:

1. Have the PermutationManager and PermutationWorker communicate via
queues? There could be one queue of permutations needing work and
one queue to hold the results. An individual result would be a small
class that holds a permutation plus an enum of success/error/OOM.
This strategy should solve a few problems at once. It improves
parallelization by reducing the synchronization point, it simplifies
the handling of failures (just put failed permutations back on the
work queue). Also, the workers could have only references to these
queues instead of having a reference to the whole manager, thus
really limiting the number of coordination points between threads.

2. I started to say, drop out of memory handling for now. I'm going
back and forth on it. It looks fine to me to drop it so long as
there is some way for users to force a single-thread compile.
However, that would appear to require a flag or at least a
well-known environment variable. As a second choice, maybe if any
OOM happens at all then shut down all workers, wait for them to
finish, and then switch to sequential mode, but this is rather
complicated.


Nits:

JavaToJavaScriptCompiler.createReboundModuleLoad -- Isn't "resultType" a
type corresponding to "mainClassName"? If so, a name like "mainType"
looks clearer.

JavaToJavaScriptCompiler.compile: InternalCompilerException looks more
canonical than RuntimeException.

IMHO, PermutationManager and PermutationWorker are large and important
enough to merit at least one top-level class. In addition, making
PermutationWorker a top-level class will remove a bunch of instance
variables from its scope, thus giving an additional check that it is
not accessing more shared state than intended.

PermutationManager.failed() would be better named getFailed().

The copyright year in the header comments is not always updated.


Again, good stuff. Let's just take a second look at the actual
mechanics of the parallelization.

Lex Spoon

Ray Cromwell

unread,
Jul 21, 2008, 4:30:47 PM7/21/08
to Google-Web-Tool...@googlegroups.com
On Mon, Jul 21, 2008 at 1:09 PM, Lex Spoon <sp...@google.com> wrote:
>
> 1. Some posters have suggested using an existing Java thread pool.
> Nice idea, although I don't know off hand if there would be an out of the
> box solution that can handle the out-of-memory condition as desired.
> It seems worth a look.

You can have complete control over thread construction creation by
passing a ThreadFactory and/or subclassing AbstractExecutionService or
ThreadPoolExecutor. I'm not sure what else is needed.

-Ray

Toby Reyelts

unread,
Jul 21, 2008, 7:20:25 PM7/21/08
to Google-Web-Tool...@googlegroups.com
Doug Lea's a smart man. We should be using his code. I have been for about the past eight years or so.

Ray Cromwell

unread,
Jul 21, 2008, 7:28:14 PM7/21/08
to Google-Web-Tool...@googlegroups.com
Yep, I've found using Doug's classes have stopped countless
concurrency bugs and saved me time.

-Ray

Scott Blum

unread,
Jul 22, 2008, 8:25:09 PM7/22/08
to Lex Spoon, BobV, Google-Web-Toolkit-Contributors
On Mon, Jul 21, 2008 at 4:09 PM, Lex Spoon <sp...@google.com> wrote:
First, is it now possible, when findEntryPoints() gets entry points
via defered binding, that users can get multiple entries?  It looks like
the answer before-patch was no, but after-patch is yes.  Before, there
was a rebind call that generated a single class name.  Now, there is a
rebind call that generates a list of class names, and they are all
processed.  What is going on there?  I am not sure to begin with how a
rebind can get multiple results, but furthermore I don't see why to
change behavior in this patch.

The behavior *should* be the same as before.  The major confusing thing is that you have to figure out all possible entry points up front, instead of doing a single deferred binding.  This is because you have to know the possible entry points to begin optimizing.  But the same logic should apply as before:

- If an entry point has a static onModuleLoad(), it is not rebound
- Otherwise, rebind
 - If the rebind result has a static onModuleLoad(), call it directly
 - If the rebind result has an instance onModuleLoad(), create a new instance of that class and call onModuleLoad() on it
 - Otherwise, fail


Less importantly, does method tracing still work?  I see a lot of
System.out.println's.  As an expedient fixup, if there really is a
break, it would seem fine to simply force a sequential mode if tracing
is on.  It is not necessary to write all the code that would be
necessary to serialize the logs after all the workers are done.

Good point, but I'm willing to punt on this for now, because in practice I always edit the gwt.xml to only have one perm when I'm doing method tracing.
 
Aside from these questions, my only significant concern is about the
mechanics of farming out jobs to workers and accumulating their
answers.  There are a number of little issues I see, enough that it
seems worth another iteration on this area.  The issues are:

 1. Some posters have suggested using an existing Java thread pool.
 Nice idea, although I don't know off hand if there would be an out of the
 box solution that can handle the out-of-memory condition as desired.
 It seems worth a look.

I'm going to look over Ray's suggestions and see what I can come up with here.
 
 2. I think OutOfMemory handling looks good for the most likely cases, but
 I am unsure how well it's handled in less common cases.
 First, if it happens, nextPerm is set to perm.number, possibly causing
 intermediate permutations to be computed more than once.  Second, I
 am not 100% sure it is guaranteed that even one worker will survive.
 Isn't it possible that an OOM will be sent to one worker while
 another worker is still handling its own OOM?  For this reason,
 before bailing entirely, it would be nice to go through the cycle
 once of going to zero workers, starting one worker, and then that
 one really does die in isolation.  The third problem is that there
 is no guarantee that I can see that a worker will be killed in favor
 of the manager thread.  It would be very nice if there were some
 way to tell the JVM which threads to kill first.  Does anyone know
 JVM threading well enough to say if this is possible?

I can hopefully address most of your concerns:
- I'm okay with the OOM recovery to be on a best effort basis until we get some live data back from users.

- 99% of the time, the OOM will occur while a thread is creating a Java AST off the serialized form.  That code block is synchronized, so in theory only one thread should fail in this spot at a time.  Once the OOM is thrown, the whole AST that was just created will become collectible, and the JVM should aggressively collect before any more OOMs get thrown.

- The manager thread should be about impossible to OOM since it's just waiting on the other threads and not allocating anything.  If it did OOM, the whole compile would immediately halt with failure.

That being said, I'll rethink this when I look at the concurrency stuff.
 
 3. As a gut reaction, PermutationManager has more synchronized
 methods than it should need.  For example, hasNextPermutation() is
 exposed as a public synchronized method, but I don't think it should
 be.  In general, getting synchronization right can be mind bending,
 so the fewer places it happens the better.  See below for one idea.

In practice, it shouldn't matter since most of them will be very quick relative to the compiler.  But I will rethink when I look at the concurrency.  I had actually thought of reformulating it such that a PermutationWorker was an inner class of PermutationManager and PM exposed some protected methods.
 
 4. The deserialization of ASTs is currently serialized by waiting on
 myLockObject.  Instead, it would be faster and no more complex to
 say that it's worker 0 on its first deserialize that gets to reuse
 the original AST.  Then, myLockObject could even be removed.

But then, it'd be much easier to get multiple simultaneous OOMs.  And I don't want to complicate the API of JJS, there are some non-GWT callers, believe it or not (IDE integration).
 
 5. GWTCompiler.compilePermutations catches and ignores
 InterruptedException's.  If I read the API docs correctly, such an
 exception means that the *current* thread has been interrupted.  In
 that case, I don't think continuing to loop is the right thing.  I'm
 not sure, honestly, what you *are* supposed to do, but it would seem
 nicer to give an explicit ICE.

You have to catch the checked exception; in practice the main thread should never get interrupted, but I think I read in Effective Java that it could happen spuriously and retrying the wait was the correct solution.
 
 1. Have the PermutationManager and PermutationWorker communicate via
 queues?  There could be one queue of permutations needing work and
 one queue to hold the results. An individual result would be a small
 class that holds a permutation plus an enum of success/error/OOM.
 This strategy should solve a few problems at once.  It improves
 parallelization by reducing the synchronization point, it simplifies
 the handling of failures (just put failed permutations back on the
 work queue).  Also, the workers could have only references to these
 queues instead of having a reference to the whole manager, thus
 really limiting the number of coordination points between threads.

This seems promising.  Maybe a follow-up patch once the original is in?
 
 2. I started to say, drop out of memory handling for now.  I'm going
 back and forth on it.  It looks fine to me to drop it so long as
 there is some way for users to force a single-thread compile.
 However, that would appear to require a flag or at least a
 well-known environment variable.  As a second choice, maybe if any
 OOM happens at all then shut down all workers, wait for them to
 finish, and then switch to sequential mode, but this is rather
 complicated.

All possibilities.  In an ideal world, it would simply do the right thing and the user would get optimal behavior without having to touch any levers.  I think this needs to run in the wild to see how well we hit that goal and then iterate.
 
JavaToJavaScriptCompiler.createReboundModuleLoad -- Isn't "resultType" a
 type corresponding to "mainClassName"?  If so, a name like "mainType"
 looks clearer.

They're not identical; the first is actually a deferred binding result on the mainType.  I renamed these to "reboundEntryType" and "originalMainClassName" for clarity.
 
JavaToJavaScriptCompiler.compile: InternalCompilerException looks more
canonical than RuntimeException.

Deeper in the compile process, I'd agree, but I think we're at a high enough level where this doesn't really involve the compiler at all; it should literally be impossible to get this exception since you can't have previously serialized a class that isn't loaded and available.
 
IMHO, PermutationManager and PermutationWorker are large and important
enough to merit at least one top-level class.  In addition, making
PermutationWorker a top-level class will remove a bunch of instance
variables from its scope, thus giving an additional check that it is
not accessing more shared state than intended.

I can't argue with you, but I would ask if it's good enough that I can hit this in a second pass when I rethink the concurrency stuff.
 
PermutationManager.failed() would be better named getFailed().

Agreed; renamed them to recordFailure() and isFailed().
 
The copyright year in the header comments is not always updated.

Thanks, will fix.

Do you see any other issues that would potentially block a commit, assuming I come back and hit the concurrency / worker model in a second pass?

Thanks,
Scott

Toby Reyelts

unread,
Jul 23, 2008, 10:23:17 AM7/23/08
to Google-Web-Tool...@googlegroups.com, Lex Spoon, BobV

No, in fact, this is exactly the opposite of the right thing to do. Perhaps you're confusing signals with interrupts? It's possible to get a spurious signal, which will wake you from a wait. This is why you test for a condition in a while loop. This is often called a "wait loop".

In contrast, if your thread has been interrupted, it's because somebody is requesting your thread to stop executing. If you choose to spin anyway, it usually means they then have to resort to more drastic and unsafe measures, such as Thread.stop or even killing your process. There is an example of proper Thread.interrupt use in Effective Java, even though it's not the go-to book on concurrency.
 
 
 1. Have the PermutationManager and PermutationWorker communicate via
 queues?  There could be one queue of permutations needing work and
 one queue to hold the results. An individual result would be a small
 class that holds a permutation plus an enum of success/error/OOM.
 This strategy should solve a few problems at once.  It improves
 parallelization by reducing the synchronization point, it simplifies
 the handling of failures (just put failed permutations back on the
 work queue).  Also, the workers could have only references to these
 queues instead of having a reference to the whole manager, thus
 really limiting the number of coordination points between threads.

This seems promising.  Maybe a follow-up patch once the original is in?
 
 2. I started to say, drop out of memory handling for now.  I'm going
 back and forth on it.  It looks fine to me to drop it so long as
 there is some way for users to force a single-thread compile.
 However, that would appear to require a flag or at least a
 well-known environment variable.  As a second choice, maybe if any
 OOM happens at all then shut down all workers, wait for them to
 finish, and then switch to sequential mode, but this is rather
 complicated.

All possibilities.  In an ideal world, it would simply do the right thing and the user would get optimal behavior without having to touch any levers.  I think this needs to run in the wild to see how well we hit that goal and then iterate.

Almost all real-world Java programs that have thread pools have a lever or so to tweak configuration such as the max number of threads in the pool.
 

Lex Spoon

unread,
Jul 23, 2008, 2:57:04 PM7/23/08
to Scott Blum, BobV, Google-Web-Toolkit-Contributors
I think we should actually improve the concurrency issues until we are
at least sure we have not introduced instability issues. This isn't a
random style complaint, but a real concern that we are introducing
much more instability than we have to.

I find concurrency bugs terrifying. I think we all should.
Concurrency bugs cause spurious hangs on users' desks that cannot be
replicated in house. Normal testing doesn't catch them, so your only
defense is to write perfect code. How scary is that? To improve our
chances, we can be defensive by avoiding shared state and by avoiding
synchronous cross-thread communication. This may sound awkward, but I
have found in practice that the only cost is in extra upfront design.
Once the general arrangement is decided, the code tends to be no more
complex than a shared state and locks approach.


Aside from concurrency, the out of memory handling is also a possible
increase in instability. While I agree we need to see live data
before tuning, we should go ahead and remove possible instability
problems. My favorite solution would be a Java property limiting the
number of threads, because we can do it now and it's simple to
implement. To contrast, the built-in OOM handling seems to have a lot
of questions and there's not a thorough answer on the table yet.
Particularly crippling is that we don't have a strong reason to
believe the dispatch thread won't get an OOM exception when memory
gets tight. If we had a way to be sure of that, then I think we could
work out something that robustly works so long as a 1-thread compile
can succeed.


On the concurrency issues:


>> 3. As a gut reaction, PermutationManager has more synchronized
>> methods than it should need. For example, hasNextPermutation() is
>> exposed as a public synchronized method, but I don't think it should
>> be. In general, getting synchronization right can be mind bending,
>> so the fewer places it happens the better. See below for one idea.
>
> In practice, it shouldn't matter since most of them will be very quick
> relative to the compiler.

This is an example of what I mean by being optimistic about
concurrency bugs. While I didn't see any particular bug in this code,
I propose that we be more defensive.

>> 4. The deserialization of ASTs is currently serialized by waiting on
>> myLockObject. Instead, it would be faster and no more complex to
>> say that it's worker 0 on its first deserialize that gets to reuse
>> the original AST. Then, myLockObject could even be removed.
>
> But then, it'd be much easier to get multiple simultaneous OOMs. And I
> don't want to complicate the API of JJS, there are some non-GWT callers,
> believe it or not (IDE integration).

In this case, there is no outright bug, but I don't understand why we
would pick the slower approach. The API would be unaffected by the
choice of approach, and the memory usage would be within dozens of
bytes of each other. The substantive differences are that one
solution runs faster and has one fewer synchronized block.


>> 5. GWTCompiler.compilePermutations catches and ignores
>> InterruptedException's. If I read the API docs correctly, such an
>> exception means that the *current* thread has been interrupted. In
>> that case, I don't think continuing to loop is the right thing. I'm
>> not sure, honestly, what you *are* supposed to do, but it would seem
>> nicer to give an explicit ICE.
>
> You have to catch the checked exception; in practice the main thread should
> never get interrupted, but I think I read in Effective Java that it could
> happen spuriously and retrying the wait was the correct solution.

It depends on which kind of interrupt it is. I agree that it's weird
for there to be a checked exception for this situation, as opposed to
the one you described, but that's my read of the Javadocs.

The non-concurency issues all sound great:

> The behavior *should* be the same as before. The major confusing thing is
> that you have to figure out all possible entry points up front, instead of doing
> a single deferred binding. This is because you have to know the possible entry
> points to begin optimizing.

Excellent, gotcha. It's just like with JGwtCreate.


>> JavaToJavaScriptCompiler.createReboundModuleLoad -- Isn't "resultType" a
>> type corresponding to "mainClassName"? If so, a name like "mainType"
>> looks clearer.
>
> They're not identical; the first is actually a deferred binding result on
> the mainType. I renamed these to "reboundEntryType" and
> "originalMainClassName" for clarity.

Cool.

>> JavaToJavaScriptCompiler.compile: InternalCompilerException looks more
>> canonical than RuntimeException.
>
> Deeper in the compile process, I'd agree, but I think we're at a high enough
> level where this doesn't really involve the compiler at all; it should
> literally be impossible to get this exception since you can't have
> previously serialized a class that isn't loaded and available.

Okay.


>> IMHO, PermutationManager and PermutationWorker are large and important
>> enough to merit at least one top-level class. In addition, making
>> PermutationWorker a top-level class will remove a bunch of instance
>> variables from its scope, thus giving an additional check that it is
>> not accessing more shared state than intended.
>
> I can't argue with you, but I would ask if it's good enough that I can hit
> this in a second pass when I rethink the concurrency stuff.

Yes, that makes sense.

-Lex

Reply all
Reply to author
Forward
0 new messages