I ran this on my desktop - 2.4 GHz Intel Core Duo2 (2 cores) and noted
Showcase compile time from 02:02 down to 01:19.
I noted that interactive performance of using my desktop did not
suffer while the build was going. Still, I was thinking that it might
be wise to set some way (Java Property?) in computeThreadCount() to
throttle the number of threads. Someone running an automated build on
a shared machine might not like the behavior of grabbing every CPU on
the system. I did a web search and didn't see a JVM argument that
might artificially lower the # of processors.
On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote:
--
Eric Z. Ayers - GWT Team - Atlanta, GA USA
http://code.google.com/webtoolkit/
I have a question tho, is deserializing bytebuffers as fast as using
a deep tree clone method? I'm not an expert in Java serialization, but
I always had the impression it was not the greatest performer. I would
have thought that a copy() method on every JNode subclass would have
been faster, especially in cases where it is effectively final (not
overriden after HotSpot type-tightens a callsite) and can be inlined
by HotSpot.
I can't wait to try this on my Mac Pro!
-Ray
Is there a lot of time spent serializing/deserializing? If so, it
might be useful to try out JBoss Serialization as a drop-in
replacement. It's quite a bit faster for serializing large trees, but
(at least in my experience) increases the output size by a few percent
(< 10%).
You could also serialize to a MappedByteBuffer as well to take
advantage of platform mmap() and allow the OS to page the cache out if
necessary. Our current build server is memory constrained and pages a
fair bit during the compile.
> <compiler-parallel-r3250.patch>
I noted that interactive performance of using my desktop did not
suffer while the build was going. Still, I was thinking that it might
be wise to set some way (Java Property?) in computeThreadCount() to
throttle the number of threads. Someone running an automated build on
a shared machine might not like the behavior of grabbing every CPU on
the system. I did a web search and didn't see a JVM argument that
might artificially lower the # of processors.
Linux typically does that anyway, if I'm not mistaken. Do other
platforms? (if they don't by default, is there some way to make them?)
--
Alex Rudnick
swe, gwt, atl
Wow, awesome patch. This is going to be fantastic for dual-core boxes.
Is there a lot of time spent serializing/deserializing? If so, it
might be useful to try out JBoss Serialization as a drop-in
replacement. It's quite a bit faster for serializing large trees, but
(at least in my experience) increases the output size by a few percent
(< 10%).
You could also serialize to a MappedByteBuffer as well to take
advantage of platform mmap() and allow the OS to page the cache out if
necessary. Our current build server is memory constrained and pages a
fair bit during the compile.
My box is only a mere 4 cores.2 x 2.66GHz Dual Core Intel XeonAnd for all that hardware, there is little difference from what Eric saw. Shouldn't I get a partial refund from Apple?Showcase:Before: (0:02:04.754)After: (0:01:03.398)
Still, not all mutl-processor machines are built alike. allowing a
manual override for the calculation in compteThreadCount would at
least add a way to experiement.
On Fri, Jul 18, 2008 at 2:07 PM, John Tamplin <j...@google.com> wrote:
--
real 0m56.745s
user 2m43.539s
sys 0m9.241s
I used the following JVM args:
/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Commands/java
-server -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
-XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=40 -XX:+AggressiveOpts
-XstartOnFirstThread -Xmx1024m -Xms1024m
I clearly need a 3Ghz 8-core Mac Pro :)
-Ray
> You could also serialize to a MappedByteBuffer as well to take
> advantage of platform mmap() and allow the OS to page the cache out if
> necessary. Our current build server is memory constrained and pages a
> fair bit during the compile.
>
> I'm very surprised by this. Any statistics you could give us here?
>
> Scott's patch checks the free amount of available memory, so if you
> don't set the JVM's max memory high enough, you won't get a parallel
> build at all.
>
I'll try to dig up concrete numbers later today, but our build server
is fairly old and only has a 1/2 GB or RAM. With the stock GWT 1.5
compiler, I get OOM exceptions (I have to force it to allocate a 512MB
heap to avoid this). It's not a powerhouse by far and our current AST
as-is seems to occupy all of physical RAM. :)
Strangely enough, adding -server to the JVM on this old machine made
the compile+test *slower* (from 20 minutes to 36 minutes).
This build machine is due to be replaced at some point (sooner rather
than later), so this might not be a useful datapoint.
Matt.
That we can do with 'nice' on the unix cmdline. I don't know that
we'd need to add explicit support from inside the compiler.
Still, not all mutl-processor machines are built alike. allowing a
manual override for the calculation in compteThreadCount would at
least add a way to experiement.
I'll try to dig up concrete numbers later today, but our build server
On 18-Jul-08, at 12:21 PM, Toby Reyelts wrote:
> You could also serialize to a MappedByteBuffer as well to take
> advantage of platform mmap() and allow the OS to page the cache out if
> necessary. Our current build server is memory constrained and pages a
> fair bit during the compile.
>
> I'm very surprised by this. Any statistics you could give us here?
>
> Scott's patch checks the free amount of available memory, so if you
> don't set the JVM's max memory high enough, you won't get a parallel
> build at all.
>
is fairly old and only has a 1/2 GB or RAM. With the stock GWT 1.5
compiler, I get OOM exceptions (I have to force it to allocate a 512MB
heap to avoid this).
It's not a powerhouse by far and our current AST
as-is seems to occupy all of physical RAM. :)
Strangely enough, adding -server to the JVM on this old machine made
the compile+test *slower* (from 20 minutes to 36 minutes).
This build machine is due to be replaced at some point (sooner rather
than later), so this might not be a useful datapoint.
Matt.
>
> My wife's old pentium II has 768M RAM. I think that officially puts
> you outside of useful datapoints.
>
> We might need to put down some minimum recommended specs for running
> the compiler. Something like - "Your machine should be no more than
> seven years old and have more memory than comes in an MP3 player" ;)
>
Agreed, wholeheartedly. I'd rather have a fast compile on my machine
with more memory than worry about an old build machine. It'd have
more memory in it if I could get the damn thing past the bios screen
with more than one stick of RAM. ;)
Matt.
Indeed. Scott's argument seems pretty good for why this should not
happen. And yet it does. :)
> Also, when the "Out of memory; will retry permutation using fewer
> concurrent threads" occurs, do you call System.gc() to try and free up some
> memory ?
That should be irrelevant. The JVM should do its own System.gc()
before throwing an OOM, and I've never seen evidence that it would not
do so.
-Lex
Is there a lot of time spent serializing/deserializing? If so, it
might be useful to try out JBoss Serialization as a drop-in
replacement. It's quite a bit faster for serializing large trees, but
(at least in my experience) increases the output size by a few percent
(< 10%).
I have reviewed the code now and think one aspect of it should be
looked at. More on that below.
Before getting to that, I have two questions that can hopefully be
brushed off but that might be cause for concern:
First, is it now possible, when findEntryPoints() gets entry points
via defered binding, that users can get multiple entries? It looks like
the answer before-patch was no, but after-patch is yes. Before, there
was a rebind call that generated a single class name. Now, there is a
rebind call that generates a list of class names, and they are all
processed. What is going on there? I am not sure to begin with how a
rebind can get multiple results, but furthermore I don't see why to
change behavior in this patch.
Less importantly, does method tracing still work? I see a lot of
System.out.println's. As an expedient fixup, if there really is a
break, it would seem fine to simply force a sequential mode if tracing
is on. It is not necessary to write all the code that would be
necessary to serialize the logs after all the workers are done.
Aside from these questions, my only significant concern is about the
mechanics of farming out jobs to workers and accumulating their
answers. There are a number of little issues I see, enough that it
seems worth another iteration on this area. The issues are:
1. Some posters have suggested using an existing Java thread pool.
Nice idea, although I don't know off hand if there would be an out of the
box solution that can handle the out-of-memory condition as desired.
It seems worth a look.
2. I think OutOfMemory handling looks good for the most likely cases, but
I am unsure how well it's handled in less common cases.
First, if it happens, nextPerm is set to perm.number, possibly causing
intermediate permutations to be computed more than once. Second, I
am not 100% sure it is guaranteed that even one worker will survive.
Isn't it possible that an OOM will be sent to one worker while
another worker is still handling its own OOM? For this reason,
before bailing entirely, it would be nice to go through the cycle
once of going to zero workers, starting one worker, and then that
one really does die in isolation. The third problem is that there
is no guarantee that I can see that a worker will be killed in favor
of the manager thread. It would be very nice if there were some
way to tell the JVM which threads to kill first. Does anyone know
JVM threading well enough to say if this is possible?
3. As a gut reaction, PermutationManager has more synchronized
methods than it should need. For example, hasNextPermutation() is
exposed as a public synchronized method, but I don't think it should
be. In general, getting synchronization right can be mind bending,
so the fewer places it happens the better. See below for one idea.
4. The deserialization of ASTs is currently serialized by waiting on
myLockObject. Instead, it would be faster and no more complex to
say that it's worker 0 on its first deserialize that gets to reuse
the original AST. Then, myLockObject could even be removed.
5. GWTCompiler.compilePermutations catches and ignores
InterruptedException's. If I read the API docs correctly, such an
exception means that the *current* thread has been interrupted. In
that case, I don't think continuing to loop is the right thing. I'm
not sure, honestly, what you *are* supposed to do, but it would seem
nicer to give an explicit ICE.
Again, none of these is huge, but they seem enough to merit another
iteration on this aspect of the patch. Two ideas to address them:
1. Have the PermutationManager and PermutationWorker communicate via
queues? There could be one queue of permutations needing work and
one queue to hold the results. An individual result would be a small
class that holds a permutation plus an enum of success/error/OOM.
This strategy should solve a few problems at once. It improves
parallelization by reducing the synchronization point, it simplifies
the handling of failures (just put failed permutations back on the
work queue). Also, the workers could have only references to these
queues instead of having a reference to the whole manager, thus
really limiting the number of coordination points between threads.
2. I started to say, drop out of memory handling for now. I'm going
back and forth on it. It looks fine to me to drop it so long as
there is some way for users to force a single-thread compile.
However, that would appear to require a flag or at least a
well-known environment variable. As a second choice, maybe if any
OOM happens at all then shut down all workers, wait for them to
finish, and then switch to sequential mode, but this is rather
complicated.
Nits:
JavaToJavaScriptCompiler.createReboundModuleLoad -- Isn't "resultType" a
type corresponding to "mainClassName"? If so, a name like "mainType"
looks clearer.
JavaToJavaScriptCompiler.compile: InternalCompilerException looks more
canonical than RuntimeException.
IMHO, PermutationManager and PermutationWorker are large and important
enough to merit at least one top-level class. In addition, making
PermutationWorker a top-level class will remove a bunch of instance
variables from its scope, thus giving an additional check that it is
not accessing more shared state than intended.
PermutationManager.failed() would be better named getFailed().
The copyright year in the header comments is not always updated.
Again, good stuff. Let's just take a second look at the actual
mechanics of the parallelization.
Lex Spoon
You can have complete control over thread construction creation by
passing a ThreadFactory and/or subclassing AbstractExecutionService or
ThreadPoolExecutor. I'm not sure what else is needed.
-Ray
-Ray
First, is it now possible, when findEntryPoints() gets entry points
via defered binding, that users can get multiple entries? It looks like
the answer before-patch was no, but after-patch is yes. Before, there
was a rebind call that generated a single class name. Now, there is a
rebind call that generates a list of class names, and they are all
processed. What is going on there? I am not sure to begin with how a
rebind can get multiple results, but furthermore I don't see why to
change behavior in this patch.
Less importantly, does method tracing still work? I see a lot of
System.out.println's. As an expedient fixup, if there really is a
break, it would seem fine to simply force a sequential mode if tracing
is on. It is not necessary to write all the code that would be
necessary to serialize the logs after all the workers are done.
Aside from these questions, my only significant concern is about the
mechanics of farming out jobs to workers and accumulating their
answers. There are a number of little issues I see, enough that it
seems worth another iteration on this area. The issues are:
1. Some posters have suggested using an existing Java thread pool.
Nice idea, although I don't know off hand if there would be an out of the
box solution that can handle the out-of-memory condition as desired.
It seems worth a look.
2. I think OutOfMemory handling looks good for the most likely cases, but
I am unsure how well it's handled in less common cases.
First, if it happens, nextPerm is set to perm.number, possibly causing
intermediate permutations to be computed more than once. Second, I
am not 100% sure it is guaranteed that even one worker will survive.
Isn't it possible that an OOM will be sent to one worker while
another worker is still handling its own OOM? For this reason,
before bailing entirely, it would be nice to go through the cycle
once of going to zero workers, starting one worker, and then that
one really does die in isolation. The third problem is that there
is no guarantee that I can see that a worker will be killed in favor
of the manager thread. It would be very nice if there were some
way to tell the JVM which threads to kill first. Does anyone know
JVM threading well enough to say if this is possible?
3. As a gut reaction, PermutationManager has more synchronized
methods than it should need. For example, hasNextPermutation() is
exposed as a public synchronized method, but I don't think it should
be. In general, getting synchronization right can be mind bending,
so the fewer places it happens the better. See below for one idea.
4. The deserialization of ASTs is currently serialized by waiting on
myLockObject. Instead, it would be faster and no more complex to
say that it's worker 0 on its first deserialize that gets to reuse
the original AST. Then, myLockObject could even be removed.
5. GWTCompiler.compilePermutations catches and ignores
InterruptedException's. If I read the API docs correctly, such an
exception means that the *current* thread has been interrupted. In
that case, I don't think continuing to loop is the right thing. I'm
not sure, honestly, what you *are* supposed to do, but it would seem
nicer to give an explicit ICE.
1. Have the PermutationManager and PermutationWorker communicate via
queues? There could be one queue of permutations needing work and
one queue to hold the results. An individual result would be a small
class that holds a permutation plus an enum of success/error/OOM.
This strategy should solve a few problems at once. It improves
parallelization by reducing the synchronization point, it simplifies
the handling of failures (just put failed permutations back on the
work queue). Also, the workers could have only references to these
queues instead of having a reference to the whole manager, thus
really limiting the number of coordination points between threads.
2. I started to say, drop out of memory handling for now. I'm going
back and forth on it. It looks fine to me to drop it so long as
there is some way for users to force a single-thread compile.
However, that would appear to require a flag or at least a
well-known environment variable. As a second choice, maybe if any
OOM happens at all then shut down all workers, wait for them to
finish, and then switch to sequential mode, but this is rather
complicated.
JavaToJavaScriptCompiler.createReboundModuleLoad -- Isn't "resultType" a
type corresponding to "mainClassName"? If so, a name like "mainType"
looks clearer.
JavaToJavaScriptCompiler.compile: InternalCompilerException looks more
canonical than RuntimeException.
IMHO, PermutationManager and PermutationWorker are large and important
enough to merit at least one top-level class. In addition, making
PermutationWorker a top-level class will remove a bunch of instance
variables from its scope, thus giving an additional check that it is
not accessing more shared state than intended.
PermutationManager.failed() would be better named getFailed().
The copyright year in the header comments is not always updated.
1. Have the PermutationManager and PermutationWorker communicate via
queues? There could be one queue of permutations needing work and
one queue to hold the results. An individual result would be a small
class that holds a permutation plus an enum of success/error/OOM.
This strategy should solve a few problems at once. It improves
parallelization by reducing the synchronization point, it simplifies
the handling of failures (just put failed permutations back on the
work queue). Also, the workers could have only references to these
queues instead of having a reference to the whole manager, thus
really limiting the number of coordination points between threads.This seems promising. Maybe a follow-up patch once the original is in?2. I started to say, drop out of memory handling for now. I'm going
back and forth on it. It looks fine to me to drop it so long as
there is some way for users to force a single-thread compile.
However, that would appear to require a flag or at least a
well-known environment variable. As a second choice, maybe if any
OOM happens at all then shut down all workers, wait for them to
finish, and then switch to sequential mode, but this is rather
complicated.All possibilities. In an ideal world, it would simply do the right thing and the user would get optimal behavior without having to touch any levers. I think this needs to run in the wild to see how well we hit that goal and then iterate.
I find concurrency bugs terrifying. I think we all should.
Concurrency bugs cause spurious hangs on users' desks that cannot be
replicated in house. Normal testing doesn't catch them, so your only
defense is to write perfect code. How scary is that? To improve our
chances, we can be defensive by avoiding shared state and by avoiding
synchronous cross-thread communication. This may sound awkward, but I
have found in practice that the only cost is in extra upfront design.
Once the general arrangement is decided, the code tends to be no more
complex than a shared state and locks approach.
Aside from concurrency, the out of memory handling is also a possible
increase in instability. While I agree we need to see live data
before tuning, we should go ahead and remove possible instability
problems. My favorite solution would be a Java property limiting the
number of threads, because we can do it now and it's simple to
implement. To contrast, the built-in OOM handling seems to have a lot
of questions and there's not a thorough answer on the table yet.
Particularly crippling is that we don't have a strong reason to
believe the dispatch thread won't get an OOM exception when memory
gets tight. If we had a way to be sure of that, then I think we could
work out something that robustly works so long as a 1-thread compile
can succeed.
On the concurrency issues:
>> 3. As a gut reaction, PermutationManager has more synchronized
>> methods than it should need. For example, hasNextPermutation() is
>> exposed as a public synchronized method, but I don't think it should
>> be. In general, getting synchronization right can be mind bending,
>> so the fewer places it happens the better. See below for one idea.
>
> In practice, it shouldn't matter since most of them will be very quick
> relative to the compiler.
This is an example of what I mean by being optimistic about
concurrency bugs. While I didn't see any particular bug in this code,
I propose that we be more defensive.
>> 4. The deserialization of ASTs is currently serialized by waiting on
>> myLockObject. Instead, it would be faster and no more complex to
>> say that it's worker 0 on its first deserialize that gets to reuse
>> the original AST. Then, myLockObject could even be removed.
>
> But then, it'd be much easier to get multiple simultaneous OOMs. And I
> don't want to complicate the API of JJS, there are some non-GWT callers,
> believe it or not (IDE integration).
In this case, there is no outright bug, but I don't understand why we
would pick the slower approach. The API would be unaffected by the
choice of approach, and the memory usage would be within dozens of
bytes of each other. The substantive differences are that one
solution runs faster and has one fewer synchronized block.
>> 5. GWTCompiler.compilePermutations catches and ignores
>> InterruptedException's. If I read the API docs correctly, such an
>> exception means that the *current* thread has been interrupted. In
>> that case, I don't think continuing to loop is the right thing. I'm
>> not sure, honestly, what you *are* supposed to do, but it would seem
>> nicer to give an explicit ICE.
>
> You have to catch the checked exception; in practice the main thread should
> never get interrupted, but I think I read in Effective Java that it could
> happen spuriously and retrying the wait was the correct solution.
It depends on which kind of interrupt it is. I agree that it's weird
for there to be a checked exception for this situation, as opposed to
the one you described, but that's my read of the Javadocs.
The non-concurency issues all sound great:
> The behavior *should* be the same as before. The major confusing thing is
> that you have to figure out all possible entry points up front, instead of doing
> a single deferred binding. This is because you have to know the possible entry
> points to begin optimizing.
Excellent, gotcha. It's just like with JGwtCreate.
>> JavaToJavaScriptCompiler.createReboundModuleLoad -- Isn't "resultType" a
>> type corresponding to "mainClassName"? If so, a name like "mainType"
>> looks clearer.
>
> They're not identical; the first is actually a deferred binding result on
> the mainType. I renamed these to "reboundEntryType" and
> "originalMainClassName" for clarity.
Cool.
>> JavaToJavaScriptCompiler.compile: InternalCompilerException looks more
>> canonical than RuntimeException.
>
> Deeper in the compile process, I'd agree, but I think we're at a high enough
> level where this doesn't really involve the compiler at all; it should
> literally be impossible to get this exception since you can't have
> previously serialized a class that isn't loaded and available.
Okay.
>> IMHO, PermutationManager and PermutationWorker are large and important
>> enough to merit at least one top-level class. In addition, making
>> PermutationWorker a top-level class will remove a bunch of instance
>> variables from its scope, thus giving an additional check that it is
>> not accessing more shared state than intended.
>
> I can't argue with you, but I would ask if it's good enough that I can hit
> this in a second pass when I rethink the concurrency stuff.
Yes, that makes sense.
-Lex