Hi guys, Could you please review this patch for trunk? It makes the compiler go much faster when you have:
a) Several permutations b) More than one processor
Actually, it even helps a little bit with several permutations and only one processor.
Here's the basic design:
1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved deferred binding decisions as explicit nodes 2) Do optimizations on this tree. 3) Serialize into a byte buffer (thanks to Bruce for this tasty suggestion) 4) Crank up N worker threads 5) Each worker thread: a) Grabs a particular permutation and deserializes its own copy of the unified Java AST b) Resolves all deferred binding for that particular permutation c) Optimizes the permutation with the new knowledge of deferred binding decisions 6) Profit!
Some notes: - Doing step 2 up front on the unified tree actually realizes some performance gains by making step 5c take less time in each permutation, even when you have only one worker thread - The number of worker threads created is Math.min(numberOfMachineProcessors, estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, there's no point creating extra threads if you don't have the memory for them, or you can't actually utilize the hardware. Permutation optimization barely touches the I/O, it's all CPU bound. - Making the Java and JS ASTs serializable was surprisingly easy.
Toby's about to email around some results from testing Showcase on his phatty 4-processor machine, if he hasn't already done so.
Scott, I'm kind of surprised to see all the code the hand manage a worker pool of threads. Why did you not use a fixed thread pool from java.util.concurrency?
On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote: > Hi guys, > Could you please review this patch for trunk? It makes the compiler go > much faster when you have:
> a) Several permutations > b) More than one processor
> Actually, it even helps a little bit with several permutations and only one > processor.
> Here's the basic design:
> 1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved > deferred binding decisions as explicit nodes > 2) Do optimizations on this tree. > 3) Serialize into a byte buffer (thanks to Bruce for this tasty suggestion) > 4) Crank up N worker threads > 5) Each worker thread: > a) Grabs a particular permutation and deserializes its own copy of the > unified Java AST > b) Resolves all deferred binding for that particular permutation > c) Optimizes the permutation with the new knowledge of deferred binding > decisions > 6) Profit!
> Some notes: > - Doing step 2 up front on the unified tree actually realizes some > performance gains by making step 5c take less time in each permutation, even > when you have only one worker thread > - The number of worker threads created is > Math.min(numberOfMachineProcessors, > estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, > there's no point creating extra threads if you don't have the memory for > them, or you can't actually utilize the hardware. Permutation optimization > barely touches the I/O, it's all CPU bound. > - Making the Java and JS ASTs serializable was surprisingly easy.
> Toby's about to email around some results from testing Showcase on his > phatty 4-processor machine, if he hasn't already done so.
> Thanks! > Scott
-- If you received this communication by mistake, you are entitled to one free ice cream cone on me. Simply print out this email including all relevant SMTP headers and present them at my desk to claim your creamy treat. We'll have a laugh at my emailing incompetence, and play a game of ping pong. (offer may not be valid in all States).
I ran this on my desktop - 2.4 GHz Intel Core Duo2 (2 cores) and noted Showcase compile time from 02:02 down to 01:19.
I noted that interactive performance of using my desktop did not suffer while the build was going. Still, I was thinking that it might be wise to set some way (Java Property?) in computeThreadCount() to throttle the number of threads. Someone running an automated build on a shared machine might not like the behavior of grabbing every CPU on the system. I did a web search and didn't see a JVM argument that might artificially lower the # of processors.
On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote: > Hi guys, > Could you please review this patch for trunk? It makes the compiler go much > faster when you have: > a) Several permutations > b) More than one processor > Actually, it even helps a little bit with several permutations and only one > processor. > Here's the basic design: > 1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved > deferred binding decisions as explicit nodes > 2) Do optimizations on this tree. > 3) Serialize into a byte buffer (thanks to Bruce for this tasty suggestion) > 4) Crank up N worker threads > 5) Each worker thread: > a) Grabs a particular permutation and deserializes its own copy of the > unified Java AST > b) Resolves all deferred binding for that particular permutation > c) Optimizes the permutation with the new knowledge of deferred binding > decisions > 6) Profit! > Some notes: > - Doing step 2 up front on the unified tree actually realizes some > performance gains by making step 5c take less time in each permutation, even > when you have only one worker thread > - The number of worker threads created is > Math.min(numberOfMachineProcessors, > estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, > there's no point creating extra threads if you don't have the memory for > them, or you can't actually utilize the hardware. Permutation optimization > barely touches the I/O, it's all CPU bound. > - Making the Java and JS ASTs serializable was surprisingly easy. > Toby's about to email around some results from testing Showcase on his > phatty 4-processor machine, if he hasn't already done so. > Thanks! > Scott
---------- Forwarded message ---------- From: Toby Reyelts <to...@google.com> Date: Fri, Jul 18, 2008 at 12:20 PM Subject: Some perf numbers on Scott's parallel compilation improvements
Hi guys,
I sat down with Scott and ran some compilations of Showcase and Dynatable on my four core machine. All compilations were run with JDK 1.6, -Xmx=1G -Xms=1G, and specific compilations were run with -server versus -client, where noted. Showcase got about 60% faster for the fastest case, and DynaTable got about 33% faster.
If you look closely at the numbers, you can see that, even for the single-threaded compiles in trunk, the JVM is taking advantage of multiple processors. In particular, you can see that using -server causes a significant amount of JIT compiling to occur on a background thread. The numbers for Showcase make it astoundingly obvious that -server can have some huge performance affects.
Executive Summary:
1) Nice patch Scott! 2) Use a multi-core machine even if you're not using Scott's patch 3) Use -server !!
Trunk
Showcase (server) real 2m46.133s user 3m42.396s sys 0m3.604s
(client) real 3m43.187s user 3m41.401s sys 0m1.716s
DynaTable (server) real 0m23.787s user 0m48.124s sys 0m0.748s
(client) real 0m21.779s user 0m22.541s sys 0m1.471s
Scott's patch
Showcase (server) real 1m6.743s user 3m18.468s sys 0m2.254s
(client) real 1m16.876s user 2m48.144s sys 0m1.850s
DynaTable (server) real 0m18.870s user 0m49.098s sys 0m0.753s
(client) real 0m14.637s user 0m21.867s sys 0m0.535s
Because I have no idea what I'm doing and just did the simplest thing I could think of to get it working. :) Seriously, I would love suggestions on how to do this better. I'd also love for you to get your butt in to work so we can try it on your 8-core monster box. :)
On Fri, Jul 18, 2008 at 12:17 PM, Kelly Norton <knor...@google.com> wrote: > Scott, > I'm kind of surprised to see all the code the hand manage a worker pool of > threads. Why did you not use a fixed thread pool from java.util.concurrency?
> /kel
> On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote:
>> Hi guys, >> Could you please review this patch for trunk? It makes the compiler go >> much faster when you have:
>> a) Several permutations >> b) More than one processor
>> Actually, it even helps a little bit with several permutations and only >> one processor.
>> Here's the basic design:
>> 1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved >> deferred binding decisions as explicit nodes >> 2) Do optimizations on this tree. >> 3) Serialize into a byte buffer (thanks to Bruce for this tasty >> suggestion) >> 4) Crank up N worker threads >> 5) Each worker thread: >> a) Grabs a particular permutation and deserializes its own copy of the >> unified Java AST >> b) Resolves all deferred binding for that particular permutation >> c) Optimizes the permutation with the new knowledge of deferred binding >> decisions >> 6) Profit!
>> Some notes: >> - Doing step 2 up front on the unified tree actually realizes some >> performance gains by making step 5c take less time in each permutation, even >> when you have only one worker thread >> - The number of worker threads created is >> Math.min(numberOfMachineProcessors, >> estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, >> there's no point creating extra threads if you don't have the memory for >> them, or you can't actually utilize the hardware. Permutation optimization >> barely touches the I/O, it's all CPU bound. >> - Making the Java and JS ASTs serializable was surprisingly easy.
>> Toby's about to email around some results from testing Showcase on his >> phatty 4-processor machine, if he hasn't already done so.
>> Thanks! >> Scott
> -- > If you received this communication by mistake, you are entitled to one free > ice cream cone on me. Simply print out this email including all relevant > SMTP headers and present them at my desk to claim your creamy treat. We'll > have a laugh at my emailing incompetence, and play a game of ping pong. > (offer may not be valid in all States).
Scott, First of all, kickass job. The code you are looking for is probably Executors.newFixedThreadPool(nThreads).
I have a question tho, is deserializing bytebuffers as fast as using a deep tree clone method? I'm not an expert in Java serialization, but I always had the impression it was not the greatest performer. I would have thought that a copy() method on every JNode subclass would have been faster, especially in cases where it is effectively final (not overriden after HotSpot type-tightens a callsite) and can be inlined by HotSpot.
On Fri, Jul 18, 2008 at 9:55 AM, Scott Blum <sco...@google.com> wrote: > Because I have no idea what I'm doing and just did the simplest thing I > could think of to get it working. :) > Seriously, I would love suggestions on how to do this better. I'd also love > for you to get your butt in to work so we can try it on your 8-core monster > box. :)
> On Fri, Jul 18, 2008 at 12:17 PM, Kelly Norton <knor...@google.com> wrote:
>> Scott, >> I'm kind of surprised to see all the code the hand manage a worker pool of >> threads. Why did you not use a fixed thread pool from java.util.concurrency? >> /kel
>> On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote:
>>> Hi guys, >>> Could you please review this patch for trunk? It makes the compiler go >>> much faster when you have: >>> a) Several permutations >>> b) More than one processor >>> Actually, it even helps a little bit with several permutations and only >>> one processor. >>> Here's the basic design: >>> 1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved >>> deferred binding decisions as explicit nodes >>> 2) Do optimizations on this tree. >>> 3) Serialize into a byte buffer (thanks to Bruce for this tasty >>> suggestion) >>> 4) Crank up N worker threads >>> 5) Each worker thread: >>> a) Grabs a particular permutation and deserializes its own copy of the >>> unified Java AST >>> b) Resolves all deferred binding for that particular permutation >>> c) Optimizes the permutation with the new knowledge of deferred binding >>> decisions >>> 6) Profit! >>> Some notes: >>> - Doing step 2 up front on the unified tree actually realizes some >>> performance gains by making step 5c take less time in each permutation, even >>> when you have only one worker thread >>> - The number of worker threads created is >>> Math.min(numberOfMachineProcessors, >>> estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, >>> there's no point creating extra threads if you don't have the memory for >>> them, or you can't actually utilize the hardware. Permutation optimization >>> barely touches the I/O, it's all CPU bound. >>> - Making the Java and JS ASTs serializable was surprisingly easy. >>> Toby's about to email around some results from testing Showcase on his >>> phatty 4-processor machine, if he hasn't already done so. >>> Thanks! >>> Scott
>> -- >> If you received this communication by mistake, you are entitled to one >> free ice cream cone on me. Simply print out this email including all >> relevant SMTP headers and present them at my desk to claim your creamy >> treat. We'll have a laugh at my emailing incompetence, and play a game of >> ping pong. (offer may not be valid in all States).
Wow, awesome patch. This is going to be fantastic for dual-core boxes.
Is there a lot of time spent serializing/deserializing? If so, it might be useful to try out JBoss Serialization as a drop-in replacement. It's quite a bit faster for serializing large trees, but (at least in my experience) increases the output size by a few percent (< 10%).
You could also serialize to a MappedByteBuffer as well to take advantage of platform mmap() and allow the OS to page the cache out if necessary. Our current build server is memory constrained and pages a fair bit during the compile.
> Could you please review this patch for trunk? It makes the compiler > go much faster when you have:
> a) Several permutations > b) More than one processor
> Actually, it even helps a little bit with several permutations and > only one processor.
> Here's the basic design:
> 1) Build a unified Java AST from Eclipse's JDT tree; represent > unresolved deferred binding decisions as explicit nodes > 2) Do optimizations on this tree. > 3) Serialize into a byte buffer (thanks to Bruce for this tasty > suggestion) > 4) Crank up N worker threads > 5) Each worker thread: > a) Grabs a particular permutation and deserializes its own copy of > the unified Java AST > b) Resolves all deferred binding for that particular permutation > c) Optimizes the permutation with the new knowledge of deferred > binding decisions > 6) Profit!
> Some notes: > - Doing step 2 up front on the unified tree actually realizes some > performance gains by making step 5c take less time in each > permutation, even when you have only one worker thread > - The number of worker threads created is > Math.min(numberOfMachineProcessors, > estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other > words, there's no point creating extra threads if you don't have the > memory for them, or you can't actually utilize the hardware. > Permutation optimization barely touches the I/O, it's all CPU bound. > - Making the Java and JS ASTs serializable was surprisingly easy.
> Toby's about to email around some results from testing Showcase on > his phatty 4-processor machine, if he hasn't already done so.
On Fri, Jul 18, 2008 at 12:31 PM, Eric Ayers <zun...@google.com> wrote: > I noted that interactive performance of using my desktop did not > suffer while the build was going. Still, I was thinking that it might > be wise to set some way (Java Property?) in computeThreadCount() to > throttle the number of threads. Someone running an automated build on > a shared machine might not like the behavior of grabbing every CPU on > the system. I did a web search and didn't see a JVM argument that > might artificially lower the # of processors.
Actually, I would rather have the same number of threads but set the OS priority such that interactive tasks will get preference.
-- John A. Tamplin Software Engineer (GWT), Google
On Fri, Jul 18, 2008 at 2:07 PM, John Tamplin <j...@google.com> wrote: > Actually, I would rather have the same number of threads but set the OS > priority such that interactive tasks will get preference.
Linux typically does that anyway, if I'm not mistaken. Do other platforms? (if they don't by default, is there some way to make them?)
On Fri, Jul 18, 2008 at 1:53 PM, Matthew Mastracci <matt...@mastracci.com> wrote:
> Wow, awesome patch. This is going to be fantastic for dual-core boxes.
> Is there a lot of time spent serializing/deserializing? If so, it > might be useful to try out JBoss Serialization as a drop-in > replacement. It's quite a bit faster for serializing large trees, but > (at least in my experience) increases the output size by a few percent > (< 10%).
> You could also serialize to a MappedByteBuffer as well to take > advantage of platform mmap() and allow the OS to page the cache out if > necessary. Our current build server is memory constrained and pages a > fair bit during the compile.
I'm very surprised by this. Any statistics you could give us here?
Scott's patch checks the free amount of available memory, so if you don't set the JVM's max memory high enough, you won't get a parallel build at all.
> > Could you please review this patch for trunk? It makes the compiler > > go much faster when you have:
> > a) Several permutations > > b) More than one processor
> > Actually, it even helps a little bit with several permutations and > > only one processor.
> > Here's the basic design:
> > 1) Build a unified Java AST from Eclipse's JDT tree; represent > > unresolved deferred binding decisions as explicit nodes > > 2) Do optimizations on this tree. > > 3) Serialize into a byte buffer (thanks to Bruce for this tasty > > suggestion) > > 4) Crank up N worker threads > > 5) Each worker thread: > > a) Grabs a particular permutation and deserializes its own copy of > > the unified Java AST > > b) Resolves all deferred binding for that particular permutation > > c) Optimizes the permutation with the new knowledge of deferred > > binding decisions > > 6) Profit!
> > Some notes: > > - Doing step 2 up front on the unified tree actually realizes some > > performance gains by making step 5c take less time in each > > permutation, even when you have only one worker thread > > - The number of worker threads created is > > Math.min(numberOfMachineProcessors, > > estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other > > words, there's no point creating extra threads if you don't have the > > memory for them, or you can't actually utilize the hardware. > > Permutation optimization barely touches the I/O, it's all CPU bound. > > - Making the Java and JS ASTs serializable was surprisingly easy.
> > Toby's about to email around some results from testing Showcase on > > his phatty 4-processor machine, if he hasn't already done so.
On Fri, Jul 18, 2008 at 12:55 PM, Scott Blum <sco...@google.com> wrote: > Because I have no idea what I'm doing and just did the simplest thing I > could think of to get it working. :) > Seriously, I would love suggestions on how to do this better. I'd also > love for you to get your butt in to work so we can try it on your 8-core > monster box. :)
> On Fri, Jul 18, 2008 at 12:17 PM, Kelly Norton <knor...@google.com> wrote:
>> Scott, >> I'm kind of surprised to see all the code the hand manage a worker pool of >> threads. Why did you not use a fixed thread pool from java.util.concurrency?
>> /kel
>> On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote:
>>> Hi guys, >>> Could you please review this patch for trunk? It makes the compiler go >>> much faster when you have:
>>> a) Several permutations >>> b) More than one processor
>>> Actually, it even helps a little bit with several permutations and only >>> one processor.
>>> Here's the basic design:
>>> 1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved >>> deferred binding decisions as explicit nodes >>> 2) Do optimizations on this tree. >>> 3) Serialize into a byte buffer (thanks to Bruce for this tasty >>> suggestion) >>> 4) Crank up N worker threads >>> 5) Each worker thread: >>> a) Grabs a particular permutation and deserializes its own copy of the >>> unified Java AST >>> b) Resolves all deferred binding for that particular permutation >>> c) Optimizes the permutation with the new knowledge of deferred binding >>> decisions >>> 6) Profit!
>>> Some notes: >>> - Doing step 2 up front on the unified tree actually realizes some >>> performance gains by making step 5c take less time in each permutation, even >>> when you have only one worker thread >>> - The number of worker threads created is >>> Math.min(numberOfMachineProcessors, >>> estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, >>> there's no point creating extra threads if you don't have the memory for >>> them, or you can't actually utilize the hardware. Permutation optimization >>> barely touches the I/O, it's all CPU bound. >>> - Making the Java and JS ASTs serializable was surprisingly easy.
>>> Toby's about to email around some results from testing Showcase on his >>> phatty 4-processor machine, if he hasn't already done so.
>>> Thanks! >>> Scott
>> -- >> If you received this communication by mistake, you are entitled to one >> free ice cream cone on me. Simply print out this email including all >> relevant SMTP headers and present them at my desk to claim your creamy >> treat. We'll have a laugh at my emailing incompetence, and play a game of >> ping pong. (offer may not be valid in all States).
-- If you received this communication by mistake, you are entitled to one free ice cream cone on me. Simply print out this email including all relevant SMTP headers and present them at my desk to claim your creamy treat. We'll have a laugh at my emailing incompetence, and play a game of ping pong. (offer may not be valid in all States).
Multicore Intel solutions have a front-side bus bandwidth constraint, so you generally have trouble keeping them busy if they don't have a really high L1/L2 hit rate. I don't know if this is related to what you saw or not, just a data point.
-- John A. Tamplin Software Engineer (GWT), Google
That we can do with 'nice' on the unix cmdline. I don't know that we'd need to add explicit support from inside the compiler.
Still, not all mutl-processor machines are built alike. allowing a manual override for the calculation in compteThreadCount would at least add a way to experiement.
On Fri, Jul 18, 2008 at 2:07 PM, John Tamplin <j...@google.com> wrote: > On Fri, Jul 18, 2008 at 12:31 PM, Eric Ayers <zun...@google.com> wrote:
>> I noted that interactive performance of using my desktop did not >> suffer while the build was going. Still, I was thinking that it might >> be wise to set some way (Java Property?) in computeThreadCount() to >> throttle the number of threads. Someone running an automated build on >> a shared machine might not like the behavior of grabbing every CPU on >> the system. I did a web search and didn't see a JVM argument that >> might artificially lower the # of processors.
> Actually, I would rather have the same number of threads but set the OS > priority such that interactive tasks will get preference.
> -- > John A. Tamplin > Software Engineer (GWT), Google
4-core Mac Pro, 4x2.66Ghz, 6Gb of RAM, compiling Showcase
real 0m56.745s user 2m43.539s sys 0m9.241s
I used the following JVM args: /System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Commands/java -server -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=40 -XX:+AggressiveOpts -XstartOnFirstThread -Xmx1024m -Xms1024m
On Fri, Jul 18, 2008 at 11:22 AM, Kelly Norton <knor...@google.com> wrote: > My box is only a mere 4 cores. > 2 x 2.66GHz Dual Core Intel Xeon > And for all that hardware, there is little difference from what Eric saw. > Shouldn't I get a partial refund from Apple? > Showcase: > Before: (0:02:04.754) > After: (0:01:03.398) > /kel > On Fri, Jul 18, 2008 at 12:55 PM, Scott Blum <sco...@google.com> wrote:
>> Because I have no idea what I'm doing and just did the simplest thing I >> could think of to get it working. :) >> Seriously, I would love suggestions on how to do this better. I'd also >> love for you to get your butt in to work so we can try it on your 8-core >> monster box. :)
>> On Fri, Jul 18, 2008 at 12:17 PM, Kelly Norton <knor...@google.com> wrote:
>>> Scott, >>> I'm kind of surprised to see all the code the hand manage a worker pool >>> of threads. Why did you not use a fixed thread pool from >>> java.util.concurrency? >>> /kel
>>> On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote:
>>>> Hi guys, >>>> Could you please review this patch for trunk? It makes the compiler go >>>> much faster when you have: >>>> a) Several permutations >>>> b) More than one processor >>>> Actually, it even helps a little bit with several permutations and only >>>> one processor. >>>> Here's the basic design: >>>> 1) Build a unified Java AST from Eclipse's JDT tree; represent >>>> unresolved deferred binding decisions as explicit nodes >>>> 2) Do optimizations on this tree. >>>> 3) Serialize into a byte buffer (thanks to Bruce for this tasty >>>> suggestion) >>>> 4) Crank up N worker threads >>>> 5) Each worker thread: >>>> a) Grabs a particular permutation and deserializes its own copy of the >>>> unified Java AST >>>> b) Resolves all deferred binding for that particular permutation >>>> c) Optimizes the permutation with the new knowledge of deferred >>>> binding decisions >>>> 6) Profit! >>>> Some notes: >>>> - Doing step 2 up front on the unified tree actually realizes some >>>> performance gains by making step 5c take less time in each permutation, even >>>> when you have only one worker thread >>>> - The number of worker threads created is >>>> Math.min(numberOfMachineProcessors, >>>> estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, >>>> there's no point creating extra threads if you don't have the memory for >>>> them, or you can't actually utilize the hardware. Permutation optimization >>>> barely touches the I/O, it's all CPU bound. >>>> - Making the Java and JS ASTs serializable was surprisingly easy. >>>> Toby's about to email around some results from testing Showcase on his >>>> phatty 4-processor machine, if he hasn't already done so. >>>> Thanks! >>>> Scott
>>> -- >>> If you received this communication by mistake, you are entitled to one >>> free ice cream cone on me. Simply print out this email including all >>> relevant SMTP headers and present them at my desk to claim your creamy >>> treat. We'll have a laugh at my emailing incompetence, and play a game of >>> ping pong. (offer may not be valid in all States).
> -- > If you received this communication by mistake, you are entitled to one free > ice cream cone on me. Simply print out this email including all relevant > SMTP headers and present them at my desk to claim your creamy treat. We'll > have a laugh at my emailing incompetence, and play a game of ping pong. > (offer may not be valid in all States).
> You could also serialize to a MappedByteBuffer as well to take > advantage of platform mmap() and allow the OS to page the cache out if > necessary. Our current build server is memory constrained and pages a > fair bit during the compile.
> I'm very surprised by this. Any statistics you could give us here?
> Scott's patch checks the free amount of available memory, so if you > don't set the JVM's max memory high enough, you won't get a parallel > build at all.
I'll try to dig up concrete numbers later today, but our build server is fairly old and only has a 1/2 GB or RAM. With the stock GWT 1.5 compiler, I get OOM exceptions (I have to force it to allocate a 512MB heap to avoid this). It's not a powerhouse by far and our current AST as-is seems to occupy all of physical RAM. :)
Strangely enough, adding -server to the JVM on this old machine made the compile+test *slower* (from 20 minutes to 36 minutes).
This build machine is due to be replaced at some point (sooner rather than later), so this might not be a useful datapoint.
On Fri, Jul 18, 2008 at 2:47 PM, Eric Ayers <zun...@google.com> wrote: > That we can do with 'nice' on the unix cmdline. I don't know that > we'd need to add explicit support from inside the compiler.
> Still, not all mutl-processor machines are built alike. allowing a > manual override for the calculation in compteThreadCount would at > least add a way to experiement.
That is harder to do when building from inside an IDE. Yes, you can go look up the process ids and renice them, but then you are spending more time fooling with it than you should have to. I agree, if I am running ./FooApp-compile it is easy enough to nice it but I would like some option from an IDE launch config.
-- John A. Tamplin Software Engineer (GWT), Google
> > You could also serialize to a MappedByteBuffer as well to take > > advantage of platform mmap() and allow the OS to page the cache out if > > necessary. Our current build server is memory constrained and pages a > > fair bit during the compile.
> > I'm very surprised by this. Any statistics you could give us here?
> > Scott's patch checks the free amount of available memory, so if you > > don't set the JVM's max memory high enough, you won't get a parallel > > build at all.
> I'll try to dig up concrete numbers later today, but our build server > is fairly old and only has a 1/2 GB or RAM. With the stock GWT 1.5 > compiler, I get OOM exceptions (I have to force it to allocate a 512MB > heap to avoid this).
If you're running with a 1.5 JVM, the default heap size is very low (something along the lines of 64M), so this is no surprise. Unlike native apps which will just run away with all of your machine's memory, Java apps are bounded by default :)
Java 1.6 makes this better, in that it will attempt to set the max heap size based on a percentage of your machine's total memory.
It's not a powerhouse by far and our current AST
> as-is seems to occupy all of physical RAM. :)
> Strangely enough, adding -server to the JVM on this old machine made > the compile+test *slower* (from 20 minutes to 36 minutes).
If you're actually paging (which wouldn't be surprising with only 512M RAM), then all bets are off about performance. We can probably find ways to reduce memory usage, but it hasn't been a priority since RAM is cheap and plentiful. If people are running out of heap space with reasonable amounts of physical memory, then we need to know about it.
This build machine is due to be replaced at some point (sooner rather
> than later), so this might not be a useful datapoint.
My wife's old pentium II has 768M RAM. I think that officially puts you outside of useful datapoints.
We might need to put down some minimum recommended specs for running the compiler. Something like - "Your machine should be no more than seven years old and have more memory than comes in an MP3 player" ;)
> My wife's old pentium II has 768M RAM. I think that officially puts > you outside of useful datapoints.
> We might need to put down some minimum recommended specs for running > the compiler. Something like - "Your machine should be no more than > seven years old and have more memory than comes in an MP3 player" ;)
Agreed, wholeheartedly. I'd rather have a fast compile on my machine with more memory than worry about an old build machine. It'd have more memory in it if I could get the damn thing past the bios screen with more than one stick of RAM. ;)
On Sat, Jul 19, 2008 at 2:01 AM, Scott Blum <sco...@google.com> wrote: > Hi guys, > Could you please review this patch for trunk? It makes the compiler go > much faster when you have:
> a) Several permutations > b) More than one processor
> Actually, it even helps a little bit with several permutations and only one > processor.
> Here's the basic design:
> 1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved > deferred binding decisions as explicit nodes > 2) Do optimizations on this tree. > 3) Serialize into a byte buffer (thanks to Bruce for this tasty suggestion) > 4) Crank up N worker threads > 5) Each worker thread: > a) Grabs a particular permutation and deserializes its own copy of the > unified Java AST > b) Resolves all deferred binding for that particular permutation > c) Optimizes the permutation with the new knowledge of deferred binding > decisions > 6) Profit!
> Some notes: > - Doing step 2 up front on the unified tree actually realizes some > performance gains by making step 5c take less time in each permutation, even > when you have only one worker thread > - The number of worker threads created is > Math.min(numberOfMachineProcessors, > estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, > there's no point creating extra threads if you don't have the memory for > them, or you can't actually utilize the hardware. Permutation optimization > barely touches the I/O, it's all CPU bound. > - Making the Java and JS ASTs serializable was surprisingly easy.
> Toby's about to email around some results from testing Showcase on his > phatty 4-processor machine, if he hasn't already done so.
If I build gwt samples using the default gwtc jvm arg -Xmx256m the following error occurs
If I increase this to 512m it is fine
gwtc: [java] Compiling module com.google.gwt.sample.showcase.Showcase [java] Compiling permutations [java] Analyzing permutation #3 [java] [WARN] Out of memory; will retry permutation using fewer concurrent threads [java] Analyzing permutation #13 [java] [WARN] Out of memory; will retry permutation using fewer concurrent threads [java] Analyzing permutation #11 [java] [ERROR] An internal compiler exception occurred [java] com.google.gwt.dev.jjs.InternalCompilerException: Unexpected error during visit. [java] at com.google.gwt.dev.jjs.ast.JVisitor.translateException(JVisitor.java:552) [java] at com.google.gwt.dev.jjs.ast.JVisitor.doTraverse(JVisitor.java:543) [java] at com.google.gwt.dev.jjs.ast.JVisitor.doAccept(JVisitor.java:529) [java] at com.google.gwt.dev.jjs.ast.JVisitor.accept(JVisitor.java:77) [java] at com.google.gwt.dev.jjs.ast.JProgram.traverse(JProgram.java:777) [java] at com.google.gwt.dev.jjs.ast.JVisitor.doTraverse(JVisitor.java:541) [java] at com.google.gwt.dev.jjs.ast.JVisitor.doAccept(JVisitor.java:523) [java] at com.google.gwt.dev.jjs.ast.JVisitor.accept(JVisitor.java:69) [java] at com.google.gwt.dev.jjs.impl.GenerateJavaScriptAST.execImpl(GenerateJavaScri ptAST.java:1854) [java] at com.google.gwt.dev.jjs.impl.GenerateJavaScriptAST.exec(GenerateJavaScriptAS T.java:1650) [java] at com.google.gwt.dev.jjs.JavaToJavaScriptCompiler.compile(JavaToJavaScriptCom piler.java:487) [java] at com.google.gwt.dev.GWTCompiler$PermutationWorker.run(GWTCompiler.java:347) [java] at java.lang.Thread.run(Thread.java:619) [java] Caused by: java.lang.OutOfMemoryError: Java heap space [java] at java.util.Arrays.copyOf(Arrays.java:2734) [java] at java.util.ArrayList.ensureCapacity(ArrayList.java:167) [java] at java.util.ArrayList.add(ArrayList.java:351) [java] at com.google.gwt.dev.jjs.impl.GenerateJavaScriptLiterals.popList(GenerateJava ScriptLiterals.java:114) [java] at com.google.gwt.dev.jjs.impl.GenerateJavaScriptAST$GenerateJavaScriptVisitor .endVisit(GenerateJavaScriptAST.java:495) [java] at com.google.gwt.dev.jjs.ast.JClassType.traverse(JClassType.java:64) [java] at com.google.gwt.dev.jjs.ast.JVisitor.doTraverse(JVisitor.java:541) [java] ... 11 more [java] [ERROR] Out of memory; to increase the amount of memory, use the -Xmx flag at startup (java -Xmx128M ...) [java] [ERROR] at RichTextToolbar.java(40): final class RichTextToolbar extends Composite [java] com.google.gwt.dev.jjs.ast.JClassType [java] [ERROR] <no source info>: <JProgram> [java] com.google.gwt.dev.jjs.ast.JProgram [java] [ERROR] Permutation failed [java] com.google.gwt.core.ext.UnableToCompleteException: (see previous log entries) [java] at com.google.gwt.dev.jjs.JavaToJavaScriptCompiler.logAndTranslateException(Ja vaToJavaScriptCompiler.java:670) [java] at com.google.gwt.dev.jjs.JavaToJavaScriptCompiler.compile(JavaToJavaScriptCom piler.java:543) [java] at com.google.gwt.dev.GWTCompiler$PermutationWorker.run(GWTCompiler.java:347) [java] at java.lang.Thread.run(Thread.java:619) [java] [ERROR] Build failed
On Sat, Jul 19, 2008 at 2:01 AM, Scott Blum <sco...@google.com> wrote: > Hi guys, > Could you please review this patch for trunk? It makes the compiler go > much faster when you have:
> a) Several permutations > b) More than one processor
> Actually, it even helps a little bit with several permutations and only one > processor.
> Here's the basic design:
> 1) Build a unified Java AST from Eclipse's JDT tree; represent unresolved > deferred binding decisions as explicit nodes > 2) Do optimizations on this tree. > 3) Serialize into a byte buffer (thanks to Bruce for this tasty suggestion) > 4) Crank up N worker threads > 5) Each worker thread: > a) Grabs a particular permutation and deserializes its own copy of the > unified Java AST > b) Resolves all deferred binding for that particular permutation > c) Optimizes the permutation with the new knowledge of deferred binding > decisions > 6) Profit!
> Some notes: > - Doing step 2 up front on the unified tree actually realizes some > performance gains by making step 5c take less time in each permutation, even > when you have only one worker thread > - The number of worker threads created is > Math.min(numberOfMachineProcessors, > estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, > there's no point creating extra threads if you don't have the memory for > them, or you can't actually utilize the hardware. Permutation optimization > barely touches the I/O, it's all CPU bound. > - Making the Java and JS ASTs serializable was surprisingly easy.
> Toby's about to email around some results from testing Showcase on his > phatty 4-processor machine, if he hasn't already done so.
Yeah, this is the intended behavior. Basically, it tried to run 3 threads initially, then as memory got tight, had to drop to fewer and fewer threads. Finally, even running just 1 thread it still ran out. I think something is slightly fishy, and we need to investigate this a bit. The memory shouldn't be going up much permutation over permutation, unless the SoftReference in StandardCompilationResult isn't doing its job correctly.
On Fri, Jul 18, 2008 at 10:13 PM, Cameron Braid <came...@braid.com.au> wrote:
There are pros and cons to both. Serialization: - Was cake to implement - Lower memory ceiling in tight situations since you don't have two copies of the AST in memory at the same time - Anticipates out of process / distributed parallelized build
Clone: - Probably faster - Immutables such as String could be shared across permutations
On Fri, Jul 18, 2008 at 1:42 PM, Ray Cromwell <cromwell...@gmail.com> wrote:
> Scott, > First of all, kickass job. The code you are looking for is probably > Executors.newFixedThreadPool(nThreads).
> I have a question tho, is deserializing bytebuffers as fast as using > a deep tree clone method? I'm not an expert in Java serialization, but > I always had the impression it was not the greatest performer. I would > have thought that a copy() method on every JNode subclass would have > been faster, especially in cases where it is effectively final (not > overriden after HotSpot type-tightens a callsite) and can be inlined > by HotSpot.
> I can't wait to try this on my Mac Pro! > -Ray
> On Fri, Jul 18, 2008 at 9:55 AM, Scott Blum <sco...@google.com> wrote: > > Because I have no idea what I'm doing and just did the simplest thing I > > could think of to get it working. :) > > Seriously, I would love suggestions on how to do this better. I'd also > love > > for you to get your butt in to work so we can try it on your 8-core > monster > > box. :)
> > On Fri, Jul 18, 2008 at 12:17 PM, Kelly Norton <knor...@google.com> > wrote:
> >> Scott, > >> I'm kind of surprised to see all the code the hand manage a worker pool > of > >> threads. Why did you not use a fixed thread pool from > java.util.concurrency? > >> /kel
> >> On Fri, Jul 18, 2008 at 12:01 PM, Scott Blum <sco...@google.com> wrote:
> >>> Hi guys, > >>> Could you please review this patch for trunk? It makes the compiler go > >>> much faster when you have: > >>> a) Several permutations > >>> b) More than one processor > >>> Actually, it even helps a little bit with several permutations and only > >>> one processor. > >>> Here's the basic design: > >>> 1) Build a unified Java AST from Eclipse's JDT tree; represent > unresolved > >>> deferred binding decisions as explicit nodes > >>> 2) Do optimizations on this tree. > >>> 3) Serialize into a byte buffer (thanks to Bruce for this tasty > >>> suggestion) > >>> 4) Crank up N worker threads > >>> 5) Each worker thread: > >>> a) Grabs a particular permutation and deserializes its own copy of > the > >>> unified Java AST > >>> b) Resolves all deferred binding for that particular permutation > >>> c) Optimizes the permutation with the new knowledge of deferred > binding > >>> decisions > >>> 6) Profit! > >>> Some notes: > >>> - Doing step 2 up front on the unified tree actually realizes some > >>> performance gains by making step 5c take less time in each permutation, > even > >>> when you have only one worker thread > >>> - The number of worker threads created is > >>> Math.min(numberOfMachineProcessors, > >>> estimatedNumberOfASTsThatCanFitInMemoryConcurrently); in other words, > >>> there's no point creating extra threads if you don't have the memory > for > >>> them, or you can't actually utilize the hardware. Permutation > optimization > >>> barely touches the I/O, it's all CPU bound. > >>> - Making the Java and JS ASTs serializable was surprisingly easy. > >>> Toby's about to email around some results from testing Showcase on his > >>> phatty 4-processor machine, if he hasn't already done so. > >>> Thanks! > >>> Scott
> >> -- > >> If you received this communication by mistake, you are entitled to one > >> free ice cream cone on me. Simply print out this email including all > >> relevant SMTP headers and present them at my desk to claim your creamy > >> treat. We'll have a laugh at my emailing incompetence, and play a game > of > >> ping pong. (offer may not be valid in all States).
I guess part of my point is that the default build (before applying the patch) worked out of the box.
Also my machine is capable of building with 4 threads (4 cores and 8 gig ram) yet the 'out of the box' configuration prevents this due to jvm arguments.
Would it be such a bad thing to have the defaults use -Xmx1G ?
What would the drawback be for people with < 1G ram ?
Also, when the "Out of memory; will retry permutation using fewer concurrent threads" occurs, do you call System.gc() to try and free up some memory ?
On Sun, Jul 20, 2008 at 9:45 AM, Scott Blum <sco...@google.com> wrote: > Yeah, this is the intended behavior. Basically, it tried to run 3 threads > initially, then as memory got tight, had to drop to fewer and fewer threads. > Finally, even running just 1 thread it still ran out. > I think something is slightly fishy, and we need to investigate this a bit. > The memory shouldn't be going up much permutation over permutation, unless > the SoftReference in StandardCompilationResult isn't doing its job > correctly.
> On Fri, Jul 18, 2008 at 10:13 PM, Cameron Braid <came...@braid.com.au> > wrote:
>> If I build gwt samples using the default gwtc jvm arg -Xmx256m the >> following error occurs
>> If I increase this to 512m it is fine
>> gwtc: >> [java] Compiling module com.google.gwt.sample.showcase.Showcase >> [java] Compiling permutations >> [java] Analyzing permutation #3 >> [java] [WARN] Out of memory; will retry permutation using fewer >> concurrent threads >> [java] Analyzing permutation #13 >> [java] [WARN] Out of memory; will retry permutation using fewer >> concurrent threads >> [java] Analyzing permutation #11 >> [java] [ERROR] An internal compiler exception occurred >> [java] com.google.gwt.dev.jjs.InternalCompilerException: Unexpected >> error during visit.
On Sat, Jul 19, 2008 at 8:28 PM, Cameron Braid <came...@braid.com.au> wrote: > I guess part of my point is that the default build (before applying the > patch) worked out of the box.
Indeed. Scott's argument seems pretty good for why this should not happen. And yet it does. :)
> Also, when the "Out of memory; will retry permutation using fewer > concurrent threads" occurs, do you call System.gc() to try and free up some > memory ?
That should be irrelevant. The JVM should do its own System.gc() before throwing an OOM, and I've never seen evidence that it would not do so.
On Fri, Jul 18, 2008 at 1:53 PM, Matthew Mastracci <matt...@mastracci.com> wrote:
> Is there a lot of time spent serializing/deserializing? If so, it > might be useful to try out JBoss Serialization as a drop-in > replacement. It's quite a bit faster for serializing large trees, but > (at least in my experience) increases the output size by a few percent > (< 10%).
On my machine, it's about 2 seconds per permutation spent deserializing Showcase AST, so it's worth looking at.
Toby also suggested that there's a way to use bytecode rewriting to make your serializable objects externalizable in a way that makes then serialize/deserialize much faster, so we might look into this as well.