I ran into a very strange effect when some Sun folks tried to benchmark JRuby's multi-thread scalability. In short, adding more threads actually caused the benchmarks to take longer.
The source of the problem (at least the source that, when fixed, allowed normal thread scaling), was an increment, mask, and test of a static int field. The code in question looked like this:
private static int count = 0;
public void pollEvents(ThreadContext context) { if ((count++ & 0xFF) == 0) context.poll();
}
So the basic idea was that this would call poll() every 256 hits, incrementing a counter all the while. My first attempt to improve performance was to comment out the body of poll() in case it was causing a threading bottleneck (it does some locking and such), but that had no effect. Then, as a total shot in the dark, I commented out the entire line above. Thread scaling went to normal.
So I'm rather confused here. Is a ++ operation on a static int doing some kind of atomic update that causes multiple threads to contend? I never would have expected this, so I wrote up a small Java benchmark:
The benchmark does basically the same thing, with a single main counter and another "fired" counter to prevent hotspot from optimizing things completely away. I've been running this on a dual-core MacBook Pro with both Apple's Java 5 and the soylatte Java 6 release. The results are very confusing:
Normal scaling here...1 thread on my system uses about 60-65% CPU, so the extra thread uses up the remaining 35-40% and the numbers show it. Then there's soylatte Java 6:
Don't compare the times directly, since these are two pretty different codebases and they each have different general performance characteristics. Instead pay attention to the trend...the soylatte Java 6 run with two threads is significantly slower than the run with a single thread. This mirrors the results with JRuby when there was a single static counter being incremented.
> I ran into a very strange effect when some Sun folks tried to benchmark > JRuby's multi-thread scalability. In short, adding more threads actually > caused the benchmarks to take longer.
> The source of the problem (at least the source that, when fixed, allowed > normal thread scaling), was an increment, mask, and test of a static int > field. The code in question looked like this:
> private static int count = 0;
> public void pollEvents(ThreadContext context) { > if ((count++ & 0xFF) == 0) context.poll(); > }
> So the basic idea was that this would call poll() every 256 hits, > incrementing a counter all the while. My first attempt to improve > performance was to comment out the body of poll() in case it was causing > a threading bottleneck (it does some locking and such), but that had no > effect. Then, as a total shot in the dark, I commented out the entire > line above. Thread scaling went to normal.
That is rather odd.
Shouldn't count be volatile?
If it's declared as volatile does that make any difference?
John Wilson wrote: > On 4/2/08, Charles Oliver Nutter <charles.nut...@sun.com> wrote: >> I ran into a very strange effect when some Sun folks tried to benchmark >> JRuby's multi-thread scalability. In short, adding more threads actually >> caused the benchmarks to take longer.
>> The source of the problem (at least the source that, when fixed, allowed >> normal thread scaling), was an increment, mask, and test of a static int >> field. The code in question looked like this:
>> private static int count = 0;
>> public void pollEvents(ThreadContext context) { >> if ((count++ & 0xFF) == 0) context.poll(); >> }
>> So the basic idea was that this would call poll() every 256 hits, >> incrementing a counter all the while. My first attempt to improve >> performance was to comment out the body of poll() in case it was causing >> a threading bottleneck (it does some locking and such), but that had no >> effect. Then, as a total shot in the dark, I commented out the entire >> line above. Thread scaling went to normal.
> That is rather odd.
> Shouldn't count be volatile?
> If it's declared as volatile does that make any difference?
I had not tried it because I expected volatile would only make it slower. And in this case, the code in question didn't really care about perfect accuracy for the counter since it's just a rough guide. But here's numbers with volatile on my machine:
So as expected, volatile does slow things down a lot, but Java 6 does do a little better here. Also interesting to see that volatile completely obliterates any gain from running multiple threads on both Java 5 and Java 6, and the total time is almost 3x slower than a single thread on Java 5.
I tried another non-volatile run using i += 1 rather than i++ and the numbers were almost identical, with Java 6 severely degrading with multiple threads running and Java 5 improving.
Here's another set of numbers from Vladimir Sizikov, on a dual-core windows machine running Sun Java 5 and Sun Java 6:
No, I haven't...no access to a machine that can run debug builds at the moment, but I may try it later if the answer to my riddle does not present itself.
On 4/2/08, Charles Oliver Nutter <charles.nut...@sun.com> wrote:
> John Wilson wrote: > > That is rather odd.
> > Shouldn't count be volatile?
> > If it's declared as volatile does that make any difference?
> I had not tried it because I expected volatile would only make it > slower. And in this case, the code in question didn't really care about > perfect accuracy for the counter since it's just a rough guide. But > here's numbers with volatile on my machine:
Interesting. I thought that the runtime system might have inferred that count should have been volatile but you numbers show that this is not the case. [snip]
> I tried another non-volatile run using i += 1 rather than i++ and the > numbers were almost identical, with Java 6 severely degrading with > multiple threads running and Java 5 improving.
It would be interesting to try:
tmp = i; tmp++; i = tmp;
for tmp as a local variable and again for tmp as a public instance field.
(I'm sorry I can't do the tests for mystelf as I don't have access to a muti core machine here).
Attila Szegedi wrote: > No. You know your JVM bytecodes, Charlie - the only incrementing > bytecode in existence is IINC and it only works on an integer local > variable.
Yes, I know that...but I hadn't dug into what code was actually being generated for a field++. Looking now they don't appear to be any different. Either way it doesn't give me any answers...
> Pretty much the only sane implementation of ++ on a static field would > be:
> Not even volatility of the field will ensure atomic increment > operations, as the value must be temporarily held on the thread > operand stack.
> If you want atomic updates, java.util.concurrent.atomic.AtomicInteger > might give you what you need on Java 5 and above.
I'm more interested in finding out why the performance is so bad with multiple threads under Java 6. I don't need atomic updates, for which I would certainly use AtomicInteger instead.
On 4/2/08, Charles Oliver Nutter <charles.nut...@sun.com> wrote:
> I ran into a very strange effect when some Sun folks tried to benchmark > JRuby's multi-thread scalability. In short, adding more threads actually > caused the benchmarks to take longer.
Was this work being done on Intel hardware?
I'm just wondering if this might be some issue with the cache on Intel mutli core processors (i.e. the cache is constantly being invalidated).
This could be tested by running the equivalent C code on Intel kit and running the Java code on non Intel kit.
John Wilson wrote: > On 4/2/08, Charles Oliver Nutter <charles.nut...@sun.com> wrote: >> I ran into a very strange effect when some Sun folks tried to benchmark >> JRuby's multi-thread scalability. In short, adding more threads actually >> caused the benchmarks to take longer.
> Was this work being done on Intel hardware?
> I'm just wondering if this might be some issue with the cache on Intel > mutli core processors (i.e. the cache is constantly being > invalidated).
> This could be tested by running the equivalent C code on Intel kit and > running the Java code on non Intel kit.
In both cases (my benchmarking and "Sun guys" benchmarking) it was on x86-based hardware, but my runs were on OS X Intel (Core Duo) and theirs were on Solaris AMD x86-64 (Opteron). I have not yet heard back whether my fix improved thread scaling for them.
> I ran into a very strange effect when some Sun folks tried to benchmark > JRuby's multi-thread scalability. In short, adding more threads actually > caused the benchmarks to take longer. [...] > Instead pay attention to the trend...the soylatte Java > 6 run with two threads is significantly slower than the run with a > single thread. This mirrors the results with JRuby when there was a > single static counter being incremented.
I think there is not enough data to see a trend. I modified your test, made it run from 1-20 threads and for 50 loops, making an average time containing the time it took to execute all threads and put these in a diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that the time constantly goes up until 4 Threads are reached, my number of CPUs. Using 5 threads is takes less time than using 4, but after that the time looks more or less constant.
Jochen Theodorou wrote: > I think there is not enough data to see a trend. I modified your test, > made it run from 1-20 threads and for 50 loops, making an average time > containing the time it took to execute all threads and put these in a > diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on > Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that > the time constantly goes up until 4 Threads are reached, my number of > CPUs. Using 5 threads is takes less time than using 4, but after that > the time looks more or less constant.
> This looks quite scalable to me.
You did notice the code splits the number of iterations by the number of threads, right? At the very least, the amount of time taken to run all threads should go down.
Run the same test in Java 5 and you'll see that Java 6 does not scale properly at all for this test.
> I'm just wondering if this might be some issue with the cache on Intel > mutli core processors (i.e. the cache is constantly being > invalidated).
Yes, that code will in general cause cache ping-ponging, which in the long run will harm sharing and hence scalability. On a single socket system, the sharing ought to be pretty good, though. And that doesn't explain the difference between 5 and 6.
P.S. Today I'm pushing the disassembler plugin for Hotspot 13, to come out with the first flood of pending OpenJDK changes, RSN. It will support disassembly of code even on product versions of Hotspot, making it *much* easier to discuss issues like this. Webrev (slightly out-of-date) is here: http://webrev.invokedynamic.info/jrose/6667042/
The only current implementation of the disassembler uses Gnu binutils. We don't link Gnu code into Hotspot, but this isn't a problem, since the disassembler is loaded into the JVM as a plugin. The downside is you'll have to build the disassembler plugin yourself. There are build instructions (using Gnu binutils) in the Hotspot sources.
// if ((i++ & 0xFF) == 0) firedCount++; // SLOW // if ((i++ & 0xFF) == 0) Counter.firedCount++; // FAST
The difference in timings on my Dual-Core Core2Duo CPU on JDK 6u5 is about 2x! And the only difference in code is where the static counter lives: either on main class (then it slow) or in static member class (then it's 2x faster!)
The test was executed with 2 threads and with -server switch: D:/re/java6/bin/java -server Trouble 2
JDK 5 and JDK 1.4 behave as expected and there is no significant difference.
Thanks, --Vladimir
On Wed, Apr 2, 2008 at 5:51 PM, Charles Oliver Nutter
> Jochen Theodorou wrote: > > I think there is not enough data to see a trend. I modified your test, > > made it run from 1-20 threads and for 50 loops, making an average time > > containing the time it took to execute all threads and put these in a > > diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on > > Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that > > the time constantly goes up until 4 Threads are reached, my number of > > CPUs. Using 5 threads is takes less time than using 4, but after that > > the time looks more or less constant.
> > This looks quite scalable to me.
> You did notice the code splits the number of iterations by the number of > threads, right? At the very least, the amount of time taken to run all > threads should go down.
> Run the same test in Java 5 and you'll see that Java 6 does not scale > properly at all for this test.
> Jochen Theodorou wrote: >> I think there is not enough data to see a trend. I modified your test, >> made it run from 1-20 threads and for 50 loops, making an average time >> containing the time it took to execute all threads and put these in a >> diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on >> Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that >> the time constantly goes up until 4 Threads are reached, my number of >> CPUs. Using 5 threads is takes less time than using 4, but after that >> the time looks more or less constant.
>> This looks quite scalable to me.
> You did notice the code splits the number of iterations by the number of > threads, right? At the very least, the amount of time taken to run all > threads should go down.
why? If I have 1 CPU and two threads, each of them running t seconds, shouldn't the time then be 2*t? Everything over 2*t would be the thread handling overhead. since you let 1 Thread run n iteration taking time t and then 2 Threads with n/2 iterations... shouldn't the total time then still be t?
If you want to tell me that running with 3 threads takes longer than running with one thread and this should not happen, then yes, probably, but that is not what I understand from scalability. I do understand that if you go to big n and the total time increases much, then the scalability is bad. If you look at 7 or 20 Threads, then this does not happen. Only when the number of threads is lower this happens. So you could say the JVM in 1.6 does not scale well for a lower number of Threads, but does for a high number of threads...
> Run the same test in Java 5 and you'll see that Java 6 does not scale > properly at all for this test.
On Wed, Apr 2, 2008 at 9:59 PM, Jochen Theodorou <blackd...@gmx.org> wrote: > why? If I have 1 CPU and two threads, each of them running t seconds, > shouldn't the time then be 2*t? Everything over 2*t would be the thread > handling overhead. since you let 1 Thread run n iteration taking time t > and then 2 Threads with n/2 iterations... shouldn't the total time then > still be t?
This particular example under discussion has a total number of iterations, and it tries to reach that number either via single thread, or two threads, etc. So, amount of work is roughly the same. What varies is the way the work done.
The point is that with two treads on two-CPU system, the work should be done faster, because there are two cores running things in parallel. But with JDK6 this does not happen, both cores *ARE* 100% busy, but the time it takes to reach the end (to reach the total number of iterations limit) is longer than expected.
On Apr 2, 2008, at 12:55 PM, Vladimir Sizikov wrote:
> Here's a slightly reduced example, with most important tings in just > two lines, one marked SLOW and another marked FAST.
Hotspot has optimizations specifically for codes that look like micro- benchmarks, but there are better ways to get performance, and to measure it.
Part of the Trouble here is that the Thread.run methods never exit. The compiler kicks in part of the way through, and replaces the interpreter frame with a compiled frame. (The technique is called OSR, or on-stack replacement.) The code is only partially optimized as a result. The difference between releases could be due to a subtle change in tuning of OSR, and would be irrelevant except for micro-benchmarks.
Here are some rules, in priority order, to know about when you write micro-benchmarks:
Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)
Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.
Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.
Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. -XX:+PrintCompilation reports OSR compilations with an at-sign to denote the non-initial entry point: Trouble$1::run @ 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.
Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside the of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effect.
Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.
Enclosed are the disassembly of the two loop kernels, from the Java 5 and Java 6 32-bit. The default JIT (on the machine I tested) was the server JIT. The loops look essentially the same to me on both releases. I also included the client version; those kernels look the same too.
One difference is the effective addresses of the two shared variables happen to fall in distinct quadwords on Java 5 and in the same quadword on Java 6:
0x082025b4 &i in Java 5 0x082025b8 &firedCount in Java 5 0xb61dcae8 &i in Java 6 0xb61dcaec &firedCount in Java 6
This may be the best way to explain the performance difference, especially since moving firedCount into a separate class seems to erase the effect.
Welcome to the multicore world! Memory is not flat like Kansas anymore.
John Rose wrote: > Enclosed are the disassembly of the two loop kernels, from the Java 5 > and Java 6 32-bit. The default JIT (on the machine I tested) was the > server JIT. The loops look essentially the same to me on both releases. > I also included the client version; those kernels look the same too.
> One difference is the effective addresses of the two shared variables > happen to fall in distinct quadwords on Java 5 and in the same quadword > on Java 6:
> 0x082025b4 &i in Java 5 0x082025b8 &firedCount in Java 5 0xb61dcae8 > &i in Java 6 0xb61dcaec &firedCount in Java 6
> This may be the best way to explain the performance > difference, especially since moving firedCount into a separate class > seems to erase the effect.
> Welcome to the multicore world! Memory is not flat like Kansas anymore.
Pardon my ignorance, but why does that explain the performance difference? Talking with Wayne Meissner he thought perhaps because the two variables were closer together, updating one caused a cache flush of both. That sounds reasonable, I suppose...but doesn't this still qualify as a regression since performance is significantly worse under Java 6?
I guess my concern is that one reason many of these new languages are gaining traction on the JVM (especially in the case of Jython and JRuby) is because they can be *actually* parallel. Without experimentally finding this effect, we might have gone months or forever with poor thread scaling, all because we used a single static variable instead of separate thread-local variables. It's rather surprising, and makes me worry about the rest of the codebase.
So I guess there's a rule of thumb here: don't share data across threads unless you absolutely need to. And perhaps: don't use statics (Gilad would be proud). But something more concrete to say why this is bad, or an explanation of how this effect is a trade-off for something much more important (cache locality for field reads?) would help ease the pain...
what do you expect? That the VM spread out all global variables throughout the heap so that any broken access pattern cause not too many cache flushes? You're hammering a global from multiple threads, I don't think you can blame the JVM for that, be it for a new language or not.
Matthias
On Thu, Apr 3, 2008 at 10:18 AM, Charles Oliver Nutter
> John Rose wrote: > > Enclosed are the disassembly of the two loop kernels, from the Java 5 > > and Java 6 32-bit. The default JIT (on the machine I tested) was the > > server JIT. The loops look essentially the same to me on both releases. > > I also included the client version; those kernels look the same too.
> > One difference is the effective addresses of the two shared variables > > happen to fall in distinct quadwords on Java 5 and in the same quadword > > on Java 6:
> > 0x082025b4 &i in Java 5 0x082025b8 &firedCount in Java 5 0xb61dcae8 > > &i in Java 6 0xb61dcaec &firedCount in Java 6
> > This may be the best way to explain the performance > > difference, especially since moving firedCount into a separate class > > seems to erase the effect.
> > Welcome to the multicore world! Memory is not flat like Kansas anymore.
> Pardon my ignorance, but why does that explain the performance > difference? Talking with Wayne Meissner he thought perhaps because the > two variables were closer together, updating one caused a cache flush of > both. That sounds reasonable, I suppose...but doesn't this still qualify > as a regression since performance is significantly worse under Java 6?
> I guess my concern is that one reason many of these new languages are > gaining traction on the JVM (especially in the case of Jython and JRuby) > is because they can be *actually* parallel. Without experimentally > finding this effect, we might have gone months or forever with poor > thread scaling, all because we used a single static variable instead of > separate thread-local variables. It's rather surprising, and makes me > worry about the rest of the codebase.
> So I guess there's a rule of thumb here: don't share data across threads > unless you absolutely need to. And perhaps: don't use statics (Gilad > would be proud). But something more concrete to say why this is bad, or > an explanation of how this effect is a trade-off for something much more > important (cache locality for field reads?) would help ease the pain...
> what do you expect? That the VM spread out all global variables > throughout the heap so that any broken access pattern cause not too > many cache flushes? You're hammering a global from multiple threads, I > don't think you can blame the JVM for that, be it for a new language > or not.
I'm just looking for a definitive answer on why it's better this way in Java 6. If this is a good optimization for other reasons, that's perfectly acceptable and I'll be happy with it. At the moment, however, I have a real-world case where it severely impacted execution performance...so severely that I'd be very surprised if other parts of the system weren't suffering from the same effect. And I'm sure there are other apps out there secretly harboring similar issues.
I'm writing up a blog post on this now, so any additional information is very welcome. And I'm not necessarily "blaming" the JVM...but slower is slower.
Charles Oliver Nutter wrote: > I'm writing up a blog post on this now, so any additional information is > very welcome. And I'm not necessarily "blaming" the JVM...but slower is > slower.
On Apr 3, 2008, at 2:20 AM, Charles Oliver Nutter wrote:
> I'm just looking for a definitive answer on why it's better this > way in > Java 6.
This particular micro-benchmark happens to be slower in Java 6 than in Java 5, but if the layouts were changed by four or twelve bytes the opposite might be true, and we'd have a "win" for Java 6. (If my previous analysis is correct, which I'm going to assume in the rest of this note.) Adding an odd number of four-byte static fields before the two in the micro-benchmark might make Java 5 faster than Java 6. So this is not a question of quality of implementation that distinguishes 5 from 6. It is indeterminate behavior that comes from questionable coding (unsynch. non-volatile static writes).
To me as a VM guy, it's a case of: P. Doctor, it hurts when I do this. D. Then don't do that.
The JVM could notice that there is lots of cache ping-ponging on a pair of separable variables, and reallocate them onto separate lines. We'll probably eventually do stuff like that, given the new realities about memory architecture. But at present this is mainly a research topic that you read about in conference papers.
I agree there are sometimes reasons, especially for language implementors, to bend the rules, and do things like increment unlocked counters. (We do it in the JVM all the time, at the C++ level.) But if you are going to bend the rules, you can't expect the JVM to be ready to know what you mean. You might have to help it accept the weird code. A good help in this case would be to make sure the bits of mutable global state are isolated in a corner somewhere.
Even better (if applicable) would be to have fast thread-local counters, with a slow background phase which occasionally tallies them into a trailing global counter. Use compare-and-swap to collect the fast-moving data into the slow-moving data. This would be faster and more accurate, at the cost of complexity and extra per-thread storage. It reflects the reality of non-uniform memory. (This verges on the old fetch-and-add instruction for massively parallel machines like the Connection Machine.)