I ran into a very strange effect when some Sun folks tried to benchmark JRuby's multi-thread scalability. In short, adding more threads actually caused the benchmarks to take longer.
The source of the problem (at least the source that, when fixed, allowed normal thread scaling), was an increment, mask, and test of a static int field. The code in question looked like this:
private static int count = 0;
public void pollEvents(ThreadContext context) { if ((count++ & 0xFF) == 0) context.poll();
}
So the basic idea was that this would call poll() every 256 hits, incrementing a counter all the while. My first attempt to improve performance was to comment out the body of poll() in case it was causing a threading bottleneck (it does some locking and such), but that had no effect. Then, as a total shot in the dark, I commented out the entire line above. Thread scaling went to normal.
So I'm rather confused here. Is a ++ operation on a static int doing some kind of atomic update that causes multiple threads to contend? I never would have expected this, so I wrote up a small Java benchmark:
The benchmark does basically the same thing, with a single main counter and another "fired" counter to prevent hotspot from optimizing things completely away. I've been running this on a dual-core MacBook Pro with both Apple's Java 5 and the soylatte Java 6 release. The results are very confusing:
Normal scaling here...1 thread on my system uses about 60-65% CPU, so the extra thread uses up the remaining 35-40% and the numbers show it. Then there's soylatte Java 6:
Don't compare the times directly, since these are two pretty different codebases and they each have different general performance characteristics. Instead pay attention to the trend...the soylatte Java 6 run with two threads is significantly slower than the run with a single thread. This mirrors the results with JRuby when there was a single static counter being incremented.
> I ran into a very strange effect when some Sun folks tried to benchmark > JRuby's multi-thread scalability. In short, adding more threads actually > caused the benchmarks to take longer.
> The source of the problem (at least the source that, when fixed, allowed > normal thread scaling), was an increment, mask, and test of a static int > field. The code in question looked like this:
> private static int count = 0;
> public void pollEvents(ThreadContext context) { > if ((count++ & 0xFF) == 0) context.poll(); > }
> So the basic idea was that this would call poll() every 256 hits, > incrementing a counter all the while. My first attempt to improve > performance was to comment out the body of poll() in case it was causing > a threading bottleneck (it does some locking and such), but that had no > effect. Then, as a total shot in the dark, I commented out the entire > line above. Thread scaling went to normal.
That is rather odd.
Shouldn't count be volatile?
If it's declared as volatile does that make any difference?
John Wilson wrote: > On 4/2/08, Charles Oliver Nutter <charles.nut...@sun.com> wrote: >> I ran into a very strange effect when some Sun folks tried to benchmark >> JRuby's multi-thread scalability. In short, adding more threads actually >> caused the benchmarks to take longer.
>> The source of the problem (at least the source that, when fixed, allowed >> normal thread scaling), was an increment, mask, and test of a static int >> field. The code in question looked like this:
>> private static int count = 0;
>> public void pollEvents(ThreadContext context) { >> if ((count++ & 0xFF) == 0) context.poll(); >> }
>> So the basic idea was that this would call poll() every 256 hits, >> incrementing a counter all the while. My first attempt to improve >> performance was to comment out the body of poll() in case it was causing >> a threading bottleneck (it does some locking and such), but that had no >> effect. Then, as a total shot in the dark, I commented out the entire >> line above. Thread scaling went to normal.
> That is rather odd.
> Shouldn't count be volatile?
> If it's declared as volatile does that make any difference?
I had not tried it because I expected volatile would only make it slower. And in this case, the code in question didn't really care about perfect accuracy for the counter since it's just a rough guide. But here's numbers with volatile on my machine:
So as expected, volatile does slow things down a lot, but Java 6 does do a little better here. Also interesting to see that volatile completely obliterates any gain from running multiple threads on both Java 5 and Java 6, and the total time is almost 3x slower than a single thread on Java 5.
I tried another non-volatile run using i += 1 rather than i++ and the numbers were almost identical, with Java 6 severely degrading with multiple threads running and Java 5 improving.
Here's another set of numbers from Vladimir Sizikov, on a dual-core windows machine running Sun Java 5 and Sun Java 6:
No, I haven't...no access to a machine that can run debug builds at the moment, but I may try it later if the answer to my riddle does not present itself.
On 4/2/08, Charles Oliver Nutter <charles.nut...@sun.com> wrote:
> John Wilson wrote: > > That is rather odd.
> > Shouldn't count be volatile?
> > If it's declared as volatile does that make any difference?
> I had not tried it because I expected volatile would only make it > slower. And in this case, the code in question didn't really care about > perfect accuracy for the counter since it's just a rough guide. But > here's numbers with volatile on my machine:
Interesting. I thought that the runtime system might have inferred that count should have been volatile but you numbers show that this is not the case. [snip]
> I tried another non-volatile run using i += 1 rather than i++ and the > numbers were almost identical, with Java 6 severely degrading with > multiple threads running and Java 5 improving.
It would be interesting to try:
tmp = i; tmp++; i = tmp;
for tmp as a local variable and again for tmp as a public instance field.
(I'm sorry I can't do the tests for mystelf as I don't have access to a muti core machine here).
Attila Szegedi wrote: > No. You know your JVM bytecodes, Charlie - the only incrementing > bytecode in existence is IINC and it only works on an integer local > variable.
Yes, I know that...but I hadn't dug into what code was actually being generated for a field++. Looking now they don't appear to be any different. Either way it doesn't give me any answers...
> Pretty much the only sane implementation of ++ on a static field would > be:
> Not even volatility of the field will ensure atomic increment > operations, as the value must be temporarily held on the thread > operand stack.
> If you want atomic updates, java.util.concurrent.atomic.AtomicInteger > might give you what you need on Java 5 and above.
I'm more interested in finding out why the performance is so bad with multiple threads under Java 6. I don't need atomic updates, for which I would certainly use AtomicInteger instead.
On 4/2/08, Charles Oliver Nutter <charles.nut...@sun.com> wrote:
> I ran into a very strange effect when some Sun folks tried to benchmark > JRuby's multi-thread scalability. In short, adding more threads actually > caused the benchmarks to take longer.
Was this work being done on Intel hardware?
I'm just wondering if this might be some issue with the cache on Intel mutli core processors (i.e. the cache is constantly being invalidated).
This could be tested by running the equivalent C code on Intel kit and running the Java code on non Intel kit.
John Wilson wrote: > On 4/2/08, Charles Oliver Nutter <charles.nut...@sun.com> wrote: >> I ran into a very strange effect when some Sun folks tried to benchmark >> JRuby's multi-thread scalability. In short, adding more threads actually >> caused the benchmarks to take longer.
> Was this work being done on Intel hardware?
> I'm just wondering if this might be some issue with the cache on Intel > mutli core processors (i.e. the cache is constantly being > invalidated).
> This could be tested by running the equivalent C code on Intel kit and > running the Java code on non Intel kit.
In both cases (my benchmarking and "Sun guys" benchmarking) it was on x86-based hardware, but my runs were on OS X Intel (Core Duo) and theirs were on Solaris AMD x86-64 (Opteron). I have not yet heard back whether my fix improved thread scaling for them.
> I ran into a very strange effect when some Sun folks tried to benchmark > JRuby's multi-thread scalability. In short, adding more threads actually > caused the benchmarks to take longer. [...] > Instead pay attention to the trend...the soylatte Java > 6 run with two threads is significantly slower than the run with a > single thread. This mirrors the results with JRuby when there was a > single static counter being incremented.
I think there is not enough data to see a trend. I modified your test, made it run from 1-20 threads and for 50 loops, making an average time containing the time it took to execute all threads and put these in a diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that the time constantly goes up until 4 Threads are reached, my number of CPUs. Using 5 threads is takes less time than using 4, but after that the time looks more or less constant.