The source of the problem (at least the source that, when fixed, allowed
normal thread scaling), was an increment, mask, and test of a static int
field. The code in question looked like this:
private static int count = 0;
public void pollEvents(ThreadContext context) {
if ((count++ & 0xFF) == 0) context.poll();
}
So the basic idea was that this would call poll() every 256 hits,
incrementing a counter all the while. My first attempt to improve
performance was to comment out the body of poll() in case it was causing
a threading bottleneck (it does some locking and such), but that had no
effect. Then, as a total shot in the dark, I commented out the entire
line above. Thread scaling went to normal.
So I'm rather confused here. Is a ++ operation on a static int doing
some kind of atomic update that causes multiple threads to contend? I
never would have expected this, so I wrote up a small Java benchmark:
The benchmark does basically the same thing, with a single main counter
and another "fired" counter to prevent hotspot from optimizing things
completely away. I've been running this on a dual-core MacBook Pro with
both Apple's Java 5 and the soylatte Java 6 release. The results are
very confusing:
First on Apple's Java 5
~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 3924
fired: 3906250
time: 3945
fired: 3906250
time: 1841
fired: 3906250
time: 1882
fired: 3906250
time: 1896
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 3243
fired: 4090645
time: 3245
fired: 4100505
time: 1173
fired: 3906049
time: 1233
fired: 3906188
time: 1173
fired: 3906134
Normal scaling here...1 thread on my system uses about 60-65% CPU, so
the extra thread uses up the remaining 35-40% and the numbers show it.
Then there's soylatte Java 6:
~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 1772
fired: 3906250
time: 1973
fired: 3906250
time: 2748
fired: 3906250
time: 2114
fired: 3906250
time: 2294
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 3402
fired: 3848648
time: 3805
fired: 3885471
time: 4145
fired: 3866850
time: 4140
fired: 3839130
time: 3658
fired: 3880202
Don't compare the times directly, since these are two pretty different
codebases and they each have different general performance
characteristics. Instead pay attention to the trend...the soylatte Java
6 run with two threads is significantly slower than the run with a
single thread. This mirrors the results with JRuby when there was a
single static counter being incremented.
So what's up here?
- Charlie
Patrick
That is rather odd.
Shouldn't count be volatile?
If it's declared as volatile does that make any difference?
John Wilson
I had not tried it because I expected volatile would only make it
slower. And in this case, the code in question didn't really care about
perfect accuracy for the counter since it's just a rough guide. But
here's numbers with volatile on my machine:
Apple Java 5:
~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 9047
fired: 3906250
time: 9007
fired: 3906250
time: 9613
fired: 3906250
time: 9846
fired: 3906250
time: 10005
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 22349
fired: 3540696
time: 27641
fired: 3341096
time: 26815
fired: 3546695
time: 26641
fired: 3542920
time: 26789
fired: 3534386
soylatte Java 6:
~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 9777
fired: 3906250
time: 9701
fired: 3906250
time: 9070
fired: 3906250
time: 8656
fired: 3906250
time: 9065
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 24781
fired: 3464957
time: 23668
fired: 3758204
time: 21235
fired: 3783215
time: 22198
fired: 3761491
time: 22937
fired: 3752534
So as expected, volatile does slow things down a lot, but Java 6 does do
a little better here. Also interesting to see that volatile completely
obliterates any gain from running multiple threads on both Java 5 and
Java 6, and the total time is almost 3x slower than a single thread on
Java 5.
I tried another non-volatile run using i += 1 rather than i++ and the
numbers were almost identical, with Java 6 severely degrading with
multiple threads running and Java 5 improving.
Here's another set of numbers from Vladimir Sizikov, on a dual-core
windows machine running Sun Java 5 and Sun Java 6:
D:\work>D:/re/java5/bin/java -server Trouble 2
time: 1666
fired: 3345620
time: 1453
fired: 4033604
time: 629
fired: 3592687
time: 569
fired: 3595772
time: 578
D:\work>D:/re/java6/bin/java -server Trouble 2
time: 1595
fired: 3896153
time: 1588
fired: 3900934
time: 2090
fired: 3896066
time: 2133
fired: 3891300
time: 2154
fired: 3892321
Again, significantly worse performance in Java 6 with multiple threads.
- Charlie
No, I haven't...no access to a machine that can run debug builds at the
moment, but I may try it later if the answer to my riddle does not
present itself.
- Charlie
No. You know your JVM bytecodes, Charlie - the only incrementing
bytecode in existence is IINC and it only works on an integer local
variable.
Pretty much the only sane implementation of ++ on a static field would
be:
GETSTATIC someClass.someField
ICONST_1
IADD
PUTSTATIC someClass.someField
Not even volatility of the field will ensure atomic increment
operations, as the value must be temporarily held on the thread
operand stack.
If you want atomic updates, java.util.concurrent.atomic.AtomicInteger
might give you what you need on Java 5 and above.
Attila.
Interesting. I thought that the runtime system might have inferred
that count should have been volatile but you numbers show that this is
not the case.
[snip]
> I tried another non-volatile run using i += 1 rather than i++ and the
> numbers were almost identical, with Java 6 severely degrading with
> multiple threads running and Java 5 improving.
>
It would be interesting to try:
tmp = i;
tmp++;
i = tmp;
for tmp as a local variable and again for tmp as a public instance field.
(I'm sorry I can't do the tests for mystelf as I don't have access to
a muti core machine here).
John Wilson
Yes, I know that...but I hadn't dug into what code was actually being
generated for a field++. Looking now they don't appear to be any
different. Either way it doesn't give me any answers...
> Pretty much the only sane implementation of ++ on a static field would
> be:
>
> GETSTATIC someClass.someField
> ICONST_1
> IADD
> PUTSTATIC someClass.someField
>
> Not even volatility of the field will ensure atomic increment
> operations, as the value must be temporarily held on the thread
> operand stack.
>
> If you want atomic updates, java.util.concurrent.atomic.AtomicInteger
> might give you what you need on Java 5 and above.
I'm more interested in finding out why the performance is so bad with
multiple threads under Java 6. I don't need atomic updates, for which I
would certainly use AtomicInteger instead.
- Charlie
Was this work being done on Intel hardware?
I'm just wondering if this might be some issue with the cache on Intel
mutli core processors (i.e. the cache is constantly being
invalidated).
This could be tested by running the equivalent C code on Intel kit and
running the Java code on non Intel kit.
John Wilson
In both cases (my benchmarking and "Sun guys" benchmarking) it was on
x86-based hardware, but my runs were on OS X Intel (Core Duo) and theirs
were on Solaris AMD x86-64 (Opteron). I have not yet heard back whether
my fix improved thread scaling for them.
- Charlie
I think there is not enough data to see a trend. I modified your test,
made it run from 1-20 threads and for 50 loops, making an average time
containing the time it took to execute all threads and put these in a
diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on
Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that
the time constantly goes up until 4 Threads are reached, my number of
CPUs. Using 5 threads is takes less time than using 4, but after that
the time looks more or less constant.
This looks quite scalable to me.
bye blackdrag
--
Jochen "blackdrag" Theodorou
The Groovy Project Tech Lead (http://groovy.codehaus.org)
http://blackdragsview.blogspot.com/
http://www.g2one.com/
You did notice the code splits the number of iterations by the number of
threads, right? At the very least, the amount of time taken to run all
threads should go down.
Run the same test in Java 5 and you'll see that Java 6 does not scale
properly at all for this test.
- Charlie
I'm just wondering if this might be some issue with the cache on Intel
mutli core processors (i.e. the cache is constantly being
invalidated).
Here's a slightly reduced example, with most important tings in just
two lines, one marked SLOW and another marked FAST.
// if ((i++ & 0xFF) == 0) firedCount++; // SLOW
// if ((i++ & 0xFF) == 0) Counter.firedCount++; // FAST
The difference in timings on my Dual-Core Core2Duo CPU on JDK 6u5 is about 2x!
And the only difference in code is where the static counter lives:
either on main class (then it slow) or in static member class (then
it's 2x faster!)
The test was executed with 2 threads and with -server switch:
D:/re/java6/bin/java -server Trouble 2
JDK 5 and JDK 1.4 behave as expected and there is no significant difference.
Thanks,
--Vladimir
why? If I have 1 CPU and two threads, each of them running t seconds,
shouldn't the time then be 2*t? Everything over 2*t would be the thread
handling overhead. since you let 1 Thread run n iteration taking time t
and then 2 Threads with n/2 iterations... shouldn't the total time then
still be t?
If you want to tell me that running with 3 threads takes longer than
running with one thread and this should not happen, then yes, probably,
but that is not what I understand from scalability. I do understand that
if you go to big n and the total time increases much, then the
scalability is bad. If you look at 7 or 20 Threads, then this does not
happen. Only when the number of threads is lower this happens. So you
could say the JVM in 1.6 does not scale well for a lower number of
Threads, but does for a high number of threads...
> Run the same test in Java 5 and you'll see that Java 6 does not scale
> properly at all for this test.
I will later
On Wed, Apr 2, 2008 at 9:59 PM, Jochen Theodorou <blac...@gmx.org> wrote:
> why? If I have 1 CPU and two threads, each of them running t seconds,
> shouldn't the time then be 2*t? Everything over 2*t would be the thread
> handling overhead. since you let 1 Thread run n iteration taking time t
> and then 2 Threads with n/2 iterations... shouldn't the total time then
> still be t?
This particular example under discussion has a total number of
iterations, and it tries to reach that number either via single
thread, or two threads, etc. So, amount of work is roughly the same.
What varies is the way the work done.
The point is that with two treads on two-CPU system, the work should
be done faster, because there are two cores running things in
parallel. But with JDK6 this does not happen, both cores *ARE* 100%
busy, but the time it takes to reach the end (to reach the total
number of iterations limit) is longer than expected.
Thanks,
--Vladimir
> Here's a slightly reduced example, with most important tings in just
> two lines, one marked SLOW and another marked FAST.
Hotspot has optimizations specifically for codes that look like micro-
benchmarks,
but there are better ways to get performance, and to measure it.
Part of the Trouble here is that the Thread.run methods never exit.
The compiler kicks in part of the way through, and replaces the
interpreter frame with a compiled frame. (The technique is called
OSR, or on-stack replacement.) The code is only partially optimized
as a result. The difference between releases could be due to a
subtle change in tuning of OSR, and would be irrelevant except
for micro-benchmarks.
Here are some rules, in priority order, to know about when you write
micro-benchmarks:
Rule 0: Read a reputable paper on JVMs and micro-benchmarking.
I suggest Brian Goetz, 2005:
http://www-128.ibm.com/developerworks/java/library/j-jtp02225.html?
ca=drs-j0805 .
Rule 1: Always include a warmup phase which runs your test kernel
all the way through,
enough to trigger all initializations and compilations before timing
phase(s).
(Fewer iterations is OK on the warmup phase. The rule of thumb is
several tens
of thousands of inner loop iterations.)
Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so
you can verify
that the compiler and other parts of the JVM are not doing unexpected
work during
your timing phase.
Rule 2.1: Print messages at the beginning and end of timing and
warmup phases,
so you can verify that there is no output from Rule 2 during the
timing phase.
Rule 3: Be aware of the difference between -client and -server, and
OSR and
regular compilations. -XX:+PrintCompilation reports OSR compilations
with
an at-sign to denote the non-initial entry point: Trouble$1::run @ 2
(41 bytes).
Prefer server to client, and regular to OSR, if you are after best
performance.
Rule 4: Be aware of initialization effects. Do not print for the
first time during
your timing phase, since printing loads and initializes classes. Do
not load
new classes outside the of the warmup phase (or final reporting phase),
unless you are testing class loading specifically (and in that case load
only the test classes). Rule 2 is your first line of defense against
such effect.
Rule 5: Be aware of deoptimization and recompilation effects.
Do not take any code path for the first time in the timing phase,
because the compiler may junk and recompile the code,
based on an earlier optimistic assumption that the path was
not going to be used at all. Rule 2 is your first line of defense
against such effects.
Best wishes,
-- John
P.S. Here's another recent discussion about micro-benchmarks:
http://mail.openjdk.java.net/pipermail/hotspot-dev/2008-February/
000254.html
Googling popped this up, which looks good:
http://www.concentric.net/~Ttwang/tech/microbench.htm
Here's a wiki about low-level information about Hotspot optimizations:
http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques
I put the rules here for easier reference:
http://wikis.sun.com/display/HotSpotInternals/MicroBenchmarks
-- John
Pardon my ignorance, but why does that explain the performance
difference? Talking with Wayne Meissner he thought perhaps because the
two variables were closer together, updating one caused a cache flush of
both. That sounds reasonable, I suppose...but doesn't this still qualify
as a regression since performance is significantly worse under Java 6?
I guess my concern is that one reason many of these new languages are
gaining traction on the JVM (especially in the case of Jython and JRuby)
is because they can be *actually* parallel. Without experimentally
finding this effect, we might have gone months or forever with poor
thread scaling, all because we used a single static variable instead of
separate thread-local variables. It's rather surprising, and makes me
worry about the rest of the codebase.
So I guess there's a rule of thumb here: don't share data across threads
unless you absolutely need to. And perhaps: don't use statics (Gilad
would be proud). But something more concrete to say why this is bad, or
an explanation of how this effect is a trade-off for something much more
important (cache locality for field reads?) would help ease the pain...
- Charlie
what do you expect? That the VM spread out all global variables
throughout the heap so that any broken access pattern cause not too
many cache flushes? You're hammering a global from multiple threads, I
don't think you can blame the JVM for that, be it for a new language
or not.
Matthias
I'm just looking for a definitive answer on why it's better this way in
Java 6. If this is a good optimization for other reasons, that's
perfectly acceptable and I'll be happy with it. At the moment, however,
I have a real-world case where it severely impacted execution
performance...so severely that I'd be very surprised if other parts of
the system weren't suffering from the same effect. And I'm sure there
are other apps out there secretly harboring similar issues.
I'm writing up a blog post on this now, so any additional information is
very welcome. And I'm not necessarily "blaming" the JVM...but slower is
slower.
- Charlie
http://headius.blogspot.com/2008/04/shared-data-considered-harmful.html
- Charlie
I'm just looking for a definitive answer on why it's better this way in
Java 6.
Btw, thanks for posting this, along with the links and references, was
very helpful.
Thanks,
--Vladimir
> Even better (if applicable) would be to have fast thread-local counters,
> with a slow background phase which occasionally tallies them into a trailing
> global counter.
That reminds me of an old question of mine. In the case where a
thread has been created from an object whose class is a subclass of
Thread, is Thread.currentThread() guaranteed to give you back the same
object? IOW, if I say:
class MyThread extends Thread { ... }
myThread = new MyThread();
myThread.start();
...
then in the new thread, is Thread.currentThread() always == to the
original value of myThread? If so, that provides a different approach
to thread-local state, where the state is held in the instance
variables of MyThread objects, and the JVM is used to thread them
through :-) the code until the point where they're needed.
--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures
> That reminds me of an old question of mine. In the case where a
> thread has been created from an object whose class is a subclass of
> Thread, is Thread.currentThread() guaranteed to give you back the same
> object?
Yes. You can then cast it to the subclass and go on from there.
> If so, that provides a different approach
> to thread-local state, where the state is held in the instance
> variables of MyThread objects, and the JVM is used to thread them
> through :-) the code until the point where they're needed.
That's how Java thread locals are implemented in the JVM.
You can only use this approach if (a) you control the creation of the
thread and can specify your own subclass for it, or (b) you can edit
the rt.jar code for java.lang.Thread. Plan (c) is to make a
ThreadLocal.
You might be interested to know that Hotspot intrinsifies
Thread.currentThread to a couple of instructions; it's cheap. We did
that to make things like thread locals efficient.
Best,
-- John
> Yes. You can then cast it to the subclass and go on from there.
That's what I had hoped, but the JLS doesn't actually seem to guarantee this.
> > If so, that provides a different approach
> > to thread-local state, where the state is held in the instance
> > variables of MyThread objects, and the JVM is used to thread them
> > through :-) the code until the point where they're needed.
>
> That's how Java thread locals are implemented in the JVM.
I don't understand how that can be, as I can create as many
ThreadLocals as I want. They can't all fit in the Thread object,
surely. And how can reference to them be fast if they have to
indirect through a Thread subclass object?
> You can only use this approach if (a) you control the creation of the
> thread and can specify your own subclass for it,
That's what I had in mind.
> You might be interested to know that Hotspot intrinsifies
> Thread.currentThread to a couple of instructions; it's cheap. We did
> that to make things like thread locals efficient.
Good.
the source for ThreadLocal explains the first bit:
public void set(T value) {
Thread t = Thread.currentThread();
ThreadLocalMap map = getMap(t);
if (map != null)
map.set(this, value);
else
createMap(t, value);
}
apparently, every thread has a map that holds all ThreadLocal's registered
>
>
> > You can only use this approach if (a) you control the creation of the
> > thread and can specify your own subclass for it,
>
>
> That's what I had in mind.
>
>
> > You might be interested to know that Hotspot intrinsifies
> > Thread.currentThread to a couple of instructions; it's cheap. We did
> > that to make things like thread locals efficient.
>
>
> Good.
>
>
> --
> GMail doesn't have rotating .sigs, but you can see mine at
> http://www.ccil.org/~cowan/signatures
>
> >
>
--
[]'s
Marcelo Takeshi Fukushima
They can't all fit in the Thread object, surely.
Yeah, clever, and basically the same as the eventual fix I had for JRuby
since it gives each thread its own counter. I think the primary rule
here is that don't expect any sort of unsynchronized, uncontrolled
updates to the same shared resource to either behave the way you want or
perform the way you want.
- Charlie
- Charlie
I guess once you're into concurrency, everything argues for immutable
data structures.
It's also nice to see how this fits with Gilad Brachas arguments
against the Java 'static' keyword (which allows mutable structures to
be shared implicitly between threads), as Charles already noted.
Regards,
Martin
Whatever machinery involved: if multiple actors contend for the same
memory location they will suffer. The best strategy is to distribute
jobs, work on each of them independently on local resources, be it
processor registers, thread locals, local memory or disks and to
synchronize the results with others as infrequently as possible. It's
the basic principle behind fork-join in a multi-threaded process and
map-reduce on a large cluster. Nothing new that Gilad Bracha
discovered in 2008, is it?
Matthias
Who's surprised?
> It's the basic principle behind fork-join in a multi-threaded
> process and
> map-reduce on a large cluster.
> Nothing new that Gilad Bracha discovered in 2008, is it?
I was referring to Bracha's stuff about how static fields and global
names in general harm software engineering, deployment (live code
updating), concurrency, testability, etc. pp. I'm not sure that this
stuff is novel, and it's certainly building on research of others. But
it's definitely non-trivial, and it goes far beyond the basic insight
that resource contention is a bottleneck in parallel/distributed
systems.
The interesting part is that it indeed comes to similar solutions
looking from a different angle and a different problem set. That's
what I wanted to point out.
Regards,
Martin
Not you in particular. I just picked your message as the last in this
thread. I was just commenting on the overall sentiment that -to me-
sounds like "wow we may have discovered something new in concurrent
programming and great how all these pieces (Erlang, Bracha) finally
fit together now". Gilad's compilation was certainly well compiled but
global variables, hey, ...
Whatever. I'm not trying to provoke here.
Cheers
Matthias
I guess once you're into concurrency, everything argues for immutable
data structures.