Performance characteristics of mutable static primitives?

Charles Oliver Nutter

unread,

Apr 2, 2008, 4:48:04 AM4/2/08

to jvm-la...@googlegroups.com

I ran into a very strange effect when some Sun folks tried to benchmark
JRuby's multi-thread scalability. In short, adding more threads actually
caused the benchmarks to take longer.

The source of the problem (at least the source that, when fixed, allowed
normal thread scaling), was an increment, mask, and test of a static int
field. The code in question looked like this:

private static int count = 0;

public void pollEvents(ThreadContext context) {
if ((count++ & 0xFF) == 0) context.poll();
}

So the basic idea was that this would call poll() every 256 hits,
incrementing a counter all the while. My first attempt to improve
performance was to comment out the body of poll() in case it was causing
a threading bottleneck (it does some locking and such), but that had no
effect. Then, as a total shot in the dark, I commented out the entire
line above. Thread scaling went to normal.

So I'm rather confused here. Is a ++ operation on a static int doing
some kind of atomic update that causes multiple threads to contend? I
never would have expected this, so I wrote up a small Java benchmark:

http://pastie.org/173993

The benchmark does basically the same thing, with a single main counter
and another "fired" counter to prevent hotspot from optimizing things
completely away. I've been running this on a dual-core MacBook Pro with
both Apple's Java 5 and the soylatte Java 6 release. The results are
very confusing:

First on Apple's Java 5

~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 3924
fired: 3906250
time: 3945
fired: 3906250
time: 1841
fired: 3906250
time: 1882
fired: 3906250
time: 1896
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 3243
fired: 4090645
time: 3245
fired: 4100505
time: 1173
fired: 3906049
time: 1233
fired: 3906188
time: 1173
fired: 3906134

Normal scaling here...1 thread on my system uses about 60-65% CPU, so
the extra thread uses up the remaining 35-40% and the numbers show it.
Then there's soylatte Java 6:

~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 1772
fired: 3906250
time: 1973
fired: 3906250
time: 2748
fired: 3906250
time: 2114
fired: 3906250
time: 2294
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 3402
fired: 3848648
time: 3805
fired: 3885471
time: 4145
fired: 3866850
time: 4140
fired: 3839130
time: 3658
fired: 3880202

Don't compare the times directly, since these are two pretty different
codebases and they each have different general performance
characteristics. Instead pay attention to the trend...the soylatte Java
6 run with two threads is significantly slower than the run with a
single thread. This mirrors the results with JRuby when there was a
single static counter being incremented.

So what's up here?

- Charlie

Patrick Wright

unread,

Apr 2, 2008, 4:59:25 AM4/2/08

to jvm-la...@googlegroups.com

Have you tried looking at the generated code from Hotspot? See e.g.
http://weblogs.java.net/blog/kohsuke/archive/2008/03/deep_dive_into.html

Patrick

John Wilson

unread,

Apr 2, 2008, 5:10:52 AM4/2/08

to jvm-la...@googlegroups.com

On 4/2/08, Charles Oliver Nutter <charles...@sun.com> wrote:
>
> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.
>
> The source of the problem (at least the source that, when fixed, allowed
> normal thread scaling), was an increment, mask, and test of a static int
> field. The code in question looked like this:
>
> private static int count = 0;
>
> public void pollEvents(ThreadContext context) {
> if ((count++ & 0xFF) == 0) context.poll();
> }
>
> So the basic idea was that this would call poll() every 256 hits,
> incrementing a counter all the while. My first attempt to improve
> performance was to comment out the body of poll() in case it was causing
> a threading bottleneck (it does some locking and such), but that had no
> effect. Then, as a total shot in the dark, I commented out the entire
> line above. Thread scaling went to normal.

That is rather odd.

Shouldn't count be volatile?

If it's declared as volatile does that make any difference?

John Wilson

Charles Oliver Nutter

unread,

Apr 2, 2008, 5:43:46 AM4/2/08

to jvm-la...@googlegroups.com

I had not tried it because I expected volatile would only make it
slower. And in this case, the code in question didn't really care about
perfect accuracy for the counter since it's just a rough guide. But
here's numbers with volatile on my machine:

Apple Java 5:

~/NetBeansProjects/jruby ➔ java -server Trouble 1

time: 9047
fired: 3906250
time: 9007
fired: 3906250
time: 9613
fired: 3906250
time: 9846
fired: 3906250
time: 10005

fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2

time: 22349
fired: 3540696
time: 27641
fired: 3341096
time: 26815
fired: 3546695
time: 26641
fired: 3542920
time: 26789
fired: 3534386

soylatte Java 6:

~/NetBeansProjects/jruby ➔ java -server Trouble 1

time: 9777
fired: 3906250
time: 9701
fired: 3906250
time: 9070
fired: 3906250
time: 8656
fired: 3906250
time: 9065

fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2

time: 24781
fired: 3464957
time: 23668
fired: 3758204
time: 21235
fired: 3783215
time: 22198
fired: 3761491
time: 22937
fired: 3752534

So as expected, volatile does slow things down a lot, but Java 6 does do
a little better here. Also interesting to see that volatile completely
obliterates any gain from running multiple threads on both Java 5 and
Java 6, and the total time is almost 3x slower than a single thread on
Java 5.

I tried another non-volatile run using i += 1 rather than i++ and the
numbers were almost identical, with Java 6 severely degrading with
multiple threads running and Java 5 improving.

Here's another set of numbers from Vladimir Sizikov, on a dual-core
windows machine running Sun Java 5 and Sun Java 6:

D:\work>D:/re/java5/bin/java -server Trouble 2
time: 1666
fired: 3345620
time: 1453
fired: 4033604
time: 629
fired: 3592687
time: 569
fired: 3595772
time: 578

D:\work>D:/re/java6/bin/java -server Trouble 2
time: 1595
fired: 3896153
time: 1588
fired: 3900934
time: 2090
fired: 3896066
time: 2133
fired: 3891300
time: 2154
fired: 3892321

Again, significantly worse performance in Java 6 with multiple threads.

- Charlie

Charles Oliver Nutter

unread,

Apr 2, 2008, 5:47:08 AM4/2/08

to jvm-la...@googlegroups.com

Patrick Wright wrote:
> Have you tried looking at the generated code from Hotspot? See e.g.
> http://weblogs.java.net/blog/kohsuke/archive/2008/03/deep_dive_into.html

No, I haven't...no access to a machine that can run debug builds at the
moment, but I may try it later if the answer to my riddle does not
present itself.

- Charlie

Attila Szegedi

unread,

Apr 2, 2008, 6:35:30 AM4/2/08

to jvm-la...@googlegroups.com

On 2008.04.02., at 10:48, Charles Oliver Nutter wrote:
>
> So I'm rather confused here. Is a ++ operation on a static int doing
> some kind of atomic update that causes multiple threads to contend?

No. You know your JVM bytecodes, Charlie - the only incrementing
bytecode in existence is IINC and it only works on an integer local
variable.

Pretty much the only sane implementation of ++ on a static field would
be:

GETSTATIC someClass.someField
ICONST_1
IADD
PUTSTATIC someClass.someField

Not even volatility of the field will ensure atomic increment
operations, as the value must be temporarily held on the thread
operand stack.

If you want atomic updates, java.util.concurrent.atomic.AtomicInteger
might give you what you need on Java 5 and above.

Attila.

John Wilson

unread,

Apr 2, 2008, 6:36:14 AM4/2/08

to jvm-la...@googlegroups.com

On 4/2/08, Charles Oliver Nutter <charles...@sun.com> wrote:
>

> John Wilson wrote:
> > That is rather odd.
> >
> > Shouldn't count be volatile?
> >
> > If it's declared as volatile does that make any difference?
>
>
> I had not tried it because I expected volatile would only make it
> slower. And in this case, the code in question didn't really care about
> perfect accuracy for the counter since it's just a rough guide. But
> here's numbers with volatile on my machine:

Interesting. I thought that the runtime system might have inferred
that count should have been volatile but you numbers show that this is
not the case.
[snip]

> I tried another non-volatile run using i += 1 rather than i++ and the
> numbers were almost identical, with Java 6 severely degrading with
> multiple threads running and Java 5 improving.
>

It would be interesting to try:

tmp = i;
tmp++;
i = tmp;

for tmp as a local variable and again for tmp as a public instance field.

(I'm sorry I can't do the tests for mystelf as I don't have access to
a muti core machine here).

John Wilson

Charles Oliver Nutter

unread,

Apr 2, 2008, 6:46:22 AM4/2/08

to jvm-la...@googlegroups.com

Attila Szegedi wrote:
> No. You know your JVM bytecodes, Charlie - the only incrementing
> bytecode in existence is IINC and it only works on an integer local
> variable.

Yes, I know that...but I hadn't dug into what code was actually being
generated for a field++. Looking now they don't appear to be any
different. Either way it doesn't give me any answers...

> Pretty much the only sane implementation of ++ on a static field would
> be:
>
> GETSTATIC someClass.someField
> ICONST_1
> IADD
> PUTSTATIC someClass.someField
>
> Not even volatility of the field will ensure atomic increment
> operations, as the value must be temporarily held on the thread
> operand stack.
>
> If you want atomic updates, java.util.concurrent.atomic.AtomicInteger
> might give you what you need on Java 5 and above.

I'm more interested in finding out why the performance is so bad with
multiple threads under Java 6. I don't need atomic updates, for which I
would certainly use AtomicInteger instead.

- Charlie

John Wilson

unread,

Apr 2, 2008, 7:01:53 AM4/2/08

to jvm-la...@googlegroups.com

On 4/2/08, Charles Oliver Nutter <charles...@sun.com> wrote:
>

> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.

Was this work being done on Intel hardware?

I'm just wondering if this might be some issue with the cache on Intel
mutli core processors (i.e. the cache is constantly being
invalidated).

This could be tested by running the equivalent C code on Intel kit and
running the Java code on non Intel kit.

John Wilson

Charles Oliver Nutter

unread,

Apr 2, 2008, 7:10:09 AM4/2/08

to jvm-la...@googlegroups.com

In both cases (my benchmarking and "Sun guys" benchmarking) it was on
x86-based hardware, but my runs were on OS X Intel (Core Duo) and theirs
were on Solaris AMD x86-64 (Opteron). I have not yet heard back whether
my fix improved thread scaling for them.

- Charlie

Jochen Theodorou

unread,

Apr 2, 2008, 11:27:44 AM4/2/08

to jvm-la...@googlegroups.com

Charles Oliver Nutter schrieb:

> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.

[...]

> Instead pay attention to the trend...the soylatte Java
> 6 run with two threads is significantly slower than the run with a
> single thread. This mirrors the results with JRuby when there was a
> single static counter being incremented.

I think there is not enough data to see a trend. I modified your test,
made it run from 1-20 threads and for 50 loops, making an average time
containing the time it took to execute all threads and put these in a
diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on
Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that
the time constantly goes up until 4 Threads are reached, my number of
CPUs. Using 5 threads is takes less time than using 4, but after that
the time looks more or less constant.

This looks quite scalable to me.

bye blackdrag

--
Jochen "blackdrag" Theodorou
The Groovy Project Tech Lead (http://groovy.codehaus.org)
http://blackdragsview.blogspot.com/
http://www.g2one.com/

Charles Oliver Nutter

unread,

Apr 2, 2008, 11:51:30 AM4/2/08

to jvm-la...@googlegroups.com

Jochen Theodorou wrote:
> I think there is not enough data to see a trend. I modified your test,
> made it run from 1-20 threads and for 50 loops, making an average time
> containing the time it took to execute all threads and put these in a
> diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on
> Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that
> the time constantly goes up until 4 Threads are reached, my number of
> CPUs. Using 5 threads is takes less time than using 4, but after that
> the time looks more or less constant.
>
> This looks quite scalable to me.

You did notice the code splits the number of iterations by the number of
threads, right? At the very least, the amount of time taken to run all
threads should go down.

Run the same test in Java 5 and you'll see that Java 6 does not scale
properly at all for this test.

- Charlie

John Rose

unread,

Apr 2, 2008, 3:43:03 PM4/2/08

to jvm-la...@googlegroups.com

On Apr 2, 2008, at 4:01 AM, John Wilson wrote:

I'm just wondering if this might be some issue with the cache on Intel

mutli core processors (i.e. the cache is constantly being

invalidated).

Yes, that code will in general cause cache ping-ponging,

which in the long run will harm sharing and hence scalability.

On a single socket system, the sharing ought to be

pretty good, though. And that doesn't explain the difference

between 5 and 6.

This looks like a question for the hotspot-runtime-dev list.

http://mail.openjdk.java.net/mailman/listinfo/hotspot-runtime-dev

-- John

P.S. Today I'm pushing the disassembler plugin for Hotspot 13,

to come out with the first flood of pending OpenJDK changes, RSN.

It will support disassembly of code even on product versions of Hotspot,

making it *much* easier to discuss issues like this.

Webrev (slightly out-of-date) is here:

http://webrev.invokedynamic.info/jrose/6667042/

The only current implementation of the disassembler

uses Gnu binutils. We don't link Gnu code into Hotspot,

but this isn't a problem, since the disassembler is loaded

into the JVM as a plugin. The downside is you'll have

to build the disassembler plugin yourself. There are

build instructions (using Gnu binutils) in the Hotspot sources.

Vladimir Sizikov

unread,

Apr 2, 2008, 3:55:24 PM4/2/08

to jvm-la...@googlegroups.com

Hi,

Here's a slightly reduced example, with most important tings in just
two lines, one marked SLOW and another marked FAST.

http://pastie.org/174059

// if ((i++ & 0xFF) == 0) firedCount++; // SLOW
// if ((i++ & 0xFF) == 0) Counter.firedCount++; // FAST

The difference in timings on my Dual-Core Core2Duo CPU on JDK 6u5 is about 2x!
And the only difference in code is where the static counter lives:
either on main class (then it slow) or in static member class (then
it's 2x faster!)

The test was executed with 2 threads and with -server switch:

D:/re/java6/bin/java -server Trouble 2

JDK 5 and JDK 1.4 behave as expected and there is no significant difference.

Thanks,
--Vladimir

Jochen Theodorou

unread,

Apr 2, 2008, 3:59:02 PM4/2/08

to jvm-la...@googlegroups.com

Charles Oliver Nutter schrieb:

> Jochen Theodorou wrote:
>> I think there is not enough data to see a trend. I modified your test,
>> made it run from 1-20 threads and for 50 loops, making an average time
>> containing the time it took to execute all threads and put these in a
>> diagram. I used a Q6600 Quadcore intel CPU with java 1.6.0_03-b05 on
>> Linux 2.6.22-14-generic #1 SMP x86_64 GNU/Linux. What I can see is that
>> the time constantly goes up until 4 Threads are reached, my number of
>> CPUs. Using 5 threads is takes less time than using 4, but after that
>> the time looks more or less constant.
>>
>> This looks quite scalable to me.
>
> You did notice the code splits the number of iterations by the number of
> threads, right? At the very least, the amount of time taken to run all
> threads should go down.

why? If I have 1 CPU and two threads, each of them running t seconds,
shouldn't the time then be 2*t? Everything over 2*t would be the thread
handling overhead. since you let 1 Thread run n iteration taking time t
and then 2 Threads with n/2 iterations... shouldn't the total time then
still be t?

If you want to tell me that running with 3 threads takes longer than
running with one thread and this should not happen, then yes, probably,
but that is not what I understand from scalability. I do understand that
if you go to big n and the total time increases much, then the
scalability is bad. If you look at 7 or 20 Threads, then this does not
happen. Only when the number of threads is lower this happens. So you
could say the JVM in 1.6 does not scale well for a lower number of
Threads, but does for a high number of threads...

> Run the same test in Java 5 and you'll see that Java 6 does not scale
> properly at all for this test.

I will later

Vladimir Sizikov

unread,

Apr 2, 2008, 4:05:14 PM4/2/08

to jvm-la...@googlegroups.com

Hi Jochen,

On Wed, Apr 2, 2008 at 9:59 PM, Jochen Theodorou <blac...@gmx.org> wrote:
> why? If I have 1 CPU and two threads, each of them running t seconds,
> shouldn't the time then be 2*t? Everything over 2*t would be the thread
> handling overhead. since you let 1 Thread run n iteration taking time t
> and then 2 Threads with n/2 iterations... shouldn't the total time then
> still be t?

This particular example under discussion has a total number of
iterations, and it tries to reach that number either via single
thread, or two threads, etc. So, amount of work is roughly the same.
What varies is the way the work done.

The point is that with two treads on two-CPU system, the work should
be done faster, because there are two cores running things in
parallel. But with JDK6 this does not happen, both cores *ARE* 100%
busy, but the time it takes to reach the end (to reach the total
number of iterations limit) is longer than expected.

Thanks,
--Vladimir

John Rose

unread,

Apr 2, 2008, 5:55:35 PM4/2/08

to jvm-la...@googlegroups.com

On Apr 2, 2008, at 12:55 PM, Vladimir Sizikov wrote:

> Here's a slightly reduced example, with most important tings in just
> two lines, one marked SLOW and another marked FAST.

Hotspot has optimizations specifically for codes that look like micro-
benchmarks,
but there are better ways to get performance, and to measure it.

Part of the Trouble here is that the Thread.run methods never exit.
The compiler kicks in part of the way through, and replaces the
interpreter frame with a compiled frame. (The technique is called
OSR, or on-stack replacement.) The code is only partially optimized
as a result. The difference between releases could be due to a
subtle change in tuning of OSR, and would be irrelevant except
for micro-benchmarks.

Here are some rules, in priority order, to know about when you write
micro-benchmarks:

Rule 0: Read a reputable paper on JVMs and micro-benchmarking.
I suggest Brian Goetz, 2005:
http://www-128.ibm.com/developerworks/java/library/j-jtp02225.html?
ca=drs-j0805 .

Rule 1: Always include a warmup phase which runs your test kernel
all the way through,
enough to trigger all initializations and compilations before timing
phase(s).
(Fewer iterations is OK on the warmup phase. The rule of thumb is
several tens
of thousands of inner loop iterations.)

Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so
you can verify
that the compiler and other parts of the JVM are not doing unexpected
work during
your timing phase.

Rule 2.1: Print messages at the beginning and end of timing and
warmup phases,
so you can verify that there is no output from Rule 2 during the
timing phase.

Rule 3: Be aware of the difference between -client and -server, and
OSR and
regular compilations. -XX:+PrintCompilation reports OSR compilations
with
an at-sign to denote the non-initial entry point: Trouble$1::run @ 2
(41 bytes).
Prefer server to client, and regular to OSR, if you are after best
performance.

Rule 4: Be aware of initialization effects. Do not print for the
first time during
your timing phase, since printing loads and initializes classes. Do
not load
new classes outside the of the warmup phase (or final reporting phase),
unless you are testing class loading specifically (and in that case load
only the test classes). Rule 2 is your first line of defense against
such effect.

Rule 5: Be aware of deoptimization and recompilation effects.
Do not take any code path for the first time in the timing phase,
because the compiler may junk and recompile the code,
based on an earlier optimistic assumption that the path was
not going to be used at all. Rule 2 is your first line of defense
against such effects.

Best wishes,
-- John

P.S. Here's another recent discussion about micro-benchmarks:
http://mail.openjdk.java.net/pipermail/hotspot-dev/2008-February/
000254.html

Googling popped this up, which looks good:
http://www.concentric.net/~Ttwang/tech/microbench.htm

Here's a wiki about low-level information about Hotspot optimizations:
http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques

John Rose

unread,

Apr 2, 2008, 6:31:47 PM4/2/08

to jvm-la...@googlegroups.com

My apologies for the atrocious formatting of the last message.

I put the rules here for easier reference:
http://wikis.sun.com/display/HotSpotInternals/MicroBenchmarks

-- John

John Rose

unread,

Apr 2, 2008, 7:38:29 PM4/2/08

to JVM Languages

Enclosed are the disassembly of the two loop kernels, from the Java 5 and Java 6 32-bit. The default JIT (on the machine I tested) was the server JIT. The loops look essentially the same to me on both releases. I also included the client version; those kernels look the same too.

One difference is the effective addresses of the two shared variables happen to fall in distinct quadwords on Java 5 and in the same quadword on Java 6:

0x082025b4 &i in Java 5 0x082025b8 &firedCount in Java 5 0xb61dcae8 &i in Java 6 0xb61dcaec &firedCount in Java 6

This may be the best way to explain the performance difference, especially since moving firedCount into a separate class seems to erase the effect.

Welcome to the multicore world! Memory is not flat like Kansas anymore.

-- John

java6-trouble.kernel.txt

java5-trouble.kernel.txt

Charles Oliver Nutter

unread,

Apr 3, 2008, 4:18:01 AM4/3/08

to jvm-la...@googlegroups.com

Pardon my ignorance, but why does that explain the performance
difference? Talking with Wayne Meissner he thought perhaps because the
two variables were closer together, updating one caused a cache flush of
both. That sounds reasonable, I suppose...but doesn't this still qualify
as a regression since performance is significantly worse under Java 6?

I guess my concern is that one reason many of these new languages are
gaining traction on the JVM (especially in the case of Jython and JRuby)
is because they can be *actually* parallel. Without experimentally
finding this effect, we might have gone months or forever with poor
thread scaling, all because we used a single static variable instead of
separate thread-local variables. It's rather surprising, and makes me
worry about the rest of the codebase.

So I guess there's a rule of thumb here: don't share data across threads
unless you absolutely need to. And perhaps: don't use statics (Gilad
would be proud). But something more concrete to say why this is bad, or
an explanation of how this effect is a trade-off for something much more
important (cache locality for field reads?) would help ease the pain...

- Charlie

Matthias Ernst

unread,

Apr 3, 2008, 5:11:01 AM4/3/08

to jvm-la...@googlegroups.com

Charles,

what do you expect? That the VM spread out all global variables
throughout the heap so that any broken access pattern cause not too
many cache flushes? You're hammering a global from multiple threads, I
don't think you can blame the JVM for that, be it for a new language
or not.

Matthias

Charles Oliver Nutter

unread,

Apr 3, 2008, 5:20:20 AM4/3/08

to jvm-la...@googlegroups.com

Matthias Ernst wrote:
> Charles,
>
> what do you expect? That the VM spread out all global variables
> throughout the heap so that any broken access pattern cause not too
> many cache flushes? You're hammering a global from multiple threads, I
> don't think you can blame the JVM for that, be it for a new language
> or not.

I'm just looking for a definitive answer on why it's better this way in
Java 6. If this is a good optimization for other reasons, that's
perfectly acceptable and I'll be happy with it. At the moment, however,
I have a real-world case where it severely impacted execution
performance...so severely that I'd be very surprised if other parts of
the system weren't suffering from the same effect. And I'm sure there
are other apps out there secretly harboring similar issues.

I'm writing up a blog post on this now, so any additional information is
very welcome. And I'm not necessarily "blaming" the JVM...but slower is
slower.

- Charlie

Charles Oliver Nutter

unread,

Apr 3, 2008, 5:43:34 AM4/3/08

to jvm-la...@googlegroups.com

Charles Oliver Nutter wrote:
> I'm writing up a blog post on this now, so any additional information is
> very welcome. And I'm not necessarily "blaming" the JVM...but slower is
> slower.

http://headius.blogspot.com/2008/04/shared-data-considered-harmful.html

- Charlie

John Rose

unread,

Apr 3, 2008, 1:43:17 PM4/3/08

to jvm-la...@googlegroups.com

On Apr 3, 2008, at 2:20 AM, Charles Oliver Nutter wrote:

I'm just looking for a definitive answer on why it's better this way in

Java 6.

This particular micro-benchmark happens to be slower in Java 6 than in Java 5, but if the layouts were changed by four or twelve bytes the opposite might be true, and we'd have a "win" for Java 6. (If my previous analysis is correct, which I'm going to assume in the rest of this note.) Adding an odd number of four-byte static fields before the two in the micro-benchmark might make Java 5 faster than Java 6. So this is not a question of quality of implementation that distinguishes 5 from 6. It is indeterminate behavior that comes from questionable coding (unsynch. non-volatile static writes).

To me as a VM guy, it's a case of:

P. Doctor, it hurts when I do this.

D. Then don't do that.

The JVM could notice that there is lots of cache ping-ponging on a pair of separable variables, and reallocate them onto separate lines. We'll probably eventually do stuff like that, given the new realities about memory architecture. But at present this is mainly a research topic that you read about in conference papers.

I agree there are sometimes reasons, especially for language implementors, to bend the rules, and do things like increment unlocked counters. (We do it in the JVM all the time, at the C++ level.) But if you are going to bend the rules, you can't expect the JVM to be ready to know what you mean. You might have to help it accept the weird code. A good help in this case would be to make sure the bits of mutable global state are isolated in a corner somewhere.

Even better (if applicable) would be to have fast thread-local counters, with a slow background phase which occasionally tallies them into a trailing global counter. Use compare-and-swap to collect the fast-moving data into the slow-moving data. This would be faster and more accurate, at the cost of complexity and extra per-thread storage. It reflects the reality of non-uniform memory. (This verges on the old fetch-and-add instruction for massively parallel machines like the Connection Machine.)

-- John

Vladimir Sizikov

unread,

Apr 3, 2008, 1:57:40 PM4/3/08

to jvm-la...@googlegroups.com

Hi John,

Btw, thanks for posting this, along with the links and references, was
very helpful.

Thanks,
--Vladimir

John Cowan

unread,

Apr 3, 2008, 3:31:56 PM4/3/08

to jvm-la...@googlegroups.com

On Thu, Apr 3, 2008 at 1:43 PM, John Rose <John...@sun.com> wrote:

> Even better (if applicable) would be to have fast thread-local counters,
> with a slow background phase which occasionally tallies them into a trailing
> global counter.

That reminds me of an old question of mine. In the case where a
thread has been created from an object whose class is a subclass of
Thread, is Thread.currentThread() guaranteed to give you back the same
object? IOW, if I say:

class MyThread extends Thread { ... }
myThread = new MyThread();
myThread.start();
...

then in the new thread, is Thread.currentThread() always == to the
original value of myThread? If so, that provides a different approach
to thread-local state, where the state is held in the instance
variables of MyThread objects, and the JVM is used to thread them
through :-) the code until the point where they're needed.

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

John Rose

unread,

Apr 3, 2008, 7:01:54 PM4/3/08

to jvm-la...@googlegroups.com

On Apr 3, 2008, at 12:31 PM, John Cowan wrote:

> That reminds me of an old question of mine. In the case where a
> thread has been created from an object whose class is a subclass of
> Thread, is Thread.currentThread() guaranteed to give you back the same
> object?

Yes. You can then cast it to the subclass and go on from there.

> If so, that provides a different approach
> to thread-local state, where the state is held in the instance
> variables of MyThread objects, and the JVM is used to thread them
> through :-) the code until the point where they're needed.

That's how Java thread locals are implemented in the JVM.

You can only use this approach if (a) you control the creation of the
thread and can specify your own subclass for it, or (b) you can edit
the rt.jar code for java.lang.Thread. Plan (c) is to make a
ThreadLocal.

You might be interested to know that Hotspot intrinsifies
Thread.currentThread to a couple of instructions; it's cheap. We did
that to make things like thread locals efficient.

Best,
-- John

John Cowan

unread,

Apr 3, 2008, 7:26:21 PM4/3/08

to jvm-la...@googlegroups.com

On Thu, Apr 3, 2008 at 7:01 PM, John Rose <John...@sun.com> wrote:

> Yes. You can then cast it to the subclass and go on from there.

That's what I had hoped, but the JLS doesn't actually seem to guarantee this.

> > If so, that provides a different approach
> > to thread-local state, where the state is held in the instance
> > variables of MyThread objects, and the JVM is used to thread them
> > through :-) the code until the point where they're needed.
>
> That's how Java thread locals are implemented in the JVM.

I don't understand how that can be, as I can create as many
ThreadLocals as I want. They can't all fit in the Thread object,
surely. And how can reference to them be fast if they have to
indirect through a Thread subclass object?

> You can only use this approach if (a) you control the creation of the
> thread and can specify your own subclass for it,

That's what I had in mind.

> You might be interested to know that Hotspot intrinsifies
> Thread.currentThread to a couple of instructions; it's cheap. We did
> that to make things like thread locals efficient.

Good.

Marcelo Fukushima

unread,

Apr 3, 2008, 7:37:06 PM4/3/08

to jvm-la...@googlegroups.com

On 4/3/08, John Cowan <johnw...@gmail.com> wrote:
>
> On Thu, Apr 3, 2008 at 7:01 PM, John Rose <John...@sun.com> wrote:
>
> > Yes. You can then cast it to the subclass and go on from there.
>
>
> That's what I had hoped, but the JLS doesn't actually seem to guarantee this.
>
>
> > > If so, that provides a different approach
> > > to thread-local state, where the state is held in the instance
> > > variables of MyThread objects, and the JVM is used to thread them
> > > through :-) the code until the point where they're needed.
> >
> > That's how Java thread locals are implemented in the JVM.
>
>
> I don't understand how that can be, as I can create as many
> ThreadLocals as I want. They can't all fit in the Thread object,
> surely. And how can reference to them be fast if they have to
> indirect through a Thread subclass object?

the source for ThreadLocal explains the first bit:

public void set(T value) {
Thread t = Thread.currentThread();
ThreadLocalMap map = getMap(t);
if (map != null)
map.set(this, value);
else
createMap(t, value);
}

apparently, every thread has a map that holds all ThreadLocal's registered

>
>
> > You can only use this approach if (a) you control the creation of the
> > thread and can specify your own subclass for it,
>
>
> That's what I had in mind.
>
>
> > You might be interested to know that Hotspot intrinsifies
> > Thread.currentThread to a couple of instructions; it's cheap. We did
> > that to make things like thread locals efficient.
>
>
> Good.
>
>
> --
> GMail doesn't have rotating .sigs, but you can see mine at
> http://www.ccil.org/~cowan/signatures
>
> >
>

--
[]'s
Marcelo Takeshi Fukushima

John Rose

unread,

Apr 3, 2008, 7:39:10 PM4/3/08

to jvm-la...@googlegroups.com

On Apr 3, 2008, at 4:26 PM, John Cowan wrote:

They can't all fit in the Thread object, surely.

That's what collections are for.

http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/75fca0b0ab83/src/share/classes/java/lang/Thread.java

class Thread { ...

/* ThreadLocal values pertaining to this thread. This map is maintained

* by the ThreadLocal class. */

ThreadLocal.ThreadLocalMap threadLocals;

... }

-- John

Rémi Forax

unread,

Apr 4, 2008, 4:46:07 AM4/4/08

to jvm-la...@googlegroups.com

John Cowan a écrit :

> On Thu, Apr 3, 2008 at 1:43 PM, John Rose <John...@sun.com> wrote:
>
>
>> Even better (if applicable) would be to have fast thread-local counters,
>> with a slow background phase which occasionally tallies them into a trailing
>> global counter.
>>
>
> That reminds me of an old question of mine. In the case where a
> thread has been created from an object whose class is a subclass of
> Thread, is Thread.currentThread() guaranteed to give you back the same
> object? IOW, if I say:
>
> class MyThread extends Thread { ... }
> myThread = new MyThread();
> myThread.start();
> ...
>
> then in the new thread, is Thread.currentThread() always == to the
> original value of myThread?

yes

> If so, that provides a different approach
> to thread-local state, where the state is held in the instance
> variables of MyThread objects,

yes, ThreadLocal is currently implemented has a field of
java.lang.Thread storing a map,
so there is no synchronization.

> and the JVM is used to thread them
> through :-) the code until the point where they're needed.
>
>

cheers,
Rémi

hlovatt

unread,

Apr 7, 2008, 6:02:48 AM4/7/08

to JVM Languages

This is a very interesting discussion and reveals a lot about the
difficulties of programming for multiple cores.

I was trying to understand what was going on and was 'messing about'
with the code and I noticed that almost any change I made slowed the
code down considerably - more than the 1 or 2 thread options in Java 6
does. Therefore I am not sure how realistic the code is. If the code
did more than simply increment a variable or two than the problem
might go away (because contention would be less and because the
uncontested operations would be a bigger percentage).

Going back to the original code. If it is OK to miss a few increments
(hence non-volitile statics) then the following should be OK:

threads[ j ] =
new Thread() {
public void run() {
final int total = totalSize / threadCount;
for ( int k = 0; k < total; k++ ) {
if ( ( k & 0xFF ) == 0 ) { // New
i += 256; // New
firedCount++; // New
} // New
// if ( ( i++ & 0xFF ) == 0 ) { firedCount++; } //
Original
}
}
};

The above version doesn't show the problem on Java 6 and is quicker
than the original.

On Apr 2, 6:48 pm, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:

> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.
>

> The source of the problem (at least the source that, when fixed, allowed
> normal thread scaling), was an increment, mask, and test of a static int
> field. The code in question looked like this:
>
> private static int count = 0;
>
> public void pollEvents(ThreadContext context) {
> if ((count++ & 0xFF) == 0) context.poll();
>
> }
>
> So the basic idea was that this would call poll() every 256 hits,
> incrementing a counter all the while. My first attempt to improve
> performance was to comment out the body of poll() in case it was causing
> a threading bottleneck (it does some locking and such), but that had no
> effect. Then, as a total shot in the dark, I commented out the entire
> line above. Thread scaling went to normal.
>
> So I'm rather confused here. Is a ++ operation on a static int doing
> some kind of atomic update that causes multiple threads to contend? I
> never would have expected this, so I wrote up a small Java benchmark:
>
> http://pastie.org/173993
>
> The benchmark does basically the same thing, with a single main counter
> and another "fired" counter to prevent hotspot from optimizing things
> completely away. I've been running this on a dual-core MacBook Pro with
> both Apple's Java 5 and the soylatte Java 6 release. The results are
> very confusing:
>
> First on Apple's Java 5
>
> ~/NetBeansProjects/jruby ➔ java -server Trouble 1
> time: 3924
> fired: 3906250
> time: 3945
> fired: 3906250
> time: 1841
> fired: 3906250
> time: 1882
> fired: 3906250
> time: 1896
> fired: 3906250
> ~/NetBeansProjects/jruby ➔ java -server Trouble 2
> time: 3243
> fired: 4090645
> time: 3245
> fired: 4100505
> time: 1173
> fired: 3906049
> time: 1233
> fired: 3906188
> time: 1173
> fired: 3906134
>
> Normal scaling here...1 thread on my system uses about 60-65% CPU, so
> the extra thread uses up the remaining 35-40% and the numbers show it.
> Then there's soylatte Java 6:
>
> ~/NetBeansProjects/jruby ➔ java -server Trouble 1
> time: 1772
> fired: 3906250
> time: 1973
> fired: 3906250
> time: 2748
> fired: 3906250
> time: 2114
> fired: 3906250
> time: 2294
> fired: 3906250
> ~/NetBeansProjects/jruby ➔ java -server Trouble 2
> time: 3402
> fired: 3848648
> time: 3805
> fired: 3885471
> time: 4145
> fired: 3866850
> time: 4140
> fired: 3839130
> time: 3658
> fired: 3880202
>
> Don't compare the times directly, since these are two pretty different
> codebases and they each have different general performance
> characteristics. Instead pay attention to the trend...the soylatte Java

> 6 run with two threads is significantly slower than the run with a
> single thread. This mirrors the results with JRuby when there was a
> single static counter being incremented.
>

> So what's up here?
>
> - Charlie

hlovatt

unread,

Apr 7, 2008, 6:03:56 AM4/7/08

to JVM Languages

Charles Oliver Nutter

unread,

Apr 7, 2008, 11:17:51 AM4/7/08

to jvm-la...@googlegroups.com

hlovatt wrote:
> This is a very interesting discussion and reveals a lot about the
> difficulties of programming for multiple cores.
>
> I was trying to understand what was going on and was 'messing about'
> with the code and I noticed that almost any change I made slowed the
> code down considerably - more than the 1 or 2 thread options in Java 6
> does. Therefore I am not sure how realistic the code is. If the code
> did more than simply increment a variable or two than the problem
> might go away (because contention would be less and because the
> uncontested operations would be a bigger percentage).
>
> Going back to the original code. If it is OK to miss a few increments
> (hence non-volitile statics) then the following should be OK:

Yeah, clever, and basically the same as the eventual fix I had for JRuby
since it gives each thread its own counter. I think the primary rule
here is that don't expect any sort of unsynchronized, uncontrolled
updates to the same shared resource to either behave the way you want or
perform the way you want.

- Charlie

David Pollak

unread,

Apr 8, 2008, 12:55:01 PM4/8/08

to jvm-la...@googlegroups.com

It may be a bit off-topic, but it's interesting how this argues for immutable data structures and Erlang's concurrency model.

- Charlie

--
Scala lift off unconference, May 10th 2008, San Francisco http://scalaliftoff.com
lift, the secure, simple, powerful web framework http://liftweb.net
Collaborative Task Management http://much4.us

Martin Probst

unread,

Apr 8, 2008, 1:25:53 PM4/8/08

to jvm-la...@googlegroups.com

>> Yeah, clever, and basically the same as the eventual fix I had for
>> JRuby
>> since it gives each thread its own counter. I think the primary rule
>> here is that don't expect any sort of unsynchronized, uncontrolled
>> updates to the same shared resource to either behave the way you
>> want or
>> perform the way you want.
>
> It may be a bit off-topic, but it's interesting how this argues for
> immutable data structures and Erlang's concurrency model.

I guess once you're into concurrency, everything argues for immutable
data structures.

It's also nice to see how this fits with Gilad Brachas arguments
against the Java 'static' keyword (which allows mutable structures to
be shared implicitly between threads), as Charles already noted.

Regards,
Martin

Matthias Ernst

unread,

Apr 8, 2008, 1:55:04 PM4/8/08

to jvm-la...@googlegroups.com

I'm surprised you're surprised.

Whatever machinery involved: if multiple actors contend for the same
memory location they will suffer. The best strategy is to distribute
jobs, work on each of them independently on local resources, be it
processor registers, thread locals, local memory or disks and to
synchronize the results with others as infrequently as possible. It's
the basic principle behind fork-join in a multi-threaded process and
map-reduce on a large cluster. Nothing new that Gilad Bracha
discovered in 2008, is it?

Matthias

Martin Probst

unread,

Apr 8, 2008, 4:14:35 PM4/8/08

to jvm-la...@googlegroups.com

> I'm surprised you're surprised.

Who's surprised?

> It's the basic principle behind fork-join in a multi-threaded
> process and
> map-reduce on a large cluster.
> Nothing new that Gilad Bracha discovered in 2008, is it?

I was referring to Bracha's stuff about how static fields and global
names in general harm software engineering, deployment (live code
updating), concurrency, testability, etc. pp. I'm not sure that this
stuff is novel, and it's certainly building on research of others. But
it's definitely non-trivial, and it goes far beyond the basic insight
that resource contention is a bottleneck in parallel/distributed
systems.

The interesting part is that it indeed comes to similar solutions
looking from a different angle and a different problem set. That's
what I wanted to point out.

Regards,
Martin

Matthias Ernst

unread,

Apr 8, 2008, 4:32:10 PM4/8/08

to jvm-la...@googlegroups.com

On Tue, Apr 8, 2008 at 10:14 PM, Martin Probst <ma...@martin-probst.com> wrote:
>
> > I'm surprised you're surprised.
>
> Who's surprised?

Not you in particular. I just picked your message as the last in this
thread. I was just commenting on the overall sentiment that -to me-
sounds like "wow we may have discovered something new in concurrent
programming and great how all these pieces (Erlang, Bracha) finally
fit together now". Gilad's compilation was certainly well compiled but
global variables, hey, ...

Whatever. I'm not trying to provoke here.

Cheers
Matthias

John Rose

unread,

Apr 8, 2008, 6:05:14 PM4/8/08

to jvm-la...@googlegroups.com

On Apr 8, 2008, at 10:25 AM, Martin Probst wrote:

I guess once you're into concurrency, everything argues for immutable

data structures.

Some ad lib observations on memory sharing, concurrency, and the JVM:

- Probably the hardest part about updating the Java memory model was making sure the rules for immutable objects (final fields) would work properly without synchronization. It was necessary, because Java needs to work well on multiprocessors.

- Hotspot put in thread-local allocation buffers a decade ago, again to avoid avoid contention by memory partitioning. We weren't quite as painfully cache conscious back then, but cache locality was also a consideration. (Still, somebody needs to experiment more with cache-conscious coloring for TLABs. Sounds like a masters' project to me.)

- An immutable object is not immutable at first. The trick is that the mutable part of its lifecycle is private, and so the system can cleverly protect it from contention, e.g., using TLABs. (Occurs to me that "larvatus" = "hidden"; an immutable object has a larval stage during which it is constructed.) This can be generalized to other controlled mutation phases; that's what GC safepoints do, for example: Allow immutable objects to be temporarily mutated. There might be an interesting "phased immutability" software abstraction to be investigated along these lines.

- The JVM needs to know about immutability in order to optimize it properly. However, there are interesting (semi-)global analyses which can help it discover immutability even when it is not clearly marked in the source program. (Optimistic ones are the coolest, and require a framework for dependency checking and deoptimization to recover. Hotspot is good at that, and can get better. I think there are PhD projects there.)

- In modern memory systems, main memory is usually many tens of cycles away from each CPU. Caches try to hide this. As a corollary, there are at least two protocols in the hidden network that connects memories, caches, and CPUs, and they differ greatly in intrinsic performance characteristics. In essence, one protocol is (often) called RTS (read to share) and one is RTO (read to own). RTO means your CPU & cache have to handshake with the rest of the world to make sure your have an exclusive right to a block of memory (cache line), before your can either read or write it. Your program incurs this cost because it intends to write the memory. But you want to use RTS when you can, because any number of CPUs can RTS the same block of memory (cache line) and happily work on the same data, as long as it remains immutable. (I'm tempted to think RTS:RTO :: UDP:TCP; there are some basic similarities I suppose.) In any case, when memory is shared, immutable memory is faster than mutable, and the JVM works hard to exploit the difference.

Best,

-- John

beelzabub

unread,

Apr 15, 2008, 1:04:24 PM4/15/08

to JVM Languages

What you should do is run the thing through a profiler. This will
tell you how many times poll is being called. My guess is that
perhaps the thread contention is prolonging the amount of time
necessary to increment count - Threads 1-3 all load count into memory
(i.e 0), all add 1 to it, and then write it back out to memory, so
after 3 threads have "written" to count, it only contains 1. This
seems to be the case - tested with linux jdk1.6, intel Q6600 (quad-
core):

$ java -server Trouble 1
time: 1995
fired: 3906250
time: 1966
fired: 3906250
time: 930
fired: 3906250
time: 916
fired: 3906250
time: 894
fired: 3906250
$ java -server Trouble 2
time: 734
fired: 3654450
time: 769
fired: 3802532
time: 821
fired: 3755766
time: 829
fired: 3775890
time: 800
fired: 3779483
$ java -server Trouble 3
time: 1836
fired: 3934611
time: 4518
fired: 3566843
time: 3102
fired: 3618666
time: 2886
fired: 3648195
time: 3844
fired: 3667866

On a side note, it's way faster to just optimize away the algorithm
(at least in your benchmark):
int numCalls = total / 0xFF;
i += total;
for (int k = 0; k < numCalls; ++k) doNothing();

Here's the same way, using AtomicIntegers:
int tmp = i.getAndAdd(total);
for (int k = 0; k < numCalls; ++k)
{
if ((k & 0xFF) == 0) doNothing();
}

However, none of this is related to the actual code, just the
benchmark. The way to optimize the actual code would be to push count
into the ThreadContext & instead of doing a binary and, simply reset
the counter & do a comparison of 256 / numThreads. Or you can keep
that code if you want every thread to poll once every 256 calls rather
than a thread getting polled once every 256 calls.

Other ways to speed it up: Don't count & simply call pollEvents with
longer intervals in-between & pollEvents then simply calls doNothing
directly.

Summary: The problem is that multiple threads are overwriting the
global variable, so that it takes longer to actually increment it by
1.

Think of it this way:
count = 1
Threads A, B, C are running.
The CPU loads count (with a value of 1) into the register.
The CPU increments the values in its registers to 2
The CPU then writes out the values of each of the registers into
count. So after all 3 threads increment count, the value is still 1.
Add in the overhead of switching between threads, and you can see why
you get a slowdown. Adding in synchronization will help, provided you
do it intelligently - too often, and synchronization penalty takes
over.

On Apr 2, 1:48 am, Charles Oliver Nutter <charles.nut...@sun.com>

Reply all

Reply to author

Forward