On Thu, Apr 3, 2008 at 1:43 PM, John Rose <John.R...@sun.com> wrote: > Even better (if applicable) would be to have fast thread-local counters, > with a slow background phase which occasionally tallies them into a trailing > global counter.
That reminds me of an old question of mine. In the case where a thread has been created from an object whose class is a subclass of Thread, is Thread.currentThread() guaranteed to give you back the same object? IOW, if I say:
class MyThread extends Thread { ... } myThread = new MyThread(); myThread.start(); ...
then in the new thread, is Thread.currentThread() always == to the original value of myThread? If so, that provides a different approach to thread-local state, where the state is held in the instance variables of MyThread objects, and the JVM is used to thread them through :-) the code until the point where they're needed.
> That reminds me of an old question of mine. In the case where a > thread has been created from an object whose class is a subclass of > Thread, is Thread.currentThread() guaranteed to give you back the same > object?
Yes. You can then cast it to the subclass and go on from there.
> If so, that provides a different approach > to thread-local state, where the state is held in the instance > variables of MyThread objects, and the JVM is used to thread them > through :-) the code until the point where they're needed.
That's how Java thread locals are implemented in the JVM.
You can only use this approach if (a) you control the creation of the thread and can specify your own subclass for it, or (b) you can edit the rt.jar code for java.lang.Thread. Plan (c) is to make a ThreadLocal.
You might be interested to know that Hotspot intrinsifies Thread.currentThread to a couple of instructions; it's cheap. We did that to make things like thread locals efficient.
On Thu, Apr 3, 2008 at 7:01 PM, John Rose <John.R...@sun.com> wrote: > Yes. You can then cast it to the subclass and go on from there.
That's what I had hoped, but the JLS doesn't actually seem to guarantee this.
> > If so, that provides a different approach > > to thread-local state, where the state is held in the instance > > variables of MyThread objects, and the JVM is used to thread them > > through :-) the code until the point where they're needed.
> That's how Java thread locals are implemented in the JVM.
I don't understand how that can be, as I can create as many ThreadLocals as I want. They can't all fit in the Thread object, surely. And how can reference to them be fast if they have to indirect through a Thread subclass object?
> You can only use this approach if (a) you control the creation of the > thread and can specify your own subclass for it,
That's what I had in mind.
> You might be interested to know that Hotspot intrinsifies > Thread.currentThread to a couple of instructions; it's cheap. We did > that to make things like thread locals efficient.
> On Thu, Apr 3, 2008 at 7:01 PM, John Rose <John.R...@sun.com> wrote:
> > Yes. You can then cast it to the subclass and go on from there.
> That's what I had hoped, but the JLS doesn't actually seem to guarantee this.
> > > If so, that provides a different approach > > > to thread-local state, where the state is held in the instance > > > variables of MyThread objects, and the JVM is used to thread them > > > through :-) the code until the point where they're needed.
> > That's how Java thread locals are implemented in the JVM.
> I don't understand how that can be, as I can create as many > ThreadLocals as I want. They can't all fit in the Thread object, > surely. And how can reference to them be fast if they have to > indirect through a Thread subclass object?
the source for ThreadLocal explains the first bit:
public void set(T value) { Thread t = Thread.currentThread(); ThreadLocalMap map = getMap(t); if (map != null) map.set(this, value); else createMap(t, value); }
apparently, every thread has a map that holds all ThreadLocal's registered
> > You can only use this approach if (a) you control the creation of the > > thread and can specify your own subclass for it,
> That's what I had in mind.
> > You might be interested to know that Hotspot intrinsifies > > Thread.currentThread to a couple of instructions; it's cheap. We did > > that to make things like thread locals efficient.
class Thread { ... /* ThreadLocal values pertaining to this thread. This map is maintained * by the ThreadLocal class. */ ThreadLocal.ThreadLocalMap threadLocals; ... }
> On Thu, Apr 3, 2008 at 1:43 PM, John Rose <John.R...@sun.com> wrote:
>> Even better (if applicable) would be to have fast thread-local counters, >> with a slow background phase which occasionally tallies them into a trailing >> global counter.
> That reminds me of an old question of mine. In the case where a > thread has been created from an object whose class is a subclass of > Thread, is Thread.currentThread() guaranteed to give you back the same > object? IOW, if I say:
> class MyThread extends Thread { ... } > myThread = new MyThread(); > myThread.start(); > ...
> then in the new thread, is Thread.currentThread() always == to the > original value of myThread? yes > If so, that provides a different approach > to thread-local state, where the state is held in the instance > variables of MyThread objects,
yes, ThreadLocal is currently implemented has a field of java.lang.Thread storing a map, so there is no synchronization.
> and the JVM is used to thread them > through :-) the code until the point where they're needed.
This is a very interesting discussion and reveals a lot about the
difficulties of programming for multiple cores.
I was trying to understand what was going on and was 'messing about'
with the code and I noticed that almost any change I made slowed the
code down considerably - more than the 1 or 2 thread options in Java 6
does. Therefore I am not sure how realistic the code is. If the code
did more than simply increment a variable or two than the problem
might go away (because contention would be less and because the
uncontested operations would be a bigger percentage).
Going back to the original code. If it is OK to miss a few increments
(hence non-volitile statics) then the following should be OK:
threads[ j ] =
new Thread() {
public void run() {
final int total = totalSize / threadCount;
for ( int k = 0; k < total; k++ ) {
if ( ( k & 0xFF ) == 0 ) { // New
i += 256; // New
firedCount++; // New
} // New
// if ( ( i++ & 0xFF ) == 0 ) { firedCount++; } //
Original
}
}
};
The above version doesn't show the problem on Java 6 and is quicker
than the original.
On Apr 2, 6:48 pm, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:
> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.
> The source of the problem (at least the source that, when fixed, allowed
> normal thread scaling), was an increment, mask, and test of a static int
> field. The code in question looked like this:
> private static int count = 0;
> public void pollEvents(ThreadContext context) {
> if ((count++ & 0xFF) == 0) context.poll();
> }
> So the basic idea was that this would call poll() every 256 hits,
> incrementing a counter all the while. My first attempt to improve
> performance was to comment out the body of poll() in case it was causing
> a threading bottleneck (it does some locking and such), but that had no
> effect. Then, as a total shot in the dark, I commented out the entire
> line above. Thread scaling went to normal.
> So I'm rather confused here. Is a ++ operation on a static int doing
> some kind of atomic update that causes multiple threads to contend? I
> never would have expected this, so I wrote up a small Java benchmark:
> The benchmark does basically the same thing, with a single main counter
> and another "fired" counter to prevent hotspot from optimizing things
> completely away. I've been running this on a dual-core MacBook Pro with
> both Apple's Java 5 and the soylatte Java 6 release. The results are
> very confusing:
> Normal scaling here...1 thread on my system uses about 60-65% CPU, so
> the extra thread uses up the remaining 35-40% and the numbers show it.
> Then there's soylatte Java 6:
> Don't compare the times directly, since these are two pretty different
> codebases and they each have different general performance
> characteristics. Instead pay attention to the trend...the soylatte Java
> 6 run with two threads is significantly slower than the run with a
> single thread. This mirrors the results with JRuby when there was a
> single static counter being incremented.
This is a very interesting discussion and reveals a lot about the
difficulties of programming for multiple cores.
I was trying to understand what was going on and was 'messing about'
with the code and I noticed that almost any change I made slowed the
code down considerably - more than the 1 or 2 thread options in Java 6
does. Therefore I am not sure how realistic the code is. If the code
did more than simply increment a variable or two than the problem
might go away (because contention would be less and because the
uncontested operations would be a bigger percentage).
Going back to the original code. If it is OK to miss a few increments
(hence non-volitile statics) then the following should be OK:
threads[ j ] =
new Thread() {
public void run() {
final int total = totalSize / threadCount;
for ( int k = 0; k < total; k++ ) {
if ( ( k & 0xFF ) == 0 ) { // New
i += 256; // New
firedCount++; // New
} // New
// if ( ( i++ & 0xFF ) == 0 ) { firedCount++; } //
Original
}
}
};
The above version doesn't show the problem on Java 6 and is quicker
than the original.
On Apr 2, 6:48 pm, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:
> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.
> The source of the problem (at least the source that, when fixed, allowed
> normal thread scaling), was an increment, mask, and test of a static int
> field. The code in question looked like this:
> private static int count = 0;
> public void pollEvents(ThreadContext context) {
> if ((count++ & 0xFF) == 0) context.poll();
> }
> So the basic idea was that this would call poll() every 256 hits,
> incrementing a counter all the while. My first attempt to improve
> performance was to comment out the body of poll() in case it was causing
> a threading bottleneck (it does some locking and such), but that had no
> effect. Then, as a total shot in the dark, I commented out the entire
> line above. Thread scaling went to normal.
> So I'm rather confused here. Is a ++ operation on a static int doing
> some kind of atomic update that causes multiple threads to contend? I
> never would have expected this, so I wrote up a small Java benchmark:
> The benchmark does basically the same thing, with a single main counter
> and another "fired" counter to prevent hotspot from optimizing things
> completely away. I've been running this on a dual-core MacBook Pro with
> both Apple's Java 5 and the soylatte Java 6 release. The results are
> very confusing:
> Normal scaling here...1 thread on my system uses about 60-65% CPU, so
> the extra thread uses up the remaining 35-40% and the numbers show it.
> Then there's soylatte Java 6:
> Don't compare the times directly, since these are two pretty different
> codebases and they each have different general performance
> characteristics. Instead pay attention to the trend...the soylatte Java
> 6 run with two threads is significantly slower than the run with a
> single thread. This mirrors the results with JRuby when there was a
> single static counter being incremented.
hlovatt wrote: > This is a very interesting discussion and reveals a lot about the > difficulties of programming for multiple cores.
> I was trying to understand what was going on and was 'messing about' > with the code and I noticed that almost any change I made slowed the > code down considerably - more than the 1 or 2 thread options in Java 6 > does. Therefore I am not sure how realistic the code is. If the code > did more than simply increment a variable or two than the problem > might go away (because contention would be less and because the > uncontested operations would be a bigger percentage).
> Going back to the original code. If it is OK to miss a few increments > (hence non-volitile statics) then the following should be OK:
Yeah, clever, and basically the same as the eventual fix I had for JRuby since it gives each thread its own counter. I think the primary rule here is that don't expect any sort of unsynchronized, uncontrolled updates to the same shared resource to either behave the way you want or perform the way you want.
> hlovatt wrote: > > This is a very interesting discussion and reveals a lot about the > > difficulties of programming for multiple cores.
> > I was trying to understand what was going on and was 'messing about' > > with the code and I noticed that almost any change I made slowed the > > code down considerably - more than the 1 or 2 thread options in Java 6 > > does. Therefore I am not sure how realistic the code is. If the code > > did more than simply increment a variable or two than the problem > > might go away (because contention would be less and because the > > uncontested operations would be a bigger percentage).
> > Going back to the original code. If it is OK to miss a few increments > > (hence non-volitile statics) then the following should be OK:
> Yeah, clever, and basically the same as the eventual fix I had for JRuby > since it gives each thread its own counter. I think the primary rule > here is that don't expect any sort of unsynchronized, uncontrolled > updates to the same shared resource to either behave the way you want or > perform the way you want.
It may be a bit off-topic, but it's interesting how this argues for immutable data structures and Erlang's concurrency model.