On Thu, Apr 3, 2008 at 1:43 PM, John Rose <John.R...@sun.com> wrote: > Even better (if applicable) would be to have fast thread-local counters, > with a slow background phase which occasionally tallies them into a trailing > global counter.
That reminds me of an old question of mine. In the case where a thread has been created from an object whose class is a subclass of Thread, is Thread.currentThread() guaranteed to give you back the same object? IOW, if I say:
class MyThread extends Thread { ... } myThread = new MyThread(); myThread.start(); ...
then in the new thread, is Thread.currentThread() always == to the original value of myThread? If so, that provides a different approach to thread-local state, where the state is held in the instance variables of MyThread objects, and the JVM is used to thread them through :-) the code until the point where they're needed.
> That reminds me of an old question of mine. In the case where a > thread has been created from an object whose class is a subclass of > Thread, is Thread.currentThread() guaranteed to give you back the same > object?
Yes. You can then cast it to the subclass and go on from there.
> If so, that provides a different approach > to thread-local state, where the state is held in the instance > variables of MyThread objects, and the JVM is used to thread them > through :-) the code until the point where they're needed.
That's how Java thread locals are implemented in the JVM.
You can only use this approach if (a) you control the creation of the thread and can specify your own subclass for it, or (b) you can edit the rt.jar code for java.lang.Thread. Plan (c) is to make a ThreadLocal.
You might be interested to know that Hotspot intrinsifies Thread.currentThread to a couple of instructions; it's cheap. We did that to make things like thread locals efficient.
On Thu, Apr 3, 2008 at 7:01 PM, John Rose <John.R...@sun.com> wrote: > Yes. You can then cast it to the subclass and go on from there.
That's what I had hoped, but the JLS doesn't actually seem to guarantee this.
> > If so, that provides a different approach > > to thread-local state, where the state is held in the instance > > variables of MyThread objects, and the JVM is used to thread them > > through :-) the code until the point where they're needed.
> That's how Java thread locals are implemented in the JVM.
I don't understand how that can be, as I can create as many ThreadLocals as I want. They can't all fit in the Thread object, surely. And how can reference to them be fast if they have to indirect through a Thread subclass object?
> You can only use this approach if (a) you control the creation of the > thread and can specify your own subclass for it,
That's what I had in mind.
> You might be interested to know that Hotspot intrinsifies > Thread.currentThread to a couple of instructions; it's cheap. We did > that to make things like thread locals efficient.
> On Thu, Apr 3, 2008 at 7:01 PM, John Rose <John.R...@sun.com> wrote:
> > Yes. You can then cast it to the subclass and go on from there.
> That's what I had hoped, but the JLS doesn't actually seem to guarantee this.
> > > If so, that provides a different approach > > > to thread-local state, where the state is held in the instance > > > variables of MyThread objects, and the JVM is used to thread them > > > through :-) the code until the point where they're needed.
> > That's how Java thread locals are implemented in the JVM.
> I don't understand how that can be, as I can create as many > ThreadLocals as I want. They can't all fit in the Thread object, > surely. And how can reference to them be fast if they have to > indirect through a Thread subclass object?
the source for ThreadLocal explains the first bit:
public void set(T value) { Thread t = Thread.currentThread(); ThreadLocalMap map = getMap(t); if (map != null) map.set(this, value); else createMap(t, value); }
apparently, every thread has a map that holds all ThreadLocal's registered
> > You can only use this approach if (a) you control the creation of the > > thread and can specify your own subclass for it,
> That's what I had in mind.
> > You might be interested to know that Hotspot intrinsifies > > Thread.currentThread to a couple of instructions; it's cheap. We did > > that to make things like thread locals efficient.
class Thread { ... /* ThreadLocal values pertaining to this thread. This map is maintained * by the ThreadLocal class. */ ThreadLocal.ThreadLocalMap threadLocals; ... }
> On Thu, Apr 3, 2008 at 1:43 PM, John Rose <John.R...@sun.com> wrote:
>> Even better (if applicable) would be to have fast thread-local counters, >> with a slow background phase which occasionally tallies them into a trailing >> global counter.
> That reminds me of an old question of mine. In the case where a > thread has been created from an object whose class is a subclass of > Thread, is Thread.currentThread() guaranteed to give you back the same > object? IOW, if I say:
> class MyThread extends Thread { ... } > myThread = new MyThread(); > myThread.start(); > ...
> then in the new thread, is Thread.currentThread() always == to the > original value of myThread? yes > If so, that provides a different approach > to thread-local state, where the state is held in the instance > variables of MyThread objects,
yes, ThreadLocal is currently implemented has a field of java.lang.Thread storing a map, so there is no synchronization.
> and the JVM is used to thread them > through :-) the code until the point where they're needed.
This is a very interesting discussion and reveals a lot about the
difficulties of programming for multiple cores.
I was trying to understand what was going on and was 'messing about'
with the code and I noticed that almost any change I made slowed the
code down considerably - more than the 1 or 2 thread options in Java 6
does. Therefore I am not sure how realistic the code is. If the code
did more than simply increment a variable or two than the problem
might go away (because contention would be less and because the
uncontested operations would be a bigger percentage).
Going back to the original code. If it is OK to miss a few increments
(hence non-volitile statics) then the following should be OK:
threads[ j ] =
new Thread() {
public void run() {
final int total = totalSize / threadCount;
for ( int k = 0; k < total; k++ ) {
if ( ( k & 0xFF ) == 0 ) { // New
i += 256; // New
firedCount++; // New
} // New
// if ( ( i++ & 0xFF ) == 0 ) { firedCount++; } //
Original
}
}
};
The above version doesn't show the problem on Java 6 and is quicker
than the original.
On Apr 2, 6:48 pm, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:
> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.
> The source of the problem (at least the source that, when fixed, allowed
> normal thread scaling), was an increment, mask, and test of a static int
> field. The code in question looked like this:
> private static int count = 0;
> public void pollEvents(ThreadContext context) {
> if ((count++ & 0xFF) == 0) context.poll();
> }
> So the basic idea was that this would call poll() every 256 hits,
> incrementing a counter all the while. My first attempt to improve
> performance was to comment out the body of poll() in case it was causing
> a threading bottleneck (it does some locking and such), but that had no
> effect. Then, as a total shot in the dark, I commented out the entire
> line above. Thread scaling went to normal.
> So I'm rather confused here. Is a ++ operation on a static int doing
> some kind of atomic update that causes multiple threads to contend? I
> never would have expected this, so I wrote up a small Java benchmark:
> The benchmark does basically the same thing, with a single main counter
> and another "fired" counter to prevent hotspot from optimizing things
> completely away. I've been running this on a dual-core MacBook Pro with
> both Apple's Java 5 and the soylatte Java 6 release. The results are
> very confusing:
> Normal scaling here...1 thread on my system uses about 60-65% CPU, so
> the extra thread uses up the remaining 35-40% and the numbers show it.
> Then there's soylatte Java 6:
> Don't compare the times directly, since these are two pretty different
> codebases and they each have different general performance
> characteristics. Instead pay attention to the trend...the soylatte Java
> 6 run with two threads is significantly slower than the run with a
> single thread. This mirrors the results with JRuby when there was a
> single static counter being incremented.
This is a very interesting discussion and reveals a lot about the
difficulties of programming for multiple cores.
I was trying to understand what was going on and was 'messing about'
with the code and I noticed that almost any change I made slowed the
code down considerably - more than the 1 or 2 thread options in Java 6
does. Therefore I am not sure how realistic the code is. If the code
did more than simply increment a variable or two than the problem
might go away (because contention would be less and because the
uncontested operations would be a bigger percentage).
Going back to the original code. If it is OK to miss a few increments
(hence non-volitile statics) then the following should be OK:
threads[ j ] =
new Thread() {
public void run() {
final int total = totalSize / threadCount;
for ( int k = 0; k < total; k++ ) {
if ( ( k & 0xFF ) == 0 ) { // New
i += 256; // New
firedCount++; // New
} // New
// if ( ( i++ & 0xFF ) == 0 ) { firedCount++; } //
Original
}
}
};
The above version doesn't show the problem on Java 6 and is quicker
than the original.
On Apr 2, 6:48 pm, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:
> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.
> The source of the problem (at least the source that, when fixed, allowed
> normal thread scaling), was an increment, mask, and test of a static int
> field. The code in question looked like this:
> private static int count = 0;
> public void pollEvents(ThreadContext context) {
> if ((count++ & 0xFF) == 0) context.poll();
> }
> So the basic idea was that this would call poll() every 256 hits,
> incrementing a counter all the while. My first attempt to improve
> performance was to comment out the body of poll() in case it was causing
> a threading bottleneck (it does some locking and such), but that had no
> effect. Then, as a total shot in the dark, I commented out the entire
> line above. Thread scaling went to normal.
> So I'm rather confused here. Is a ++ operation on a static int doing
> some kind of atomic update that causes multiple threads to contend? I
> never would have expected this, so I wrote up a small Java benchmark:
> The benchmark does basically the same thing, with a single main counter
> and another "fired" counter to prevent hotspot from optimizing things
> completely away. I've been running this on a dual-core MacBook Pro with
> both Apple's Java 5 and the soylatte Java 6 release. The results are
> very confusing:
> Normal scaling here...1 thread on my system uses about 60-65% CPU, so
> the extra thread uses up the remaining 35-40% and the numbers show it.
> Then there's soylatte Java 6:
> Don't compare the times directly, since these are two pretty different
> codebases and they each have different general performance
> characteristics. Instead pay attention to the trend...the soylatte Java
> 6 run with two threads is significantly slower than the run with a
> single thread. This mirrors the results with JRuby when there was a
> single static counter being incremented.
hlovatt wrote: > This is a very interesting discussion and reveals a lot about the > difficulties of programming for multiple cores.
> I was trying to understand what was going on and was 'messing about' > with the code and I noticed that almost any change I made slowed the > code down considerably - more than the 1 or 2 thread options in Java 6 > does. Therefore I am not sure how realistic the code is. If the code > did more than simply increment a variable or two than the problem > might go away (because contention would be less and because the > uncontested operations would be a bigger percentage).
> Going back to the original code. If it is OK to miss a few increments > (hence non-volitile statics) then the following should be OK:
Yeah, clever, and basically the same as the eventual fix I had for JRuby since it gives each thread its own counter. I think the primary rule here is that don't expect any sort of unsynchronized, uncontrolled updates to the same shared resource to either behave the way you want or perform the way you want.
> hlovatt wrote: > > This is a very interesting discussion and reveals a lot about the > > difficulties of programming for multiple cores.
> > I was trying to understand what was going on and was 'messing about' > > with the code and I noticed that almost any change I made slowed the > > code down considerably - more than the 1 or 2 thread options in Java 6 > > does. Therefore I am not sure how realistic the code is. If the code > > did more than simply increment a variable or two than the problem > > might go away (because contention would be less and because the > > uncontested operations would be a bigger percentage).
> > Going back to the original code. If it is OK to miss a few increments > > (hence non-volitile statics) then the following should be OK:
> Yeah, clever, and basically the same as the eventual fix I had for JRuby > since it gives each thread its own counter. I think the primary rule > here is that don't expect any sort of unsynchronized, uncontrolled > updates to the same shared resource to either behave the way you want or > perform the way you want.
It may be a bit off-topic, but it's interesting how this argues for immutable data structures and Erlang's concurrency model.
>> Yeah, clever, and basically the same as the eventual fix I had for >> JRuby >> since it gives each thread its own counter. I think the primary rule >> here is that don't expect any sort of unsynchronized, uncontrolled >> updates to the same shared resource to either behave the way you >> want or >> perform the way you want.
> It may be a bit off-topic, but it's interesting how this argues for > immutable data structures and Erlang's concurrency model.
I guess once you're into concurrency, everything argues for immutable data structures.
It's also nice to see how this fits with Gilad Brachas arguments against the Java 'static' keyword (which allows mutable structures to be shared implicitly between threads), as Charles already noted.
Whatever machinery involved: if multiple actors contend for the same memory location they will suffer. The best strategy is to distribute jobs, work on each of them independently on local resources, be it processor registers, thread locals, local memory or disks and to synchronize the results with others as infrequently as possible. It's the basic principle behind fork-join in a multi-threaded process and map-reduce on a large cluster. Nothing new that Gilad Bracha discovered in 2008, is it?
On Tue, Apr 8, 2008 at 7:25 PM, Martin Probst <m...@martin-probst.com> wrote:
> >> Yeah, clever, and basically the same as the eventual fix I had for > >> JRuby > >> since it gives each thread its own counter. I think the primary rule > >> here is that don't expect any sort of unsynchronized, uncontrolled > >> updates to the same shared resource to either behave the way you > >> want or > >> perform the way you want.
> > It may be a bit off-topic, but it's interesting how this argues for > > immutable data structures and Erlang's concurrency model.
> I guess once you're into concurrency, everything argues for immutable > data structures.
> It's also nice to see how this fits with Gilad Brachas arguments > against the Java 'static' keyword (which allows mutable structures to > be shared implicitly between threads), as Charles already noted.
> It's the basic principle behind fork-join in a multi-threaded > process and > map-reduce on a large cluster. > Nothing new that Gilad Bracha discovered in 2008, is it?
I was referring to Bracha's stuff about how static fields and global names in general harm software engineering, deployment (live code updating), concurrency, testability, etc. pp. I'm not sure that this stuff is novel, and it's certainly building on research of others. But it's definitely non-trivial, and it goes far beyond the basic insight that resource contention is a bottleneck in parallel/distributed systems.
The interesting part is that it indeed comes to similar solutions looking from a different angle and a different problem set. That's what I wanted to point out.
On Tue, Apr 8, 2008 at 10:14 PM, Martin Probst <m...@martin-probst.com> wrote:
> > I'm surprised you're surprised.
> Who's surprised?
Not you in particular. I just picked your message as the last in this thread. I was just commenting on the overall sentiment that -to me- sounds like "wow we may have discovered something new in concurrent programming and great how all these pieces (Erlang, Bracha) finally fit together now". Gilad's compilation was certainly well compiled but global variables, hey, ...
> I guess once you're into concurrency, everything argues for immutable > data structures.
Some ad lib observations on memory sharing, concurrency, and the JVM:
- Probably the hardest part about updating the Java memory model was making sure the rules for immutable objects (final fields) would work properly without synchronization. It was necessary, because Java needs to work well on multiprocessors.
- Hotspot put in thread-local allocation buffers a decade ago, again to avoid avoid contention by memory partitioning. We weren't quite as painfully cache conscious back then, but cache locality was also a consideration. (Still, somebody needs to experiment more with cache- conscious coloring for TLABs. Sounds like a masters' project to me.)
- An immutable object is not immutable at first. The trick is that the mutable part of its lifecycle is private, and so the system can cleverly protect it from contention, e.g., using TLABs. (Occurs to me that "larvatus" = "hidden"; an immutable object has a larval stage during which it is constructed.) This can be generalized to other controlled mutation phases; that's what GC safepoints do, for example: Allow immutable objects to be temporarily mutated. There might be an interesting "phased immutability" software abstraction to be investigated along these lines.
- The JVM needs to know about immutability in order to optimize it properly. However, there are interesting (semi-)global analyses which can help it discover immutability even when it is not clearly marked in the source program. (Optimistic ones are the coolest, and require a framework for dependency checking and deoptimization to recover. Hotspot is good at that, and can get better. I think there are PhD projects there.)
- In modern memory systems, main memory is usually many tens of cycles away from each CPU. Caches try to hide this. As a corollary, there are at least two protocols in the hidden network that connects memories, caches, and CPUs, and they differ greatly in intrinsic performance characteristics. In essence, one protocol is (often) called RTS (read to share) and one is RTO (read to own). RTO means your CPU & cache have to handshake with the rest of the world to make sure your have an exclusive right to a block of memory (cache line), before your can either read or write it. Your program incurs this cost because it intends to write the memory. But you want to use RTS when you can, because any number of CPUs can RTS the same block of memory (cache line) and happily work on the same data, as long as it remains immutable. (I'm tempted to think RTS:RTO :: UDP:TCP; there are some basic similarities I suppose.) In any case, when memory is shared, immutable memory is faster than mutable, and the JVM works hard to exploit the difference.
What you should do is run the thing through a profiler. This will
tell you how many times poll is being called. My guess is that
perhaps the thread contention is prolonging the amount of time
necessary to increment count - Threads 1-3 all load count into memory
(i.e 0), all add 1 to it, and then write it back out to memory, so
after 3 threads have "written" to count, it only contains 1. This
seems to be the case - tested with linux jdk1.6, intel Q6600 (quad-
core):
On a side note, it's way faster to just optimize away the algorithm
(at least in your benchmark):
int numCalls = total / 0xFF;
i += total;
for (int k = 0; k < numCalls; ++k) doNothing();
Here's the same way, using AtomicIntegers:
int tmp = i.getAndAdd(total);
for (int k = 0; k < numCalls; ++k)
{
if ((k & 0xFF) == 0) doNothing();
}
However, none of this is related to the actual code, just the
benchmark. The way to optimize the actual code would be to push count
into the ThreadContext & instead of doing a binary and, simply reset
the counter & do a comparison of 256 / numThreads. Or you can keep
that code if you want every thread to poll once every 256 calls rather
than a thread getting polled once every 256 calls.
Other ways to speed it up: Don't count & simply call pollEvents with
longer intervals in-between & pollEvents then simply calls doNothing
directly.
Summary: The problem is that multiple threads are overwriting the
global variable, so that it takes longer to actually increment it by
1.
Think of it this way:
count = 1
Threads A, B, C are running.
The CPU loads count (with a value of 1) into the register.
The CPU increments the values in its registers to 2
The CPU then writes out the values of each of the registers into
count. So after all 3 threads increment count, the value is still 1.
Add in the overhead of switching between threads, and you can see why
you get a slowdown. Adding in synchronization will help, provided you
do it intelligently - too often, and synchronization penalty takes
over.
On Apr 2, 1:48 am, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:
> I ran into a very strange effect when some Sun folks tried to benchmark
> JRuby's multi-thread scalability. In short, adding more threads actually
> caused the benchmarks to take longer.
> The source of the problem (at least the source that, when fixed, allowed
> normal thread scaling), was an increment, mask, and test of a static int
> field. The code in question looked like this:
> private static int count = 0;
> public void pollEvents(ThreadContext context) {
> if ((count++ & 0xFF) == 0) context.poll();
> }
> So the basic idea was that this would call poll() every 256 hits,
> incrementing a counter all the while. My first attempt to improve
> performance was to comment out the body of poll() in case it was causing
> a threading bottleneck (it does some locking and such), but that had no
> effect. Then, as a total shot in the dark, I commented out the entire
> line above. Thread scaling went to normal.
> So I'm rather confused here. Is a ++ operation on a static int doing
> some kind of atomic update that causes multiple threads to contend? I
> never would have expected this, so I wrote up a small Java benchmark:
> The benchmark does basically the same thing, with a single main counter
> and another "fired" counter to prevent hotspot from optimizing things
> completely away. I've been running this on a dual-core MacBook Pro with
> both Apple's Java 5 and the soylatte Java 6 release. The results are
> very confusing:
> Normal scaling here...1 thread on my system uses about 60-65% CPU, so
> the extra thread uses up the remaining 35-40% and the numbers show it.
> Then there's soylatte Java 6:
> Don't compare the times directly, since these are two pretty different
> codebases and they each have different general performance
> characteristics. Instead pay attention to the trend...the soylatte Java
> 6 run with two threads is significantly slower than the run with a
> single thread. This mirrors the results with JRuby when there was a
> single static counter being incremented.