Hi all,
since quite a while I am looking for the best approach to tackle the Groovy speed problem, but it seems I will not be able to fully solve the problem without solving my concurrency problem.
My problem is more or less the following. A method call is done using the data of a meta class, which is mostly like a caching structure, that can be recreated unless certain conditions are met, that we don't need to know for now. For each method call I need to check on this meta class. And here part of the problem starts. How to get that object?
If it is stored in a volatile, ThreadLocal or even retrieved through a data structure that uses synchornized, I get into problems with hotspot.
I found for me the ideal way to do this and other things would be to be able to write an fully initialized object to main memory and other threads can but don't have to get that value. They get knowledge of it either as part of other synchronization, or by a user forced instruction. The advantage is that does not have to happen on each read, like with volatile. In theory ThreadLocal could be good enough, if there were no speed issue. True ThreadLocal has been improved big times over the years, but it is still slower as for example volatile, because I cannot store the ThreadLocal object for long. So I would still have to get a ThreadLocal for each method invocation and that is bitter.
As I see it I can either use native code and maybe cause problems for the JVM, or I am on a completely wrong track here and I simply see something not.
So I am asking here for ideas. Anyone having one?
bye Jochen
Charles Oliver Nutter wrote:
[...]I spend uite some time to get an overview and counted about 20 points were concurrent stuff is involved on the first call of a method. If the method is called again, this is reduced to some volatile access to see if a category is active and some other things.
Can you explain in terms of the Groovy codebase what you mean by "the
Groovy speed problem"? I may be able to explain what we have done in
JRuby for similar situations.
The key problem is caching the result of a partial evaluation. (For example, an expected method dispatch at a call site, or a metaclass which summarizes the shape of a type.) The cached value can be made "pure" (read-only) but it is subject (at least potentially) to very slow change, once per "epoch". Let's call such a slowly varying value a "stable value".
(Compare "stationary values" Ian Rogers's paper for PPPJ'08. What dynamic language folks need is an epochal variation of this concept, a value which is stationary most of the type and changes once in a blue moon. Hence, "stable value".)
The typical dynamic language technique is to use a volatile variable to hold the cached value. The volatility ensures that once-per-epoch changes get immediately picked up by threads. The problem is that the volatile reference defeats caching and inlining of all sorts.
The JVM has better tricks. It can cache and inline a vtable lookup if it can prove (from class hierarchy analysis) that a method is not overridden. The invokedynamic instruction has similar capabilities. If the epoch changes (because a new overriding method is loaded), code needs to be edited or discarded, and threads need to be interrupted. This happens at a safepoint.
Here's a sketch for a user-level hook to stable values:
public class StableValue<T> {
/** Get the current value. */
public T getStableValue() {...}
/** Set the current value. This may be costly. */
public void getStableValue(T x) {...}
/** Commit the current value. Future calls to set() will throw. */
public void getStableValue() {...}
}
I'm going to abbreviate "getStableValue" as "get", etc., from here on. It is tempting to make an analogy with java.lang.ref.Reference, but I don't know if that makes enough sense. By using the full term "getStableValue" we can contemplate mixing stable values into other types:
public class CallSite extends StableValue<MethodHandle> { ... }
Anyway, the idea is that the JVM is encouraged (by the user) to perform optimizations on the value returned by the get() call. If it does so, it must somehow watch for future calls to set(). The freeze() call (from the user) informs the JVM that it can stop watching. This should probably be a system-defined class, because it may have private fields which participate in the JVM's task of tracking dependencies.
Of course, static variables in Java are somewhat amenable to this treatment "automagically". But the above abstraction guides the JVM much more clearly. I don't think we can retrofit this behavior onto normal Java variables. (We can with "static final" of course, but those guys can't evolve over the epochs.) It's better to be explicit here, I think.
StableValue should be subclassable, in the same style as java.lang.ref classes. That will remove gratuitous indirections. For example, a StableHashTable could be built whose entries subclass StableValue, much like a WeakHashTable.
It should be possible to build lazy values on top of StableValues, by overriding get() to check if the current value is null. (Another reason to subclass.)
The StableValue.set call should perform the memory effects of a release (cf. monitorexit) before the value is published.
The StableValue.get call should perform the memory effects of an acquire (cf. monitorenter) before the value is fetched, the *first* time any given thread calls get() on a given StableValue. Subsequent get() calls are "for free". This is why a safepoint mechanism is needed, to force querying threads to re-aquire relevant memory blocks.
A StableValue can be "entangled" with optimized code if its value is folded into the code. If this happens, the JVM needs to keep track of the StableValue so that it can adjust or discard the code the next time the value changes. The StableValue class probably needs invisible synthetic fields to help it link into the JVM's data structures for tracking code dependencies.
The caching behavior of java.dyn.CallSite with respect to setTarget could then be defined in terms of StableValue. It will be "as if" the CallSite held its current target in a StableValue.
Class metadata (of type MyMD) can be stored in a ClassValue<StableValue<MyMD>>. For information on ClassValue, see:
http://cr.openjdk.java.net/~jrose/pres/indy-javadoc-mlvm/java/dyn/ClassValue.html
-- John
P.S. None of the above helps with the special case of getting invalidation when types gain new subtypes. There's another hook needed there, I think. For example, ClassValues could be optionally made sensitive to class hierarchy changes, so that they get recomputed if new subtypes are loaded. Not sure if this is needed, though.
For StableValue, I think there needs to be (a) a better guarantee of responsiveness to changes than normal fields provides, and (b) a clear path to optimizing the stable value as if it were "mostly constant" in the same sense that the bodies of not-yet-overridden virtual methods are "mostly constant" today.
Does something like StableValue belong in the Fences API, I wonder?
-- John
On Fri, Oct 1, 2010 at 3:44 AM, John Rose <john....@oracle.com> wrote:
> I had a chance to sit down with Rich at JavaOne to discuss this. (Thanks Rich!)
>
> The key problem is caching the result of a partial evaluation. (For example, an expected method dispatch at a call site, or a metaclass which summarizes the shape of a type.) The cached value can be made "pure" (read-only) but it is subject (at least potentially) to very slow change, once per "epoch". Let's call such a slowly varying value a "stable value".
>
> (Compare "stationary values" Ian Rogers's paper for PPPJ'08. What dynamic language folks need is an epochal variation of this concept, a value which is stationary most of the type and changes once in a blue moon. Hence, "stable value".)
>
> The typical dynamic language technique is to use a volatile variable to hold the cached value. The volatility ensures that once-per-epoch changes get immediately picked up by threads. The problem is that the volatile reference defeats caching and inlining of all sorts.
>
> The JVM has better tricks. It can cache and inline a vtable lookup if it can prove (from class hierarchy analysis) that a method is not overridden. The invokedynamic instruction has similar capabilities. If the epoch changes (because a new overriding method is loaded), code needs to be edited or discarded, and threads need to be interrupted. This happens at a safepoint.
>
> Here's a sketch for a user-level hook to stable values:
>
> public class StableValue<T> {
> /** Get the current value. */
> public T getStableValue() {...}
> /** Set the current value. This may be costly. */
> public void getStableValue(T x) {...}
> /** Commit the current value. Future calls to set() will throw. */
> public void getStableValue() {...}
> }
>
> I'm going to abbreviate "getStableValue" as "get", etc., from here on. It is tempting to make an analogy with java.lang.ref.Reference, but I don't know if that makes enough sense. By using the full term "getStableValue" we can contemplate mixing stable values into other types:
> public class CallSite extends StableValue<MethodHandle> { ... }
>
> Anyway, the idea is that the JVM is encouraged (by the user) to perform optimizations on the value returned by the get() call. If it does so, it must somehow watch for future calls to set(). The freeze() call (from the user) informs the JVM that it can stop watching. This should probably be a system-defined class, because it may have private fields which participate in the JVM's task of tracking dependencies.
It seems like it's in the family of Atomic* to me, and as you
mentioned in a followup email we need to make the set+freeze
operation (at minimum) be atomic. So why not just borrow the
AtomicReference API and have AtomicStableReference or
StableAtomicReference (and perhaps AtomicStable*) that also implement
getAndSet, incrementAndSet, incrementAndFreeze, and so on?
I also liked Doug Lea's "fences" API, though I guess that's off the
table for Java 7 now. It was certainly more flexible, since it allowed
us more freedom about which fields to use (and therefore we would not
have to change current impls...we just fence them), but it is perhaps
more invasive at a VM level.
> Of course, static variables in Java are somewhat amenable to this treatment "automagically". But the above abstraction guides the JVM much more clearly. I don't think we can retrofit this behavior onto normal Java variables. (We can with "static final" of course, but those guys can't evolve over the epochs.) It's better to be explicit here, I think.
>
> StableValue should be subclassable, in the same style as java.lang.ref classes. That will remove gratuitous indirections. For example, a StableHashTable could be built whose entries subclass StableValue, much like a WeakHashTable.
>
> It should be possible to build lazy values on top of StableValues, by overriding get() to check if the current value is null. (Another reason to subclass.)
compareAndSet?
> The StableValue.set call should perform the memory effects of a release (cf. monitorexit) before the value is published.
>
> The StableValue.get call should perform the memory effects of an acquire (cf. monitorenter) before the value is fetched, the *first* time any given thread calls get() on a given StableValue. Subsequent get() calls are "for free". This is why a safepoint mechanism is needed, to force querying threads to re-aquire relevant memory blocks.
>
> A StableValue can be "entangled" with optimized code if its value is folded into the code. If this happens, the JVM needs to keep track of the StableValue so that it can adjust or discard the code the next time the value changes. The StableValue class probably needs invisible synthetic fields to help it link into the JVM's data structures for tracking code dependencies.
One disadvantage this has from "fences" is that we will have to always
walk a reference to get to the StableValue object. Perhaps we can roll
in the optimization Tom R. came up with to fold away repeated accesses
of the same final field reference? Or will Hotspot see through the
field reference + StableValue.get and fold the whole thing together?
> The caching behavior of java.dyn.CallSite with respect to setTarget could then be defined in terms of StableValue. It will be "as if" the CallSite held its current target in a StableValue.
>
> Class metadata (of type MyMD) can be stored in a ClassValue<StableValue<MyMD>>. For information on ClassValue, see:
> http://cr.openjdk.java.net/~jrose/pres/indy-javadoc-mlvm/java/dyn/ClassValue.html
Ah-ha, this is what I missed before. This definitely does help with
the Class/MetaClass relationship most of us will have to deal with, so
with this and StableValue we're really getting some powerful tools
here.
> -- John
>
> P.S. None of the above helps with the special case of getting invalidation when types gain new subtypes. There's another hook needed there, I think. For example, ClassValues could be optionally made sensitive to class hierarchy changes, so that they get recomputed if new subtypes are loaded. Not sure if this is needed, though.
Now I get this statement as well. If a new subtype comes in you may or
may not want the parent's ClassValue to apply down-hierarchy. If you
don't want it to apply, you need a way to invalidate and re-prime
ClassValues for the super and subtype. Agreed...there's a mechanism of
some sort needed here, OR ClassValues could be limited to exactly one
type from one classloader, and we'll handle the overhead of binding
new types that enter the system.
- Charlie
Begin forwarded message:
From: Doug Lea <d...@cs.oswego.edu>
Date: October 2, 2010 7:10:47 AM PDT
To: John Rose <john....@oracle.com>
Subject: Re: is there a fence for this?
On 10/01/10 17:56, John Rose wrote:
> I'm beginning to think that both mutable invokedynamic call sites and cached
> meta-classes are instances of the same pattern, a "stable value", which the
> JVM does not support well. But JSR 292 can supply a new hook, if JSR 166
> does not already.
We (JSR166) don't have anything directly applicable.
Here are a few thoughts. Feel free to relay any or all of this
to jvm-languages list.
Memory-model-wise, I suppose you'd classify this as a
weird form of "Release consistency":
Readers running between releases are allowed to act as if
values do not change. Here, the release points correspond
to epochs, and we assume that there is some sort of sync
so that writers don't interfere with each other upon
setting up new epochs. And possibly (but, I suspect, not
necessarily) a full all-thread r/w barrier upon epoch advance.
But the main twist on release-consistency is that reader-sync
may (or may not) entail new code generation.
In principle, this could be addressed with scoped fences.
(as in the ones at
http://gee.cs.oswego.edu/dl/jsr166/dist/docs/java/util/concurrent/atomic/Fences.html)
Note: I suspect that the Fences API won't make Java7.
It wasn't in our recent jsr166/openjdk integration. Due to
the combination of mismatches with existing JLS ch 17
specs (which would need even more revisions/fixes than they
already do) and the continual moral hazard complaints --
i.e., arguments that the world is better off if the only
people using them are those who can so do via Unsafe.
Scoped fences map pretty well to release consistency (as
well as to most others), and so ought to apply here.
(As a thought experiment, consider an implementation of
even ordinary fences that might involve code regeneration
on some forms of updates.) Still, the actual APIs would
probably need some adaptation if only because there is
no Java notion of the associated scope/epoch. And pragmatically,
you'd want these fences to be treated very specially anyway.
I think you'd only need one or two forms (not the usual three),
mainly (using made-up Epoch type):
orderReadsByEpoch(Epoch e);
and maybe (but probably left for JVM-internal use only)
orderWritesByEpoch(Epoch e)
with specs similar to plain orderReads/orderWrites.
Interestingly enough, the similarities appear to overlap those
wrt "freeze" actions described (with bugs) in JLS sec 17.5
(http://java.sun.com/docs/books/jls/third_edition/html/memory.html)
that do not have an independent implementation in any JVM
(but could be). "Freezes" are an attempt to provide an account
of some final-field oddities. Beware that there are
unexpected but allowed behaviors with not-really-final
final fields, that may also be shared by epoch-based control.
Given the above, "stable" would not be my favorite
term for this mode. Also, my experience with trying
to use per-variable type-qualifier-like or magic-type/class
notions in Java (as in your "StableValue")
to map to fences is not very positive. Even "volatile" and
"final" are problematic. On the other hand, it is fine,
and often preferable to only get fences as parts of
special-mode accesses. You implicitly do this in methods
like "getStableValue", but doing it only per-access
leads to the usual problems we face in not being
able to efficiently support l-value operations.
(Aside: A different solution to this will probably eventually
be part of X10 spec: using flavored blocks for fence control.
As in
finish { p = x; } f = p.f // volatile-style field read
I wish there were some syntax holes for this sort of
thing in Java....)
Which leads me at the moment to suggest just adding one or two
fence-like methods for epochs.
-Doug
> I will say that this helps the volatile reference problem we all face,
> but I know Jochen and I are still looking for a way to completely get
> rid of our soft/weak proxy/metaclass wrappers :( I'm this || close to
> hacking Maxine to have an additional mutable reference on all objects
> just so I can see how it would play out...
Thomas Wuerthinger's enhanced hotswap work gets moderately close to this, and he and I talked at JavaOne about what it would take to go the rest of the way. The main problem is that the HotSpot JVM thinks it knows some hard-coded offsets into well-known classes like String. (It's an old problem http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4953140 .) The 292 work makes it worse, since there is more coupling between the JVM and the java.dyn classes.
(Long ago, I was on the technical due-diligence team to see if Sun wanted to acquire this strange HotSpot technology. The code looked well-factored, so we tried adding a field to all object headers, a field containing the backtrace as of the point the object was allocated. It took an afternoon, and worked the first time.)
> It seems like it's in the family of Atomic* to me, and as you
> mentioned in a followup email we need to make the set+freeze
> operation (at minimum) be atomic. So why not just borrow the
> AtomicReference API and have AtomicStableReference or
> StableAtomicReference (and perhaps AtomicStable*) that also implement
> getAndSet, incrementAndSet, incrementAndFreeze, and so on?
That's a good idea. It's not likely to happen in a user visible way in JDK 7, though.
> I also liked Doug Lea's "fences" API, though I guess that's off the
> table for Java 7 now. It was certainly more flexible, since it allowed
> us more freedom about which fields to use (and therefore we would not
> have to change current impls...we just fence them), but it is perhaps
> more invasive at a VM level.
Not very much more invasive. The JVM immediately cooks those things to fences, anyway:
http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/share/classes/java/util/concurrent/atomic/AtomicInteger.java
http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/share/vm/opto/library_call.cpp
(See lazySet, and compare inline_unsafe_ordered_store with inline_unsafe_access.)
>> A StableValue can be "entangled" with optimized code if its value is folded into the code. If this happens, the JVM needs to keep track of the StableValue so that it can adjust or discard the code the next time the value changes. The StableValue class probably needs invisible synthetic fields to help it link into the JVM's data structures for tracking code dependencies.
>
> One disadvantage this has from "fences" is that we will have to always
> walk a reference to get to the StableValue object. Perhaps we can roll
> in the optimization Tom R. came up with to fold away repeated accesses
> of the same final field reference?
Yes, that's already partially present in HotSpot, but just for method handles. Tom's version is much more general.
(The key problem here is the finals are not really final, in the JVM's view, because constructors and reflection are allowed to change them.)
> Or will Hotspot see through the
> field reference + StableValue.get and fold the whole thing together?
It depends on the root reference to the StableValue, of course. In the case of an invokedynamic call site, there is a stable reference to the CallSite, so if it incorporates a StableValue (or equivalent behavior), the extra indirection is routinely folded up inside the compilation of the call site.
In the case of metaclass extraction from a non-constant x.getClass call, it's a chain of however many indirections there are in the data structure.
>> The caching behavior of java.dyn.CallSite with respect to setTarget could then be defined in terms of StableValue. It will be "as if" the CallSite held its current target in a StableValue.
>>
>> Class metadata (of type MyMD) can be stored in a ClassValue<StableValue<MyMD>>. For information on ClassValue, see:
>> http://cr.openjdk.java.net/~jrose/pres/indy-javadoc-mlvm/java/dyn/ClassValue.html
>
> Ah-ha, this is what I missed before. This definitely does help with
> the Class/MetaClass relationship most of us will have to deal with, so
> with this and StableValue we're really getting some powerful tools
> here.
Yes. Let's get that one right.
>> P.S. None of the above helps with the special case of getting invalidation when types gain new subtypes. There's another hook needed there, I think. For example, ClassValues could be optionally made sensitive to class hierarchy changes, so that they get recomputed if new subtypes are loaded. Not sure if this is needed, though.
>
> Now I get this statement as well. If a new subtype comes in you may or
> may not want the parent's ClassValue to apply down-hierarchy. If you
> don't want it to apply, you need a way to invalidate and re-prime
> ClassValues for the super and subtype. Agreed...there's a mechanism of
> some sort needed here, OR ClassValues could be limited to exactly one
> type from one classloader, and we'll handle the overhead of binding
> new types that enter the system.
The only version I have now is ignorant of subtyping relations, so each class is treated as a unique thing. That's sufficient for now, and is probably all that can be done for JDK 7.
Thanks for the searching comments and questions.
-- John
TL;DR version: The scheme you laid out here seems sound.
(To avoid relays/forks of discussions, I tried joining and CCing the group)
My only concern about this is whether it precludes
further useful support for the general problem
of "phased computation", which is in turn either a
variant of or a generalization of fork/join computation,
depending on how you look at it. Since JDK7 includes
integrated support for both ForkJoinTask and Phaser,
that will both probably be used in JDK8 lambdas etc,
it seems almost a sure thing that people will further
explore underlying JVM support. Here's a synopsis of
issues. (Please bear with the background before
returning to questions at hand.)
Phased and fork/join computations impose enough structure
on parallel computations to allow cheaper low-level
memory-level accesses. Well, at least when they are used in
the intended ways; among the main difference between JDK7
support and explicitly parallel languages like X10
(supporting "clocks" (a form of Phasers) and async/finish
(a form of fork/join)) is that Java compilers cannot
enforce correct usage (although tools and IDE plugins
could be improved/developed that could do so in most cases).
As warmup, for fork/join, the main economies are due to the
static DAG structure of computation. Among other impacts,
accessing a variable modified by a joined task requires no
memory-level sync beyond what was required anyway
to determine the join. (We currently have no way to
designate such variables or usages; better support
probably awaits this.)
For phased computations, the main economies apply
to "phased" variables -- those that may take a
new value on each phase but are constant within phases.
(We also don't have a way of designating these.)
Conceptually, a phased variable is a holder of
two values/variables, call them "current" and "next",
where reads are from "current" and writes are
to "next". In existing JMM-ese, "current" is basically a
non-volatile variable, and "next" is volatile. And
conceptually, upon each phase advance, each party
does "current = next". There are a few ways to
implement this differently though. One common option
is to use only a single location per phased variable,
and to arrange an action taken upon phase advance
to do in-place updates (see for example Phaser.onAdvance,
http://gee.cs.oswego.edu/dl/jsr166/dist/docs/java/util/concurrent/Phaser.html
which must by nature be performed in exclusion since
all other parties are blocked until advance. Another
option is to use a two-slot array per phased variable,
indexed by the lowest bit of phase number: reads are
from slot[phase & 1], writes are to slot[~phase & 1]
(assuming that phase numbers are explicit and sequential).
This avoids the need for copying. Also, across
all of these, it is possible and sometimes desirable for
parties to use their own writes early, before advance.
Sadly enough, I don't know of a recipe determining which
of these options is best in a given situation. (Although
I do know one of the ingredients: whether you require
atomicity of updates across all parties upon advance.)
Maybe some further study and exploration could arrive at a
good recipe.
In the mean time, note that only one of these options
(explicit current+next) even has a reasonable interpretation
in existing JVM-ese. The others are mismatched in that
some of the operations have JMM-volatile semantics and some
don't, so it doesn't make sense to declare the variable
as either volatile or non-volatile.
John's proposal for MutableCallSite falls into this category,
hence the controversy about how to express it.
It is a specialization of the onAdvance (aka sync()) approach
with in-place updates, and safepoint generations serving
as phases. This seems reasonable given the likely need for
atomicity during safepoints anyway.
While ideally it would be nice to integrate this with
other phased computation even now (I can imagine defining
adjunct classes for use with j.u.c.Phaser), I'm not too
tempted to do so right now, because it is not always the
best available option.
However, someday, it would be great to be able
to allow use the same mechanics in each. As parallelism
becomes more widespread, we must make it easiest and
most attractive for people to use simpler constrained
execution control patterns like phasers and FJ rather than the
ad-hoc unstructured use of threads/volatiles/locks. One
way to help reach this goal is to improve JVM-level support
so that the most structured parallel code is also the fastest
parallel code.
-Doug
> On 12/09/10 19:09, John Rose wrote:
>> I started a thread on Google Groups to get more advice on safepoint-based
>> invalidation, which the EG is naming MutableCallSite#sync.
>>
>> http://groups.google.com/group/jvm-languages/browse_thread/thread/9c9d3e84fc745676#
>>
>
> TL;DR version: The scheme you laid out here seems sound.
Thanks very much, Doug. I assume you are primarily referring to part (a) of my question:
>> In order to work, the specification has to (a) make logical sense in the terms of the JMM, (b) be reasonably implementable by JVMs, and (c) be useful to programmers.
Does anyone have a take on (c)? (See my comments below on using this API...)
> My only concern about this is whether it precludes
> further useful support for the general problem
> of "phased computation"...
I hope not, though JSRs 292 and 166 are an instance of:
http://en.wikipedia.org/wiki/Conway's_Law
The worst case is that we are seeing the good being the enemy of the better. The best case is we are doing initial exercises in JDK 7 which will get well-structured in JDK 8. I vote for best case.
The device of associating a "nonce-volatile" with a safepoint-phased variable update is generally applicable, not just to MutableCallSite. Perhaps once we have a suitable LValue abstraction (able to efficiently refer to x.f or a[i], which are your two implementations of phased variables), we can make a standard operation LValue.setOnceVolatile.
Doug, you've asked before for LValue. As we've discussed, in order to do LValue, the JVM needs very reliable scalarization. This means alleviating certain obstacles from the JOM (Java Object Model), notably pointer comparison and monitor-per-object. Tuples would help also. The MLVM project needs to work on this stuff; the key resource shortage is the most valuable resource, which is people (esp. HotSpot hackers) to think and work on it.
> For phased computations, the main economies apply
> to "phased" variables -- those that may take a
> new value on each phase but are constant within phases.
Thanks for this concept.
> Sadly enough, I don't know of a recipe determining which
> of these options is best in a given situation. (Although
> I do know one of the ingredients: whether you require
> atomicity of updates across all parties upon advance.)
> Maybe some further study and exploration could arrive at a
> good recipe.
In the JSR 292 case, we are supporting languages (like Java!) which can dynamically change their type schemas, but do so infrequently, perhaps at phase changes. The changes will in general require complex sets of call sites to be updated, implying a dependency mechanism (see the Switcher class for this). But the actual change-over can be slow, and may involve safepoints, as today with devirtualization that happens when new classes are loaded.
(See Universe::flush_dependents_on around line 1100 of:
http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/share/vm/memory/universe.cpp )
The idea is that the type schema change would be prepared (as a "next global state") by a writer thread #1. At this point some call sites become "invalid in next state". Those call sites would be switched, either by a switcher (which would in general control many call sites at once) or by editing them individually, and then calling 'sync' on them.
The new state of all call sites must at this point be consistent with both the current and next global state of the type schema. Typically, a lock would be held on the schema by writer thread #1, and the updated call site paths (now switched in and sync-ed) would pile up on the schema lock. After the next global state is installed as current, the writer thread #1 lets go of the lock, and callers are free to shake out the new state of affairs.
The shake-out may include installing optimized call paths in various call sites. This means that a mutable call site might undergo two mutations, the first to get into a state consistent with both old and new schemas (perhaps seizing a lock, as above) and the second to optimize calling within the new state. Both transitions generally need to grab a lock so they can have a consistent view of the global schema state.
As an extra complication, the editing of call sites is probably performed in a concurrent manner, and lazily. There might be writer thread #2, etc., performing updates to mutable call sites. It it not clear to me whether the second write also needs a sync, but it looks like that may be the case to get performance. Perhaps the right way to go is avoid the second write, and have the first write install the optimized path immediately, and queuing a 'sync' request to be executed after a short delay. One key concern is to avoid "safepoint storms", by batching 'sync' requests.
> In the mean time, note that only one of these options
> (explicit current+next) even has a reasonable interpretation
> in existing JVM-ese. The others are mismatched in that
> some of the operations have JMM-volatile semantics and some
> don't, so it doesn't make sense to declare the variable
> as either volatile or non-volatile.
>
> John's proposal for MutableCallSite falls into this category,
> hence the controversy about how to express it.
> It is a specialization of the onAdvance (aka sync()) approach
> with in-place updates, and safepoint generations serving
> as phases. This seems reasonable given the likely need for
> atomicity during safepoints anyway.
I'd like to build both on top of LValue operations, but not in JDK 7.
Someone with a thesis to write should propose some conservative additions to the JMM to operate on LValues (or something like them). They can start with nonce-volatiles, but there are probably better ideas waiting to be recognized.
> While ideally it would be nice to integrate this with
> other phased computation even now (I can imagine defining
> adjunct classes for use with j.u.c.Phaser), I'm not too
> tempted to do so right now, because it is not always the
> best available option.
Let's keep thinking about building blocks, for inclusion in a JDK 7+O(1).
> One
> way to help reach this goal is to improve JVM-level support
> so that the most structured parallel code is also the fastest
> parallel code.
Yes. If the fastest code is unstructured, the fastest code will be unreasonably expensive in most cases. Which would be dumb economics.
-- John
We'd be happy to take further comments on this alias also.
-- John
On Dec 11, 2010, at 6:18 AM, Rich Hickey wrote:
> Thanks for this work, John!
>
> I'm concerned about the separation of update and publication in this
> API...
Yes, syncTargets would be very powerful. But it would also be hard to
require of all JVMs. Speaking only for HotSpot, we don't do everything
with safepoints, because they are expensive. We use racy updates
whenever we can get away with it. The cost of a racy update is the
co-existence of two states, unpredictably visible to various threads.
I think that's a normal complexity cost for doing good language
implementations.
I think mapping a global atomic update to the JMM would require
more "magic edges" in the happens-before graph. The proposal
I posted, while weaker, has a correspondingly simpler impact on
the JMM. This is another way of observing that JVMs are likely to
have an easier time of adding the proposed functionality.
So a globally atomic update is harder to implement and harder
to specify. It is also overkill for a common use case, which is
delayed optimization of call sites. See below...
> Rich,
> I don't think you can provide an optimized method handle when syncing but
> more probably a default generic method that will later installs a fast path.
Thanks, Remi, for explaining this. I'm going to pile on here.
(I have one comment on your code; see below.)
I think of the pattern Remi sketches as a 2-phase commit.
Phase 0 and phase 2 are long-term phases. Phase 1 is brief but not atomic.
Phase 0 is the reign of the old target T0, before any MCS.setTarget.
Phase 1 starts when metadata locks are grabbed.
Under the locks, MCS.setTarget installs a default generic method T1.
(This is analogous to the JVM trick of setting a call site to the "unlinked" state.)
T1 is not racy. It is careful to grab a reader lock on metadata.
It is likely to install an optimized method T2, via a simple setTarget.
This may happen after an invocation count, or after user-level profiling.
(Therefore, it does not make sense to try to guess at T2 during phase 1.)
The MCS.sync operation is performed during phase 1, after all
relevant setTargets are done. It has the effect of excluding threads
from observing target T0. (I.e., it "flushes" T0 from the system.)
Phase 2 starts when metadata locks are released. During phase 2,
individual threads eventually execute T1. T1 lazily decides to install T2.
(Or several equivalent versions of T2.)
Threads which observe T1 (because of caching or inlining) will perform
sync actions which will force them to observe more recent call site targets.
During phase 0, T2 cannot be observed. During phase2, T0 cannot be observed.
The intermediate target T1 can be observed during any phase.
Compare that with Rich's proposed atomic syncTargets operation,
which would exclude phase 1 and target T1, for better and worse.
Another way of comparing syncTargets with setTarget+sync is
simply to syncTargets excludes target T1 from phase 0, whereas
the weaker proposal does not exclude T1.
This weakness can also be described in terms of two reader
threads, a Fast Reader and a Slow Reader. The Fast Reader
sees the result of the writer's setTarget of T1 in the same
nanosecond. The Slow Reader sees only T0 until it is
forced to pick up T1 by the sync operation.
-- John
> So if the result of setTarget is visible before the sync call, it will
> execute
> a default generic method which will also enter in a synchronized block.
> This will effectively coordinate things between readers and writers.
>
> invariant broken (writer):
> synchronized(masterLock) {
> // mutate meta data
> // update all callsites
> foreach(impacted callsites) {
> callsite.setTarget(callsite.defaultGenericPath);
> }
> sync(impacted callsites);
> }
>
> default generic method (reader):
> synchonized(masterLock) {
> // check arguments
> // use meta data to create a fast path
> }
(This next line should also be synchronized. -- JRR)
> setTarget(guard + fastpath);
>
>
> This pattern intends to mimick the way the VM do lazy deoptimization
> i.e mark the callsite as deoptimized and later when the callsite is used
> re-create a fast path.
[...]
>> So if the result of setTarget is visible before the sync call, it will
>> execute
>> a default generic method which will also enter in a synchronized block.
>> This will effectively coordinate things between readers and writers.
>>
>> invariant broken (writer):
>> synchronized(masterLock) {
>> // mutate meta data
>> // update all callsites
>> foreach(impacted callsites) {
>> callsite.setTarget(callsite.defaultGenericPath);
>> }
>> sync(impacted callsites);
>> }
>>
>> default generic method (reader):
>> synchonized(masterLock) {
>> // check arguments
>> // use meta data to create a fast path
>> }
> (This next line should also be synchronized. -- JRR)
>
>> setTarget(guard + fastpath);
Correct me if I'm wrong.
Le last line can be in the synchronized block but it don't have to.
The JMM guarantees that the current thread will see the new target (T2).
Other threads can see T1 and in that case will update to a T2' which is
semantically equivalent to T2.
It's a racy update.
>>
>> This pattern intends to mimick the way the VM do lazy deoptimization
>> i.e mark the callsite as deoptimized and later when the callsite is used
>> re-create a fast path.
R�mi
This API has several presumptions:
- There is some singular global metadata.
- It is protected by a lock. What if it is an immutable structure in
an AtomicReference and modified via CAS? Or passed by value to targets/
handles?
- Call sites will always use a two-phase update, i.e. they will need
to rediscover their binding vs directly target it (during metadata
update) given their cached data and the new metadata. You are
piggybacking on this presumption to provide a consistency hook
(blocking on the read lock).
While T1 is not strictly MT racy (given the adoption of this pattern
wholesale), it is prone to error, in that should the update phase
interleave ordinary code and setTarget calls, it can see an
inconsistent state (i.e. nothing keeps call sites the update thread
encounters from moving to T1.5 prior to the sync call, since the
updater thread holds the lock). Many dynamic language implementations
use their own dynamic code in a way that could trip over this.
This is an API that only works correctly given an extremely
constrained pattern of use. That's fine, if unfortunate, but needs to
go in the docs I think. It is essential that the pattern be - grab the
lock, then 'modify' all metadata (no setTargets), then update all
targets (no calls through callsites), then sync (under the lock), and
that the targets *must be* stubs that grab the lock and then a) do
nothing (if not lazy) or b) setTarget (under the lock), if lazy.
The phases you describe are derived from this pattern and do not fall
out of the API.
I proposed one thing (syncTargets), and asked one question (does that
obsolete setTarget).
The use case Remi and you describe makes it clear that something like
setTarget (with ordinary field synchronization semantics) would still
have utility for two-phase callsites. It would be overkill to sync
again, presuming fastpath calculation is idempotent and acceptable to
do over in multiple threads. However, it's error prone, as Remi's
example is broken, as you pointed out - setting the target outside of
the lock means the site can (permanently) miss an update.
If syncTargets were possible, you'd require no locks, and you might
want something more like a CAS-based swapTarget for lazy sites.
Coordination around metadata (if any) would be completely orthogonal.
invariant broken (writer):
{
// calculate new metadata
// determine new targets
// update all callsites
syncTargets(impacted callsites, new targets);
}
default generic method (reader):
{
// check arguments
// create a fast path
swapTarget(this?, guard + fastpath);
}
I'll leave it to your expertise whether this combination is harder to
implement and harder to specify. That certainly dominates. The
benefits of syncTargets and swapTarget from an API perspective are
that, given appropriate regard for the expense of syncTargets - you
can't get it wrong, the API itself delivers the coordination
semantics, there is no elaborate pattern to replicate, and the field
of use is broader.
Rich
>>> default generic method (reader):
>>> synchonized(masterLock) {
>>> // check arguments
>>> // use meta data to create a fast path
>>> }
>> (This next line should also be synchronized. -- JRR)
>>
>>> setTarget(guard + fastpath);
>
> Correct me if I'm wrong.
> Le last line can be in the synchronized block but it don't have to.
> The JMM guarantees that the current thread will see the new target (T2).
> Other threads can see T1 and in that case will update to a T2' which is
> semantically equivalent to T2.
> It's a racy update.
Racy updates have the risk of not getting picked up by Slow Reader threads.
But, as you point out, in this case T1 synchronizes any Slow Reader and forces him to see T2.
So, you're right.
I guess the principle is that racy updates are OK as long as the previous state (T1) is allowed and will eventually force all readers to go to the new state (T2).
-- John
Why can't T3 happen between the synchronized block and the setUpdate
call, and get overwritten with a T1-based decision?
Rich
It can. I was wrong.
setTarget() has to be in the synchronized block.
R�mi
>> Why can't T3 happen between the synchronized block and the setUpdate call, and get overwritten with a T1-based decision?
>>
>> Rich
>>
>
> It can. I was wrong.
> setTarget() has to be in the synchronized block.
Foo; I agree. The T1 update can float into the far future and screw things up (overwriting T3).
In our EG meeting this morning we decided to provide better usage guides in the javadoc.
Rich, I'm going to think a while about your alternative proposal of syncTargets and swapTarget:
On Dec 15, 2010, at 9:22 AM, Rich Hickey wrote:
> The benefits of syncTargets and swapTarget from an API perspective are that, given appropriate regard for the expense of syncTargets - you can't get it wrong, the API itself delivers the coordination semantics, there is no elaborate pattern to replicate, and the field of use is broader.
Quick responses:
SyncTargets would be a true global atomic transaction, with barriers in both directions to floating reads and writes.
If you were willing to update the targets one at a time, but still wanted two-way barriers, you could use VolatileCallSite, and file a performance bug for JVMs to optimize the case of a stable volatile.
The CAS idea does not translate well to function pointers, because they are necessarily opaque. (Function equivalence is undecidable...) The Object.equals method on them is pretty meaningless. I think you would need a double-CAS of some sort, which replaced a <descriptor, method handle> pair with a new version.
I think I need a better understanding of how STM techniques are implemented on a generic JVM, and what the pain points are.
-- John