HTM and synchronized blocks

1,688 views
Skip to first unread message

Gil Tene

unread,
Feb 2, 2014, 1:48:56 PM2/2/14
to mechanica...@googlegroups.com
Peter Lawry just blogged on "Hardware Transactional Memory in Java, or why synchronized will be cool again." , and there is a lively discussion on the use of newly available HTM features (e.g. in Haswell x86) for synchronized blocks.

For reference, I uploaded an ancient slide deck of mine on the subject, titled "Speculative Locking: Breaking the Scale Barrier (JAOO 2005)" reference. As you can imagine, with Haswell based commodity servers likely coming next year, this subject will be reopened and much-discussed.

Having built up quite a bit of practical, real-world experience with synchronized blocks and HTM (through three generations of Vega hardware and supporting JVMs), I think we'll see mixed results in unintentionally-written cases, but to Peter's point, and especially for you mechanically sympathetic developers out there, internal use of synchronized blocks when a JVM auto-backs-them-up with HTM can be an interesting new capability. 

Peter Lawrey

unread,
Feb 2, 2014, 2:30:28 PM2/2/14
to mechanica...@googlegroups.com

In terms of time lines, I have a couple of Haswell laptops and planning to order a dual E5-2600v2 system this week.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin Thompson

unread,
Feb 2, 2014, 2:56:46 PM2/2/14
to mechanica...@googlegroups.com
Attached is a presentation on what's new in Haswell from Ravi who invented Lock Elision. I managed to get him out of hiding to speak at QCon SF.


QConSF2012_rajwar.pdf

Michael Barker

unread,
Feb 2, 2014, 3:53:45 PM2/2/14
to mechanica...@googlegroups.com
Was there a video of his talk?

Mike.

On 3 February 2014 08:56, Martin Thompson <mjp...@gmail.com> wrote:
> Attached is a presentation on what's new in Haswell from Ravi who invented
> Lock Elision. I managed to get him out of hiding to speak at QCon SF.
>
>

Martin Thompson

unread,
Feb 2, 2014, 4:25:10 PM2/2/14
to mechanica...@googlegroups.com
Unfortunately it was either not recorded or posted as far as I can tell.


On Sunday, 2 February 2014 20:53:45 UTC, mikeb01 wrote:
Was there a video of his talk?

Mike.

On 3 February 2014 08:56, Martin Thompson <mjp...@gmail.com> wrote:
> Attached is a presentation on what's new in Haswell from Ravi who invented
> Lock Elision. I managed to get him out of hiding to speak at QCon SF.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

ben.c...@alumni.rutgers.edu

unread,
Feb 2, 2014, 10:54:33 PM2/2/14
to mechanica...@googlegroups.com

Thank you Peter for sharing this (and to Martin and Gil for providing the Intel/Azul presentations ... very helpful).

> [...] and especially for you mechanically sympathetic developers out there, internal use of synchronized blocks when a JVM auto-backs-them-up with HTM can be an interesting new capability.

It is an exceedingly interesting new capability.  However,  I am almost certain  that my current understanding cannot possibly be correct.

Can someone kindly correct me (please, be blunt, forceful, even rude, where I am way off -- I would be very grateful) where I am wrong here:

Consider to the 2 synchronized tx methods defined below and how they operate on a HTM_CACHE operand in the vertical time axis.  Because the methods are both defined as
synchronized they exhibit isolation=PESSIMISTIC_SERIALIZABLE from the Java Byte code view.  Pessimistically, neither can impact the other while one own's any block's lock.  However, if the HTM run-time re-writes so to be the isolation=OPTIMISTIC_READ_COMMITTED ... can the actual behaviour at run-time REALLY be what is shown in the execution time map below?

Please, be blunt if I am way, way off.

Musing openly, TBD:  does the @t=2    bal=access_cache(); operation get rolled back automatically @t=3 (DIRTY_READ conflict discovered) by the HTM cache coherency protocol?

Time

 

 

 

@t=0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

@t=1

 

 

 

 

 

 

 

 

 

 

 

@t=2

 

 

 

 

 

 

 

 


@t=3

 

 

 

 

@t=4

HTM_Thrd_1(Operator)

 

 

 

 

 

 

 

public synchronized void tx_mutate_cache() {

 

/*

From Java byte-code view this method executes with isolation=PESSIMISTIC_SERIALIZABLE … i..e when HTM_Thrd_1 enters/owns the lock, HTM_Thrd_2 may not perform any CRUD operation on HTM_CACHE_operand.

 

From HTM modified machine code view, however, the native locking protocol is re-written to be isolation=OPTIMISTIC_(SPECULATIVE)_READ_COMMITTED … i.e. when HTM_Thrd_1 enters/owns this block, HTM_Thrd 2 may still perform a READ operation  … however if the HTM modified machine code’s coherency protocol at any time determines that any optimistic READ is determined to have been exposed to DIRTY_READ risk (or any other coherency conflict) the HTM modified machine code view reverts back to isolation=PESSIMISTIC_SERIALIZABLE flow of control and re-trys the entire transaction.

 */

 

 

 

 

mutate_cache($200);

 

 

/* does processing */

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


/* ROLLBACK */

 

 

 

 

 

 

}

 

 

 

HTM_CACHE (Operand)

 

 

$100.00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

$200

 

 

 

 

 

 

 

 

 

 

$200

 

 

 

 

 

 

 

 

 


$100

 

 

 

 

$100

 

HTM_Thrd_2(Operator)

 

 

 

 

 

 

 

public synchronized void tx_access_cache() {

 

/*

From Java byte-code view this method executes with isolation=PESSIMISTIC_SERIALIZABLE … i..e HTM_Thrd_2 may not perform any CRUD operation in this block   while HTM_Thrd_1 owns the lock.

 

From HTM modified machine code view, however, the native locking protocol is re-written to be isolation=OPTIMISTIC_(SPECULATIVE)_READ_COMMITTED … i.e. when HTM_Thrd_1 owns the lock, HTM_Thrd 2 may still perform a READ operation  … however if the HTM modified machine code’s coherency protocol at any time determines that any optimistic READ is determined to have been exposed to DIRTY_READ risk (or any other coherency conflict) the HTM modified machine code view reverts back to isolation=PESSIMISTIC_SERIALIZABLE flow of control and re-trys transaction.

 */

 

 

 

 

 

 

 

 

 

 

 

 

 

 

/* from Java view @t=2 nothing happens (isolation=SERIALIZABLE) */

 

/* from HTM native view @t=2 READ is allowed to proceed, speculating that isolation=OPTIMISTIC_READ_COMMITTED will succeed */


bal = access_cache();  //bal assigned $200.00

 

/* does processing */

 

 

/* from HTM native view @t=3 the READ performed @t=2 (bal=$200) must now be scored as a DIRTY_READ conflict */

 

 

If (no_coherencyconflict())

     Commit();

} else {

    RollBackRetry();   /* this is what EXECUTES from HTM view */

}

 

 

 

 

 






--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Feb 3, 2014, 1:04:14 AM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu
Ben, the thing that makes the situation you describe impossible is that the mutated $200 value placed in the HTM_CACHE operand will not be made visible to any other thread until the end of the transaction it was mutated in. If another thread attempts to observe the value while the transaction is still in flight, the read will cause the transaction to fail, rolling back any mutations before they are eve visible to others.

I'm not sure where the isolation states you describe (PESSIMISTIC_SERIALIZABLE, OPTIMISTIC_READ_COMMITTED, etc.) are from. I don't think of HTM (in the Vega and Intel TSX sense) in those terms. The HTM we are talking about is localized between the CPU core and it's cache (usually the L1), and uses regular (unmodified) coherence protocol behavior to provide the transactional conflict detection.

I like describing this HTM thing simplistically, as I find it builds simple understanding and intuition. A simple description of how the speculate/commit/abort stuff works in both Vega and x86 (AFAIK) looks something like the following (I'm intentionally ignoring the core-internal store buffer for the purpose of this discussion):

- Each cache line in the L1 tracks "speculatively read" and "speculatively written" states. These states are always (logically) clear outside of speculative transaction execution.

Read and write access under speculation:
- Whenever a processor reads from a memory location while in speculative state, the associated L1 cache line is marked with something indicating "speculatively read". Such lines will necessarily be held in L1 in a readable state, which means that no other processor has an exclusive (or dirty) copy of the line.
- Whenever a processor writes to a memory location while in speculative state, the associated L1 cache line is marked with something indicating "speculatively written". Such lines will necessarily be held in L1 in an exclusive state, which means that no other processor has a readable copy of the line. 
- [small corner case: when a dirty line (not speculatively written) in L1 is written to while in speculative state, the contents of the line will first be pushed to a lower level (closer to memory) cache before the speculative store affects it (lets skip the various interplay details this creates and assume the hardware takes care of things transparently).

Commits and aborts:
- When a processor commits a transaction (leaves the speculative state via a "commit" operation), all L1 cache lines have their "speculatively read" and "speculatively written" states cleared. Nothing else needs to happen at the cache level.
- When a processor aborts a transaction (for any reason, including an explicit "abort" operation or a detected conflict), all L1 cache lines marked as "speculatively written" are marked invalid, and all L1 cache lines have their "speculatively read" and "speculatively written" states cleared.

Causes for aborts:
- An abort will be triggered by an explicit "abort" operation, or by any processor operation that may be required to abort (e.g. on x86 I believe a ring change causes an abort).
- Any situation where an L1 line marked as "speculatively read" or "speculatively written" is evicted from the cache will cause an abort, and any situations where a "speculatively written" line loses exclusivity will cause an abort (the abort will occur "before" the eviction or loss of exclusivity happens). Reasons for such eviction include:
  - Another processor attempting to write to a "speculatively read" or "speculatively written" line. Before the other processor successfully does so, it will require an exclusive copy of the line, which will first require the "speculatively read"  or "speculatively written" line to be evicted from this L1.
  - Another processor attempting to read from a "speculatively written" line. Before the other processor can successful read from the line, it must first obtain a copy of the line, which will first require that this L1 lose exclusivity.
  - The cache may be first to evict a "speculative read" or "speculatively written" due to capacity reasons. E.g. a 9th (different location) read that falls in the same set of a 8-way associative cache can cause such an evict. An abort must be taken because the cache will lose it's ability to track speculative states past the eviction.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Feb 3, 2014, 1:13:33 AM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu
BTW, Our Vega machines have been shipping with this HTM capability since 2004, and our Vega JVM has been using it for automatically executing synchronized blocks transactionally since late 2005. Beyond the hardware magic (which is fairly simple to control), we found that the JVM benefits only when adaptive use of the speculative logic is made. I.e. much like the adaptive "Thin/Thick" locking modes optimize for un-contended cases (with biased locking being a further optimization), using speculation only in the presence of contention helps "pay the extra speculation costs" only when doing so will actually be beneficial. Similarly, avoiding speculation for degenerately aborting transactions helps avoid abort/retry costs. The JVM "learns" very quickly which lock instances are worth speculating on and which are either not worth it or don't need it, much like it does with the common monitor inflation and deflation heuristics that all JVMs already use for thin/thick locking optimizations.

ben.c...@alumni.rutgers.edu

unread,
Feb 3, 2014, 2:16:18 AM2/3/14
to mechanica...@googlegroups.com

> Ben, the thing that makes the situation you describe [...]


Gil, first and foremost, thank you so much for the effort you took to even assess the situation I was describing.  I marvel at the incredibly generous and informative responses you (and others) provide in this forum.  Again, thank you.



> the mutated $200 value placed in the HTM_CACHE operand will not be made visible to any other thread until the end of the transaction it was mutated in. If another thread attempts to observe the value while the transaction is still in flight, the read will cause the transaction to fail, rolling back any mutations before they are eve visible to others.

Ah, there is the point of clarity  I needed provided and confirmed.    I mistakenly assumed that an "HTM speculative read" might behave like an "ACID optimistic read" (of the uncommitted, mutated, i.e $200) cached data.   In ACID, to prevent that read optimism from being betrayed w/ the potential disaster of data inconsistency, the ACID style TM must account - at the time that the reader thread commits - that no "DIRTY_READ" conflict took place.  An "ACID pessimistic read" would just out-right BLOCK at the time the thread attempted access on (uncommitted, mutated, i.e $200) cached data.


> I'm not sure where the isolation states you describe (PESSIMISTIC_SERIALIZABLE, OPTIMISTIC_READ_COMMITTED, etc.) are from.

They actually don't exist as literal isolation domain values in the form I presented them!  (Took a little too much license -- the suffixes here I provided strictly to be illustrative).

These things (of course)  basically come from the java.sql.Connection API ... where we apply the familiar "do me an ACID transaction" coding pattern  of

//get DataSource
//get Connection
connection.setAutoCommit(Boolean.FALSE);
try {
       connection.setIsolation(Connection.READ_COMMITTED);  // indicate to the ACID TM that Txn is DIRTY_READ intolerant
       //atomic CRUD Op 1
       //atomic CRUD Op 2
      ....
       //atomic CRUD Op N
       connection.commit();
} catch (Exception e) {
      connection.rollback();
}

which is the classic way to code a DIRTY_READ intolerant ACID transactional thread in Java.


> I don't think of HTM (in the Vega and Intel TSX sense) in those terms. The HTM we are talking about is localized between the CPU core and it's cache (usually the L1), and uses regular (unmodified) coherence protocol behavior to provide the transactional conflict detection.

Yep, I am seeing it now too ...  it is possibly a bit unfortunate that HTM uses the word "Transactional" .... an HTM Transaction is not at all an ACID Transaction.  An HTM coherency protocol manager is different from an ACID transaction manager.  These are points of clarity I lacked a few hours ago .... and htat I am starting to now  render into a (at least a little bit better) focus.  :-)

> I like describing this HTM thing simplistically, as I find it builds simple understanding and intuition

+1  ... but, for me I kind of first have to "gargle everything" until I realize and establish as known that I don't know before I can have credibility with myself that I understand.

Off now to study your response!  Again. Gil,  thank you.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rüdiger Möller

unread,
Feb 3, 2014, 9:34:46 AM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu
Good post ! Answers a lot of stuff coming into mind regarding what triggers an abort.
Actually there are many conditions triggering an abort.
Some questions:
If the programmer uses course grained locking, this will only be scuccessful on low contended data structures, isn't it ?
Does allocation inside speculative code increase probability of having an abort due to modifications of common data in the GC ?
If cache eviction triggers abort, there will be even more need to control memory data layout (or let the VM get more clever doing that).
How expensive is an abort ?

Peter Lawrey

unread,
Feb 3, 2014, 9:52:43 AM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu

See inline.

On 3 Feb 2014 14:35, "Rüdiger Möller" <moru...@gmail.com> wrote:
> Some questions:
> If the programmer uses course grained locking, this will only be scuccessful on low contended data structures, isn't it ?

What matters is low write contention on cache lines. Different threads can alter differemt parts of a data structure concurrently.

> Does allocation inside speculative code increase probability of having an abort due to modifications of common data in the GC ?

AFAICS it doesn't have too if the constructor has no side effects. The memory allocation is usually thread local.

> If cache eviction triggers abort, there will be even more need to control memory data layout (or let the VM get more clever doing that).

Aborts can happen rarely even for fairly optimal code e.g. cache lines happen to exceed the 8-way associativity or your hyperthreaded cpu does something unusual in the other thread (uses the same L1 cache)

> How expensive is an abort ?

I image it can be pretty expensive so you want the JVM to monitor how often a block of code aborts and change the code to regular locking code if this is happening too much

Gil Tene

unread,
Feb 3, 2014, 11:00:35 AM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu


On Monday, February 3, 2014 6:52:43 AM UTC-8, Peter Lawrey wrote:

See inline.

On 3 Feb 2014 14:35, "Rüdiger Möller" <moru...@gmail.com> wrote:
> Some questions:
> If the programmer uses course grained locking, this will only be scuccessful on low contended data structures, isn't it ?

What matters is low write contention on cache lines. Different threads can alter differemt parts of a data structure concurrently.

Right. To be specific, it's cache line contention between a write and either a read or a write that matter. Shared lines are not an issue. Contention between writers will cause an abort, and a  writer contending with some reader (or vice versa) is also an issue. It's also important to note that It is enough for one side of the contention to be in a transaction to cause an abort.   
 

> Does allocation inside speculative code increase probability of having an abort due to modifications of common data in the GC ?

AFAICS it doesn't have too if the constructor has no side effects. The memory allocation is usually thread local.

 
With TLAB allocation schemes (which all server JVMs use), the only contention possible through allocation will occur when multiple threads take a new TLAB. Very rare.

The contents of the allocation itself cannot be a contentions point (by definition) if it is done within a transaction, because it is impossible for it to be visible to other threads. 

> If cache eviction triggers abort, there will be even more need to control memory data layout (or let the VM get more clever doing that).

Aborts can happen rarely even for fairly optimal code e.g. cache lines happen to exceed the 8-way associativity or your hyperthreaded cpu does something unusual in the other thread (uses the same L1 cache)

Self-eviction (due to capacity) will not be a data layout issue. But contention-driven aborts have the same false-sharing related issues that drove the introduction of @Contended and (without it being available) the various layout tricks discussed on the Mechanical Sympathy blog and elsewhere.False sharing is a bigger problem for transactions than it is for simple writer-writer contention situations because in those simple contention situations thing just slowed down through serialization of access to the cache line, while with transactions they will cause transaction aborts, and because they also occur with reader-writer contention.
  

> How expensive is an abort ?

I image it can be pretty expensive so you want the JVM to monitor how often a block of code aborts and change the code to regular locking code if this is happening too much


Well, that depends on the platform. In Vega, aborts (and speculation start) were fairly expensive. Similar in cost (and in code actually) to a longjmp and a setjmp, with about 10-20 cycles of additional cost dominated by clearing of the cache state tracking speculative state for all cache lines (that's a 512-bit write operation and in Vega broke it into several cycles). Vega only used HTM to deal with memory state, and left the CPU state to software (hence the longjmp and setjmp).

AFAIK on x86 (Haswell) the CPU takes care of much of the longjmp/setjmp cost under the hood in a mechanism that can be though. When using explicit speculation (speculate/commit), software does not have to save cpu state (do a setjmp) coming in to a speculation, or specifically restore it on an abort, but an abort still vectors to different code which has to handle the abort logic and where to go next. Haswell also has an HLE mode that transparently speculates simple CAS-based locks (with no additional software handling beyond an indication on the CAS that HLE is desired).

AFAIK, the abort cost on x86 is similar to a mispredicted branch and an atomic (LOCK) access. And the cost of speculate is similar to an atomic.

HLE is potentially useful even for simple (non-adaptive) locking code. But in a JVM, and in the context of synchronized blocks (java monitors), it's still an open question about whether or not HLE is useful, since the JVM already knows how to separate un-contended monitors from contended ones, and there is no need for HLE on un-contended monitor instances. 

In the long run, I expect the explicit (speculate/commit) mode to be used in JVMs along with adaptive (per monitor instance) decisions on whether or not to use it. That's what we do on Vega, and it took more than a year of work with real applications to settle on a monitor-instance-specific heuristic that is good enough to use as a default: one that never hurts (compared to never using speculation), and makes use of speculation only when beneficial. The hardest part in this is dealing with false positive aborts, of which there are plenty (think timer interrupts for example), and we found that failure-cause indication from the harwdare was critical for a good heuristic. So much so that Vega2 and Vega3 had specific enhancements in this area (of abort cause indication) compared to our first (Vega1) implementation.
 




 
 

Francis Stephens

unread,
Feb 3, 2014, 12:02:21 PM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu
Previously I had spent some time reading Cliff Click's account on the experience Azul had with HTM in the Vega boxes. The big takeaway seemed to be that HTM could work very well except that a lot of code tends to hit a common piece of data, e.g. the mod-count field in a collections classes. I would expect that this problem could still dominate most existing code. Is anyone considering building HTM friendly collections libraries etc?

As an aside, as I was re-reading Cliffs articles I noticed you guys were using a Micro-Kernal OS on the Vega boxes. How did you find that?

Ben Cotton

unread,
Feb 3, 2014, 12:36:12 PM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu
> I'm not sure where the isolation states you describe (PESSIMISTIC_SERIALIZABLE, OPTIMISTIC_READ_COMMITTED, etc.) are from. I don't think of HTM (in the Vega and Intel TSX sense) in those terms. 

Gil, Peter, et.al.

If possible, could you share your understanding(s) of HTM's complete isolation expectation/obligation terms?  I assume that HTM transactions must provide

isolation=SERIALIZABLE ( a global view (for all views) that Transactional operands are not impacted by the interleaving of CRUD operations)

isolation=LINEARIZABLE  ( a local view (for all views) that Transactional CRUD operations' sequential correctness are not impacted by the interleaving of Threads)

but that  HTM is likely completely unbothered wrt to these isolation terms (all of which are needed by ACID transactions): 

isolation=READ_COMMITTED (accommodate DIRTY-READ intolerant transactions)

isolation=REPEATABLE_READ (accommodate transactions that depend on repeatable read guarantees - even if another tx successfully commits the read operand)

isolation=SERIALIZABLE (accommodate all transactional operation intolerances - including PHANTOM-READ intolerant operations -- complete operation isolation)

Those newbies (like myself) first learning HTM "Transactions" will likely benefit from separating HTM's exact  isolation exectations/obligations (and leaving any familiar understanding of ACID "Transactions" isolation expectations/obligations behind).

The HTM we are talking about is localized between the CPU core and it's cache (usually the L1), and uses regular (unmodified) coherence protocol behavior to provide the transactional conflict detection.

It would be cool if HTM was possibly changed to name "Hardware Transitional Memory" ... this would avoid the reflex name-collision confusion that results from HTM learning newbies associating the word "Transactions" with their (likely familiar) understanding of ACID "Transactions".  ACID had the term "Transactions" first (since the 1960s)

P.S  Come to think of it ACID also had the term "Serializable" first too!  To avoid newbie name collison-confusion, it would be polite if Java respectfully renamed java.io.Serializable to something else.  (LOL, obvisously).

P.P.S  Thanks again for all these (gentle, generous, genius) responses in this thread. Lots and lots of studying (and re-studying) remains for me. This forum is truly a special place.

Gil Tene

unread,
Feb 3, 2014, 1:20:49 PM2/3/14
to <mechanical-sympathy@googlegroups.com>, ben.c...@alumni.rutgers.edu
On Feb 3, 2014, at 9:02 AM, Francis Stephens <francis...@gmail.com> wrote:

Previously I had spent some time reading Cliff Click's account on the experience Azul had with HTM in the Vega boxes. The big takeaway seemed to be that HTM could work very well except that a lot of code tends to hit a common piece of data, e.g. the mod-count field in a collections classes. I would expect that this problem could still dominate most existing code.

At Azul, we call the JVM feature Optimistic Thread Concurrency (OTC). It's the term we use to describe adaptive HTM-assisted java synchronized blocks.

Our real world experience with OTC showed some interesting things. We got it to the "it never hurts" level with about 1.5 years of work, and did see it benefit some real world apps with high thread counts (sometimes you'd see an actual 3x improvement in something real). But in most large/complex applications (usually with app server stacks involved) we did not see anything more than a few percent, and the benefits were very inconsistent.

Think of it this way: OTC and HTM (as applied to locks and synchronized blocks) will win when lock contention is significantly higher than actual data contention in the critical sections the locks protect. This is an obviously common case in software design as locking is an inherently pessimistic thing: We cover a critical section with a lock not because we know the data will be contended, but because we know it *might* be contended, and since we can't allow the contention to occur without atomicity guarantees (for the entire critical section), we end up serializing access to it.

So Coarse grained locking (fror whatever you think "coarse" means) is an obvious opportunity.

However, as multi-core platforms started to become more and more common, people obviously care more and more about serializing operations that prevent multiple cores from being used at the same time, and those bottlenecks get worked on. Two things happen because of that:

1. The biggest bottlenecks that people run into get addressed first, leaving less and less for a future OTC/HTM thing to help with. This one is obvious.

2. People spend time to "Study" the bottlenecks even if they don't fix them. This one has a less obvious and strongly detrimental effect for OTC/HTM: The first thing people seem to do when they identify a scale-limiting contention point is to instrument it. This often takes the form of adding some counters to the operation. And that in turn makes data contention (on the counters, not the actual data protected y the critical section) equal to lock contention. The very act of adding counters inside critical sections removes most of the OTC/HTM ability to improve them.

So the side effect is that OTC/HTM can help things nobody cared enough about to study...

As I noted earlier, I think the value of transparent synchronized block stuff is limited in code that was not designed with them in mind. But I also think it's a very powerful tool when used right. I.e. when a mechanically-sympathetic developer is aware of OTC, they can much more easily construct code to do stuff with it that would be much harder to build otherwise.

The simplest example of this would be to use synchronized blocks to wrap atomic operations across multiple fields, knowing that serialization will only occur when contention occurs. This can often result in cleaner/simpler/faster code compared to applying concurrent algorithms that can only do single-field atomics.

Is anyone considering building HTM friendly collections libraries etc?

Look at the presentation link I sent at the start of this thread. It includes a walk through on how to make a hash table more OTC/HTM-friendly.

As an aside, as I was re-reading Cliffs articles I noticed you guys were using a Micro-Kernal OS on the Vega boxes. How did you find that?

"Micro-Kernel" is not the right technical term for the Aztek kernel in Vega.  People often use "micro-kernel" to describe a small, dedicated kernel with no crud in it (which is certainly a good description of Aztek), but in kernel design term a micro kernel refers to a form of modular architecture (ala the Mach operating system, and the kernel design in parts of Darwin and Mac OS).

So the answer will depend on what you mean by "micro kernel" in the question. Is it Cliff's general use of "small good kernel", or the textbook meaning?


--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/f84bwRQpyTQ/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Todd Lipcon

unread,
Feb 3, 2014, 1:30:24 PM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu
On Mon, Feb 3, 2014 at 10:20 AM, Gil Tene <g...@azulsystems.com> wrote:

However, as multi-core platforms started to become more and more common, people obviously care more and more about serializing operations that prevent multiple cores from being used at the same time, and those bottlenecks get worked on. Two things happen because of that:

1. The biggest bottlenecks that people run into get addressed first, leaving less and less for a future OTC/HTM thing to help with. This one is obvious.

2. People spend time to "Study" the bottlenecks even if they don't fix them. This one has a less obvious and strongly detrimental effect for OTC/HTM: The first thing people seem to do when they identify a scale-limiting contention point is to instrument it. This often takes the form of adding some counters to the operation. And that in turn makes data contention (on the counters, not the actual data protected y the critical section) equal to lock contention. The very act of adding counters inside critical sections removes most of the OTC/HTM ability to improve them.

So the side effect is that OTC/HTM can help things nobody cared enough about to study...

I wonder if there is a certain selection bias in the observations above. In my experience, the folks who run Azul (especially in years past when it was on non-commodity hardware) are primarily applications with soft realtime requirements and heavy concurrency (HFT, etc). So, these are probably folks who have done the analysis you describe above, and probably purpose-built their software to be fairly concurrent-happy.

I'm coming at this list from another perspective as someone who works on Hadoop. In many cases we originally designed the software to be throughput-oriented and didn't concern ourselves much with latency in the early days. This resulted in some really coarse-grained locking in certain areas that is really difficult to untangle now, years later. I think something like HTM is fairly promising for such legacy applications as a stop-gap until we're able to rewrite components for more fine-grained synchronization.

-Todd

Gil Tene

unread,
Feb 3, 2014, 1:46:48 PM2/3/14
to <mechanical-sympathy@googlegroups.com>, ben.c...@alumni.rutgers.edu
I think that trying to think of the HTM capability available in Haswell (and in Vega) in ACID terms will be highly confusing and over-complicated. It's actually a very simple feature.

At Azul, we called the hardware capability "Simultaneous Multi-address Atomicity", and I think this term capture what it does perfectly. These forms of HTM basically present a guaranteed atomicity across multiple memory accesses (reads and writes), and nothing more.

They can obviously be used to provide an nCAS (multi-word CAS) capability (so things like doubly linked lists with atomic inserts and removes work trivially), and can obviously be used in all sorts of cool concurrent algorithms, but since the atomicity is much more generic than CAS, and covers arbitrary program logic in the atomic block, it can be used to execute most lock-protected critical blocks that do not involve non-memory side effects (i.e. no I/O, interrupt consumption, or probably no system calls).

The key to doing that is Lock Elision, which Ravi Rajwar pioneered in his SLE paper [1]. I still have a copy of my short "Wow!" e-mail from 2002, sent when I ran across it. The notion that the atomic operation can be used to cover the "inside" of the lock but not the lock itself, and that the lock itself is elided (not written to) is key, as without elision, the lock state would a present data contention level equal to that of the lock contention.

BTW, as is often the case in scientific discovery, there was concurrent work on the same subject going on elsewhere, with an interesting publication by Prof. Josep Torrellas from Illinois on "Thread-Level Speculation" [2]. The two works seem to be unaware of each other and published at a similar time frame (2001).

-- Gil.

[1] "Speculative lock elision: enabling highly concurrent multithreaded execution" http://courses.cs.washington.edu/courses/cse590g/02wi/rajwar_r.pdf

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/f84bwRQpyTQ/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Feb 3, 2014, 1:59:02 PM2/3/14
to <mechanical-sympathy@googlegroups.com>, ben.c...@alumni.rutgers.edu
Todd, this observation comes not from our current crop of latency sensitive apps (which use Zing) but from our Vega systems, which were/are primarily used for powering Web-based applications with big fat web containers (J2EE). Vega's proxy architecture (which introduced a transparent network hop to all traffic) made it unsuitable for HFT and super-low-latency systems, but it's massive core counts (Vega3 goes up to 864 cores in an SMP configuration) worked great with throughput oriented "naturally parallel" workloads such as OLTP, and the concurrent GC capabilities and large heap support meant that we could run Tomcat and WebLogic apps and drive 40-50+ cores before hitting most bottlenecks.

Our experience is that It's the instrumentation of the coarse grain locks that gets in the way of throughput. And the more people cared about throughput and studied the bottlenecks that kept them from driving more and more concurrent cores, the more their instrumentation got in the way of OTC.

There certainly is selection bias here, but so far Vega presents the only real-world use cases for OTC/HTM in servers, at least in the Java world. We may find that OTC/HTM helps thing out-of-the-box in more cases than I expect. We certainly did run into cases like that with Vega, but they seemed to be getting more and more rare over time as people running on multi-core systems (with no OTC) learned to look for their coarse-grained bottlenecks.

Remember, BTW, that we were not looking at 5% wins with this (and I wouldn't consider a 5% win a big deal). We were looking to un-bottleneck multi-core use with massive scale benefits. When OTC did kick in, you'd usually see throughput increases as a result of 3x as many cores being able to actively produce work.

-- Gil.

Gil Tene

unread,
Feb 3, 2014, 2:03:16 PM2/3/14
to <mechanical-sympathy@googlegroups.com>, ben.c...@alumni.rutgers.edu
Correction: SMA stood for "Speculative Multi-Address Atomicity", and not "Simultaneous Multi-Address Atomicity" as I said below. Both are probably good descriptions, but the speculative word captures more of what is going on.

Rüdiger Möller

unread,
Feb 3, 2014, 2:29:08 PM2/3/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu
Peter, Gil: Thanks for enlightment. Extra credits to Gil for extended super awesomeness :-) 

Michael Barker

unread,
Feb 3, 2014, 3:27:47 PM2/3/14
to mechanica...@googlegroups.com
> 2. People spend time to "Study" the bottlenecks even if they don't fix them.
> This one has a less obvious and strongly detrimental effect for OTC/HTM: The
> first thing people seem to do when they identify a scale-limiting contention
> point is to instrument it. This often takes the form of adding some counters
> to the operation. And that in turn makes data contention (on the counters,
> not the actual data protected y the critical section) equal to lock
> contention. The very act of adding counters inside critical sections removes
> most of the OTC/HTM ability to improve them.

Just to reinforce this point. The Intel software optimsiation
manual[0] (section 12), talks about explicitly avoiding using
instrumentation style counters within TSX critical section for the
same reason. It increases the probability of data sharing (both false
and true).

Mike.

[0] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Michael Hamrick

unread,
Feb 4, 2014, 5:11:15 AM2/4/14
to mechanica...@googlegroups.com
Reviewing a very specific part of this discusstion, I still have a question. My question's answer my have been implied in the aggregate of this discussion, but I am still hoping I could get a quick,  short, specific  reply that re-confirms.

1.  Peter Lawrey's original blog introduced a desired Java on HTM capability to potentially take a coarsley-grained synchronized .java code block (strict isolation=SERIALIZABLE) and re-write its JVM runtime machine code in a way that it could then execute a more concurrent,  finer-grained locking isolation on that same synchronized block, and, would be able to use the HTM's built-in auto coherency protocol to abort/retry the re-write (back to strict isolation=SERIALIZABLE) in the eventtht attempt encountered conflict.
2.  Ben Cotton then introduced a test-case that led to his (cautious) question to Peter of the form  "is it possible that an HTM run-time re-write of a strict isolation=SERIALIZABLE .java synchronized block can be potentially  automatically re-factored into something that delivers an optimistic isolation=READ_COMMITTED capability (transparently, w/o the programmer having to explicitly use the Java ACID isolation APIs)?"
3.   Gil Tene then indicated to Ben that this was impossible -- that his test's assertion would always fail at one specific point, i.e.  @t=2  when the reader thread's  attempt to access the HTM_CACHE operand's mutated $200 value (a mutation done by the writer thread, but not yet committed by the writer thread).  Gil said that access attempt @ t=2 will always trigger an immediate HTM abort. Thus, no ACID style isolation=READ_COMMITTED realization should ever be expected on an HTM coarse-grained locking --> fine-grained locking re-write attempt.

So my question is (and with all due respect to this thread  cautioning not to think of HTM transactions from familiar ACID transactions viewpoints):

Can an HTM re-write of a strict .java synchronized code block  ever lead to producing  a run-time whose concurrent access scheme exhibits anything other than isolation=SERIALIZABLE?

BTW, agreeing with others already mentioned admiration/appreciation of this awesome thread, THANK YOU.


On Monday, February 3, 2014 1:04:14 AM UTC-5, Gil Tene wrote:

Francis Stephens

unread,
Feb 4, 2014, 5:13:34 AM2/4/14
to mechanica...@googlegroups.com, ben.c...@alumni.rutgers.edu
Thanks Gil, I find it really valuable to hear about HTM in practice. It would be hard to guess at some of those quirks without exposing HTM to a large number code bases.

I had interpreted Cliff's use of 'micro-kernal' to have the textbook meaning. Cheers.
To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Feb 4, 2014, 10:27:09 AM2/4/14
to <mechanical-sympathy@googlegroups.com>, mechanica...@googlegroups.com
Can an HTM re-write of a strict .java synchronized code block  ever lead to producing  a run-time whose concurrent access scheme exhibits anything other than isolation=SERIALIZABLE?

My answer would be: Any valid implementation of Synchronized would always need to exhibit atomic behavior for the block in its entirety (atomic against any other blocks synchronizing with the same monitor instance), and would have to adhere to the ordering rules in the JMM. An HTM-assisted implementation of a synchronized block will only be valid if it met these requirements.

I *think* this means the answer to your question is no (as in no exhibited effects that differ from isolation=SERIALIZABLE). But not because it is necessarily impossible to have such an effect when (mis)using HTM for the purpose of emulating strict lock-based synchronized blocks, but because JVMs should not have buggy implementations of monitor enter/exit semantics.

BTW, Keep in mind that we are talking about a specific form of HTM behavior (which Vega and TSX share). HTM is a wide field, and I've seen plenty of academic work on many forms of HTM that are not as strict as the Vega or TSX ones, and in which all sorts of looser consistency is possible. E.g. the HTM described in the original transactional memory concept in the 1993 Herlihy & Moss parer (http://cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf) would allow some (explicit) operations to have transactional behavior while others did not.

Sent from my iPad
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/f84bwRQpyTQ/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

Ben Cotton

unread,
Feb 4, 2014, 10:39:28 AM2/4/14
to mechanica...@googlegroups.com
So my question is (and with all due respect to this thread  cautioning not to think of HTM transactions from familiar ACID transactions viewpoints)[...]

I strongly recommend (despite me myself being an orginal offender to this recommendation) that you effort to leave all your familair notions of ACID transactions behind when you start to learn HTM transactions (which, BTW I am only in my 2nd real day of learning).  This is really more than a caution, this may be a necessity.  ACID and HTM both use the same-sounding, same-spelling, same-industry word "transactions" but they are not at all the same thing.  Mixing a notion of one of these without leaving behind pre-conceived notions of the other, will likely betray your capability to learn both effectively.   ACID transactions are managed by a Transaction Manager to which Java applications communicate their ACID intententions via the java.sql.Connection API.  HTM transactions are managed by a Coherency Protocol Manager.  In my very newbie understanding , Java applications do not communicate anything to  HTM in any (certainly not w/ anything like an API).  What Peter is proposing is that the new JEPs for Off-Heap/FFI et. al. Native capabilities consider empowering the Java run-time with some kind of official  join point to HTM capabilities to empower apps with finer-grained locking and performance benifits (all completely transparent from the application developer view). 

Another HUGE (and now obvious to me) difference in these views on what defines a "transaction" is in their distinctive outcome managements.   ACID outcomes are 1 of (commit/rollback)  HTM outcomes are 1 of (commit/abort).  The ACID rollback outcome is potentially extremely complicated as it implies a full (stop/recover) capability.  As Gil pointed out, HTM abort outcome is actually very simple=  full (stop/throw everything away from which there was 'spculative' lines).  


> Can an HTM re-write of a strict .java synchronized code block  ever lead to producing  a run-time whose concurrent access scheme exhibits anything other than isolation=SERIALIZABLE?

I would say 'No'.  All ACID's non-Serialible isolations include some notion of managing access risk concurrently.  But, Gil explicitly pointed out and as you eloquently re-capped in your post, the @t=2 access attempt by HTM_reader_thread on the non-committed $200 mutation made by HTM_write_thread can never take place in HTM world.  It is impossible.  

Again, effort not to be bothered by isoaltion concepts in learning HTM (IMHO).

My apologies for kind of "learning out loud" here in this response.  Peter, Gil, et.al.  please correct me where applicable.

BTW, agreeing with others already mentioned admiration/appreciation of this awesome thread, THANK YOU.

Great, great stuff, right?  :-)

Michael Hamrick

unread,
Feb 4, 2014, 11:00:47 AM2/4/14
to mechanica...@googlegroups.com
Thanks Gil (and Ben, as I have now read your response too).

So, if both of you agree that nothing looser than strict isolation=SERIALIZABLE can be re-alized in an HTM finer-grained locking re-write of .java synchronized code block, what would be a user/application view on any benifit (of any kind) taking place from the HTM performing the re-write vs. the HTM leaving the original .java produced byte-code completely intact?

What Peter is proposing is that the new JEPs for Off-Heap/FFI et. al. Native capabilities consider empowering the Java run-time with some kind of official  join point to HTM capabilities to empower apps with finer-grained locking and performance benifits (all completely transparent from the application developer view). 

What would be a user-view example of HTM capabilities empowering apps with performance benifits, if the HTM re-write can't get past strict isolation=SERIALIZABLE. I don't see what is gained?

To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ben Cotton

unread,
Feb 4, 2014, 11:23:02 AM2/4/14
to mechanica...@googlegroups.com
> the HTM described in the original transactional memory concept in the 1993 Herlihy & Moss parer (http://cs.brown.edu/~mph/HerlihyM93/herlihy93transactional.pdf

Great reference. Gil!  In this white-paper, the original HTM authors explicitly take custody of defining its notion of a "transaction" to only include capabilities for Atomicity and Serializable Consistency (the 'A' and the 'C' in ACID) no mention at all of isolation nor durability is made (the 'I' and 'D' in ACID).  


To unsubscribe from this group and all of its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Grajcar

unread,
Feb 4, 2014, 11:32:05 AM2/4/14
to mechanica...@googlegroups.com
On Tue, Feb 4, 2014 at 5:00 PM, Michael Hamrick <michael.to...@gmail.com> wrote:
Thanks Gil (and Ben, as I have now read your response too).
What would be a user-view example of HTM capabilities empowering apps with performance benifits, if the HTM re-write can't get past strict isolation=SERIALIZABLE. I don't see what is gained?

I guess I could answer this part (learning by answering, you know). A synchronized block as such allows no concurrency, but if it gets implemented as HTM it can. Imagine two threads reading from a synchronized HashMap (I mean what you get from Collections.synchronizedMap(new HashMap())).  Normally, one must wait till the other finishes, but with HTM both can work in parallel as the outcome is guaranteed to be the same. It's sort if a better implementation of SERIALIZABLE. Using synchronized block is the simplest and the least concurrent implementation Using HTM is a clever trick allowing concurrent access as long as no conflict occurs (write and any access to the same cache line is a conflict, reading alone if fine and so is accessing different cache-lines).

For writes to a synchronized HashMap, this would nearly work, too. As long as both threads access parts of the HashMap belonging to different cache lines, they could work in parallel. Unfortunately, there are fields like size and modCount, which gets modified (nearly) on each write, so this wouldn't work in this case. It might work for one writer and one (or more) readers.

Peter Lawrey

unread,
Feb 4, 2014, 11:46:20 AM2/4/14
to mechanica...@googlegroups.com

HashMap could work concurrrently if the size doesn't change eg a replace and some altering of how modCount is detected.

HashMap is in the JDK so it could be optimised/changed. What I had in mind is code which is not easily changed but doesn't have side effects like a mod count.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Hamrick

unread,
Feb 4, 2014, 11:58:02 AM2/4/14
to mechanica...@googlegroups.com

Ah!  great answer Martin.    Thank you.

A synchronized block as such allows no concurrency, but if it gets implemented as HTM it can.

Ok, I'm with you (but at this satge I still don't see how you can do concurrency, if you can't somehow  relax strict SERIALIZABLE. ?)

>  It's sort if a better implementation of SERIALIZABLE

Bingo.  Now "the light" has come on in my head.  It is now no-longer strict SERIALIZABLE it is now a better SERIALIZABLE -- established not through "stricness" but through "validated speculation".

So cool.

Now my question is (learning through endless questions, you know) what in Peter's JEP will he ask the Java run-time to adopt so that it can assist HTM in re-writing (strict SERIALIZABLE) .java synchronized code blocks into (better SERIALIZABLE) HTM machine code?  What does Peter's JEP ask the Java run-time to do to help?

Thanks!

Gil Tene

unread,
Feb 4, 2014, 1:15:14 PM2/4/14
to
The see a practical user-view of the actual effects of the semantically identical synchronized block execution with HTM-assist, look at slides 22/23 in the my "Speculative Locking: Breaking the Scale Barrier (JAOO 2005)" presentation.  These are not hypothetical or modeled numbers, they are actual measurements on actual Vega hardware and actual production JVMs. This is just how Vega JVMs behave, and we've been shipping them wwith OTC on by default since 2006.

Also note that the vertical axis is logarithmic ;-).

I highly recommend people go through the whole deck though (step by step). It's 8+ years old, but re-reading my own material in the Intel TSX-capable commodity server context, it reads as if I had a time machine view for what you'll need to know later this year in commodity hardware world. The presentation it presents the motivation and logic in detail, including hints on how to write better "HTM friendly" code in tour synchronized blocks. As long as Vega was the only machine that did this stuff, we didn't really expect much HTM-aware code writing to happen, but with TSX showing up on every commodity Intel x86 server starting later this year, writing HTM-sympathetic code is something to start thinking of.

I should probably start submitting the presentation (almost as is) to conference talks this year...

Ben Cotton

unread,
Feb 4, 2014, 4:35:49 PM2/4/14
to mechanica...@googlegroups.com
Man, this is great stuff.  Exceedingly interesting, I want to effectively devour it all (I, know ambitious newbie "want to know it all" talk, I'll calm down, I promise)

In the continued (and confirmed!) spirit of "learning out loud", I recap my current understanding as:

1.  HTM "Transactions" are an originally academic concept, with 1993 published ambitions that are distinctly different than ACID "ransactions".  HTM  may improve the performance of strict isolation=SERIALIZABLE .java synchronized blocks' execution by  re-writing its run-time to realize a HTM-savvy  better isolation=SERIALIZABLE (a sort of concurrent serializable, strange as that sounds) capability.

2.  Around 2008, Azul's Vega VM used a technique called "Optimistic Thread Concurrency" as its mechanism to HTM-assist its Vega hardware product with the re-write described in [1].  Gil  published a presentation URL where slides 22/23 show the 2008 explicit and real performance benefits Azul observed from its OTC based HTM-assist experience (Nice!).  It is very interesting that Azul uses the word "Optimistic" in its implementaton solution.  The word indeed suggests a breakthrough (suggested in his presentation's title)  has occurred -- defintiely a breakthrough wrt to traditional views on Serializable isolation.  No way does the ACID world ever use the word "Optimistic" in its isolation=Serializable accounting..  

3,  The tactical basis for implementing HTM-assist, both theoretically and in Azul products, is:

4.  RE-WRITE strict isolation=SERIALIZABLE .java byte code into better isolation=SERIALIZABLE run-time code.  This re-write will include code-paths that do  4.1 SPECULATION access/mutate views  4.2   VALIDATION views (valid ? commit : abort)   4.3  COMMIT views   4.4  ABORT views  4.5 on ABORT paths, RE-TRY views (but with strict isolation=SERIALIZABLE)

5.  On execution follow the code-paths'  views that result from [4], hoping that performance improvement realized, but assured that if no performance gain the "transaction" still experienced a safe outcome.

6.   CONSEQUENCE of ALL THIS?   Have your cake (a safe, decades reliable, strict isolation=SERIALIZABLE, will defintiely work, backup code execution path) and eat it too (a credibly optimisitic, no-risk, higher-performant, code execution path to a better isolation=SERIALIZABLE capability)

Win. Win. 

Nice.

Back to studying (and re-studying).

 

On Tuesday, February 4, 2014 1:03:06 PM UTC-5, Gil Tene wrote:
The see a practical user-view of the actual effects of the semantically identical synchronized block execution with HTM-assist, look at slides 22/23 in the my "Speculative Locking: Breaking the Scale Barrier (JAOO 2005)" presentation.  These are not hypothetical or modeled numbers, they are actual measurements on actual Vega hardware and actual production JVMs. This is just how Vega JVMs behave, and we've been shipping them wwith OTC on by default since 2006.

Also note that the vertical axis is logarithmic ;-).

I highly recommend people go through the whole deck though (step by step). It's 8+ years old, but re-reading my own material in the Intel TSX-capable commodity server context, it reads as if I had a time machine view for what you'll need to know later this year in commodity hardware world. The presentation it presents the motivation and logic in detail, including hints on how to write better "HTM friendly" code in tour synchronized blocks. As long as Vega was the only machine that did this stuff, we didn't really expect much HTM-aware code writing to happen, but with TSX showing up on every commodity Intel x86 server starting later this year, writing HTM-sympathetic code is something to start thinking of.

I should probably start submitting the presentation (almost as is) to conference talks this year...

On Tuesday, February 4, 2014 8:58:02 AM UTC-8, Michael Hamrick wrote:

Michael Hamrick

unread,
Feb 4, 2014, 4:44:20 PM2/4/14
to mechanica...@googlegroups.com

In the continued (but not confirmed) spirit of "learning by endlessly questioning":

 Does Peter's current JEP request that ---  what Azul did in 2008 with Vega OTC doing HTM-assist on behalf of Vega hardware, Oracle now do in 2014 with OpenJDK VM doing HTM-assist on behalf of Intel TSX-capable hardware?

Peter Lawrey

unread,
Feb 4, 2014, 4:51:30 PM2/4/14
to mechanica...@googlegroups.com

Not quite.  I am a step ahead of that in the off heap JEP.

I am assuming HTM will be available one day and that the concurrency library e.g.Lock or similar, will have support for it as well. This will need native/intrinsic operations to implement this and tradionally this would have been added to Unsafe but this is becoming restricted to internal code.

So the suggested JEP includes support for HTM in the replacement for Unsafe, in particular so it can be applied to off heap memory as well.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Hamrick

unread,
Feb 4, 2014, 5:07:48 PM2/4/14
to mechanica...@googlegroups.com
Thanks Peter.  So in your off-heap JEP, are you formally requesting that Oracle replace Unsafe with a supported library/API  that will allow/encourage potential  HTM-assist "providers" to deliver to OpenJDK@Intel-TSX stacks  the same capabilities that the Azul-provided OTC solution delivered to  AzulVM@AzulVegaHardware stacks?

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Peter Lawrey

unread,
Feb 4, 2014, 5:16:54 PM2/4/14
to mechanica...@googlegroups.com

I suspect that Oracle doesn't have same experience that Azul has in this space and I expect that their use of HTM in Java 9 will be minimal. I hope it will be more than just the proposed 128-bit CAS which is a little underwhelming.

However, given there is just 3 machine code instructions for Intel TSX all you need for "new" Unsafe is three intrinsic methods and perhaps a method to say whether it is supported. Ideally this would be compatible with Vega and AMD's implementation.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Feb 4, 2014, 5:30:20 PM2/4/14
to <mechanical-sympathy@googlegroups.com>
The Azul Vega OTC thing (which started in 2005, btw, not 2008), was basically "transparent HTM-assisted synchronized blocks" which, when it is able to, transparently exposes additional concurrent execution opportunities in code that is currently serialized due to conservatively executed on non-HTM-assisted synchronized blocks.

(I think) What Peter is talking about is an API that would allow programatic control of HTM. The same can be asked for not only for off-heap but for on-heap, and certainly shouldn't be an "Unsafe" thing. I actually think it belongs in it's own JEP/JSR, and not as part of the "get rid of Unsafe" JEP.

The two (Transparent OTC and HTM API) are potentially orthogonal, but things can get "messy" when they end up overlapping in use of the underlying HTM resources. I think that user-exposed HTM APIs are important to build, but I also think that the new abstractions needed are going to tricky to get right. E.g. assuming a specific HTM capability in the API definition is obviously wrong, especially since even within x86 we should expect TSX to mature over multiple processor generations.

The main area of immaturity I expect to run into will be around the wish to use HTM at multiple potentially conflicting levels at the same time. E.g. the JVM may use HTM for internal, transparent-to-user-code things (OTC for example), while user code may be using HTM-controlling APIs at the same time, and the OS may choose to use HTM for in-kernel-cool-stuff.

I worry that the current HTM designs may fail to work well when multiple, non-coordinating mechanisms try to make use of HTM together. They will probably "function correctly" from a semantics point of view (nested transactions work fine in both OTC/SMA and TSX), but all such mechanisms will need to deal with failure-cause-driven heuristics, and providing that failure information in a way that is useful across multiple un-cooridnating layers creates interesting questions that no real world experience exists for yet. [At Azul we only had one mechanisms (the JVM's OTC) use HTM, so we didn't have that conflict, and even then].

Still, I look forward to seeing an HTM-API JEP/JSR. I'll be happy to chime in on it as it goes.

-- Gil.

Bingo.  Now "the light" has come on in my head.  It is now no-longer strict SERIALIZABLE it is now a betterSERIALIZABLE -- established not through "stricness" but through "validated speculation".

So cool.

Now my question is (learning through endless questions, you know) what in Peter's JEP will he ask the Java run-time to adopt so that it can assist HTM in re-writing (strict SERIALIZABLE) .java synchronized code blocks into (better SERIALIZABLE) HTM machine code?  What does Peter's JEP ask the Java run-time to do to help?

Thanks!





On Tuesday, February 4, 2014 11:32:05 AM UTC-5, Martin Grajcar wrote:
On Tue, Feb 4, 2014 at 5:00 PM, Michael Hamrick <michael.to...@gmail.com> wrote:
Thanks Gil (and Ben, as I have now read your response too).
What would be a user-view example of HTM capabilities empowering apps with performance benifits, if the HTM re-write can't get past strict isolation=SERIALIZABLE. I don't see what is gained?

I guess I could answer this part (learning by answering, you know). A synchronized block as such allows no concurrency, but if it gets implemented as HTM it can. Imagine two threads reading from a synchronized HashMap (I mean what you get from Collections.synchronizedMap(new HashMap())).  Normally, one must wait till the other finishes, but with HTM both can work in parallel as the outcome is guaranteed to be the same. It's sort if a better implementation of SERIALIZABLE. Using synchronized block is the simplest and the least concurrent implementation Using HTM is a clever trick allowing concurrent access as long as no conflict occurs (write and any access to the same cache line is a conflict, reading alone if fine and so is accessing different cache-lines).

For writes to a synchronized HashMap, this would nearly work, too. As long as both threads access parts of the HashMap belonging to different cache lines, they could work in parallel. Unfortunately, there are fields like size and modCount, which gets modified (nearly) on each write, so this wouldn't work in this case. It might work for one writer and one (or more) readers.

-- 
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/f84bwRQpyTQ/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages