Not listed yet in the above bibliography , I thought this new paper
from Sun was fascinating and simple ideas for future Hardware TM
support - but it's not an entirely fleshed out solution yet - but you
can see they are trying to get away from CAS and LLSC operations as
Particularly, I wondered if it could be used in conjunction with
Intel's promotion of Fully Buffered DIMMS:
Tim Harris now from Microsoft seems to be leading the charge for some
innovative approaches to Software TM. Some say that it's too soon to
put stuff into hardware, that we need to figure out the best way to
model TM in software first (assuming TM is the best approach for
This particular paper on Composable Memory Transactions is quite
He's also leading a conference this month that is generating a lot of
Lastly, i have found some of the observations in this IBM Technical
report regarding Atomic Sets and Concurrency - a helpful new theory
(with proofs!) approach for modeling concurrency in languages - to be
published in POPL'06.
(from Sun paper)
"We introduce Transient Blocking Synchronization (TBS), a new approach
to hardware synchronization for mega-scale distributed-memory
multiprocessor machines. Such machines, with thousands of processors
and controller based memory modules, ar e essentially distributed
networks, and one must search for new paradigms that provide hardware
synchronization support with high levels of robustness and minimal
protocol and communication overhead. It is our claim that the semantics
of non-blocking synchronization primitives such as Compare&Swap and
LoadLinked/StoreConditional on the one hand, and blocking ones such as
Full/Empty-bits on the other, will introduce high communication and
space costs when implemented on large scale machines.
TBS is a novel hardware synchronization paradigm resting between the
classic blocking and non-blocking approaches to synchronization. It is
an example of how, based on very weak ''transient'' hardware
blocking, that is, blocking that may be revoked at any time, one can
provide non-blocking universal synchronization with low communication
and space complexity. This paper presents a set of simple TBS
single-location synchronization operations and shows that they provide
low cost non-blocking emulations of all known read-modify-write
operations. Moreover, it shows that the combination of TBS with
hardware supported transactional bits, a variation on traditional
hardware full/empty bits, can provide low message cost implementations
of multi-word transactional operations."
IMHO, I believe its too early in the game for actual hardware support, even
though they really do need it in order for TM to become really useful. All
of the software emulations I have seen are loaded with expensive atomic
operations; mutexs seem more efficient...
original 801 had a form ... it was used by journal filesystem
for aix in rs/6000.
filesystem metadata was organized ... and the memory system caught all
changes ... so basically you avoided having to do all the explicit
logging/locking statements in the filesystem code ... you did have to
add commit calls.
there was some disagreement whether the overhead of explicit
logs/locks were more expensive than the implicit overhead involved in
transaction memory. turns out there was the overhead of scanning the
whole memory area for changes at commits.
a portable version of the journal filesystem code with explicit
lock/logging changes in the code turned up the explicit lock/logging
calls had lower overhead than transactional memory (at least in the
journal filesystem case).
and unrelated old reference from about same period as aixv3 ... the
Proceedings of the 20th International Symposium on Fault-Tolerant
Computing Systems, pp 89--96, Newcastle, June 1990.
A Fault Tolerant Tightly Coupled Multiprocessor Architecture based on
Stable Transactional Memory
Authors: Michel Banatre and Philippe Joubert
Traditionally, tightly coupled multiprocessors allow data sharing
between multiple caches by keeping cached copies of memory blocks
coherent with respect to shared memory. This is difficult to achieve
in a fault tolerant environment due to the need to save global
checkpoints in shared memory from where consistent cache states can be
recovered after a failure.
The architecture presented in this report solves this problem by
encapsulating the memory modifications done by a process into an
atomic transaction. Caches record dependencies between the
transactions associated with processes modifying the same memory
blocks. Dependent transactions may then be atomically committed. Such
an operation requires a cache coherence protocol responsible for
recording process dependencies as well as keeping coherent cached
copies of blocks and a shared Stable Transactional Memory owing to
which all memory updates are atomic to allow recovery after a
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
[ ... transactional memory, what about it? ... ]
lynn> original 801 had a form ... it was used by journal
lynn> filesystem for aix in rs/6000. [ ... ] a portable version
lynn> of the journal filesystem code with explicit lock/logging
lynn> changes in the code turned up the explicit lock/logging
lynn> calls had lower overhead than transactional memory (at
lynn> least in the journal filesystem case).
Ah the usual interesting historical and technical detail from
you. Thanks, as usual, and I am particularly interested in this
as I just switched my Linux based PC to the latter day version
of that «portable version of the journal filesystem code», the
JFS in recent Linux kernels.
As an aside I switched because I was terrified by some
informal file system tests I have done, when I discovered
that performance degradation due to extensive use in 'ext3'
can be quite apocalyptic:
[ ... ]
lynn> A Fault Tolerant Tightly Coupled Multiprocessor
lynn> Architecture based on Stable Transactional Memory
lynn> Authors: Michel Banatre and Philippe Joubert
Banatre? There was a Banatre at UofRennes who wrote interesting
stuff about Algol 68 long ago, including an utterly fascinating
analysis of how much micro parallelism/locking (a topic that
has come up recently in comp.arch IIRC) there was in his Algol
68 compiler taken as an example.
Ahhh, I could not resist looking that up. That Banatre is
Jean-Pierre Banatre, and perhaps Michel is a relation or a
second generation geek.
Which category is happening a lot (e.g. IIRC famously with
Larry Wall, but I also met once this awesome Oz girl both
whose parents are CompSci professors, and she got a CompSci
degree and is doing a CompSci PhD too :->).
However, a little web searching with the obvious keywords and
the microparallelism/microlocking paper is:
"An event-driven compiling technique"
where the technique used to resolve Algol 68's mysterious
forward references (which IIRC normally require at least four
left-to-right passes) is basically microthreads...
As another note, I feel compelled to mention MC/CWU's Bohm's
''oscillating'' compiler passes (surely revealed to him by
aliens), which cuts the four passes down to two, where the
first pass compiler is left-to-right, and the second is
right-to-left, which as the author notes in effect gives
infinite right lookahead. I have never seen it mentioned
No, not enough -- I could not resist looking up about that
Bohm, and he seems to be this guy:
and I did not know that he had also worked at V.U. Manchester
legendary CompSci department, and on their legendary dataflow
systems (I had the privilege many years ago to see a full
dataflow machine actually working over there, a rare experience
Sorry, but at least his paper list contains a lot of nice
references on microparallelism (and perhaps locking too) so it
is not totally offtopic. :-)
[ ... ]
Humm... I do agree that the forward-progress/throughput of
"obstruction-free" based data-structures "can be" a bit unstable under load:
As for the presented implementation, it seems to definitely be an
improvement over "obstruction-freedom" based designs. However, I do have a
little problem with some of the overhead involved. AFAICT, it seems to
require readers to keep in sync with the "transaction descriptor", and the
proxy collector structures. I really have to study their implementation some
more, but is does seem that some memory barriers would be required to sync
with the commit operation, and the proxy garbage collector "scanning logic"
the code happens to use.
off-topic: AFAICT, the proxy gc algorithm doesn't allow for references to
exist outside of a "critical-region", just like RCU, so you are going to
basically have to invoke it for every access to a transaction and/or shared
object if you want the collector to be able to handle any kind of "load"; it
looks like it executes a StoreLoad style memory barrier when you enter and
leave a GC region, and I think the algorithm has to observe three GC epochs
before it can dump its lists...
> A Fault Tolerant Tightly Coupled Multiprocessor Architecture based on
> Stable Transactional Memory
> Authors: Michel Banatre and Philippe Joubert
Humm... Very interesting. Thank you. Any information on transactional memory
Ahhh. It seems that they are indeed moving away from "simple" word-sized
CAS, to "giant-and-costly LL/SC-type logic" that has to scan at commit for
any observable multi-word modifications in shared memory...
No. The "simple" (*) proposals exploit/extend cache-coherence logic, so
there's no scan at commit time. I'd suggest reading the articles from
the bibliography above if you're actually interested...
*: There's also some more complex proposals, where transactions are not
limited to fitting in the cache.
David Gay, not speaking for Intel
Agreed. I only very briefly skimmed though some of them. As you can see, I
am not really that excited about transaction memory... However, after I get
through RTFM... ;)
Things could change...
> *: There's also some more complex proposals, where transactions are not
> limited to fitting in the cache.
That was one of my main concerns...
I've skimmed some of the TCC info (tcc.stanford.edu), and it seems to
imply that the hardware atomic transaction support will make efficient
parallel programming much easier. But atomic transactions only ensure
*some* serial schedule will result. Programming is all about processing
data with well-defined operations in a carefully planned schedule in
order to obtain a desired result. Atomic transactions only deal with
the "well-defined operations" part of that, specifically with allowing
more complex operations, ones that continue to be well-defined even in a
Parallel programming is also, in large part, about hiding latency, and
the snoopy caches in the TCC proposals don't really address that. Then
again, TCC is apparently being proposed only for CMPs and SMTs at this time.
So, what about just augmenting TCC or other TMs with some nice
programming constructs for ordering? (TCC already proposes a few.) For
example, ScalPL (www.elepar.com/ScalPL.html) is also based on "all
transactions, all the time" (to steal TCC's catch phrase), but the vast
majority of ScalPL's constructs are for helping to manage the data and
control flow between the atomic transactions, and to nest and modularize
them. So maybe TCC (or other TMs) could offer hardware transaction
support and ScalPL could provide the ordering and higher-order software
In fact, TCC and ScalPL tend to work at cross purposes and/or duplicate
work, for very basic reasons. TCC strives to be non-blocking, but ScalPL
blocks (or, if you prefer, doesn't start) transactions until the program
allows. TCC transactions fetch data on demand, creating embedded
latency, but ScalPL has a good idea of what data the transaction will
need beforehand, so could pre-fetch the data while other transactions
are working with already-prefetched data, hiding latency. Maybe most
importantly, the whole idea behind TCC (and other TMs) is to support an
atomic commit, but ScalPL doesn't allow conflicting transactions to
execute concurrently in the first place so doesn't need a commit at all.
In fact, the one place that ScalPL could perhaps productively use the
TCC transactions would be in its scheduling algorithm to determine which
ScalPL transaction(s) to run locally next...and if someone really wants
to design hardware just to optimize ScalPL-like transaction scheduling,
I might suggest some other directions.
IOW, I'm not yet convinced that transactional memory is anything more
than a neat hardware hack, without much practical utility except for
cases where you've got rather extreme parallelism and very lax ordering