On Thursday, July 1, 2021 at 1:52:18 PM UTC-4, Anton Ertl wrote:
> Around 2009 transactional memory (both software transactional memory
> and hardware transactional memory) were hot topics. Sun had announced
> it for the Rock processor (canceled in 2009). AMD made the ASF
> proposal (and never implemented it). IBM announced transactional
> memory for Blue Gene/Q (2011), and later had it in Power 8 and Power
> 9. Intel added TSX (which is actually two mechamisms, hardware lock
> elision (HLE) and restricted transactional memory (RTM)) to Haswell
> (2013), Broadwell (2014) and Skylake (since 2015). ARM has TME (not
> sure if it is implemented anywhere). Looks like a winner, doesn't it?
>
> But Intel's TSX was plagued with functionality bugs, and security
> bugs, which have led to disabling TSX temporarily and apparently now
> permanently on many processors through firmware updates. IBM removed
> transactional memory in Power 10. AMD never implemented ASF, nor TSX.
>
> So what happened? Is it too hard to implement hardware transactional
> memory correctly? Or does it offer too little to software to make
> software writers write extra code paths for it? Or something else?
Sun's Rock implementation attempted to be general but always
failed for many conditions (even a function call, IIRC). Rock was
also similar to Pentium 4 in being highly innovative, so design
resources were presumably stretched and issues tended to avalanche
with issues in one aspect magnifying issues in other aspects (with
little spare design resources to address the issues).
Azul Systems experience indicated that shared software (performance)
counters prevented broad use of its HTM. (They could not rewrite
all the libraries. Cliff Click had a blog post on this.)
The Blue Gene architecture and implementation was rather
embedded/HPC in style. As such, I would count that more as a
potential learning experience and limited exposure to software
developers.
AMD's ASF probably failed to be implemented in part because
Intel would not sign-on. AMD's experience with its SIMD extensions
(which had little adoption in software, I suspect) probably discouraged
'going it alone', especially for an extension primarily useful for scale
up systems. With Intel using market segmentation tactics for RTM,
AMD's incentive to implement such would be limited (was this also
a time when AMD was in decline?)
I am extremely uninformed on the implementation and performance
aspects of IBM's POWER implementations, so I will not comment on
such.
Intel's RTM (and HLE) implementations seemed to have suffered
from trying to be "full-featured" in the first implementation and not
providing speed for small transactions. If I understand correctly,
RTM failures were very expensive (i.e., more than a branch
misprediction).
The side channel issues seem inherent within shared permission
or shared storage. With shared permission, an inquisitive thread
can introduce failures that are address-dependent producing a
data-dependent timing variability. With conservative filters (and
no distinction of permission domains), even with no access to
shared memory locations (and prefetches failing on permission
violation) an inquisitive thread could introduce data-dependent
timing variation.
In my opinion, the difficulty (probably comparable to cache
coherence with high performance fences and weak consistency)
would certainly make such challenging (which, IMO, suggests
introducing a more limited feature set initially) was a significant
problem. The commonness of failure modes for common software
(e.g., tracking counters/logging), the high cost of success (compared
to ll/sc with a few added memory accesses), and the very high cost
of failures also seem to be (somewhat avoidable) issues.
I disagree with Linus Torvalds about the potential for HTM, but some
of his arguments on Real World Technologies do make sense to me.
Linus Torvalds does think that a kind of expended ll/sc might be
useful.
One Real World Technologies thread on HTM:
https://www.realworldtech.com/forum/?threadid=201184&curpostid=201184
Failure prediction (and automatic retry) seem sensible. I suspect
some of the interesting features would only make sense for a third
or fourth generation implementation.