Whatever happened to transactional memory?

Anton Ertl

unread,

Jul 1, 2021, 1:52:18 PM7/1/21

to

Around 2009 transactional memory (both software transactional memory
and hardware transactional memory) were hot topics. Sun had announced
it for the Rock processor (canceled in 2009). AMD made the ASF
proposal (and never implemented it). IBM announced transactional
memory for Blue Gene/Q (2011), and later had it in Power 8 and Power
9. Intel added TSX (which is actually two mechamisms, hardware lock
elision (HLE) and restricted transactional memory (RTM)) to Haswell
(2013), Broadwell (2014) and Skylake (since 2015). ARM has TME (not
sure if it is implemented anywhere). Looks like a winner, doesn't it?

But Intel's TSX was plagued with functionality bugs, and security
bugs, which have led to disabling TSX temporarily and apparently now
permanently on many processors through firmware updates. IBM removed
transactional memory in Power 10. AMD never implemented ASF, nor TSX.

So what happened? Is it too hard to implement hardware transactional
memory correctly? Or does it offer too little to software to make
software writers write extra code paths for it? Or something else?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Jul 1, 2021, 2:39:27 PM7/1/21

to

On Thursday, July 1, 2021 at 12:52:18 PM UTC-5, Anton Ertl wrote:
> Around 2009 transactional memory (both software transactional memory
> and hardware transactional memory) were hot topics. Sun had announced
> it for the Rock processor (canceled in 2009). AMD made the ASF
> proposal (and never implemented it). IBM announced transactional
> memory for Blue Gene/Q (2011), and later had it in Power 8 and Power
> 9. Intel added TSX (which is actually two mechamisms, hardware lock
> elision (HLE) and restricted transactional memory (RTM)) to Haswell
> (2013), Broadwell (2014) and Skylake (since 2015). ARM has TME (not
> sure if it is implemented anywhere). Looks like a winner, doesn't it?
>
> But Intel's TSX was plagued with functionality bugs, and security
> bugs, which have led to disabling TSX temporarily and apparently now
> permanently on many processors through firmware updates. IBM removed
> transactional memory in Power 10. AMD never implemented ASF, nor TSX.
>
> So what happened? Is it too hard to implement hardware transactional
> memory correctly? Or does it offer too little to software to make
> software writers write extra code paths for it? Or something else?
<

In the case of AMD, I left and there was no one left who understood its
complexities well enough to lead the charge. But ASF was NOT TM, it
could be used to help build a TM but it was NOT TM.
<
In the case of Intel, I think Meltdown and Spectré along with the myriad
of bugs in TSX-HLE-RTM meant that continuing support was too costly to
continue; OR that it is better to wait for some future theory to develop
before bringing it back.
<
In my case, I decided TM is too much for HW, and developed large scale
ATOMIC events, instead; The Exotic Synchronization Method (ESM) of
My 66000 {A follow on of ASF.}

Paul A. Clayton

unread,

Jul 1, 2021, 2:48:42 PM7/1/21

to

On Thursday, July 1, 2021 at 1:52:18 PM UTC-4, Anton Ertl wrote:
> Around 2009 transactional memory (both software transactional memory
> and hardware transactional memory) were hot topics. Sun had announced
> it for the Rock processor (canceled in 2009). AMD made the ASF
> proposal (and never implemented it). IBM announced transactional
> memory for Blue Gene/Q (2011), and later had it in Power 8 and Power
> 9. Intel added TSX (which is actually two mechamisms, hardware lock
> elision (HLE) and restricted transactional memory (RTM)) to Haswell
> (2013), Broadwell (2014) and Skylake (since 2015). ARM has TME (not
> sure if it is implemented anywhere). Looks like a winner, doesn't it?
>
> But Intel's TSX was plagued with functionality bugs, and security
> bugs, which have led to disabling TSX temporarily and apparently now
> permanently on many processors through firmware updates. IBM removed
> transactional memory in Power 10. AMD never implemented ASF, nor TSX.
>
> So what happened? Is it too hard to implement hardware transactional
> memory correctly? Or does it offer too little to software to make
> software writers write extra code paths for it? Or something else?

Sun's Rock implementation attempted to be general but always
failed for many conditions (even a function call, IIRC). Rock was
also similar to Pentium 4 in being highly innovative, so design
resources were presumably stretched and issues tended to avalanche
with issues in one aspect magnifying issues in other aspects (with
little spare design resources to address the issues).

Azul Systems experience indicated that shared software (performance)
counters prevented broad use of its HTM. (They could not rewrite
all the libraries. Cliff Click had a blog post on this.)

The Blue Gene architecture and implementation was rather
embedded/HPC in style. As such, I would count that more as a
potential learning experience and limited exposure to software
developers.

AMD's ASF probably failed to be implemented in part because
Intel would not sign-on. AMD's experience with its SIMD extensions
(which had little adoption in software, I suspect) probably discouraged
'going it alone', especially for an extension primarily useful for scale
up systems. With Intel using market segmentation tactics for RTM,
AMD's incentive to implement such would be limited (was this also
a time when AMD was in decline?)

I am extremely uninformed on the implementation and performance
aspects of IBM's POWER implementations, so I will not comment on
such.

Intel's RTM (and HLE) implementations seemed to have suffered
from trying to be "full-featured" in the first implementation and not
providing speed for small transactions. If I understand correctly,
RTM failures were very expensive (i.e., more than a branch
misprediction).

The side channel issues seem inherent within shared permission
or shared storage. With shared permission, an inquisitive thread
can introduce failures that are address-dependent producing a
data-dependent timing variability. With conservative filters (and
no distinction of permission domains), even with no access to
shared memory locations (and prefetches failing on permission
violation) an inquisitive thread could introduce data-dependent
timing variation.

In my opinion, the difficulty (probably comparable to cache
coherence with high performance fences and weak consistency)
would certainly make such challenging (which, IMO, suggests
introducing a more limited feature set initially) was a significant
problem. The commonness of failure modes for common software
(e.g., tracking counters/logging), the high cost of success (compared
to ll/sc with a few added memory accesses), and the very high cost
of failures also seem to be (somewhat avoidable) issues.

I disagree with Linus Torvalds about the potential for HTM, but some
of his arguments on Real World Technologies do make sense to me.
Linus Torvalds does think that a kind of expended ll/sc might be
useful.
One Real World Technologies thread on HTM:
https://www.realworldtech.com/forum/?threadid=201184&curpostid=201184

Failure prediction (and automatic retry) seem sensible. I suspect
some of the interesting features would only make sense for a third
or fourth generation implementation.

Thomas Koenig

unread,

Jul 1, 2021, 2:56:53 PM7/1/21

to

Paul A. Clayton <paaron...@gmail.com> schrieb:

> I am extremely uninformed on the implementation and performance
> aspects of IBM's POWER implementations, so I will not comment on
> such.

It appears to have worked on POWER8, but it did not work on
at least one release of POWER9 chips (DD 2.1). Must have
been brown paper bag day at IBM...

MitchAlsup

unread,

Jul 1, 2021, 4:11:03 PM7/1/21

to

This is not how the game is played between I and A. ASF had the full
weight of MS behind it, and where MS goes so follows I--witness
x86-64 as compared to ISA-64.
<
Also note:: ASF was a performance oriented ATOMIC accelerator
and the OS needs lots of this; whereas HTM is a general purpose
multithreaded coordinator. MS wanted ASF for OS stuff, not necessarily
to give it away to applications.

<
> AMD's experience with its SIMD extensions
> (which had little adoption in software, I suspect) probably discouraged
> 'going it alone', especially for an extension primarily useful for scale
> up systems. With Intel using market segmentation tactics for RTM,
> AMD's incentive to implement such would be limited (was this also
> a time when AMD was in decline?)
>
> I am extremely uninformed on the implementation and performance
> aspects of IBM's POWER implementations, so I will not comment on
> such.
>
> Intel's RTM (and HLE) implementations seemed to have suffered
> from trying to be "full-featured" in the first implementation and not
> providing speed for small transactions. If I understand correctly,
> RTM failures were very expensive (i.e., more than a branch
> misprediction).
>
> The side channel issues seem inherent within shared permission
> or shared storage. With shared permission, an inquisitive thread
> can introduce failures that are address-dependent producing a
> data-dependent timing variability. With conservative filters (and
> no distinction of permission domains), even with no access to
> shared memory locations (and prefetches failing on permission
> violation) an inquisitive thread could introduce data-dependent
> timing variation.
<

ASF and ESM solve this problem in that if a thread reaches a certain
(well defined) point, interference gets NaKed while the current
owner is allowed to make forward progress. So the interferer
takes the hit, not the thread making forward progress.

Ivan Godard

unread,

Jul 1, 2021, 5:20:21 PM7/1/21

to

The Mill atomic facility is semantically equivalent to ASF and ESM.

Ivan Godard

unread,

Jul 1, 2021, 5:30:48 PM7/1/21

to

True transactions are beyond hardware solution; hardware doesn't do
arbitrarily-big. Bounded atomicity (frequently misnamed "transaction")
is doable in hardware. There are two issues, identified by Paul and
others, that must be addressed without falling into the subset trap. One
is distinguishing participants (which must be atomic) from concurrent
non-participants (like logging, single-step, and perfcounts). The other
is performant retry.

The failing implementations got these issues wrong.

MitchAlsup

unread,

Jul 1, 2021, 6:10:30 PM7/1/21

to

Excellent !
<
How do you query the "miss buffer" to see if it is OK to proceed to writes ?

Quadibloc

unread,

Jul 1, 2021, 11:00:17 PM7/1/21

to

On Thursday, July 1, 2021 at 12:39:27 PM UTC-6, MitchAlsup wrote:

> In the case of Intel, I think Meltdown and Spectré along with the myriad
> of bugs in TSX-HLE-RTM meant that continuing support was too costly to
> continue; OR that it is better to wait for some future theory to develop
> before bringing it back.

According to Wikipedia, Intel has announced that it's going to bring back
a new version of TSX in the Sapphire Rapids processors, hopefully having
fixed the bugs of prior versions.

John Savard

Anton Ertl

unread,

Jul 2, 2021, 8:08:41 AM7/2/21

to

"Paul A. Clayton" <paaron...@gmail.com> writes:
>On Thursday, July 1, 2021 at 1:52:18 PM UTC-4, Anton Ertl wrote:
>> So what happened? Is it too hard to implement hardware transactional
>> memory correctly? Or does it offer too little to software to make
>> software writers write extra code paths for it? Or something else?

[interesting stuff snipped]

>I disagree with Linus Torvalds about the potential for HTM, but some
>of his arguments on Real World Technologies do make sense to me.
>Linus Torvalds does think that a kind of expended ll/sc might be
>useful.
>One Real World Technologies thread on HTM:
>https://www.realworldtech.com/forum/?threadid=201184&curpostid=201184

This thread contains quite informative postings. "anon2" posts
answers to my questions:

https://www.realworldtech.com/forum/?threadid=201184&curpostid=201233

I don't know enough about this stuff to asses anon2's take, but it
does sound plausible and would explain why transactional memory has
not taken the world by storm; In particular, he writes:

|And what's more the modern implementation doesn't all come crashing
|down in a heap as soon as one processor aborts because it collided
|with an update or took a slightly different apth that used a few more
|cache lines, or took an interrupt. When you have to take the fallback
|and the fallback path taking the lock causes all other threads to
|abort their transactions and get stuck in the fallback path and
|everything falls off the cliff. So then you need to fix your fallback
|path, at which point you find it works better than TSX anyway.

And here is Linus Torvald's take on the situation:

https://www.realworldtech.com/forum/?threadid=201184&curpostid=201214

He wrote additional postings, but this seems to have been the major one.

Andy

unread,

Jul 2, 2021, 9:30:33 PM7/2/21

to

Apologies for the formatting errors, my accidentally sending it though
email has clearly done it no favors, oops.

On 30/06/21 5:39 am, MitchAlsup wrote:
> On Monday, June 28, 2021 at 10:45:41 PM UTC-5, Bitter blue pill wrote:

>> Ummm, so no love for Call Return Tables?, wouldn't they go some way to
>> reducing the number of conditional instructions spent checking for error
>> return codes and having otherwise unneeded instructions situated right
>> after the original call.
> <
> I am not sure I understand your 3 word phrase--can you give an example ?

Ah, could be I'm not using the correct terminology for modern ears, or
perhaps more likely it's such an obscure thing to use, only the most die
hard assembler programmers back in the day considered it worthwhile.

> I have seen a lot of code that does stuff like::
> <
> if( function( arg1, arg1, arg3 ) == SUCCESS )
> if( next_function( arg4, arg5 ) == SUCCESS )
> return SUCCESS;
> return FAILURE;
> <
> And I would expect the code at the return point have a CMP-BC or BC0
> depending if SUCCESS was == 0 or not.
>
>>>
And that would be an example of what those programmers were trying to
avoid, making a call in the middle of some performance critical code and
then having to check a return code to determine success or failure, and
if so which of possibly multiple errors had occurred.
Also it allows for error handling code to be pushed far enough away from
the main loop that the icache never fills with error handling instructions.

The idea being that after a CALL, BSR, JSR, JAL, etc type instruction a
table of offsets was built that pointed to each of the error handlers.
And all going well after calling the subroutine, execution resumes right
at the instructions following the return table, no error code check needed.
>>>

>
You have just pushed the transfer of control on error into the called
subroutine and are making the called subroutine have access to a
set of labels of the calling subroutine. So, you do not in any way
eliminate the complexity, you just bury it somewhere you cannot
see the complexity.
>

It's not about reducing complexity, just making the common execution
path a little faster by replacing a condition code comparison in the
main routine with an add in the subroutine.

>>>
Should an error occur in the subroutine, it's handled close to where
it's detected, and with the aide of the return offset table jumps
straight to the error handler as required.
>>>

> The caller still has to supply the table of labels.

Yes of course.

>>>
Your 'indexed subroutine table' triggered something of a flashback I
guess, and it occurs to me that maybe your mechanism is generalizable
enough that it could also be used for multi path return tables as well.
And if so, it does so as a privileged operation that could let such
tables be referenced in code segments marked as executable but not
readable nor writable?
>>>

>
As currently defined, you could return from called
subroutine and use its return code as a switch index::

if( (code = some_function( arg1, arg2, arg3 )) == SUCCESS )
{
}
else switch( code )
{
case: RANGE:
case: DOMAIN:
case: OPERATION:
}
>

Hmmm, the whole idea is an indexed switch, just an unconventional one
that is typically never used if there is no error condition arising in
the subroutine.

>>>
Which was kind of a drawback when I saw it used, in that using general
data referencing instructions to monkey around in a table sitting smack
in the middle of a instruction sequence, meant that segment had to be
marked as readable.
>>>

>
The switch tables of My 66000 are not readable
and only accessed by the FETCH stream, so they only need EXECUTE
access rights.
>

That's what I was thinking it might do.

>>>
In the current era of return oriented malware programming, maybe it's
possible to close off at least one such avenue of exploitation.
>>>

>
My EXIT
instruction makes it difficult for ROP to exploit My 66000 codes.
>

>>>
And for programmers who want/need to use return tables on modestly
resourced embedded CPUs which may also be connected to the internet,
maybe that is worth at least some small consideration?
>>>

>
The problem I see, here, is that the called subroutine has to use a
table of labels in a domain it does not "understand"*,
>

Not sure what you mean here.

On an error the subroutine pops the stack for the address of the return
table (or it's already sitting in the link register of a RISC style
ISA), uses a precoded index to fetch the appropriate offset from the
table, adds offset + table address, then jumps to the resulting address.

>
furthermore: it appears
to be no less actual code (dynamic) to perform the transfer of control
after arriving back at the call point.
>

As above, no error -> no conditional control transfer in the return path
or in the main code after the subroutine return, hence the speedup,
small as it may be on various ISAs.

(*) could be a cross language subroutine call.

MitchAlsup

unread,

Jul 2, 2021, 9:59:59 PM7/2/21

to

Subroutine X needs labels from calling routine Y. The labels of X are
not known in Y. What you are proposing is that this table of labels in Y
is used after registers are restored in X and instead of returning back to
the call point in X you transfer control back to Label[k] in X instead.
<
So, you have the same amount of flow control, but in your method,
you are perverting subroutine boundaries to do it. That is:
Instead of simply returning with the error code, you test the error code
after 95% of epilogue is done and then branch over return to lable[k].
<
Whereas in the unperverted case, you return the index to the table
and branch-on-zero over the switch into error handlers.
<
Thus, there is the same amount of control flow--assuming an epilogue
is needed for this function. {You have saved nothing} And since this
branch is easily predicted, you may have lost a trifle.
<
Now assuming you do not have an Epilogue in the function I think the
same arithmetic holds.
<
The reason for pointing at Epilogue is that C++ functions have to call
destructors in Epilogue before restoring registers, deallocating stack,
and returning. And then there is the EXIT instruction in My 66000 ISA
that does all of the restoration, deallocation, and control transfer back.

a...@littlepinkcloud.invalid

unread,

Jul 3, 2021, 6:00:55 AM7/3/21

to

Ivan Godard <iv...@millcomputing.com> wrote:
> True transactions are beyond hardware solution; hardware doesn't do
> arbitrarily-big. Bounded atomicity (frequently misnamed "transaction")
> is doable in hardware. There are two issues, identified by Paul and
> others, that must be addressed without falling into the subset trap. One
> is distinguishing participants (which must be atomic) from concurrent
> non-participants (like logging, single-step, and perfcounts). The other
> is performant retry.
>
> The failing implementations got these issues wrong.

Trying to utilize bounded hardware "transactions" to accelerate
software transactions is extraordinarily difficult. To quote Matveev
and Shavit,

"For many years, the accepted wisdom has been that the key to adop-
tion of best-effort hardware transactions is to guarantee progress by
combining them with an all software slow-path, to be taken if the
hardware transactions fail repeatedly. However, all known gener-
ally applicable hybrid transactional memory solutions suffer from
a major drawback: the coordination with the software slow-path in-
troduces an unacceptably high instrumentation overhead into the
hardware transactions." They go on to describe an algorithm which is
reasonably performant, but it is awfully complicated. [1]

One other thing worth mentioning: ARMv9 has the Transactional Memory
Extension, TME. I don't know what hardware supports it.

Andrew.

[1] https://dspace.mit.edu/handle/1721.1/90886

Anton Ertl

unread,

Jul 3, 2021, 7:29:47 AM7/3/21

to

a...@littlepinkcloud.invalid writes:
>One other thing worth mentioning: ARMv9 has the Transactional Memory
>Extension, TME.

https://developer.arm.com/architectures/cpu-architecture/a-profile?_ga=2.25090490.1816501146.1617130244-708743682.1613576602#arm-v9a

says that TME is a major featire of ARMv9.

https://www.arm.com/why-arm/architecture/cpu

does not mention TME as key feature of ARMv9.

It it major, but not key? Or did they have a change of mind?

> I don't know what hardware supports it.

They recently announced the Cortex-X2, Cortex-A710, and Cortex-A510
that are implementations of ARMv9; hardware is expected in 2022. We
will see if they all implement TME.

Ivan Godard

unread,

Jul 3, 2021, 8:09:17 AM7/3/21

to

On 7/2/2021 6:59 PM, MitchAlsup wrote:
<snip>

> Subroutine X needs labels from calling routine Y. The labels of X are
> not known in Y. What you are proposing is that this table of labels in Y
> is used after registers are restored in X and instead of returning back to
> the call point in X you transfer control back to Label[k] in X instead.
> <
> So, you have the same amount of flow control, but in your method,
> you are perverting subroutine boundaries to do it. That is:
> Instead of simply returning with the error code, you test the error code
> after 95% of epilogue is done and then branch over return to lable[k].
> <
> Whereas in the unperverted case, you return the index to the table
> and branch-on-zero over the switch into error handlers.
> <
> Thus, there is the same amount of control flow--assuming an epilogue
> is needed for this function. {You have saved nothing} And since this
> branch is easily predicted, you may have lost a trifle.
> <
> Now assuming you do not have an Epilogue in the function I think the
> same arithmetic holds.
> <
> The reason for pointing at Epilogue is that C++ functions have to call
> destructors in Epilogue before restoring registers, deallocating stack,
> and returning. And then there is the EXIT instruction in My 66000 ISA
> that does all of the restoration, deallocation, and control transfer back.

It's a little more complicated than that, because the control transfer
can be a remote throw instead of a return. In addition, if the throw is
not caught, the user wants to enter the debugger at the throw site with
the stack nit unwound and no destructors called. Consequently you get
two trips down through the stack, one looking for catch and one unwinding.

This all is complicated enough that the standard implementation puts
metadata about the stack and catch situation in the load module and
invokes a software library to do the unwind. Which works except when the
throwing code is the OS kernel and has no file facility to read the load
module...

Then there's also the question of what to do when the stack contains
calls between protection domains A->B->A, and what happens when an inner
function in A throws to an outer function also in A, but the throw
passes through middle functions in B. As far as I know we are the only
ones addressing this.

EricP

unread,

Jul 3, 2021, 11:22:38 PM7/3/21

to

Anton Ertl wrote:
> a...@littlepinkcloud.invalid writes:
>> One other thing worth mentioning: ARMv9 has the Transactional Memory
>> Extension, TME.
>
> https://developer.arm.com/architectures/cpu-architecture/a-profile?_ga=2.25090490.1816501146.1617130244-708743682.1613576602#arm-v9a
>
> says that TME is a major featire of ARMv9.
>
> https://www.arm.com/why-arm/architecture/cpu
>
> does not mention TME as key feature of ARMv9.
>
> It it major, but not key? Or did they have a change of mind?
>
>> I don't know what hardware supports it.
>
> They recently announced the Cortex-X2, Cortex-A710, and Cortex-A510
> that are implementations of ARMv9; hardware is expected in 2022. We
> will see if they all implement TME.
>
> - anton

The key to making HTM usable is in conflict management,
which none of the current implementations do.
They only do conflict detection with "new contender wins"
so they basically bog down at the first sign of conflict.
(The Alohanet model I've mentioned before.)

ARM has been busy adding TME support to various compilers.

There is almost no documentation on how TME works. The blog entry
below has a diagram with a box labeled "conflict management"
but without a detailed description of how it detects conflicts,
and what, if anything, it does to manage conflicts,
it is impossible to assess if it is similar to the others.

I found some technical documentation on ARM's TME dated May 2021 here.
There is a download button for the PDF on the upper left.

https://developer.arm.com/documentation/ddi0608/latest

https://developer.arm.com/documentation/102527/latest/

The only info I can find on how TME might work is in
this web blog article by an ARM employee who added TME
support to the Gen5 cpu uArch cycle simulator in 2021.

Arm’s Transactional Memory Extension Support in gem5
https://community.arm.com/developer/research/b/articles/posts/arms-transactional-memory-extension-support-

It says in the simulator that since L2 is inclusive they use it to
hold the "before" image of a cache line and L1 to hold "after" image.
If TME commits, they keep L1, if it aborts they toss the L1 copies.

That could imply an L1 conflict evict will abort the transaction,
meaning it is sensitive to its' models' cache architecture,
in particular the number of "ways" in each row.
Cache Directory controllers can also force evictions if their global
tracking rows fill up, so global cache interactions possible.

Chris M. Thomasson

unread,

Jul 4, 2021, 4:03:21 PM7/4/21

to

On 7/1/2021 10:26 AM, Anton Ertl wrote:
> Around 2009 transactional memory (both software transactional memory
> and hardware transactional memory) were hot topics. Sun had announced
> it for the Rock processor (canceled in 2009). AMD made the ASF
> proposal (and never implemented it). IBM announced transactional
> memory for Blue Gene/Q (2011), and later had it in Power 8 and Power
> 9. Intel added TSX (which is actually two mechamisms, hardware lock
> elision (HLE) and restricted transactional memory (RTM)) to Haswell
> (2013), Broadwell (2014) and Skylake (since 2015). ARM has TME (not
> sure if it is implemented anywhere). Looks like a winner, doesn't it?
>
> But Intel's TSX was plagued with functionality bugs, and security
> bugs, which have led to disabling TSX temporarily and apparently now
> permanently on many processors through firmware updates. IBM removed
> transactional memory in Power 10. AMD never implemented ASF, nor TSX.
>
> So what happened? Is it too hard to implement hardware transactional
> memory correctly? Or does it offer too little to software to make
> software writers write extra code paths for it? Or something else?

Iirc, the last time I tried out a TM, it would abort a write transaction
if a read was performed. Think of a LL/SC that would abort if something
read a word in the reservation granule. This would be very, very BAD!

EricP

unread,

Jul 4, 2021, 4:06:56 PM7/4/21

to

EricP wrote:
> Anton Ertl wrote:
>> a...@littlepinkcloud.invalid writes:
>>> One other thing worth mentioning: ARMv9 has the Transactional Memory
>>> Extension, TME.

On a related topic I came across another recent ARM addition, EDE.

"Unfortunately, current ISAs do not provide a way to describe such an
execution dependence between two instructions that have no register or
memory dependences. As a result, programmers must place fences,
which unnecessarily serialize many unrelated instructions.
To remedy this limitation, we propose an ISA extension capable
of describing these execution dependences. We call the proposal
Execution Dependence Extension (EDE), and add it to Arm’s AArch64 ISA."

Execution Dependence Extension (EDE):
ISA Support for Eliminating Fences, 2021
https://conferences.computer.org/iscapub/pdfs/ISCA2021-4ghucdBnCWYB7ES2Pe4YdT/333300a456/333300a456.pdf

David W Schroth

unread,

Jul 4, 2021, 6:41:45 PM7/4/21

to

On Sat, 3 Jul 2021 05:09:13 -0700, Ivan Godard
<iv...@millcomputing.com> wrote:

>On 7/2/2021 6:59 PM, MitchAlsup wrote:
><snip>
>

<snip>

>
>Then there's also the question of what to do when the stack contains
>calls between protection domains A->B->A, and what happens when an inner
>function in A throws to an outer function also in A, but the throw
>passes through middle functions in B. As far as I know we are the only
>ones addressing this.

It is always difficult/challenging to translate from/to OS2200
terms/concepts, but if my understanding is correct, OS2200 provided
this capability in the mid-1980s, and the capability has been used in
customer production ever since.

I applaud your addressing this.

Ivan Godard

unread,

Jul 4, 2021, 7:22:39 PM7/4/21

to

I once wrote a Mary I compiler that ran on the 1108/Exec8 cross to the
NDE Nord-1, but the system was still batch then. The current version
OS2200 doesn't really do A->B->A; if I understand it it's a micro-task
system that can spawn and queue tasks with different rights, but each
task has a fixed right-set and is independent of all other spawned
tasks, including multiple instances of itself.

That's different than calls. If the call history is A1->B1->A2 ("->"
denotes a call, there are no returns; all three function frames are
still active on the stack) then frame A2 can see and change the contents
of frame A1 because both are in the A protection domain, but neither can
see or change the content of frame B1 which is in a different domain.
When A2 returns there is a transit to domain B and the code of B1 picks
up at the point of call, just as a normal call-return. And there is
another domain-transit when B1 returns, and we are back running A1 in
the A domain.

If we denote spawn as "=>", then OS2200 does A1=>B1=>A2 OK, but A2 has
no visibility into A1. And if it does A1=>B1=>A2=>A1, that's not a a
return and it's not a call either; its another independent instance of A1.

I think so anyway. It sounds like you are current on OS2200 - do I have
that right?

David W Schroth

unread,

Jul 5, 2021, 12:57:31 AM7/5/21

to

On Sun, 4 Jul 2021 16:22:37 -0700, Ivan Godard

<iv...@millcomputing.com> wrote:

>On 7/4/2021 3:42 PM, David W Schroth wrote:
>> On Sat, 3 Jul 2021 05:09:13 -0700, Ivan Godard
>> <iv...@millcomputing.com> wrote:
>>
>>> On 7/2/2021 6:59 PM, MitchAlsup wrote:
>>> <snip>
>>>
>> <snip>
>>>
>>> Then there's also the question of what to do when the stack contains
>>> calls between protection domains A->B->A, and what happens when an inner
>>> function in A throws to an outer function also in A, but the throw
>>> passes through middle functions in B. As far as I know we are the only
>>> ones addressing this.
>>
>>
>> It is always difficult/challenging to translate from/to OS2200
>> terms/concepts, but if my understanding is correct, OS2200 provided
>> this capability in the mid-1980s, and the capability has been used in
>> customer production ever since.
>>
>> I applaud your addressing this.
>>
>
>I once wrote a Mary I compiler that ran on the 1108/Exec8 cross to the
>NDE Nord-1, but the system was still batch then. The current version
>OS2200 doesn't really do A->B->A; if I understand it it's a micro-task
>system that can spawn and queue tasks with different rights, but each
>task has a fixed right-set and is independent of all other spawned
>tasks, including multiple instances of itself.
>

While I am not at all clear on what you consider a micro-task, your
understanding is not correct. Each equivalent of a process gets its
own protection domain, each shared subsystem (think database) also
gets its own protection domain. As an example, calls from a
transaction (with its own domain) to a database domain experiences a
domain transitiion - before the call, the thread runs with the
transaction's domain, when in the database code, the thread runs with
the databas domain.

>That's different than calls. If the call history is A1->B1->A2 ("->"
>denotes a call, there are no returns; all three function frames are
>still active on the stack) then frame A2 can see and change the contents
>of frame A1 because both are in the A protection domain, but neither can
>see or change the content of frame B1 which is in a different domain.
>When A2 returns there is a transit to domain B and the code of B1 picks
>up at the point of call, just as a normal call-return. And there is
>another domain-transit when B1 returns, and we are back running A1 in
>the A domain.

That's pretty much how things work, except that the Activity Local
Stack can be read or written regardless of which key the thread is
currently running under. Other yhread level storage can (and usually
is) be owned, if you will, by the shared subsystem, and can only be
read/written when running with the key of the shared subsystem

>
>If we denote spawn as "=>", then OS2200 does A1=>B1=>A2 OK, but A2 has
>no visibility into A1. And if it does A1=>B1=>A2=>A1, that's not a a
>return and it's not a call either; its another independent instance of A1.
>
>I think so anyway. It sounds like you are current on OS2200 - do I have
>that right?

That isn't how it works on OS2200. And I am reasonably current on
OS2200, as I am one of the people responsible for
enhancing/maintaining the OS.

Ivan Godard

unread,

Jul 5, 2021, 1:42:31 AM7/5/21

to

Ah! I hadn't understood; thank you for the explanation.

What you describe is close to but not quite the same as ours. We don't
have a shared activity stack; instead the stack frames created while
under a particular key are reachable only by code also running under the
same key. Consequently if there are cross-key calls then the stack winds
up looking like Swiss cheese from the view of any single key. The
advantage is that stack frames can be sure that they are immune to
stack-smash exploits by code in other keys that they call or are called by.

David W Schroth

unread,

Jul 5, 2021, 2:02:55 PM7/5/21

to

On Sun, 4 Jul 2021 22:42:25 -0700, Ivan Godard

Thank you for thee laboration; most of what I know about the Mill
architecture is gleaned from posts in this forum. I look forward to
the day when an explanation of how you perform the cross domin magic
appears.

I should probably mention that the Activity Local Stack does not
contain the return address and architectural information saved as part
of crossing protection domains, that information is kept in a separate
stack that is only accessible by threads running with either the
Exec's key or the master key.

Ivan Godard

unread,

Jul 5, 2021, 7:08:46 PM7/5/21

to

Yes, dual threaded stacks is the only viable solution for ROP-style and
stack smash attacks that we have found; Mitch is wrestling with that
right now in My66. I didn't know that OS2200 had it; are you aware of
any others?

We keep the architectural state for normal calls in the second stack
too, as well as that of cross-domain calls. It makes the hardware
handling simpler and adds additional protection versus accident and attack.

David W Schroth

unread,

Jul 5, 2021, 11:42:44 PM7/5/21

to

On Mon, 5 Jul 2021 16:08:42 -0700, Ivan Godard

I'm not aware of any others, but I could be considered to have lived a
somewhat sheltered life when it comes to computer architecture - I
know the one I work on/with in some depth, and otherwise all I know is
what I read...

That largely matches what we do in OS2200. For compatibility with
programs written for earlier 1100s, we have instructions that jump to
the target address whiile sving the return address in a register or
(ugh) memory. I believe code generated by current compilers always
uses the instructions that save architectural state in the return
stack entry.