[ANN] AMD Advanced Synchronization Facility: HTM

Dmitriy V'jukov

unread,

Apr 28, 2009, 3:29:36 PM4/28/09

to

That's finally happen! HTM proposal for x86 from AMD!
Main page:
http://developer.amd.com/cpu/ASF/Pages/default.aspx

AMD "Advanced Synchronization Facility" Proposal

The AMD Advanced Synchronization Facility (ASF) is an experimental
instruction set extension for the AMD64 architecture that would
provide new capabilities for efficient synchronization of access to
shared data in highly multithreaded applications as well as operating
system kernels. ASF provides a means for software to inspect and
update multiple shared memory locations atomically without having to
rely on locks for mutual exclusion. It is intended to facilitate lock-
free programming for highly concurrent shared data structures,
allowing more complex and higher performance manipulation of such
structures than is practical with traditional techniques based on
compare-swap instructions such as CMPXCHG16B. ASF code can also
interoperate with lock-based code, or with Software Transactional
Memory.

Some basic usage examples of ASF are provided in the specification.
However, we expect the programming community could readily use the
power and flexibility of ASF to implement very sophisticated, robust
and innovative concurrent data structure algorithms, and we encourage
such experimentation. AMD will be releasing a simulation framework in
the near future to facilitate this.

AMD is releasing this proposal to encourage the parallel programming
community to review and comment on it. Such input will help shape the
ultimate direction of this feature, so that it may best serve the
needs of advanced parallel application developers.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Apr 28, 2009, 4:11:29 PM4/28/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:6480162b-b11a-4b07...@d2g2000pra.googlegroups.com...

> That's finally happen! HTM proposal for x86 from AMD!
> Main page:
> http://developer.amd.com/cpu/ASF/Pages/default.aspx
>
> AMD "Advanced Synchronization Facility" Proposal
>

> [...]

FWIW, here is previous discussion on this:

http://groups.google.com/group/comp.arch/browse_frm/thread/68a9fda9386048d7

I think they need to clarify one tiny little point in section 7.2.1. They
basically write that one cannot remove multiple objects from a LIFO without
ASF. This is not true. One can remove all items with a wait-free atomic
swap, and if the LIFO contains more than one item, well, that removing
multiple items in a single operation. Also, it would be nice if they mention
some other solutions to the ABA problem, and memory reclamation issues.
Anyway, IMO, AMD would be crazy not to implement this in their upcoming
processors. Well done.

I would really love to benchmark this against existing solutions. I want to
check the performance of a standard non-blocking LIFO vs the ASF LIFO. That
would be interesting.

AMD is going in the right direction!

Bernd Paysan

unread,

Apr 29, 2009, 4:52:42 AM4/29/09

to

Dmitriy V'jukov wrote:
> AMD is releasing this proposal to encourage the parallel programming
> community to review and comment on it. Such input will help shape the
> ultimate direction of this feature, so that it may best serve the
> needs of advanced parallel application developers.

I've done some highly parallel code in the recent past, application field
cryptography. The code is screaming for some really lightweight
synchronization, but I don't find what I'm looking for in the AMD proposal.
What AMD basically proposes is a transaction-based system, i.e. where
independent threads can update common data structures without stepping on
each others toe - by aborting and retrying the whole transaction when
someone else modifies the same memory locations in between. This can be
useful, but is not the only case.

What I need is rather a way to synchronize all threads upon completion of
their work. Something like Occam's "fork ... join" control structure.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Anthony Williams

unread,

Apr 29, 2009, 5:48:10 AM4/29/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> writes:

> That's finally happen! HTM proposal for x86 from AMD!
> Main page:
> http://developer.amd.com/cpu/ASF/Pages/default.aspx

Though I can see why they've limited it to cache lines, I can also see
that this will make it hard to use in practice because the cache line
size is not available at compile time. For example, it means that
separate protected objects must be at least one cache line apart in
memory, otherwise accesses to one will clash with accesses to the
other. On some CPUs, 64 byte spacing is sufficient, but on others
128-byte or 256-byte spacing might be required to avoid false sharing.

Anthony
--
Anthony Williams
Author of C++ Concurrency in Action | http://www.manning.com/williams
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Just Software Solutions Ltd, Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK

Chris Dodd

unread,

Apr 29, 2009, 2:43:09 PM4/29/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in
news:C2PJl.99218$GU6....@newsfe09.iad:

> "Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
> news:6480162b-b11a-4b07...@d2g2000pra.googlegroups.com...
>> That's finally happen! HTM proposal for x86 from AMD!
>> Main page:
>> http://developer.amd.com/cpu/ASF/Pages/default.aspx
>>
>> AMD "Advanced Synchronization Facility" Proposal
>>
>> [...]
>
> FWIW, here is previous discussion on this:
>
> http://groups.google.com/group/comp.arch/browse_frm/thread/68a9fda9386048
> d7

Looking at the contention table in section 6.2.1, I can't see how this can
possibly make forward progress in the presence of contention. Any time you
have two threads contending on the same location(s) for a SPECULATE section,
they'll just repreatedly abort each other. In order for forward progress to
be guarenteed, at least some of the 'Abort B' int the upper part of the table
need to be changed to 'Abort A'. What am I missing?

Chris Dodd
cd...@acm.org

EricP

unread,

Apr 29, 2009, 2:42:18 PM4/29/09

to

I raised this same question in the original discussion back in Aug 2008.

http://groups.google.com/group/comp.arch/browse_thread/thread/cedc2e76568ab0c5/934e3485d711d4a0?hl=en#934e3485d711d4a0

The key is that the cache coherency bus protocol is enhanced
to allow cache line read requests to be NAK'ed.
That allows one cpu to gather its update set together,
while all others fail and replay trap back to the ACQUIRE.
There is also a minimum capacity limitation of 4 cache lines
so gathering the set is achievable.

My concern is that *any read* operation by another cpu on a
protected location triggers an abort in the lock holder.
That might prove to be a tricky limitation to work
within... needs more thought.

Eric

EricP

unread,

Apr 29, 2009, 3:04:23 PM4/29/09

to

I should have said any read on a modified protected location.
That could allow a sufficient number of readers to block
all writers.

It might be nicer if a writer could set a guard flag on the
cache line so that readers would could see an update was
in progress and spin wait without disrupting the transaction.

Eric

MitchAlsup

unread,

Apr 29, 2009, 4:05:52 PM4/29/09

to

Well, horay, my invention has finally seen the light of day.

AMD have diddled with the instruction ordering (and spelling of the
instruction names) but most of the original intent remains.

As far as I can tell, they have lost the ability to beat the bigO
(n**2) problem in DCAS based nonBlocking algorythms. This is shown in
table 6.2.1. Any downgrade of cache line status, aborts the CPU that
has potentially made the most progress. When I looked into this, this
kind of scenario leads to livelock under certain timing arrangements
between CPUs/cores/Nodes.

So: Chris Dodd is correct in his observation.

Eric writes:
"The key is that the cache coherency bus protocol is enhanced
to allow cache line read requests to be NAK'ed. "

Yes, indeed, that WAS the key, however table 6.2.1 indicates this NAK
is not present in this version of ASF. And one can ONLY hope that it
is restored as time marches on.

"It might be nicer if a writer could set a guard flag on the
cache line so that readers would could see an update was
in progress and spin wait without disrupting the transaction. "

That remains possible. For example, by setting a 'LOCK' in advance of
the data structures being manipulated (like in the queue header),
however, at this point, its time for real software writers to delve in
see what works, what is idiosyncratic, and what needs work.

DCAS, TriCAS, and QuadCAS are certainly possible, with 'references' as
big as 64-bytes each. That is a reference can contain a pointer, a
timestamp, a cpuID, and ancilarry information that prevents the A-B
synchronization problem.

Mitch

Chris Dodd

unread,

Apr 29, 2009, 7:55:46 PM4/29/09

to

EricP <ThatWould...@thevillage.com> wrote in
news:d5a90$49f89fc7$45c49ea8$16...@TEKSAVVY.COM:

> Chris Dodd wrote:
>> Looking at the contention table in section 6.2.1, I can't see how this
>> can possibly make forward progress in the presence of contention. Any
>> time you have two threads contending on the same location(s) for a
>> SPECULATE section, they'll just repreatedly abort each other. In order
>> for forward progress to be guarenteed, at least some of the 'Abort B'
>> int the upper part of the table need to be changed to 'Abort A'.
>

> I raised this same question in the original discussion back in Aug 2008.
>
> http://groups.google.com/group/comp.arch/browse_thread/thread/cedc2e76568
> ab0c5/934e3485d711d4a0?hl=en#934e3485d711d4a0
>
> The key is that the cache coherency bus protocol is enhanced
> to allow cache line read requests to be NAK'ed.

But that's precisely my point -- the table in 6.2.1 says NAKs never occur
in the protocol. If they did, they'd show up as 'Abort A' in the table.
No 'Abort A' means no NAKs in the protocal, and without them you'll get
live-lock.

Chris Dodd
cd...@acm.org

Dmitriy Vyukov

unread,

Apr 29, 2009, 7:57:49 AM4/29/09

to

AFAIS, ASF allows efficient implementation of the join (atomic
counting).

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 29, 2009, 7:56:01 AM4/29/09

to

On Apr 29, 1:48 pm, Anthony Williams <anthony....@gmail.com> wrote:

> "Dmitriy V'jukov" <dvyu...@gmail.com> writes:
> > That's finally happen! HTM proposal for x86 from AMD!
> > Main page:
> >http://developer.amd.com/cpu/ASF/Pages/default.aspx
>
> Though I can see why they've limited it to cache lines, I can also see
> that this will make it hard to use in practice because the cache line
> size is not available at compile time. For example, it means that
> separate protected objects must be at least one cache line apart in
> memory, otherwise accesses to one will clash with accesses to the
> other. On some CPUs, 64 byte spacing is sufficient, but on others
> 128-byte or 256-byte spacing might be required to avoid false sharing.

Well, I think for now (then) we may hardcode 64 bytes. In the worst
case we will get more falls to slow-path (mutex-based or something),
which must present anyway.
Also probably it will be enough to just tune allocator for cache-line
size, which is not so hard to do in C/C++.
If it will become a problem anyway one is always free to include
sources and compiler into installer and make compilation on target
machine :)
The other way will be to template parametrize whole core of the
library/application with the cacheline size and choose right core at
startup (so that there will be no further additional overheads).

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

Apr 29, 2009, 2:16:17 PM4/29/09

to

On Apr 29, 10:43 pm, Chris Dodd <cd...@acm.org> wrote:

> Looking at the contention table in section 6.2.1, I can't see how this can
> possibly make forward progress in the presence of contention. Any time you
> have two threads contending on the same location(s) for a SPECULATE section,
> they'll just repreatedly abort each other. In order for forward progress to
> be guarenteed, at least some of the 'Abort B' int the upper part of the table
> need to be changed to 'Abort A'. What am I missing?

Contention must be handled at higher levels (software). If operation
encounters repeated aborts due to contention, software may employ
(exponential) back-off. If aborts continue to happen, software may
fall-back to mutex-based synchronization.

--
Dmitriy V'jukov

EricP

unread,

Apr 29, 2009, 11:47:56 PM4/29/09

to

Oh yeah, you are right. I was thinking it was just
a minor variation on the original, but it is not.
It should be A aborting if B already has it Owned,
at least as a first cut.

Eric

Mayan Moudgill

unread,

Apr 30, 2009, 9:28:33 AM4/30/09

to

Dmitriy V'jukov wrote:
> That's finally happen! HTM proposal for x86 from AMD!
> Main page:
> http://developer.amd.com/cpu/ASF/Pages/default.aspx
>
> AMD "Advanced Synchronization Facility" Proposal
>

Interesting. While I understand the motivations, I'm not sure I like the
current architecture.

They appear to have mated 4 ideas:
1. Do (strong) load-link/store-conditional with multiple link addresses
2. Do all the store-conditionals atomically
3. If the block store-conditional fails, partially restore modified
state to a checkpointed position.
4. Fail to checkpointed position as soon as it is known that a
reservation has been violated.

Now, based on their initial motivation (atomic updates to various
regions), why don't they:
- lock: lock an address:
- force line to E (writing back any modified lines)
- may always succeed or
- may fail, returning a success/failure indication.
- while a cache-line is locked, any requests to that cache line get deferred
- invalidate: flush the locked cache lines.
- force cache line from M to I.
- unlock: remove locks on all locked lines
- can augment the ALL variant with per address variants.
- an unlock can fail: in this case unlock = invalidate
- unlock returns success/failure

Under this proposal, DCAS becomes
retry:
lock A
lock B
vA = *A;
vB = *B;
if( (vA == cA && vB == cB)
*A = uA;
*B = uB;
swap = 1;
else
swap = 0
if( unlock() == failed ) goto retry;

It is intended that if a process is interrupted while holding at least
one reservation, then an invalidate will be executed, and all subsequent
locks by that process will have no effect till an unlock is executed.
The unlock will fail. This can be implemented using two bits of
architected state, RESERVATION_HELD and RESERVATION_FAILED. The
RESERVATION_HELD bit is set when a lock succeeds; the RESERVATION_FAILED
bit is set whenever the RESERVATION_HELD bit is set and an interrupt
occurs. A lock or unlock with the RESERVATION_FAILED bit set fails. An
unlock clears both RESERVATION_HELD and RESERVATION_FAILED bits.

A straight forward implementation would add an extra state to the cache;
converting MESI to SMILE. The extra transitions are:
* lock: force to E then to L (will cause writeback if M)
* unlock: L->M
* invalidate: L->I
Alternatively, add another state U (=modified,locked). In that case
* on store: L->U, U->U
* unlock: L->E, U-> M
* invalidate: L->E U->I
This has its pros and cons. The biggest drawback is that the cache state
logic has to be able to update multiple entries on unlock and invalidate
at once. Another negative is that to guarantee 4 lockable cache-lines,
one must have at least 5-way associative caches, or a regular cache
paired with a smaller but higher associativity cache (e.g. victim cache).

MitchAlsup

unread,

Apr 30, 2009, 1:58:59 PM4/30/09

to

On Apr 30, 8:28 am, Mayan Moudgill <ma...@bestweb.net> wrote:
> They appear to have mated 4 ideas:

Actually, the original intent was "How do we give MicroSoft the DCAS
they want". Which degrades into how do we guarenteee in the processor
to have multiple cache lines co-resident for several-to-many cycles
and allow SW to express a number of manipulations on those cache
lines, and make it appear to be ATOMIC. Having it smell a little like
LL and SC was completely coincidental. We did not recognize until
significantly later that LL and SC could be easily synthesized from
the primitives we created--we were targeting more DCAS-like stuff than
LL-SC stuff.

> 1. Do (strong) load-link/store-conditional with multiple link addresses

The stores are not contitional all by themselves, SW has to create the
conditionality around the stores using compares and conditional
branches.

> 2. Do all the store-conditionals atomically

All locked stores are made to appear to have been performed at one
instant in time. Instant before-> none visible, Instant after-> all
visible.

> 3. If the block store-conditional fails, partially restore modified
> state to a checkpointed position.

There are all sorts of other failure modes, including spurrious ones.
HW guarenteed to restore 100% of the protected state, HW does not
guarentee to restore or leave alone the unprotected state. HW does
guarentee that unprotectd state remains visible to third party
observers in causal memory order.

> 4. Fail to checkpointed position as soon as it is known that a
> reservation has been violated.

Fail to back to checkpoint may not be "as soon as known" especially if
you are thinking about the architecture as presented by the debugger
to the SW person. Failure will be detected before COMITT.

> Now, based on their initial motivation (atomic updates to various
> regions), why don't they:
> - lock: lock an address:
> - force line to E (writing back any modified lines)
> - may always succeed or
> - may fail, returning a success/failure indication.
> - while a cache-line is locked, any requests to that cache line get deferred

This leads to unsolvable deadlock/livelock issues--depending on the
protocol on the interconnect(s). And created massive headaches for
interrupts, and more that a few issues for exceptions.

> - invalidate: flush the locked cache lines.
> - force cache line from M to I.
> - unlock: remove locks on all locked lines
> - can augment the ALL variant with per address variants.
> - an unlock can fail: in this case unlock = invalidate
> - unlock returns success/failure

<snip>

> A straight forward implementation would add an extra state to the cache;
> converting MESI to SMILE. The extra transitions are:
> * lock: force to E then to L (will cause writeback if M)
> * unlock: L->M
> * invalidate: L->I
> Alternatively, add another state U (=modified,locked). In that case
> * on store: L->U, U->U
> * unlock: L->E, U-> M
> * invalidate: L->E U->I
> This has its pros and cons. The biggest drawback is that the cache state
> logic has to be able to update multiple entries on unlock and invalidate
> at once.

This requires a search (directed or not) through the cache to modify
the states after successful conclusion. See below.

> Another negative is that to guarantee 4 lockable cache-lines,
> one must have at least 5-way associative caches, or a regular cache
> paired with a smaller but higher associativity cache (e.g. victim cache).

ASF does this whole thing in the memory access (miss) buffers** and
can do the whole visiblity thing in one clock. Nor does ASF have to do
anything other than make it all visible (flushes cache state
changes,...). This makes it a lot esier on the HW. And is not (sic)
associated with the associativity of the cache in any way shape or
form. In fact, ASF works even if there is no cache in the whole
system.

(**) this is where the limitation on the number of protected lines
comes from. Even a processor without a cache will have miss buffers.

----------------------------------------------

Back when I was working on this, we were intending to support 8
protected lines (with 7 as a back up). There are useful atomic
primatives that use 5 protected lines (moving a concurrent data
structure (CDS) element from one place in a CDS to another in a single
atomic event.)

And here is where I* think this current ASF architecture is screwed
up*. This atomic move basically requires finding the element to be
moved (a many-multi-cycle process) and then finding a place to put it
(another many-multi-cycle process) and then doing the move atomically.
The original ASF allowed one to lock the protected cache line(s) as
they got found before entering the atomic region. At entry to the
atomic region, if the protected cache lines have been downgraded from
the cache, the entry fails, otherwise a few SW checks are performed,
and then the atomic operations are performed.

This current ASF requires either that the atomic region be very much
longer (around both searches) or each search has to move a bunch of
state to a thread -safe memory buffer that will be compared later
agains the current state(s), and then compare all that state in teh
atomic region before modifying CDS state atomically. By defering the
entry into the atomic phase until after SW has determined which lines
are participating, but leaving HW watching for interference in the
meantime, the duration of the atomic region is made signficantly
smaller.

In addition, in the original proposal, SW got an indicator of how much
contention was going on (and whether the failure was a spurrious
issue), and SW could use same to modulate what to do on failures. (try
again immediately, exponential backoff, context switch, become a
helper for other atomic event,...)

While this ASF may be a step forward in atomic stuff, it is
significantly less powerful that the previous version described in the
previous thread.

So, its nice to have it see the light of day, It is shameful to have
lost so much power and expressivity along the way. It is truley sad
that ASF lost the ability to attack the BigO(n**2) problem along the
way.

Mitch

(*) But I recognize that my opinion on this is completely irrelevent
at the same time.

EricP

unread,

Apr 30, 2009, 5:38:42 PM4/30/09

to

MitchAlsup wrote:
>
> And here is where I* think this current ASF architecture is screwed
> up*. This atomic move basically requires finding the element to be
> moved (a many-multi-cycle process) and then finding a place to put it
> (another many-multi-cycle process) and then doing the move atomically.
> The original ASF allowed one to lock the protected cache line(s) as
> they got found before entering the atomic region. At entry to the
> atomic region, if the protected cache lines have been downgraded from
> the cache, the entry fails, otherwise a few SW checks are performed,
> and then the atomic operations are performed.
>
> This current ASF requires either that the atomic region be very much
> longer (around both searches) or each search has to move a bunch of
> state to a thread -safe memory buffer that will be compared later
> agains the current state(s), and then compare all that state in teh
> atomic region before modifying CDS state atomically. By defering the
> entry into the atomic phase until after SW has determined which lines
> are participating, but leaving HW watching for interference in the
> meantime, the duration of the atomic region is made signficantly
> smaller.

The current ASF would also seem to preclude the hardware
accelerators as it never forms an update "set" until the commit.
I rather liked that idea, given the wealth of transistors available,
and I think it would have given a distinct competitive advantage
in the multi-core arena.

> So, its nice to have it see the light of day, It is shameful to have
> lost so much power and expressivity along the way. It is truley sad
> that ASF lost the ability to attack the BigO(n**2) problem along the
> way.

I assume you mean the N**2 cost of collision-retry loops.
Yeah... and I don't like their unbounded queue delays.

But there may be other ways around it (I was thinking about this
lying in bed this morning). For example, back off to some FIFO
queuing mechanism (event flag in user mode, spin fifo in kernel).
Set a flag that re-routes everyone through a slow path.

***But*** you don't have to sync with absolutely everybody because
if someone is still walking the data structure then worst that
happens is they trigger you abort and retry. Which means you may
avoid incrementing and decrementing counters most of time.

In other words, it can allow the algorithms to be somewhat sloppy
WRT synchronization, and in doing so be much more efficient.

Eric

Mayan Moudgill

unread,

Apr 30, 2009, 10:38:46 PM4/30/09

to

MitchAlsup wrote:
> On Apr 30, 8:28 am, Mayan Moudgill <ma...@bestweb.net> wrote:
>
>>- while a cache-line is locked, any requests to that cache line get deferred
>
>
> This leads to unsolvable deadlock/livelock issues--depending on the
> protocol on the interconnect(s). And created massive headaches for
> interrupts, and more that a few issues for exceptions.

1) The abort-on-probe in ASF can lead to live-lock.
2) If the interconnect does NOT allow deferred access to cache-lines,
then the ASF proposal does not work either; the ASF has to defer access
whilst committing the locked regions to the cache.
3) The mechanism described specifically allows interrupts and exceptions
to be taken (and reservations to fail)
4) The mechanism described allows the h/w to implement any failure
policy, including probe-on-abort.

>
> This requires a search (directed or not) through the cache to modify
> the states after successful conclusion. See below.

Not really; what it requires is easy to do _IF_ you've implemented your
cache state using flops instead of SRAMs. In that case, there will be
an extra three control lines to each cell (toL, LtoI, LtoE) and probably
4-6 gates. The area and speed cost is insignifcant (the high fanout of
the LtoI and LtoE controls is compensated for by the fact that there is
no address decode; it goes to ALL cells).

If you've implemented the states as SRAMs, then it depends on whether
they are implemented as x2 SRAMs (just MESI) or xN SRAMs (MESI+LRU
state, N dependent on LRU algorithm) or some other variant (e.g. M for
sub-cache-line), then there may be a larger area penalty, but it is
still insignificant.

>
>> Another negative is that to guarantee 4 lockable cache-lines,
>>one must have at least 5-way associative caches, or a regular cache
>>paired with a smaller but higher associativity cache (e.g. victim cache).
>
>
> ASF does this whole thing in the memory access (miss) buffers** and
> can do the whole visiblity thing in one clock. Nor does ASF have to do
> anything other than make it all visible (flushes cache state
> changes,...). This makes it a lot esier on the HW. And is not (sic)
> associated with the associativity of the cache in any way shape or
> form. In fact, ASF works even if there is no cache in the whole
> system.
>
> (**) this is where the limitation on the number of protected lines
> comes from. Even a processor without a cache will have miss buffers.

Sorry, there is no requirement for miss-buffers at the L1. Consider a
write-through L1 with a allocate-on-load policy (i.e. store misses do
not cause L1 line fetches), backed by an on-chip L2. Every load miss is
sent to the L2 along with the necessary control information needed to
complete the load (e.g. identity of the load causing the miss or the LQ
entry). Every cache refill is sent to the L1 by the L2 along with the
associated control information. Any load miss combining is done at the L2.

BTW: you still have two copies of the state: old and
updated-but-not-committed. Unless your miss-buffer is different than how
I understand it, it only holds the identity of outstanding misses. So,
you have to have some structure to hold the copies. Is this the
write-buffers, or some other buffers? And on a successful commit or
invalidate, the cache and this buffer have to be reconciled.

> ----------------------------------------------
>
> Back when I was working on this, we were intending to support 8
> protected lines (with 7 as a back up). There are useful atomic
> primatives that use 5 protected lines (moving a concurrent data
> structure (CDS) element from one place in a CDS to another in a single
> atomic event.)

Did you talk to any processor designers when you did this?

Terje Mathisen

unread,

May 1, 2009, 3:13:52 AM5/1/09

to

Mayan Moudgill wrote:

> MitchAlsup wrote:
>> Back when I was working on this, we were intending to support 8
>> protected lines (with 7 as a back up). There are useful atomic
>> primatives that use 5 protected lines (moving a concurrent data
>> structure (CDS) element from one place in a CDS to another in a single
>> atomic event.)
>
> Did you talk to any processor designers when you did this?

Are you serious?

From LinkedIn:

Mitch Alsupï¿½s Experience

*
Architect
AMD

(Public Company; Mechanical or Industrial Engineering industry)

1999 ï¿½ 2007 (8 years)
*
Chief Architect
Ross Technology

(Public Company; Mechanical or Industrial Engineering industry)

1991 ï¿½ 1998 (7 years)
*
Architect
Motorola SPS

(Mechanical or Industrial Engineering industry)

1983 ï¿½ 1992 (9 years)
*
Processor Architect
Motorola

(Public Company; Mechanical or Industrial Engineering industry)

1983 ï¿½ 1991 (8 years)

I sort of have the feeling Mitch has spent way more time than most, even
here at c.arch, talking to processor designers. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Mayan Moudgill

unread,

May 1, 2009, 7:20:03 AM5/1/09

to

My *sincere* apologies for doubting his abilities. Sorry.

In my defence, I will say that a lot of the objections looked very
academic/tunnel-visioned. In particular, the implication that one needed
a state machine to update the cache tag bits or that a miss buffer is
absolutely required pressed some buttons.

Again, I apologize for doubting Mitch Aslup's credentials. However, I
still think that he is suffering from a little bit of tunnel-vision.

EricP

unread,

May 1, 2009, 10:02:43 AM5/1/09

to

MitchAlsup wrote:
>
> So, its nice to have it see the light of day, It is shameful to have
> lost so much power and expressivity along the way. It is truley sad
> that ASF lost the ability to attack the BigO(n**2) problem along the
> way.

Hmmmm... actually now I'm thinking they may have
screwed the pooch with their changes to the ASF design.

1) Because the new design resolves contention by aborting the owner,
it means even an update to a single cache line can livelock.
So even a simple DCAS has to deal with the possibility.

The NAK could fix that but is insufficient for multiple lines.

2) There is no guaranteed forward progress, so every usage
must deal with livelock, and ultimately they make
a system call to wait the thread to guarantee progress
(after trying various tricks to avoid doing so).
This makes every usage more complex and more expensive.

3) The new design does not collect a set of protected lines
and verify them first, but rather does it laissez-faire,
so there is no formal point at which a go/no-go decision is made,
other than the final commit.

So to upgrade the design and ensure there is forward progress
requires making Commit do what Acquire did before.
(and there can be conflicting transactions with cpu 1 owning
line A and wanting B, and cpu 2 owning B and wanting A,
except with 4 or more cache lines and/or cpus.
To guarentee a winner you must know the whole update set).

But moving update set dispute resolution to COMMIT means you must
change LOCK MOVx(store) instructions to not actually transfer line
ownership until the commit, but then that breaks A-B-A detection
which uses ownership transfer to do this detection.

So it may be that livelock issue is unfixable,
or vastly more complex.

Eric

Dmitriy Vyukov

unread,

May 1, 2009, 10:54:30 AM5/1/09

to

On May 1, 7:02 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> So it may be that livelock issue is unfixable,
> or vastly more complex.

I also think that 100% live-lock prevention even if possible not
feasible doing in hardware (I'm not a hardware guy!). Here is another
moment the ultimate goal must be not only absence of live-locks, but
reasonable performance (low abort rate). What sense will make hardware
implementation which will guarantee absence of live-locks but anyway
will provide crappy performance? It still must be fixed by software
(the same back-offs, etc). So personally for me it's Ok that
specification somehow amenable to live-locks, anyway I don't believe
that heavy update-contented data-structure may have reasonable
performance and scalability.
However as for formal specification, probably they must at least allow
NACKs. Future implementations of the ASF will include more and more
intelligent logic for live-lock (and unnecessary aborts in general)
prevention,

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 1, 2009, 11:04:19 AM5/1/09

to

On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> Well, horay, my invention has finally seen the light of day.

I wanted to ask some question on AMD forum since they are asking for
feedback. But since Mitch is here I will ask here.

Table 6.2.1. postulates that if CPU B holds cache-line in Protected
Owned state, and CPU A make plain (non-transactional) read, then CPU B
aborts.
Why it's necessary to abort CPU B? I can imagine implementation which
will satisfy read from CPU A with old value (which was actual before
CPU B started transaction), and not abort CPU B. I.e. for software it
will look like read from CPU A just happen-before transaction on CPU
B. At first glance it does not conflict with transactional semantics,
while allowing more parallelism (less aborts). Such implementation
will use separate buffer for transactional writes (not cache).

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 1, 2009, 11:13:21 AM5/1/09

to

On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:

> Well, horay, my invention has finally seen the light of day.

Another question:
Why it's necessary to use LOCK MOV, why all reads and writes inside
transactional region may not be treated as transactional? This is the
approach taken by Sun's HTM in Rock. On one hand AMD's approach is
more flexible (I may not include some reads/writes into transaction),
on the other hand with Sun's approach I may do following thing.

Assume I have large mutex-based code-base:

LOCK(x);
some_c_or_cpp_code_here;
UNLOCK(x);

I may replace it with:

TRANSACTIONAL_LOCK(x);
some_c_or_cpp_code_here;
TRANSACTIONAL_UNLOCK(x);

Where TRANSACTIONAL_LOCK/TRANSACTIONAL_UNLOCK first tries to use
transaction to execute code, then after several attempts falls back to
mutex (so called transactional lock-ellision).
Sun's works show that transactional lock-ellision is extremely easy to
apply to large code-base and does give some substantiation performance
improvements.
With AMD's approach (LOCK MOV) I will have to rewrite all code in
asm... not very cool...
As for additional flexibility, well, I don't see many use cases for
non-transactional reads/writes inside of transaction, anyway we want
to make transaction as slim as possible...
So what is the rationale behind explicit LOCK MOVs?

--
Dmitriy V'jukov

EricP

unread,

May 1, 2009, 11:39:05 AM5/1/09

to

But as a minimum they must at least ensure that single cache line
operations that don't take "too long" are guaranteed to succeed.
I.e. the same rules that Alpha has for LL/SC.
This doesn't necessarily mean NAKs, just holding onto
a locked cache line for some small clock window to give
the commit a chance to occur.

Eric

Dmitriy Vyukov

unread,

May 1, 2009, 12:13:48 PM5/1/09

to

Agree. CPU may just delay ACK.
It will be beneficial for formal specification to support different
implementations. [Very] hopefully ASF will be supported by Intel too,
who knows what apporach they will take.

--
Dmitriy V'jukov

MitchAlsup

unread,

May 1, 2009, 12:30:50 PM5/1/09

to

On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> Why it's necessary to use LOCK MOV, why all reads and writes inside
> transactional region may not be treated as transactional?

A good question.

But consider that one must end up dereferencing memory references to
set up the cache lines for the critical section. And notice that there
is a strict limit to the number of protected cache lines. Thus, if
there is any additional level of indirection, you run out of
protectable resources.

Secondly, in order to dbugg these critical regions, one must build a
thread-safe memory buffer and dump critical region state into same so
that you can printi it out (or otherwise examine it) after success or
failure. You really (REALY) do not want these to be protected and
rolled back on failure.

Thus, the limited number of protectable lines, and the absolute need
to be able to figure out what went on inside the critical region,
prevents treating everything as protected.

In my opinion, this is a good thiing, a very good thing.

Mitch

MitchAlsup

unread,

May 1, 2009, 12:43:59 PM5/1/09

to

On May 1, 6:20 am, Mayan Moudgill <ma...@bestweb.net> wrote:
> Again, I apologize for doubting Mitch Aslup's credentials. However, I

> still think that he is suffering from a little bit of tunnel-vision.- Hide quoted text -

For the record, I thank Terje for his defense and would like to
indicate that I take no offense upon Myan.

This whole synchronization philosophy was developed from the hardware
level (make the HW work first) and then figure out how to program it
(present reasonable abstraction to the SW levels). It is 'different'
because we ultimately felt that the current directions {DCAS, xxTM}
were lacking at varous points in their (sub)architectures.

I would (happily) submit that the field of synchronization has so many
mind numbing details that all paths to any potential solutions have a
certain amount of tunnel vision.

However, the more original ASF proposal was the first (that I know of)
that showed a way to attack the BigO(n**2) problem. This problem has
to do with the cache coherent protocols and how memory systems are
architected with performance towards the 'normal' access patterns and
a near complete blind eye to synchronization access patterns (repeated
access to hot lines). DCAS (and varients) and Transactional Memory
have these exact same problems. The problem is not DCAS, nor xxTM, but
inside the cache hierarchy and memory systems. And we went into
significant detail in the aforeposted thread earlier.

ASF evolved over 2-2.5 years (while I was there) with insight from the
best architects withini AMD and from selected AMD partners who gave
useful and illuminating input into what they might have liked and what
they did not. I'm pretty sure that this evolution did not stop after I
left.

Mitch

MitchAlsup

unread,

May 1, 2009, 12:47:01 PM5/1/09

to

On Apr 30, 4:38 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> The current ASF would also seem to preclude the hardware
> accelerators as it never forms an update "set" until the commit.
> I rather liked that idea, given the wealth of transistors available,
> and I think it would have given a distinct competitive advantage
> in the multi-core arena.

I saw nothing in the spoecification that would have precluded an ASF
implementation from modifying the cache at the normal pipeline
(LOCKed) store timing with the caveat that the premodified data be
moved to the miss data buffers where it can be recovered upon failure.
Thus success can be made fast, penalizing failures; instead of the
other way around.

Mitch

MitchAlsup

unread,

May 1, 2009, 12:58:49 PM5/1/09

to

On Apr 30, 4:38 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> I assume you mean the N**2 cost of collision-retry loops.
> Yeah... and I don't like their unbounded queue delays.

N**2 is the amount of memory traffic to deal with N processor threads
accessing (for/with write permission) a single cache line and all N of
these processors agreeing on who "got" that line. Its simple cache
coherence at play/work.

Software programatics and SW synchronization architectures can make it
worse than this.

Mitch

MitchAlsup

unread,

May 1, 2009, 1:19:55 PM5/1/09

to

On Apr 30, 9:38 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> MitchAlsup wrote:
> > On Apr 30, 8:28 am, Mayan Moudgill <ma...@bestweb.net> wrote:
>
> >>- while a cache-line is locked, any requests to that cache line get deferred
>
> > This leads to unsolvable deadlock/livelock issues--depending on the
> > protocol on the interconnect(s). And created massive headaches for
> > interrupts, and more that a few issues for exceptions.
>
> 1) The abort-on-probe in ASF can lead to live-lock.

Agreed. This is why the earlier earlyASF {that I know about} had a HW
mechanism that HW could use to prevent live-lock.

> 2) If the interconnect does NOT allow deferred access to cache-lines,
> then the ASF proposal does not work either; the ASF has to defer access
> whilst committing the locked regions to the cache.

This is why I was lobbying for a NAK to be added to the HyperTansport
interconnect fabric and only for ASF events.

> 3) The mechanism described specifically allows interrupts and exceptions
> to be taken (and reservations to fail)

My point was that UnConstrained NAKing makes interrupts really tricky,
whereas tightly controlled NAKing avoids these pitfalls without
loosing the value of the NAKing {i.e. preventing livelock} and does
not make interrupts any more (usefully) difficult than they already
are.

The earlyASF that I was associated with set a timout at the point when
the critical section "became" critical. During this short window of
time, interrupts would be defered. This time was supposed to be about
2*worst case DRAM access. Thus, the critical regioni program should
have most of its protected data in its cache, and if it does, then the
timeout would not become visible. If the critical region had lost
several of its lines there is a good chance that interference would
cause a failure in the critical region anyways.

Thus, one NEEDS an instruction to MARK the transistion from collecting
the lines to be processed, and to manglement of the lines in the
ATOMIC event. This is what was lost. In addition, the instruction that
used to do this delivered an integer (+0-) and embedded in this
integer was the success indicator (0) or failure indicators. Negative
failure numbers were assigned to spurrious errors (buffer overflows,
table overflows,...) to allow HW simplification and allow SW to ignore
these and simply try again. Positive numbers were assigned by the
order of interference on a given set of protected lines. Thus one
could attemt to ACQUIRE the lines needed to perform a critical
section, and receive back an indication that 6 other threads have
touched (at least part of) those same lines to be protected. SW was
supposed to use this indicator to do something other than just try
again. Just trying again leads to BigO(N**2) contention. Knowing that
you are 6th in line allows that routine to march down the linked list
and then attempt an access that has low probability that others are
accessing. This is what allows SW to attack the BigO(N**2) problem.

Consider a timer goes off and 15 threads jump at the head of the run
queue. In normal synchronizatioins, this (lets just say) takes a long
time. However, in earlyASF everybody wakes up, jumps at the head of
the queue, is given an interference number, uses this number to access
independent elements on the subsequent attempt and BigO(N**2) is not
BigO(3). Since timing cannot be guarenteed, BigO(3) degrades to
"about" BigO(log(N)).

However, it appeasr that this has been lost.

Mitch

MitchAlsup

unread,

May 1, 2009, 1:26:55 PM5/1/09

to

On Apr 30, 9:38 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> MitchAlsup wrote:

> > This requires a search (directed or not) through the cache to modify
> > the states after successful conclusion. See below.
>
> Not really; what it requires is easy to do _IF_ you've implemented your
> cache state using flops instead of SRAMs.

Look, NOBODY is going to build their cache(s) in register technology,
its just not dense enough. However everybody is going to have a number
of miss buffers that ARE build in register technology. Thus if you can
accept a reasonable limit to the number of protected liines, then the
miss buffers are actually easy to use for this purpose.

Now, you might be able to afford to build all or part of your TAG
array in register technology, and some of this is (and has been done)
(especially invalidations. But you will never be able to afford the
data portion of the cache to be implemented in register technology.

In any event, the miss buffers already have all the desired
characteristics. They already deal with inbound and outbound data, the
demand address and the victim address, and they are snooped
continuously. So they are the perfect place to add a "little"
functionality. In addition, for all intents and purposes, they ARE a
cache; a cache of memory things in progress that need constant
tending.

Mitch

MitchAlsup

unread,

May 1, 2009, 1:30:45 PM5/1/09

to

On May 1, 10:04 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > Well, horay, my invention has finally seen the light of day.
>
> I wanted to ask some question on AMD forum since they are asking for
> feedback. But since Mitch is here I will ask here.
>
> Table 6.2.1. postulates that if CPU B holds cache-line in Protected
> Owned state, and CPU A make plain (non-transactional) read, then CPU B
> aborts.
> Why it's necessary to abort CPU B?

You have lost the illusion of atomicity on that cache line. Thus, its
time to abort and retry (potentially later).

ASF is not guarenteeing atomicity, it is guarenteeing the ILLUSION of
atomicity.

Mitch

MitchAlsup

unread,

May 1, 2009, 1:38:59 PM5/1/09

to

On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> With AMD's approach (LOCK MOV) I will have to rewrite all code in
> asm... not very cool...

AMD has worked with SUN and we jointly came to the conclusion that ASF
would be a powerful tool to make transactional memory work better/
faster.

I think it was Mark Moir that said "Transactional memory is for people
who just want synchronization to work" at which point I said "ASF if
for people who want synchronization to work fast", and we all had a
good chuckle. There is a completely different philosophy here. ASF is
for coders who understand atomicity, synchronization, nonBlocking,
WaitFree, and the intricate techniques involved therein, and are
willing to work for their performance. TM maybe/is for everybody else.

If you like TM write in TM, only if TM is not giving you the
performance you are looking for, then selectively recode those time
critical section in ASF to beat down the costs.

I also think that over 5-ish years SW people will come up with better
abstractions than did we so as to present SW coders with powerful more
easy to use primatives. This is why I spend so much effort building
primatives than solutions (DCAS,TCAS,QCAS).

Mitch

Dmitriy Vyukov

unread,

May 1, 2009, 1:43:32 PM5/1/09

to

Personally I like precise fine-grainer control over read-, write-sets
(anyway it a thing for crazy low-level people), possibility to RELEASE
a line from read-set is awesome (especially for linked-lists).
But then maybe just choose different defaults:
LOCK MOV -> MOV
MOV -> NONLOCKED MOV
Consider, non-transactional code inside transaction must deal with
something special: debugging, contention-control, wicked
optimizations, etc. I.e. something which is especially introduced, and
received according attention, probably something emitted by aware
compiler, something which is intended "to appear inside of a
transaction". Basically something for which it's Ok to explicitly add
'UNLOCKED'. Plain synchronization code on the other hand is something
which a "normal" code just surrounded with enter/leave_critical_region
(transaction or mutex-based).
So IMVHO it makes perfect sense to allow normal synchronization code
to use normal MOVes, and at the same time allow system code too (which
will use UNLOCKed MOVes).

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 1, 2009, 1:51:15 PM5/1/09

to

On May 1, 10:30 am, MitchAlsup <MitchAl...@aol.com> wrote:
> On May 1, 10:04 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > > Well, horay, my invention has finally seen the light of day.
>
> > I wanted to ask some question on AMD forum since they are asking for
> > feedback. But since Mitch is here I will ask here.
>
> > Table 6.2.1. postulates that if CPU B holds cache-line in Protected
> > Owned state, and CPU A make plain (non-transactional) read, then CPU B
> > aborts.
> > Why it's necessary to abort CPU B?
>
> You have lost the illusion of atomicity on that cache line. Thus, its
> time to abort and retry (potentially later).

For whom exactly I lost illusion of atomicity?
CPU A is doing non-transactional reads, so it's ok for it to see non-
consistent data (if it reads more than 1 variable), it's only
important for it to not see speculative stores.
CPU B does not commit yet and his speculative stores are still in the
special buffer, so it is not in the game yet - all his reads are still
valid yet, his speculative stores are not observed by anybody yet.

> ASF is not guarenteeing atomicity, it is guarenteeing the ILLUSION of
> atomicity.

Isn't illusion of atomicity IS-A atomicity? :)

--
Dmitriy V'jukov

Mayan Moudgill

unread,

May 1, 2009, 1:53:14 PM5/1/09

to

MitchAlsup wrote:
> On Apr 30, 9:38 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
>
>>MitchAlsup wrote:
>>
>>>This requires a search (directed or not) through the cache to modify
>>>the states after successful conclusion. See below.
>>
>>Not really; what it requires is easy to do _IF_ you've implemented your
>>cache state using flops instead of SRAMs.
>
>
> Look, NOBODY is going to build their cache(s) in register technology,
> its just not dense enough. However everybody is going to have a number
> of miss buffers that ARE build in register technology. Thus if you can
> accept a reasonable limit to the number of protected liines, then the
> miss buffers are actually easy to use for this purpose.
>
> Now, you might be able to afford to build all or part of your TAG
> array in register technology, and some of this is (and has been done)
> (especially invalidations. But you will never be able to afford the
> data portion of the cache to be implemented in register technology.

Absolutely. I was only talking about the cache-state (i.e. MESI) bits,
not even the whole tag (MESI+LRU+sub-cache occupancy+index etc.), which
might be 512x2 bits. If you're going to support two loads per cycle
against each set, there is a lot of state that you might want
implemented as flops rather than arrays.

Given the requirement that the cache start up with all entries invalid,
you may already be implementing the MESI bits (partly) using flops.
(Alternative approaches include using a state machine to clear each
state, which also has area implications).

> In any event, the miss buffers already have all the desired
> characteristics. They already deal with inbound and outbound data, the
> demand address and the victim address, and they are snooped
> continuously. So they are the perfect place to add a "little"
> functionality. In addition, for all intents and purposes, they ARE a
> cache; a cache of memory things in progress that need constant
> tending.
>
> Mitch

I dislike using MAF/miss-buffers. I would much rather send the control
information with the miss-request, and have it echoed back along with
the miss-return-data, defering all the miss-combining etc. to the L2.
{By control information, I mean stuff like
- register to be written back
- insn to retired
}
This costs extra pipeline flops, but it removes an extra queuing
resource from the core, and takes one? stage out of the cache miss
completion path.

Dmitriy Vyukov

unread,

May 1, 2009, 2:11:18 PM5/1/09

to

On May 1, 10:38 am, MitchAlsup <MitchAl...@aol.com> wrote:
> On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> > With AMD's approach (LOCK MOV) I will have to rewrite all code in
> > asm... not very cool...
>
> AMD has worked with SUN and we jointly came to the conclusion that ASF
> would be a powerful tool to make transactional memory work better/
> faster.
>
> I think it was Mark Moir that said "Transactional memory is for people
> who just want synchronization to work" at which point I said "ASF if
> for people who want synchronization to work fast"

:)
Indeed!
We need synchronization primitives only to work, and to work fast. I
am even Ok with the name SPECULATE :)

> , and we all had a
> good chuckle. There is a completely different philosophy here. ASF is
> for coders who understand atomicity, synchronization, nonBlocking,
> WaitFree, and the intricate techniques involved therein, and are
> willing to work for their performance. TM maybe/is for everybody else.
>
> If you like TM write in TM, only if TM is not giving you the
> performance you are looking for, then selectively recode those time
> critical section in ASF to beat down the costs.
>
> I also think that over 5-ish years SW people will come up with better
> abstractions than did we so as to present SW coders with powerful more
> easy to use primatives. This is why I spend so much effort building
> primatives than solutions (DCAS,TCAS,QCAS).

If some design decision is a trade-off decision which affects
performance, then I vote with 3 hands for solution which provides more
performance/scalability. Well. but why not just provide UNLOCKED MOV
instead, this does not sacrifice performance (at my first glance).

I have to thank you for such a great job. Especially I am happy that
it was in the "x86 company" :)
RELEASE feature is super-cool! Something which even todays STM lack
of. I was talking with developers of Intel STM about "partial commits"
or "partly overlapping transactions", not sure whether they are going
to support them...

Btw, here is one question I have to ask. AMD was working with SUN, but
did AMD have some discussions with Intel on this (about support in
Intel processors)? IMVHO such initiatives must be a join effort of
both x86 giants. Not sure whether Intel had discussions with AMD about
AVX...

--
Dmitriy V'jukov

MitchAlsup

unread,

May 1, 2009, 3:51:22 PM5/1/09

to

On May 1, 1:11 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 1, 10:38 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > I also think that over 5-ish years SW people will come up with better
> > abstractions than did we so as to present SW coders with powerful more
> > easy to use primatives. This is why I spend so much effort building
> > primatives than solutions (DCAS,TCAS,QCAS).
>
> If some design decision is a trade-off decision which affects
> performance, then I vote with 3 hands for solution which provides more
> performance/scalability.

What would you do if the intellectual burden was such that, after you
developed/exposed those primatives, only 2 dozen people in the world
understood the sublties to the point needed to code reliable
synchronizations? We stumbled over more than a few such possibilities
uncovered along the way.

> Well. but why not just provide UNLOCKED MOV
> instead, this does not sacrifice performance (at my first glance).

We talked about this on the previous thread. Placing synchronization/
atomicity on top of the Cache coherence protocol makes complexity go
through the roof. Placing beside the CC protocols only makes it really
hard to get right. You are playing HW games with the cache coherence
protocol. The cache IS actually incoherent for a short period of time,
on exactly those protected lines, and in addition there is the
potential that the memory request protocol is different/enhanced/more-
dangerous/more-complicated durring part of that period of time. This
leaves vast windows-in-time where you can screw up royally, and not
have the ability to recover (architecture is fundamentally broken).
What we are talking about here, however, is if after all these
sublties (99.9%-ish level) have been discovered, talked about, dealt
with, applied in (demonstration-like) practice, and written down.

> I have to thank you for such a great job. Especially I am happy that
> it was in the "x86 company" :)

Thanks.

> RELEASE feature is super-cool! Something which even todays STM lack
> of. I was talking with developers of Intel STM about "partial commits"
> or "partly overlapping transactions", not sure whether they are going
> to support them...

I have to defer credit to Dave Christie, here.

> Btw, here is one question I have to ask. AMD was working with SUN, but
> did AMD have some discussions with Intel on this (about support in
> Intel processors)? IMVHO such initiatives must be a join effort of
> both x86 giants. Not sure whether Intel had discussions with AMD about
> AVX...

The necessities of nonContamination of Intellectual Property means
that AMD/Intel cannot work together until something is at the level of
the just issued specification. However, I am 99.44% sure that a few
people with in Intel were aware of the general nature of this effort
as we were aware of several TM-like investigations inside Intel. This
is how NonDisclosure agreements work in practice. IP stuff is not
directly disclosed, but the questions from NDA signers are phrased in
such a way that the NDA-holding party does end up understanding that
somebody else is looking at <blah>. And generally one can google up
enough intermediate data to infer what the other party may be thnking
about.

Mitch

MitchAlsup

unread,

May 1, 2009, 3:54:03 PM5/1/09

to

On May 1, 12:51 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 1, 10:30 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > You have lost the illusion of atomicity on that cache line. Thus, its
> > time to abort and retry (potentially later).
>
> For whom exactly I lost illusion of atomicity?

Interested third parties (like a DMA device, graphics, other threads)
may have seen the cache line. So, you must revert back to the pre-
atomic-event state. At this point, its simply easier, and less
perilous to abandon the current critical sectioin and take a fresh
look at the concurrent data structure. And then figure out what is the
best course of events from here onwards.

Mitch

MitchAlsup

unread,

May 1, 2009, 3:57:00 PM5/1/09

to

Sort of depends if you are thiniking at the instruction set level, or
at the clock by clock level.

At the clock by clock level, once you attempt to provide an atomic
event that is wider (number of stores) than one can performs stores on
a per clock basis, the illusion of atomicity is all that can be
provided.

Mitch

dave dice

unread,

May 1, 2009, 5:06:04 PM5/1/09

to

Dmitriy V'jukov wrote:

> Another question:
> Why it's necessary to use LOCK MOV, why all reads and writes inside
> transactional region may not be treated as transactional? This is the
> approach taken by Sun's HTM in Rock. On one hand AMD's approach is
> more flexible (I may not include some reads/writes into transaction),
> on the other hand with Sun's approach I may do following thing.
>
> Assume I have large mutex-based code-base:
>
> LOCK(x);
> some_c_or_cpp_code_here;
> UNLOCK(x);
>
> I may replace it with:
>
> TRANSACTIONAL_LOCK(x);
> some_c_or_cpp_code_here;
> TRANSACTIONAL_UNLOCK(x);
>

Following up on Dmitriy's comment, it's extremely convenient for lock
elision on ROCK that HTM is modal and we can execute the same code
either under a lock or within a transaction. That avoids yet another
dimensionality explosion for emitted code. And of course it gives
you a chance to wrap legacy code in transactions.

Shifting topics, I'm appreciative of the progress properties afforded
by Mitch's original design. Having said that, NACKing might open up
some worrisome denial-of-service attacks and related issues,
potentially necessitating watchdog timers for transactions. It's all
solvable, but might not be trivial and the remedies could expose yet
another set of annoyances. Perhaps Mitch might comment. Having
transactions that are only obstruction-free (possibly admitting mutual
abort with no progress) is reasonable as it shifts the responsibility
for contention management and progress guarantees from hardware into
software where we can apply more flexible & adaptive policies. With
lock elision on ROCK, for instance, if the abort rate is too high we
can simply wrap the transactions with some simple admission control
protocols to restrict concurrency to a reasonable level. In the
extreme case we just revert to the lock. But this probably reflects
my bias (or tunnel vision, as Mitch suggested (:>)) that I see
hardware TM more as an accelerator.

Regards
Dave

p.s., at the risk of pointing out the obvious, another advantage of
transactional lock elision is that we can avoid sloshing the $ lines
underlying the lock metadata, reducing coherency traffic and
potentially improving performance for "promiscuous" locks -- locks
that are shared but not contented or lightly contended.

Mayan Moudgill

unread,

May 1, 2009, 5:30:15 PM5/1/09

to

MitchAlsup wrote:

> This is why I was lobbying for a NAK to be added to the HyperTansport
> interconnect fabric and only for ASF events.
>

Interesting; I didn't realize that Hypertransport does not have a Retry
read response. That means that once a message is put on the bus, it will
either complete or come back with an error.

BTW: What typically happens if the response is Target Abort? Is it
retried, or just an error? How about Data Error?

MitchAlsup

unread,

May 1, 2009, 6:12:26 PM5/1/09

to

HT only has completion and failure responses. Failures are machine
check-like events--don't want to go there. Failures have generally
indicated that the HT routing tables are screwed, or that the TLB
mapping tables are screwed. Neither is good for your current OS's
assumptions. What is needed is a message that contains "yes I have it,
but I cannot give it to you right now" response. Which is neither a
success nor a failure response.

What we wanted was for the NAK to arrive back at the requestor, and if
it was attempting to perform an ASF event, then the ASF event would
fail (and backtrack -- CDS is undergoing modification as we speak),
but if it was not attepting an ASF event, the request was simply
reissued (innocent or malevolent bystander). Thus, innocent
interraction with data undergoing synchronization gets defered while
the synchronization event makes forward progress. Programming
languages cannot prevent innocent or malevolent threads accessing the
CDS any time they want--only SW disipline and codign practices can,
however earlyASF gave priority to the good guys over the innocents and
malevolents. It appears that this ASF does not.

Since the "loop" of request->MC->PROBES->miss-buffer->requestor is
many hundreds of CPU clocks long, simply retrying is sufficient and
does NOT flood the interconnect with messages. Since the various
messages are short (4-8 HT clocks) and ordered at the memory
cotrollers (1 memory access in active state per cache line--the rest
queued), and traffic is throttled in the HT fabric, hundreds of CPUs
can be contending with little overhead (<10%-ish) in the HT fabric due
to these innocent and malevolent retries.

Thus, the good guys make forward progress, the innocent get delayed
and see the latest state of the concurrent data structure when their
access completes.

The rather similar looking Intel interconnect fabric does not have
this 'lack-of-flooding' property, and if Intel were to implement
something like ASF they would make different choices about some of
these details. I am available to help them is they so desire.

Mitch

Dmitriy Vyukov

unread,

May 2, 2009, 3:26:49 AM5/2/09

to

Sorry, I still do not understand.
Non-transactional read on CPU A is satisfied with non-speculative
value (value which was actual before transaction on CPU B begins), so
nobody seen anything criminal yet. The illusion of atomicity here is
that read on CPU A just happens-before whole transaction on CPU B.
Note that CPU a makes *non-transactional* read. I understand why
atomicity will be broken if CPU A makes *transactional* read (both
transactions may end up neither happens-before another), or if CPU A
makes write (either transactional or not).

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 2, 2009, 7:23:53 AM5/2/09

to

On 1 май, 23:51, MitchAlsup <MitchAl...@aol.com> wrote:
> On May 1, 1:11 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > On May 1, 10:38 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > > I also think that over 5-ish years SW people will come up with better
> > > abstractions than did we so as to present SW coders with powerful more
> > > easy to use primatives. This is why I spend so much effort building
> > > primatives than solutions (DCAS,TCAS,QCAS).
>
> > If some design decision is a trade-off decision which affects
> > performance, then I vote with 3 hands for solution which provides more
> > performance/scalability.
>
> What would you do if the intellectual burden was such that, after you
> developed/exposed those primatives, only 2 dozen people in the world
> understood the sublties to the point needed to code reliable
> synchronizations? We stumbled over more than a few such possibilities
> uncovered along the way.

I have no answer. I guess boss wanted primitives that are extremely
fast and scalable and at the same time extremely high-level and easy
to use. A kind of impossible. Not so fast but easy to use primitives
we already have (mutexes, STM, message-passing, etc). So extremely
fast and scalable but extremely hard to use primitives look not so bad
in our toolbox. IMHO.
How many people are able to deal with relaxed memory models with
separated atomicity and ordering (Sparc RMO)? I believe ASF will be
not principally harder, anyway relaxed memory models force me to think
on hardware level (store buffers, reorderings in pipeline, etc).

>
> > Well. but why not just provide UNLOCKED MOV
> > instead, this does not sacrifice performance (at my first glance).
>
> We talked about this on the previous thread. Placing synchronization/
> atomicity on top of the Cache coherence protocol makes complexity go
> through the roof. Placing beside the CC protocols only makes it really
> hard to get right. You are playing HW games with the cache coherence
> protocol. The cache IS actually incoherent for a short period of time,
> on exactly those protected lines, and in addition there is the
> potential that the memory request protocol is different/enhanced/more-
> dangerous/more-complicated durring part of that period of time. This
> leaves vast windows-in-time where you can screw up royally, and not
> have the ability to recover (architecture is fundamentally broken).
> What we are talking about here, however, is if after all these
> sublties (99.9%-ish level) have been discovered, talked about, dealt
> with, applied in (demonstration-like) practice, and written down.

I am not sure why you are talking about complexity and disposition of
the atomicity relative to cache coherence, what I was talking about is
no more than renaming of names. It must not affect complexity,
disposition of the atomicity relative to cache coherence, etc. Here I
described rationale for renaming in some more detail:
http://groups.google.ru/group/comp.programming.threads/msg/1b2dfbbeda9a7fdd
Currently I am considering such renaming as a win-win solution.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 2, 2009, 8:18:19 AM5/2/09

to

On 2 май, 01:06, dave dice <david.d...@gmail.com> wrote:

> Following up on Dmitriy's comment, it's extremely convenient for lock
> elision on ROCK that HTM is modal and we can execute the same code
> either under a lock or within a transaction. That avoids yet another
> dimensionality explosion for emitted code. And of course it gives
> you a chance to wrap legacy code in transactions.

Chapter where you redefine mutex lock/unlock macros to use
transactions on main path and just recompile some code base looks very
nice indeed.
ASF's fine-grained control over read-, write-sets and possibility to
release a line look very nice too.

What do you think about combining both advantages, i.e. to treat plain
MOVs as transactional inside of an active transaction, and provide
special "UNLOCKed MOVs" which will be treated non-transactionally
inside a transaction? IMVHO it's a win-win solution.

> Shifting topics, I'm appreciative of the progress properties afforded
> by Mitch's original design. Having said that, NACKing might open up
> some worrisome denial-of-service attacks and related issues,
> potentially necessitating watchdog timers for transactions. It's all
> solvable, but might not be trivial and the remedies could expose yet
> another set of annoyances. Perhaps Mitch might comment. Having
> transactions that are only obstruction-free (possibly admitting mutual
> abort with no progress) is reasonable as it shifts the responsibility
> for contention management and progress guarantees from hardware into
> software where we can apply more flexible & adaptive policies. With
> lock elision on ROCK, for instance, if the abort rate is too high we
> can simply wrap the transactions with some simple admission control
> protocols to restrict concurrency to a reasonable level. In the
> extreme case we just revert to the lock. But this probably reflects
> my bias (or tunnel vision, as Mitch suggested (:>)) that I see
> hardware TM more as an accelerator.

I am unable to provide intelligent comments on how first generation of
ASF must implemented. However I see that here is 2 different things.
First is actual implementation of the first generation ASF. And second
is formal ASF specification, which will be a base for several
generations of implementations from several vendors.
Definitely it's worth doing to allow the specification to support many
different implementations. Especially taking into account that most
software developers do not actually care who, when and how will abort.
I guess software developers will differentiate implementations by
speed of execution of their code.
Maybe it's better to remove chapter 6.2. from the specification at all
and take Sun's "best effort" approach, i.e. just say "hardware will do
it's best to execute transactions as fast as possible, it you will
notice special abort codes or too high abort rate apply backoff, fall
back to mutexes, or whatever".

--
Dmitriy V'jukov

EricP

unread,

May 2, 2009, 12:54:51 PM5/2/09

to

So you are suggesting that only locked loads are protected -
normal loads are unsafe. If someone wants transaction semantics
their code must be written as such. There is utility in being
able to do unsafe reads but whether that should be default
or explicit is debatable.

Also, looking at the MOESI coherency protocol, it is possible they
may be trying to retain the lines in either the Exclusive (not updated,
no shared copies) or Modified (updated, no shared copies) states.
That might allow the Commit to do its thing without
transmitting any bus messages.

Allowing a read would transition Modified lines to the Owned state,
necessitating possibly multiple Invalidate messages on Commit.
All such lines would have to be held in a limbo state until the
replies were received, which implies resources must be tied down,
expands timing windows for responses to requests for those lines, etc.
That in turn might limit the ability to expand the number
of lockable lines in the future.

That having been said, I can see the utility in having an explicit
'unsafe read' which says 'I know what I'm doing so get out of my way'.
In particular it would allow a cache line being updated to be marked
with a guard flag and counter. An unsafe read would be used to read
flag without disturbing and update in progress.
As it currently stands such a guard flag would have to be located
in a separate adjacent cache line.
Of course this also causes MOSEI issues I mentioned above.

Eric

Dmitriy Vyukov

unread,

May 3, 2009, 4:28:34 AM5/3/09

to

On May 2, 9:54 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> So you are suggesting that only locked loads are protected -
> normal loads are unsafe. If someone wants transaction semantics
> their code must be written as such. There is utility in being
> able to do unsafe reads but whether that should be default
> or explicit is debatable.

No, I didn't have an intent to change observable behavior in any way.
Since activity on CPU B is speculative, so hardware may just affect
that nothing is actually happen on CPU B yet.
If CPU A would make a write, then it just would be senseless to
continue transaction on CPU B - it will abort on commit anyway. But if
CPU A makes read, we may continue transaction on CPU B - no reasons to
abort, then only thing we must ensure is than CPU A will not see
speculative values.

> Also, looking at the MOESI coherency protocol, it is possible they
> may be trying to retain the lines in either the Exclusive (not updated,
> no shared copies) or Modified (updated, no shared copies) states.
> That might allow the Commit to do its thing without
> transmitting any bus messages.
>
> Allowing a read would transition Modified lines to the Owned state,
> necessitating possibly multiple Invalidate messages on Commit.
> All such lines would have to be held in a limbo state until the
> replies were received, which implies resources must be tied down,
> expands timing windows for responses to requests for those lines, etc.
> That in turn might limit the ability to expand the number
> of lockable lines in the future.

My primitive view is that we may (1) transmit some bus messages on
commit, or (2) redo some work + transmit the same number of messages
during re-execution.
Though now I am not sure that such details must be included into the
specification at all.

> That having been said, I can see the utility in having an explicit
> 'unsafe read' which says 'I know what I'm doing so get out of my way'.

Once again, my intent is to reduce number of aborts w/o affecting
observable behavior.
Mitch said that I loose visible atomicity, if so please show me where
and how (this is not my intent).

> In particular it would allow a cache line being updated to be marked
> with a guard flag and counter. An unsafe read would be used to read
> flag without disturbing and update in progress.
> As it currently stands such a guard flag would have to be located
> in a separate adjacent cache line.
> Of course this also causes MOSEI issues I mentioned above.

The use-case I have in mind is checking of the stack for emptiness.

--
Dmitriy V'jukov

EricP

unread,

May 3, 2009, 11:23:28 AM5/3/09

to

Dmitriy Vyukov wrote:
> On May 2, 9:54 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>
>> So you are suggesting that only locked loads are protected -
>> normal loads are unsafe. If someone wants transaction semantics
>> their code must be written as such. There is utility in being
>> able to do unsafe reads but whether that should be default
>> or explicit is debatable.
>
>
> No, I didn't have an intent to change observable behavior in any way.
> Since activity on CPU B is speculative, so hardware may just affect
> that nothing is actually happen on CPU B yet.
> If CPU A would make a write, then it just would be senseless to
> continue transaction on CPU B - it will abort on commit anyway. But if
> CPU A makes read, we may continue transaction on CPU B - no reasons to
> abort, then only thing we must ensure is than CPU A will not see
> speculative values.

I agree, having CPU A's read abort CPU B appears
logically unnecessary.

ASF does not have a concept of a protected reader -
perhaps it should. When multiple protected cache lines
are updated, they all commit together or not at all.
That is not a strong enough guarantee by itself to
prevent readers from seeing partial multiple updates.

Cpu A could do a normal read on location #1, sleep for a while,
then read location #2 (on separate cache lines). While asleep
cpu B performs a multiple update on #1 and #2 and commits.
Cpu A sees the old #1 and new #2.

Since cpu A can see old #1 and new #2 anyway, it is pointless
to have either of those reads abort cpu B if they happen
to fall within B's protected update window.

Does that just about cover your concerns?

> The use-case I have in mind is checking of the stack for emptiness.

What do you have in mind?

Eric

EricP

unread,

May 3, 2009, 12:35:54 PM5/3/09

to

EricP wrote:
>
> ASF does not have a concept of a protected reader -
> perhaps it should.

Rather I should say that it is not explicitly documented as
such. There does not appear to be any reason that you
could not create a protected read-only code section by
just doing a SPECULATE...LOCK MOVEx (load)... COMMIT
with no updates and if an interrupt or task switch occurs
then you trap back to the SPECULATE.

Perhaps that was so self evident they just never mentioned it.

Eric

Dmitriy Vyukov

unread,

May 3, 2009, 1:01:21 PM5/3/09

to

On May 3, 8:23 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> I agree, having CPU A's read abort CPU B appears
> logically unnecessary.

I guess authors of the ASF proposal just base their reasoning on same
particular implementation. However IMHO it's a mistake to base formal
specification which may have several implementations on some
particular implementation. IMHO the better place for such rules is
software developer's manual for particular processor family (where the
more details the better).

> ASF does not have a concept of a protected reader -
> perhaps it should. When multiple protected cache lines
> are updated, they all commit together or not at all.
> That is not a strong enough guarantee by itself to
> prevent readers from seeing partial multiple updates.

Why? AFAIU, ASF does support protected readers:

SPECULATE
LOCK MOV EAX, [ESI]
LOCK MOV EBX, [EDI]
COMMIT
; at this point EAX and EBX have consistent values regarding
concurrent transactional modifications of [ESI] and [EDI].

Now I think that it's Ok to not abort CPU B in some cases even if CPU
A makes transactional reads. Consider:

; CPU B
1. SPECULATE
2. LOCK PREFETCHW [ESI]
3. LOCK PREFETCHW [EDI]
4. LOCK MOV [ESI], 1
5. LOCK MOV [EDI], 1
6. COMMIT

; CPU A
10. SPECULATE
11. LOCK MOV EAX, [ESI]
12. LOCK MOV EBX, [EDI]
13. COMMIT

ESI on CPU A == ESI on CPU B, EDI on CPU A == EDI on CPU B, initially
[ESI] = [EDI] = 0

Consider following order of events (as seen from interconnect between
CPU A and CPU B):
1, 2, 3, 4, 10, 11, 12, 13, 5, 6

Provided that EAX and EBX will be equal to 0 on CPU A, it's perfectly
Ok to not abort CPU B. COMMIT on CPU B will probably issue additional
cache coherence messages, but number of messages will be not greater
than that provided transaction re-execution on CPU B.

It's possible to avoid additional cache-coherence messages in this
case. It's called "lazy acquire' in STM terminology. LOCK PREFETCHW
must not make any global activity, only local bookkeeping, then COMMIT
will acquire all required cache-lines in MODIFIED state and make
necessary updates. Not sure how much this is applicable to hardware.
Though hard-wiring particular contention resolution protocol into
specification looks more and more dis-attractive for me.

> Cpu A could do a normal read on location #1, sleep for a while,
> then read location #2 (on separate cache lines). While asleep
> cpu B performs a multiple update on #1 and #2 and commits.
> Cpu A sees the old #1 and new #2.
>
> Since cpu A can see old #1 and new #2 anyway, it is pointless
> to have either of those reads abort cpu B if they happen
> to fall within B's protected update window.
>
> Does that just about cover your concerns?
>
> > The use-case I have in mind is checking of the stack for emptiness.
>
> What do you have in mind?

Consider LIFO stack which may be empty with substantial probability.
Threads often check whether the stack is empty or not. I want this
check can be coded with as low overhead as possible, and to not abort
potential concurrent modifications:

if (stack_head != 0)
{
SPECULATE
// try to pop element from the stack
COMMIT
}

--
Dmitriy V'jukov

EricP

unread,

May 3, 2009, 1:45:28 PM5/3/09

to

Dmitriy Vyukov wrote:
> On May 3, 8:23 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>> ASF does not have a concept of a protected reader -
>

> Why? AFAIU, ASF does support protected readers:

Yep... forget I said that.
Giant brain fart on a global scale.
I don't know what I was thinking.

Eric

dave dice

unread,

May 4, 2009, 2:05:23 PM5/4/09

to

On May 2, 8:18 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
>
> What do you think about combining both advantages, i.e. to treat plain
> MOVs as transactional inside of an active transaction, and provide
> special "UNLOCKed MOVs" which will be treated non-transactionally
> inside a transaction? IMVHO it's a win-win solution.

Exactly. In addition I'd prefer the decorated "UNLOCKED:" loads and
stores to be legal and have normal semantics in normal non-
transactional mode, so we could run the same code both within and
outside of txns.

Dave

EricP

unread,

May 4, 2009, 7:41:49 PM5/4/09

to

Dmitriy Vyukov wrote:
>
> It's possible to avoid additional cache-coherence messages in this
> case. It's called "lazy acquire' in STM terminology. LOCK PREFETCHW
> must not make any global activity, only local bookkeeping, then COMMIT
> will acquire all required cache-lines in MODIFIED state and make
> necessary updates. Not sure how much this is applicable to hardware.
> Though hard-wiring particular contention resolution protocol into
> specification looks more and more dis-attractive for me.

I am thinking along similar lines. I like the idea of a
hardware lock accelerator but have difficulty seeing it
withing the SPECULATE... COMMIT framework.

I'm thinking that since we can't NAK (that being a no-no for
some bus protocols, I'm guessing coherentHT is one) we need
some way to wait. So how about LOCK WAIT? It works as follows:
you emit a series of LOCK PREFETCH or LOCK PREFETCHW instr
declaring your intent to access the locations but since no data
is actually transfered, there are no logic consequences.
That is followed by a LOCK WAIT which works similar to a
WaitForInterrupt in that it may delay, it may not.
It could return a hint code as to what to do next.
Then you do the normal SPECULATE...COMMIT.

If the hardware lock accelerator is present then it decides
who should really wait and who should go immediately.
Maybe it could return a hint in EAX as to what to do next.

There is no need for bus NAKs because nothing gets transferred.

E.g.
Replay:
PREFETCH x
PREFETCH y
PREFETCHW z
LOCK WAIT
JNZ ReplayDecide
SPECULATE
JNZ ReplayDecide
...
COMMIT
JZ Done
ReplayDecide:
blah
blah
JMP Replay

Eric

EricP

unread,

May 4, 2009, 8:00:06 PM5/4/09

to

Ooops I mean
LOCK PREFETCH
LOCK PREFETCHW

Dmitriy Vyukov

unread,

May 5, 2009, 3:28:29 AM5/5/09

to

On 5 май, 03:41, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> E.g.
> Replay:
> PREFETCH x
> PREFETCH y
> PREFETCHW z
> LOCK WAIT
> JNZ ReplayDecide
> SPECULATE
> JNZ ReplayDecide
> ...
> COMMIT
> JZ Done
> ReplayDecide:
> blah
> blah
> JMP Replay

How would it look if some of the load/store locations depend on
previously loaded data? i.e.:
int* x = *g_x;
int y = *g_y;
x[y * 5 + 11] = 1;

--
Dmitriy V'jukov

EricP

unread,

May 5, 2009, 1:07:07 PM5/5/09

to

Dmitriy Vyukov wrote:

Firstly, I should have put the LOCK PREFETCH's inside the speculate
as their virtual to physical translation is part of the transaction.

Now to handle dependent data reads, maybe just an initial locked
prefetch and wait would do. Once any data value has been seen,
the reader becomes a candidate for a potential abort.

My thinking is that a LOCK PREFETCH and LOCK PREFETCHW doesn't
cause any visible state change, so it can be defined to do
whatever the designers require.
On some cpus it does prefetch and interferes with peer locks.
On others they are batched up and shipped to a scheduler.

To be able to optimize the locking order, all LOCK PREFETCHW should
occur prior to the LOCK WAIT, and the actual stores occur after
the wait, and no new locked reads occur after the first store.
That is not a restriction, it is just that the odds in
aborting go up if you change the sequence.

Attempt #2...

int* x = *g_x;
int y = *g_y;
x[y * 5 + 11] = 1;

g_z++;

Replay:
SPECULATE
JNZ RetryDecide
LOCK PREFETCH g_x
LOCK PREFETCH g_y
LOCK PREFETCHW g_z
LOCK WAIT
LOCK MOV rbx, g_y
LOCK MOV rbx, [rbx]
MUL rbx, 5
ADD rbx, 11
LOCK MOV rcx, g_x
LOCK MOV rdx, g_z
INC rdx
LOCK PREFETCHW [rcx+8*rbx]
LOCK WAIT
LOCK MOV [rcx+8*rbx], 1
LOCK MOV g_z, rdx
COMMIT
JZ Done
RetryDecide:

MitchAlsup

unread,

May 5, 2009, 7:36:29 PM5/5/09

to

On May 2, 6:23 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> On 1 ÍÁÊ, 23:51, MitchAlsup <MitchAl...@aol.com> wrote:
> I am not sure why you are talking about complexity and disposition of
> the atomicity relative to cache coherence, what I was talking about is
> no more than renaming of names. It must not affect complexity,
> disposition of the atomicity relative to cache coherence, etc.

The HW in question had 8 memory (miss) buffers and 22 store buffers.
Some of the atomic activities we wanted to be able to perform would
have overflowed the 22 stores limit in the store buffer if one were
programming with cursors {pointer, timestamp, other cursor data == 3
stores per 'reference'}. If the atomic event takes more room than is
available in the execution window, then it cannot be recovered (like a
branch misprediction is recovered). Since we also wanted coders to be
able to debugg these regions, other stores would be needed. Thus,
complexity would have been increased, or the "ASF model" would have
had to shrink to fit the execution window of a particular machine.
Neither was desirable.

So, ASF was designed so that its SW model reamins intack with any
associativity at the cache hierarchy (including zero) and with any
number of execution window resources. Its only mandate was enough MABs
to allow forward progress once a number of them were locked down for
ASF activites. This being the smallest and cheapest resource that
already existed in the correct places for the activities we wanted to
perform. The MABs also had the property that they got snooped, were
(or could be) associated with line buffers (for recovery purposes),
and the decoration needed was only a couple of flip flops and some
state logic.

Thus, I disagree with your assumption that it would not have affected
complexity (relative to almost anything in that machine). Thus, this
functionality could be installed in a machine by removing the #UD
traps on memref instructions that happen to have a LOCK prefix, a
couple of new instructions, and a couple of new bits in an already
existing functioining piece of the machine.

Mitch

MitchAlsup

unread,

May 5, 2009, 7:40:20 PM5/5/09

to

On May 2, 11:54 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> Also, looking at the MOESI coherency protocol, it is possible they
> may be trying to retain the lines in either the Exclusive (not updated,
> no shared copies) or Modified (updated, no shared copies) states.
> That might allow the Commit to do its thing without
> transmitting any bus messages.

To retain the illusion of atomicity, all lines must remain in the
prior state or become visible in the after state in one instant.
Passing bus messages around makes this very significantly harder.

> Allowing a read would transition Modified lines to the Owned state,
> necessitating possibly multiple Invalidate messages on Commit.

Which makes the set of lines not transition from the before state to
the after state in one instant.

> All such lines would have to be held in a limbo state until the
> replies were received, which implies resources must be tied down,
> expands timing windows for responses to requests for those lines, etc.
> That in turn might limit the ability to expand the number
> of lockable lines in the future.

And, in practice, does not work.

Mitch

MitchAlsup

unread,

May 5, 2009, 7:45:39 PM5/5/09

to

On May 3, 3:28 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 2, 9:54 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> My primitive view is that we may (1) transmit some bus messages on
> commit, or (2) redo some work + transmit the same number of messages
> during re-execution.

Now the COMMIT instruction takes an essentially unbounded amount of
time. And I don't think the HT has the infrastructure to smell atomic
under these assumptions.

> Though now I am not sure that such details must be included into the
> specification at all.
>
> > That having been said, I can see the utility in having an explicit
> > 'unsafe read' which says 'I know what I'm doing so get out of my way'.
>
> Once again, my intent is to reduce number of aborts w/o affecting
> observable behavior.
> Mitch said that I loose visible atomicity, if so please show me where
> and how (this is not my intent).

It may very well be that it remains possible to retain atomicity under
these circumstances. However with the HW limits we had imposed (and
self imposed*) upon us, I could not.

Mitch

(*) In order to qualify for inclusion in a particular rev of the given
machine, this had to be a "small" change to a bounded number of
existing machine circuits.

MitchAlsup

unread,

May 5, 2009, 7:47:51 PM5/5/09

to

MS noticed this immediately (the utility and existance). FWIW

Mitch

Michael Hohmuth

unread,

May 7, 2009, 4:24:46 AM5/7/09

to

Disclaimer:
I work for AMD and am one of the coauthors of the ASF proposal. AMD
has not announced support for ASF (Advanced Synchronization
Facility) in any future product.

I'd like to thank everyone for the very interesting discussion. I've
read though all of it, and many of you made some very good points.

MitchAlsup <Mitch...@aol.com> writes:

> Well, horay, my invention has finally seen the light of day.
>
> AMD have diddled with the instruction ordering (and spelling of the
> instruction names) but most of the original intent remains.

[ and in another posting: ]

> While this ASF may be a step forward in atomic stuff, it is
> significantly less powerful that the previous version described in
> the previous thread.
>
> So, its nice to have it see the light of day, It is shameful to have
> lost so much power and expressivity along the way. It is truley sad
> that ASF lost the ability to attack the BigO(n**2) problem along the
> way.

Congratulations! And sorry for all the diddling that we did. At
least we had some fun doing it. :)

> As far as I can tell, they have lost the ability to beat the bigO
> (n**2) problem in DCAS based nonBlocking algorythms. This is shown
> in table 6.2.1. Any downgrade of cache line status, aborts the CPU
> that has potentially made the most progress. When I looked into
> this, this kind of scenario leads to livelock under certain timing
> arrangements between CPUs/cores/Nodes.

Yes, we removed deterministic mode from the original design (described
in your patents). This definitely puts more burden on software to
manage contention and, specifically, to deal with possible livelock.

In general, we chose a more hardware-friendly approach. The
specification now intends to allow a wide range of implementations,
including very simple ones, but without precluding sophisticated
implementations that do some contention management in hardware.

> Back when I was working on this, we were intending to support 8
> protected lines (with 7 as a back up). There are useful atomic
> primatives that use 5 protected lines (moving a concurrent data
> structure (CDS) element from one place in a CDS to another in a
> single atomic event.)
>
> And here is where I* think this current ASF architecture is screwed
> up*. This atomic move basically requires finding the element to be
> moved (a many-multi-cycle process) and then finding a place to put
> it (another many-multi-cycle process) and then doing the move
> atomically. The original ASF allowed one to lock the protected
> cache line(s) as they got found before entering the atomic
> region. At entry to the atomic region, if the protected cache lines
> have been downgraded from the cache, the entry fails, otherwise a
> few SW checks are performed, and then the atomic operations are
> performed.
>
> This current ASF requires either that the atomic region be very much
> longer (around both searches) or each search has to move a bunch of
> state to a thread -safe memory buffer that will be compared later
> agains the current state(s), and then compare all that state in teh
> atomic region before modifying CDS state atomically. By defering the
> entry into the atomic phase until after SW has determined which
> lines are participating, but leaving HW watching for interference in
> the meantime, the duration of the atomic region is made signficantly
> smaller.

Yes, shorter speculative regions are better. But we wanted to enable
one additional use case not covered by the approach you outlined:
Often you cannot even safely do the search (i.e., walk a list,
dereference "next" pointers) without making sure that the followed
references are still valid (unless you use locking or make some
far-reaching assumptions, such as type-stable memory). We considered
various alternatives, and the present design offers one solution: Do
the search in the speculative region and protect the references as you
go. Undoubtedly there are other solutions.

Some other benefits of the current proposal in comparison to the older
one:

* Ability to dynamically discover and add new data to the set of
protected memory locations in the speculative region (a
generalization of the solution to the safe-references problem
described above)

* Nested speculative regions

[ from yet another posting: ]

> ASF evolved over 2-2.5 years (while I was there) with insight from
> the best architects withini AMD and from selected AMD partners who
> gave useful and illuminating input into what they might have liked
> and what they did not. I'm pretty sure that this evolution did not
> stop after I left.

Indeed. Based on all this input, I believe that we arrived at a
design point that is easier to implement, allows more flexibility for
software, and addresses the needs of software better than your
original proposal.

Michael
--
Michael Hohmuth, AMD Operating System Research Center, Dresden, Germany
michael...@amd.com, www.amd64.org

Michael Hohmuth

unread,

May 7, 2009, 5:31:05 AM5/7/09

to

Chris M. Thomasson writes:

> I think they need to clarify one tiny little point in section
> 7.2.1. They basically write that one cannot remove multiple objects
> from a LIFO without ASF. This is not true. One can remove all items
> with a wait-free atomic swap, and if the LIFO contains more than one
> item, well, that removing multiple items in a single operation.

That's true -- thanks!

> Also, it would be nice if they mention some other solutions to the
> ABA problem, and memory reclamation issues.

Could you elaborate a little what you were thinking of? Deferred
reclamation, or virtual-memory tricks? (BTW, do you have a good
citable reference for PDR?)

Wouldn't you think that would be a bit out of scope for the spec
proposal's motivation / use cases / examples section?

BTW, I do believe that ASF can make memory reclamation much easier.

> Anyway, IMO, AMD would be crazy not to implement this in their
> upcoming processors. Well done.

Thanks for your kind words!

Chris M. Thomasson

unread,

May 11, 2009, 12:31:58 AM5/11/09

to

"Michael Hohmuth" <Michael...@amd.com> wrote in message
news:3yzbpq5...@sieglinde.amd.com...

> Chris M. Thomasson writes:
>
>> I think they need to clarify one tiny little point in section
>> 7.2.1. They basically write that one cannot remove multiple objects
>> from a LIFO without ASF. This is not true. One can remove all items
>> with a wait-free atomic swap, and if the LIFO contains more than one
>> item, well, that removing multiple items in a single operation.
>
> That's true -- thanks!

Also, you can totally eliminate ABA and any internal memory reclamation
issues for the lock-free LIFO if you always use an atomic swap for
consumers.

>> Also, it would be nice if they mention some other solutions to the
>> ABA problem, and memory reclamation issues.
>
> Could you elaborate a little what you were thinking of? Deferred
> reclamation, or virtual-memory tricks? (BTW, do you have a good
> citable reference for PDR?)

PDR stands for `Partial copy-on-write Deferred Reclamation'. It is a term
that was coined by Joe Seigh over on `comp.programming.threads' in the
following discussion:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/82376b7295e6ce1a

Here is Joes website:

http://atomic-ptr-plus.sourceforge.net

And some further descriptions and usage examples:

http://groups.google.com/group/comp.programming.threads/msg/c8206c2535fe816e

I hope that helps.

Anyway, you can use PDR to avoid the ABA problem and get around memory
reclamation issues in basically any non-blocking algorithm.

> Wouldn't you think that would be a bit out of scope for the spec
> proposal's motivation / use cases / examples section?

Yeah. Now that I think about it, your probably right. However, it might be
useful if the paper shows how one can use ASF alongside PDR. For instance,
one can use ASF to allow writer threads to insert and remove nodes from
complex tree algorithms while using PDR to allow reader threads to perform
highly efficient concurrent traversals.

> BTW, I do believe that ASF can make memory reclamation much easier.

Easier in what sense? I mean it does completely solve the internal memory
reclaiming issues problem wrt lock-free stacks and queues, but how about
node traversal? I don't think that ASF is sufficient enough to handle the
reader portion of a PDR algorithm; something like:
______________________________________________________________
struct node {
struct node* next;
void* state;
};

static node* g_head = [...];

// read-only threads
void traverse() {
struct node* node = ATOMIC_LOAD_DEPENDS(&g_head);

while (node) {
struct node* next = ATOMIC_LOAD_DEPENDS(&node->next);
read_only_processing_of_state(node->state);
node = next;
}
}
______________________________________________________________

ASF would not only have to protect the nodes themselves, but the actual
memory pointed to by the `node::state' member which is accessed by the
`read_only_processing_of_state()' procedure.

Now, ASF can __definitely__ help out with the writer side of the algorithm.
One can use ASF to create elaborate non-blocking writer logic, and use PDR
to handle all of the memory reclamation issues for the reader logic. AFAICT,
using ASF and PDR together would create an incredibly powerful toolset.

>> Anyway, IMO, AMD would be crazy not to implement this in their
>> upcoming processors. Well done.
>
> Thanks for your kind words!

No problem. I think that ASF is going to turn out to be __very__ useful.
Thank you, and all of your colleagues for creating it!

:^)

Dmitriy Vyukov

unread,

May 11, 2009, 6:32:25 PM5/11/09

to

On 11 май, 08:31, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > BTW, I do believe that ASF can make memory reclamation much easier.
>
> Easier in what sense? I mean it does completely solve the internal memory
> reclaiming issues problem wrt lock-free stacks and queues, but how about
> node traversal? I don't think that ASF is sufficient enough to handle the
> reader portion of a PDR algorithm; something like:
> ______________________________________________________________
> struct node {
> struct node* next;
> void* state;
>
> };
>
> static node* g_head = [...];
>
> // read-only threads
> void traverse() {
> struct node* node = ATOMIC_LOAD_DEPENDS(&g_head);
>
> while (node) {
> struct node* next = ATOMIC_LOAD_DEPENDS(&node->next);
> read_only_processing_of_state(node->state);
> node = next;
> }}
>
> ______________________________________________________________
>
> ASF would not only have to protect the nodes themselves, but the actual
> memory pointed to by the `node::state' member which is accessed by the
> `read_only_processing_of_state()' procedure.

AFAIU, ASF does solve memory reclamation for readers *provided* that
whole reader processing is included into transaction.
Consider:

void traverse()
{
SPECULATE;
node** prev = &g_head;
node* n = LOCK MOV *prev;
while (n)
{
node* next = LOCK MOV n->next;
RELEASE prev;
prev = &n->next;
process(n);
n = next;
}
COMMIT;
}

void remove()
{
SPECULATE;
node* n = get_node_for_deletion();
n->next = 0;
COMMIT;
free(n);
}

Assume that reader processes node N. When writer will set N->next to
0, reader will *instantly* abort. No chance of catching SIGSEGV.
Maybe here are some minor bugs, but I believe this scheme is intended
to work in general. I.e. reader uses some flag as a
present_in_the_list indication, and because of the instant
notifications about modifications, no chance of processing already
removed element. N->next=0 will propagate to the readers before any
side effect of free(N).

Not saying that PDR will be completely useless. PDR may still be
useful to reduce number of aborts, PDR has a nice of being wait-free,
no retries, not aborts, no live-locks. Though ASF may greatly simplify
PDR implementation. For example, at first glance, it's possible to
build virtually zero-overhead SMR (limited number of deferred objects,
usability in user-space).

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

May 13, 2009, 9:28:18 AM5/13/09

to

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
> news:1d8f34a8-a1a5-497f-b79d->
> e2ba16...@o14g2000vbo.googlegroups.com...

> On 11 О©╫О©╫О©╫, 08:31, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
> > > BTW, I do believe that ASF can make memory reclamation much easier.
> >
> > Easier in what sense? I mean it does completely solve the internal
> > memory
> > reclaiming issues problem wrt lock-free stacks and queues, but how about
> > node traversal? I don't think that ASF is sufficient enough to handle
> > the
> > reader portion of a PDR algorithm; something like:
> > ______________________________________________________________

> > [...]

> > ______________________________________________________________
> >
> > ASF would not only have to protect the nodes themselves, but the actual
> > memory pointed to by the `node::state' member which is accessed by the
> > `read_only_processing_of_state()' procedure.

> AFAIU, ASF does solve memory reclamation for readers *provided* that
> whole reader processing is included into transaction.

> [...]

> Assume that reader processes node N. When writer will set N->next to
> 0, reader will *instantly* abort. No chance of catching SIGSEGV.
> Maybe here are some minor bugs, but I believe this scheme is intended
> to work in general. I.e. reader uses some flag as a
> present_in_the_list indication, and because of the instant
> notifications about modifications, no chance of processing already
> removed element. N->next=0 will propagate to the readers before any
> side effect of free(N).

Ahh yes. Your 100% correct. Okay, that sounds good.; I need to re-read the
paper more closely...

> Not saying that PDR will be completely useless.

At least it won't be useless on future architectures that do not implement
something like ASF...

;^)

> PDR may still be
> useful to reduce number of aborts, PDR has a nice of being wait-free,
> no retries, not aborts, no live-locks.

Indeed. Humm... I do wonder about a speculative region aborting on
interruption. If the reader must traverse through a very large number of
objects, and the per-object processing is a little complex, it seems like it
would open itself up to a fairly wide window for an interrupt to occur.

> Though ASF may greatly simplify
> PDR implementation. For example, at first glance, it's possible to
> build virtually zero-overhead SMR (limited number of deferred objects,
> usability in user-space).

I would be REALLY interested in testing the performance of ASF-based SMR vs.
classic membar-based SMR, and of course Joe Seighs excellent RCU+SMR. That
would be neat.

Thanks Dmitriy.

Chris M. Thomasson

unread,

May 13, 2009, 9:39:35 AM5/13/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:JVzOl.61548$ew.4...@newsfe24.iad...

>> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message
>> news:1d8f34a8-a1a5-497f-b79d->
>> e2ba16...@o14g2000vbo.googlegroups.com...

>> [...]

>> Not saying that PDR will be completely useless.
>
> At least it won't be useless on future architectures that do not implement
> something like ASF...

Well, I do think that using PDR for readers and ASF for writers would give
best of both worlds. Highly efficient wait-free read-size performance, and
the ability for non-blocking complex writer logic. That does sound
attractive. I think it would be better than using fine-grain locking for
writers... Something like:
____________________________________________________
void read_only_traverse() {
pdr_region_t* const pdr = pdr_acquire();
/* peform the read-only traversal */
pdr_release(pdr);
}

void remove_a() {
pdr_region_t* const pdr = pdr_acquire();

SPECULATE;
node* n = get_node_for_deletion();

COMMIT;
pdr_release_and_defer(pdr, n);
}

/ * or, perhaps even: */

void remove_b() {

SPECULATE;
node* n = get_node_for_deletion();

COMMIT;
pdr_defer(n);
}
____________________________________________________

Does that make sense to you?

> ;^)
> [...]

Dmitriy Vyukov

unread,

May 13, 2009, 10:01:31 AM5/13/09

to

On May 13, 5:28 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Ahh yes. Your 100% correct. Okay, that sounds good.; I need to re-read the
> paper more closely...
>
> > Not saying that PDR will be completely useless.
>
> At least it won't be useless on future architectures that do not implement
> something like ASF...
>
> ;^)

In the bright future we will have ASF on all widespread hardware, and
for everything else we will have heavyweight atomic_ptr/pc_sample just
as an backoff :)

Btw, Windows 7 has an interesting feature which may also simplify
implementation of user-space PDR - UMS (user-mode scheduled) threads.
Basically UMS delivers to the application events about thread blocking/
unblocking. Unfortunately it does not deliver events about thread time-
slice end.

> > PDR may still be
> > useful to reduce number of aborts, PDR has a nice of being wait-free,
> > no retries, not aborts, no live-locks.
>
> Indeed. Humm... I do wonder about a speculative region aborting on
> interruption. If the reader must traverse through a very large number of
> objects, and the per-object processing is a little complex, it seems like it
> would open itself up to a fairly wide window for an interrupt to occur.

Note the usage of the RELEASE in my example. Basically reader thread
holds only ONE node at a time. If writer will delete/modify/insert
node before or after that node, this will NOT cause reader to abort.
Though in my example reader does not see consistent list as a whole,
assume writer removes element from the tail and inserts it into
beginning of the list in SINGLE transaction, my reader may miss that
element at all. I believe this is not a problem in many scenarios.
However if writer deletes/modifies node that reader currently
processes, this definitely will cause reader to retry from the
beginning of the list. Probably the algorithm may be somehow
improved...

> > Though ASF may greatly simplify
> > PDR implementation. For example, at first glance, it's possible to
> > build virtually zero-overhead SMR (limited number of deferred objects,
> > usability in user-space).
>
> I would be REALLY interested in testing the performance of ASF-based SMR vs.
> classic membar-based SMR, and of course Joe Seighs excellent RCU+SMR. That
> would be neat.

The cool thing about ASF is that AFAIU it is not causing execution of
the MFENCE (AFAIU the main cause of the atomic RWM's poor performance
on x86). So ASF transaction is a kind of a bunch of plain loads and
stores from the performance POV, and if all the locations are already
in the core's cache, then well I guess performance of the ASF
transaction may be basically equal to that of a bunch of plain loads
and stores.
In SMR all the locations are usually in the core's cache.
If data-structure in read-mostly then once again all the locations are
usually in the core's cache.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

May 13, 2009, 10:50:24 AM5/13/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:c5725202-5663-4c48...@r13g2000vbr.googlegroups.com...

On May 13, 5:28 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > Ahh yes. Your 100% correct. Okay, that sounds good.; I need to re-read
> > the
> > paper more closely...
> >
> > > Not saying that PDR will be completely useless.
> >
> > At least it won't be useless on future architectures that do not
> > implement
> > something like ASF...
> >
> > ;^)

> In the bright future we will have ASF on all widespread hardware, and
> for everything else we will have heavyweight atomic_ptr/pc_sample just
> as an backoff :)

;^D

> Btw, Windows 7 has an interesting feature which may also simplify
> implementation of user-space PDR - UMS (user-mode scheduled) threads.
> Basically UMS delivers to the application events about thread blocking/
> unblocking. Unfortunately it does not deliver events about thread time-
> slice end.

Perhaps. It does not seem that it could be a general-purpose PDR because it
would only work with UMS threads. AFAICT, an application has to specifically
create UMS threads using `CreateRemoteThreadEx()'. Perhaps it would be
better to stick with `FlushProcessWriteBuffers()' or
`NtQuerySystemInformation()'...

> > > PDR may still be
> > > useful to reduce number of aborts, PDR has a nice of being wait-free,
> > > no retries, not aborts, no live-locks.
> >
> > Indeed. Humm... I do wonder about a speculative region aborting on
> > interruption. If the reader must traverse through a very large number of
> > objects, and the per-object processing is a little complex, it seems
> > like it
> > would open itself up to a fairly wide window for an interrupt to occur.

> Note the usage of the RELEASE in my example. Basically reader thread
> holds only ONE node at a time. If writer will delete/modify/insert
> node before or after that node, this will NOT cause reader to abort.
> Though in my example reader does not see consistent list as a whole,
> assume writer removes element from the tail and inserts it into
> beginning of the list in SINGLE transaction, my reader may miss that
> element at all. I believe this is not a problem in many scenarios.
> However if writer deletes/modifies node that reader currently
> processes, this definitely will cause reader to retry from the
> beginning of the list. Probably the algorithm may be somehow
> improved...

I was more concerned with interrupts occurring during length traversals. If
I understand correctly, an interrupt that occurs within a speculative region
will cause it to abort. That is, between a SPECULATE and COMMIT... Not
between a LOCK MOV and RELEASE. What am I missing?

> > > Though ASF may greatly simplify
> > > PDR implementation. For example, at first glance, it's possible to
> > > build virtually zero-overhead SMR (limited number of deferred objects,
> > > usability in user-space).
> >
> > I would be REALLY interested in testing the performance of ASF-based SMR
> > vs.
> > classic membar-based SMR, and of course Joe Seighs excellent RCU+SMR.
> > That
> > would be neat.

> The cool thing about ASF is that AFAIU it is not causing execution of
> the MFENCE (AFAIU the main cause of the atomic RWM's poor performance
> on x86). So ASF transaction is a kind of a bunch of plain loads and
> stores from the performance POV, and if all the locations are already
> in the core's cache, then well I guess performance of the ASF
> transaction may be basically equal to that of a bunch of plain loads
> and stores.
> In SMR all the locations are usually in the core's cache.
> If data-structure in read-mostly then once again all the locations are
> usually in the core's cache.

If ASF-based SMR is just about as good as MFENCE-free SMR, well, that's
great! Humm... Still, I would still probably not use ASF-SMR to traverse a
list such that a SMR reference is obtained on each and every node... I would
probably use ASF-SMR to implement proxy-gc. Although, that would remove the
bounded garbage property of SMR... Well, everything has its tradeoffs.

Question to the ASF Inventors:

Would you not still need to insert an MFENCE instruction in an ASF-based SMR
implementation? Does an MFENCE in a speculative region behave differently
than an MFENCE in a non-speculative region? It seems like the load would
still be able to rise above the store when everything becomes visible during
the COMMIT...

Also, it does seem as if the SPECULATE instruction behaves as a membar of
sorts. I read in section 6.5 that all writes occurring before SPECULATE will
be rendered visible before writes within the speculative region. I would be
interested to see performance of a "naked" speculative region compared to an
SFENCE or MFENCE... Humm, interesting...

I guess that the acquire portion of an ASF-SMR impl might look like this
(membar free/psuedo-code):
_____________________________________________________________
void* smr_acquire(
void** pshared_data,
void** pthread_local
) {
SPECULATE;
void* shared_data = LOCK MOV [pshared_data];
MOV [pthread_local], shared_data;
void* shared_data_cmp = MOV [pshared_data];
if (shared_data != shared_data_cmp) ABORT;
COMMIT;
return shared_data;
}
_____________________________________________________________

or should it go like:
_____________________________________________________________
void* smr_acquire(
void** pshared_data,
void** pthread_local
) {
SPECULATE;
void* shared_data = LOCK MOV [pshared_data];
MOV [pthread_local], shared_data;
MFENCE;
void* shared_data_cmp = MOV [pshared_data];
if (shared_data != shared_data_cmp) ABORT;
COMMIT;
return shared_data;
}
_____________________________________________________________

Chris M. Thomasson

unread,

May 13, 2009, 11:13:38 AM5/13/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:0TNNl.30255$uD3....@newsfe20.iad...
> "Michael Hohmuth" <Michael...@amd.com> wrote in message [...]
>>[...] Wouldn't you think that would be a bit out of scope for the spec

>> proposal's motivation / use cases / examples section?

Perhaps not entirely. In section 7.2.1 the paper mentions that one cannot
traverse a lock-free LIFO because other threads may be
popping/pushing/deleting nodes concurrently. Well, this is not true if the
threads are using a PDR algorithm to manage the memory.

Also, it would be nice if the paper included an ASF-based SMR
implementation. Not the entire impl, just the acquire and releasing of
shared pointers portion:
________________________________________________________
void*
smr_acquire(
void** pshared_ptr,
void** pthread_local
) {
do {
*pthread_local = *pshared_ptr;
if (! *pthread_local) break;
// full memory barrier
} while (*pthread_local != *pshared_ptr);
// acquire memory barrier
return *pthread_local;
}

void
smr_release(
void** pthread_local
) {
// release memory barrier
*pthread_local = NULL;
}
________________________________________________________

> Yeah. Now that I think about it, your probably right. However, it might be
> useful if the paper shows how one can use ASF alongside PDR. For instance,
> one can use ASF to allow writer threads to insert and remove nodes from
> complex tree algorithms while using PDR to allow reader threads to perform
> highly efficient concurrent traversals.

> [...]

Writing about mixing PDR and ASF... Is it okay for CPU A to load from a
shared location S using normal MOV instruction from outside a speculative
region, and CPU B to concurrently store and/or load from S while inside a
speculative region?

EricP

unread,

May 13, 2009, 12:05:01 PM5/13/09

to

Chris M. Thomasson wrote:
>
> Is it okay for CPU A to load from a
> shared location S using normal MOV instruction from outside a
> speculative region, and CPU B to concurrently store and/or load from S
> while inside a speculative region?

No, not ok. As per table 6.2.1, if cpu B has performed a protected
write to one or more cache lines, anyone accessing that line
triggers B to abort. That includes normal read, write and prefetch.

(I assume the also includes the automatic prefetches that
the memory subsystem does of its own volition.
I wonder if that has an impact...?)

Eric

EricP

unread,

May 13, 2009, 12:22:35 PM5/13/09

to

Dmitriy Vyukov wrote:
> On May 13, 5:28 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>> Indeed. Humm... I do wonder about a speculative region aborting on
>> interruption. If the reader must traverse through a very large number of
>> objects, and the per-object processing is a little complex, it seems like it
>> would open itself up to a fairly wide window for an interrupt to occur.
>
>
> Note the usage of the RELEASE in my example. Basically reader thread
> holds only ONE node at a time. If writer will delete/modify/insert
> node before or after that node, this will NOT cause reader to abort.
> Though in my example reader does not see consistent list as a whole,
> assume writer removes element from the tail and inserts it into
> beginning of the list in SINGLE transaction, my reader may miss that
> element at all. I believe this is not a problem in many scenarios.
> However if writer deletes/modifies node that reader currently
> processes, this definitely will cause reader to retry from the
> beginning of the list. Probably the algorithm may be somehow
> improved...

No the RELEASE doesn't save you from interrupts, it *MAY*,
as in implementation dependent feature, save from collisions
on cache lines that you no longer reference.
A sufficiently long list would always get a timer interrupt
which aborts either a reader or writer back to the beginning.

Which means no matter what, the SPECULATE...COMMIT region
must be small and any list or tree walking done iteratively
with additional mechanisms for ensuring referenced objects
don't suddenly disappear.

Eric

Dmitriy Vyukov

unread,

May 13, 2009, 12:24:54 PM5/13/09

to

On May 13, 6:50 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > Btw, Windows 7 has an interesting feature which may also simplify
> > implementation of user-space PDR - UMS (user-mode scheduled) threads.
> > Basically UMS delivers to the application events about thread blocking/
> > unblocking. Unfortunately it does not deliver events about thread time-
> > slice end.
>
> Perhaps. It does not seem that it could be a general-purpose PDR because it
> would only work with UMS threads. AFAICT, an application has to specifically
> create UMS threads using `CreateRemoteThreadEx()'. Perhaps it would be
> better to stick with `FlushProcessWriteBuffers()' or
> `NtQuerySystemInformation()'...

I meant UMS as a mean to automatically release PDR in the case of
blocking calls.

> > > > PDR may still be
> > > > useful to reduce number of aborts, PDR has a nice of being wait-free,
> > > > no retries, not aborts, no live-locks.
>
> > > Indeed. Humm... I do wonder about a speculative region aborting on
> > > interruption. If the reader must traverse through a very large number of
> > > objects, and the per-object processing is a little complex, it seems
> > > like it
> > > would open itself up to a fairly wide window for an interrupt to occur.
> > Note the usage of the RELEASE in my example. Basically reader thread
> > holds only ONE node at a time. If writer will delete/modify/insert
> > node before or after that node, this will NOT cause reader to abort.
> > Though in my example reader does not see consistent list as a whole,
> > assume writer removes element from the tail and inserts it into
> > beginning of the list in SINGLE transaction, my reader may miss that
> > element at all. I believe this is not a problem in many scenarios.
> > However if writer deletes/modifies node that reader currently
> > processes, this definitely will cause reader to retry from the
> > beginning of the list. Probably the algorithm may be somehow
> > improved...
>
> I was more concerned with interrupts occurring during length traversals. If
> I understand correctly, an interrupt that occurs within a speculative region
> will cause it to abort. That is, between a SPECULATE and COMMIT... Not
> between a LOCK MOV and RELEASE. What am I missing?

Hmmm... I missed the word 'interrupts'. I think you are right... It
seems that lengthy ASF transactions will be problematic even w/o
contention...

> > > > Though ASF may greatly simplify
> > > > PDR implementation. For example, at first glance, it's possible to
> > > > build virtually zero-overhead SMR (limited number of deferred objects,
> > > > usability in user-space).
>
> > > I would be REALLY interested in testing the performance of ASF-based SMR
> > > vs.
> > > classic membar-based SMR, and of course Joe Seighs excellent RCU+SMR.
> > > That
> > > would be neat.
> > The cool thing about ASF is that AFAIU it is not causing execution of
> > the MFENCE (AFAIU the main cause of the atomic RWM's poor performance
> > on x86). So ASF transaction is a kind of a bunch of plain loads and
> > stores from the performance POV, and if all the locations are already
> > in the core's cache, then well I guess performance of the ASF
> > transaction may be basically equal to that of a bunch of plain loads
> > and stores.
> > In SMR all the locations are usually in the core's cache.
> > If data-structure in read-mostly then once again all the locations are
> > usually in the core's cache.
>
> If ASF-based SMR is just about as good as MFENCE-free SMR, well, that's
> great! Humm... Still, I would still probably not use ASF-SMR to traverse a
> list such that a SMR reference is obtained on each and every node... I would
> probably use ASF-SMR to implement proxy-gc. Although, that would remove the
> bounded garbage property of SMR... Well, everything has its tradeoffs.

If you are traversing lengthy list and doing some processing per every
node then I think plain atomic_ptr/pc_sample will work reasonably. I
hope that ASF will give us means to implement fine-grained PDR with
bounded garbage and low overhead and usable from user-space w/o
caveats... Am I want too much? :)

I've though a bit more about ASF powered SMR and have understood that
I don't understand.
The reload and comparison look unnecessary in your first example,
because values can't change inside of transaction. However then writer
must remove a node and check all hazard pointers inside of single
transaction. Not sure now how to implement this...

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 13, 2009, 12:27:23 PM5/13/09

to

On May 13, 7:13 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> Writing about mixing PDR and ASF... Is it okay for CPU A to load from a
> shared location S using normal MOV instruction from outside a speculative
> region, and CPU B to concurrently store and/or load from S while inside a
> speculative region?

The specification states that it's Ok. However if CPU B made
speculative store to S, and then CPU A made normal load from S,
transaction on CPU A will be aborted.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 13, 2009, 12:58:12 PM5/13/09

to

Yes, I completely missed the point with interrupts. I am still
thinking in the old term: no contention - no problems :)

Btw, I am going to post some totally crazy idea about ASF which tries
to take advantage of aborts on interrupts! Not sure whether it will
work at all...

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

May 13, 2009, 1:24:59 PM5/13/09

to

"EricP" <ThatWould...@thevillage.com> wrote in message
news:46dd5$4a0aefad$45c49ea8$11...@TEKSAVVY.COM...

> Chris M. Thomasson wrote:
>>
>> Is it okay for CPU A to load from a shared location S using normal MOV
>> instruction from outside a speculative region, and CPU B to concurrently
>> store and/or load from S while inside a speculative region?
>
> No, not ok. As per table 6.2.1, if cpu B has performed a protected
> write to one or more cache lines, anyone accessing that line
> triggers B to abort. That includes normal read, write and prefetch.

This is not good because it could live-lock writer threads. I don't really
see why CPU B has to be aborted. After all, if a CPU loads from a currently
protected location outside of a region, well, it should be assumed that the
CPU does not really care if it loads stale data wrt concurrent transactions.
Also, since the operation is a load, then no data is getting mutated which
means that the correctness of the algorihtms being performed by concurrent
transactions will not be adversely effected. Sometimes, its perfectly okay
for readers to observe stale data. Sometimes the readers don't even need to
observe the writers actions as a singular atomic operation.

> (I assume the also includes the automatic prefetches that
> the memory subsystem does of its own volition.
> I wonder if that has an impact...?)

Humm...

Chris M. Thomasson

unread,

May 13, 2009, 1:28:24 PM5/13/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:062b4193-2452-47b4...@r3g2000vbp.googlegroups.com...

You mean that CPU B will be aborted right? Keep in mind that CPU A is using
a normal MOV instruction _outside_ of a region to load from a shared
location that is currently protected in CPU B's region.

Dmitriy Vyukov

unread,

May 13, 2009, 1:37:11 PM5/13/09

to

On May 13, 8:58 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> Btw, I am going to post some totally crazy idea about ASF which tries
> to take advantage of aborts on interrupts! Not sure whether it will
> work at all...

Ok, here is that crazy idea. HOLY COW!

Can't ASF be used to detect execution of the blocking calls and as
async periodic timer? Just what is needed for user-space PDR!

Assume that we have heavyweight PDR implementation (like pc_sample):
typedef void* pdr_handle;
pdr_handle pdr_acquire();
pdr_release(pdr_handle);

The straightforward usage is:
int hash_map_search(...)
{
pdr_handle h = pdr_acquire();
int res = hash_map_search_impl(...);
pdr_release(h);
return res;
}

However this is frequently too costly.

The idea is to acquire PDR handle in the beginning of the
hash_map_search(), then start ASF transaction and do NOT release PDR
handle in the end of the hash_map_search(). Just to make it clear -
here is no speculative activity (LOCK MOV, LOCK PREFETCH), just the
transaction started.
Then 2 variants are possible:
1. When the thread will call hash_map_search() next time transaction
is still valid, so thread can just execute hash_map_search_impl() - we
are still in the protected region. No overheads related to PDR in this
case.
2. Transaction aborts before next call to hash_map_search() (because
of the interrupt or blocking call - I guess sysenter will abort trx).
On abort we release pdr handle and resume execution from the point
where it was aborted. So next time thread will execute hash_map_search
(), it will reacquire PDR handle.

Here is crude code sketch:

struct thread_desc
{
int pdr_acquired;
int inside_of_pdr_region;
int was_abort;
pdr_handle handle;
};

__declspec(thread) thread_desc t_desc;

int hash_map_search(...)
{
thread_desc* d = &t_desc;
assert(d->inside_of_pdr_region == 0);
d->inside_of_pdr_region = 1;
if (d->pdr_acquired == 0)
{
d->was_abort = 0;
reenter_trx:
RAX = SPECULATE;
if (RAX)
{
if (t_desc.pdr_acquired == 1
&& t_desc.inside_of_pdr_region == 0
&& (ERROR_REASON(RAX) == ILLEGAL_INSTRUCTION
|| ERROR_REASON(RAX) == INTERRUPT))
{
pdr_release(t_desc.handle);
t_desc.pdr_acquired = 0;
JMP ABORT_ADDR;
}
else
{
t_desc.was_abort = 1;
goto reenter_trx;
}
}
if (d->was_abort)
{
JMP ABORT_ADDR;
}
t_desc.handle = pdr_acquire();
t_desc.pdr_acquired = 0;
}
int res = hash_map_search_impl(...);
d->inside_of_pdr_region = 0;
return res;
}

Here is definitely some stupid errors in all these logic, and abort
handler must be carefully coded in asm. But I want to hear from ASF
developers and hardcore thread monkeys - do you think the idea will be
working? Will the "JMP ABORT_ADDR" work as I think it works)? I.e.
will just resume execution from the abort addr. Since no registers/
memory is modified during abort, I think/hope it must work.

Some additional notes.
Other parts of the program must not use ASF.
Abort handler may release PDR handle not on first interrupt but on
let's say 10-th interrupt.
If thread is preempted for a long time while holding PDR handle, yes,
the handle will be released only after the thread will be scheduled
next time.

ARRRRGH! What do you think?

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 13, 2009, 1:44:43 PM5/13/09

to

On May 13, 9:28 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

That's what I was saying everywhere:
http://groups.google.com/group/comp.programming.threads/msg/a488029494a21e97
http://forums.amd.com/devforum/messageview.cfm?catid=203&threadid=112333&highlight_key=y

Mitch said that this will complicate COMMIT and let it make cache-line
transfers. They want CPU to hold all locked cache lines in the cache.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

May 13, 2009, 1:57:05 PM5/13/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:e2f7bd60-615a-4c33...@r34g2000vbi.googlegroups.com...

On May 13, 9:28 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message
> >
> > news:062b4193-2452-47b4...@r3g2000vbp.googlegroups.com...
> > On May 13, 7:13 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> >
> > > > Writing about mixing PDR and ASF... Is it okay for CPU A to load
> > > > from a
> > > > shared location S using normal MOV instruction from outside a
> > > > speculative
> > > > region, and CPU B to concurrently store and/or load from S while
> > > > inside
> > > > a
> > > > speculative region?
> > > The specification states that it's Ok. However if CPU B made
> > > speculative store to S, and then CPU A made normal load from S,
> > > transaction on CPU A will be aborted.
> >
> > You mean that CPU B will be aborted right? Keep in mind that CPU A is
> > using
> > a normal MOV instruction _outside_ of a region to load from a shared
> > location that is currently protected in CPU B's region.

> That's what I was saying everywhere:
> http://groups.google.com/group/comp.programming.threads/msg/a488029494a21e97
> http://forums.amd.com/devforum/messageview.cfm?catid=203&threadid=112333&highlight_key=y

I completely agree with you. It seems a bit "insane" to have CPU B abort
because of a plain load from CPU A.

> Mitch said that this will complicate COMMIT and let it make cache-line
> transfers. They want CPU to hold all locked cache lines in the cache.

Damn! There has to be a way to allow for a normal plain load from CPU A to
abort CPU B's speculative region. Grrr...

Chris M. Thomasson

unread,

May 13, 2009, 1:58:14 PM5/13/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:IRDOl.52002$bi7....@newsfe07.iad...
> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message [...]
>> [...]

>> Mitch said that this will complicate COMMIT and let it make cache-line
>> transfers. They want CPU to hold all locked cache lines in the cache.
>
> Damn! There has to be a way to allow for a normal plain load from CPU A to
> abort CPU B's speculative region. Grrr...

Ummm... I meant to say:

Damn! There has to be a way to allow for a normal plain load from CPU A to

NOT abort CPU B's speculative region. Grrr..

;^)

EricP

unread,

May 13, 2009, 2:39:25 PM5/13/09

to

Yes, I mentioned that in an earlier msg. The memory locations
don't have to be shared across cpus to be protected.

The OS might have multiple memory locations that are read by
the main OS and updated an interrupt routine.
Guarding the reads in a speculate...commit block ensures you
cannot see partial updates and avoids an expensive interrupt disable.

In user mode, you could abort on a task switch or any other kernel
transition that occurred during a speculate block.
The only practical use I can think of for such might be for
benchmarks to filter out interrupts and other task times.
Note though that, strangely, the RDTSC instr. is not allowed
inside a speculate block.

Eric

MitchAlsup

unread,

May 13, 2009, 2:52:49 PM5/13/09

to

On May 13, 12:58 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Chris M. Thomasson" <n...@spam.invalid> wrote in messagenews:IRDOl.52002$bi7....@newsfe07.iad...
>

> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message [...]

> >> [...]
> >> Mitch said that this will complicate COMMIT and let it make cache-line
> >> transfers. They want CPU to hold all locked cache lines in the cache.
>
> > Damn! There has to be a way to allow for a normal plain load from CPU A to
> > abort CPU B's speculative region. Grrr...
>
> Ummm... I meant to say:
>
> Damn! There has to be a way to allow for a normal plain load from CPU A to
> NOT abort CPU B's speculative region. Grrr..

What you want, here, is the ability to perform a READ_ONCE on [A].

READ_ONCE reads the value, but leaves write permission where it
already is, so if an ASF event already has write permission, it
retains write permission, while giving the reader the old copy of the
data.

In theory, this works just fine (assuming you have access to
READ_ONCE) however, in practice, the line that gets transfered to the
READ_ONCE is the original incarnation of the line. so the CPU holding
that line, ma have to reconstruct the old image before transmitting,
and the reassemble the new image to preceede with ASF event in
progress. These are not difficult, but are cumbersome. They are easier
in a HT environment where only 1 memory ref is active to a given cache
line at a time, than in the QP environment where several can be in
progress simultaneously.

Mitch

Dmitriy Vyukov

unread,

May 13, 2009, 3:27:27 PM5/13/09

to

On May 13, 10:39 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> In user mode, you could abort on a task switch or any other kernel
> transition that occurred during a speculate block.
> The only practical use I can think of for such might be for
> benchmarks to filter out interrupts and other task times.
> Note though that, strangely, the RDTSC instr. is not allowed
> inside a speculate block.

Here it is:
http://groups.google.com/group/comp.programming.threads/msg/7c5b08fc7761c917

Schematically. Thread allocated large memory block, then frees it.
During free we have 2 options. (1) We may honestly return block to the
OS. Drawback: if thread will request large memory block again soon we
will have to call OS again - too costly. (2) We may defer the block in
the thread private cache. Drawback: if thread will not request large
memory block again soon and will not call into memory allocator at all
in the near future, the block will leak.
With ASF I am trying to do the following. We defer the memory block,
but get async notification from the ASF in the form of abort on
interrupt/syscall/etc, if we see that thread does not reuse memory
block soon we return it to the OS.
ASF used as a kind of async UNIX signal which automatically and
periodically triggered on interrupts/syscalls/etc.

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 13, 2009, 3:34:36 PM5/13/09

to

Yikes! It's a kind of async UNIX signal which is triggered
automatically and *periodically* on interrupts/syscalls/etc.

We can do a whole lot of stuff with this:

void our_thread_func_wrapper(void* p)
{

RAX = SPECULATE;
if (RAX)
{

switch (what_we_have_to_do())
{
case execute_periodic_memory_fence:
MFENCE;
note_fence_executed_by_thread();
break;
case check_for_excessive_private_memory_caching:
flush_some_private_memory();
break;
case drop_pdr_handle:
release_pdr_handle();
break;
case ADD_YOUR_STUFF_HERE:
break;
}
REENTER_TRX_AND_RESUME_EXECUTION;
}
user_thread_func(p);
COMMIT;
}

Looks too good to be true...

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

May 13, 2009, 3:37:28 PM5/13/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:154db8a0-da2d-4119...@q2g2000vbr.googlegroups.com...

> [...]

I really need to think about this. I did not think about using an abort on
interrupt as an synchronization epoch detection mechanism. Clever.

It seems like your not only deferring the call to `pdr_release()' until an
"abort on interrupt" condition arises, but your also not making a call to
COMMIT. Did you leave the call to COMMIT out as a mistake? Because it seems
like it could possibly be problematic. Think of something like:

void worker_thread {
for (;;) {
item_t* item = hash_map_search(...);
if (item) {
do_lengthy_processing(item);
do_some_more_lengthy_processing(item);
perhaps_some_more_processing(item);
}
}
}

If the speculative region can extend beyond the scope of
`hash_map_search()', then that means that the
`do_lengthy_processing()/do_some_more_lengthy_processing()/perhaps_some_more_processing()'
will essentially be included into the region as well. If an interrupt occurs
during one of those processing functions, then all of a sudden the program
flow jumps back into `hash_map_search()' and will have to be executed all
over again. This seems like a major opportunity for live-lock to occur.

I know I am missing something important from your idea. Humm... Please
enlighten me!

:^)

Now, if an abort on interrupt could somehow automatically jump to a handler
and resume at the point of abortion, well, that could work.

Dmitriy Vyukov

unread,

May 13, 2009, 3:40:10 PM5/13/09

to

Mitch, can you provide some comments on whether it's possible to make
it work?
AFAIR, abort address is accessible by software. If transaction does
not actually execute any speculative code then nothing will be rolled-
back on abort. Registers will be intact. Everything executed inside of
transaction will be treated as normal code.
So the question is: is it possible to restart transaction and just
jump to the abort address in the abort handler?

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

May 13, 2009, 3:40:22 PM5/13/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:1d2af276-665a-4238...@r34g2000vbi.googlegroups.com...

On May 13, 9:37 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 13, 8:58 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> [...]

> Yikes! It's a kind of async UNIX signal which is triggered
> automatically and *periodically* on interrupts/syscalls/etc.

> We can do a whole lot of stuff with this:

> void our_thread_func_wrapper(void* p)
> {
> RAX = SPECULATE;
> if (RAX)
> {
> switch (what_we_have_to_do())
> {
> case execute_periodic_memory_fence:
> MFENCE;
> note_fence_executed_by_thread();
> break;
> case check_for_excessive_private_memory_caching:
> flush_some_private_memory();
> break;
> case drop_pdr_handle:
> release_pdr_handle();
> break;
> case ADD_YOUR_STUFF_HERE:
> break;
> }
> REENTER_TRX_AND_RESUME_EXECUTION;
> }
> user_thread_func(p);
> COMMIT;
> }

> Looks too good to be true...

If ASF allows one to resume at the point of abortion, then it should work
fine.

Chris M. Thomasson

unread,

May 13, 2009, 3:43:16 PM5/13/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:8f114aa1-4a91-43b9...@b1g2000vbc.googlegroups.com...

On May 13, 11:34 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> > On May 13, 9:37 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> > [...]

> > Looks too good to be true...

> Mitch, can you provide some comments on whether it's possible to make
> it work?

I hope so. I could implement vZOOM with this. It would be great.

Dmitriy Vyukov

unread,

May 13, 2009, 3:57:18 PM5/13/09

to

On May 13, 11:40 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> Mitch, can you provide some comments on whether it's possible to make
> it work?
> AFAIR, abort address is accessible by software. If transaction does
> not actually execute any speculative code then nothing will be rolled-
> back on abort. Registers will be intact. Everything executed inside of
> transaction will be treated as normal code.
> So the question is: is it possible to restart transaction and just
> jump to the abort address in the abort handler?

Well, it seems that ASF does NOT support "resume". On abort RIP is
saved in the ASF_EXCEPTION_IP MSR, however RSP and EFLAGS are LOST, so
we can't resume :(
It would be nice if ALL state will be saved on abort...

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 13, 2009, 4:01:53 PM5/13/09

to

On May 13, 11:34 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:

> We can do a whole lot of stuff with this:
>
> void our_thread_func_wrapper(void* p)
> {
> RAX = SPECULATE;
> if (RAX)
> {
> switch (what_we_have_to_do())
> {
> case execute_periodic_memory_fence:
> MFENCE;
> note_fence_executed_by_thread();
> break;
> case check_for_excessive_private_memory_caching:
> flush_some_private_memory();
> break;
> case drop_pdr_handle:
> release_pdr_handle();
> break;
> case ADD_YOUR_STUFF_HERE:
> break;
> }
> REENTER_TRX_AND_RESUME_EXECUTION;
> }
> user_thread_func(p);
> COMMIT;
>
> }
>
> Looks too good to be true...

If we hit disallowed instruction then we can't resume execution inside
of transaction (we will hit the same error again and again), so in
general we have to resume execution outside of transaction (for
interrupts/syscalls/farcalls is probably Ok to resume inside of
transaction).
So this mechanism can be used only as a one-shot signal. I.e. it's
possible to drop cached PDR reference, flush excessive memory, but not
possible to execute periodic memory fences... Ok this probably already
does not matter since ASF does not support execution resume...

--
Dmitriy V'jukov

Dmitriy Vyukov

unread,

May 13, 2009, 4:03:16 PM5/13/09

to

On May 13, 11:37 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message

Indeed. Nothing will be rolled back during abort since those code is
not speculative. So my intention was to "resume execution" from the
abort point.

--
Dmitriy V'jukov

EricP

unread,

May 13, 2009, 7:31:23 PM5/13/09

to

Do you mean that READ_ONCE does not cache the line on CPU A
(the reader)?

If so then it would allow a reader to flood the bus with
read transactions. I can see this "unsafe read" operation being
useful in spin-waits.

If not then the modified line transitions to an "Owned"
(updated, shared) state and either the invalidates must be
sent on Commit (if the write buffer holds the new version) or
on Abort (if the write buffer holds the original version).

Either way, you have to deal with that
'holding multiple lines in limbo waiting for many acks' problem.
If we are talking about up to, say, 256 cpus
and 8 cache lines, that could take a while.

Eric

MitchAlsup

unread,

May 14, 2009, 2:08:06 PM5/14/09

to

No, it is neither nice to save state on abort, nor possible to reenter
an atomic region after an abort; unless you so mangle the definition
of atomic to be completely useless.

A) Where would you put this state where absolutely nobody could see it
(i.e. remain atomic-smelling), including other threads, OS and VM
services.
B) What guarentees can you provide that the data structure(s) being
manipulated will be the same once you get back from the abort? (A:
low)
C) What if the OS or VM needs to access the same data structure while
you are between abort and reentering? (A: Screwed)

When we looked at this, the conclusion we came to is that it is simply
better to throw away everything you understand about the state of the
CDS and start over. This prevents having to invent a new place to dump
state (and make the OS people happy), and enables the OS or VM to be
essentially unconcerned with what you (the application) may be doing a
any given moment in time--even if they the OS or VM need to perform
similar kinds of accesses to potentially shared CDSs.

Atomic means that nobody can see any intermediate state. ASF is
designed under the through train that you want to perform many many
millions of alterations to a concurrent data structure per second.
Basically this means if you blink your eyes, the CDS has changed under
your nose. Having as little state as possible to support said event
means that you basically have to punt and backup if anything prevents
your successful completion of the event at hand.

D) What is the probability that this saved and restored state is
completely useless after returning? (A: high)

Th orginal ASF provided an indicator to inform SW WHY the abort path
was taken and this list included {Interrupts, exception, spurrious,
interference} so SW could make an educated and rational decision as to
what to do next. Apparently the current ASF does not trust SW writers
to be sufficiently educated or rational.

The original ASF was not designed to be sufficient for writing
transactions. It was designed to be sufficient for writing complicated
ATOMIC primatives such as BOOLEAN cursor::DCAS(a,b,c,d,e,f) { } or
BOOLEAN timestamped::CDS_move(from, to, element) { }. And these
procedures retain the property that they amount of code inside the
atomic critical section is "as small as possible". Transactions are a
higher level abstraction that can be supported by ASF and ASF-like
procedures, but do not have the property that what is inside the
critical secton is "as small as possible".

Mitch

nm...@cam.ac.uk

unread,

May 14, 2009, 3:13:01 PM5/14/09

to

In article <80f45c69-403d-4210...@g20g2000vba.googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>
>No, it is neither nice to save state on abort, nor possible to reenter
>an atomic region after an abort; unless you so mangle the definition
>of atomic to be completely useless.

Whether you are talking about hardware or software ....

>The original ASF was not designed to be sufficient for writing
>transactions. It was designed to be sufficient for writing complicated
>ATOMIC primatives such as BOOLEAN cursor::DCAS(a,b,c,d,e,f) { } or
>BOOLEAN timestamped::CDS_move(from, to, element) { }. And these
>procedures retain the property that they amount of code inside the
>atomic critical section is "as small as possible". Transactions are a
>higher level abstraction that can be supported by ASF and ASF-like
>procedures, but do not have the property that what is inside the
>critical secton is "as small as possible".

Yeah. As you know, I believe that it would be a good idea to move
to the higher level, but I am very well aware of the consequences.
Inter alia, it would need MUCH more disciplined programming language
paradigms. You can't get there starting from the existing junk.

Regards,
Nick Maclaren.

Chris M. Thomasson

unread,

May 15, 2009, 7:10:39 AM5/15/09

to

"Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

news:7bee054e-5939-40fd...@r13g2000vbr.googlegroups.com...

On May 13, 6:50 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > Btw, Windows 7 has an interesting feature which may also simplify
> > > implementation of user-space PDR - UMS (user-mode scheduled) threads.
> > > Basically UMS delivers to the application events about thread
> > > blocking/
> > > unblocking. Unfortunately it does not deliver events about thread
> > > time-
> > > slice end.
> >
> > Perhaps. It does not seem that it could be a general-purpose PDR because
> > it
> > would only work with UMS threads. AFAICT, an application has to
> > specifically
> > create UMS threads using `CreateRemoteThreadEx()'. Perhaps it would be
> > better to stick with `FlushProcessWriteBuffers()' or
> > `NtQuerySystemInformation()'...

> I meant UMS as a mean to automatically release PDR in the case of
> blocking calls.

Yes. It would certainly provide the means to do that.

Indeed. This has me weary of using ASF for iterating through a large linked
data-structure.

=^)

All we need is a means of user-space code detecting per-cpu membar usage.
Not sure of this needs to be in the hardware as existing operating systems
already provide the means of doing this. Although, its all hacks! Well,
except for Windows Vista and above `FlushProcessWriteBuffers()' API
procedure.

Humm... That is a problem. The darn non-blocking algorihtm would need to
incorporate SMR scanning phase into transaction. Well perhaps not... But
then, I think the MFENCE would be needed in the transaction to deal with its
"weakly" ordered memory visibility issues wrt the way it presents loads and
stores to any observers as a singular atomic operation.

Also, wrt creating virtually zero-overhead SMR with ASF... I am wondering
about the performance implications of the `SPECULATE' instruction. It does
indeed sound like it behaves of a membar of sorts. (e.g. SFENCE?)... Humm...
Perhaps using hackish automatic synchronization epoch detection is not so
bad after all?

;^o

Chris M. Thomasson

unread,

May 15, 2009, 7:18:46 AM5/15/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:A4cPl.29282$%_2.1...@newsfe04.iad...

> "Dmitriy Vyukov" <dvy...@gmail.com> wrote in message

> [...]

>> If you are traversing lengthy list and doing some processing per every
>> node then I think plain atomic_ptr/pc_sample will work reasonably. I
>> hope that ASF will give us means to implement fine-grained PDR with
>> bounded garbage and low overhead and usable from user-space w/o
>> caveats... Am I want too much? :)
>
> =^)
>
> All we need is a means of user-space code detecting per-cpu membar usage.
> Not sure of this needs to be in the hardware as existing operating systems
> already provide the means of doing this. Although, its all hacks! Well,
> except for Windows Vista and above `FlushProcessWriteBuffers()' API
> procedure.
>

Although, that alone does not address putting an upper bound on accumulated
garbage. Even Joe Seighs RCU+SMR does not have this property. Humm...

Dmitriy Vyukov

unread,

May 15, 2009, 8:48:34 AM5/15/09

to

On May 15, 3:10 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > If you are traversing lengthy list and doing some processing per every
> > node then I think plain atomic_ptr/pc_sample will work reasonably. I
> > hope that ASF will give us means to implement fine-grained PDR with
> > bounded garbage and low overhead and usable from user-space w/o
> > caveats... Am I want too much? :)
>
> =^)
>
> All we need is a means of user-space code detecting per-cpu membar usage.
> Not sure of this needs to be in the hardware as existing operating systems
> already provide the means of doing this. Although, its all hacks! Well,
> except for Windows Vista and above `FlushProcessWriteBuffers()' API
> procedure.

What bothers me regarding user-space PDR is a high-overhead+bounded-
garbage (SMR, pc_sample) vs low-overhead+unbounded-garbage (RCU,
amortized proxy-collector) trade-off. I like amortized proxy-
collector, however it can create too many garbage if thread will be
waiting for console input inside of an amortized proxy-collected
region. What I need to make it bullet-proof (low-overhead+bounded-
garbage) is detection of thread blocking, detection of thread time
slice end and some periodic async callbacks.

> > The reload and comparison look unnecessary in your first example,
> > because values can't change inside of transaction. However then writer
> > must remove a node and check all hazard pointers inside of single
> > transaction. Not sure now how to implement this...
>
> Humm... That is a problem. The darn non-blocking algorihtm would need to
> incorporate SMR scanning phase into transaction. Well perhaps not... But
> then, I think the MFENCE would be needed in the transaction to deal with its
> "weakly" ordered memory visibility issues wrt the way it presents loads and
> stores to any observers as a singular atomic operation.
>
> Also, wrt creating virtually zero-overhead SMR with ASF... I am wondering
> about the performance implications of the `SPECULATE' instruction. It does
> indeed sound like it behaves of a membar of sorts. (e.g. SFENCE?)... Humm...
> Perhaps using hackish automatic synchronization epoch detection is not so
> bad after all?

As I understand from the specs, SPECULATE does not issue any memory
barriers. There are just normal x86 rules - loads are acquire, store
are release + guarantee that transaction will appear atomic for other
transactions. So I guess canonical implementation of ASF SMR will be:

void* smr_dereference(void** hp, void** pp)
{
void* p;
retry:
SPECULATE;
if (RAX) goto retry;
p = *pp;
*hp = p;
COMMIT;
}

void remove(void** pp)
{
void* p;
int quiescent;
retry:
SPECULATE;
if (RAX) goto retry;
p = *pp;
*pp = 0;
quiescent = scan_all_hazard_pointers(p);
COMMIT;
if (quiescent)
free(p);
else
defer_freeing(p);
}

--
Dmitriy V'jukov

EricP

unread,

May 15, 2009, 11:18:24 AM5/15/09

to

No that won't work.

1) You have to treat it like a LL/SC: no more than 10 to 20
instructions inside a speculate block.

I don't know what your 'scan_all_hazard_pointers' does,
but I'm guessing it violates this rule.

2) You *must* deal with possible livelock. As the ASF spec currently
stands you must do this even when accessing a single cache line,
though I imagine that will eventually be addressed because
of the problems it causes for even the simplest code.

As ASF currently stands, ping-pong'ing cache lines is possible.

This means that your goto Retry in both 'smr_dereference'
and 'remove' is NOT sufficient.

Exponential back-off, which I don't like because it just
wastes time and offers no guarantees of success or fairness,
is probably the minimum acceptable solution.

3) All the examples so far, from AMD, Chris or yourself
have glossed over the abort handling and ignores the livelock.
If I may suggest, the following structure covers its usage:

for (;;)
{
SPECULATE
{
ok = do_something; // No more than 10-20 instructions
if (ok)
{ COMMIT(); // Ignore commit status for now.
break;
}
else
{ ABORT; // Jump to catch handler
}
}
catch (int reason)
{
if (reason == blah1)
handle1();
else if (reason == blah2)
handle2();
}
}

The reason I like this structure is it makes one think
about what the catch handler is going to do.

Eric

EricP

unread,

May 15, 2009, 11:52:35 AM5/15/09

to

EricP wrote:
>
> 2) You *must* deal with possible livelock.

1) Locking

Just to continue on this, I had been thinking about how ASF could
be used for simple solutions first (not the non-blocking code),
and what the consequences would be for them.

Whereas previously I would guard with a Read-Write Critical Section
(whatever one calls that, SlimReadWriteLock, etc.)
which has fast and slow paths, with ASF there are 3 paths
non-locking concurrent read-write, fast and slow.

The new path, concurrent read-write has zero overhead and allows
concurrent readers and writers as long as they don't collide.
When they do, it switches over to using Atomic ops to manage
fast and slow paths for concurrent read, exclusive write lock.
When the backlog of lock users is gone, it reverts to non-locking.

I'm still playing with this approach, but I think a minor
modification to my reader-writer lock would suffice.
If a non-locked collision occurs it falls back to a
FIFO ordered read-write lock.

2) Allocations

Another consequence is the need to ensure that all protected
data structures are allocated on separate cache lines.
In this case, the lock and the list head, in addition to the
object nodes, must all be heap allocated since the linker
doesn't know the cache line size and cannot prevent line sharing.

To completely avoid any false sharing, the cache line aware
heap allocator would return pointers to the start of a cache line,
all allocations would be integral line sized, and no heap
management info would be in the same line.

Eric