Mill architecture combo talk at UC Berkeley 10/30

Ivan Godard

unread,

Oct 24, 2013, 2:18:20 PM10/24/13

to

View this posting in your browser:
http://us7.campaign-archive1.com/?u=daeaa4982a7a1c6b864c6b557&id=29827c3560&e=5699b707e9

Out-of-the-Box Computing

Ivan Godard, CTO of Out-of-the-Box Computing, will be giving a talk at
UC Berkeley’s EECS ASPIRE.

The particulars:
Wednesday, October 30, 2013
Noon - 1:30PM
Sutardja Dai Hall in room 250

This will be a combination of the previous three topics publicly
presented on the Mill general-purpose CPU architecture. It will cover
the encoding, belt and memory hierarchy portions of the design. The talk
will assume a familiarity with modern CPUs but not with any particular
instruction set architecture.

Encoding, the belt and memory hierachy in the Mill CPU
Enabling performance in a statically-scheduled architecture
The Mill is a new CPU architecture designed for very high single-thread
performance within a very small power envelope. It achieves DSP-like
power/performance on general purpose codes, without reprogramming. The
Mill is a wide-issue, statically scheduled design with exposed pipeline.
High-end Mills can decode, issue, and execute over thirty MIMD
operations per cycle, sustained. The pipeline is very short, with a
mispredict penalty of only four cycles.

To sustain decode at Mill rates, the instructions are split into two
streams of operation bundles, each with its own program counter and
instruction cache, which are decoded in lock step together. Each
variable-length bundle of variable-length operations is decoded from
both ends, requiring only two pipeline stages for 30+ operations.

Supplying data to 30+ operations per cycle requires a completely new
programming model. The Mill has no general registers. Instead, data is
passed from functional unit to functional unit via the Belt, a small
structure with conveyor-belt-like semantics. Operations use temporal
addressing to obtain data from the belt, while operation results are
implicitly dropped at the front, moving the previous contents along. The
Belt is single assignment so contents are never modified; this
eliminates pipeline hazards and obviates the hundreds of power-hungry
rename registers of legacy machines.

It is well known that exposed-pipe static scheduling yields near-perfect
code with minimal power – except when there is a miss in the cache. In a
conventional VLIW, a miss stalls the whole machine, whereas an
out-of-order architecture can sometimes find other useful operations to
execute while waiting on the memory hierarchy. The Mill uses a novel
load instruction that tolerates load misses as well as hardware
out-of-order approaches can do, while avoiding the need for expensive
load buffers and completely avoiding false aliasing. In addition, store
misses are impossible on a Mill, and a large fraction of the memory
traffic of a conventional processor can be omitted entirely.

The presentation will cover these and other technical aspects of the
memory hierarchy in the Mill design. Videos and other material about
other aspects of the Mill can be found at ootbcomp.com/docs.

Chris M. Thomasson

unread,

Oct 24, 2013, 3:54:49 PM10/24/13

to

> "Ivan Godard" wrote in message news:l4boas$2aq$1...@dont-email.me...
[...]

Hi Ivan. First of all, AFAICT, the Mill is extremely innovative
and interesting at the same time. Thank you for creating
such an awesome child Ivan! :^)

I work with clever multi-threaded synchronization algorithms,
as was wondering how a local Mill chip communications
wrt remote Mill chips on the same board? Is it a clever hardware
message passing algorithm?

Is there any room for a bakery synchronization algorithm such as:

https://groups.google.com/forum/#!original/comp.arch/aLVoxdQdRac/9DtioMswSmUJ

Or any other clever sync algorithms?

https://groups.google.com/forum/#!searchin/lock-free/hash/lock-free/qCYGGkrwbcA/YqYecEGva18J

on the Mill? Or, am I bound to a msgpass algo?

I guess it boils down to, can a programmer handcraft novel
synchronization between Mill processors and/or threads?

What is the granularity of the memory barriers on the Mill?

I am fond of the Sparc wrt the membar instruction.

Well, is the Mill 100% totally SEQ_CST?

Chris M. Thomasson

unread,

Oct 24, 2013, 3:59:59 PM10/24/13

to

> > "Chris M. Thomasson" wrote in message
> > news:l4btud$t1q$1...@speranza.aioe.org...

> > > "Ivan Godard" wrote in message news:l4boas$2aq$1...@dont-email.me...
> > [...]

> > Hi Ivan. First of all, AFAICT, the Mill is extremely innovative
> > and interesting at the same time. Thank you for creating
> > such an awesome child Ivan! :^)

> > [...]

> Well, is the Mill 100% totally SEQ_CST?

Some further context:

http://www.1024cores.net

http://software.intel.com/en-us/forums/topic/296471

Ivan Godard

unread,

Oct 24, 2013, 8:39:15 PM10/24/13

to

On 10/24/2013 12:54 PM, Chris M. Thomasson wrote:
>> "Ivan Godard" wrote in message news:l4boas$2aq$1...@dont-email.me...
> [...]
>
> Hi Ivan. First of all, AFAICT, the Mill is extremely innovative
> and interesting at the same time. Thank you for creating
> such an awesome child Ivan! :^)

Thank you :-)

> I work with clever multi-threaded synchronization algorithms,
> as was wondering how a local Mill chip communications
> wrt remote Mill chips on the same board? Is it a clever hardware
> message passing algorithm?

Multicore is the subject of a future talk, but I can give some of the
bottom lines without going into detail about how it works.

Multiple cores on one chip may be configured to have hardware shared
address spaces with hardware coherent caches. It is also possible to
configure a chip with incoherent or mixed in-/coherent; the latter is
common in embedded work.

We are not a fabric vendor, so between chips we will support whatever
fabric protocols the market demands. That said, it appears that the
market is going to message passing over the fabric.

> Is there any room for a bakery synchronization algorithm such as:
>
> https://groups.google.com/forum/#!original/comp.arch/aLVoxdQdRac/9DtioMswSmUJ
>

Any algorithm that will work on e.g. a Haswell will also work on a Mill.

> Or any other clever sync algorithms?

The Mill uses an optimistic concurrency synchronization primitive based
on using the caches to keep track of read and write sets. It is quite
similar to the optimistic primitives recently added to x86 and PowerPC,
and all are based on academic work of some age. x86 didn't get it quite
right, so we fixed a few things, but the underlying idea is the same.

There are no read-modify-write pessimistic primitives at all; these may
all be implemented using the underlying optimistic primitive.

> https://groups.google.com/forum/#!searchin/lock-free/hash/lock-free/qCYGGkrwbcA/YqYecEGva18J

Much of the arm-waving required by these algorithjms is unnecessary on a
Mill. While true data sharing will cause data to ping-pong between cores
just like on any architecture, the Mill is immune to false sharing. In
addition, the Mill's sequential consistency eliminates all of the
membars and data race bugs.

> on the Mill? Or, am I bound to a msgpass algo?

No; just share your data. Your performance will be best if you use a
transaction model rather than locks.

> I guess it boils down to, can a programmer handcraft novel
> synchronization between Mill processors and/or threads?

Certainly can, but probably has no need to.

> What is the granularity of the memory barriers on the Mill?

There are none.

> I am fond of the Sparc wrt the membar instruction.

Membar is for masochists. :-)

> Well, is the Mill 100% totally SEQ_CST?

Yes, completely, both in a monocore and in a multicore when the cores
are configured to share memory.

Ivan Godard

unread,

Oct 24, 2013, 8:42:30 PM10/24/13

to

I'm familiar with this work - I once designed and implemented a
commercial distributed object-oriented database, so I know what real
transactions are :-)

The Mill is designed to enable safe and efficient lock-free algorithms,
using optimistic concurrency.

Chris M. Thomasson

unread,

Oct 25, 2013, 10:35:10 PM10/25/13

to

> "Ivan Godard" wrote in message news:l4cer5$5rb$1...@dont-email.me...

Am I bound to a "cascading" LL/SC type interface for such
algorithms?

If so, AFAICT, this basically means I cannot really do
any loop-less wait-free algorithms. There are several
loop-less algorithms that would be destroyed if I were
forced to implement the atomic operations with loops.

A very simple example, and an open question to anybody
who might know:

I still do not know exactly why SPARC removed the simple
atomic swap instruction. It was a nice single instruction
performing an atomic swap where the entire operation was
hardware based. Well, whats left but CAS? So, now, one
needs to code a CAS-loop in software to perform a simple
atomic swap on the SPARC! I liked it better when the swap
logic was completely contained in hardware.

Well, is the opportunistic sync scheme that you are using
in anyway similar to something like:

http://developer.amd.com/wordpress/media/2013/02/45432-ASF_Spec_2.1.pdf

http://llvm.org/pubs/2010-04-EUROSYS-DresdenTM.pdf

Ivan Godard

unread,

Oct 25, 2013, 11:04:40 PM10/25/13

to

On 10/25/2013 7:35 PM, Chris M. Thomasson wrote:
>> "Ivan Godard" wrote in message news:l4cer5$5rb$1...@dont-email.me... On

>> The Mill is designed to enable safe and efficient lock-free
>> algorithms, using optimistic concurrency.
>
> Am I bound to a "cascading" LL/SC type interface for such algorithms?

It is optimistic (as LL/SC is) but where LL/SC can have at most one
value in the read set and one in the write set the Mill's form permits
much more and guarantees a number up to the wayness of the L1 cache.
Thus (for example) you can do a bi-linked list insert without a lock.

We once had a DLL/DSC design, and dumped it when we discovered how to
use the cache instead.

> If so, AFAICT, this basically means I cannot really do any loop-less
> wait-free algorithms. There are several loop-less algorithms that would
> be destroyed if I were forced to implement the atomic operations with
> loops.

I would appreciate a pointer to an algorithm that is destroyed by
looping synchronization when eight-wide read and write sets are
supported by the primitive. We want very much to make sure that
inter-thread and inter-core communication is easy and fast, and are
looking for counter-examples. The PowerPC is an existence proof that
optimistic is sufficient for functionality, and ours is demonstrably a
superset of that, but there's always the possibility of better.

We are unlikely to do anything that requires atomic read-modify-write
hardware. You have no idea what a mess that causes :-(

> A very simple example, and an open question to anybody who might know:
>
> I still do not know exactly why SPARC removed the simple
> atomic swap instruction. It was a nice single instruction performing an
> atomic swap where the entire operation was hardware based. Well, whats
> left but CAS? So, now, one needs to code a CAS-loop in software to
> perform a simple atomic swap on the SPARC! I liked it better when the swap
> logic was completely contained in hardware.
>
> Well, is the opportunistic sync scheme that you are using
> in anyway similar to something like:
>
> http://developer.amd.com/wordpress/media/2013/02/45432-ASF_Spec_2.1.pdf
>
> http://llvm.org/pubs/2010-04-EUROSYS-DresdenTM.pdf

Yes, it uses the same mechanism of using the cache to hold read and
write sets. Being the Mill, there are a few improvements to these AMD
and Intel designs :-) Among other things:
* you can single-step a debugger through the primitive region
* you can log info and do other actions in the primitive region that
will not be rolled back on failure
* you have hardware help for contention recovery

Ivan Godard

unread,

Oct 25, 2013, 11:54:06 PM10/25/13

to

On 10/25/2013 7:35 PM, Chris M. Thomasson wrote:

> Well, is the opportunistic sync scheme that you are using
> in anyway similar to something like:
>
> http://developer.amd.com/wordpress/media/2013/02/45432-ASF_Spec_2.1.pdf
>
> http://llvm.org/pubs/2010-04-EUROSYS-DresdenTM.pdf

Reading through the AMD doc (I was familiar with the Intel version but
not AMDs) there are further differences with the Mill semantics:
* the Mill program need not pre-reserve region lines
* the Mill read/write sets are in terms of program objects, not of cache
lines, so there is no false contention due to other cores accessing
non-region parts of the cache line containing the region.
* the Mill doesn't use the semantically flawed nested region approach
* the Mill does not abort a region on interrupt, fault or function cal,
so long as the inner code does not start its own region. (It's not clear
whether this actually makes a difference in the case of fault/interrupt,
for which the handlers will in practice nearly always want to do some
sync of their own).
* aborts and other region failure do not cause a transfer of control, so
the Mill does not caise an imprecise exception in such a case.
* there is no equivalent of the RELEASE operation. It's a good idea, and
we might add it if there are not IP issues.

Paul A. Clayton

unread,

Oct 26, 2013, 1:10:02 PM10/26/13

to

On Friday, October 25, 2013 11:04:40 PM UTC-4, Ivan Godard wrote:
[snip]

> Yes, it uses the same mechanism of using the cache to hold read and
> write sets. Being the Mill, there are a few improvements to these AMD
> and Intel designs :-) Among other things:
> * you can single-step a debugger through the primitive region
> * you can log info and do other actions in the primitive region that
> will not be rolled back on failure
> * you have hardware help for contention recovery

Well, AMD's ASF provided the means to have non-transactional
accesses. (The version of ASF that Mitch Alsup sought while
working at AMD provided more conflict information. [He
discussed such previously on comp.arch.]) IIRC, ASF only
preserved the stack pointer.

IBM's zArchitecture (360 mainframe descendants) transactional
memory design supports software defined GPR saving and also
supports constrained transactions that are guaranteed to
complete (falling back to hardware locking if necessary).

Intel's Restricted Transactional Memory provides complete
state checkpointing. It does not provide much information
(as currently architected) about the nature of failures
beyond their basic cause (explicit abort, insufficient
capacity, conflict--including a bit indicating if a retry
may succeed). It is defined to fail under a significant
number of conditions (like interrupts) and some conditions
are defined as potentially causing failures (and there is
not guarantee that a transaction will ever succeed, an
implementation could implement XBEGIN as an unconditional
jump).

Sun's TM implementation for Rock was even more broken,
even function calls could cause an abort, IIRC.

(There does not seem to be much public documentation
of the TM architecture for Azul Systems Vega processors.
Cliff Click had some blog posts discussing some aspects,
including the issue that a substantial portion of code
used shared counters that introduced somewhat
unnecessary conflicts. [This implies that Vega did not
provide non-transactional accesses--Cliff Click might
have explicitly mentioned this at some point.])

I do not know how much information is publicly
available for Blue Gene/Q's TM architecture. The
implementation is rather peculiar because there was an
attempt to minimize changes to the core, particularly
the L1 data cache. (That TM was limited to a single
chip is also a bit odd.)

Paul A. Clayton

unread,

Oct 26, 2013, 1:44:30 PM10/26/13

to

On Friday, October 25, 2013 11:54:06 PM UTC-4, Ivan Godard wrote:
[snip]

> Reading through the AMD doc (I was familiar with the Intel version but
> not AMDs) there are further differences with the Mill semantics:
> * the Mill program need not pre-reserve region lines

I did not think that AMD required any pre-reservation, it
merely supported transactional prefetch.

> * the Mill read/write sets are in terms of program objects, not of cache
> lines, so there is no false contention due to other cores accessing
> non-region parts of the cache line containing the region.

I am guessing that this is using the per-byte valid bits
to define extent or is this actually based on objects so
that false contention is possible. "Locking" objects
rather than bytes has some advantage (no need to explicitly
touch every byte that is guarded). Would such be based on
the bounds-oriented pointer system discussed earlier?

Cache line granularity has some advantage in hardware
simplicity (most systems do not provide per-byte valid
bits!).

> * the Mill doesn't use the semantically flawed nested region approach

I am curious why you view nesting as flawed. Is nesting
of locks likewise flawed?

> * the Mill does not abort a region on interrupt, fault or function cal,
> so long as the inner code does not start its own region. (It's not clear
> whether this actually makes a difference in the case of fault/interrupt,
> for which the handlers will in practice nearly always want to do some
> sync of their own).

I think the Sun Rock TM was the only hardware TM system
that aborted on function calls.

> * aborts and other region failure do not cause a transfer of control, so
> the Mill does not caise an imprecise exception in such a case.

What do you mean by "transfer of control"? If you mean a
branch/jump, then this would seem to imply that an aborted
transaction would flow through as if not aborted until it
reached the end of its transaction. This would be similar
to ll/sc and would have some nice features (and not
supporting nesting would make such less unattractive).

If you mean fault into a different permission space, then
I did not think any of the HTM architectures did that.

> * there is no equivalent of the RELEASE operation. It's a good idea, and
> we might add it if there are not IP issues.

The mention of ASF's RELEASE instruction makes me curious
if supporting (unprivileged) invalidate cache block
instructions had been considered for The Mill. (I receive
the impression that such operation is supported for the
stack, where blocks are invalidated on stack frame
deallocation [function return].) While such has very
limited practical value (avoiding unnecessary writebacks
and explicitly supporting early eviction), for the
micro-optimization obsessed such has some attraction
even if it is not worthwhile.

(I wanted to make some comments on the memory system talk--
in part ignoring your "I'm lying for the sake of simplicity"
caveat--, but if I manage to procrastinate long enough such
would be pointless.)

Ivan Godard

unread,

Oct 26, 2013, 3:45:59 PM10/26/13

to

On 10/26/2013 10:44 AM, Paul A. Clayton wrote:
> On Friday, October 25, 2013 11:54:06 PM UTC-4, Ivan Godard wrote:
> [snip]
>> Reading through the AMD doc (I was familiar with the Intel version but
>> not AMDs) there are further differences with the Mill semantics:
>> * the Mill program need not pre-reserve region lines
>
> I did not think that AMD required any pre-reservation, it
> merely supported transactional prefetch.
>

Terminology: they use a prefetch opcode because that's convenient in the
encoding, but what it actually does is reserve the line for the
transaction. Yes, they probably prefetch the line too, but depending on
the cache organization they might not actually need to do that.

>> * the Mill read/write sets are in terms of program objects, not of cache
>> lines, so there is no false contention due to other cores accessing
>> non-region parts of the cache line containing the region.
>
> I am guessing that this is using the per-byte valid bits
> to define extent or is this actually based on objects so
> that false contention is possible. "Locking" objects
> rather than bytes has some advantage (no need to explicitly
> touch every byte that is guarded). Would such be based on
> the bounds-oriented pointer system discussed earlier?

I use "object" sloppily; hardware doesn't know nuttin' about objects :-)
It uses the valid bits; there is no false contention. The checked
pointers are only a proposal, not actually part of the Mill at this point.

> Cache line granularity has some advantage in hardware
> simplicity (most systems do not provide per-byte valid
> bits!).

At software cost. We're on the side of the programmer :-)

>> * the Mill doesn't use the semantically flawed nested region approach
>
> I am curious why you view nesting as flawed. Is nesting
> of locks likewise flawed?

The standard example is a library function that provides a take-a-number
facility with the following guarantees:
* each number is unique
* the taken numbers are dense

This library is trivial with a lock and a counter. It is also easy using
optimistic synchronization in the absence of nested rollback semantics.

Now consider the following sequence with rollback:
counter value is 7
thread 1 enters xaction A
thread 1 calls takeanumber (containing xaction B), gets number 8
thread 2 calls takeanumber (containing xaction C), gets number 9
thread 1 aborts xaction A, rolls back counter to 7
thread 3 calls takeanumber twice, gets 8 and 9.
thread 2 and thread 3 both have 9, violating uniqueness.

The AMD design avoids this by merging the takeanumber counter (xaction
B) into xaction A, so thread 2 (xaction C) collides with A and one must
be broken. As far as I recall, the AMD document did not say who wins in
such a conflict.

Assume for a moment that A wins, and thread 2 as a result enters a retry
loop. If thread 1 then hangs without commiting A then the takeanumber
facility is locked out.

Assume for a moment that B wins, getting 8, and A rolls back and
retries. If takeanumber is a hot point, and A is at all lengthy, then A
will starve, even though A may be very infrequent and never itself collides.

Note that the whole sequence has no problem at all using locks; that
there have been no untoward events like interrupts; and that the actual
read/write sets do not exceed capacity.

The problem is inherent in nesting of xactions of differing duration and
frequency using rollback semantics.

>> * the Mill does not abort a region on interrupt, fault or function cal,
>> so long as the inner code does not start its own region. (It's not clear
>> whether this actually makes a difference in the case of fault/interrupt,
>> for which the handlers will in practice nearly always want to do some
>> sync of their own).
>
> I think the Sun Rock TM was the only hardware TM system
> that aborted on function calls.
>
>> * aborts and other region failure do not cause a transfer of control, so
>> the Mill does not caise an imprecise exception in such a case.
>
> What do you mean by "transfer of control"? If you mean a
> branch/jump, then this would seem to imply that an aborted
> transaction would flow through as if not aborted until it
> reached the end of its transaction. This would be similar
> to ll/sc and would have some nice features (and not
> supporting nesting would make such less unattractive).

Yes, execution continues until a commit is attempted; the result of the
commit op tells you what happened, and you can branch if you want to.
Only the write set is rolled back on failure.

> If you mean fault into a different permission space, then
> I did not think any of the HTM architectures did that.

No, no fault (by the hardware - software can do what it wants). It's a
generalized LL/SC if you'd like to think of it that way.

This is a data synchronization facility, not a code synchronization
facility like monitors in the sense of Concurrent Pascal or the
equivalent Java construct. That is, it is not coupled to a block in the
host language with exclusive right to enter. You give it the read set
and the write set and pull the trigger.

Language runtimes can use this to implement monitors of course. Speaking
as a language designer, I am quite opposed to monitors - they are fine
for toys but do not scale with program complexity. Note that so-called
"tranactional memory - TM" is actually optimistic monitors and is not a
transaction at all as the term is understood by databases and our
much-maligned COBOL brethren.

>> * there is no equivalent of the RELEASE operation. It's a good idea, and
>> we might add it if there are not IP issues.
>
> The mention of ASF's RELEASE instruction makes me curious
> if supporting (unprivileged) invalidate cache block
> instructions had been considered for The Mill. (I receive
> the impression that such operation is supported for the
> stack, where blocks are invalidated on stack frame
> deallocation [function return].) While such has very
> limited practical value (avoiding unnecessary writebacks
> and explicitly supporting early eviction), for the
> micro-optimization obsessed such has some attraction
> even if it is not worthwhile.

Per-line cache control operations are available to the app; the cache is
on the far side of protection, so the app must have rights to the line.
For example, a JIT would use this to get newly created code visible to
the instruction fetch logic.

More global cache control operations also exist but are reached via
MMIO, and so require permissions to the MMIO address, typically not
given to apps. A typical example is a total cache flush in anticipation
of deep sleep mode.

> (I wanted to make some comments on the memory system talk--
> in part ignoring your "I'm lying for the sake of simplicity"
> caveat--, but if I manage to procrastinate long enough such
> would be pointless.)
>

Please reconstruct (the talk video is up now - ootbcomp.com/docs/memory)
- I find your comments insightful and appreciate them.

Chris M. Thomasson

unread,

Oct 26, 2013, 4:19:26 PM10/26/13

to

> "Ivan Godard" wrote in message news:l4cel1$4rn$1...@dont-email.me...

> On 10/24/2013 12:54 PM, Chris M. Thomasson wrote:
> >> "Ivan Godard" wrote in message news:l4boas$2aq$1...@dont-email.me...
> > [...]
> >
> > Hi Ivan. First of all, AFAICT, the Mill is extremely innovative
> > and interesting at the same time. Thank you for creating
> > such an awesome child Ivan! :^)

> Thank you :-)

No problem Ivan. The Mill is awesome. Splitting the traffic up by going in
opposite directions is genius. The decoders must love this novel aspect!

;^)

> > I work with clever multi-threaded synchronization algorithms,
> > as was wondering how a local Mill chip communications
> > wrt remote Mill chips on the same board? Is it a clever hardware

> > message passing algorithm?

/*

> > I would appreciate a pointer to an algorithm that is destroyed by

> > looping synchronization when eight-wide read and write sets are
> > supported by the primitive.
*/

A simple example; I can give you a many more if you are interested:

http://webpages.charter.net/appcore/misc/rwmutex_win_hpp.html

Notice that the following code implements the acquisition of a read lock:
______________________________________________________________
void rdlock() throw() {
assert(m_count <= LONG_MAX);
LONG count = InterlockedDecrement(&m_count);
if (count < 0) {
if (WaitForSingleObject(m_rdwset,
INFINITE) != WAIT_OBJECT_0) {
RWMUTEX_WIN_UNEXPECTED();
}
}
}
______________________________________________________________

On systems that support atomic RMW, the `InterlockedDecrement' is a
loop-less operation contained within the hardware itself (e.g., LOCK XADD or
DEC). Keep in mind, this is from the software's point of view. For instance,
imagine changing all of the XADD and XCHG instructions contained within:

http://webpages.charter.net/appcore/appcore/src/cpu/i686/ac_i686_gcc_asm.html

to loops!

;^/

Yikes!!!

Chris M. Thomasson

unread,

Oct 26, 2013, 4:22:15 PM10/26/13

to

> "Chris M. Thomasson" wrote in message

> news:l4h84i$5qn$1...@speranza.aioe.org...

> > "Ivan Godard" wrote in message news:l4cel1$4rn$1...@dont-email.me...

[...]

> On systems that support atomic RMW, the `InterlockedDecrement' is a
> loop-less operation contained within the hardware itself (e.g., LOCK XADD
> or DEC). Keep in mind, this is from the software's point of view. For
> instance, imagine changing all of the XADD and XCHG instructions contained
> within:

or:

http://webpages.charter.net/appcore/vzoom/refcount/refcount-ia32-gcc.asm

to loops.

[...]

Ivan Godard

unread,

Oct 26, 2013, 7:17:29 PM10/26/13

to

I think you are mixing up the software cost with the hardware cost.

You do understand that (for example) "interlockDecrement" (hereafter ID)
can be implemented in terms of an optimistic primitive, so the question
is which is more costly, the optimistic or pessimistic versions.

In normal coding practice, an optimistic version of ID will be in a
library function that you call; you wouldn't put the loop in your own
code, which would be unchanged from your examples.

The body of that function (on a Mill) would have an ENTER, a LOAD, a
SUBTRACT, a STORE, a COMMIT, a conditional BRANCH, and a conditional
RETURN; other code to throw an exception on an unrecoverable error would
never be entered. On even middling Mill members that would code in four
instructions; how long it takes depends on whether the LOAD hits in the
D$1. The call will typically share with other code and adds nothing.
Optimistic on other machines would take more or less, but in all cases
should be no more than a couple of dozen cycles.

Now get down into the machine itself. The Mill code does not have to
clear the load/store pipelines. ENTER does nothing but set a few control
bits, checking that there isn't another interlock in process. The LOAD
has the same timing as a normal load, and does not require that there be
memory behind the counter. The SUBTRACT is a normal 1-cycle arithmetic
op. The STORE as usual will update the D$1 directly. However, if the
line already exists in the D$1, is dirty, and is not marked as a
participant in the write set already then the STORE will evict the
existing line to the D$2 before creating an empty new line and updating
it. That evict can delay the STORE by a couple of cycles, depending on
what else is happening in the cache.

The COMMIT (assuming no contention) is inserted into the request stream,
will take a cycle to do its thing in the cache, and reports to the belt;
overall latency probably the same as a hitting load.

The loop and error branches will assuredly be predicted not-taken, so
those occupy no time, and we are back at the point of call. So for the
monocore case, we are looking at two load times (one for the load and
one for the commit), plus 4-5 cycles for everything else; maybe 3 more
if the line was present and dirty and had to be evicted. So unless we
miss on the load (unlikely if the counter is a hot spot) the overall
cost is ~10 cycles.

In the multicore case, if we don't have the counter in cache and some
other core has the line then the load has to bring it in, no different
than any other load. If we already have the counter in cache in shared
state then we have to invalidate other shares, but that is *very* cheap
on a Mill (details in future talks). Consequently the cost is the same:
~10 cycles unless we miss on the load, in which case we pay the normal
load penalty.

Now consider the RMW cost. Fair warning: I have never worked at the
RMW-level in hardware that has it, and my understanding of the costs may
well be wrong. However, if so I'm sure that Andy or someone else will
set me right.

My understanding is that RMW always actually updates DRAM, as it is
implemented by seizing master control of the lines to memory. Moreover,
because it is impractical to interleave RMW and normal load/store to the
same line, the first step of a RMW is to drain the load and store
buffers completely. That is, RMW acts like a MEMBAR both before and
after the RMW itself (although perhaps some machines may require that
the software insert the MEMBARs rather than having the hardware do it).

As with the Mill, the R side of RMW may miss, and it takes as long as it
takes. The Mill could hit in the counter without having the rest of the
line, avoiding a miss that RMW would have. RMW W side has a more time
consuming invalidate. And the write back to DRAM will be several hundred
cycles.

So at the machine level, the Mill's optimistic concurrency does need a
loop in a function rather than a native operation, but it runs entirely
in cache and takes a dozen cycles or so, while RMW involves DRAM and is
good for several hundred cycles.

Your choice.

Ivan

p.s. Does anybody have a microbenchmark that just does an infinite
number of IDs to the same counter? What's the cycle count per ID?

Chris M. Thomasson

unread,

Oct 26, 2013, 7:19:47 PM10/26/13

to

>"Chris M. Thomasson" wrote in message
>news:l4h84i$5qn$1...@speranza.aioe.org...

>> "Ivan Godard" wrote in message news:l4cel1$4rn$1...@dont-email.me...

>[...]

>/*
>> > I would appreciate a pointer to an algorithm that is destroyed by
>> > looping synchronization when eight-wide read and write sets are
>> > supported by the primitive.
> */

FWIW, here is another example of a loop-less algorithm:

http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue

the code for the push operation is nice and simple:
_________________________________________________
void mpscq_push(mpscq_t* self, mpscq_node_t* n)
{
n->next = 0;

// serialization-point wrt producers, acquire-release
mpscq_node_t* prev = XCHG(&self->head, n);

// serialization-point wrt consumer, release
prev->next = n;
}
_________________________________________________

No loops in the software. Pretty nice so far... Obviously, this is assuming
that the XCHG instruction is totally contained in hardware. From the
software point of view, it is a single atomic RMW operation. However, if
XCHG, and all of its RMW counterparts were taken away, well, what's
left but CMPXCHG? This means, or at least implies loops. And, AFAICT,
that comes with a cost and a can of worms usually called: no expectation
of forward progress guarantees, refer to the contention manager for
further information. Can the Mill get by the cost of converting a single
atomic rmw operation to a “transactional” loop?

Chris M. Thomasson

unread,

Oct 26, 2013, 7:26:37 PM10/26/13

to

> "Ivan Godard" wrote in message news:l4hijm$i4d$1...@dont-email.me...

> On 10/26/2013 1:19 PM, Chris M. Thomasson wrote:

> [...]

> So at the machine level, the Mill's optimistic concurrency does need a
> loop in a function rather than a native operation, but it runs entirely
> in cache and takes a dozen cycles or so, while RMW involves DRAM and is
> good for several hundred cycles.

> Your choice.

Can you make a highly efficient, and hardware contained, RMW using
the Mills existing optimistic concurrency architecture? IMHO, it would be
nice to not have to handcraft a loop for every atomic RMW under the sun.
XADD, XOR, AND, ect... the looping brings up forward progress issues.

Humm...

Chris M. Thomasson

unread,

Oct 26, 2013, 7:43:18 PM10/26/13

to

> "Ivan Godard" wrote in message news:l4hijm$i4d$1...@dont-email.me...

> On 10/26/2013 1:19 PM, Chris M. Thomasson wrote:

> [...]

> I think you are mixing up the software cost with the hardware cost.

> [...]

> My understanding is that RMW always actually updates DRAM, as it is
> implemented by seizing master control of the lines to memory. Moreover,
> because it is impractical to interleave RMW and normal load/store to the
> same line, the first step of a RMW is to drain the load and store buffers
> completely. That is, RMW acts like a MEMBAR both before and after the RMW
> itself (although perhaps some machines may require that the software
> insert the MEMBARs rather than having the hardware do it).

The membars are there to enforce the "required" ordering of operations.
The strength of the ordering depends on the memory visibility requirement of
the algorithm.

For instance, take a simple atomic LIFO construct:

<C++'ish pseudo code>
______________________________________________
struct node
{
node* m_next; // = NULL
};

struct stack
{
node* m_head; // = NULL

void push(node* n)
{
node* h = LOAD(&m_head);

do
{
n->m_next = h;
// Release Membar...

// the following CAS acts like C++11 where the
// comparand is updated upon failure.
}
while (! CAS(&m_head, &h, n));
}

node* flush_acquire()
{
node* n = SWAP(&m_head, NULL);

if (n) // Acquire Membar...

return n;
}

node* flush_consume()
{
node* n = SWAP(&m_head, NULL);

if (n) // Consume Membar... (basically data-dependent load
barrier)

return n;
}
};
______________________________________________

I can break the API apart according to the level of synchronization I
actually need.

AFAICT, this is a pretty nice ability.

> p.s. Does anybody have a microbenchmark that just does an infinite number
> of IDs to the same counter? What's the cycle count per ID?

http://www.rdrop.com/~paulmck/RCU/LCA2004.02.13a.pdf

This paper has some fairly decent information:

http://sigops.org/sosp/sosp13/papers/p33-david.pdf

The latter paper is a little misleading in a couple of areas...

Chris M. Thomasson

unread,

Oct 26, 2013, 7:47:34 PM10/26/13

to

Sorry for all of the posts!

;^/

> "Chris M. Thomasson" wrote in message

> news:l4hk2u$vef$1...@speranza.aioe.org...

> > "Ivan Godard" wrote in message news:l4hijm$i4d$1...@dont-email.me...

> [...]

> > p.s. Does anybody have a microbenchmark that just does an infinite
> > number of IDs to the same counter? What's the cycle count per ID?

> http://www.rdrop.com/~paulmck/RCU/LCA2004.02.13a.pdf

Information from Slide Number 4:

Operation Costs: How Bad???

4-CPU 700MHz i386 P-III

Operation Nanoseconds

Instruction 0.7
Clock Cycle 1.4
L2 Cache Hit 12.9
Atomic Increment 58.2
Cmpxchg Atomic Increment 107.3
Atomic Incr. Cache Transfer 113.2
Main Memory 162.4
CPU-Local Lock 163.7
Cmpxchg Blind Cache Transfer 170.4
Cmpxchg Cache Transfer and Invalidate 360.9

Ivan Godard

unread,

Oct 26, 2013, 7:54:24 PM10/26/13

to

No reason; macrocode would be as good as the same thing in microicode.
The Mill doesn't do microcode anyway. Are you allergic to function calls?

Ivan Godard

unread,

Oct 26, 2013, 8:12:19 PM10/26/13

to

On 10/26/2013 4:43 PM, Chris M. Thomasson wrote:
>> "Ivan Godard" wrote in message news:l4hijm$i4d$1...@dont-email.me...
>
>> On 10/26/2013 1:19 PM, Chris M. Thomasson wrote:
>> [...]
>
>> I think you are mixing up the software cost with the hardware cost.
>> [...]
>
>> My understanding is that RMW always actually updates DRAM, as it is
>> implemented by seizing master control of the lines to memory.
>> Moreover, because it is impractical to interleave RMW and normal
>> load/store to the same line, the first step of a RMW is to drain the
>> load and store buffers completely. That is, RMW acts like a MEMBAR
>> both before and after the RMW itself (although perhaps some machines
>> may require that the software insert the MEMBARs rather than having
>> the hardware do it).
>
> The membars are there to enforce the "required" ordering of operations.
> The strength of the ordering depends on the memory visibility
> requirement of
> the algorithm.

Actually no, or rather not only, at least as I understand it. The way a
membar works in the hardware is to flush the load or store buffers
(depending on the kind of membar). This is cheap if you are not doing
any unrelated loads or stores and all the buffers are empty, but very
expensive if you are and they have to go to DRAM.

Inside the chip these ops are real gates, not abstractions, and
extraordinarily complex. The RMW ops cost much nore than a simple load
and store pair, and their existence muck up and adds costs to other
operations even when RMW is not used.

>> p.s. Does anybody have a microbenchmark that just does an infinite
>> number of IDs to the same counter? What's the cycle count per ID?
>
> http://www.rdrop.com/~paulmck/RCU/LCA2004.02.13a.pdf
>
>
> This paper has some fairly decent information:
>
> http://sigops.org/sosp/sosp13/papers/p33-david.pdf
>
> The latter paper is a little misleading in a couple of areas...

According to the paper, RMW was already a couple of hundred cycles even
in the PIII days. If you'd take a more expensive alternative over a
cheaper one then go ahead; I won't argue religion. But please explain:
why is a loop is by definition anathema?

Ivan Godard

unread,

Oct 26, 2013, 8:19:55 PM10/26/13

to

Not sure where you are getting all this.

1) XCHG(...) can be compiled into an intrinsic, or compiled into a call
on a function containing a loop; the source code containing XCHG is the
same.

2) what's wrong with a loop? Do you never write "for(...)"?

3) Not sure what optimistic concurrency primitive you have been looking
at, but the forward=progress guarantee all hardware implementations I
know of is the same as for a hash-table or a C++ std::vector<>

4) What contention manager? This is hardware - there is no manager of
any kind, "contention" or otherwise. Perhaps you are thinking of
Software Transactional Memory?

5) it's pretty easy to get by negative costs :-)

Chris M. Thomasson

unread,

Oct 26, 2013, 8:54:50 PM10/26/13

to

> >"Ivan Godard" wrote in message news:l4hm8n$sb$1...@dont-email.me...

> >On 10/26/2013 4:19 PM, Chris M. Thomasson wrote:
> >>> "Chris M. Thomasson" wrote in message
> >>> news:l4h84i$5qn$1...@speranza.aioe.org...

> >[...]

> Not sure where you are getting all this.

From developing wait-free algorithms. The key is to avoid LL/SC and/or CAS
at
all costs. Well, there is an exception for CAS. There are situations where a
static
comparand can be used, thus avoiding the load, mutate, loop.

> 1) XCHG(...) can be compiled into an intrinsic, or compiled into a call on
> a function containing a loop; the source code containing XCHG is the same.

> 2) what's wrong with a loop? Do you never write "for(...)"?

The problem with a loop, is how many times this loop is going to execute
before the operation is finally committed? What happens when I expose this
loop to heavy load?

A loopless algorithm gives me some guarantees about these questions. For
instance, a single atomic RMW swap is simple. No loop. However, if I need to
"emulate" the atomic RMW swap with something like:
_______________________________________________
void atomic_swap(void** ptr, void* v)
{
for (;;)
{
void* cmp = LOAD(ptr);

if (CAS(ptr, cmp, v)) return cmp;
}
}
_______________________________________________

Well, now the simple swap operation is exposed to uncertainty due to the
conditional nature that has to be explicitly handled with looping. Exactly
how
many loops am I going to get? I had a guarantee of a single instruction, now
I might have to retry the op a number of times before it even executes...

This means that the performance of `atomic_swap()' is going to depend on
how many times this loop fails to do anything at all.

> 3) Not sure what optimistic concurrency primitive you have been looking
> at, but the forward=progress guarantee all hardware implementations I know
> of is the same as for a hash-table or a C++ std::vector<>

https://groups.google.com/forum/#!original/comp.arch/Bx-OAJTjU-U/-AWQURPzH10J

> 4) What contention manager? This is hardware - there is no manager of any
> kind, "contention" or otherwise. Perhaps you are thinking of Software
> Transactional Memory?

Yes I was. Sorry about that Ivan.

;^/

> 5) it's pretty easy to get by negative costs :-)

IMVHO, I consider looping around on a failed condition to have a cost...

Chris M. Thomasson

unread,

Oct 26, 2013, 8:57:05 PM10/26/13

to

"Chris M. Thomasson" wrote in message

news:l4ho92$78l$1...@speranza.aioe.org...
[...]

Yikes!

>_______________________________________________
> void atomic_swap(void** ptr, void* v)

Ummm... It would be nice if this function had a return type!

void* atomic_swap(void** ptr, void* v)

Chris M. Thomasson

unread,

Oct 26, 2013, 9:11:11 PM10/26/13

to

> "Ivan Godard" wrote in message news:l4hlqg$v4r$1...@dont-email.me...

> On 10/26/2013 4:43 PM, Chris M. Thomasson wrote:
> >> "Ivan Godard" wrote in message news:l4hijm$i4d$1...@dont-email.me...
> >
> >> On 10/26/2013 1:19 PM, Chris M. Thomasson wrote:
> >> [...]
> >
> >> I think you are mixing up the software cost with the hardware cost.
> >> [...]
> >
> >> My understanding is that RMW always actually updates DRAM, as it is
> >> implemented by seizing master control of the lines to memory.
> >> Moreover, because it is impractical to interleave RMW and normal
> >> load/store to the same line, the first step of a RMW is to drain the
> >> load and store buffers completely. That is, RMW acts like a MEMBAR
> >> both before and after the RMW itself (although perhaps some machines
> >> may require that the software insert the MEMBARs rather than having
> >> the hardware do it).
> >
> > The membars are there to enforce the "required" ordering of operations.
> > The strength of the ordering depends on the memory visibility
> > requirement of
> > the algorithm.

> Actually no, or rather not only, at least as I understand it. The way a
> membar works in the hardware is to flush the load or store buffers
> (depending on the kind of membar). This is cheap if you are not doing
> any unrelated loads or stores and all the buffers are empty, but very
> expensive if you are and they have to go to DRAM.

Well, the user algorithm has the requirements. So, lets say
the algo needs to load a pointer and be able to see what the
pointer points to. This requires a data-dependent load barrier.
In c++11 that would be memory_order_consume. However, if
the algorithm needs to see any other state that is not attached
to what is pointed to, then the app needs an acquire membar
(#LoadStore | #LoadLoad), or memory_order_acquire in C++11.

An acquire barrier is more expensive than a data-dependent load
barrier which is a nop on every machine I have seen, well, except
for DEC Alpha... ;^)

So, the ability for the software to specify exactly what memory
visibility requirements is needs, can be very beneficial. For instance,
the performance of RCU basically depends on memory_order_consume
to be compiled into a NOP and thus incur no memory barrier overheads.
SEQ_CST is plain overkill here. Big time.

[...]

> But please explain:
> why is a loop is by definition anathema?

Can you give me a guarantee that this loop will never have to
loop around and retry?

Chris M. Thomasson

unread,

Oct 26, 2013, 9:17:07 PM10/26/13

to

> "Ivan Godard" wrote in message news:l4hkos$qg5$1...@dont-email.me...

I think it might be beneficial to integrate something as simple as atomic
SWAP,
into the Mill. It has to be more efficient than a loop in software. For
instance,
how many times is the loop going to execute before it finally commits?

This retry time could be reduced if the swap operation was contained in
hardware
to begin with.

> Are you allergic to function calls?

Nope. :^)

Chris M. Thomasson

unread,

Oct 26, 2013, 10:17:52 PM10/26/13

to

> "Ivan Godard" wrote in message news:l4hm8n$sb$1...@dont-email.me...

> [...]

> 2) what's wrong with a loop? Do you never write "for(...)"?

I my experience, a CAS-loop is more expensive than a simple XADD.

Compare the following pseudo-codes:

version 1:
____________________________________________
uword atomic_xadd(uword* v, uword a)
{
return XADD(v, a);
}
____________________________________________

or:

version 2:
____________________________________________
uword atomic_inc(uword* v, uword a)
{
for (;;)
{
uword c = LOAD(v);

if (CAS(v, c, c + a)) return c;
}
}
____________________________________________

Where the XADD and CAS instructions are at hardware level.

IMVHO, version 1 is a winner. I can code up a simple benchmark comparing
the difference between XADD and CMPXCHG under extreme focused load, but
it would only end up showing that the CAS looping hurts performance. This is
a
given on x86, SPARC (e.g., swap instruction vs cas emualtion of swap
instruction).

XADD always beats a CMPXCHG loop.

I think XADD can beat a LL/SC loop on PowerPC under heavy load as well.

Ivan Godard

unread,

Oct 27, 2013, 1:25:15 AM10/27/13

to

Here's a brief tutorial, in hope that it helps.

Optimistic strategies are used in many places besides synchronization:
network protocols (Ethernet and friends), hash tables, memory and
translation caches, and many other places. In all cases, a plot of cost
vs. load factor shows a very low cost at low load, rising slowly until a
knee is reached, with very high cost past the knee as thrashing sets in.
Where the knee is varies with the details, but the curve shape is
characteristic.

For most places in which an optimistic strategy is available there is
also a pessimistic strategy available as well. These too have a
characteristic cost vs. load curve: relatively high at low load, rising
slowly but steadily with load in a reasonably linear fashion, up to a
maximum load at which load is constrained by other factors. There is no
knee, but below the optimistic knee the optimistic strategy is cheaper,
often significantly cheaper, than the pessimistic. After all, were it
not cheaper then optimistic would never be used at all, while in fact
you use such strategies all the time without thinking of it.

However, there may be non-cost reasons for choosing one strategy over
the other. In much real time work, the actual cost is irrelevant so long
as it never goes over a certain fixed value. In such a case, the higher
cost but greater certainty may be preferable to the average lower cost
but higher variability of optimistic. IBM invented Token Ring based on
just such an argument (see https://en.wikipedia.org/wiki/Token_ring for
the fate of Token Ring). I ignore such specialized situations herein.

Nearly all optimistic strategies are subject to a phenomenon called
"convoying". This is a local, emergent effect, in which minor
happenstantial variations in load cause retries that increase the
effective local load factor, leading to more failure, retries, and more
load. When convoying is present, the global load factor may be well
below the knee but local convoying can raise the local load above the
knee and performance drops sharply until the local effect works itself
out. The convoying effect has been studied mathematically and is akin to
local onset of chaotic regions in the phase space.

There are two possible responses to the appearance of convoying in a
system. In one, the global load factor is kept low enough, and the
statistical character of the load is adjusted, so as to lessen the
probability of a local load cluster over some period of time. Thus, in
Ethernet you might add bandwidth while retaining the optimistic
algorithm unchanged.

The second approach is to prevent happenstantial failure from clustering
locally and hence thrashing. In hash tables, rather than just looking in
the next cell on a collision, you rehash the key and reprobe to
someplace completely different in the table. Uniformly reprobing gives
better performance in higher loads, because it avoids miss clustering.

The "try again someplace else far from here and different from what your
collider will do" way of avoiding local collision clusters can be
applied to most optimistic strategies. In the time domain, the common
approach is called "exponential backoff". It is used in Ethernet and
many other protocols. It is also used in the allocation domain so as to
achieve asymptotically constant overhead in C++ vectors: each time the
vector runs out of space its size is doubled.

These retry algorithms do not remove the knee that is inherent in
optimistic strategies, but they due remove statistical effects when load
is "bunchy", and tend to move the knee toward lower cost at higher load.
Measuring an optimistic strategy without using a backoff is shooting a
straw man.

Both optimistic and pessimistic show statistical variation in the cost
of any single access; you may think that pessimistic does not, but for
fixed bandwidth if requests locally bunch higher than the bandwidth then
some requests must be queued, even though no retry is used; that's the
source of the rising cost curve of pessimistic. For any given load
factor, the mean of these request costs can be computed from the shape
of the respecting curves and the underlying zero-load cost.

The cost of synchronization is but one part of the overall constraints
experienced by a program. Thus synchronization may be completely hidden
by a lack of memory bandwidth, or insufficient floating point units, or
so on. The designer must fit within externally imposed resource limits -
it might be nice to have 25 floating point units, but they won't fit in
the power and area budgets. Within those limits, the goal is to provide
minimal average cost over the universe of applications.

This goal is difficult to achieve; every application hits one limit or
another, and every application user feels that his limit (whatever it
is) is the only one that matters and has been unfairly shirked. He feels
mightily aggrieved.

To turn to the case a hand, the cost of synchronization is irrelevant if
the application is not sync-bound. The costs must be measured in
meaningful units. In this case you favor RMW because it is "one
instruction". The number of instructions only matters to decode, which
is not the concern here; that one instruction will promptly become
several micro-ops anyway. Instruction count is not a meaningful measure.

The appropriate measure here is the number of syncs per minute, or,
equivalently, the average number of cycles per sync. Note "average": if
you are doing hard real time and truly require fixed maximum delay you
should not be using any conventional CPU; timing variations from cache,
TLB, memory refresh, and scheduler will completely swap the variations
from any kind of sync.

The hardware can only do one RMW at a time. If there are 20 cores
contending, each core will have a sync bandwidth that is a 20th of what
is available without contention. So you cannot know what that single RMW
costs, until you are aware of what else is happening, and the RMW cost
will certainly not be "one instruction" by a long shot, at any load
approaching the optimistic knee.

Your comparison here is not only flawed by treating RMW as
instantaneous, it is also flawed by a failing implementation of
optimistic. As written, your loop will convoy because you have no
backoff, but that is a fault of your code and not of the strategy. These
things are subtle - any multicore aspect is - and it's easy to get it
wrong. That's why it's better to use a well written synchronization
library rather than your own loops.

Depending on the shape of the cost curves, how good a backoff is used,
and the phase of the moon, there may be a region of the load space in
which the system is sync bound and pessimistic (RMW) is less costly than
optimistic; that is far from certain, but it may be. Being close to
saturation, that region of the load space will be rarely occupied by
applications - very few applications get anywhere near sync bound in any
strategy. Short of that region (i.e. at lighter contention), optimistic
is less costly than pessimistic, which benefits the mass of applications
that are in that lighter load region.

It is a designer's choice whether to benefit the great mass at the
expense of the extreme and rare, or vice versa. The Mill design attempts
to have it both ways and benefit both, but sometimes you can't. In this
case it's not even clear that a dilemma exists; the examples you have
evinced so far do not in fact support the claim made. Our analysis of
the hardware, including the Mill extensions, has convinced us that our
optimistic will be less costly on average than RMW would be, at any load
factor up to saturation caused by reasons other then sync. Consequently
we feel that there is no load at which RMW would be better. True, RMW
will probably have more stable variation in response, but the Mill is
not suitable for hard real time in any case so that doesn't matter to us.

It may well be that we face surprises when the design reaches silicon -
I'm sure there will be a few, and sync cost may be one. If our analysis
turns out to have been wrong then we'll fix it then. But we won't change
it now on the basis of a misunderstood straw man.

YMMV.

This thread has probably gone on longer than is profitable to either of
us, so I'm dropping it at this point.

Ivan

Jean-Marc Bourguet

unread,

Oct 27, 2013, 9:21:52 AM10/27/13

to

"Chris M. Thomasson" <n...@spam.invalid> writes:

>> "Ivan Godard" wrote in message news:l4hlqg$v4r$1...@dont-email.me...

>> But please explain: why is a loop is by definition anathema?
>
> Can you give me a guarantee that this loop will never have to
> loop around and retry?

You seem to make a difference between a loop in software and a loop hidden
in microcode or in hardware. I don't see why the first is worse. Low
level should provide primitives, high level abstraction. That's true for
software. That's true for hardware. That's true when you are spread over
both.

Yours,

--
Jean-Marc

Andy (Super) Glew

unread,

Oct 27, 2013, 7:46:12 PM10/27/13

to

On 10/25/2013 8:04 PM, Ivan Godard wrote:
> On 10/25/2013 7:35 PM, Chris M. Thomasson wrote:
>>> "Ivan Godard" wrote in message news:l4cer5$5rb$1...@dont-email.me... On
>
>
>>> The Mill is designed to enable safe and efficient lock-free
>>> algorithms, using optimistic concurrency.
>>
>> Am I bound to a "cascading" LL/SC type interface for such algorithms?
>
> It is optimistic (as LL/SC is) but where LL/SC can have at most one
> value in the read set and one in the write set the Mill's form permits
> much more and guarantees a number up to the wayness of the L1 cache.
> Thus (for example) you can do a bi-linked list insert without a lock.
>
> We once had a DLL/DSC design, and dumped it when we discovered how to
> use the cache instead.

Do you "lock" entries in the "transaction set" between the beginning and
end of the transaction?

By "lock", I mean: what do you do if another processor tries to access
the location in question? Do you tell that other processor to "Go
away, come back later?" Or otherwise defer the other processor's
request? (Perhaps only after getting ownership of all locations (upto
8?) involved in the transaction?)

In other words, is there any guarantee of forward progress? Or could an
antagonist constantly interfere, e.g. by doing unlocked writes to one of
the upto 8 locations involved in a - I'll call it transaction.

In a hierarchical multicore system, with N cores, is the cache shared
amongst all N cores required to have associativity at least N*the
associativity of the L1 cache that is being used for the

MIPS currently uses the cache for (single location) LL/SC, although
other mechanisms have been used historically. MIPS based
microcontrollers can use bus based locked for atomic set/clear bits.

ARM mostly seems to use a mechanism auxiliary to or separate from the
cache. This allows synchronization for uncached memory as well as cacheable.

Intel and AMD x86 CMPXCHG and other RMWs use mechanisms ranging from
cache based to bus lock based. They typically provide guarantees of
forward progress.

Intel HLE uses cache based mechanisms to almost transparently improve
the performance of legacy locking.

Intel RTM seems to use cache based mechanisms. You are highly likely to
make forward progress up to the associativity of the cache, actually up
to the point where the the transaction set thrashes. But as far as I can
tell Intel is not locking or deferring outside requests for items in the
transaction set, and can similarly be constrained by the associativity
of shared between cores and threads caches.

Unless the Mill is locking or deferring, your "performance
characteristics" would be very similar to Intel RTM's.

The big issue with locking and deferring is deadlock. Typically resolved
using one of two techniques:

a) acquire the lines in a standard order, e.g. sorted from low to high
physical address. Here "acquire" means "start deferring requests for
the line".

b) obtain the lines in any order, and acquire them (start deferring)
when you have obtained them all.

Actually, b) and a) are related: "obtain" may mean "get the line into my
cache in a state such that I can write to it" (typically invalidated in
all other caches (although being me I play with that)), and "acquire"
may mean "for a line that is obtained, switch to locking/deferring".

---

Anticipating the usual confusion of terminology: when I say "locking"
here, I do NOT mean locked atomic RMW operations. I mean "whatever
mechanism allows the LL/SC or HTM mechanism to make forward progress".
Typically marking a line such that any request to it from another
processor or thread is deferred until the end of the transaction.

Many LL/SC or HTM mechanisms make no guarantees of forward progress. In
which case they may not use such mecahnisms.

It is unfortunate that the terms collide. (But it is also natural: a low
level lock being used to implement a higher level non-lock-based
mechanism, or vice versa, and/or in many combinations.)

--
The content of this message is my personal opinion only. Although I am
an employee (currently Imagination Technologies, which acquired MIPS; in
the past of companies such as Intellectual Ventures/QIPS, Intel, AMD,
Motorola, and Gould), I reveal this only so that the reader may account
for any possible bias I may have towards my employers' products. The
statements I make here in no way represent my employers' positions on
the issue, nor am I authorized to speak on behalf of my employers, past
or present.

Andy (Super) Glew

unread,

Oct 27, 2013, 7:57:00 PM10/27/13

to

On 10/26/2013 12:45 PM, Ivan Godard wrote:
> It's a
> generalized LL/SC if you'd like to think of it that way.
>
> This is a data synchronization facility, not a code synchronization
> facility like monitors in the sense of Concurrent Pascal or the
> equivalent Java construct. That is, it is not coupled to a block in the
> host language with exclusive right to enter. You give it the read set
> and the write set and pull the trigger.
>
> Language runtimes can use this to implement monitors of course. Speaking
> as a language designer, I am quite opposed to monitors - they are fine
> for toys but do not scale with program complexity. Note that so-called
> "tranactional memory - TM" is actually optimistic monitors and is not a
> transaction at all as the term is understood by databases and our
> much-maligned COBOL brethren.

While I share your distaste about monitors, I am not sure how I see that
hardware TM, like Intel's RTM, is actually "just" optimistic monitors.

I criticize it vociferously for its limitations, and I am quite doubtful
that its current implementation can make many, or any, guarantees of
forward progress.

But I fail to see how it is different from a transaction in an RDBMS,
except for scale, precision, and optimism.

It is not "give it the read set and write set and pull the trigger". It
discovers the read set and the write set dynamically. But at the end of
the transaction, it pulls the trigger and commits the write set, knowing
that the read set is undisturbed (and also that the write set is
exclusive in all caches).

It is not precise, in that there can be false collisions. No byte masks.

But I still think that it meets the definition of transaction.

In what way does it not meet your definition of transaction?

Ivan Godard

unread,

Oct 27, 2013, 9:39:44 PM10/27/13

to

On 10/27/2013 4:46 PM, Andy (Super) Glew wrote:
> On 10/25/2013 8:04 PM, Ivan Godard wrote:
>> On 10/25/2013 7:35 PM, Chris M. Thomasson wrote:
>>>> "Ivan Godard" wrote in message news:l4cer5$5rb$1...@dont-email.me... On
>>
>>
>>>> The Mill is designed to enable safe and efficient lock-free
>>>> algorithms, using optimistic concurrency.
>>>
>>> Am I bound to a "cascading" LL/SC type interface for such algorithms?
>>
>> It is optimistic (as LL/SC is) but where LL/SC can have at most one
>> value in the read set and one in the write set the Mill's form permits
>> much more and guarantees a number up to the wayness of the L1 cache.
>> Thus (for example) you can do a bi-linked list insert without a lock.
>>
>> We once had a DLL/DSC design, and dumped it when we discovered how to
>> use the cache instead.
>
> Do you "lock" entries in the "transaction set" between the beginning and
> end of the transaction?

No

> By "lock", I mean: what do you do if another processor tries to access
> the location in question? Do you tell that other processor to "Go
> away, come back later?" Or otherwise defer the other processor's
> request? (Perhaps only after getting ownership of all locations (upto
> 8?) involved in the transaction?)

Such collision requires the other party to get write ownership of the
relevant part of the read set. The resulting invalidate, when received
by the xaction, will fail it.

> In other words, is there any guarantee of forward progress? Or could an
> antagonist constantly interfere, e.g. by doing unlocked writes to one of
> the upto 8 locations involved in a - I'll call it transaction.

There is no guarantee of forward progress in the xaction itself. The DOS
attack you describe can also occur outside the transaction context, by
forcing a reader to continually reacquire the line involved. Fairness is
thus the responsibility of the underlying cache coherence protocol; if
that is fair and precludes DOS then xactions are immune to DOS too.

> In a hierarchical multicore system, with N cores, is the cache shared
> amongst all N cores required to have associativity at least N*the
> associativity of the L1 cache that is being used for the

Implementation defined.

> MIPS currently uses the cache for (single location) LL/SC, although
> other mechanisms have been used historically. MIPS based
> microcontrollers can use bus based locked for atomic set/clear bits.

In many ways our synch is a multiple LL/SC.

> ARM mostly seems to use a mechanism auxiliary to or separate from the
> cache. This allows synchronization for uncached memory as well as
> cacheable.

Not currently planned.

> Intel and AMD x86 CMPXCHG and other RMWs use mechanisms ranging from
> cache based to bus lock based. They typically provide guarantees of
> forward progress.

And round-robin to ensure fairness. We don't.

> Intel HLE uses cache based mechanisms to almost transparently improve
> the performance of legacy locking.

Yes; I was greatly impressed with the lock elision idea.

> Intel RTM seems to use cache based mechanisms. You are highly likely to
> make forward progress up to the associativity of the cache, actually up
> to the point where the the transaction set thrashes. But as far as I can
> tell Intel is not locking or deferring outside requests for items in the
> transaction set, and can similarly be constrained by the associativity
> of shared between cores and threads caches.
>
> Unless the Mill is locking or deferring, your "performance
> characteristics" would be very similar to Intel RTM's.

Yes, we expect so.

> The big issue with locking and deferring is deadlock. Typically resolved
> using one of two techniques:
>
> a) acquire the lines in a standard order, e.g. sorted from low to high
> physical address. Here "acquire" means "start deferring requests for
> the line".
>
> b) obtain the lines in any order, and acquire them (start deferring)
> when you have obtained them all.
>
> Actually, b) and a) are related: "obtain" may mean "get the line into my
> cache in a state such that I can write to it" (typically invalidated in
> all other caches (although being me I play with that)), and "acquire"
> may mean "for a line that is obtained, switch to locking/deferring".

Deferral is equivalent to multiple mutex locks, and subject to the
corresponding rules; acquire order is the most common way to assure
absence of deadlock, but there are priority issues that make mutexes a
real hard problem.

>
> Anticipating the usual confusion of terminology: when I say "locking"
> here, I do NOT mean locked atomic RMW operations. I mean "whatever
> mechanism allows the LL/SC or HTM mechanism to make forward progress".
> Typically marking a line such that any request to it from another
> processor or thread is deferred until the end of the transaction.

We explicitly do not do that. If you want pessimistic concurrency (i.e.
lockout/deferral) then you must use our primitive to implement a
conventional mutex, lock that, and do what a RMW pessimistic system does
now.

> Many LL/SC or HTM mechanisms make no guarantees of forward progress. In
> which case they may not use such mecahnisms.
>
> It is unfortunate that the terms collide. (But it is also natural: a low
> level lock being used to implement a higher level non-lock-based
> mechanism, or vice versa, and/or in many combinations.)
>
>
>

There is no guarantee of forward progress or fairness, except as
statistically determined by backoff. The backoff makes lack of progress
and unfairness exponentially unlikely.

There are details I can't talk about yet. We make no claim to have
invented cache-based optimistic concurrency, but the application to the
rest of the Mill is of course novel.

Ivan Godard

unread,

Oct 27, 2013, 11:24:01 PM10/27/13

to

In my terminology, an xaction must provide the ACID properties
(https://en.wikipedia.org/wiki/ACID). TM, in all systems I know of,
does not. Among other issues, updates are not persistent, there is no
provision for multiple participants or distribution, and connectivity is
assumed. The Oracle guys, even any COBOL programmer, are laughing. TM is
an optimistic synchronization device for transient state, no more.

There are two basic models for sync: data sync and code sync. In data
sync, the device provides programmatic means to identify read sets and
write sets, and to cause the write set to become visible atomically to
non-participants. The xaction has no syntactic relation to the host
language, may have multiple and distributed participants (and multiple
host languages), and has no scope.

Code sync is the idea of single occupant for a code region pursuant to
the syntax of a host language: only one active call of a particular
method, for example. There can be only one participant and one language,
by definition. Monitors in Concurrent Pascal started the whole thing;
note that they were added to a teaching language rather than a
production language.

Code sync is customarily implemented pessimistically by global locks,
but that raises the possibility of deadlock; because the locking is
implicit, avoiding deadlock requires detailed knowledge of the
(changing) implementation. TM is code sync using optimistic concurrency.
This avoids the deadlock issue, but raises the problem of non-idempotent
actions in the sync region. Pessimistic code sync permits i/o in the
region, whereas i/o and other non-idempotencies is semantically a
problem in optimistic code sync.

I oppose code sync in part because of those ill-defined semantics.
However, my largest beef is that the underlying read and write sets are
implicit and syntactically derived. In modern languages, it is
impossible to determine what variables are touched from inspection of
the source. Moreover, remote changes to source or libraries will change
what gets touched. Lastly, the naive will wrap code sync around what
they feel is a semantic unit, rather than around a reference unit; the
consequence is to drag a host of irrelevancies into the sync.

Watch what happens when the region calls "new" and half of malloc gets
added to the sets. Note that the call on "new" may be implicit and not
visible in the source. Incidentally, production code often provides its
own allocators, so a TM-aware malloc in clib is not a solution.

Then those naive TM users complain that their performance sucks due to
contention as loads get heavier. The constructs offer them no clue as to
what to do about this. Fixing the problem will commonly require them to
destroy the natural logic of the program so as to cluster the actual
sync'd data within an arbitrary syntactic wrapper, in effect
reconstructing data sync within a bastardized code sync wrapper..

So I prefer data sync, although I do admit that code sync is easier for
the novice; suitable for an introductory Java course at a junior
college, perhaps.

(Speaking of rants - where's Nick?)

Andy (Super) Glew

unread,

Oct 28, 2013, 12:19:26 AM10/28/13

to

On 10/26/2013 4:17 PM, Ivan Godard wrote:
> On 10/26/2013 1:19 PM, Chris M. Thomasson wrote:
>>> "Ivan Godard" wrote in message news:l4cel1$4rn$1...@dont-email.me...
>>
>>> On 10/24/2013 12:54 PM, Chris M. Thomasson wrote:
> Now consider the RMW cost. Fair warning: I have never worked at the
> RMW-level in hardware that has it, and my understanding of the costs may
> well be wrong. However, if so I'm sure that Andy or someone else will
> set me right.
>
> My understanding is that RMW always actually updates DRAM, as it is
> implemented by seizing master control of the lines to memory.

No.

What you say sounds like the old "bus lock" implementation, which we
rendered obsolete in Intel P6 circa 1996.

Processors since then have implemented atomic RMWs in the cache, by
"locking" smaller or larger parts of the cache. Where "locking" means
deferring other processors requesting the line(s) in question until the
RMW transaction is finished.

Bus locks may have endured when accessing uncached memory. I know the
bus people tried to get rid of them completely, e.g. by mandating that
you would not be allowed to do

> Moreover,
> because it is impractical to interleave RMW and normal load/store to the
> same line, the first step of a RMW is to drain the load and store
> buffers completely.

Again, old implementation.

We did this on P6 - we did not need to, but did because I was scared
that something so new might have bugs. And it did - the pipeline drain
saved our butt.

Modern machines do not drain the pipe.

> That is, RMW acts like a MEMBAR both before and
> after the RMW itself (although perhaps some machines may require that
> the software insert the MEMBARs rather than having the hardware do it).

Artifact. Most ISAs that are not SC require membar semantics before and
after the RMW. E.g. x86.

But on many implementations obtaining such membar semantics is now
cheap. (Mainly those with an ordered interconnect.)

One of my minor contributions to Itanium may have been encouraging them
to separate the membar from the atomic RMW. There are situations where
all you need is the atomic RMW, not the memory ordering side effect.

(I say "may have been", because although I know I suggested this to
Itanium, others with more involvement and influence may have done so as
well.)

> As with the Mill, the R side of RMW may miss, and it takes as long as it
> takes. The Mill could hit in the counter without having the rest of the
> line, avoiding a miss that RMW would have. RMW W side has a more time
> consuming invalidate. And the write back to DRAM will be several hundred
> cycles.
>
> So at the machine level, the Mill's optimistic concurrency does need a
> loop in a function rather than a native operation, but it runs entirely
> in cache and takes a dozen cycles or so, while RMW involves DRAM and is
> good for several hundred cycles.
>
> Your choice.

If you are shopping for red herrings.

Andy (Super) Glew

unread,

Oct 28, 2013, 12:25:31 AM10/28/13

to

It would be the same, if there were a loop in microcode or hardware.

There is not.

Or, looked at another way, hardware that defers remote snoops until the
transaction is over moves what is a LOCAL loop in an LL/SC
implementation to a REMOTE retry loop.

The tradeoff:

+ the hardware implementation of atomic RMWs makes many useful
guarantees of forward progress.

+ the optimistic concurrency control approach of LL/SC becomes
complicated when you try to make useful guarantees of forward progress.
So much so that many bus implementors have asked to NOT have to
implement LL/SC.

+ but LL/SC allows many different synchronization primitives to be
implemented in software. Whereas placing atomic RMWs in the ISA
necessarily has a limited repertoire. (Unless you manage to extract the
code between the LL and SC, and export that.)

Hybrid implementations are possible.

Andy (Super) Glew

unread,

Oct 28, 2013, 12:42:34 AM10/28/13

to

On 10/27/2013 6:39 PM, Ivan Godard wrote:
> On 10/27/2013 4:46 PM, Andy (Super) Glew wrote:
>> On 10/25/2013 8:04 PM, Ivan Godard wrote:
>>> On 10/25/2013 7:35 PM, Chris M. Thomasson wrote:
>>>>> "Ivan Godard" wrote in message news:l4cer5$5rb$1...@dont-email.me... On

>> In other words, is there any guarantee of forward progress? Or could an
>> antagonist constantly interfere, e.g. by doing unlocked writes to one of
>> the upto 8 locations involved in a - I'll call it transaction.
>
> There is no guarantee of forward progress in the xaction itself. The DOS
> attack you describe can also occur outside the transaction context, by
> forcing a reader to continually reacquire the line involved. Fairness is
> thus the responsibility of the underlying cache coherence protocol; if
> that is fair and precludes DOS then xactions are immune to DOS too.

Is your xaction a single bus operation, or multiple?

In other words, do you acquire the members of your read and write sets a
single load and store (augmented) at a time? With each taking cache
misses independently? Or do you specify the entire read and write set
as a single action?

If multiple, separate, bus operations, then you must have a different
definition of fairness than most bus systems do. Many systems implement
round robin. Consider a system with only two participants, the xaction
processor and the antagonist. On a round robin system the antagonist
will get an opportunity separated by at most one bus operation from the
xaction processor.

Stuff like this can be avoided by "batching" requests - giving the
xaction processor N adjacent bus operation slots. But getting this
right is one of the reasons why the guaranteeing forward progress for
LL/SC can be challenging.

>
>> In a hierarchical multicore system, with N cores, is the cache shared
>> amongst all N cores required to have associativity at least N*the
>> associativity of the L1 cache that is being used for the
>
> Implementation defined.
>
>> MIPS currently uses the cache for (single location) LL/SC, although
>> other mechanisms have been used historically. MIPS based
>> microcontrollers can use bus based locked for atomic set/clear bits.
>
> In many ways our synch is a multiple LL/SC.
>
>> ARM mostly seems to use a mechanism auxiliary to or separate from the
>> cache. This allows synchronization for uncached memory as well as
>> cacheable.
>
> Not currently planned.
>
>> Intel and AMD x86 CMPXCHG and other RMWs use mechanisms ranging from
>> cache based to bus lock based. They typically provide guarantees of
>> forward progress.
>
> And round-robin to ensure fairness. We don't.

Ah, good, I missed that. What is your bus scheduling system?

I see exponential backoff below.

Ah, exponential backoff.

Konrad Lai would be happy.

Coming from a real-time background, I was uncomfortable with this.

But, also, coming from a real time background, I saw how so many times
hard real time guarantees gave way to "good enough" guarantees in the
marketplace. Some day I am sure we will see a plane (or spaceship)
malfunction because of this. But in the meantime, good enough beats
provably correct time and time again.

--

Actually, I think that exponential backoff would be the key to deferall.

Exponential backoff in the xaction can provide no guarantees of forward
progress.

Exponential backoff if the antagonist, of anyone else disrupting a
xaction, can make guarantees of forward progress.

>
> There are details I can't talk about yet. We make no claim to have
> invented cache-based optimistic concurrency, but the application to the
> rest of the Mill is of course novel.

Andy (Super) Glew

unread,

Oct 28, 2013, 12:54:50 AM10/28/13

to

On 10/27/2013 8:24 PM, Ivan Godard wrote:
> On 10/27/2013 4:57 PM, Andy (Super) Glew wrote:
>> On 10/26/2013 12:45 PM, Ivan Godard wrote:
>>> It's a
>>> generalized LL/SC if you'd like to think of it that way.
>>>
>>> This is a data synchronization facility, not a code synchronization
>>> facility like monitors in the sense of Concurrent Pascal or the
>>> equivalent Java construct. That is, it is not coupled to a block in the
>>> host language with exclusive right to enter. You give it the read set
>>> and the write set and pull the trigger.
>>>
>>> Language runtimes can use this to implement monitors of course. Speaking
>>> as a language designer, I am quite opposed to monitors - they are fine
>>> for toys but do not scale with program complexity. Note that so-called
>>> "tranactional memory - TM" is actually optimistic monitors and is not a
>>> transaction at all as the term is understood by databases and our
>>> much-maligned COBOL brethren.
>>
>> While I share your distaste about monitors, I am not sure how I see that

>> hardware TM, like Intel's RTM, is actually not "just" optimistic monitors.

>>
>> But I fail to see how it is different from a transaction in an RDBMS,
>> except for scale, precision, and optimism.
>>

>> In what way does it not meet your definition of transaction?
>>
>>
>
> In my terminology, an xaction must provide the ACID properties
> (https://en.wikipedia.org/wiki/ACID). TM, in all systems I know of, does
> not. Among other issues, updates are not persistent

?? Persistent as in written to tape and stored in Iron Mountain?
Probably not.

But many commercial database systems are successful synchronizing only
to memory. Possibly battery backed up.

(One thing I like about log based architectures and microarchitectures
is that they can work better when real persistence is the goal.)

Looking forward to hearing how the Mill provides persistence better than
battery backed up DRAM.

>, there is no
> provision for multiple participants or distribution, and connectivity is
> assumed.

Fair enough.

(Did I mention log based?)

> The Oracle guys, even any COBOL programmer, are laughing. TM is
> an optimistic synchronization device for transient state, no more.

Not so sure about the Oracle guys. I took a transactional database
class with the Oracle kernel team given by Jim Grey.

> There are two basic models for sync: data sync and code sync.

Another annoying tutorial. If I were feeling like I felt a few days
ago, I would say "Blah, blah, blah."

> Watch what happens when the region calls "new" and half of malloc gets
> added to the sets. Note that the call on "new" may be implicit and not
> visible in the source. Incidentally, production code often provides its
> own allocators, so a TM-aware malloc in clib is not a solution.

That is a problem.

Many HTM and STM systems provide mechanisms to "escape" the transaction,
so that certain actions are not included.

Andy (Super) Glew

unread,

Oct 28, 2013, 12:57:42 AM10/28/13

to

On 10/27/2013 9:19 PM, Andy (Super) Glew wrote:
> On 10/26/2013 4:17 PM, Ivan Godard wrote:
>> On 10/26/2013 1:19 PM, Chris M. Thomasson wrote:
>>>> "Ivan Godard" wrote in message news:l4cel1$4rn$1...@dont-email.me...
>>>
>>>> On 10/24/2013 12:54 PM, Chris M. Thomasson wrote:
>> Now consider the RMW cost. Fair warning: I have never worked at the
>> RMW-level in hardware that has it, and my understanding of the costs may
>> well be wrong. However, if so I'm sure that Andy or someone else will
>> set me right.
>>
>> My understanding is that RMW always actually updates DRAM, as it is
>> implemented by seizing master control of the lines to memory.
>
> No.
>
> What you say sounds like the old "bus lock" implementation, which we
> rendered obsolete in Intel P6 circa 1996.
>
> Processors since then have implemented atomic RMWs in the cache, by
> "locking" smaller or larger parts of the cache. Where "locking" means
> deferring other processors requesting the line(s) in question until the
> RMW transaction is finished.
>
> Bus locks may have endured when accessing uncached memory.

I forgot to mention that some CPUs, and nearly all GPUs, implement
atomics at memory and/or in a shared cache controller - by placing ALUs
at the cache controller. So they do not require locking cache lines.

Ivan Godard

unread,

Oct 28, 2013, 1:04:53 AM10/28/13

to

On 10/27/2013 9:42 PM, Andy (Super) Glew wrote:
> On 10/27/2013 6:39 PM, Ivan Godard wrote:
>> On 10/27/2013 4:46 PM, Andy (Super) Glew wrote:
>>> On 10/25/2013 8:04 PM, Ivan Godard wrote:
>>>> On 10/25/2013 7:35 PM, Chris M. Thomasson wrote:
>>>>>> "Ivan Godard" wrote in message news:l4cer5$5rb$1...@dont-email.me... On
>
>>> In other words, is there any guarantee of forward progress? Or could an
>>> antagonist constantly interfere, e.g. by doing unlocked writes to one of
>>> the upto 8 locations involved in a - I'll call it transaction.
>>
>> There is no guarantee of forward progress in the xaction itself. The DOS
>> attack you describe can also occur outside the transaction context, by
>> forcing a reader to continually reacquire the line involved. Fairness is
>> thus the responsibility of the underlying cache coherence protocol; if
>> that is fair and precludes DOS then xactions are immune to DOS too.
>
> Is your xaction a single bus operation, or multiple?

Commonly none at all. As described in ootbcomp.com/docs/memory, Mill
data is often backless and there are no memory bus actions at all
arising from load and store. There is still cache coherency though, and
the CC protocol is per load/store within a xaction, not collectred and
all at once.

> In other words, do you acquire the members of your read and write sets a
> single load and store (augmented) at a time? With each taking cache
> misses independently? Or do you specify the entire read and write set
> as a single action?
>
> If multiple, separate, bus operations, then you must have a different
> definition of fairness than most bus systems do. Many systems implement
> round robin. Consider a system with only two participants, the xaction
> processor and the antagonist. On a round robin system the antagonist
> will get an opportunity separated by at most one bus operation from the
> xaction processor.
>
> Stuff like this can be avoided by "batching" requests - giving the
> xaction processor N adjacent bus operation slots. But getting this
> right is one of the reasons why the guaranteeing forward progress for
> LL/SC can be challenging.

Whether xaction or not, if requests actually get as far as the memory
bus/controller then there is a round robin. This has no specific impact
on xactions.

>>
>> And round-robin to ensure fairness. We don't.
>
> Ah, good, I missed that. What is your bus scheduling system?

There is RR on bus actions, but few load/stores get to the bus. The CC
(as opposed to the bus) may RR but is not defined to do so, and only
would on low end systems. High end will be point-to-point.

<snip>

>>
>>
>> There is no guarantee of forward progress or fairness, except as
>> statistically determined by backoff. The backoff makes lack of progress
>> and unfairness exponentially unlikely.
>
> Ah, exponential backoff.
>
> Konrad Lai would be happy.
>
> Coming from a real-time background, I was uncomfortable with this.
>
> But, also, coming from a real time background, I saw how so many times
> hard real time guarantees gave way to "good enough" guarantees in the
> marketplace. Some day I am sure we will see a plane (or spaceship)
> malfunction because of this. But in the meantime, good enough beats
> provably correct time and time again.
>
> --
>
> Actually, I think that exponential backoff would be the key to deferall.
>
> Exponential backoff in the xaction can provide no guarantees of forward
> progress.
>
> Exponential backoff if the antagonist, of anyone else disrupting a
> xaction, can make guarantees of forward progress.

The Mill is not suitable for hard real time. No machine with a cache is
suitable.

More must wait for the filings behind the multicore talk.

Ivan

Ivan Godard

unread,

Oct 28, 2013, 1:08:06 AM10/28/13

to

On 10/27/2013 9:54 PM, Andy (Super) Glew wrote:

>
> Another annoying tutorial. If I were feeling like I felt a few days
> ago, I would say "Blah, blah, blah."

You are not the only one who reads these posts.

Noob

unread,

Oct 28, 2013, 7:52:02 AM10/28/13

to

Ivan Godard wrote: [snip]

Wow! +1 Insightful. Thanks.

Michael S

unread,

Oct 28, 2013, 9:52:10 AM10/28/13

to

Very interesting and insightful post, but, IMHO, too philosophical for the question in hand.
Things discussed at too high level of abstraction.
Down to bits, bytes, cache lines and coherence packets exchange, it's not clear to me, *why* at light load the performance of optimistic solution (supposedly, LL/SC loop) would be better, than performance of "pessimistic" RMW.
It seems to me, that in the best case, i.e. when cache line of interest is already owned by L1D of requesting core, RMW will be at least as fast as LL/SC, but more likely somewhat faster.
In the second best case, i.e. when cache line of interest is owned by other core, but no other agent was interested in this line at about the same time ("about the same time" defined like approximately the time of single coherence exchange), it seems to me that RMW and LL/CC will be be equally fast (or equally slow, relatively to first case).

What am I missing?

EricP

unread,

Oct 28, 2013, 10:41:50 AM10/28/13

to

Andy (Super) Glew wrote:
>
> I forgot to mention that some CPUs, and nearly all GPUs, implement
> atomics at memory and/or in a shared cache controller - by placing ALUs
> at the cache controller. So they do not require locking cache lines.

I was thinking that would be a good way to do it, in the Cache Controller
because atomic FetchADD, FetchAND, FetchOR, FetchXOR, FetchNOT,
and AtomicXchg all fit into the existing load/store queue structure
- transport the operands to the CC instead of data values.

However for atomic CmpXchg, and the double wide XchgDW and CmpXchgDW
could require an address and two double wide operands,
so you wind up having many resource for the load/store queue
allocated handle these rare instructions.

That looks to force ones hand and require cache line pinning,
and move the data into the core then back out, and all the
associated handshaking that would necessitate.

Alternatively, there could be a special function unit
for handling the double wide operations (DWFU), but then
syncing with the load/store queue becomes the problem.
I suppose instead of a queue drain you could insert
a special command in the LSQ that when it reaches the
queue head triggers the DWFU to do its thing.

Eric

Andy (Super) Glew

unread,

Oct 28, 2013, 1:18:04 PM10/28/13

to

On 10/28/2013 7:41 AM, EricP wrote:
> Andy (Super) Glew wrote:

>> I forgot to mention that some CPUs, and nearly all GPUs, implement
>> atomics at memory and/or in a shared cache controller - by placing
>> ALUs at the cache controller. So they do not require locking cache
>> lines.
>
> I was thinking that would be a good way to do it, in the Cache Controller
> because atomic FetchADD, FetchAND, FetchOR, FetchXOR, FetchNOT,
> and AtomicXchg all fit into the existing load/store queue structure
> - transport the operands to the CC instead of data values.
>
> However for atomic CmpXchg, and the double wide XchgDW and CmpXchgDW
> could require an address and two double wide operands,
> so you wind up having many resource for the load/store queue
> allocated handle these rare instructions.

At the moment all of the CPUs I am familiar with use the same load/store
queue structures to handle packed SIMD vectors (like Intel's SSE and
AVX, or MIPS MSA (MIPS SIMD Architecture)) that the use for integers.

I.e. 128 or 256 bit or wider buffers.

So a double wide 2x64 = 128 bit operand would fit.

GPUs tend to operate on much wider data items.

As long as the synchronization item is less than or equal to the widest
data item, no problem.

I think that it is wasteful to spend a 128 or 256 store buffer entry on
an 8/16/32/64 bit store. But that is what is done.

> That looks to force ones hand and require cache line pinning,
> and move the data into the core then back out, and all the
> associated handshaking that would necessitate.

Like I said, not an issue.

Not until you need to involve more than a single cache line.

Andy (Super) Glew

unread,

Oct 28, 2013, 1:26:14 PM10/28/13

to

On 10/26/2013 10:25 PM, Ivan Godard wrote:
> The hardware can only do one RMW at a time. If there are 20 cores
> contending, each core will have a sync bandwidth that is a 20th of what
> is available without contention.

Not on any reasonably modern machine.

Since P6 in 1997, Intel x86 can do one RMW at a time per CPU, but every
CPU can be doing RMWs simultaneously and overlapped, so long as they are
not thrashing the same cache set.

I believe they can now do one RMW per thread.

Basically the same limitation as LL/SC.

They can't do more than one atomic RMW per thread. *That* is what
primitives such as LLn/SCn get you.

George Neuner

unread,

Oct 28, 2013, 3:27:00 PM10/28/13

to

On Sun, 27 Oct 2013 22:04:53 -0700, Ivan Godard <iv...@ootbcomp.com>
wrote:

>No machine with a cache is suitable [for hard real time].

Caching affects the average behavior, but HRT is concerned with worst
case behavior. Algorithms which depend on precise, or statistical,
access times for correctness are broken a priori.

A (suitably fast) single core CPU provably can handle HRT regardless
of caching - if the program operates correctly with the cache turned
off, it *should* operate correctly with the cache turned on. If not,
it is the program rather than the cache which is at fault.

Multiprocessor systems are far more difficult to prove correct - most
developers sidestep the issue by tying HRT threads to a particular
core/CPU and, if necessary, placing data structures so as to minimize
cache line interference.

In any event, your blanket assertion is simply incorrect ... cached
single core CPUs are no longer a rarity in HRT. Big OoO CPUs are not
popular, but powerful sequential CPUs are no longer shunned. They
are, in fact, seeing more use as more low power models become
available. Multicores still are rare, but as computing needs continue
to grow, multicore chips are being considered more and more often.

Caching and multiprocessing in RT systems (both hard and soft) are
recurrent topics in comp.arch.embedded. You might want to search for
some of those discussions.

George

Ivan Godard

unread,

Oct 28, 2013, 4:29:02 PM10/28/13

to

On 10/26/2013 10:25 PM, Ivan Godard wrote:

> On 10/26/2013 7:17 PM, Chris M. Thomasson wrote:
>>> "Ivan Godard" wrote in message news:l4hm8n$sb$1...@dont-email.me...

<snip tutorial>

Andy tells me that RMW no longer locks the memory bus the way it used
to. Shows you what you get when a software guy talks about other
people's hardware :-( Thank you, Andy - you sure I can't wheedle you
out of MIPS?

Without bus locking, the optimistic and pessimistic approaches should
have similar cost for those actions that can be expressed in both, for
example testAndSet and compareAndSwap. The pessimistic forms require the
ability to lock a cache line from preemption, to perform simple
arithmetic within the cache, and to refuse or buffer preemption
requests, none of which should be too hard or take too long.

However, I'll stick my neck out into the hardware domain once again and
assert that all pessimistic primitives since DCAS on the 68000 have been
mono-addressed: only one datum participates in the primitive. If you
don't mind hacking your data structure you can treat that monodatum as
two adjacent data to provide the illusion of DCAS - see
https://en.wikipedia.org/wiki/Double_compare-and-swap - but that's not
the same thing.

Early Mill had DLL/DLC using snoopers, which was easy in a monocore but
we were doubtful about multicore. We dropped it when we became aware of
the more general cache-based optimistic algorithms. A large number of
lock-free or wait-free algorithms require multidata synchronization (or
truly tortuous algorithms if they can be done at all), so we are happy
to provide the optimistic primitive for its richer semantics and freedom
from deadlock and priority issues, if for no other reasons.

Should we *also* provide a pessimistic primitive, to appease those who
are appalled by the possibility of taking a loop? Certainly we should if
we were to enter the real-time markets, but that seems unlikely.
Otherwise we can wait and see what the market cares about.

Ivan Godard

unread,

Oct 28, 2013, 4:43:14 PM10/28/13

to

On 10/28/2013 12:27 PM, George Neuner wrote:
> On Sun, 27 Oct 2013 22:04:53 -0700, Ivan Godard <iv...@ootbcomp.com>
> wrote:
>
>> No machine with a cache is suitable [for hard real time].
>
> Caching affects the average behavior, but HRT is concerned with worst
> case behavior. Algorithms which depend on precise, or statistical,
> access times for correctness are broken a priori.
>
> A (suitably fast) single core CPU provably can handle HRT regardless
> of caching - if the program operates correctly with the cache turned
> off, it *should* operate correctly with the cache turned on. If not,
> it is the program rather than the cache which is at fault.

Well, yes, of course, but a machine with the cache turned off is not the
architecture at point - don't buy a Mill with the cache turned off,
instead buy a Z80, it will be cheaper.

> Multiprocessor systems are far more difficult to prove correct - most
> developers sidestep the issue by tying HRT threads to a particular
> core/CPU and, if necessary, placing data structures so as to minimize
> cache line interference.
>
> In any event, your blanket assertion is simply incorrect ... cached
> single core CPUs are no longer a rarity in HRT. Big OoO CPUs are not
> popular, but powerful sequential CPUs are no longer shunned. They
> are, in fact, seeing more use as more low power models become
> available. Multicores still are rare, but as computing needs continue
> to grow, multicore chips are being considered more and more often.
>
> Caching and multiprocessing in RT systems (both hard and soft) are
> recurrent topics in comp.arch.embedded. You might want to search for
> some of those discussions.

And perhaps someone with a HRT program that needs no more than you can
get out of a cache-less Mill will buy Mills for the other advantages;
more power to them. They will have bought and paid for a waste of 80% or
so of the chip area, but market effects could make the choice cheaper in
dollars than using a HRT-specific design.

I've done HRT, and could do an HRT architecture. Almost certainly it
would be a belt machine, but it wouldn't be a Mill.

Paul A. Clayton

unread,

Oct 28, 2013, 5:50:11 PM10/28/13

to

On Monday, October 28, 2013 12:25:31 AM UTC-4, Andy (Super) Glew wrote:
[snip]

> Or, looked at another way, hardware that defers remote snoops until the
> transaction is over moves what is a LOCAL loop in an LL/SC
> implementation to a REMOTE retry loop.
>
> The tradeoff:
>
> + the hardware implementation of atomic RMWs makes many useful
> guarantees of forward progress.
>
> + the optimistic concurrency control approach of LL/SC becomes
> complicated when you try to make useful guarantees of forward progress.
> So much so that many bus implementors have asked to NOT have to
> implement LL/SC.
>
> + but LL/SC allows many different synchronization primitives to be
> implemented in software. Whereas placing atomic RMWs in the ISA
> necessarily has a limited repertoire. (Unless you manage to extract the
> code between the LL and SC, and export that.)
>
> Hybrid implementations are possible.

IBM zSeries constrained transactions provide guaranteed
forward progress with substantial limitations on the
code stream and the data accesses. At the most extreme
degree of failure, I think it falls back to hardware
locks.

Perhaps tokens/versions might be used to make the
required hardware complexity more manageable.

It seems that it would help if software could provide
a lock identifier (possibly including a mask to allow
different lock granularity??) that is guaranteed to
be sufficient by itself for guarding the accesses.
Such could be used to filter accesses since if the
lock identifier did not match, the access could not
conflict; but it might also be used to facilitate
the ordering of conflicting transactions by reducing
the search space. (No, I have not thought this through
nor am I especially familiar with the basic principles
in this area. I share this in case it actually has
some interesting aspect.)

Stefan Monnier

unread,

Oct 30, 2013, 3:15:15 PM10/30/13

to

> A (suitably fast) single core CPU provably can handle HRT regardless
> of caching - if the program operates correctly with the cache turned
> off, it *should* operate correctly with the cache turned on. If not,
> it is the program rather than the cache which is at fault.

You can do better than that. You can do static analysis of the cache
behavior of your program, predicting which memory accesses *will* hit
the cache (and which ones may miss or may hit).

It doesn't always give useful results, admittedly (and assumes
cooperative concurrency so you know when you cache might get
flushed/trashed by another process/thread).

Stefan

Andy (Super) Glew

unread,

Oct 30, 2013, 7:24:37 PM10/30/13

to

On 10/28/2013 12:27 PM, George Neuner wrote:

> On Sun, 27 Oct 2013 22:04:53 -0700, Ivan Godard <iv...@ootbcomp.com>
> wrote:
>
>> No machine with a cache is suitable [for hard real time].
>
> Caching affects the average behavior, but HRT is concerned with worst
> case behavior. Algorithms which depend on precise, or statistical,
> access times for correctness are broken a priori.
>
> A (suitably fast) single core CPU provably can handle HRT regardless
> of caching - if the program operates correctly with the cache turned
> off, it *should* operate correctly with the cache turned on. If not,
> it is the program rather than the cache which is at fault.

One of my friends used to do a lot of very close to the metal
programming - not Hard Real Time in the sense of being safety critical,
but hard because back in the day it was hard to do video and audio on a
PC without artifacts.

He complained bitterly about the transition from write-through to
write-back caches: on a non-speculative in-order machine he would know
when cache line fills were being done, and would arrange them so as to
maximize DRAM performance by being in the same DRAM page/bank.

But when he ran into write-back caches, and then speculation, and then
out-of-order, he could not really know when a cache line eviction was
going to occur. So, evictions occurred at almost random times, and
mucked up his careful scheduling of DRAM pages/banks.

=> Caching *can* affect the worst case behavior negatively.

Fortunately on average this unfortunate worst case degradation is
probably unlikely to occur. (I had fun writing that sentence: can we
make it even more twisty?)

Terje Mathisen

unread,

Nov 12, 2013, 7:04:02 AM11/12/13

to

Andy (Super) Glew wrote:
> On 10/28/2013 12:27 PM, George Neuner wrote:
>> A (suitably fast) single core CPU provably can handle HRT regardless
>> of caching - if the program operates correctly with the cache turned
>> off, it *should* operate correctly with the cache turned on. If not,
>> it is the program rather than the cache which is at fault.
>
> One of my friends used to do a lot of very close to the metal
> programming - not Hard Real Time in the sense of being safety critical,
> but hard because back in the day it was hard to do video and audio on a
> PC without artifacts.

I resemble that remark...

>
> He complained bitterly about the transition from write-through to
> write-back caches: on a non-speculative in-order machine he would know
> when cache line fills were being done, and would arrange them so as to
> maximize DRAM performance by being in the same DRAM page/bank.

but it wasn't me

>
> But when he ran into write-back caches, and then speculation, and then
> out-of-order, he could not really know when a cache line eviction was
> going to occur. So, evictions occurred at almost random times, and
> mucked up his careful scheduling of DRAM pages/banks.
>
> => Caching *can* affect the worst case behavior negatively.

I have found that it is _very_ hard to come up with consistent PREFETCH
wins on a machine with unpredictable (i.e. write back) cache behavior.

>
> Fortunately on average this unfortunate worst case degradation is
> probably unlikely to occur. (I had fun writing that sentence: can we
> make it even more twisty?)
>

Naa, it's done: Stick a fork in it!

Terje
(Just back from two weeks of cruising in the Caribbean, with no net access.)

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Nov 12, 2013, 7:24:32 AM11/12/13

to

Ivan Godard wrote:
> p.s. Does anybody have a microbenchmark that just does an infinite
> number of IDs to the same counter? What's the cycle count per ID?

I wrote a program to test the maximum communication rates for a shared
memory interface to an NTP reference clock:

A single producer which wrote (cache-line aligned) blocks of simulated
timestamps and N consumers which all tried to read as many of these as
possible.

It used XADD as the only primitive.

Let me look around... yes, I found it (see below).

I get around 20-25 M updates/second (each filling a full cache line,
plus updates to a few XADD'ed counters), so around 100-130 clock
cycles/update.

For the readers the retries happens when an external event have caused a
thread to be preempted in the middle of a (non atomic) time stamp read,
and delayed long enough so that the writer FIFO wraps around.

Each of the reader threads are just a little bit faster than the
producer, so they will sometimes re-read the same value. When I did the
same test with just 4 threads total (i.e. one per physical core on this
4x2 hyperthreaded i7) the readers were slower, most probably because the
OS (Win7) was continuously moving these 3 cpu-bound reader threads back
& forth between the 7 available HW threads. :-(

Terje

D:\c2\shm-test>Release\shm-test.exe /?
Usage shm-test [options]
-b buffer_slots (default 4, range 2-64 in powers of two)
-n nr_of_threads (default 4, range 2-256)
-t seconds_to_run (default 60)

D:\c2\shm-test>Release\shm-test.exe -b 16 -n 8 -t 30
Testing with 16 buffer slots, 8 threads for 30 seconds
Writer affinity = 1, old = ff
Reader 1 affinity = fe, old = ff
Reader 2 affinity = fe, old = ff
Reader 3 affinity = fe, old = ff
Reader 4 affinity = fe, old = ff
Reader 5 affinity = fe, old = ff
Reader 6 affinity = fe, old = ff
Reader 7 affinity = fe, old = ff
Done waiting 30 seconds

All threads done! (i = 9)
Writer:
623835335 iterations, 2.0795e+007 iterations/second
Thread 1:
681683801 iterations, 417307646 updating 10473 retries 0 errors
Thread 2:
708901882 iterations, 411172462 updating 7293 retries 0 errors
Thread 3:
727857586 iterations, 387066847 updating 4911 retries 0 errors
Thread 4:
729632983 iterations, 422608342 updating 3965 retries 0 errors
Thread 5:
681082191 iterations, 416579655 updating 6597 retries 0 errors
Thread 6:
686005060 iterations, 417456171 updating 5161 retries 0 errors
Thread 7:
680920039 iterations, 416288433 updating 3189 retries 0 errors
Reader totals:
4896083542 iterations, 2888479556 updating 41589 retries 0 errors
Reader totals:
1.632e+008 iterations/second, 59.00% updating 0.00% retries 0.000% errors

Chris M. Thomasson

unread,

Nov 12, 2013, 6:51:59 PM11/12/13

to

> "Terje Mathisen" wrote in message
> news:69j9la-...@ntp-sure.tmsw.no...

> Ivan Godard wrote:
> > p.s. Does anybody have a microbenchmark that just does an infinite
> > number of IDs to the same counter? What's the cycle count per ID?

> I wrote a program to test the maximum communication rates for a shared
> memory interface to an NTP reference clock:

> A single producer which wrote (cache-line aligned) blocks of simulated
> timestamps and N consumers which all tried to read as many of these as
> possible.

> It used XADD as the only primitive.

You probably know the difference between testing a fetch-and-add operation
implemented with a CMPXCHG-loop vs a single XADD. Well, XADD seems to
clean its clock. This is why I dislike looping around on failure conditions
in
software when performing simple fetch-and-add operations!

:^o

Ivan Godard

unread,

Nov 12, 2013, 8:21:46 PM11/12/13

to

Are you conflating the use of F&A as an optimization of "++i" in a
monocore, with the use of F&A as a synchronization device as in a
take-a-number?

On a Mill an F&A as an optimization buys nothing over a load/add/store
sequence - except for some entropy in the encoding - even if the adder
is in the top-level cache where the store would go. It simply takes as
many cycles to stage the value into a local adder as it does to get to
the belt and through a regular ALU. This result would not be true on an
OOO machine in which each of the three imputed ops has to pass through a
scheduler stage, but we don't have a scheduler stage. For us, handling
the general case, we have a cycle in the address adder for the load, two
cycles to {get to cache, search tag array, extract data, get the data to
an adder}, one for the add, and one to put the data back in the cache.
Everything else overlaps. A native F&A would go no faster.

F&A as a synchronization device cannot use that three-operation timing
on a Mill, because the load and store are disjoint and could see an
intervening interrupt. On the Mill's optimistic concurrency primitive,
the code must add two additional operations to the implementation, plus
whatever is desired to handle failure. The extra ops are
beginSynchonousRegion and commitSynchronousRegion (terser but less
informative names in reality). These do not add to the overall latency
because BSR can be in the same instruction as the load and CSR in the
same as the store, or at least that's what we think at this point,
because both just dump another entry in the request stream. CSR returns
a result; the latency of that result is (like most latencies)
member-dependent. If all you want is an unconditional retry then you
need two additional ops: eql (was the CSR result "success") and a
branch, which can be in the same instruction, after the CSR retires.

This sequence has the same timing as the F&A-as-optimization case except
for the BSR/CSR, test, and loopback. Under the usual optimistic
assumption, the test will always succeed and the branch will never be
taken, so prediction will eliminate any delay and the "eql" op simply
occupies a spare ALU slot with other code; the cost is really only
encoding entropy for two operations (perhaps 32 bits for these) and two
slots worth of slot pressure that can't be used for other stuff. That's
non-zero, true, but pretty minor overall.

The BSR and CSR take more detailed analysis. BSR tests a hardware flag
to see if the program is already in a region, and then sets it. Assuming
program correctness, the error handling won't happen, so the BSR will
overlap with other actions in the caches, and in particular with the
load. Consequently the BSR adds nothing.

Successful CSR has to flash-clear the write-set bits in the top-level
cache. This must be atomic, but for practical sized caches there's no
problem to do it in one cycle, although there may be a one-cycle stutter
in processing the request stream. Failed CSR has to both flash-clear the
write-set bits and also the valid bits in the participating lines. This
too should take at most one cycle.

However, being in a synchronization region alters the cache behavior of
participating stores. A cache-based synchronizer uses the L$1 for read
and write sets, and the L$2 for the rollback values. Consequently, if a
participating store finds that the target line is a) in the L$1 and b)
not already a participant and c) is dirty, then the existing L$1 line
must be evicted to L$2 before the new store can be placed in the cache.

This effect is equivalent to the evict-path contention that occurs when
we run out of non-dirty ways for a new store into the L$1, and uses the
same hardware. The cost is very heavily implementation-dependent, but in
the worst case should be no worse than a bank fetch and a little
marshaling arbitration, say two cycles. However, on a high end Mill with
perhaps four load/store units, if all four are doing participating
stores and all hit the evict requirement, almost certainly the resulting
convoying will exceed the buffering capacity in the paths and will back
up to the execution engines and trigger an issue stall until things get
unglued.

Machines as big as a high-end Mill (or any OOO) do not stop on a dime -
merely to get the stall signal across the core can take several clocks.
Consequently, contrived cases that hit issue-stalls can run longer than
the anticipated timing. This problem is not unique to synchronization -
code outside the region can suffer from the same convoy problem if it is
continually writing to different lines in the same L$1 way. Every
architecture has this issue, but except for verification codes the
effect is infrequent enough to be invisible to app code.

The Mill optimistic synch primitive works essentially the same way that
everybody else's does, so any difference among architectures in the
relative cost of optimistic/pessimistic is not based on the primitives
themselves, but on other aspects of the respective architectures. I
would expect that any in-order monocore would find an optimistic
implementation of F&A to be roughly the equal of pessimistic, the way
the Mill does. In contrast, an OOO machine is going to require several
more trips through the scheduler than a native pessimistic operation
would need.

In addition, a cache-coherent multicore implementation is going to wind
up with more CC delays than a native pessimistic F&A would. Here the
Mill has an advantage over even an in-order architecture because our
invalidation requests have zero latency - we have to send them, but they
do not delay the execution.

Chris: thank you for bringing this up. I've never really sat down and
worked through the analysis of F&A before - the Mill design has used
more abstract models for working out how to handle concurrency, and it's
good to reality-check on a concrete example.

Paul A. Clayton

unread,

Nov 13, 2013, 11:19:49 AM11/13/13

to

On Tuesday, November 12, 2013 8:21:46 PM UTC-5, Ivan Godard wrote:
[snip]

> F&A as a synchronization device cannot use that three-operation timing
> on a Mill, because the load and store are disjoint and could see an
> intervening interrupt. On the Mill's optimistic concurrency primitive,
> the code must add two additional operations to the implementation, plus
> whatever is desired to handle failure. The extra ops are
> beginSynchonousRegion and commitSynchronousRegion (terser but less
> informative names in reality). These do not add to the overall latency
> because BSR can be in the same instruction as the load and CSR in the
> same as the store, or at least that's what we think at this point,
> because both just dump another entry in the request stream. CSR returns
> a result; the latency of that result is (like most latencies)
> member-dependent. If all you want is an unconditional retry then you
> need two additional ops: eql (was the CSR result "success") and a
> branch, which can be in the same instruction, after the CSR retires.

I have been thinking about how the classic ll/sc could
be extended into a much broader form of atomic action,
and it seemed to me that linked-load could be used as
a form of begin limited transaction plus load and
store-conditional could be used as a form of end
transaction.

(My musing on an extensible ll/sc has encountered
difficulties. Supporting internal [register] state
rollback would seem to require a separate instruction
[to allow simple ISA family members not to require such
complexity], though preserving the stack pointer might
be allowed [with the restriction that simple members
can persistently fail transactions that modify the
stack pointer; allowing behavior to either save SP or
not might be acceptable unpredictable/implementation-
dependent behavior]. Nesting [which the Mill does not
support] is also troublesome. Given the small benefit
of not adding opcodes such musings have little value,
though requiring extended transactions to handle
persistent failure in software would allow simple
members to always abort extended transactions and use
the alternate software path so there would be significant
compatibility in software.)

For the Mill it is understandable that separate
operations (corresponding to instructions in a non-VLIW)
would be used. Not only would this allow low cost
orthogonality (ll often does not support all the load
forms supported by the ISA; for a typical RISC design
using a dummy load in a ll would not be much more
problematic than requiring a separate instruction to
begin atomic sections), but it might slightly simplify
decode and routing of operations while allowing
implementation flexibility.

> Successful CSR has to flash-clear the write-set bits in the top-level
> cache. This must be atomic, but for practical sized caches there's no
> problem to do it in one cycle, although there may be a one-cycle stutter
> in processing the request stream. Failed CSR has to both flash-clear the
> write-set bits and also the valid bits in the participating lines. This
> too should take at most one cycle.

If bit clearing was too slow, one could use versioning.

(By "participating lines" I assume you mean for the write
set. Cache lines in the read set can remain valid.)

> However, being in a synchronization region alters the cache behavior of
> participating stores. A cache-based synchronizer uses the L$1 for read
> and write sets, and the L$2 for the rollback values. Consequently, if a
> participating store finds that the target line is a) in the L$1 and b)
> not already a participant and c) is dirty, then the existing L$1 line
> must be evicted to L$2 before the new store can be placed in the cache.

Even if the line is clean one might want to push the
data back to L2 (assuming a non-inclusive L2) if there
is a significant chance of failure. Even with prefetching
based on accesses from the previous failed transaction,
a memory access would be relatively slow (and increase
the chance of another failure).

The choice of copying back to L2 could be made dynamically
based on chance of failure, L1-to-L2 bandwidth pressure,
etc.
[snip]

> Consequently, contrived cases that hit issue-stalls can run longer than
> the anticipated timing. This problem is not unique to synchronization -
> code outside the region can suffer from the same convoy problem if it is
> continually writing to different lines in the same L$1 way. Every
> architecture has this issue, but except for verification codes the
> effect is infrequent enough to be invisible to app code.

By "in the same L$1 way" did you mean "at the same cache
index"--assuming non-skewed associativity--(which is
usually called "set"), so that an N-way cache could only
hold writes to N lines?

Elbow/cuckoo caches can reduce the impact of such by
using the cuckoo hash mechanism. (This is in addition
to the conflict avoidance of using skewed associativity.)
Cuckoo caching is more appropriate for outer levels of
cache, but it could be selectively applied to L1. (If one
would be waiting anyway, copying a cache line within L1
might not add any delay.)

Ivan Godard

unread,

Nov 13, 2013, 2:00:00 PM11/13/13

to

On 11/13/2013 8:19 AM, Paul A. Clayton wrote:
> On Tuesday, November 12, 2013 8:21:46 PM UTC-5, Ivan Godard wrote:
> [snip]

<snip>

>> However, being in a synchronization region alters the cache behavior of
>> participating stores. A cache-based synchronizer uses the L$1 for read
>> and write sets, and the L$2 for the rollback values. Consequently, if a
>> participating store finds that the target line is a) in the L$1 and b)
>> not already a participant and c) is dirty, then the existing L$1 line
>> must be evicted to L$2 before the new store can be placed in the cache.
>
> Even if the line is clean one might want to push the
> data back to L2 (assuming a non-inclusive L2) if there
> is a significant chance of failure. Even with prefetching
> based on accesses from the previous failed transaction,
> a memory access would be relatively slow (and increase
> the chance of another failure).

The Mill cache policy on a cache miss is to raise the loaded-from line
one level only, not all the way to the top (while also returning the
loaded bytes to the retire station of course). As a result, if the line
is clean in the L$1 then there is almost certainly another copy
(possibly dirty) in the L$2. Hence there's no value to pushing clean
lines in general.

> The choice of copying back to L2 could be made dynamically
> based on chance of failure, L1-to-L2 bandwidth pressure,
> etc.
> [snip]
>> Consequently, contrived cases that hit issue-stalls can run longer than
>> the anticipated timing. This problem is not unique to synchronization -
>> code outside the region can suffer from the same convoy problem if it is
>> continually writing to different lines in the same L$1 way. Every
>> architecture has this issue, but except for verification codes the
>> effect is infrequent enough to be invisible to app code.
>
> By "in the same L$1 way" did you mean "at the same cache
> index"--assuming non-skewed associativity--(which is
> usually called "set"), so that an N-way cache could only
> hold writes to N lines?
>
> Elbow/cuckoo caches can reduce the impact of such by
> using the cuckoo hash mechanism. (This is in addition
> to the conflict avoidance of using skewed associativity.)
> Cuckoo caching is more appropriate for outer levels of
> cache, but it could be selectively applied to L1. (If one
> would be waiting anyway, copying a cache line within L1
> might not add any delay.)
>

There are a lot of schemes to reduce cache collisions that arise from
regular strides matching the hash algorithm and consequently clustering
(and colliding) in the cache. However, regardless of the scheme used it
is always possible to contrive test cases that will result in maximal
collisions, and in general the cache data paths are not redundant enough
to handle the traffic and will cause a stall on any architecture.

Chris M. Thomasson

unread,

Nov 14, 2013, 1:59:11 AM11/14/13

to

> "Ivan Godard" wrote in message news:5282D42A...@ootbcomp.com... On

> 11/12/2013 3:51 PM, Chris M. Thomasson wrote:
> >> "Terje Mathisen" wrote in message
> >> news:69j9la-...@ntp-sure.tmsw.no...
> >
> >> Ivan Godard wrote:
> >> > p.s. Does anybody have a microbenchmark that just does an infinite
> >> > number of IDs to the same counter? What's the cycle count per ID?
> >
> >> I wrote a program to test the maximum communication rates for a shared
> >> memory interface to an NTP reference clock:
> >
> >> A single producer which wrote (cache-line aligned) blocks of simulated
> >> timestamps and N consumers which all tried to read as many of these as
> >> possible.
> >
> >> It used XADD as the only primitive.
> >
> > You probably know the difference between testing a fetch-and-add
> > operation
> > implemented with a CMPXCHG-loop vs a single XADD. Well, XADD seems to
> > clean its clock. This is why I dislike looping around on failure
> > conditions in
> > software when performing simple fetch-and-add operations!
> >
> > :^o

> Are you conflating the use of F&A as an optimization of "++i" in a
> monocore, with the use of F&A as a synchronization device as in a
> take-a-number?

I mainly use F&A for implementing all sorts of bakery synchronization
algorithms. I can gain wait-free properties by using the XADD instruction
on an x86. That is, I can guarantee that an operation will take exactly
N synchronization instructions. And I can define N. A very simple example
would be a reference count to manage the lifetime of an object. When
I use XADD I can define N = 1:

<pseudo-code>
____________________________________________
word fetch_and_add(word* d, word n)
{
return XADD(d, n);
}
____________________________________________

This is a wait-free operation, which IMHO, is a great property to strive
for!

Okay. Now, lets say that XADD did not exist in the x86. Fine. I will use
CMPXCHG...

____________________________________________
word fetch_and_add(word* d, word n)
{
for (;;)
{
word c = LOAD(d);
if (CMPXCHG(d, c, c + n)) return c;
}
}
____________________________________________

This is NOT a wait-free operation anymore, which means I cannot
define N = 1. Instead I have to define N = ? because we were just
downgraded to a lock-free operation do to the use of a loop.
Basically, I have no idea how many times this function will need to
call CMPXCHG in order to complete the F&A.

In my experience, wait-free synchronization outperforms and scales
much better than lock-free...

Every time I need to design a synchronization scheme, I try extremely
hard to reap the benefits of wait-free whenever I can. This means
trying very hard to remove the loops from the sync algorithm.

[...]

> Chris: thank you for bringing this up.

No problem!

:^)

> I've never really sat down and worked through the analysis of F&A before -
> the Mill design has used more abstract models for working out how to
> handle concurrency, and it's good to reality-check on a concrete example.

IMVVHO, it's good to have a means for HTM, but I still think it would
be beneficial to add "pessimistic" F&A and atomic SWAP (XCHG on
x86) instructions. They are very useful synchronization instructions,
and some fairly interesting algorithms can be based on them...

Can I implement wait-free synchronization primitives on the Mill?

Ivan Godard

unread,

Nov 14, 2013, 3:02:18 AM11/14/13

to

On 11/13/2013 10:59 PM, Chris M. Thomasson wrote:

<snip>

> IMVVHO, it's good to have a means for HTM, but I still think it would
> be beneficial to add "pessimistic" F&A and atomic SWAP (XCHG on
> x86) instructions. They are very useful synchronization instructions,
> and some fairly interesting algorithms can be based on them...
>
>
> Can I implement wait-free synchronization primitives on the Mill?

As currently defined, any true synchronization algorithm on a Mill
involves the possibility of failure and recovery, typically by iteration.

However, you seem to have misunderstood what the notion of "wait-free"
involves - see https://en.wikipedia.org/wiki/Wait-free. Wait-free
requires a guarantee of eventual per-thread progress; it does not
require the common-sense meaning of "immediate progress without delay"
which seems to be your understanding of the term. Yours is a common
misunderstanding - this stuff is subtle and easy to mix up.

Using the accepted meaning, the Mill optimistic primitive is in fact
wait-free when considered as a component of a code sequence that
implements retry. That is, a single invocation of the primitive may fail
(and hence is not wait-free), but a retry loop incorporating the
primitive guarantees forward progress and hence is wait-free.

This result considers only collision failure. The primitive may fail for
other reasons such as capacity, which arise because the form of the
construction is defective as written, and defective constructs are not
guaranteed to be wait-free nor even to make any progress whatsoever. The
same may occur with pessimistic protocols such as XADD and XCHG when
used with an invalid address: the construct is inherently defective and
no progress will be made.

I can only speak for the Mill form of optimistic concurrency; I do not
know enough of the internals of the offerings of other vendors (much of
which is not in the public documentation anyway), but from my cursory
inspection at least the Intel optimistic primitive cannot guarantee
wait-free based solely on the hardware, although suitable surrounding
software when coupled to the hardware primitive should be able to do so.
I know even less about other vendors than Intel (who is to be
complemented about the extent of their documentation), so cannot comment
on them.

In contrast, a simple retry loop is wait-free on the Mill hardware
without additional software requirements. Sorry, why that is true for us
and not for Intel - hasn't been filed yet.

Chris M. Thomasson

unread,

Nov 14, 2013, 3:44:34 PM11/14/13

to

> "Ivan Godard" wrote in message news:l6202m$7pe$1...@dont-email.me...

> On 11/13/2013 10:59 PM, Chris M. Thomasson wrote:

> <snip>

> > IMVVHO, it's good to have a means for HTM, but I still think it would
> > be beneficial to add "pessimistic" F&A and atomic SWAP (XCHG on
> > x86) instructions. They are very useful synchronization instructions,
> > and some fairly interesting algorithms can be based on them...
> >
> >
> > Can I implement wait-free synchronization primitives on the Mill?

> As currently defined, any true synchronization algorithm on a Mill
> involves the possibility of failure and recovery, typically by iteration.

> However, you seem to have misunderstood what the notion of "wait-free"
> involves - see https://en.wikipedia.org/wiki/Wait-free.

[...]

Well, I know that I do not have to explicitly handle a failure case of XADD
in software,for I know it is hardware based, and IMHO, the hardware
knows better wrt handling actual contention on the cache line...

Also, IMVVVHO, I say "wait-free" because the computation of XADD is
hardware based and does not interfere with other calculations from
the synchronization algorithms point of view.

Therefore, AFAICT, loopless synchronization _is_ good. Also, boiling
down any library code you are using to the bare metal instructions
and making sure the F&A is a single loopless instruction can be beneficial.

Chris M. Thomasson

unread,

Nov 15, 2013, 3:44:37 PM11/15/13

to

> "Chris M. Thomasson" wrote in message
> news:l63cnl$9vr$1...@speranza.aioe.org... [...]

> Also, IMVVVHO, I say "wait-free" because the computation of XADD is
> hardware based and does not interfere with other calculations from
> the synchronization algorithms point of view.

> Therefore, AFAICT, loopless synchronization _is_ good. Also, boiling
> down any library code you are using to the bare metal instructions
> and making sure the F&A is a single loopless instruction can be
> beneficial.

I hope I did not drift off into Skyb_ck Ville!

:^o

---------------
I just really enjoy my loopless synch algorithms, and it breaks my heart to
be "forced" into a loop construct. However, The Mill is awesome. In other
words, the sheer innovations wrt the Mill simply surpass my little quirk
wrt loopless synch. I have issues with PowerPC as well. I mean you boil down
Windows InterlockedExchangeAdd into XADD on x86, and on on PowerPC it
is a LL/SC loop!

;^o

snide...@gmail.com

unread,

Nov 21, 2013, 2:09:51 PM11/21/13

to

On Thursday, October 24, 2013 11:18:20 AM UTC-7, Ivan Godard wrote:

> The presentation will cover these and other technical aspects of the
> memory hierarchy in the Mill design. Videos and other material about
> other aspects of the Mill can be found at ootbcomp.com/docs.

Thanks for the pointer. I just dropped in out of the blue (read comp.arch years ago, off and on), and this is one of the first things I've found. I love reading overviews of architectures, so this should keep me busy for a while.

(I'm a former ASM journeyman, not an architect, but I've spent hours following links in Wikipedia and reading reviews of Haswell, etc. Experience with x86 through 32 bits, a touch of IA64, and more recently ARM.)

(OT: I am reluctant to recommend tenor-sans as a font for a web site. I actually turned on user-styles for that page, because Opera 12 on F16 doesn't seem to render that text well. Also, sans-serif fonts look clean, but may be harder to read; Tufte quotes Albers 1975 on this.)

/dps