How hardware implements CAS

Yunpeng Li

unread,

Jan 4, 2017, 8:43:25 AM1/4/17

to mechanical-sympathy

Hi there,
Could someone help to share some light on how hardware really do to implement atomic operations such as CAS? Especially what's the difference and overhead in the spectrum from single-thread-single-core-single-socket to hyper-thread-multi-core-multi-socket architectures.
The Google results is sort of either too hardcore or to high level concepts, it is great if someone can give a “middlecore” introduction for software guys like me😀

Thanks in advance
Yunpeng Li

Gil Tene

unread,

Jan 4, 2017, 2:15:12 PM1/4/17

to mechanical-sympathy

The bulk of how CAS is implemented matters only in multi-core situations (well, multi-hardware-threads at least) for obvious reasons.

The simplest CAS implementations (and the easiest mental model) will simply freeze the local cache coherence protocol state machine after the load part of the CAS brings the relevant cache line into the nearest (e.g. L1) cache in exclusive mode, and will unfreeze it after the (optional) store completes. This, by definition, makes the CAS operation as a whole atomic with relation to any other participant in the cache coherence protocol. This mental model works well for in-order, short pipeline, shallow store buffer, weak memory model cores.

When you throw in stronger memory ordering requirements, deep store buffers, OOOE, and deep pipelines, a naive implementation that just freezes things like that would incur a huge cost (as some easier implementations used to). E.g. most uses of CAS expect ordering guarantees, which may require that all previously issued stores and loads be visible before the CAS operation's load and (potential store) are. A naive/simple implementation would typically include flushing the store buffer and waiting for all pending load operations to complete before performing the CAS's load operation, which would equate to depleting the entire execution pipeline of a modern OOOE multi-issue machine. That's why CAS implementations get much more complicated than the simplistic "freeze the coherency" model, maintaining the semantic requirements but doing so without complete buffer, pipeline, and pending instruction flushes.

Avi Kivity

unread,

Jan 4, 2017, 2:39:46 PM1/4/17

to mechanica...@googlegroups.com

Gil covered the implementation details; as to overhead, it can be quite
low if there is no cacheline contention. Agner's tables list Skylake
lock cmpxchg as having a throughput of 1 insn per 18 cycles, which is
fairly amazing. However, as soon as you have contention, this tanks
completely due to the associated barriers stopping everything else while
the data is moved around.

Vitaly Davidovich

unread,

Jan 4, 2017, 2:59:39 PM1/4/17

to mechanical-sympathy

Probably worth a mention that "CAS" is a bit too generic. For instance, you can have weak and strong CAS, with some architectures only providing strong (e.g. intel) and some providing/allowing both. Depending on whether a weak or strong CAS is used, the memory ordering/pipeline implications will be different (and thus the local, i.e. to the core executing it, cost will be higher/lower). Then there are cases where the "CAS" operation can trigger bus escalation, e.g. lock'd instructions on intel where the memory crosses cachelines or is uncacheable memory, which will be more costly than if the bus lock doesn't have to be asserted.

For cacheable memory, as Gil mentioned, I believe the basic gist of the model relies on the existing cache coherence protocol that's already built in (after all, plain stores to memory are already arbitrated properly by the caches). The "CAS" instruction/use cases, however, can add ordering/memory constraints that will impact performance (no different in that sense than, say, a plain store followed by a full cpu memory fence).

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,

Jan 4, 2017, 3:01:44 PM1/4/17

to mechanical-sympathy

Oh, and forgot to mention the LL/SC style of CAS that's offered by some architectures with weak (by default) memory models. The Load-Linked/Store-Conditional becomes a non-atomic operation underneath, but the CPU ensures that the store is only done if the underlying cacheline wasn't taken away (even if the value at the address is still the expected one).

Yunpeng Li

unread,

Jan 4, 2017, 11:35:21 PM1/4/17

to mechanical-sympathy

Thanks a lot for the explanations.

Rajiv Kurian

unread,

Jan 5, 2017, 8:02:19 PM1/5/17

to mechanical-sympathy

If you are really interested in the low level details and can read or are willing to learn to read an HDL, I'd say take a look at the risc-v project. You can either look at the rocket (in-order) or boom (out-of-order) cores. My recommendation would be to start with rocket since it is simpler. Further boom is built with a lot of components picked out of rocket. Though risc-v does not have CAS directly and uses load-reserved/store-conditional instructions. lr/sc is a stronger guarantee than CAS. It checks bus access and not data bits (which is what CAS checks) at the h/w level so the implementation is different, but a lot of the same concerns still apply.

Here are some links that might help -

High level schematic of the rocket-core (might be out of date) - http://www.lowrisc.org/docs/tagged-memory-v0.1/rocket-core/

Top level rocket repo -https://github.com/ucb-bar/rocket-chip

DCache implementation - https://github.com/ucb-bar/rocket-chip/blob/540502f96da6bae91c88c85940c96cf2fa6ab1c9/src/main/scala/rocket/DCache.scala

Follow the links to see related things like the AMOALU, arbiter etc.

For folks interested in hardware, you'll find a lot other interesting things that are glanced over in most Computer Architecture/Digital design courses when writing the toy MIPS processors. Implementations of things like TLB, BTB, DMA, decode units, FPU etc are all written from scratch. Sadly documentation is sparse and it is easy to get lost, but you can always ask questions on the risc-v mailing list. I'd like to stress that even though it might seem like it, rocket is not simply a throwaway toy implementation. Last I checked the rocket-core implementation was twice as energy efficient as the Cortex-A5 (take this with a grain of salt).

The rocket and boom sources (and the standalone FPU unit) are all written in Chisel - a DSL on top of Scala. If you understand Verilog, it will not take you too long to understand Chisel. Chisel is written at a much higher level than Verilog and is easier to read IMO. If Chisel seems like garbage to you and you want to read Verilog, Chisel translates to very clean and predictable Verilog using the Chisel toolchain.