Hi Peter,
I'm not sure that it addresses the issue of hardware fairness, but this
paper compares CAS and LOCK XADD behavior on relatively modern x86:
"Evaluating the Cost of Atomic Operations on Modern Architectures,"
Hermann Schweizer, Maciej Besta, and Torsten Hoefler.
https://spcl.inf.ethz.ch/Publications/.pdf/atomic-bench.pdf
Here's the abstract:
"""
Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel
programming. Yet, performance tradeoffs between these operations and
various characteristics of such systems, such as the structure of
caches, are unclear and have not been thoroughly analyzed. In this paper
we establish an evaluation methodology, develop a performance model, and
present a set of detailed benchmarks for latency and bandwidth of
different atomics. We consider various state-of-the-art x86
architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer.
The results unveil surprising performance relationships between the
considered atomics and architectural properties such as the coherence
state of the accessed cache lines. One key finding is that all the
tested atomics have comparable latency and bandwidth even if they are
characterized by different consensus numbers. Another insight is that
the design of atomics prevents any instruction level parallelism even if
there are no dependencies between the issued operations. Finally, we
discuss solutions to the discovered performance issues in the analyzed
architectures. Our analysis can be used for making better design and
algorithmic decisions in parallel programming on various architectures
deployed in both off-the-shelf machines and large compute systems.
"""
Cheers,
Ross.