LOCK XADD wait-free or lock free.

1,831 views
Skip to first unread message

Peter Veentjer

unread,
Aug 25, 2018, 1:00:22 AM8/25/18
to mechanical-sympathy
I'm polishing up my lock free knowledge and one question keeps on bugging me.

The question is about the LOCK XADD and why it is considered to be wait free.

AFAIK for wait freedom there needs to be some fairness.

So image a concurrent counter using a spin on a cas to increment a counter, then at least 1 thread wins and makes progress. Therefor this implementation is lock free. It isn't wait free because some thread could be spinning without bound. The problem here is that there is no fairness.

Now imagine this counter would be implemented using a LOCK XADD and therefor there is no need for a loop. What is guaranteeing that every thread is going to make progress in a bound number of steps? E.g. could it happen that one thread is continuously denied exclusive access to the memory and therefor it won't complete in a bound number of steps? Or do the request get stored in some kind of list and the requests are processed in this order? This order would provide fairness and then it will be wait free.

I have been looking at the Intel documentation of LOCK XADD; but it remains unclear.

Gil Tene

unread,
Aug 25, 2018, 2:39:13 AM8/25/18
to mechanica...@googlegroups.com
Wait-free, Lock-free, and Obstruction-free are well defined things. They are all non-blocking, such that suspended threads cannot indefinitely prevent all others from making progress, but they differ in the forward progress expectations of individual threads. The "Non-blocking algorithm" wikipedia entry does a fair job of explaining what each means. It is forward progress in a bounded number of steps, rather than "fairness", that is key. You can be extremely unfair (as in making forward progress at 1,000,000,000:1 ratios) and wait-free at the same time.

There is a difference between the wait-free/lock-free classification of an operation/instruction and the classification of a mechanism/algorithm using that operation/instruction...

Both the LOCK XADD operation and LOCK CAS operations are wait free. There is nothing any other processor can do to prevent your processor from making forward progress and eventually (in a bound number of steps) executing either instruction. In fact, [at the very least] all user-mode x86 CPU instructions are trivially wait free, otherwise very bad things could be done by user-mode code...

The key difference between a LOCK XADD and a LOCK CAS is that the LOCK XADD will actually and unconditionally force a visible change to the shared memory state in a way that no other processor can prevent, while the shared memory effect of a LOCK CAS operation is a conditional, and will only occur if the compare shows that the expected value and the memory value are the same. Simple (but by no means all) uses of CAS often perform retry loops in lock-free, but not wait-free, mechanisms. 

It is easy to see how one can use LOCK XADD operations to build wait-free mechanisms. E.g. implementing a wait-free atomic counter using XADD is trivial. It is similarly useful in more complicated synchronization primitives, like e.g. my WriterReaderPhaser, which is wait free on architectures that support atomic increment operations, but (in my current implementation) is only lock free on architectures that force us to resort to a CAS.. It is somewhat harder to construct wait-free mechanisms using LOCK CAS, but it is by no means not impossible, and usually just means some extra complexity and spending some more memory per concurrent thread or cpu. e.g. see "Wait-Free Queues With Multiple Enqueuers and Dequeuers", where wait free queues are implemented using CAS constructs.

Peter Veentjer

unread,
Aug 25, 2018, 11:18:53 AM8/25/18
to mechanical-sympathy
Hi Gil,

thanks for your answer.

However it still feels a bit like 'it works because it does' (or my skull is just too thick).

So what prevents a cpu from perpetually being denied to execute the LOCK XADD? So imagine there are 2 cpu's, both executing a LOCK XADD in a loop, what prevents that one of the cpu's isn't perpetually being starved from its resource. And even if it isn't perpetual, what guarantees that the CPU completes in a bounded number of steps?

Martin Thompson

unread,
Aug 25, 2018, 12:11:26 PM8/25/18
to mechanical-sympathy
To perform an update via the XADD instruction the cache line containing the word must first be acquired. x86 uses the MESI cache coherence protocol and to get the cacheline for update an RFO (Read/Request For Ownership) message must be sent as a bus transaction. These requests are queued per core and in the uncore and effectively avoid starvation. Events are available, e.g. OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO. From the perspective of instructions each thread takes a finite number of steps to complete. The CAS equivalent would be lock-free, rather than wait-free, as the number of steps per thread is not finite.

Gil Tene

unread,
Aug 25, 2018, 1:26:50 PM8/25/18
to mechanical-sympathy


On Saturday, August 25, 2018 at 9:11:26 AM UTC-7, Martin Thompson wrote:
To perform an update via the XADD instruction the cache line containing the word must first be acquired. x86 uses the MESI cache coherence protocol and to get the cacheline for update an RFO (Read/Request For Ownership) message must be sent as a bus transaction. These requests are queued per core and in the uncore and effectively avoid starvation. Events are available, e.g. OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO. From the perspective of instructions each thread takes a finite number of steps to complete. The CAS equivalent would be lock-free, rather than wait-free, as the number of steps per thread is not finite.

The CAS might never "succeed" in a capped number of steps (i.e. someone else could forever beat it to the memory location and the compare could fail in each and every attempt), but the CAS instruction will complete, one way or another, on all practical hardware implementations, in a capped number of steps. A fundamental design requirement in all hardware implementations I'm aware of is to prevent the starvation of any possible (at least user mode, and probably all) instructions, including CAS. This might be achieved in many different ways, often leveraging qualities of the cache protocols and the knowledge that the number of processors and other components (like memory controllers, coherence protocol coordination points, queue depths in various places, etc.) in the system is capped. "How exactly" doesn't matter. It's the job of whomever designs the hardware to make sure it is so.

Think about it this way: In any system where it is impossible for you to cause a hardware watchdog reset from user code, there's a "capped number of steps" for executing any individual user-mode instruction, which by definition makes them all (individually) wait-free. There are no non-wait-free user mode instructions in such systems. QED.

Gil Tene

unread,
Aug 25, 2018, 1:33:58 PM8/25/18
to mechanical-sympathy


On Saturday, August 25, 2018 at 8:18:53 AM UTC-7, Peter Veentjer wrote:
Hi Gil,

thanks for your answer.

However it still feels a bit like 'it works because it does' (or my skull is just too thick).

So what prevents a cpu from perpetually being denied to execute the LOCK XADD? So imagine there are 2 cpu's, both executing a LOCK XADD in a loop, what prevents that one of the cpu's isn't perpetually being starved from its resource.

What prevents it is the fact that if this was possible, you've just described a trivial way for you to crash any server and any laptop on earth from user-mode code. One of the jobs of the hardware designers is to make it impossible for you to do that, and they've usually done that job to an ok enough level on the systems we run on (assuming we haven't read anything in the news about "how to bring all of AWS down from a laptop").

Yichao Yu

unread,
Aug 25, 2018, 9:29:15 PM8/25/18
to mechanica...@googlegroups.com
> What prevents it is the fact that if this was possible, you've just described a trivial way for you to crash any server and any laptop on earth from user-mode code. One of the jobs of the hardware designers is to make it impossible for you to do that, and they've usually done that job to an ok enough level on the systems we run on (assuming we haven't read anything in the news about "how to bring all of AWS down from a laptop").

Sorry I couldn't resist posting this

Is the `F00F`[1] bug (or other HCF bugs[2]) what might happen if such
a property is not guaranteed by the hardware? I was actually very
surprised that modern hardware could be designed without any such
problems...

[1] https://en.wikipedia.org/wiki/Pentium_F00F_bug
[2] https://en.wikipedia.org/wiki/Halt_and_Catch_Fire

>
>>
>> And even if it isn't perpetual, what guarantees that the CPU completes in a bounded number of steps?
>
>
>
>
>>
>>
>>
>> On Saturday, August 25, 2018 at 9:39:13 AM UTC+3, Gil Tene wrote:
>>>
>>> Wait-free, Lock-free, and Obstruction-free are well defined things. They are all non-blocking, such that suspended threads cannot indefinitely prevent all others from making progress, but they differ in the forward progress expectations of individual threads. The "Non-blocking algorithm" wikipedia entry does a fair job of explaining what each means. It is forward progress in a bounded number of steps, rather than "fairness", that is key. You can be extremely unfair (as in making forward progress at 1,000,000,000:1 ratios) and wait-free at the same time.
>>>
>>> There is a difference between the wait-free/lock-free classification of an operation/instruction and the classification of a mechanism/algorithm using that operation/instruction...
>>>
>>> Both the LOCK XADD operation and LOCK CAS operations are wait free. There is nothing any other processor can do to prevent your processor from making forward progress and eventually (in a bound number of steps) executing either instruction. In fact, [at the very least] all user-mode x86 CPU instructions are trivially wait free, otherwise very bad things could be done by user-mode code...
>>>
>>> The key difference between a LOCK XADD and a LOCK CAS is that the LOCK XADD will actually and unconditionally force a visible change to the shared memory state in a way that no other processor can prevent, while the shared memory effect of a LOCK CAS operation is a conditional, and will only occur if the compare shows that the expected value and the memory value are the same. Simple (but by no means all) uses of CAS often perform retry loops in lock-free, but not wait-free, mechanisms.
>>>
>>> It is easy to see how one can use LOCK XADD operations to build wait-free mechanisms. E.g. implementing a wait-free atomic counter using XADD is trivial. It is similarly useful in more complicated synchronization primitives, like e.g. my WriterReaderPhaser, which is wait free on architectures that support atomic increment operations, but (in my current implementation) is only lock free on architectures that force us to resort to a CAS.. It is somewhat harder to construct wait-free mechanisms using LOCK CAS, but it is by no means not impossible, and usually just means some extra complexity and spending some more memory per concurrent thread or cpu. e.g. see "Wait-Free Queues With Multiple Enqueuers and Dequeuers", where wait free queues are implemented using CAS constructs.
>>>
>>> On Friday, August 24, 2018 at 10:00:22 PM UTC-7, Peter Veentjer wrote:
>>>>
>>>> I'm polishing up my lock free knowledge and one question keeps on bugging me.
>>>>
>>>> The question is about the LOCK XADD and why it is considered to be wait free.
>>>>
>>>> AFAIK for wait freedom there needs to be some fairness.
>>>>
>>>> So image a concurrent counter using a spin on a cas to increment a counter, then at least 1 thread wins and makes progress. Therefor this implementation is lock free. It isn't wait free because some thread could be spinning without bound. The problem here is that there is no fairness.
>>>>
>>>> Now imagine this counter would be implemented using a LOCK XADD and therefor there is no need for a loop. What is guaranteeing that every thread is going to make progress in a bound number of steps? E.g. could it happen that one thread is continuously denied exclusive access to the memory and therefor it won't complete in a bound number of steps? Or do the request get stored in some kind of list and the requests are processed in this order? This order would provide fairness and then it will be wait free.
>>>>
>>>> I have been looking at the Intel documentation of LOCK XADD; but it remains unclear.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ross Bencina

unread,
Aug 28, 2018, 7:18:19 AM8/28/18
to mechanica...@googlegroups.com
Hi Peter,

I'm not sure that it addresses the issue of hardware fairness, but this
paper compares CAS and LOCK XADD behavior on relatively modern x86:

"Evaluating the Cost of Atomic Operations on Modern Architectures,"
Hermann Schweizer, Maciej Besta, and Torsten Hoefler.

https://spcl.inf.ethz.ch/Publications/.pdf/atomic-bench.pdf

Here's the abstract:

"""
Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel
programming. Yet, performance tradeoffs between these operations and
various characteristics of such systems, such as the structure of
caches, are unclear and have not been thoroughly analyzed. In this paper
we establish an evaluation methodology, develop a performance model, and
present a set of detailed benchmarks for latency and bandwidth of
different atomics. We consider various state-of-the-art x86
architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer.
The results unveil surprising performance relationships between the
considered atomics and architectural properties such as the coherence
state of the accessed cache lines. One key finding is that all the
tested atomics have comparable latency and bandwidth even if they are
characterized by different consensus numbers. Another insight is that
the design of atomics prevents any instruction level parallelism even if
there are no dependencies between the issued operations. Finally, we
discuss solutions to the discovered performance issues in the analyzed
architectures. Our analysis can be used for making better design and
algorithmic decisions in parallel programming on various architectures
deployed in both off-the-shelf machines and large compute systems.
"""

Cheers,

Ross.

Steven Stewart-Gallus

unread,
Jan 17, 2019, 2:41:32 AM1/17/19
to mechanical-sympathy
There are more than a few ways to Denial of Service other threads and or cores on most personal computers today.

Of which the most common way encountered is to probably use up too much memory causing the system to swap to the hard drive.

This is partly a feature not a bug. You wouldn't like a system for personal use that would never let you do:
 kill -s STOP $PID

So it is inherent to the design of multiprocessor personal computers that a sufficiently privileged program can bog down other cores with a bunch of interrupts, or you know power off the machine.

Unless you are designing for very specialised systems such as medical equipment you don't really need to worry about it and you don't need to buy special processors or configure them in special modes to get the most determinism out of them and ensure they really are wait-free.

What really matters for your purposes is that cooperating threads are wait free in practise.

If you're worried that there will be some hardware bug letting people bypass cloud virtualisation and read your secured data well then I would be too. But I would be much less worried about a DOS attack against cloud services like AWS because existing distributed systems are already explicitly designed for fail over in case there is some kind of hardware failure.
Reply all
Reply to author
Forward
0 new messages