are CAS-retry loops guaranteed-bounded under certain contention scenarios?

348 views
Skip to first unread message

Ross Bencina

unread,
Apr 17, 2014, 6:01:30 AM4/17/14
to mechanica...@googlegroups.com
Hi Everyone,

I recently discovered this list, I'm enjoying following the discussions.
I develop real-time audio applications in C++ on consumer operating
systems (Windows, OS X). I've been using simple SPSC lock-free
techniques to avoid priority inversion. Now I'm exploring higher
parallelism multi-core use-cases. I need multi-producer and/or
multi-consumer queues.

For real-time use I'm interested in primitives with bounded worst-case
execution time. For now I'd prefer to stick to the simple, trusted,
mainstream algorithms rather than go down the wait-free route. In
theory, lock-free queues such as the IBM Freelist, and MS-Queue don't
guarantee bounded execution time due to CAS retry loops, but I'm
wondering whether in practice, under certain conditions, they can be
shown to have have bounded time behavior.

I think this boils down to the following question:

Are there known conditions under which contending CAS retry loops are
time-bounded on a multi-processor system?

If the global operation rate is low enough, might it be possible to say
that the CAS retry count is somehow bounded by a function of min(#cpus,
#contending-threads)?

One of Martin's posts about Sandybridge [1] suggests that CAS has
fairness properties built in. Does this mean that the microarchitecture
guarantees starvation-free CAS loops? Comments on behavior of other
microarchitectures would also be interesting.

So far a literature search has only turned up [2], which proves that
under certain real-time scheduling methods, CAS retry loops can be
bounded *on a uniprocessor* (i.e. there are only so many opportunities
for a CAS to fail per unit time). However these arguments don't hold on
multiprocessors, so I'm looking for other arguments and/or evidence.

Thanks,

Ross.

[1]
http://mechanical-sympathy.blogspot.com.au/2013/01/further-adventures-with-cas.html

[2] https://www.cs.unc.edu/~anderson/papers/tocs97.pdf
JAMES H. ANDERSON, SRIKANTH RAMAMURTHY, and KEVIN JEFFAY, "Real-Time
Computing with Lock-Free Shared Objects," ACM Transactions on Computer
Systems, Vol. 15, No. 2, May 1997, Pages 134-165.

Martin Thompson

unread,
Apr 17, 2014, 6:37:43 AM4/17/14
to mechanica...@googlegroups.com
If you want a wait-free multi-producer queue to single-consumer you could try the following:


It could easily be reworked to do SPMC.

To avoid spinning CAS loops I tend to lean on algorithms that use LOCK XADD, or LOCK XCHG rather than LOCK CMPXCHG. We can get at XADD and XCHG now in Java 8 via the new intrinsics added to Unsafe :-)

Martin...

Sergey Zubarev

unread,
Apr 17, 2014, 7:45:13 AM4/17/14
to mechanica...@googlegroups.com
Hi, Ross!

MPSC lock-free queue based on the linked list
can be implemented using atomic swap on write side,
i guess it will be time-bounded on Intel using XCHG instruction.

Sergey.

Chris Vest

unread,
Apr 17, 2014, 8:45:02 AM4/17/14
to mechanica...@googlegroups.com
Just to make sure I understand:

The XCHG instruction is semantically equivalent to a (Java) volatile read followed by a (Java) volatile write, but as a single atomic unit?

Cheers,
Chris

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,
Apr 17, 2014, 9:14:23 AM4/17/14
to mechanica...@googlegroups.com
Yes XCHG it is the getAndSet() variant in the Atomic* classes.

The JLS memory model says nothing about the Atomic classes. Hopefully this can be addressed for Java 9.

Vitaly Davidovich

unread,
Apr 17, 2014, 9:15:04 AM4/17/14
to mechanica...@googlegroups.com

Hi Ross,

Generally/theoretically speaking, cas retry loops are unbounded.  If you have low contention rate (as you hinted at), then the actual running time will be lower than if the state is highly contended.  Depending on what type of work is submitted to the queue, one can make the statement that the system/algorithm as a whole is making progress though even if workers are losing cas attempts (e.g. homogeneous queue).

As Martin mentioned, one way to avoid cas retry is to "claim slots" in the underlying queue, such as by using lock'd xadd instruction (think ringbuffer or circular array backed queue).  The problem here, still, is under high contention throughput will go down as cachelines will be bouncing around and RFO traffic will dominate (most likely, if the rest of the work is dirt cheap).  Ideally what one wants here is per-cpu partitioning of work.

As for SandyBridge, there's no fairness built in (as far as I know, at least).  As Martin's blog states, the drop in throughput is purely because intel improved the interconnect latency between cores and what used to be "bulk" updates by a core no longer are.  In my opinion, hardware will never attempt to give fairness if for no other reason than performance implications of that.

Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ross Bencina

unread,
Apr 17, 2014, 9:26:43 AM4/17/14
to mechanica...@googlegroups.com
Hi Martin, Sergey,

On 17/04/2014 8:37 PM, Martin Thompson wrote:
> If you want a wait-free multi-producer queue to single-consumer you
> could try the following:
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue

On 17/04/2014 9:45 PM, Sergey Zubarev wrote:
> MPSC lock-free queue based on the linked list
> can be implemented using atomic swap on write side

Thanks for the suggestion. I'm familiar with Mr Vyukov's algorithm. It
has many nice properties. The problem is that it isn't non-blocking. If
the producer is interrupted during mpscq_push at (*) then the consumer
is blocked. (Sergy: is this the algorithm you were thinking of too?)

void mpscq_push(mpscq_t* self, mpscq_node_t* n)
{
n->next = 0;
mpscq_node_t* prev = XCHG(&self->head, n);
(*)
prev->next = n;
}

If there are both high-priority and low-priority producers, a
low-priority producer can stall the queue. This is effectively a
priority inversion on the queue. FYI my priority structure is:

REAL-TIME: producer
NEAR-REAL-TIME: consumer
NORMAL: producer
LOW: producer

One possible way around the blocking issue would be to use Vyukov's
queue for high-priority producers and a CAS-loop based queue (e.g.
reversed IBM freelist, MS-queue) for lower priority producers, and then
work out how to reconstruct the output of both queues into a valid
message order in the consumer. I am not sure how to do this (maybe
something involving a global counter or barrier messages?).

In any case, if I can avoid multiple queues that would be nice. It
seemed easiest to me if I can just show that for my use case, the CAS
retry loops do not cause unbounded stalling. Hence my question about
known bounding conditions for CAS retry loops on multiprocessors.


> To avoid spinning CAS loops I tend to lean on algorithms that use
> LOCK XADD, or LOCK XCHG rather than LOCK CMPXCHG. We can get at XADD
> and XCHG now in Java 8 via the new intrinsics added to Unsafe :-)

Perhaps I could take a hybrid approach and implement two push()
operations: an XCHG based push() for high-priority producers and a
CAS-retry based push() for lower priority producers. I think that I can
do this with the "reversed IBM freelist" FIFO, not sure about others.

Thanks,

Ross.

Sergey Zubarev

unread,
Apr 17, 2014, 1:09:31 PM4/17/14
to mechanica...@googlegroups.com


On Thursday, April 17, 2014 5:26:43 PM UTC+4, Ross Bencina wrote:
Hi Martin, Sergey,

On 17/04/2014 8:37 PM, Martin Thompson wrote:
> If you want a wait-free multi-producer queue to single-consumer you
> could try the following:
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue

On 17/04/2014 9:45 PM, Sergey Zubarev wrote:
 > MPSC lock-free queue based on the linked list
 > can be implemented using atomic swap on write side

Thanks for the suggestion. I'm familiar with Mr Vyukov's algorithm. It
has many nice properties. The problem is that it isn't non-blocking. If
the producer is interrupted during mpscq_push at (*) then the consumer
is blocked. (Sergy: is this the algorithm you were thinking of too?)

Yes 
I don not think  CAS-retry push() will work properly with pop() function
designed to work with XCHG. It looks like it will require hybrid pop() as well.

By the way, have you any numbers how much time the lock-free
queue with XCHG can spend waiting the next item?
I guess it depends on OS and hardware, but still quite interesting...

Thanks,

Ross.

Ross Bencina

unread,
Apr 17, 2014, 1:55:24 PM4/17/14
to mechanica...@googlegroups.com
On 18/04/2014 3:09 AM, Sergey Zubarev wrote:
> I don not think CAS-retry push() will work properly with pop() function
> designed to work with XCHG. It looks like it will require hybrid pop()
> as well.

Agreed. Here's what I have so far for the reversed IBM freelist with
wait-free and non-wait-free push() options. The idea is that you only
use the wait-free version in threads with higher priority than the
consumer. That means that the consumer should only see bubbles in the
queue if it is running on a different core.

I realise that there are potential performance issues with reversing
LIFOs into FIFOs like this but I believe that I can afford to burn some
cycles in the consumer.


Node {
Node *next_;
}

// MPSC FIFO
Queue{
Node *atomicLifoTop_;
Node *consumerFifo_; // local to consumer
Node *consumerLifo_; // local to consumer
}


void Queue::init() {
atomicLifoTop_ = 0;
consumerLifo_ = 0;
consumerFifo_ = 0;
}

// non-blocking push. cf. Treiber
void Queue::push_a(Node *n, bool& shouldWakeConsumer) {
Node *top;
do {
top = atomicLifoTop_;
n->next = top;
}while(!CAS(&atomicLifoTop_, top, n));

shouldWakeConsumer = (top==0); // was empty
}

// wait-free push(). can block the consumer.
// should only be called by threads above the priority of consumer
// similar to Vyukov but note that we're building a LIFO al-la Treiber
// second XCHG is used so we can handshake with consumer and know whether
// to signal the consumer to wake up after a bubble
void Queue::push_b(Node *n, bool& shouldWakeConsumer) {
n->next = SENTINEL1; // can be address of atomicLifoTop_ for example.
// _release_
Node *top = XCHG(&atomicLifoTop_,n); // sync point with respect to
producers and consumer
Node *n_next = XCHG(&n->next,top); // n->next = top // sync point
with respect to consumer

// was empty or a bubble happened
shouldWakeConsumer = (top == NULL || n_next = SENTINEL2);
}


Node* Queue::reversingLoop_() {
assert( consumerLifo_ != 0 );
for (;;) {
Node *n = consumerLifo_;

Node *n_next = n->next; // passive atomic load

if (n_next == SENTINEL1) { // queue is maybe blocked by an
in-progress push
n_next = XCHG(&n->next,SENTINEL2); // signal the producer
that it needs to wake us up when it's done, grab a trusted value
if (n_next == SENTINEL1) { // we really are blocked
return 0;
}

// otherwise we now have a good n_next value to use. note
that n_next can't be SENTINEL2 here

}else if( n_next == SENTINEL2 ) { // still blocked from
previous reversingLoop_() call
return 0;
}


if (n_next == NULL) {
consumerLifo_ = 0;
n->next = 0;
return n;
}

// reverse onto FIFO stack
consumerLifo = n_next;
n->next = consumerFifo_;
consumerFifo_ = n;
}
}


Node* Queue::pop() {
if (consumerLifo_) {
// reversing in progress due to being blocked on a SENTINEL1
return reversingLoop_();
} else if (consumerFifo_) {
Node *n = consumerFifo_;
consumerFifo_ = n->next;
return n;
} else if (atomicLifoTop_) { // don't try to modify it unless it's
non-zero
consumerLifo_ = XCHG(&atomicLifoTop_,0);
return reversingLoop_();
} else {
return 0;
}
}



> By the way, have you any numbers how much time the lock-free
> queue with XCHG can spend waiting the next item?
> I guess it depends on OS and hardware, but still quite interesting...

I'm not sure what question you're asking here. But if you mean how long
the delay can be between XCHG and patching up the next pointer, it is
going to depend on the thread priorities, system load, OS, etc. I have
not measured, but for a real-time application you have to decide at some
point that you are not willing to take the risk. To me a priority
inversion related problem seems much risker than a CAS-retry loop problem.

Ross.






Gil Tene

unread,
Apr 17, 2014, 9:17:15 PM4/17/14
to
It's "hard" (my way of saying "impossible" without committing to it) for hardware to provide fairness for CAS-retry loops. Hardware can try to provide some level of fairness in "who wins the CAS" when multiple CASs are in flight, but the retry logic part of a CAS-retry loop is not visible to the hardware. It's just regular non-atomic instructions, which eventually end up doing another CAS. I don't know of any hardware that tries to figure out the the second CAS is a "repeat" in order to deal with it any differently than it would when it sees an unrelated CAS...

On our Vega machines (which had 800+ cores), we found that even though the hardware guaranteed forward progress for CASs, using spinlocks became impractical for many things inside the kernel due to starvation problems, and we ended up moving to variations of fairlocks instead. Linux appears to have recently moved to ticketed spinlocks for similar reasons. I expect heavy-contention user-mode software to need the same when swap-based algorithms are not enough.

Atomic Swap (or atomic add) based algorithms do help in some places, but only on hardware that supports such operations. X86 does, but on some other CPUs don't. A CAS, or TestAndSet, or LL/SC is sometimes all you get...

Ross Bencina

unread,
Apr 18, 2014, 4:06:02 AM4/18/14
to mechanica...@googlegroups.com
Thanks for your insights Gil. I have put a couple of specific questions
about your response in-line below, marked [GT?:].

May I make an additional humble request: would it be possible for you or
someone else to correct my ignorance about conditions under which CAS
and LL-SC can fail, please? I've tried to capture what I know below.


On 18/04/2014 11:15 AM, Gil Tene wrote:
> It's "hard" (my way of saying "impossible" without committing to it) for
> hardware to provide fairness for CAS-retry loops. Hardware can try to
> provide some level of fairness in "who wins the CAS" when multiple CASs
> are in flight, the the retry logic part of a CAS-retry loop is not
> visible to the hardware. It's just regular non-atomic instructions,
> which eventually end up doing another CAS. I don't know of any hardware
> that tries to figure out the the second CAS is a "repeat" in order to
> deal with it any differently than it would when it sees an unrelated CAS...
>
> On our Vega machines (which had 800+ cores), we found that even though
> the hardware guaranteed forward progress for CASs, using spinlocks
> because impractical for many things inside the kernel due to starvation
> problems

[GT?:] What does it mean for hardware to guarantee forward progress and
at the same time have starvation problems? Do you mean that the hardware
guarantees forward progress of at least one CAS? (global lock-free property)


> , and we ended up moving to variations of fairlocks instead.
> Linux appears to have recently moved to ticketed spinlocks for similar
> reasons.

That's interesting. It parallels the research focus on wait-free queues
-- i.e. the need for stronger software primitives to ensure fairness.


> I expect heavy-contention user-mode software to need the same
> when swap-based algorithms are not enough.

How to define heavy-contention? and when exactly are swap-based
algorithms not enough? My contention is limited. Can this fact be used
to give me certainty about time/retry bounds?

As I understand it, there are three classes of reasons for CAS or LL-SC
loops to retry (but maybe I'm wrong?):

1. The OS interrupts and/or reschedules during the CAS or LL-SC
loop/section. This invalidates LL-SC requirements and/or causes other
threads/cores to run and modify the atomic variable, so the operation
fails. It is caused by interrupt and OS scheduler behavior on the
current core only.

2. The cache coherency mechanism resolves a race on the memory address
of interest in favor of another core and the CAS or SC fails. This is
the unavoidable "true-contention" situation.

3. "False contention" triggers spurious weak LL-SC failure due to
another core's access to a different memory addresses (perhaps accesses
to the same cache line, for example). My understanding is that CAS never
fails spuriously unless the comparison fails, so false contention is not
relevant for CAS.

Have I missed anything?

Atomic-swap and atomic-add can dodge (1) and (3), and all such
operations get serialized with respect to races (2).

Some thoughts on bounding the effects of (1), (2), (3) for CAS and LL-SC:

(1) Interrupt/scheduler triggered failure is a property of the operating
system scheduler and the interrupt schema of the machine. In principle,
bounds for these effects can be arrived at by analysis of scheduler
behavior -- I've seen it done in research papers on using lock-free
algorithms for real-time applications on uniprocessors.

(2) Application caused contention can be managed and understood at the
application level. Although it's unclear exactly what application
conditions would guarantee boundedness aside from "never contend". What
I have in mind are constraints on operation density and minimum
time-spacing of operations in contending tasks.

(3) False contention induced failure is a property of the memory
consistency hardware. I guess that on some systems with weak LL-SC there
is no way to avoid unbounded false contention triggering spurious
retries. But on some hardware perhaps it is enough to isolate the
variables of interest on their own cache lines or pages. How common is
this kind of false contention in practice? Can it be avoided entirely on
modern systems by placing shared data on separate pages?

Thanks,

Ross.


> Atomic Swap (or atomic add) based algorithms do help in some places, but
> only on hardware that supports such operations. X86 does, but on some
> other CPUs don't. A CAS, or TestAndSet, or LL/SC is sometimes all you get...
>
> On Thursday, April 17, 2014 3:01:30 AM UTC-7, Ross Bencina wrote:
>
> <https://www.cs.unc.edu/%7Eanderson/papers/tocs97.pdf>
> JAMES H. ANDERSON, SRIKANTH RAMAMURTHY, and KEVIN JEFFAY, "Real-Time
> Computing with Lock-Free Shared Objects," ACM Transactions on Computer
> Systems, Vol. 15, No. 2, May 1997, Pages 134-165.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Sergey Zubarev

unread,
Apr 18, 2014, 5:39:00 AM4/18/14
to mechanica...@googlegroups.com
Is it working code you use or pseudocode? 

My experience shows sometimes (in a particular task) lock-free approach
with more than one atomic operations gives almost the same performance results as a mutex.

Gil Tene

unread,
Apr 18, 2014, 5:53:42 AM4/18/14
to <mechanical-sympathy@googlegroups.com>


Sent from my iPad

> On Apr 18, 2014, at 1:06 AM, "Ross Bencina" <rossb...@audiomulch.com> wrote:
>
> Thanks for your insights Gil. I have put a couple of specific questions about your response in-line below, marked [GT?:].
>
> May I make an additional humble request: would it be possible for you or someone else to correct my ignorance about conditions under which CAS and LL-SC can fail, please? I've tried to capture what I know below.
>
>
>> On 18/04/2014 11:15 AM, Gil Tene wrote:
>> It's "hard" (my way of saying "impossible" without committing to it) for
>> hardware to provide fairness for CAS-retry loops. Hardware can try to
>> provide some level of fairness in "who wins the CAS" when multiple CASs
>> are in flight, the the retry logic part of a CAS-retry loop is not
>> visible to the hardware. It's just regular non-atomic instructions,
>> which eventually end up doing another CAS. I don't know of any hardware
>> that tries to figure out the the second CAS is a "repeat" in order to
>> deal with it any differently than it would when it sees an unrelated CAS...
>>
>> On our Vega machines (which had 800+ cores), we found that even though
>> the hardware guaranteed forward progress for CASs, using spinlocks
>> because impractical for many things inside the kernel due to starvation
>> problems
>
> [GT?:] What does it mean for hardware to guarantee forward progress and at the same time have starvation problems? Do you mean that the hardware guarantees forward progress of at least one CAS? (global lock-free property)

Yes. For some implementations of CAS or LL/SC (the ones that guarantee forward progress). E.g. X86 CAS, and I *think* the ARM LDREX/STREX both guarantee forward progress. But MIPS and PowerPC LL/SC usually don't.

>> , and we ended up moving to variations of fairlocks instead.
>> Linux appears to have recently moved to ticketed spinlocks for similar
>> reasons.
>
> That's interesting. It parallels the research focus on wait-free queues -- i.e. the need for stronger software primitives to ensure fairness.
>
>
>> I expect heavy-contention user-mode software to need the same
>> when swap-based algorithms are not enough.
>
> How to define heavy-contention?

I'd "define" it as "enough to cause extreme unfairness or starvation". From memory, i believe that on Vega we saw 40 second to grab a spinlock on some threads when others were getting the same spinlocks thousands of times a second. Usually at 30+ tightly competing cores.

> and when exactly are swap-based algorithms not enough?

When your algorithm needs conditional atomic operations. E.g. Locking and counting semaphores are harder to implement with swap-only or add-only, and would require spinning swap loops to get right (which are as bad or worse than spinning CAS loops for other needs).

> My contention is limited. Can this fact be used to give me certainty about time/retry bounds?

If you know everything about your system, including maximum contention rates and lock-held times, and the specific latencies involved in everything, you can probably prove a bound. But it would be complicated, and non-portable (e.g. to a different CPU frequency or core count).

>
> As I understand it, there are three classes of reasons for CAS or LL-SC loops to retry (but maybe I'm wrong?):
>
> 1. The OS interrupts and/or reschedules during the CAS or LL-SC loop/section. This invalidates LL-SC requirements and/or causes other threads/cores to run and modify the atomic variable, so the operation fails. It is caused by interrupt and OS scheduler behavior on the current core only.

CAS instructions are usually atomic and uninterruptible (as are all single instruction operations, usually). LL/SC has the interrupt susceptibility.

> 2. The cache coherency mechanism resolves a race on the memory address of interest in favor of another core and the CAS or SC fails. This is the unavoidable "true-contention" situation.

This is only the case for LL/SC, or for could-spuriously-fail CAS variants. The CAS form that only fails if the condition fails (x86 for example) is not susceptible to coherency races, just to actual races that make the condition fail because the value is not the one expected.

>
> 3. "False contention" triggers spurious weak LL-SC failure due to another core's access to a different memory addresses (perhaps accesses to the same cache line, for example). My understanding is that CAS never fails spuriously unless the comparison fails, so false contention is not relevant for CAS.

Most modern CAS implementations guarantee no spurious failure, but some (older) CAS implementation could fail even if the condition would have succeeded.

>
> Have I missed anything?
>
> Atomic-swap and atomic-add can dodge (1) and (3), and all such operations get serialized with respect to races (2).
>
> Some thoughts on bounding the effects of (1), (2), (3) for CAS and LL-SC:
>
> (1) Interrupt/scheduler triggered failure is a property of the operating system scheduler and the interrupt schema of the machine. In principle, bounds for these effects can be arrived at by analysis of scheduler behavior -- I've seen it done in research papers on using lock-free algorithms for real-time applications on uniprocessors.
>
> (2) Application caused contention can be managed and understood at the application level. Although it's unclear exactly what application conditions would guarantee boundedness aside from "never contend". What I have in mind are constraints on operation density and minimum time-spacing of operations in contending tasks.

Possible, but scary, and I'd want a lot of padding on those constraints. E.g. Keep in mind that certain persistent (but correctable) error conditions, like a single broken bit in DRAM combined with a MESIF protocol and cross-socket contention that involves DRAM reads and writes in the race, can cause longer operation times than normal.

>
> (3) False contention induced failure is a property of the memory consistency hardware. I guess that on some systems with weak LL-SC there is no way to avoid unbounded false contention triggering spurious retries. But on some hardware perhaps it is enough to isolate the variables of interest on their own cache lines or pages. How common is this kind of false contention in practice? Can it be avoided entirely on modern systems by placing shared data on separate pages?

Separate cache lines is usually enough. Same padding techniques as those used to avoid false-sharing bottlenecks.
> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/e9jLWF5rhAM/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Ross Bencina

unread,
Apr 18, 2014, 6:54:10 AM4/18/14
to mechanica...@googlegroups.com
On 18/04/2014 7:39 PM, Sergey Zubarev wrote:
> Is it working code you use or pseudocode?

That one is pseudocode (no explict atomics and no memory barriers for a
start, same goes for a lot of the code on Dmitry Vyukov's site too). It
is based on working reversed IBM freelist code that I have running here.
I don't claim that the pseudocode works, or that it performs, just that
it looks like it might. The complexity with the extra sentinel is only
there because I need to know when to signal the worker to wake up.


> My experience shows sometimes (in a particular task) lock-free approach
> with more than one atomic operations gives almost the same performance
> results as a mutex.

Unfortunately for my use-case, I can't use mutexes because they are
prone to priority inversion on my target platforms (Windows and OS X).
Even if the lock-free algorithms are slower than mutexes I would use them.

Also keep in mind that real-time is about bounding worst-case execution
time, not about maximising the average case. I don't get to amortise
essential work across multiple audio buffer deadlines. A mutex fastpath
is not very interesting to me if there's also a slow path.

Ross.

Ross Bencina

unread,
Apr 18, 2014, 8:06:58 AM4/18/14
to mechanica...@googlegroups.com
On 18/04/2014 7:53 PM, Gil Tene wrote:
> Sent from my iPad

Thanks Gil, comments in line...

>> [GT?:] What does it mean for hardware to guarantee forward progress
>> and at the same time have starvation problems? Do you mean that the
>> hardware guarantees forward progress of at least one CAS? (global
>> lock-free property)
>
> Yes. For some implementations of CAS or LL/SC (the ones that
> guarantee forward progress). E.g. X86 CAS, and I *think* the ARM
> LDREX/STREX both guarantee forward progress. But MIPS and PowerPC
> LL/SC usually don't.

Very interesting, I had no idea about that.


> If you know everything about your system, including maximum
> contention rates and lock-held times, and the specific latencies
> involved in everything, you can probably prove a bound. But it would
> be complicated, and non-portable (e.g. to a different CPU frequency
> or core count).
>>
>> As I understand it, there are three classes of reasons for CAS or
>> LL-SC loops to retry (but maybe I'm wrong?):
>>
>> 1. The OS interrupts and/or reschedules during the CAS or LL-SC
>> loop/section. This invalidates LL-SC requirements and/or causes
>> other threads/cores to run and modify the atomic variable, so the
>> operation fails. It is caused by interrupt and OS scheduler
>> behavior on the current core only.
>
> CAS instructions are usually atomic and uninterruptible (as are all
> single instruction operations, usually). LL/SC has the interrupt
> susceptibility.

What I was trying to get at there is that a CAS retry loop can fail even
if all operations are on one core, e.g.:

10 do{
20 oldVal := val
30 newVal := something
40 }while( !CAS(&val, oldVal, newVal) )

If the program is interrupted after line 20 and before the CAS
instruction is issued, then the CAS may fail due to another thread on
the same core modifying val. Any account of whether CAS-retry is
real-time safe needs to consider this case in addition to others.


>> 2. The cache coherency mechanism resolves a race on the memory
>> address of interest in favor of another core and the CAS or SC
>> fails. This is the unavoidable "true-contention" situation.
>
> This is only the case for LL/SC, or for could-spuriously-fail CAS
> variants. The CAS form that only fails if the condition fails (x86
> for example) is not susceptible to coherency races, just to actual
> races that make the condition fail because the value is not the one
> expected.

Perhaps I don't know what a "coherency race" is. What's the difference
between a coherency race and an actual race?

Aren't all races, in the end, coherency races? The coherence substrate
mediates all memory operations.
My understanding is that RAM can cause timing issues irrespective of the
use of CAS loops. And of course, caches induce wildly varying timing
behavior.

Ultimately these machines are not suitable for hard real-time work.

Maybe I am asking the wrong question trying to bound CAS retry. Perhaps
the real question is more about whether CAS retry is any more of a risk
to timing than any of a myriad of other uncontrolled system variables in
the average personal computer.


(Incidentally, if anyone is following along, I just found a nice
"tutorial" from The Mill's Ivan Godard on the topic of CAS retry loops
vs RMW ops:
https://groups.google.com/d/msg/comp.arch/RAm2iXLlCu4/ZcGi4EvqcbEJ )

Thank you,

Ross.

Vitaly Davidovich

unread,
Apr 18, 2014, 8:28:29 AM4/18/14
to mechanica...@googlegroups.com

Coherence race is basically a race to claim exclusive access to a cacheline.  With cas retry loop, there's a chance that while your loop body is running another core has modified the condition variable; when you get around to CAS your core could get exclusive access to the line, but finds the actual condition value is not the expected one - that's a normal race.

With LL/SC the condition value may be the expected one, but cache line may have been updated in the meantime, even to the expected value; LL/SC will fail here, whereas CAS would succeed.  This is a coherence race.

Sent from my phone


more options, visit https://groups.google.com/d/optout.

-- You received this message because you are subscribed to a topic
in the Google Groups "mechanical-sympathy" group. To unsubscribe
from this topic, visit
https://groups.google.com/d/topic/mechanical-sympathy/e9jLWF5rhAM/unsubscribe.


To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Apr 18, 2014, 10:14:22 AM4/18/14
to <mechanical-sympathy@googlegroups.com>
Exactly. That's what makes LL/SC variants unable to guarantee forward progress.

With [guaranteed forward progress forms of] CAS the result of increased contention is unfairness (to the point of starvation). With LL/SC the result of increased contention is a livelock, where nothing moves at all.

Sent from my iPad
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages