Why lock-free implementations are faster than traditional using synchronization? Are they?

Michael Podolsky

unread,

Oct 8, 2011, 9:08:13 PM10/8/11

to

Hi All,

I've encountered I have very basic misunderstanding of the topic of
lock-free algorithms.

They possess many good qualities but is the performance one of these
qualities? And it it is, to what extent and due to what reasons
exactly?

Of course. there may be some special situations where we may run
faster with block-free and wait-free implementations using solely
interlock-xadd like operations.

But as it comes to using our main tool, cmpxchg instruction, our code
gets a kind of transactional semantics with failing/recurring attempts
to modify data. It's like we spinning(!)
on attempts to modify some data with cmpxchg. Now I should ask how is
it different as for performance from using just a spin-lock so that
instead of

example 1
do { oldPtr = read(ptr)
prepare newValue of ptr with optional pointed data,
while (!cmpxchg(&ptr, oldPtr, newValue));
recycle oldPtr

we write

example 2
lock(spinlock)
oldPtr = read(ptr)
prepare newValue of ptr with optional pointed data
recycle/delete oldPtr
ptr = newValue;
unlock(spinlock)

One may argue that waiting on lock(spinlock) saturates the bus much
more than rarely called cmpxchg in the example 2, but this is merely a
question of implementation of efficient spinlock for multicore.
Supposing that spinning on our spinlock does not saturate the bus and
just wastes the internal resources of one core (which is also the case
in example 1), why the code in example 1 should be fater than in
example 2?

If there are other important contexts, not covered by my examples,
where lockfree runs faster, I'd be glad if I am pointed out, until
then let me suppose that what is written in example 1 is a rather
general pattern of lockfree. So does it run faster than its analogue
which uses spinlocks?

Yes, I understand that example 2 might run slowly in some
circumstances like when a thread which locked a spinlock is preempted
and other threads start to spin. May a composite lock (spinlock +
kernel mutex) be helpful in this case? Or if we take pure kernel-level
implementation where a 'thread' can not be 'preempted' what about
saying that example 2 is as good as example 1? Probably no as Linux
uses RCU, but what are the exact reasons?

How should one answer for the question why exactly lockfree may run
faster, in user space or in the kernel, wherever?

Thanks,
Michael

Casper H.S. Dik

unread,

Oct 9, 2011, 6:45:40 AM10/9/11

to

Michael Podolsky <michael.p...@gmail.com> writes:

>How should one answer for the question why exactly lockfree may run
>faster, in user space or in the kernel, wherever?

When there is little or no contention or systems with few competing
threads.

Casper
--

Marcel Müller

unread,

Oct 10, 2011, 7:22:53 AM10/10/11

to

On 09.10.11 03.08, Michael Podolsky wrote:
> I've encountered I have very basic misunderstanding of the topic of
> lock-free algorithms.
>
> They possess many good qualities but is the performance one of these
> qualities? And it it is, to what extent and due to what reasons
> exactly?

Well the reason is, because they do not lock and block.
Each time a thread is held at a mutex it looses its current time slice.
Even if it need to stop only for some clock cycles, the scheduler will
not immediately reschedule the thread as long as there is something else
to do. This also causes the cache content to become mostly waste.
A lock free algorithm will not give up a time slice.

> Of course. there may be some special situations where we may run
> faster with block-free and wait-free implementations using solely
> interlock-xadd like operations.
>
> But as it comes to using our main tool, cmpxchg instruction, our code
> gets a kind of transactional semantics with failing/recurring attempts
> to modify data. It's like we spinning(!)
> on attempts to modify some data with cmpxchg.

You are right. Cmpxchg is similar to a spin-lock. And it will have
comparable performance to a fast mutex implemented by a spin-lock. But
there are some important differences.

> Now I should ask how is
> it different as for performance from using just a spin-lock so that
> instead of
>
> example 1
> do { oldPtr = read(ptr)
> prepare newValue of ptr with optional pointed data,
> while (!cmpxchg(&ptr, oldPtr, newValue));
> recycle oldPtr
>
> we write
>
> example 2
> lock(spinlock)
> oldPtr = read(ptr)
> prepare newValue of ptr with optional pointed data
> recycle/delete oldPtr
> ptr = newValue;
> unlock(spinlock)

Compared exchange comes only into play if the old value is needed to
compute the new value in a non trivial way, i.e. where there is no
dedicated machine instruction available. Most platforms offer trivial
operations like x+value, x-value, x&value etc. directly without a loop.

> One may argue that waiting on lock(spinlock) saturates the bus much
> more than rarely called cmpxchg in the example 2, but this is merely a
> question of implementation of efficient spinlock for multicore.

The collision probability scales with the square of the time taken in
one spinning loop cycle. So the basic idea is to keep this time as short
as possible. It is obvious that spinning at some spin-lock variable and
then modifying some other memory location holds the lock longer than
reading the old value and write the new value back, which implicitly
releases the lock. The compared exchange simply joins the memory
location of the mutex with the memory location of the data to protect.
This also shows when it makes sense: only as long as you do not protect
more than one memory place with the lock. But keep in mind that fine
grained locking is a requirement of scalability.

And there is another important difference. The loop condition is moved
from the beginning of the critical section to the end. This has a
significant effect on scheduling issues. See below.

> Supposing that spinning on our spinlock does not saturate the bus and
> just wastes the internal resources of one core (which is also the case
> in example 1), why the code in example 1 should be fater than in
> example 2?

The loop is smaller. Example 1 results in a collision if at least two
threads are between line 1 and 3. In example 2 the range is extended to
line 1 to 6. OK, there is a bug. The delete oldPtr should be kept out of
this area as it is in example 1. And note that the collision probability
scales with the square of this time (and even higher powers for multiple
collisions).

> Yes, I understand that example 2 might run slowly in some
> circumstances like when a thread which locked a spinlock is preempted
> and other threads start to spin.

This is an important wort case. For this reason most spin-lock
implementations have a fall back to some ordinary blocking, at least
sleep(0) after some turns.

This is where the second advantage of e compare exchange loop comes into
play. The check condition is at the end of the critical section. This is
like optimistic locking.
Any process is allowed to proceed until it wants to update a value. At
this point the update is rejected if it is not based on the previous
version. This is almost the same as any reasonable code version
repository works. (Note that the compare exchange loop still has the
potential race condition that other threads might have modified the
value and eventually restored the old value before the loop completes.)

Checking the condition at the end of the loop has the important
advantage that if a thread is preempted or suspended for some reason, no
other threads will spin. Only the thread that did not do its job in time
will pay the bill and take one extra turn. In contrast a spin-lock mutex
may spin infinitely and maybe yet in pararllel if only one thread causes
a delay. It simply is not lock free. For the same reason compare
exchange loops cannot cause priority inversion, too.

> May a composite lock (spinlock + kernel mutex) be helpful in this case?

This is one work around. It has the disadvantage that it requires kernel
resources even if they are never needed. Also it makes construction and
destruction of the fast mutex expensive, which could be an issue if they
are used in temporaries.
In fact I have not seen such an implementation of a fast mutex so far.
Either you use fast mutexes, and you confirm thereby that you do not
hold the mutex for long, or you don't and use kernel mutexes.

> Or if we take pure kernel-level
> implementation where a 'thread' can not be 'preempted' what about
> saying that example 2 is as good as example 1? Probably no as Linux
> uses RCU, but what are the exact reasons?

First of all you generate a reasonable overhead by the required switches
to kernel context. But a spin-lock is not RCU, so I don't understand
your question.

Marcel

David Schwartz

unread,

Oct 12, 2011, 7:57:39 PM10/12/11

to

On Oct 8, 6:08 pm, Michael Podolsky <michael.podolsky...@gmail.com>
wrote:

> Of course. there may be some special situations where we may run
> faster with block-free and wait-free implementations using solely
> interlock-xadd like operations.

When either there isn't much contention or there's nothing else the
system could do that wouldn't content anyway. Lock-free algorithms
increase contention, that's bad if there's something else the system
could do that wouldn't contend. Lock deschedule contending threads.
That's awful if all threads contend but awesome if we can schedule a
thread that will run without contention.

> But as it comes to using our main tool, cmpxchg instruction, our code
> gets a kind of transactional semantics with failing/recurring attempts
> to modify data. It's like we spinning(!)

Not for most real-world algorithms. They typically guarantee that
spinning is always accompanied by forward progress. If thread A is
spinning, it's only because other threads are constantly winning. So
the system as a whole is still making brisk forward progress.

It is a rare algorithm that permits a race to cause neither thread to
make forward progress. In that case, you are right.

> example 1
> do { oldPtr = read(ptr)
> prepare newValue of ptr with optional pointed data,
> while (!cmpxchg(&ptr, oldPtr, newValue));
> recycle oldPtr

The only 'heavy' instruction here is cmpxchg. Every other operation is
all but free.

> we write
>
> example 2
> lock(spinlock)
> oldPtr = read(ptr)
> prepare newValue of ptr with optional pointed data
> recycle/delete oldPtr
> ptr = newValue;
> unlock(spinlock)

Here, 'unlock' might be very expensive, as it may require unblocking
another thread. That will typically mean that in the contended case,
an excursion into the kernel is needed. Plus, the 'lock' operation
typically costs as much as cmpxchg.

> One may argue that waiting on lock(spinlock) saturates the bus much
> more than rarely called cmpxchg in the example 2, but this is merely a
> question of implementation of efficient spinlock for multicore.
> Supposing that spinning on our spinlock does not saturate the bus and
> just wastes the internal resources of one core (which is also the case
> in example 1), why the code in example 1 should be fater than in
> example 2?

Because it has exactly one expensive operation, the cmpxchg and no
horrifically expensive operations, even if it's contended.

> If there are other important contexts, not covered by my examples,
> where lockfree runs faster, I'd be glad if I am pointed out, until
> then let me suppose that what is written in example 1 is a rather
> general pattern of lockfree. So does it run faster than its analogue
> which uses spinlocks?

Yes, but. Running two threads that contend, even if every cmpxchg
succeeds, tends to hammer the bus with lots of cache ping-ponging. So
it will run faster if that contention is unavoidable. For example, if
every ready-to-run thread needs to access the same collection, lock
free will win out.

> How should one answer for the question why exactly lockfree may run
> faster, in user space or in the kernel, wherever?

Because it avoids the very expensive 'unlock' operation in the
contended case and because the window in which conflict is possible is
much narrower.

DS

Chris M. Thomasson

unread,

Oct 14, 2011, 7:54:13 PM10/14/11

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:75f226aa-00a1-4153...@v29g2000prd.googlegroups.com...

On Oct 8, 6:08 pm, Michael Podolsky <michael.podolsky...@gmail.com>
wrote:

> > Of course. there may be some special situations where we may run
> > faster with block-free and wait-free implementations using solely
> > interlock-xadd like operations.

> When either there isn't much contention or there's nothing else the
> system could do that wouldn't content anyway. Lock-free algorithms
> increase contention, that's bad if there's something else the system
> could do that wouldn't contend.

What about all of the algorithms that rely on lock/wait/whatever-free for
the fast-path, and eventcounts/futexs (very-slow-path) to provide proper
blocking semantics? IMVHO, blending lock-free with lock-based _is_, usually
a good thing.

> Lock deschedule contending threads.
> That's awful if all threads contend but awesome if we can schedule a
> thread that will run without contention.

I am quite fond of giving threads the ability to exploit the `try_lock()'
failure condition... Perhaps, something like:

http://groups.google.com/group/comp.programming.threads/msg/c642ff9131bba9ab

Humm...

[...]

Chris M. Thomasson

unread,

Oct 14, 2011, 8:57:50 PM10/14/11

to

"Michael Podolsky" <michael.p...@gmail.com> wrote in message
news:c98061d5-bfd0-4170...@z19g2000vby.googlegroups.com...

> Hi All,
>
> I've encountered I have very basic misunderstanding of the topic of
> lock-free algorithms.

Read this first:

http://groups.google.com/group/comp.arch/msg/4f39aeb2a3bf6829

And, take note of the author...

[...]

Michael Podolsky

unread,

Oct 26, 2011, 11:37:28 AM10/26/11

to

On Oct 12, 7:57 pm, David Schwartz <dav...@webmaster.com> wrote:
> On Oct 8, 6:08 pm, Michael Podolsky <michael.podolsky...@gmail.com>
> wrote:

> > But as it comes to using our main tool, cmpxchg instruction, our code
> > gets a kind of transactional semantics with failing/recurring attempts
> > to modify data. It's like we spinning(!)
>
> Not for most real-world algorithms. They typically guarantee that
> spinning is always accompanied by forward progress. If thread A is
> spinning, it's only because other threads are constantly winning. So
> the system as a whole is still making brisk forward progress.
>
> It is a rare algorithm that permits a race to cause neither thread to
> make forward progress. In that case, you are right.

Let me make myself clear.

I did not mean a scenario without a progress, i.e. spinning without a
progress.
There is always a progress for "correct" lock-free or spinlock-based
algorithms,
in both cases one thread (at least) makes progress, while others may
be in a loop
making failing attempts to cmpxchg some value (for lock-free), or
failing attempts to
lock a spinlock (which is being locked by a thread making
"progress").
My question, then, was in which respects spinning on attempts to make
cmpxchg
in lock-free is better then spinning on attempts to gain a spinlock.

For a spinlock, btw, the unlock() operation is not expensive as it may
be implemented,
unlike full-fledged mutex, without calls to kernel or scheduler, i.e.
by merely spinning on some
flag.

Then, I asked, how spinning on a cmpxchg in lock-free is better then
spinning on such a spinlock?

Thanks, Michael

Chris M. Thomasson

unread,

Oct 26, 2011, 5:19:57 PM10/26/11

to

"Michael Podolsky" <michael.p...@gmail.com> wrote in message

news:af80f87d-ea93-4cb4...@a7g2000yqd.googlegroups.com...

[...]

> Let me make myself clear.

> I did not mean a scenario without a progress, i.e. spinning without a
> progress.

[...]

> Then, I asked, how spinning on a cmpxchg in lock-free is better then
> spinning on such a spinlock?

Spinning is pretty bad period. However, wrt strong CAS (e.g., no spurious
failures) and lock-free algorithms in general, a failure condition usually
means that actual shared data was updated by another thread. Contrast this
with a spinlock in which a failure condition merely means that the spinlock
state was updated. Take the following into account:

lock-free
________________________________________________
struct counter
{
atomic<unsigned> m_count; // = 0

unsigned
inc_or_relaxed(unsigned incv, unsigned orv)
{
unsigned cmp = m_count.load(mb_relaxed);

while (! m_count.compare_exchange_strong(
cmp,
(cmp + incv) | orv,
mb_relaxed));

return cmp;
}
};
________________________________________________

Each failure of the while loop means that `counter::m_count' was updated.
The "winning" thread is completely finished with the whole operation.

simple spin-lock
________________________________________________
struct counter
{
atomic<unsigned> m_splock; // = 0
unsigned m_count; // = 0

unsigned
inc_or_relaxed(unsigned incv, unsigned orv)
{
while (m_splock.exchange(1, mb_relaxed)) backoff();
atomic_thread_fence(mb_acquire);

unsigned count = m_count;
m_count = (count + incv) | orv;

atomic_thread_fence(mb_release);
m_splock.store(0, mb_relaxed);

return count;
}
};
________________________________________________

Each failure of the while loop means that `counter::m_splock' was updated...
If the spinlock is swung into a "locked" state, then all failing threads
need to wait for that one "winning" thread to update the actual shared state
_and_ put the spinlock back into an "unlocked" state. Also, notice the
memory barriers?

David Schwartz

unread,

Oct 27, 2011, 4:00:33 AM10/27/11

to

On Oct 26, 8:37 am, Michael Podolsky <michael.podolsky...@gmail.com>
wrote:

> My question, then, was in which respects spinning on attempts to make
> cmpxchg
> in lock-free is better then spinning on attempts to gain a spinlock.

Almost all of them.

While spinning to gain a spinlock, you have to spin until at least the
thread that holds the spinlock manages to release it. That will
require several instructions, typically around a dozen or so. It will
include at least the atomic operation to acquire the spinlock plus
whatever is done under the protection of the lock, which will almost
always require cache ping-ponging because whatever CPU last had the
spinlock also last had whatever shared data there is.

On the other hand, while spinning for another thread to do a cmpxchg,
your typical case is just spinning while the other CPU completes a
single instruction, and you are already trying to get the only cache
line that needs to ping-pong.

Basically, a lock-free algorithm typically moves the heavy lifting of
manipulating shared data structures outside the critical section.

DS