>How should one answer for the question why exactly lockfree may run
>faster, in user space or in the kernel, wherever?
When there is little or no contention or systems with few competing
threads.
Casper
--
Well the reason is, because they do not lock and block.
Each time a thread is held at a mutex it looses its current time slice.
Even if it need to stop only for some clock cycles, the scheduler will
not immediately reschedule the thread as long as there is something else
to do. This also causes the cache content to become mostly waste.
A lock free algorithm will not give up a time slice.
> Of course. there may be some special situations where we may run
> faster with block-free and wait-free implementations using solely
> interlock-xadd like operations.
>
> But as it comes to using our main tool, cmpxchg instruction, our code
> gets a kind of transactional semantics with failing/recurring attempts
> to modify data. It's like we spinning(!)
> on attempts to modify some data with cmpxchg.
You are right. Cmpxchg is similar to a spin-lock. And it will have
comparable performance to a fast mutex implemented by a spin-lock. But
there are some important differences.
> Now I should ask how is
> it different as for performance from using just a spin-lock so that
> instead of
>
> example 1
> do { oldPtr = read(ptr)
> prepare newValue of ptr with optional pointed data,
> while (!cmpxchg(&ptr, oldPtr, newValue));
> recycle oldPtr
>
> we write
>
> example 2
> lock(spinlock)
> oldPtr = read(ptr)
> prepare newValue of ptr with optional pointed data
> recycle/delete oldPtr
> ptr = newValue;
> unlock(spinlock)
Compared exchange comes only into play if the old value is needed to
compute the new value in a non trivial way, i.e. where there is no
dedicated machine instruction available. Most platforms offer trivial
operations like x+value, x-value, x&value etc. directly without a loop.
> One may argue that waiting on lock(spinlock) saturates the bus much
> more than rarely called cmpxchg in the example 2, but this is merely a
> question of implementation of efficient spinlock for multicore.
The collision probability scales with the square of the time taken in
one spinning loop cycle. So the basic idea is to keep this time as short
as possible. It is obvious that spinning at some spin-lock variable and
then modifying some other memory location holds the lock longer than
reading the old value and write the new value back, which implicitly
releases the lock. The compared exchange simply joins the memory
location of the mutex with the memory location of the data to protect.
This also shows when it makes sense: only as long as you do not protect
more than one memory place with the lock. But keep in mind that fine
grained locking is a requirement of scalability.
And there is another important difference. The loop condition is moved
from the beginning of the critical section to the end. This has a
significant effect on scheduling issues. See below.
> Supposing that spinning on our spinlock does not saturate the bus and
> just wastes the internal resources of one core (which is also the case
> in example 1), why the code in example 1 should be fater than in
> example 2?
The loop is smaller. Example 1 results in a collision if at least two
threads are between line 1 and 3. In example 2 the range is extended to
line 1 to 6. OK, there is a bug. The delete oldPtr should be kept out of
this area as it is in example 1. And note that the collision probability
scales with the square of this time (and even higher powers for multiple
collisions).
> Yes, I understand that example 2 might run slowly in some
> circumstances like when a thread which locked a spinlock is preempted
> and other threads start to spin.
This is an important wort case. For this reason most spin-lock
implementations have a fall back to some ordinary blocking, at least
sleep(0) after some turns.
This is where the second advantage of e compare exchange loop comes into
play. The check condition is at the end of the critical section. This is
like optimistic locking.
Any process is allowed to proceed until it wants to update a value. At
this point the update is rejected if it is not based on the previous
version. This is almost the same as any reasonable code version
repository works. (Note that the compare exchange loop still has the
potential race condition that other threads might have modified the
value and eventually restored the old value before the loop completes.)
Checking the condition at the end of the loop has the important
advantage that if a thread is preempted or suspended for some reason, no
other threads will spin. Only the thread that did not do its job in time
will pay the bill and take one extra turn. In contrast a spin-lock mutex
may spin infinitely and maybe yet in pararllel if only one thread causes
a delay. It simply is not lock free. For the same reason compare
exchange loops cannot cause priority inversion, too.
> May a composite lock (spinlock + kernel mutex) be helpful in this case?
This is one work around. It has the disadvantage that it requires kernel
resources even if they are never needed. Also it makes construction and
destruction of the fast mutex expensive, which could be an issue if they
are used in temporaries.
In fact I have not seen such an implementation of a fast mutex so far.
Either you use fast mutexes, and you confirm thereby that you do not
hold the mutex for long, or you don't and use kernel mutexes.
> Or if we take pure kernel-level
> implementation where a 'thread' can not be 'preempted' what about
> saying that example 2 is as good as example 1? Probably no as Linux
> uses RCU, but what are the exact reasons?
First of all you generate a reasonable overhead by the required switches
to kernel context. But a spin-lock is not RCU, so I don't understand
your question.
Marcel
> Of course. there may be some special situations where we may run
> faster with block-free and wait-free implementations using solely
> interlock-xadd like operations.
When either there isn't much contention or there's nothing else the
system could do that wouldn't content anyway. Lock-free algorithms
increase contention, that's bad if there's something else the system
could do that wouldn't contend. Lock deschedule contending threads.
That's awful if all threads contend but awesome if we can schedule a
thread that will run without contention.
> But as it comes to using our main tool, cmpxchg instruction, our code
> gets a kind of transactional semantics with failing/recurring attempts
> to modify data. It's like we spinning(!)
Not for most real-world algorithms. They typically guarantee that
spinning is always accompanied by forward progress. If thread A is
spinning, it's only because other threads are constantly winning. So
the system as a whole is still making brisk forward progress.
It is a rare algorithm that permits a race to cause neither thread to
make forward progress. In that case, you are right.
> example 1
> do { oldPtr = read(ptr)
> prepare newValue of ptr with optional pointed data,
> while (!cmpxchg(&ptr, oldPtr, newValue));
> recycle oldPtr
The only 'heavy' instruction here is cmpxchg. Every other operation is
all but free.
> we write
>
> example 2
> lock(spinlock)
> oldPtr = read(ptr)
> prepare newValue of ptr with optional pointed data
> recycle/delete oldPtr
> ptr = newValue;
> unlock(spinlock)
Here, 'unlock' might be very expensive, as it may require unblocking
another thread. That will typically mean that in the contended case,
an excursion into the kernel is needed. Plus, the 'lock' operation
typically costs as much as cmpxchg.
> One may argue that waiting on lock(spinlock) saturates the bus much
> more than rarely called cmpxchg in the example 2, but this is merely a
> question of implementation of efficient spinlock for multicore.
> Supposing that spinning on our spinlock does not saturate the bus and
> just wastes the internal resources of one core (which is also the case
> in example 1), why the code in example 1 should be fater than in
> example 2?
Because it has exactly one expensive operation, the cmpxchg and no
horrifically expensive operations, even if it's contended.
> If there are other important contexts, not covered by my examples,
> where lockfree runs faster, I'd be glad if I am pointed out, until
> then let me suppose that what is written in example 1 is a rather
> general pattern of lockfree. So does it run faster than its analogue
> which uses spinlocks?
Yes, but. Running two threads that contend, even if every cmpxchg
succeeds, tends to hammer the bus with lots of cache ping-ponging. So
it will run faster if that contention is unavoidable. For example, if
every ready-to-run thread needs to access the same collection, lock
free will win out.
> How should one answer for the question why exactly lockfree may run
> faster, in user space or in the kernel, wherever?
Because it avoids the very expensive 'unlock' operation in the
contended case and because the window in which conflict is possible is
much narrower.
DS