Linux Kernel Spin Locks

1,133 views
Skip to first unread message

Michael Barker

unread,
Jan 9, 2013, 1:56:10 PM1/9/13
to mechanica...@googlegroups.com
This appeared on the concurrency-interest list today:

http://lwn.net/SubscriberLink/531254/027006e5f05701e0/

xeus...@googlemail.com

unread,
Jan 12, 2013, 7:41:22 PM1/12/13
to mechanica...@googlegroups.com
Very interesting read. Maybe the following is a foolish idea, but I asked myself why it should not be possible to do the same thing we do at our streets. I mean without traffic lights, what happens if multiple cars enter the same crossing? Well, simple, right before left and every driver just observes the car to his right. So why not do the same thing in locking?

We need three variables for this, a pointer to the lock owner [owner], a pointer to the last waiting thread [lastWaiting] and a boolean to signal when the lock (crossing) is free [state]. Every thread that enters the lock performs a CAS operation at the [lastWaiting], remembering the value it has currently (so remember the car coming from right) and setting it to itself (so to the current thread). Now the thread will loop and read the [owner] pointer until the [owner] contains the value it has read from the [lastWaiting] variable, so until the car that came from right is currently driving. Now it checks if the [state] is FREE and as soon as this is the case it can change first the [state] to USED and then the [owner] to itself (it uses and owns the crossing now). If the thread leaves the lock it just sets the [state] to false (the next car please).

This should result for n-threads in a worst case of ((n+1)*n)/2 total CAS operations. Each single thread performs in the worst case n CAS operations. Therefore it has a guaranteed worst case run time and should be valid for real time systems. Further it guarantees the order of the pending threads, so it can be seen as fair. Further this should result in a smaller cache coherency protocol usage as waiting threads will only read the L1 cache line, but never write to it, except they know they acquired the ownership of the lock. Entering threads however will CAS the cache line.

Well, so far the theory. Now I tried to implement that in Java, but it seems not to work. The following is my implementation, but I wonder if my theory is bad, if the code is buggy or if it might be a problem of Java itself? Would be nice to get some help here.

public class SpinLock {
private final AtomicReference<Thread> lastWaiting = new AtomicReference<Thread>();
private volatile Thread owner;

private static final int FREE = 0;
private static final int USED = 1;
private volatile int state;

public final void leave() {
state = FREE;
}

public final void enter() {
final Thread currentThread = Thread.currentThread();
Thread prevThread;
do {
prevThread = lastWaiting.get();
} while (!lastWaiting.compareAndSet(prevThread, currentThread));

while (true) {
if (owner==prevThread) {
if (state==FREE) {
state = USED;
owner = currentThread;
return;
}
}

// this is important, because if the thread right of us (so in-front of us) is running at the same CPU
// we need to tell the scheduler that it should rather run this one
Thread.yield();
}
}
private static volatile long test = 0;
private static final SpinLock lock = new SpinLock();
private static final class Test extends Thread {
public Test( String name ) {
super(name);
}
@Override
public void run() {
for( int i=0; i < 1000*1000; i++) {
lock.enter();
try {
if ((++test % 100000)==0) {
System.out.println( Thread.currentThread().getName()+" = "+test );
}
} finally {
lock.leave();
}
}
}
}
public static void main( String[] args ) {
for (int i=0; i < 2; i++) new Test("Test-"+i).start();
}
}


Adam Browning

unread,
Jan 12, 2013, 9:50:17 PM1/12/13
to mechanica...@googlegroups.com
This is what I believe is happening:

1. Thread-0 succeeds the CAS & installs itself as the last waiter
2. Thread-0 sees that the state is FREE, so it installs itself as the owner
3. Thread-0 releases the lock
4. Thread-0 reenters the lock and succeeds the CAS again to install itself as the last waiter
5. Thread-1 enters the lock and succeeds the CAS to install itself as the last waiter
6. Thread-1 sees that its previous owner, Thread-0, is the owner and the state is FREE
7. Thread-1 sets the state to USED
8. Thread-1 sets the owner to Thread-1
8. Thread-0 sees the state is USED so it spins waiting for the owner to be Thread-0 and the state to be FREE and deadlocks because now the owner can never be anything but Thread-1

xeus...@googlemail.com

unread,
Jan 13, 2013, 6:37:40 AM1/13/13
to mechanica...@googlegroups.com
Thank you very much, exactly this happened. So I changed my code to this, what works well:

import java.util.concurrent.atomic.AtomicInteger;

public class SpinLock {
private final AtomicInteger ticket = new AtomicInteger();
private int next;
public final void leave( int handle ) {
next = handle;
}
public final int enter() {
final int myId= ticket.getAndIncrement();

// spin lock without CAS
while (true) {
if (myId==next) {
return myId+1;
}

// this is important, because if the thread right of us (so in-front of us) is running at the same CPU
// we need to tell the scheduler that it should rather run this one
Thread.yield();
}
}
}

Works well for me.

Gil Tene

unread,
Jan 13, 2013, 11:45:43 AM1/13/13
to
While interesting as a general subject, and for kernel or hypervisor programmers (something I happen to do), spinlocks in user-mode code have much bigger problems than this. So whenever I see user-mode code with spinlocks in it, I cringe.

Not that you should never-ever-ever-ever used them, but they are one of those things that should have a huge BEWARE sign on them. The same kind that blind folded knife throwing demonstrations should come with. (So replace "never-ever-ever-ever" with just "never"). I've seen spinlocks used "responsibly" in libraries that usually provide a non-spinlock execution option - some written by people on this list, but I've also seen the effects of people using those libraries without understanding the implications.

As a general practice, the minute you start using spinlocks in user-mode code, you've left the generally-deployable software realm. If you have such spinlocks anywhere in the code you use (including any libraries), you had better have absolute control over what all CPUs in your system are doing, and how they are allocated to user processes and threads, and how many threads can possibly be competing for CPU resources. This control needs to assure that a spinlock-holding thread will never be in a situation where it may be pre-empted while holding the spinlock. Without this control, something as small as a 20msec load spike, or a background cron job perturbing your scheduling assumptions can completely ruin your day, making you wish (from a performance point of view) you didn't have those spinlocks.

The reason is simple: with user-level spinlocks, things will work well and as-expected ONLY as long as your user threads are NEVER interrupted or suspended for any reason that is outside of your control. Any interruption in the execution of a spinlock-holding user thread tends to have cascading bad behavior that collapses and stalls the performance of otherwise super-speedy code.

Spinlocks, as a general concept, start with the assumption that the critical section of code protected by the spinlock will complete in a short amount of time. Based on that assumption, code using of spinlocks in making a critical performance tradeoff: it is willing to burn CPU time in one thread while it waits for another to let go of the spin lock. As long as the base assumption holds, the amount of CPU burned remains small, and the tradeoff is worth it. But think of what happens when a spinlock-holding thread "just happens to be" preempted by another thread that needs and deserves some CPU according to whatever scheduling mechanism controls such things. Imagine that this newly-deserving thread (or some other thread) wants to grab the spinlock that is currently held by a runnable-but-not-running thread. No forward progress will occur for any thread waiting for the spinlock for entire scheduling quantum (typically 4-10 msec), even though all CPUs will appear to be pegged and busy (with threads wanting the spinlock).

The way this is handled in OS kernel code is simple: a spinlock holding thread will disable preemption while holding spin lock. It may also disable interrupts, but disabling preemption is critical. Unfortunately, user-mode code (almost by definition) has no ability to do the same.

-- Gil.

xeus...@googlemail.com

unread,
Jan 13, 2013, 5:00:47 PM1/13/13
to mechanica...@googlegroups.com
Hi Gil,

thank you for that comment, I learned a lot from it. I did a smoke tests with my code above and noticed the following:

- As long as I can guarantee that each thread entering the lock runs at its own core and is not suspended for other threads the lock I wrote above consumes only around 50% of the time synchronized or ReentrantLock consume (I removed the Thread.yield for that purpose).
- As soon as multiple threads run at one core and try to enter the lock the performance is disastrous. It seems that each waiting thread gets the some time slices, which means a lot of CPU time is burned in a senseless looping. This is additional to the problem you said that happens if the owner of the lock gets suspended. I guess the senseless context switches are worse either, but without a Thread.yield to calms that cost of senseless looping down a bit, things would be much more worse.

So to say, the above without the Thead.yield call should be very very good in systems where it is guaranteed that each thread that is competing for the lock is executed at its own core and not suspended, but it is disastrous at normal environments. In a normal environment and especially if there is a hypervisor this seems to be a very bad idea.

I wonder if I can't improve this so that the thread that leaves the lock informs the next waiting thread. Lets assume the thread that wants to enter the lock calls [park] instead of looping and the thread leaving the lock then calls [unpark] for the next waiting thread, shouldn't that solve the problem (apart from that I currently have no clue how to implement that)? Do you know what the cost of park/unpark are? I mean, I never liked to poll, interrupting seemed always a better solution to me.

Gil Tene

unread,
Jan 13, 2013, 7:49:30 PM1/13/13
to mechanica...@googlegroups.com
There are various backoff tricks people do (yield, park/unpark, direct use of futexes, etc.). Most of the blocking variants devolve into being equivalent to a posix lock or mutex, and the non-blocking ones always have both fall-off-cliff spin-cost and unfairness issues. When the only thing using your CPUs is threads contending for the exact same specific "spinlock" instance, various ordering tricks within that lock instance can get you past the unfairness issues. But in the real world, you end up with other things using and competing for the same CPUs (like other lock instances, or just other cpu consumers within or outside of your process. The spinlock-backoff yield tricks usually fall prey to those, usually limiting the progress through the spinlock to something worse than a single threaded behavior, and letting other, unrelated code past you all the time (basically, you end up giving up quantums way too easily). 

My rule-ofthimb is this: Don't use spinlocks in user-mode code. Instead, try (very hard) to keep your code non-blocking and lock-free. Make it wait-free if you can help it, too. Atomics are fine. Especially the various CAS and SWAP based things you can do for shared lock-free data structures. Just don't use them for locking, as in don't use them to protect a critical section of code without a blocking back off. Only the OS layer closest to the hardware can actually spinlock reliably (and yes, guest OSs that run on hypervisors can often run into this same issue - it's a common hypervsior performance pitfall).
Reply all
Reply to author
Forward
0 new messages