RFC: lock-free read-write locks

Ronald Landheer-Cieslak

unread,

Jul 27, 2004, 6:05:38 PM7/27/04

to

Hello all,

The following link contains an explanation and pseudo-code for lock-free
rw-locks. I'd be interested in comments on this (especially, if
possible, before I start implementing).

http://cvs.sourceforge.net/viewcvs.py/jail-ust/libthread/how_rw_locks_work.txt?view=markup

Basically, the proposed rw-locks consist of two queues and two lists:
the queues are a general waiters queue and a writers queue, the lists
are both readers lists. One of the lists contains waiting (& sleeping)
readers whereas the other contains working ones. The general idea is
that the lock can be passed by either one writer or any number of
readers, but a new reader won't be allowed to pass if there's a writer
waiting, thus avoiding to starve the writers. When a writer is done, all
waiting readers are allowed to continue, thus also avoiding to starve
readers.

One design choice that may be important is that arriving threads may get
the job of scheduling other threads: i.e. the first thread to arrive
schedules itself and any thread that arrives in the general waiting
queue while it schedules (i.e. between the times it look whether the
queue is empty). The intent is to avoid context switches, but it is
possible that the thread that winds up scheduling everybody starves
itself (though I think that is extremely unlikely). It would be rather
trivial to adapt the algorithm to impose a maximum of other threads to
schedule to prevent this type of starvation, though.. It would be
slightly less trivial to have every thread schedule itself, which would
mean more context switches but less chance of starvation..

Comments?

rlc

SenderX

unread,

Aug 1, 2004, 6:41:41 PM8/1/04

to

> The following link contains an explanation and pseudo-code for lock-free
> rw-locks. I'd be interested in comments on this (especially, if
> possible, before I start implementing).
>
>
http://cvs.sourceforge.net/viewcvs.py/jail-ust/libthread/how_rw_locks_work.txt?view=markup

I will study this one. The only thing I can think of right now is that you
seem to be constructing you own waitsets. Should a waitset be a single
kernel object? SCHED_OTHER...

I know for a fact that you can do lock-free rw-mutex with 2 kernel objects
and double wide cas on 32-bit cpu, or normal cas on a 64-bit cpu.

The shared data looks like this:

typedef struct ac_RwMutexShared_
{
unsigned __int16 Reads;
unsigned __int16 Writes;
unsigned __int16 ReadWaiters;
unsigned __int16 WriteWaiters;

} ac_RwMutexShared;

The algo is simple because a cas can atomically modify the entire shared
rw-mutex state. You could omit the Writes and the WriteWaiter fields. Just
set a special bit to indicate writes and/or write waiters.

P.S.

Joe has mentioned a rw-spinlock that does not suffer from any starvation
issues... interesting.

Ronald Landheer-Cieslak

unread,

Aug 2, 2004, 10:52:17 AM8/2/04

to

SenderX wrote:

>>The following link contains an explanation and pseudo-code for lock-free
>>rw-locks. I'd be interested in comments on this (especially, if
>>possible, before I start implementing).
>>
>>
>
> http://cvs.sourceforge.net/viewcvs.py/jail-ust/libthread/how_rw_locks_work.txt?view=markup
>
> I will study this one.

Thanks :)

> The only thing I can think of right now is that you
> seem to be constructing you own waitsets. Should a waitset be a single
> kernel object? SCHED_OTHER...

Part of the goal is to not depend on the kernel to handle the ordering
of waiting threads - mostly because kernels can't be generally relied
upon to follow the same semantics - i.e. the ones I want.

> I know for a fact that you can do lock-free rw-mutex with 2 kernel objects
> and double wide cas on 32-bit cpu, or normal cas on a 64-bit cpu.

OK (interested..), but could you also guarantee fairness?

I.e. the major drawback of the algorithm as I wrote it in the link
above, is that a scheduling thread can be starved - but only up to a
certain point: once it has scheduled itself (which is the first thing it
does), it is sure to go through the lock at a given time, as long as the
number of threads available in the system it not infinite: at some
moment it will start blocking the progress of all other threads in
favour of itself.

There's also the possibility to slightly adapt the algorithm and make
every thread schedule itself - but that would add a lot of context
switches which I think would be more expensive that the current setup..

> The shared data looks like this:
>
> typedef struct ac_RwMutexShared_
> {
> unsigned __int16 Reads;
> unsigned __int16 Writes;
> unsigned __int16 ReadWaiters;
> unsigned __int16 WriteWaiters;
>
> } ac_RwMutexShared;
>
> The algo is simple because a cas can atomically modify the entire shared
> rw-mutex state. You could omit the Writes and the WriteWaiter fields. Just
> set a special bit to indicate writes and/or write waiters.

> Joe has mentioned a rw-spinlock that does not suffer from any starvation
> issues... interesting.
There's a spinning starvation-free rw-lock described here:
http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf

I haven't read the article in detail yet (just found it by typing "rw
spinlock starvation Joe Seigh" on Google), but it looks interesting and
the rw-lock they propose is pretty straight-forward, though perhaps not
very efficient..

rlc

Joe Seigh

unread,

Aug 2, 2004, 11:54:28 AM8/2/04

to

Ronald Landheer-Cieslak wrote:
>
> > Joe has mentioned a rw-spinlock that does not suffer from any starvation
> > issues... interesting.
> There's a spinning starvation-free rw-lock described here:
> http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf
>
> I haven't read the article in detail yet (just found it by typing "rw
> spinlock starvation Joe Seigh" on Google), but it looks interesting and
> the rw-lock they propose is pretty straight-forward, though perhaps not
> very efficient..
>

This is code I've posted before. The fetch and op functions are simulated
for the purposes of illustration and you use an interlocked implementation
to make them atomic.

#define EXCL 1
#define SHRD 0x10000
#define fa(n, i) ((*(n)+=(i))-(i))
#define fi(n) ((*(n))++)

volatile int n; /* next available ticket */
volatile int c; /* current request */

volatile int x; /* exclusive */
volatile int s; /* shared */

int main() {
int w; /* wait ticket */

/*-------------------------------------------------------------------*/
/* shared/exclusive access using fetch-and-add */
/*-------------------------------------------------------------------*/
n=0;c=0; /* initialize */

w=fa(&n, EXCL); /* request exclusive access */
while(w!=c); /* wait for it */
/* critical region */
fa(&c, EXCL); /* release exclusive access */

w=fa(&n, SHRD); /* request shared access */
while((w%SHRD)!=(c%SHRD)); /* wait for it */
/* critical region */
fa(&c, SHRD); /* release shared access */

/*-------------------------------------------------------------------*/
/* shared/exclusive access using fetch-and-increment */
/*-------------------------------------------------------------------*/
n=0;x=0;s=0; /* initialize */

w=fi(&n); /* request (shared) access */
while(w!=s); /* wait for it */
fi(&s); /* release shared */
/* critical region */
fi(&x); /* release exclusive */

w=fi(&n); /* request (exclusive) access */
while(w!=x); /* wait for it */
/* critical region */
fi(&s); /* release shared */
fi(&x); /* release exclusive */

return 0;
}

They're bakery algorithms with FIFO service order. In the first exclusive requests wait for
*all* prior accesses to complete. Shared requests just wait for prior exclusive accesses
to complete. Number of waiting threads has to be less than the field sizes. The algorithm
works even with wrap and carry out from the exclusive access field into the shared access field.

The second version is if you just have fetch and increment. The shared requests have to
unblock other waiting shared requests as they enter the critical section so it's a little
less efficient.

You can optimize this somewhat by using a power of 2 to define the field size and logically
anding to extract a field value, using a store w/ membar to release exclusive access on the
first one.

Bakery spinlocks may be more efficient than normal spinlocks since they acquire their wait
ticket value first and wait rather than waiting and then all waiting threads attempting to
acquire the lock all in the same instant. The interlocked updates from ticket requests
are more spread out and cause less contention on the cache.

Joe Seigh

Ronald Landheer-Cieslak

unread,

Aug 2, 2004, 3:40:37 PM8/2/04

to

Joe Seigh wrote:
>
> Ronald Landheer-Cieslak wrote:
>
>>> Joe has mentioned a rw-spinlock that does not suffer from any
>>> starvation issues... interesting.
>>
>> There's a spinning starvation-free rw-lock described here:
>> http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf
>>
>> I haven't read the article in detail yet (just found it by typing
>> "rw spinlock starvation Joe Seigh" on Google), but it looks
>> interesting and the rw-lock they propose is pretty
>> straight-forward, though perhaps not very efficient..
> This is code I've posted before. The fetch and op functions are
> simulated for the purposes of illustration and you use an interlocked
> implementation to make them atomic.

[snipped the code]

> They're bakery algorithms with FIFO service order. In the first
> exclusive requests wait for *all* prior accesses to complete. Shared
> requests just wait for prior exclusive accesses to complete. Number
> of waiting threads has to be less than the field sizes. The
> algorithm works even with wrap and carry out from the exclusive
> access field into the shared access field.

Now there's an elegant solution :)
Mind if I borrow it (i.e. add it to my libthread library)? (with proper
references, of course)

> The second version is if you just have fetch and increment. The
> shared requests have to unblock other waiting shared requests as they
> enter the critical section so it's a little less efficient.

That's actually a lot more like the one I had previously "whipped up":
http://tinyurl.com/6mrwp. Neither very efficient nor very fair, but for
the purpose the Hash was initially written for, it works (actually: it
initially wasn't written to be thread-safe at all..)

Your version (with fetch-and-increment) at least has the merit of being
fair ;)

AFAIK, a fetch-and-add "instruction" can easily be implemented with a
CAS: i.e.

int32_t fetch_and_add(int32_t * i, int32_t add)
{
int32_t retval = *i;
int32_t newval;

do
{
newval = retval + add;
} while (compare_and_exchange(&retval, i, newval) != 0);

return retval;
}
should do the trick.. "add" can, of course, be negative :)

[snip logical anding..]

> Bakery spinlocks may be more efficient than normal spinlocks since
> they acquire their wait ticket value first and wait rather than
> waiting and then all waiting threads attempting to acquire the lock
> all in the same instant. The interlocked updates from ticket
> requests are more spread out and cause less contention on the cache.

Yup, but bakery or no, it's still a busy-waiting lock, which means it
will only work well on multi-processors and no threads go to sleep when
waiting.

It is an elegant solution, though, and a nice way to implement spinlocks
in general (and rw-spinlock in particular ;) )

If you don't mind, I'll add a slighly modified version of your code to
libthread (i.e. modified to do what your code does in different
functions and use the above fetch-and-add).

Thanks,

rlc

SenderX

unread,

Aug 2, 2004, 5:43:03 PM8/2/04

to

> They're bakery algorithms with FIFO service order. In the first exclusive
requests wait for
> *all* prior accesses to complete. Shared requests just wait for prior
exclusive accesses
> to complete. Number of waiting threads has to be less than the field
sizes. The algorithm
> works even with wrap and carry out from the exclusive access field into
the shared access field.

Nice. Very nice...

I am wondering about the performance of a writer spin-waiting for a bunch of
readers... Should the writer block after a couple of spins?

I need to bench this against my cas version that will block threads after a
couple of spins...

Joe Seigh

unread,

Aug 2, 2004, 7:31:10 PM8/2/04

to

> It is an elegant solution, though, and a nice way to implement spinlocks
> in general (and rw-spinlock in particular ;) )
>
> If you don't mind, I'll add a slighly modified version of your code to
> libthread (i.e. modified to do what your code does in different
> functions and use the above fetch-and-add).
>

Go ahead. The algorithm is in the public domain. It was published in
SIGOPS Operating Systems Review in April 90 as a distributed algorithms
with a mention of the basic algorithm as well. Plus I've posted the
code example on usenet serveral times since then.

I forgot to mention. You can tweek it to be a promotable lock. I've
posted an example of the logic for that here

http://groups.google.com/groups?selm=1998Apr7.093117%40bose.com

It has the advantage of the upgrade doesn't have to be made
conditional.

Joe Seigh

unread,

Aug 2, 2004, 7:41:29 PM8/2/04

to

Well, it's a spin lock. Not really meant to be the other kind. If you add in
blocking, you'd need a blocking mechanism that would handle the service order
correctly.

You'd probably need the futex equivalent of an eventticket mechanism. But as
soon as you but in blocking the adaptive locks which allow queue jumping
will perform better under some conditions.

Joe Seigh

unread,

Aug 3, 2004, 8:01:58 AM8/3/04

to

Joe Seigh wrote:
>
> I forgot to mention. You can tweek it to be a promotable lock. I've
> posted an example of the logic for that here
>
> http://groups.google.com/groups?selm=1998Apr7.093117%40bose.com
>
> It has the advantage of the upgrade doesn't have to be made
> conditional.

Wrong link. I copied the wrong link from my bookmark file. That should be
http://groups.google.com/groups?selm=3C231C0A.6AB363D9%40genuity.com

Joe Seigh

unread,

Aug 7, 2004, 11:33:41 AM8/7/04

to

Joe Seigh wrote:
>
> SenderX wrote:

> >
> > I am wondering about the performance of a writer spin-waiting for a bunch of
> > readers... Should the writer block after a couple of spins?
> >
> > I need to bench this against my cas version that will block threads after a
> > couple of spins...
>
> Well, it's a spin lock. Not really meant to be the other kind. If you add in
> blocking, you'd need a blocking mechanism that would handle the service order
> correctly.
>
> You'd probably need the futex equivalent of an eventticket mechanism. But as
> soon as you but in blocking the adaptive locks which allow queue jumping
> will perform better under some conditions.
>

Actually, I'm not sure what queue jumping would by you in a read-write lock.
The fact that you are using a read-write lock means that contention is an
issue, otherwise you'd be using a mutex which has less overhead than rwlocks in most
implementations. So going with a FIFO service order may be an optimal strategy.

Joe Seigh