RCU for preemptive user threads

Joe Seigh

unread,

Jun 2, 2004, 1:50:25 PM6/2/04

to

At this point I haven't heard anything for a while so I assume
the project is DOA. So maybe not sooner than you think.

Some backgound.

If you have a preemptive environment where there is no way
to prevent preemption is a critical region (in preemptive kernels
you can mask interrupts), then what you have to do is only
count voluntary preemption as a quiesce point. The problem with
this is the quiesce points can have arbitrary periodicity In a non--
preemptive environment the periodicity was a timeslice. This makes
it difficult if you want to make the RCU write side garbage collection
achieve a certain performance level. The write side waits can be
arbitrarily long due to threads being involuntarily suspended for long
periods due to low priority or threads being compute bound and doing
little or no voluntary preemption.

If you don't have a problem with this, then you could do user space
RCU by polling thread getrusage info from /proc for the voluntary
context switch counts and for the thread run state.

I had suggested last August that if you could arrange for interrupt in
in a read critical region to restart at the start of the critical region then
by definition no interrupt would be in the middle of a critical region and
you could count involuntary context switches as a quiesce point and
not have the problems that just counting voluntary context switches
has.

Basically, the mechanism for this would have been for the kernel to
make a signal pending on an involuntary context switch. A setjmp
buffer would be set at the start of the critical region and the signal
handler would longjmp on that if it detected thread was in the critical
region.

But it turns out you don't need to do this in the kernel if you use
information from the thread /proc entries to decide if you need to
signal a thread. If you see a thread not doing voluntary context
switching then you signal it and then you can make certain
inferences from the next involuntary context switch. If gets a little
complicated. You have to decide if the signal was delivered or is
still pending. If the thread suspended at least once after the signal
then you can count it as a quiesce point that one time.

So basically, you can do this on any unix that has the right thread
info in /proc. Solaris has it. The only problem is that IBM has
all the RCU polling techniques patented, so there is no point in
doing it since you wouldn't know whether you could use it or not.

Joe Seigh

unread,

Jun 2, 2004, 7:43:46 PM6/2/04

to

The way this would look in the user application would be to have a
pointer to the setjmp buffer defined in thread local storage so the
signal hander could find it. This would normally be null. The code
around a read critical region would look like

setjmp(bufptr->jmpbuf);
rcubufptr = bufptr;
... // critical region
rcubufptr = NULL;

If an interrpt occurs in the critical region, the signal handler
does a longjmp to restart critical region. This means the code
in the critical region needs to be restartable which is a major
difference from kernel RCU. Basically in most cases this means
read only access.

This is a problem if the code needs to travers a data structure,
such as a linked list, and return a reference to a node. Kernel
RCU will typically do a interlocked increment of a reference count.
In user space RCU there would be race condition between the store
from the interlocked instruction and store setting rcubufptr to
NULL.

The way to deal with this is have the signal handler if it sees
rcubufptr not NULL copy the ucontext_t into a buffer supplied
along with the setjmp buffer and to set rcpbufptr to NULL to
prevent the ucontext_t info from being overwritten before it
can be examined. The read critcal region logic in this case
would look like

if (setjmp(bufptr->jmpbuf) == 0 || restart_needed(bufptr->ucontext)) {
rcubufptr = bufptr;
... // critical region
rcubufptr = NULL;
}

restart_needed() would be platform dependent code to determine from the machine
context whether the interlocked store succeeded or not (needs to look at instruction
pointer address and condition code registers). But at least the platform
dependent stuff would be decoupled from the base userspace RCU implemntation.

Unfortunately we need to use setjmp because there isn't any support for an
efficient C exception and stack unwinding mechanism that would have less overhead
in entering what is in effect an exception block. Just saving the instruction
pointer and restoring it is not an option in C. In assembler yes but not C.

Fortunately setjmp doesn't appear to be that bad, it about as expensive as an inline
compare and swap. A lock would have at least one interlocked instruction with
pipeline destroying memory barriers and with performance killing cache hits.

There's also the pthread_getspecific which I haven't measured perfmance of but
you could define rcubufptr directly in the thread local storage (tls.h in linux)
if you had to.

Joe Seigh

SenderX

unread,

Jun 3, 2004, 2:26:32 AM6/3/04

to

> But it turns out you don't need to do this in the kernel if you use
> information from the thread /proc entries to decide if you need to
> signal a thread. If you see a thread not doing voluntary context
> switching then you signal it and then you can make certain
> inferences from the next involuntary context switch. If gets a little
> complicated. You have to decide if the signal was delivered or is
> still pending. If the thread suspended at least once after the signal

> then you can count it as a quiescent point that one time.

>
> So basically, you can do this on any unix that has the right thread
> info in /proc. Solaris has it. The only problem is that IBM has
> all the RCU polling techniques patented, so there is no point in
> doing it since you wouldn't know whether you could use it or not.

;(...

Your stuff should work fine.... Joe's RCU User-Space could be a nice API for
any serious _nix OS to expose to experienced parallel programmers?

There are other ways to allow for 1,000's of threads to write and read in
parallel, in the exact same bucket of a collection, using a modified version
of my proxy gc, or your atomic_ptr logic( really ). For every 65535( user
can adjust at runtime ) concurrent deletions ( not pushes, "only" pops )
from a collection, you may get some reader threads blocked on a lock-free
semaphore. But, hey... That's better than using a mutex to manage every push
and pop! It does not have any problems with freeing resources, and it works
on 64-bit systems using simple offsets, like atomic_ptr would have to use.

I tracked hundreds-of-thousands of pushes and pops, performed by 1024
threads in NPTL and Windows threads. You may get 450-500 threads blocked on
a sem_t/HANDLE for rather quick intervals, per hundred-thousand or so
operations. I wanted to create a proxy gc that can handle many lock-free
writes in the presence of worst case read contention, and not run out of
memory...

Damn, the proxy gc algos are sooo clean, and simple...

Not sure if IBM owns this beauty already!!!!!!!!!!@?!?!?@!?@!?@!?@

DOH! $#%@

;(...

Joe Seigh

unread,

Jun 6, 2004, 4:02:53 PM6/6/04

to

How a possible reader side implementation might look.
No error checking in this example at this time.

pthread_key_t rcu_restart_key;

// the data structure to communicate between the program and the
// signal handler
typedef struct {
sig_atomic_t flag;
jmp_buf buf;
ucontext_t context;
} rcu_restart_t;

// signal handler
void rcu_restart(int sig, siginfo_t *info, void *xxx) {
rcu_restart_t *restart;

restart = (rcu_restart_t *)pthread_getspecific(rcu_restart_key);

if (restart->flag == 0)
return;

else if (restart->flag == 2)
memcpy(&(restart->context), xxx, sizeof(ucontext_t));

restart->flag = 0;
longjmp(restart->buf, 1);
}

// DCL like initialization for each thread. probably should be inlined in part
rcu_restart_t * rcu_restart_init() {
rcu_restart_t *restart;
struct sigaction sa;

if ((restart = (rcu_restart_t *)pthread_getspecific(rcu_restart_key)) == NULL) {
// initialize TSD (per thread)
restart = (rcu_restart_t *)malloc(sizeof(rcu_restart_t));
restart->flag = 0; // before setting signal handler

pthread_setspecific(rcu_restart_key, (void *)restart);

// initialize RCU signal handler (per thread)
sa.sa_sigaction = &rcu_restart; // signal handler
sigemptyset(&sa.sa_mask); // no signals masked
sa.sa_flags = SA_SIGINFO;

sigaction(SIGUSR2, &sa, NULL); // abort if sighandler previously set
}

return restart;
}

// a user suppplied routint to check if interlocked update succeeded or not
int needsRestart(ucontext_t *context) {
return 1; // or 0 depending
}

// in the threads

rcu_restart_t *restart;

// read only critical region
restart = rcu_restart_init();
setjmp(restart->buf);
restart->flag = 1;
// ... critical region
restart->flag = 0;

// read only w/ interlock critical region
restart = rcu_restart_init();
if (setjmp(restart->buf) == 0 || needsRestart(&(restart->context)) != 0) {
restart->flag = 2;
// ... critical region
restart->flag = 0;
}

Note the flag can be any data type that is atomic w.r.t. signals. Needs to be volatile
probably to prevent it from being moved by compiler optimization. The flag is actually
just being used as an efficient way to enable and disable signals. You could use a sig
function but it would not likely be as efficient.

pthread_getspecific is probably not too bad performance wise. A special define in tls
would be better and get it down to one or two load instructions.

All the logic here just has to be async safe. RCU is designed to not require synchronization
on the reader side for performance. No cache thrashing by modifying shared storage and no
serialization to destroy pipelined processor performance.

Signals don't have to be reliable, ie. no more than one has to be pending. Additional signals
may be dropped if one is already pending. I arbitrarily picked SIGUSR2 for this example.
I haven't researched which signal would work best.

And yes, TSD key is assumed to be initialized properly.

Joe Seigh

unread,

Jun 7, 2004, 7:36:38 AM6/7/04

to

The signal hander would be per process so I'll have to move the sigaction
to a once per process initialization section (where the TSD key would be
initalized also). I generally don't mess with signals too much since they're
generally useless for synchronization. This is the first real use I've
come up with for signals. The sigaction in the thread should probably be
replaced by pthread_sigmask to enable the signals for the thread and
some logic to intialize the TSD in the signal handler also since the
threads can inherit a signal mask with the signal already enabled.

I'll repost revised code later when I get the time. I'm not posting
the code as working code. The reason I'm posting it as code rather than
just writing about it is that getting accross the concepts is just a
little less ambiguous that way.

Joe Seigh

unread,

Jun 8, 2004, 9:16:50 AM6/8/04

to

The read side would look something like this only not like this

#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
#include <setjmp.h>
#include <ucontext.h>
#include <errno.h>

pthread_key_t rcu_restart_key;

//
// thread local data
//

typedef struct {
sig_atomic_t flag;
jmp_buf buf;
ucontext_t context;
} rcu_restart_t;

//
// init thread local data
//

rcu_restart_t * rcu_restart_init() {
rcu_restart_t *restart;

sigset_t mask;

if ((restart = (rcu_restart_t *)pthread_getspecific(rcu_restart_key)) == NULL) {
// initialize TSD (per thread)
restart = (rcu_restart_t *)malloc(sizeof(rcu_restart_t));;
restart->flag = 0;

pthread_setspecific(rcu_restart_key, (void *)restart);

// enable signal for thread
setemptyset(&mask);
sigaddset(&mask, SIGUSR2);

pthread_sigmask(SIG_UNBLOCK, &mask, NULL);
}

return restart;
}

//
// signal handler
//

void rcu_restart(int sig, siginfo_t *info, void *xxx) {
rcu_restart_t *restart;

if ((restart = (rcu_restart_t *)pthread_getspecific(rcu_restart_key)) == NULL)
return;

if (restart->flag == 0)
return;

else if (restart->flag == 2)
memcpy(&(restart->context), xxx, sizeof(ucontext_t));

restart->flag = 0;
longjmp(restart->buf, 1);
}

//
// process init
//
void rcu_init() {
struct sigaction sa;

// initialize RCU TSD key (process)
pthread_key_create(&rcu_restart_key, &free);

// initialize RCU signal handler

sa.sa_sigaction = &rcu_restart; // signal handler
sigemptyset(&sa.sa_mask); // no signals masked
sa.sa_flags = SA_SIGINFO;

sigaction(SIGUSR2, &sa, NULL);
}

//
// arbitrary interlocked restart routine
//

int needsRestart(ucontext_t *context) {
return 1;
}

// main process somewhere

struct sigaction sa;
rcu_restart_t *restart;

// initialize RCU
rcu_init();

// in the threads somewhere

// read only critical region
restart = rcu_restart_init();
setjmp(restart->buf);
restart->flag = 1;
// ... critical region
restart->flag = 0;

// read only w/ interlock critical region
restart = rcu_restart_init();
if (setjmp(restart->buf) == 0 || needsRestart(&(restart->context)) != 0) {
restart->flag = 2;
// ... critical region
restart->flag = 0;
}

//---

There's sort of a problem in that pthread_getspecific is not async signal safe
which means that you have to have control of the signal masks when threads
are created and you don't necesarily have control of all thread creation (well
you do, you just can't used any libraries including libpthread). We need some
kind of pthread_atcreate function.

Don't you just love it when C/C++ pretends threads don't exist and pthreads pretends
signals don't exist. An then when you have to create an ad hoc and by definition
"non-portable" mechanism to deal with it, all the various standards twits come out
and complain that you are not following the rules.

Joe Seigh

unread,

Jun 8, 2004, 5:35:30 PM6/8/04

to

I wrote:
> There's sort of a problem in that pthread_getspecific is not async signal safe
> which means that you have to have control of the signal masks when threads
> are created and you don't necesarily have control of all thread creation (well
> you do, you just can't used any libraries including libpthread). We need some
> kind of pthread_atcreate function.
>

Interesting problem with an even more interesting solution. Also solves another
problem that I didn't mention. And yes, it does not follow the rules. :)

Joe Seigh

unread,

Jun 9, 2004, 2:54:03 PM6/9/04

to

We can go with a more conventional solution for the async problem since
the threads most likely have to register to establish a mapping to translate
the kernel thread id from /proc to a pthread id mapping for pthread_kill().

I don't see anything that says TSD deallocation after thread_exit has to
be synchronous so I might have to deal with it being synchronous, i.e.
the reuse of thread ids before old thread unregistered.

The RCU polling can be as simple or complicated as you want it depending
on how well you understand RCU and how the /proc entrails augury works
out.

#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
#include <setjmp.h>
#include <ucontext.h>
#include <errno.h>

pthread_key_t rcu_restart_key;

//
// RCU thread data
//
typedef struct {
// thread local data

sig_atomic_t flag;
jmp_buf buf;
ucontext_t context;

int qcount; // optional quiesce state counter

// RCU thread polling data
// (accessed by separate polling thread so separate cache line
// wouldn't hurt)

// TBD ...

} rcu_restart_t;

//
// unregister thread and delete local data
//
void rcu_unregister(void *restart) {
// unregister (may use mutex)

// deallocate tls
free(restart);
}

//
// init thread local data
//
rcu_restart_t * rcu_restart_init() {
rcu_restart_t *restart;
sigset_t mask;

if ((restart = (rcu_restart_t *)pthread_getspecific(rcu_restart_key)) == NULL) {
// initialize TSD (per thread)
restart = (rcu_restart_t *)malloc(sizeof(rcu_restart_t));
restart->flag = 0;

pthread_setspecific(rcu_restart_key, (void *)restart);

// enable signal for thread
setemptyset(&mask);
sigaddset(&mask, SIGUSR2);

pthread_sigmask(SIG_UNBLOCK, &mask, NULL);

// Register this thread with RCU polling mechanism
// Signals will not get sent to this thread until
// after this is done, so pthread_getspecific in
// signal handler will be async safe. This may
// (but not necessarily) use mutexes so this function
// itself is not async safe and cannot be used in signal
// handlers.

// register_thread_for_rcu_polling();
}

return restart;
}

//
// signal handler
//
void rcu_restart(int sig, siginfo_t *info, void *xxx) {
rcu_restart_t *restart;

restart = (rcu_restart_t *)pthread_getspecific(rcu_restart_key);

// possible explicit RCU quiese point counter incremented
// here if necessary, i.e. /proc doesn't provide enough
// data or function.
// restart->qcount++;

if (restart->flag == 0)
return;

else if (restart->flag == 2)
memcpy(&(restart->context), xxx, sizeof(ucontext_t));

restart->flag = 0;
longjmp(restart->buf, 1);
}

//
// process init
//
void rcu_init() {
struct sigaction sa;

// initialize RCU TSD key (process)

pthread_key_create(&rcu_restart_key, &rcu_unregister);

// initialize RCU signal handler
sa.sa_sigaction = &rcu_restart; // signal handler
sigemptyset(&sa.sa_mask); // no signals masked
sa.sa_flags = SA_SIGINFO;

sigaction(SIGUSR2, &sa, NULL);

// initialize RCU polling data and mechanism

}

//
// arbitrary interlocked restart routine
//
int needsRestart(ucontext_t *context) {
return 1;
}

// --

struct sigaction sa;
rcu_restart_t *restart;

// initialize RCU
rcu_init();

// --

// read only critical region
restart = rcu_restart_init();
setjmp(restart->buf);
restart->flag = 1;
// ... critical region
restart->flag = 0;

// read only w/ interlock critical region
restart = rcu_restart_init();
if (setjmp(restart->buf) == 0 || needsRestart(&(restart->context)) != 0) {
restart->flag = 2;
// ... critical region
restart->flag = 0;
}

Joe SEigh

Joe Seigh

unread,

Jun 9, 2004, 6:16:57 PM6/9/04

to

We now return you to your regularly scheduled programming.
And don't hold your breath. :)

Joe Seigh

unread,

Dec 26, 2004, 11:44:15 PM12/26/04

to

SenderX wrote:

> > So basically, you can do this on any unix that has the right thread
> > info in /proc. Solaris has it. The only problem is that IBM has
> > all the RCU polling techniques patented, so there is no point in
> > doing it since you wouldn't know whether you could use it or not.
>
> ;(...
>
> Your stuff should work fine.... Joe's RCU User-Space could be a nice API for
> any serious _nix OS to expose to experienced parallel programmers?
>

...
> Damn, the proxy gc algos are sooo clean, and simple...
>
> Not sure if IBM owns this beauty already!!!!!!!!!!@?!?!?@!?@!?@!?@
>

I think generic RCU with GC or with proxy GC is public domain but specific
methods are patented. Sun has a patent for using RCU in Java with the built in
GC as far as I can tell. RCU with counters as the quiesce points and using
/proc to observe the counters is probably patented or otherwise owned by IBM.

This was the project I was asked if I didn't mind assigning patent rights to
IBM on. I assume it's because the others on the project were Linux Specialists
employed full time by IBM and IBM wanting to retain its IP rights from works
created by its employees rather than some scheme by IBM to exert control over
open source via its patents. The two aren't mutually exclusive.

I'm not being paid by anybody for this. The only possible thing would have
been some publicity but that wouldn't appear to be coming soon enought to do
me any good. The other Linux specialists are busy and I would probably be
asked to write code for free that I would have no right to use myself. So
there is zero benefit in staying on this project for me.

The previous postings were to make sure that the parts contributed by me
go into the public domain. Previous disclosure on RCU w/ restart here
http://groups.google.com/groups?selm=3F38BE84.115F0A17%40xemaps.com

That said, I don't think IBM will do a user space only implementation, as
opposed to a Linux kernel mod version, since the code could run on their
competitor's platforms.

Joe Seigh