Re: RCU for preemptive user threads

Joe Seigh

unread,

Jun 24, 2004, 7:44:17 AM6/24/04

to

I think I'll implement this for Solaris for the fun of it since I won't
be able to do anything with it. Linux is out of the question for now.
The /proc for the 2.4 kernel doesn't have the right stuff in it. /proc
in the newer kernal may have the right info but I'm disinclined to waste
time or money just to see if that info might be there.

Joe Seigh

unread,

Jun 24, 2004, 9:19:35 AM6/24/04

to

Also maybe have a performance shootout between RCU, atomic_ptr, and rwlock. :)

Joe Seigh

SenderX

unread,

Jun 24, 2004, 6:41:32 PM6/24/04

to

> Also maybe have a performance shootout between RCU, atomic_ptr, and
rwlock. :)

Could I throw one of my ia32 proxy gc algos in the ring?

Does it have to be a 100% open-source contest?

;)

Joe Seigh

unread,

Jun 24, 2004, 7:58:53 PM6/24/04

to

It's not really that sort of contest. It will be more of a rough benchmark
against another lock-free algorithm since the rwlock will probably tank
miserably once I crank up the contention. It will be a simple copy on write
object so extra complexity of a proxy won't be needed. And it's on an
SB100 so IA32 won't work very well. Though, I really will need a 4-way
or 8-way to test it properly.

Joe Seigh

SenderX

unread,

Jun 24, 2004, 8:24:48 PM6/24/04

to

> It's not really that sort of contest. It will be more of a rough
benchmark
> against another lock-free algorithm since the rwlock will probably tank
> miserably once I crank up the contention.

:)

> It will be a simple copy on write
> object so extra complexity of a proxy won't be needed.

Probably, my stuff is great for protecting large shared collections.

Humm, actually, I do have a version that performs good for cow. The threads
do a xadd to acquire a proxy count, and a cas loop to release it. Its pretty
efficient...

> And it's on an
> SB100 so IA32 won't work very well. Though, I really will need a 4-way
> or 8-way to test it properly.

Was it easy porting atomic_ptr?

casa, casxa...

Joe Seigh

unread,

Jun 24, 2004, 9:26:56 PM6/24/04

to

SenderX wrote:

> > And it's on an
> > SB100 so IA32 won't work very well. Though, I really will need a 4-way
> > or 8-way to test it properly.
>
> Was it easy porting atomic_ptr?
>
> casa, casxa...

That was already done. CASX wich is double wide in 32 bit mode. No plans to
do a version for 64 bit sparc or 64 bit AMD (unless someone pays me to). I probably
the c++ code with later updates to atomic_ptr in vc++6.0 without checking if updates
would compile with gcc.

Joe Seigh

red floyd

unread,

Jun 25, 2004, 3:52:20 PM6/25/04

to

Just remember to pay your license fees to SCO :)

Joe Seigh

unread,

Jun 25, 2004, 7:24:09 PM6/25/04

to

red floyd wrote:
>
> Just remember to pay your license fees to SCO :)

Actually, I could cc Darl on my plans and there's not
much he could do about it since it's based on a mainframe
method. We may not be hearing about him too mch longer.
I hear he's going into the Witless Protection Program.

Joe Seigh

SenderX

unread,

Jun 26, 2004, 12:38:16 AM6/26/04

to

> > Was it easy porting atomic_ptr?
> >
> > casa, casxa...
>

> That was already done. CASX which is double wide in 32 bit mode.

It seems both atomic_ptr and all of my ia32 collectors easily port to 32-bit
mode sparc, without any change to the non-cpu specific code.

Using gas, gcc and pthread's, make lock-free algos very portable...

> No plans to
> do a version for 64 bit sparc or 64 bit AMD (unless someone pays me to).

Ahhh, 64-bit atomic_ptr...

<rant>
That thought makes me want to slap the s#$%#$ out the AMD64 designers! They
totally backstabbed every programmer that actually knows the power of the
cmpxchg8b instruction!

They forced us to use f%u%$ offsets into memory pools or indexes into arrays
as pointers! Damn!!
</rant>

;)

Joe Seigh

unread,

Jun 27, 2004, 7:54:48 PM6/27/04

to

SenderX wrote:

> > No plans to
> > do a version for 64 bit sparc or 64 bit AMD (unless someone pays me to).
>
> Ahhh, 64-bit atomic_ptr...
>
> <rant>
> That thought makes me want to slap the s#$%#$ out the AMD64 designers! They
> totally backstabbed every programmer that actually knows the power of the
> cmpxchg8b instruction!
>
> They forced us to use f%u%$ offsets into memory pools or indexes into arrays
> as pointers! Damn!!
> </rant>
>

If I did put out a proxy GC api with RCU or atomic_ptr where RCU can't be implemented,
I'm tempted to use rwlock() for AMD processors just out of general principle.

Joe Seigh

unread,

Jun 28, 2004, 7:04:42 AM6/28/04

to

Joe Seigh wrote:
> If I did put out a proxy GC api with RCU or atomic_ptr where RCU can't be implemented,
> I'm tempted to use rwlock() for AMD processors just out of general principle.
>

I'd have to have wrappers for mutexes so I could switch in a rwlock lock write. RCU does not
have these. I could just use a global mutex to simulate the interlocked updates. That
would work better (worse).

Joe Seigh

unread,

Jun 30, 2004, 9:41:45 PM6/30/04

to

Well, intial simple test seems to work. Even dummied up a thread
preempted in a critical secion and the signal restart seems to work.
Now to write a really nasty testcase to abuse the hell out of it.
The testcase will probably be more complicated than the RCU code.

Joe Seigh

SenderX

unread,

Jul 2, 2004, 2:59:24 PM7/2/04

to

> We may not be hearing about him too mch longer.
> I hear he's going into the Witless Protection Program.

http://www.linuxworld.com/story/38137.htm
( Monty Python Meets Darl McBride )

lol

Joe Seigh

unread,

Jul 3, 2004, 10:38:31 AM7/3/04

to

No go on solaris. The solaris signal dispatching mechanism doesn't work
the way I thought it would. It would have saved a lot of trouble if the
yes/no question I asked on comp.unix.solaris had been answered. A bunch
of bloody useless non-experts there.

I did find another mechanism that would allow implementing RCU for
preemptive user threads in solaris. It's a known technique though
and I'm not interested in reimplementing the wheel.

Nothing to see here. Move along.

Joe Seigh

unread,

Jul 5, 2004, 10:27:33 AM7/5/04

to

I did find another mechanism that seems to work. So
I'll have something to play with for a while.

Joe Seigh

unread,

Jul 5, 2004, 10:19:49 PM7/5/04

to

I think my RCU code would work better if solaris didn't forget I had
a signal handler installed. Time to install a new version of solaris.
It's really hard to get a realistic performance test on a single cpu
machine. Mutexes look better than they should because you are getting
serial execution anyhow. RCU performs at least as good as mutexes
with no contention on a single processor. You throw in contention
and mutex performance goes to hell. rwlocks do somewhat better
under "shared" contention (concurrent access rather than serial).

Here's some timings of specific functions, not the testcases, on a
502 mhz SB100 to give you an idea of relative performance.

empty loop; = 0.000000026
p = NULL; = 0.000000027
pthread_getspecific = 0.000000167
sctl = schedctl_lookup = 0.000002329
schedctl_start(sctl) = 0.000000037
setjmp(buf); = 0.000000063
sigsetjmp(buf, 0); = 0.000000129
rcu lock/unlock = 0.000000220
membar #LoadStore = 0.000000027
load/membar #LoadStore/store = 0.000000033
store/membar #StoreLoad/load = 0.000000056
mutex lock/unlock = 0.000000252
cas64(...); = 0.000000071
rw lock/unlock = 0.000000711

Everything is in cache and loop timing is not
substracted out. So a setjmp would be about
37 nsec, for example. RCU locks only update
thread local storage so you don't get the cache
hits that the interlocked instructions for
mutex and rwlock have. The RCU (read) lock/unlock
is a pthread_get_specific, a setjmp and 2 stores.

The pthread_getspecific seems kind of expensive,
probably since it's been compromised for dynamic
deletion of keys. It should only be a few load
instructions.

The only downside with RCU is the same for any
form of GC. You don't want to do a lot of
updates since you tie up a lot of resources being
waited to be GC'ed.

Joe Seigh

Casper H.S. Dik

unread,

Jul 6, 2004, 5:45:53 AM7/6/04

to

Joe Seigh <jsei...@xemaps.com> writes:

>I think my RCU code would work better if solaris didn't forget I had
>a signal handler installed. Time to install a new version of solaris.

Are you using signal()? Solaris will remember that you have a signal
handler installed as long as you use the proper POSIX functions
(signal() should not be used; sigaction is what you need to use)

>It's really hard to get a realistic performance test on a single cpu
>machine. Mutexes look better than they should because you are getting
>serial execution anyhow. RCU performs at least as good as mutexes
>with no contention on a single processor. You throw in contention
>and mutex performance goes to hell. rwlocks do somewhat better
>under "shared" contention (concurrent access rather than serial).

>Here's some timings of specific functions, not the testcases, on a
>502 mhz SB100 to give you an idea of relative performance.

Which version of Solaris are you trying this with?

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Joe Seigh

unread,

Jul 6, 2004, 8:25:10 AM7/6/04

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >I think my RCU code would work better if solaris didn't forget I had
> >a signal handler installed. Time to install a new version of solaris.
>
> Are you using signal()? Solaris will remember that you have a signal
> handler installed as long as you use the proper POSIX functions
> (signal() should not be used; sigaction is what you need to use)

Yes, I'm using sigaction and it works for something like 43 times (in one case)
and then suddenly stops catching the signal. It's up in sched_yield but
I don't think the problem is sched_yield. That's what I using to cause
interleaved execution on a uniprocessor.

>
> >It's really hard to get a realistic performance test on a single cpu
> >machine. Mutexes look better than they should because you are getting
> >serial execution anyhow. RCU performs at least as good as mutexes
> >with no contention on a single processor. You throw in contention
> >and mutex performance goes to hell. rwlocks do somewhat better
> >under "shared" contention (concurrent access rather than serial).
>
> >Here's some timings of specific functions, not the testcases, on a
> >502 mhz SB100 to give you an idea of relative performance.
>
> Which version of Solaris are you trying this with?

A real old early version of Solaris 8 that orginally came with the SB100.
I need to upgrade but downloading iso's off of a Sun server is not a lot
of fun. I may not bother since I've established proof of concept and I
have no real application of this. You may see something on Linux as
they have an application in mind.

I really need to get a new hobby. Inventing new synchronization algorithms
is as about a useless a hobby as it gets.

Joe Seigh

Casper H.S. Dik

unread,

Jul 6, 2004, 2:33:08 PM7/6/04

to

Joe Seigh <jsei...@xemaps.com> writes:

>Yes, I'm using sigaction and it works for something like 43 times (in one case)
>and then suddenly stops catching the signal. It's up in sched_yield but
>I don't think the problem is sched_yield. That's what I using to cause
>interleaved execution on a uniprocessor.

Are you using setjmp and such?

(With respect to your other posts: the /usr/lib/lwp implementation is
supposed to be faster but you may want to patch it; when it became
the default libthread in Solaris9, I don't think there were any
remaining performance regressions and quite a few improvements)

Joe Seigh

unread,

Jul 6, 2004, 3:43:54 PM7/6/04

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >Yes, I'm using sigaction and it works for something like 43 times (in one case)
> >and then suddenly stops catching the signal. It's up in sched_yield but
> >I don't think the problem is sched_yield. That's what I using to cause
> >interleaved execution on a uniprocessor.
>
> Are you using setjmp and such?

Yes. longjmp is supposed to be one of the allowable ways of returning from
a signal handler.

Joe Seigh

Casper H.S. Dik

unread,

Jul 6, 2004, 6:05:52 PM7/6/04

to

Joe Seigh <jsei...@xemaps.com> writes:

I think you want to use siglongjmp() instead.

Casper

Joe Seigh

unread,

Jul 6, 2004, 6:32:42 PM7/6/04

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>

> >"Casper H.S. Dik" wrote:
> >>
> >> Are you using setjmp and such?
>
> >Yes. longjmp is supposed to be one of the allowable ways of returning from
> >a signal handler.
>
> I think you want to use siglongjmp() instead.
>

I'm not changing the signal mask so it doesn't matter whether the signal mask
gets saved or restored. sigsetjmp is a lot slower than setjmp and you need
to use it to use siglongjmp. The signal handler set up by sigaction doesn't
block any additional signals either.

Joe Seigh

Roger Faulkner

unread,

Jul 7, 2004, 3:08:58 AM7/7/04

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<40EB02DF...@xemaps.com>...

Actually, if you do setjmp() at main level, then enter a signal handler
and call longjmp(), then after the setjmp() returns non-zero back at
what you thought was main level, you will actually still be in the
signal context. (If you call getcontext(), you will find a non-null
previous context pointer.)

Also, if you use sigaction() to set up the signal handler, then by default
when the signal handler is entered, the signal in question will be
blocked. Calling setjmp() will not change the signal mask, so you will
return with the signal still blocked.

If you use the SA_NODEFER flag to sigaction() to avoid this case,
then you will be subject to taking another signal of the same number
before your signal handler completes and reentering your signal handler
recursively, probably not what you want.

You should use sigsetjmp() / siglongjmp(), for the sake of correctness.

Roger Faulkner
roger.f...@sun.com

Casper H.S. Dik

unread,

Jul 7, 2004, 4:16:37 AM7/7/04

to

Joe Seigh <jsei...@xemaps.com> writes:

Calling a signal handler will always cause at least one additional signal to
be blocked: the signal itself.

Joe Seigh

unread,

Jul 7, 2004, 6:06:39 AM7/7/04

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >"Casper H.S. Dik" wrote:
> >>
> >> I think you want to use siglongjmp() instead.
> >>
> >I'm not changing the signal mask so it doesn't matter whether the signal mask
> >gets saved or restored. sigsetjmp is a lot slower than setjmp and you need
> >to use it to use siglongjmp. The signal handler set up by sigaction doesn't
> >block any additional signals either.
>
> Calling a signal handler will always cause at least one additional signal to
> be blocked: the signal itself.
>

That should be cleared on exiting the signal handler.

Joe Seigh

unread,

Jul 7, 2004, 6:31:35 AM7/7/04

to

According to the official documentation the difference between setjmp and
sigsetjmp is that it is undefined whether setjmp/longjmp causes the signal
mask to be saved and restored. sigsetjmp/siglongjmp allow you to specify
whether or not that happens. The documentation says nothing about only one
being valid for exiting the signal handler context. My impression is that
longjmp/siglongjmp just modify the return context and then just do a return
to let the signal hander do the actual return from signal handler context.

From the longjmp docs

As it bypasses the usual function call and
return mechanisms, longjmp() shall execute
correctly in contexts of interrupts, signals, and any
of their associated functions. However, if longjmp()
is invoked from a nested signal handler (that is,
from a function invoked as a result of a signal
raised during the handling of another signal), the
behavior is undefined.

I'm not doing nested signal handlers.

From siglongjmp docs

The siglongjmp() function shall be equivalent to
the longjmp() function, except as follows:

References to setjmp() shall be equivalent to
sigsetjmp().

The siglongjmp() function shall restore the
saved signal mask if and only if the env
argument was initialized by a call to
sigsetjmp() with a non-zero savemask
argument.

Joe Seigh

Casper H.S. Dik

unread,

Jul 7, 2004, 7:12:31 AM7/7/04

to

Joe Seigh <jsei...@xemaps.com> writes:

Yes, but you don't exit the signal handler if you longjmp out of it.

Only if you siglongjmp() out of it.

Casper H.S. Dik

unread,

Jul 7, 2004, 7:17:21 AM7/7/04

to

Joe Seigh <jsei...@xemaps.com> writes:

>According to the official documentation the difference between setjmp and
>sigsetjmp is that it is undefined whether setjmp/longjmp causes the signal
>mask to be saved and restored. sigsetjmp/siglongjmp allow you to specify
>whether or not that happens. The documentation says nothing about only one
>being valid for exiting the signal handler context. My impression is that
>longjmp/siglongjmp just modify the return context and then just do a return
>to let the signal hander do the actual return from signal handler context.

What return context? A signal handler is typically called through
a trampoline:

trampoline(sig, infop, context)
{
userhandlers[sig]->handler(sig, infop, context);
setcontext(context); /* or other system specific cleanup */
}

When you longjmp() you bypass the signal handler return code;
you never get to the "cleanup" bit. (I don't see a way it could, either).

Joe Seigh

unread,

Jul 7, 2004, 7:34:31 AM7/7/04

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>

>
> Yes, but you don't exit the signal handler if you longjmp out of it.
>
> Only if you siglongjmp() out of it.
>

That's not what the official longjmp(), siglongjmp() docs say. Is there
documentation somewhere else on this?

Joe Seigh

Casper H.S. Dik

unread,

Jul 7, 2004, 8:09:18 AM7/7/04

to

Joe Seigh <jsei...@xemaps.com> writes:

>> Yes, but you don't exit the signal handler if you longjmp out of it.
>>
>> Only if you siglongjmp() out of it.
>>

>That's not what the official longjmp(), siglongjmp() docs say. Is there
>documentation somewhere else on this?

The UNIX specification (version 2) says in the sigaction manual
page:

When the signal handler returns, the receiving process will
resume execution at the point it was interrupted unless the
signal handler makes other arrangements. If longjmp() or
_longjmp() is used to leave the signal handler, then the signal
mask must be explicitly restored by the process.

Which I would take as authorative and indicates that you must
either restore the mask yourself or use siglongjmp.

If you are interested in the signal state of such a hung process
under Solaris you can use "psig <pid>" and "pflags <pid>"; the latter
prints the signal masks and the pending signals.

Casper

Joe Seigh

unread,

Jul 7, 2004, 9:24:13 AM7/7/04

to

Ok and I just spotted it in man -s3head signal also. So sigsetjmp() would
be the right thing to use. Thanks. I would have never thought to look
at the sigaction man page rather than the longjmp man page to see how
longjmp should be used.

Joe Seigh

Roger Faulkner

unread,

Jul 7, 2004, 12:58:53 PM7/7/04

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<40EBD2EE...@xemaps.com>...

> Roger Faulkner wrote:
> >
> > You should use sigsetjmp() / siglongjmp(), for the sake of correctness.
> >
>
> According to the official documentation the difference between setjmp and
> sigsetjmp is that it is undefined whether setjmp/longjmp causes the signal
> mask to be saved and restored.

Yes, and the Solaris (SVR4) implementation of setjmp()/longjmp()
does not save/restore the signal mask.

> sigsetjmp/siglongjmp allow you to specify
> whether or not that happens. The documentation says nothing about only one
> being valid for exiting the signal handler context.

True, but you have to be aware of the consequences.

> My impression is that
> longjmp/siglongjmp just modify the return context and then just do a return
> to let the signal hander do the actual return from signal handler context.

Wrong assumption about the Solaris (SVR4) implementation.

siglongjmp() does indeed perform a setcontext() system call
(which is what happens on normal return from a signal handler).

However, longjmp() just restores registers (stack pointer, etc) to
their state at the time of the setjmp() and branches to the saved
return location. No setcontext() is performed, and the signal mask
is not changed from its value at the time longjmp() is called.

> From the longjmp docs
>
> As it bypasses the usual function call and
> return mechanisms, longjmp() shall execute
> correctly in contexts of interrupts, signals, and any
> of their associated functions. However, if longjmp()
> is invoked from a nested signal handler (that is,
> from a function invoked as a result of a signal
> raised during the handling of another signal), the
> behavior is undefined.
>
> I'm not doing nested signal handlers.
>
> From siglongjmp docs
>
> The siglongjmp() function shall be equivalent to
> the longjmp() function, except as follows:
>
> References to setjmp() shall be equivalent to
> sigsetjmp().
>
> The siglongjmp() function shall restore the
> saved signal mask if and only if the env
> argument was initialized by a call to
> sigsetjmp() with a non-zero savemask
> argument.
>
> Joe Seigh

From the getcontext() / setcontext() docs:

APPLICATION USAGE

When a signal handler is executed, the current user context is saved
and a new context is created. If the thread leaves the signal handler
via longjmp(), then it is unspecified whether the context at the time
of the corresponding setjmp() call is restored and thus whether future
calls to getcontext() provide an accurate representation of the current
context, since the context restored by longjmp() does not necessarily
contain all the information that setcontext() requires. Signal handlers
should use siglongjmp() or setcontext() instead.

Roger Faulkner
roger.f...@sun.com

Joe Seigh

unread,

Jul 7, 2004, 2:16:43 PM7/7/04

to

Roger Faulkner wrote:
>
> Joe Seigh <jsei...@xemaps.com> wrote in message news:<40EBD2EE...@xemaps.com>...
> > Roger Faulkner wrote:
> > >
> > > You should use sigsetjmp() / siglongjmp(), for the sake of correctness.
> > >
> >
> > According to the official documentation the difference between setjmp and
> > sigsetjmp is that it is undefined whether setjmp/longjmp causes the signal
> > mask to be saved and restored.
>
> Yes, and the Solaris (SVR4) implementation of setjmp()/longjmp()
> does not save/restore the signal mask.

Now that I think about it some more, I'm a little confused. Usually they
add new functions like sigsetjmp/siglongjmp so as to keep the behavior of
the old functions like setjmp/longjmp unchanged. But if setjmp/longjmp
is undefined as to whether signal mask save/restore is done then the old
definitions were the same? I'm trying to deconstruct the rationale for
sigsetjmp/siglongjmp. The problem was the old functions were nondeterministic
or what?

...

>
> From the getcontext() / setcontext() docs:
>
> APPLICATION USAGE
>
> When a signal handler is executed, the current user context is saved
> and a new context is created. If the thread leaves the signal handler
> via longjmp(), then it is unspecified whether the context at the time
> of the corresponding setjmp() call is restored and thus whether future
> calls to getcontext() provide an accurate representation of the current
> context, since the context restored by longjmp() does not necessarily
> contain all the information that setcontext() requires. Signal handlers
> should use siglongjmp() or setcontext() instead.
>

So why was getcontext/setcontext made obsolete? It seems you could have
used them instead of defining new functions that are essentially the same.

Joe Seigh

Roger Faulkner

unread,

Jul 7, 2004, 11:22:12 PM7/7/04

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<40EC3FEF...@xemaps.com>...

> Roger Faulkner wrote:
> >
> > Joe Seigh <jsei...@xemaps.com> wrote in message news:<40EBD2EE...@xemaps.com>...
> > > Roger Faulkner wrote:
> > > >
> > > > You should use sigsetjmp() / siglongjmp(), for the sake of correctness.
> > >
> > > According to the official documentation the difference between setjmp and
> > > sigsetjmp is that it is undefined whether setjmp/longjmp causes the signal
> > > mask to be saved and restored.
> >
> > Yes, and the Solaris (SVR4) implementation of setjmp()/longjmp()
> > does not save/restore the signal mask.
>
> Now that I think about it some more, I'm a little confused. Usually they
> add new functions like sigsetjmp/siglongjmp so as to keep the behavior of
> the old functions like setjmp/longjmp unchanged. But if setjmp/longjmp
> is undefined as to whether signal mask save/restore is done then the old
> definitions were the same? I'm trying to deconstruct the rationale for
> sigsetjmp/siglongjmp. The problem was the old functions were nondeterministic
> or what?

The documents we (at least I) have been quoting are the POSIX/SUSv3
standards documents. The standards need to cover all conforming
implementations and different conforming implementatiions behave
differently in the areas designated as "undefined" in the standard.

The behavior of setjmp()/longjmp() and sigsetjmp()/siglongjmp() have
always been defined for the Solaris (SVR4) implementation, since
circa 1990. They are as I described to you earlier.

> > From the getcontext() / setcontext() docs:
> >
> > APPLICATION USAGE
> >
> > When a signal handler is executed, the current user context is saved
> > and a new context is created. If the thread leaves the signal handler
> > via longjmp(), then it is unspecified whether the context at the time
> > of the corresponding setjmp() call is restored and thus whether future
> > calls to getcontext() provide an accurate representation of the current
> > context, since the context restored by longjmp() does not necessarily
> > contain all the information that setcontext() requires. Signal handlers
> > should use siglongjmp() or setcontext() instead.
>
> So why was getcontext/setcontext made obsolete? It seems you could have
> used them instead of defining new functions that are essentially the same.

Who ever said getcontext()/setcontext() was made obsolete?
Did you mean setjmp()/longjmp()? They are not obsolete either.

setjmp()/longjmp() are perfectly safe (and very fast since no system
calls are required) when used entirely in the main (not signal) context
(modulo the unsafeness of skipping back over function call frames where
locks have been acquired [all of the functions we are discussing are
dangerous to use in general in a multithreaded environment]).

sigsetjmp()/siglongjmp() are safe to use between contexts, as we have
been discussing, but are slower because they require system calls.

getcontext()/setcontext() are also safe but they are the slowest because
they save/restore the entire context, including the floating-point state
(which [sig]setjmp()/[sig]longjmp() do not).

Roger Faulkner
roger.f...@sun.com

Casper H.S. Dik

unread,

Jul 8, 2004, 4:47:16 AM7/8/04

to

Joe Seigh <jsei...@xemaps.com> writes:

>Now that I think about it some more, I'm a little confused. Usually they
>add new functions like sigsetjmp/siglongjmp so as to keep the behavior of
>the old functions like setjmp/longjmp unchanged. But if setjmp/longjmp
>is undefined as to whether signal mask save/restore is done then the old
>definitions were the same? I'm trying to deconstruct the rationale for
>sigsetjmp/siglongjmp. The problem was the old functions were nondeterministic
>or what?

The POSIX standards were written after the fact, when divergence had happened
between various Unix variants.

In traditional Unix the only way to install a signal handler
was with signal(); it semantics where "unreliable":

- when a signal is delivered the signal handler is called and
reset to SIG_DFL
- a signal is not blocked when it has been delivered.

In these circumstances there was no need to longjmp() to ever consider
the signal mask state: there was none.

Then BSD Unix came along and made signals reliable; in doing so, they
did not realize that if you change the semantics you really also have to
change the name. signal() was made "reliable" (signals blocked, signal
handler not reset) The change to signal() obviated the need for changes
to longjmp/setjmp. (Which means a bigger buffer size, etc)
And they added _longjmp()/_setjmp() to "just manipulate the C registers".
They also invented "sigvec".

Because AT&T Unix always cared a lot more about backward compatibility,
both in binary and in source, they could change neither signal() not
*jmp.

I think it was when standardizing all of this mess that they invented
sigaction() and sig*jmp(); sigaction() was modelled after sigvec() but
because it was incomplete designed some changes needed to be made and
so it was renamed.

Since we're now left with two signal() and two *jmp() specifications,
the standard had to leave the behaviour "undefined" to enable the
standard to be created in the first place.

So it is history which dictates the "unspecified" behaviour.

Joe Seigh

unread,

Jul 8, 2004, 6:41:28 AM7/8/04

to

Roger Faulkner wrote:
>
> Joe Seigh <jsei...@xemaps.com> wrote in message news:<40EC3FEF...@xemaps.com>...
> > Roger Faulkner wrote:
> > So why was getcontext/setcontext made obsolete? It seems you could have
> > used them instead of defining new functions that are essentially the same.
>
> Who ever said getcontext()/setcontext() was made obsolete?
> Did you mean setjmp()/longjmp()? They are not obsolete either.

Issue 6

IEEE Std 1003.1-2001/Cor 2-2004, item
XSH/TC2/D6/45 is applied, updating the
SYNOPSIS and APPLICATION USAGE sections to
note that the getcontext() and setcontext()
functions are obsolescent.

Which is doubly strange since C and Unix go to extreme and ridiculuous measures
for compatiblity even when it impedes future function.

Joe Seigh

unread,

Jul 8, 2004, 8:36:13 AM7/8/04

to

Roger Faulkner wrote:
>
> setjmp()/longjmp() are perfectly safe (and very fast since no system
> calls are required) when used entirely in the main (not signal) context
> (modulo the unsafeness of skipping back over function call frames where
> locks have been acquired [all of the functions we are discussing are
> dangerous to use in general in a multithreaded environment]).
>
> sigsetjmp()/siglongjmp() are safe to use between contexts, as we have
> been discussing, but are slower because they require system calls.
>
> getcontext()/setcontext() are also safe but they are the slowest because
> they save/restore the entire context, including the floating-point state
> (which [sig]setjmp()/[sig]longjmp() do not).
>

setjmp(buf); = 0.000000062
sigsetjmp(buf, 1); = 0.000000154
getcontext() = 0.000009015

Slowest is an understatement. I know some architectures have mechanisms
to determine whether the floating point regs are being used so that
they don't have to be saved if they don't have to. Also a reset floating
point state reset mechanism so that just because some subroutine used the fp
regs once, they don't have to be saved and restored for all time.

Joe Seigh

Casper H.S. Dik

unread,

Jul 8, 2004, 9:54:11 AM7/8/04

to

Joe Seigh <jsei...@xemaps.com> writes:

>setjmp(buf); = 0.000000062
>sigsetjmp(buf, 1); = 0.000000154
>getcontext() = 0.000009015

>Slowest is an understatement. I know some architectures have mechanisms
>to determine whether the floating point regs are being used so that
>they don't have to be saved if they don't have to. Also a reset floating
>point state reset mechanism so that just because some subroutine used the fp
>regs once, they don't have to be saved and restored for all time.

But getcontext() cannot know that: the bit that says "fp registers
dirty" is only useful for the kernel, it has one place where to save
state and it can assign meaning to the bits. But getcontext needs
to save all the registers because it has no history to go back to.

Roger Faulkner

unread,

Jul 8, 2004, 11:33:52 AM7/8/04

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<40ED26C2...@xemaps.com>...

Ah, I see.
They were struggling with the problem of makecontext() and swapcontext()
and ended up (perhaps with overkill) making getcontext() and setcontext()
obsolete. This was the problem:

Change Number: XSH/TC2/D6/57 [XSH ERN 53]

Change the function prototype from:
"void makecontext(ucontext_t *ucp, void (*func)(void), int argc, ...);"

To:
"void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);"

In the RATIONALE section
Change From:
"None."

To:
"With the incorporation of ISO/IEC 9899:1999 (C99) into this
specification it was found that the ISO C Standard (subclause 6.11.6)
specifies that the use of function declarators with empty parentheses
is an obsolescent feature. Therefore, using the function prototype:

void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);

is making use of an obsolescent feature of ISO C. Therefore, a strictly
conforming POSIX application can't use this form. Therefore, use of
getcontext(), makecontext() and swapcontext() is marked obsolescent.

They probably went overboard and may revisit the issue again.
There was really no need to make getcontext()/setcontext() obsolete,
just makecontext() and, by association, swapcontext().
But then again, who knows what a standards body will mandate?

Roger Faulkner
roger.f...@sun.com

Joe Seigh

unread,

Jul 8, 2004, 1:17:03 PM7/8/04

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >setjmp(buf); = 0.000000062
> >sigsetjmp(buf, 1); = 0.000000154
> >getcontext() = 0.000009015
>
> >Slowest is an understatement. I know some architectures have mechanisms
> >to determine whether the floating point regs are being used so that
> >they don't have to be saved if they don't have to. Also a reset floating
> >point state reset mechanism so that just because some subroutine used the fp
> >regs once, they don't have to be saved and restored for all time.
>
> But getcontext() cannot know that: the bit that says "fp registers
> dirty" is only useful for the kernel, it has one place where to save
> state and it can assign meaning to the bits. But getcontext needs
> to save all the registers because it has no history to go back to.
>

You need the instruction that saves the fp regs to save and restore
the fp dirty flag and conditionally save and restore the fp regs.
If the fp regs don't need saving and aren't saved it should run much
quicker.

Joe Seigh

Casper H.S. Dik

unread,

Jul 8, 2004, 4:27:15 PM7/8/04

to

Joe Seigh <jsei...@xemaps.com> writes:

>You need the instruction that saves the fp regs to save and restore
>the fp dirty flag and conditionally save and restore the fp regs.
>If the fp regs don't need saving and aren't saved it should run much
>quicker.

That's not what I was getting at; getcontext() must *always* save
all the FP registers because the "dirty" bit has a meaning relative
to "last save". There is no "last save" for getcontext() to fall
back on.

Take an example:

float *f = 1.0 * 2.0; /* registers dirty */
/* A */
getcontext(&ctxt1);

after the first instruction the FP registers are dirty;
if a context switch happens at /* A */ they
will be clean again when getcontext() is called. If not,
then they will be dirty. Should getcontext() do something
different if they are no longer dirty? Of course not,
it needs all FP registers to be able to restore them later.

Jonathan Adams

unread,

Jul 8, 2004, 8:14:30 PM7/8/04

to

Joe Seigh <jsei...@xemaps.com> wrote:
> The pthread_getspecific seems kind of expensive,
> probably since it's been compromised for dynamic
> deletion of keys. It should only be a few load
> instructions.

This is probably because you're using the old libthread -- the
alternate libthread's TSD implementation is quite efficient -- the
fast path for pthread_getspecific() is one compare and one load.[1]
(the data for the first few TSD keys is stored directly in the thread
structure)

- jonathan

[1] The slow path totals three loads and three compares.

Joe Seigh

unread,

Jul 8, 2004, 9:05:40 PM7/8/04

to

That's good. That will help make up for some of the overhead
picked up going from setjmp to sigsetjmp.

Joe Seigh

unread,

Jul 8, 2004, 9:40:12 PM7/8/04

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >You need the instruction that saves the fp regs to save and restore
> >the fp dirty flag and conditionally save and restore the fp regs.
> >If the fp regs don't need saving and aren't saved it should run much
> >quicker.
>
> That's not what I was getting at; getcontext() must *always* save
> all the FP registers because the "dirty" bit has a meaning relative
> to "last save". There is no "last save" for getcontext() to fall
> back on.
>
> Take an example:
>
> float *f = 1.0 * 2.0; /* registers dirty */
> /* A */
> getcontext(&ctxt1);
>
> after the first instruction the FP registers are dirty;
> if a context switch happens at /* A */ they
> will be clean again when getcontext() is called. If not,
> then they will be dirty. Should getcontext() do something
> different if they are no longer dirty? Of course not,
> it needs all FP registers to be able to restore them later.

I was thinking of a mechanism like that for pushing context
on the stack. But it would need to be an additional thing
since the way the kernel uses it is incompatible with the
way I'm thinking.

But it doesn't matter since in user space nobody saves floating
point so the functions that use fp regs have to explicitly save
the fp regs themselves. So sigsetjmp and setjmp are perfectly
adaquate for conventional use and setcontext is maybe a bit of
overkill since we're not context switching but unwinding a stack
with no fp regs being used.

Joe Seigh

SenderX

unread,

Jul 8, 2004, 11:29:45 PM7/8/04

to

> I did find another mechanism that would allow implementing RCU for
> preemptive user threads in solaris. It's a known technique though
> and I'm not interested in reimplementing the wheel.

I was thinking of simple ways to track quiescent states on each thread...

A 100% user-space RCU subsystem could be driven by a separate thread.

The thread could be run by a lock-free stack. It would take requests from
threads that wish to participate with, or revoke their ties to RCU. It could
iterate the list of threads every so often to check for per-thread two
generation counters. If the threads old counter did not match its new
counter, it means the thread is through a quiescent state and a counter gets
incremented. When the quiescent state counter >= the number of threads in
the RCU system, then a quiescent period must have went by, so we iterate the
RCU callbacks queues on each thread and see if they are waiting for the
current quiescent period, if so free them.

The threads that are in the system would contain two per-thread counters (
old, new ) and a freelist of queues. If a bunch of objects come in on the
same quiescent period the per-thread queues will batch them up. The batches
themselves will contain a copy of the quiescent period they were collected
in. If a batch comes in with a different quiescent period than the last
batch, it means we can free all of the batches up to the new one.

The only problem I can see, is that the threads will have to make sure they
change their new counters fairly often. That means before blocking of any
sort, or a before a system call, ect... If a single thread simply does not
inc its generation, no RCU callbacks will be fired... That would be bad...

A super rough idea for the subsystem thread:

static int iQuiescentPeriod = 0;

void* RCU_SubSystem( void *pState )
{
int iQuiescentState = 0;

for (;; )
{
// This would adjust the latency for RCU cleanup
pThread = WaitForRequest( 250ms timeout );

LockRCUSubSystem();

if ( pThread )
{
/* Add thread to RCU */
}

/* Attempt to detect quiescent states */

for each thread in RCU
{
if ( thread.old != thread.new )
{
thread.old = thread.new;

iQuiesentState++;
}
}

if ( iQuiescentState >= number of threads in RCU )
{
iQuiesentState = 0;

atomic_inc( &iQuiesentPeriod );

for each thread in RCU
{
for each rcu callback batchs in thread
{
if ( thread.batch.iQuiesentPeriod != iQuiesentPeriod )
{
FlushRCUCallbackBatch( &thread.batch );
}
}
}
}

UnlockRCUSubSystem();
}
}

I may try to tinker around with this crazy idea! If it works, I will post a
more detailed description.

;)

Joe Seigh

unread,

Jul 9, 2004, 8:10:57 AM7/9/04

to

SenderX wrote:
>
> > I did find another mechanism that would allow implementing RCU for
> > preemptive user threads in solaris. It's a known technique though
> > and I'm not interested in reimplementing the wheel.
>
> I was thinking of simple ways to track quiescent states on each thread...
>
> A 100% user-space RCU subsystem could be driven by a separate thread.
>
> The thread could be run by a lock-free stack. It would take requests from
> threads that wish to participate with, or revoke their ties to RCU. It could
> iterate the list of threads every so often to check for per-thread two
> generation counters. If the threads old counter did not match its new
> counter, it means the thread is through a quiescent state and a counter gets
> incremented. When the quiescent state counter >= the number of threads in
> the RCU system, then a quiescent period must have went by, so we iterate the
> RCU callbacks queues on each thread and see if they are waiting for the
> current quiescent period, if so free them.
>

If you use global counter quiesce point logic like

local_counter = ++global_counter; // ++global_counter is atomic

or

local_counter = monotonic_clock();

then you can use a single deferred work queue. The overhead of atomicly
incrementing a global counter isn't a problem since while quiesce points
are frequent, they are a lot less frequent than the read critical regions
or should be if you want any benefits from RCU.

If you use local counter quiese point logic like

local_counter++;

and poll to see if they've changed then you hve to use entirely different
logic to process deferred work, some indirect inferencing, since the local
counters would have all different values with no relation to each other.

If you have a system that lets you see the per thread getrusage info
that has the number of voluntary context switches, you can use them for
counters which has a couple of benefits. The quiesce points are similar
to Linux kernel RCU and you don't need to explicitly seed your system with
quiesce point counters. But since the counters are in the system you
can't have quiescent period checking w/ the counters. It has to be
polling only.

The counters are, in effect, eventcounts if anyone wonders what they
actually are. There are other types of things that can be used as
quiesce points, regular events, transient states, etc... The logic
for those is different somewhat.

RCU is proxy GC and proxy GC, since they batch GC, have a problem with
threads being indefinitely blocked or suspended and holding up GC.
Boehm GC, SMR, and regular ref counting have smaller granularity and
less of a problem there.

Canceling and restarting the read critcal section to force the
references to be dropped works if you have a mechanism that
doesn't require forward progress of the suspended thread, a
problem. You can do this in RCU if you have the right asynchronous
signal/exception throwing mechanism.

Joe Seigh

unread,

Jul 9, 2004, 12:57:30 PM7/9/04

to

I wrote:
>
> I did find another mechanism that seems to work. So
> I'll have something to play with for a while.
>

Well, that doesn't work either. It's actually possible to create
a thread that runs indefinitely with an unblocked signal pending
and not take the signal.

So if this gets done it will be on Linux which will give Linux
a significant performance advantage over Solaris going by the
testcases I ran.

Joe Seigh

unread,

Jul 9, 2004, 7:15:11 PM7/9/04

to

> Well, that doesn't work either. It's actually possible to create
> a thread that runs indefinitely with an unblocked signal pending
> and not take the signal.
>

No, it's a programming bug trying to work around some signal issues. I still
think there are some problems with getting signals delivered to threads
in the run queue.

Testcase has 2 writer threads and 20 reader threads. There are
various shed_yeilds and nanosleeps to try to get the threads to
interleave and generate some contention.

With no loop delay in either the readers or writers except for
some sched_yeilds (flat out) the locks/sec/thread rates are
approximately (in the neighborhood)

reader writer
mutex 290 290
rwlock 290 290
atomic_ptr 3400 3400
rcu 1700 1700
rcu2 4000 2667
rcu2* 20000 1222

rcu is is the restart code around the read critical section.
rcu is with no code around the read critical region.

rcu2 is with no sched_yield in the writer loop which
for some reason slowed down the writers and sped up
the readers.

I don't know why the reader and writer rates are the
same in most cases. Probably something with sched_yield
round robining the threads maybe.

This is on a uniprocessor. I really need to get
a 8-way to properly test this out.

Joe Seigh

Roger Faulkner

unread,

Jul 10, 2004, 1:49:10 AM7/10/04

to

Joe Seigh <jsei...@xemaps.com> wrote in message news:<40EED065...@xemaps.com>...

Please qualify your statements by what release of Solaris and what
libthread you are using. What you say is not true of the new libthread
in Solaris 9 nor of the alternate libthread in Solaris 8.

There were many things wrong with the old libthread, especially in the
area of signals, I cannot quibble about that. That's why we gave up on
it and implemented the new libthread.

Roger Faulkner
roger.f...@sun.com

Joe Seigh

unread,

Jul 10, 2004, 10:03:28 AM7/10/04

to

Apart of a programming bug which was a red herring, there's now problem
with the thread libraries. The "problem" is with the signal delivery
mechanism and it may not just be solaris, I suspect all the uniices do
their run queues the same way. Unix signal delivery really is polling
and the thread actually has to run and make forward progress to see the
signal and only if it executes one of the polling points. Normally this
is not a problem for conventional unix programming. But I'm doing some
unconventional stuff. I though it might work on Solaris but now that
I know more about the actual signal delivery mechanism in Solaris I
can see it won't work. No problem, I just won't try it on Solaris.

The Linux signel delivery mechanism probably works the same way. But
there's one important difference. Somebody can go into the Linux
kernel and add in function to make what they want to do work. So
basically when someone comes up with a new idea for scalability and
performance and it needs a kernel tweek, where do you think it's going
to go? Linux or Solaris? Sorry, Solaris has no future.

Joe Seigh

unread,

Jul 10, 2004, 12:12:10 PM7/10/04

to

SenderX wrote:
>
> > I did find another mechanism that would allow implementing RCU for
> > preemptive user threads in solaris. It's a known technique though
> > and I'm not interested in reimplementing the wheel.
>
> I was thinking of simple ways to track quiescent states on each thread...
>
> A 100% user-space RCU subsystem could be driven by a separate thread.

Also, one of the problems that I haven't mentioned so far is unix signal
handlers. In theory, you can get signals delivered while you are in a
RCU critial section and they could do voluntary context switches. And
there is no way of knowing when the signal handlers "exit". So if you
use voluntary context switches as quiesce points you have to block all
signals unless you know for sure the signal handlers don't lose control.

Either that or do RCU with your own explicit quiesce points.

SMR is starting to look a lot more attractive.

Joe Seigh

Steve Watt

unread,

Jul 11, 2004, 3:08:52 PM7/11/04

to

In article <40EFF921...@xemaps.com>,
Joe Seigh <jsei...@xemaps.com> wrote:
> [ ... ] The "problem" is with the signal delivery

>mechanism and it may not just be solaris, I suspect all the uniices do
>their run queues the same way. Unix signal delivery really is polling
>and the thread actually has to run and make forward progress to see the
>signal and only if it executes one of the polling points. Normally this
>is not a problem for conventional unix programming. But I'm doing some
>unconventional stuff. I though it might work on Solaris but now that
>I know more about the actual signal delivery mechanism in Solaris I
>can see it won't work. No problem, I just won't try it on Solaris.

I'm trying to grasp what you're getting at here, but I guess I'm
not seeing it. In the OS I'm intimately familiar with, signals are
delivered when the thread that's been selected to receive the signal
returns to user mode. If that's what you mean by "polling point",
then I suppose it is. However, some processor exception made the signal
pending (system call, irq/dev driver, bus error, whatever), so I would
opine that the victim thread will get it PDQ.

In an SMP environment, things are a wee bit trickier, but it seems that
pitching an interrupt to the CPU that's carrying the target thread would
do the trick nicely. The interrupt needn't do anything except get into
the kernel, because the return will deliver the signal.

[ snip ]
--
Steve Watt KD6GGD PP-ASEL-IA ICBM: 121W 56' 57.8" / 37N 20' 14.9"
Internet: steve @ Watt.COM Whois: SW32
Free time? There's no such thing. It just comes in varying prices...

Joe Seigh

unread,

Jul 11, 2004, 4:43:59 PM7/11/04

to

Steve Watt wrote:
>
> In article <40EFF921...@xemaps.com>,
> Joe Seigh <jsei...@xemaps.com> wrote:
> > [ ... ] The "problem" is with the signal delivery
> >mechanism and it may not just be solaris, I suspect all the uniices do
> >their run queues the same way. Unix signal delivery really is polling
> >and the thread actually has to run and make forward progress to see the
> >signal and only if it executes one of the polling points. Normally this
> >is not a problem for conventional unix programming. But I'm doing some
> >unconventional stuff. I though it might work on Solaris but now that
> >I know more about the actual signal delivery mechanism in Solaris I
> >can see it won't work. No problem, I just won't try it on Solaris.
>
> I'm trying to grasp what you're getting at here, but I guess I'm
> not seeing it. In the OS I'm intimately familiar with, signals are
> delivered when the thread that's been selected to receive the signal
> returns to user mode. If that's what you mean by "polling point",
> then I suppose it is. However, some processor exception made the signal
> pending (system call, irq/dev driver, bus error, whatever), so I would
> opine that the victim thread will get it PDQ.
>
> In an SMP environment, things are a wee bit trickier, but it seems that
> pitching an interrupt to the CPU that's carrying the target thread would
> do the trick nicely. The interrupt needn't do anything except get into
> the kernel, because the return will deliver the signal.
>

Generally, you don't intterupt threads running on other processors as it
is usually an expensive operation. And if the thread is running I'm
not actaully concerned about it as it is making forward progress. It's
threads that are "runnable" not "running", i.e. they are in the run queue.
Unless they actually run (which is not guaranteed) and execute a signal
polling point, they won't see the signal. I don't really care if the
signal handler is run right away but I do need to know if the signal
handler trap as been set. There's no way to do that if the target thread
is in the run queue as far as I can tell. And no way to get the target
thread off the run queue into a stopped state unless it executes and
maybe hits a polling point so that's not an option.

I don't think you can do that in Linux either but I expect that it will
be put in if the project for Linux goes forward. Even if Linux proves
the worth of the technique, I doubt it will go into Solaris, Windows or
any of other unices except AIX and other IBM operating systems. One
thing about the NIH syndrome is that you can pretty much count on it.

Joe Seigh

Casper H.S. Dik

unread,

Jul 11, 2004, 6:21:38 PM7/11/04

to

Joe Seigh <jsei...@xemaps.com> writes:

>I don't think you can do that in Linux either but I expect that it will
>be put in if the project for Linux goes forward. Even if Linux proves
>the worth of the technique, I doubt it will go into Solaris, Windows or
>any of other unices except AIX and other IBM operating systems. One
>thing about the NIH syndrome is that you can pretty much count on it.

I'm sorry but I must object; if a feature is important and interesting,
improves performance and doesn't break anything we will certainly
look at it.

The fact that you can stuff arbitrary "improvements" in Linux isn't
always an advantage; even the slightest changes of semantics have
broken things in the past, so changing signals to do something
differnetly is not a path to enter lightly.

But if it's cleanly architected and there are good reasons to
emulate it, then we're not afraid to imitate or copy bits; just
like we don't mind being copied.

But all the interesting algorithms I see go by here, that
"perform so much better", I'm just wary of the old performance
improvement traps. Make sure to try your utmost to improve the
performance of the algorithm you want to beat. And realize that
performance and human intuition don't go well together.

Casper

Joe Seigh

unread,

Jul 11, 2004, 7:20:10 PM7/11/04

to

"Casper H.S. Dik" wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> >I don't think you can do that in Linux either but I expect that it will
> >be put in if the project for Linux goes forward. Even if Linux proves
> >the worth of the technique, I doubt it will go into Solaris, Windows or
> >any of other unices except AIX and other IBM operating systems. One
> >thing about the NIH syndrome is that you can pretty much count on it.
>
> I'm sorry but I must object; if a feature is important and interesting,
> improves performance and doesn't break anything we will certainly
> look at it.

That's my impression of development organizations. HIH combined with
self fufilling prophesies. "rwlocks aren't needed in the kernel because
they aren't used that much" (actually not at all since there were no rwlocks
implemented at time time) I heard that once somewhere.

>
> The fact that you can stuff arbitrary "improvements" in Linux isn't
> always an advantage; even the slightest changes of semantics have
> broken things in the past, so changing signals to do something
> differnetly is not a path to enter lightly.

It would probably be an addition to, not modification of, the current
mechanism. Current performance sensitive paths would be untouched
for the most part. But it's really up to whoever does the actual
work.

>
> But if it's cleanly architected and there are good reasons to
> emulate it, then we're not afraid to imitate or copy bits; just
> like we don't mind being copied.
>
> But all the interesting algorithms I see go by here, that
> "perform so much better", I'm just wary of the old performance
> improvement traps. Make sure to try your utmost to improve the
> performance of the algorithm you want to beat. And realize that
> performance and human intuition don't go well together.
>

McKenney has published performance studies of RCU. Anyway, I
try to do "AB" testing to the extent it's possible. I don't
draw conclusions on just "A" testing, or no testing. There are
some who do.

Joe Seigh

Steve Watt

unread,

Jul 11, 2004, 10:27:17 PM7/11/04

to

In article <40F1A87E...@xemaps.com>,

Joe Seigh <jsei...@xemaps.com> wrote:
>
>Steve Watt wrote:
>>
>> In article <40EFF921...@xemaps.com>,
>> Joe Seigh <jsei...@xemaps.com> wrote:

>> > [ ... ] The "problem" is with the signal delivery mechanism [ ... ]

>>
>> I'm trying to grasp what you're getting at here, but I guess I'm
>> not seeing it. In the OS I'm intimately familiar with, signals are
>> delivered when the thread that's been selected to receive the signal
>> returns to user mode. If that's what you mean by "polling point",
>> then I suppose it is. However, some processor exception made the signal
>> pending (system call, irq/dev driver, bus error, whatever), so I would
>> opine that the victim thread will get it PDQ.
>>
>> In an SMP environment, things are a wee bit trickier, but it seems that
>> pitching an interrupt to the CPU that's carrying the target thread would
>> do the trick nicely. The interrupt needn't do anything except get into
>> the kernel, because the return will deliver the signal.
>
>Generally, you don't intterupt threads running on other processors as it
>is usually an expensive operation. And if the thread is running I'm

It rather depends on the why, and on how you feel about determinism. I
have an RTOS background, after all.

>not actaully concerned about it as it is making forward progress. It's
>threads that are "runnable" not "running", i.e. they are in the run queue.

And as soon as they move from the run queue to a processor, the signal
frame will be set up before they get to the next user instruction, at
least on *BSD and LynxOS.

>Unless they actually run (which is not guaranteed) and execute a signal
>polling point, they won't see the signal. I don't really care if the
>signal handler is run right away but I do need to know if the signal
>handler trap as been set. There's no way to do that if the target thread
>is in the run queue as far as I can tell. And no way to get the target
>thread off the run queue into a stopped state unless it executes and
>maybe hits a polling point so that's not an option.

I don't think I'm clear on what you mean by "the signal handler trap has
been set". Do you mean that the next instruction executed will be signal
handler? That's true on some (and I suspect most) UNIXish OSes.

>I don't think you can do that in Linux either but I expect that it will

>be put in if the project for Linux goes forward. [ snip FUD ]

What are you proposing adding, exactly?

Steve

Joe Seigh

unread,

Jul 12, 2004, 8:27:13 AM7/12/04

to

Steve Watt wrote:
>
> In article <40F1A87E...@xemaps.com>,
> Joe Seigh <jsei...@xemaps.com> wrote:

...

> >Generally, you don't intterupt threads running on other processors as it
> >is usually an expensive operation. And if the thread is running I'm
>
> It rather depends on the why, and on how you feel about determinism. I
> have an RTOS background, after all.

In RTOS's they're willing to take the performance trade off for determinism.

>
> >not actaully concerned about it as it is making forward progress. It's
> >threads that are "runnable" not "running", i.e. they are in the run queue.
>
> And as soon as they move from the run queue to a processor, the signal
> frame will be set up before they get to the next user instruction, at
> least on *BSD and LynxOS.

...

> I don't think I'm clear on what you mean by "the signal handler trap has
> been set". Do you mean that the next instruction executed will be signal
> handler? That's true on some (and I suspect most) UNIXish OSes.

Basically that. You change the instruction pointer in the thread context
to point to the unix signal handler and save the old one somewhere where
the signal handler can find it. At that point you know the next instruction,
if and when it gets executed, will be the signal handler. There's probably
a way of doing it, some api, but I don't know what it is. kill -9 probably
usss it. It needs to work for thread in the run queue. Probably involves
getting it off the queue somehow which, depending on how the local processor
run queue is synchronized, could be an expensive operation. Or it may not.

>
> >I don't think you can do that in Linux either but I expect that it will
> >be put in if the project for Linux goes forward. [ snip FUD ]
>
> What are you proposing adding, exactly?
>

The ability to synchronously cancel RCU read critical sections*. But I'm not
in a position to actaully talk any kernel developers into doing it. Someone
else may be working on that for Linux but I don't know that status of that
project or even if it got beyond the proposal stage. I was only involved
in one discussion on it and am no longer involved with it. I was only banging
on Solaris to see if it might work there since the docs weren't real clear
on how the signal deliver mechanism actually works. (Of course they're
always clear after the fact). It doesn't work, so end of story as far
as Solaris is concerned. We'll just have to wait and see if anything
comes out for Linux.

* The ability to cancel RCU critical sections is important in a preemptive
environment because RCU is proxy GC and a non running low priority thread
can hold up GC. It's basically a form of priority inversion. Candelation
is a cleaner solution than trying to hack in priority inheritance or something
to force forward progress on the non running thread.

Joe Seigh

unread,

Jul 12, 2004, 12:42:33 PM7/12/04

to

>
> SMR is starting to look a lot more attractive.
>

I may do an SMR implementation just to see how it compares to
RCU. Michael says he has a technique to make it scale with
number of threads. I'm not sure what that is. Also I don't
know what that array problem is that you're talking about.
Are you using the array for hazard pointers and reallocing
them to add new hazard pointers. In that case, it's easy to
know when all the threads have finished with the old array.

Joe Seigh

Ronald Landheer-Cieslak

unread,

Jul 12, 2004, 1:08:41 PM7/12/04

to

Mind of I butt in? ;)
I've got a working SMR implementation under GPL/BSD double license if
you're interested. If you want to test-drive it *and* report back the
results, I'll be happy to write you a GPL waiver so you can use it in
non-free software is you want :)
Basically, that would save you the effort of writing another
implementation and would get me independant testing results, so it looks
like a win-win situation :)

Just a thought :)

rlc

SenderX

unread,

Jul 12, 2004, 4:17:32 PM7/12/04

to

> I may do an SMR implementation just to see how it compares to
> RCU.

I have a working model ( so far ;) of a user-space RCU that is loosely
based on ideas from the previous post in this thread. Its running on Linux
and windows i686 SMP. Tests are 64 reader threads and 5 writers, working on
a collection that softly fluctuates from 100,000 - 150,000 items. The
algorithm doesn't require that any thread call any special functions to do
an explicit quiescent state shift. However, the algo requires that every
reader thread that calls rcu_read_lock must call rcu_read_unlock...

It implements the following RCU API in user space:

assuming "TSO memory model and ia32"...

1. void call_rcu( rcu_head *pHead, void ( *fpDtor ) ( void* ), void
*pState );
- Mine uses a "normal-cas" loop, call_rcu should be non-blocking?

- rcu_heads batch up in a per-thread queue, kept in its own cache line

- The quiescent period detection algo runs in a separate thread, that is
controlled by a lock-free stack. The algo fires if it timeouts, or receives
an explicit request.

2. void rcu_read_lock( void );
- Mine uses a single store, i.e. iValue = 1;

3. void rcu_read_unlock( void );
- Mine uses a single store, and a non-atomic load-store

You can use this user-space RCU API to implment any "mostly-read"
collection, and basically any previous algo that uses the classic RCU API. I
need to test for some more weeks/months, but it looks rock solid!

A lot more details later...

P.S.

The experimental user-space RCU should outperform SMR, because if a reader
threads needs to iterate 100,000 items... Using SMR they would need to
update a damn hazard pointer each time, using user-space RCU alls they need
a single store with membar on cpus that require them.

This is looking very promising...

Joe Seigh

unread,

Jul 12, 2004, 4:35:16 PM7/12/04

to

I don't think my uniprocessor setup would work too well. I'm seeing too many
scheduling artifacts as it is and trying to interpret them is a bit of detective work.
It's more for the fun of it though if I do do it I don't think I'll go all out
on the set theorectic operations. You use any set data type libraries or write
your own?

Joe Seigh

unread,

Jul 12, 2004, 5:08:10 PM7/12/04

to

SenderX wrote:
>
> > I may do an SMR implementation just to see how it compares to
> > RCU.
>
> I have a working model ( so far ;) of a user-space RCU that is loosely
> based on ideas from the previous post in this thread. Its running on Linux
> and windows i686 SMP. Tests are 64 reader threads and 5 writers, working on
> a collection that softly fluctuates from 100,000 - 150,000 items. The
> algorithm doesn't require that any thread call any special functions to do
> an explicit quiescent state shift. However, the algo requires that every
> reader thread that calls rcu_read_lock must call rcu_read_unlock...
>
> It implements the following RCU API in user space:
>
> assuming "TSO memory model and ia32"...
>
> 1. void call_rcu( rcu_head *pHead, void ( *fpDtor ) ( void* ), void
> *pState );
> - Mine uses a "normal-cas" loop, call_rcu should be non-blocking?

You mean the functions that defer work or delay? Not really required
as RCU assumes the exclusive access ops are using a conventional lock
anyway. You have to communicate with the polling thread. I'm using
a mutex and condvar to tell the polling thread there's work. There's
no reason you can't use lock-free to do that also.

>
> - rcu_heads batch up in a per-thread queue, kept in its own cache line
>
> - The quiescent period detection algo runs in a separate thread, that is
> controlled by a lock-free stack. The algo fires if it timeouts, or receives
> an explicit request.
>
> 2. void rcu_read_lock( void );
> - Mine uses a single store, i.e. iValue = 1;
>
> 3. void rcu_read_unlock( void );
> - Mine uses a single store, and a non-atomic load-store

You usually want read lock/unlock functions even if there's no code so the
program source code is self documenting. Are you putting in quiesce points
there? I did that in an alternate version and while there's a little bit of
extra over head (roughly equivalent to an SMR hazard pointer op) it really
scales well since the more you use it the more quiesce points happen.

I have per thread quisce point counters in TSD/TLS and I took the pthread_getspecific
out of the read lock macro. So it's something like

#define rcu_read_lock(node) \
node->qcount2 = node->qcount += 2; \
StoreLoadMemBar;

#define rcu_read_unlock(node) \
LoadStore-StoreStoreMemBar; \
node->qcount2 = 0;

qcount is initialized to 1 so it's never 0 even if it wraps.

Polling logic goes something line

if (node->last_qcount != node->qcount2) {
// quiesce point for this node
...
node->last_qcount = node->qcount; // save last quiesce count for node
}

Which is the hacked equivalent of

#define rcu_read_lock(node) \
node->qcount++; \
node->qcount2 = 1; \
StoreLoadMemBar;

#define rcu_read_unlock(node) \
LoadStore-StoreStoreMemBar; \
node->qcount2 = 0;

and the polling logic being

if (node->last_qcount != node->qcount || // quiesce point happened
node->qcount2 == 0) // or unlocked
{
// quiesce point for this node
...
node->last_qcount = node->qcount; // save last quiesce count for node
}

You probably wouldn't call it qcount2 in this case since it's not a count.

Joe Seigh

Ronald Landheer-Cieslak

unread,

Jul 12, 2004, 6:04:11 PM7/12/04

to

libmemory was primarily written to support the non-blocking data types
in libcontain, so I guess the answer to your question is that I write my
own :)

As for having a uniprocessor setup: I have a uniprocessor setup as well
(and not a great one at that) but *any* testing is better than just
mine: seeing as I wrote the code in the first place, chances are that I
overlook the majority of the possible bugs that could be hiding the the
woodworks.

For the time being, I am unaware of anyone testing libmemory (or
libcontain) on anything but an uniprocessor setup (though people have
downloaded it without reporting back to me, so I imagine there may be a
dual-processor running libmemory Out There somewhere).

Seriously, any data you could provide would be very welcome :)

Thanks,

rlc

SenderX

unread,

Jul 12, 2004, 9:19:08 PM7/12/04

to

> You mean the functions that defer work or delay?

Yes.

> Not really required
> as RCU assumes the exclusive access ops are using a conventional lock
> anyway.

I thought that call_rcu could be called outside the write side
critical-region. So I guess I could remove the cas in my call_rcu... I guess
there is no use to try to improve the write side throughput, wrt batching
rcu_head's per thread. RCU is all about the readers.

> You have to communicate with the polling thread. I'm using
> a mutex and condvar to tell the polling thread there's work. There's
> no reason you can't use lock-free to do that also.

My polling thread communicates with the user application by firing the
callback in rcu_head's, and by accepting requests to explicitly run a scan.

> You usually want read lock/unlock functions even if there's no code so the
> program source code is self documenting. Are you putting in quiesce
points
> there? I did that in an alternate version and while there's a little bit
of
> extra over head (roughly equivalent to an SMR hazard pointer op) it really
> scales well since the more you use it the more quiesce points happen.

Exactly. It does seem to scale, I was running the writer threads at 1 node
deletion per 10 milliseconds, pretty quick for RCU?. The memory was not
flying high and out of control.

> I have per thread quisce point counters in TSD/TLS and I took the
pthread_getspecific
> out of the read lock macro. So it's something like

Yes. I use pthread_getspecific as well. It can destroy performance ( OS
dependant ). I have a special version of rcu_read_xxx functions:

/* Call tls once */
pRcu = pthread_getspecific( ... );

for ( ;; )
{
np_rcu_read_lock( pRcu );

< iterate the collection >

np_rcu_read_unlock( pRcu );
}

So you call tls once and reuse the rcu handle many times. Good work around
for crappy pthread_getspecific implementations.

> Polling logic goes something line
>
> if (node->last_qcount != node->qcount2) {
> // quiesce point for this node
> ...
> node->last_qcount = node->qcount; // save last quiesce count for node
> }

That is very close to how I poll.

LocalOld = pThread->Old;
LocalNew = pThread->New;

if ( LocalOld != LocalNew )
{
pThread->Old = LocalNew;

++pRCU->QuiesentState;
}

Does IBM have this method patented? Anyway to bend around the damn thing?
This user-space RCU is looking too good.

SenderX

unread,

Jul 12, 2004, 9:44:32 PM7/12/04

to

> You usually want read lock/unlock functions even if there's no code so the
> program source code is self documenting. Are you putting in quiesce
points
> there?

Yes. Its seems to be a perfect place for user-space per-thread quiescent
state shift.

Joe Seigh

unread,

Jul 13, 2004, 7:21:30 AM7/13/04

to

Actually it's the worst place as far as RCU is concerned. While the quiesce
states should be frequently and periodically executed, you want less of them
then there are read critical sections. That's how you get the "almost free"
read access. However, if for one reason or another you can't get quiesce
points set up that way, then putting them in the read lock/unlock does give
you some of the other advantages of RCU which is the "lock-free" part and
being strictly read only, i.e. no cache hits like you'd get for synchronization
that requires interlocked updates like mutexes or atomic_ptr.

Joe Seigh

unread,

Jul 13, 2004, 7:45:01 AM7/13/04

to

SenderX wrote:
>
> Does IBM have this method patented? Anyway to bend around the damn thing?
> This user-space RCU is looking too good.

This particular method? I don't know. I'd have to look at their patents in
detail.

The most general statement of the technique's GC part is to wait until all
threads have gone thru a quiesce point at least once since making the object
unreachable before deleting the object. Unfortunately we didn't put in a
claim this broad in the orginal patent. That may not have helped even since
a lot of defensive/blocking patents are of the "in conjunction" type. So
we may have had to name every possible kind of event mechanism and still
might have missed a few.

In distributed algorithms for mutual consensus there's usually a "wait until
all threads see some condition" type logic. And there's been a lot of
research in how to implement that kind of logic in a scalable way. Some of
this can be used for "in conjunction claims".

I haven't really read any of the patents in detail. You can start here

5,442,758 5,608,893 6,219,690

Some of these are continuation patents. Look at the patents that reference
these or search on McKenney as the inventor. There may be patents in the
pipeline that don't show up.

Joe Seigh

SenderX

unread,

Jul 13, 2004, 8:53:31 PM7/13/04

to

> > Does IBM have this method patented? Anyway to bend around the damn
thing?
> > This user-space RCU is looking too good.
>
> This particular method? I don't know. I'd have to look at their patents
in
> detail.

The method I came up with is driven by 1 thread and a normal cas. It doesn't
require any information from the OS wrt tracking per-thread quiescent
states.

> The most general statement of the technique's GC part is to wait until all
> threads have gone thru a quiesce point at least once since making the
object
> unreachable before deleting the object.

> So

> we may have had to name every possible kind of event mechanism and still
> might have missed a few.

" is to wait until all threads "

If by all threads, you mean all the threads the OS has running at any one
time, or all of thef cpu's... That would probably make a difference.

My implementation tracks quiescent states on per-process threads that have
executed rcu_read_lock or call_rcu at least one time during their lifetime.
All other threads are not polled for quiescent state.

> In distributed algorithms for mutual consensus there's usually a "wait
until
> all threads see some condition" type logic. And there's been a lot of
> research in how to implement that kind of logic in a scalable way. Some
of
> this can be used for "in conjunction claims".
>
> I haven't really read any of the patents in detail. You can start here
>
>
> 5,442,758 5,608,893 6,219,690
>
> Some of these are continuation patents. Look at the patents that
reference
> these or search on McKenney as the inventor. There may be patents in the
> pipeline that don't show up.

Thank you.

SenderX

unread,

Jul 13, 2004, 9:10:59 PM7/13/04

to

> Actually it's the worst place as far as RCU is concerned. While the
quiesce
> states should be frequently and periodically executed, you want less of
them
> then there are read critical sections.

I was wondering about that, but I could not come up with a better and
equally portable method... ;(

My system requires that threads make a quiescent state change before they
enter a rcu read critical region, and one more right before they exit.

So, luckily for me, there was rcu_read_xxx functions. Now I can fulfill all
of the requirements of my rcu subsystem, and it seems to be working in
experimental heavy load conditions.

If the rcu_read_xxx functions we not there, my system would not be
compatible with RCU API.

> However, if for one reason or another you can't get quiesce
> points set up that way,
> then putting them in the read lock/unlock does give
> you some of the other advantages of RCU which is the "lock-free" part and
> being strictly read only, i.e. no cache hits like you'd get for
synchronization
> that requires interlocked updates like mutexes or atomic_ptr.

The being lock-free and cache friendly is OK with me.

:)

Joe Seigh

unread,

Jul 13, 2004, 10:04:47 PM7/13/04

to

SenderX wrote:
>
> " is to wait until all threads "
>
> If by all threads, you mean all the threads the OS has running at any one
> time, or all of thef cpu's... That would probably make a difference.
>
> My implementation tracks quiescent states on per-process threads that have
> executed rcu_read_lock or call_rcu at least one time during their lifetime.
> All other threads are not polled for quiescent state.

You know their not doing rcu reads. That's a quiescent state of sorts.
Quiescent states are somewhat abstract and you can pretty much come
up with all sorts of mechanisms.

>
> > In distributed algorithms for mutual consensus there's usually a "wait
> until
> > all threads see some condition" type logic. And there's been a lot of
> > research in how to implement that kind of logic in a scalable way. Some
> of
> > this can be used for "in conjunction claims".
> >
> > I haven't really read any of the patents in detail. You can start here
> >
> >
> > 5,442,758 5,608,893 6,219,690
> >
> > Some of these are continuation patents. Look at the patents that
> reference
> > these or search on McKenney as the inventor. There may be patents in the
> > pipeline that don't show up.
>

They seem somewhat generic. I didn't see any specific event types in the
claims. And none of the polling methods used in Linux like combining
trees.

If you do something commercial you should check with IBM first. Basic
business sense. If you are doing something like open source then just
check with McKenney and IBM. IBM would probably give you a license
for open source similar to the one for Linux just to keep control of
the technique. If you did it for windows, they might have some problems,
I don't know.

Joe Seigh

unread,

Jul 13, 2004, 10:09:48 PM7/13/04

to

SenderX wrote:
>
> > Actually it's the worst place as far as RCU is concerned. While the
> quiesce
> > states should be frequently and periodically executed, you want less of
> them
> > then there are read critical sections.
>
> I was wondering about that, but I could not come up with a better and
> equally portable method... ;(
>
> My system requires that threads make a quiescent state change before they
> enter a rcu read critical region, and one more right before they exit.
>
> So, luckily for me, there was rcu_read_xxx functions. Now I can fulfill all
> of the requirements of my rcu subsystem, and it seems to be working in
> experimental heavy load conditions.

You should check very light load conditions. If a thread does really
infrequent rcu reads then it's going to be a long time between quiesce
points. You want to consider the unlocked state as a quiescent state.

Joe Seigh

SenderX

unread,

Jul 16, 2004, 8:23:06 PM7/16/04

to

> You should check very light load conditions. If a thread does really
> infrequent rcu reads then it's going to be a long time between quiesce
> points. You want to consider the unlocked state as a quiescent state.

Yes. If a thread is unlocked && an "active quiescent period is in progress",
then its in a quiescent state.

I quickly ran into this issue this when I was first messing around with the
algo. Threads would get blocked on a condvar, and some of them had rcu_heads
batched up. The rcu algo did not detect the threads as quiescent because
their generation numbers were a match. I needed to track one more quiescent
state per-thread: the unlocked state.

Joe Seigh

unread,

Jul 17, 2004, 3:28:49 PM7/17/04

to

Performance wise, RCU even in this form looks good. The only thing that performs
consistently as good or better is atomic_ptr but that's on a uniprocessor. There's
a little bit of throttling on the write side but not as bad as I thought it might
be. And of course RCU and atomic_ptr will take stuff that will tank mutexes so
bad that you just ctl-C rather than wait for the mutex testcase to finish.

Joe Seigh

unread,

Jul 18, 2004, 12:36:36 PM7/18/04

to

> Performance wise, RCU even in this form looks good. The only thing that performs
> consistently as good or better is atomic_ptr but that's on a uniprocessor. There's
> a little bit of throttling on the write side but not as bad as I thought it might
> be. And of course RCU and atomic_ptr will take stuff that will tank mutexes so
> bad that you just ctl-C rather than wait for the mutex testcase to finish.
>

Now this is what I need to test out user space RCU on

http://www.sgi.com/newsroom/press_releases/2004/july/ncsa.html

A 1024 way smp computer. The apps on this are definitely going to
need something like RCU.

Joe Seigh