Documentation NPTL

Alex

unread,

Jun 12, 2006, 5:31:07 AM6/12/06

to

Hello,
does anyone know a good source for documentation on NPTL?
The only thing i've found was a document from Ulrich Drepper,
nptl-design.pdf.

Any hint appreciated, thanks
Alex

Joe Seigh

unread,

Jun 12, 2006, 7:35:12 AM6/12/06

to

Alex wrote:
> Hello,
> does anyone know a good source for documentation on NPTL?
> The only thing i've found was a document from Ulrich Drepper,
> nptl-design.pdf.
>

Well, part of it is the futex. The latest version is here
http://people.redhat.com/drepper/futex.pdf

The futex is a bit of a hack and it's always interesting to see
the lengths they will go to in order to keep it.

The source is here
http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/nptl/sysdeps/pthread/?cvsroot=glibc

The kernel thread ids look like pids but they're not really like the
old Linux Threads though some syscalls that take a pid will take a
kernel thread id (not a pthread id) last time I checked. E.g. kill().
There's a gettid syscall which returns the tid.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

dav...@webmaster.com

unread,

Jun 12, 2006, 6:21:21 PM6/12/06

to

Joe Seigh wrote:

> The futex is a bit of a hack and it's always interesting to see
> the lengths they will go to in order to keep it.

I have to disagree. We recently re-wrote our core adaptation layer to
support Linux with futexes, using futexes natively. Not only were the
futex versions significantly smaller than the non-futex versions (and
thus much easier to validate) but their performance was significantly
better as well.

They are frequently misunderstood. All a futex really consists of is an
atomic "if this memory location has this value, sleep until woken (or
for at least a particular amount of time)". Combined with simple atomic
operations, they make the construction of high-performance
synchronization primitives almost trivial.

I ask myself constantly why other operating systems don't support them.

To give you a simple example, take a reader-writer lock. Our best futex
implementation took an average of 50nS of overhead per lock operation
and was 95 lines long. Our best non-futex implementation took an
average of 130nS and was 80 lines long. (P4 SMP)

Most of the actual arguments against them I've seen consist of
variations of "they can be used badly, so they're bad".

DS

Chris Thomasson

unread,

Jun 13, 2006, 4:42:56 PM6/13/06

to

<dav...@webmaster.com> wrote in message
news:1150150881....@c74g2000cwc.googlegroups.com...

>
> Joe Seigh wrote:
>
>> The futex is a bit of a hack and it's always interesting to see
>> the lengths they will go to in order to keep it.
>
> I have to disagree. We recently re-wrote our core adaptation layer to
> support Linux with futexes, using futexes natively. Not only were the
> futex versions significantly smaller than the non-futex versions (and
> thus much easier to validate) but their performance was significantly
> better as well.
>
> They are frequently misunderstood. All a futex really consists of is an
> atomic "if this memory location has this value, sleep until woken (or
> for at least a particular amount of time)". Combined with simple atomic
> operations, they make the construction of high-performance
> synchronization primitives almost trivial.
>
> I ask myself constantly why other operating systems don't support them.

You can "emulate" them fairly eaisly in user-space using CAS and
mutex/condvar:

http://groups.google.com/group/comp.programming.threads/msg/f46ad77b39935c93

http://appcore.home.comcast.net/appcore/src/ac_eventcount_algo1_c.html
http://appcore.home.comcast.net/appcore/include/ac_eventcount_algo1_h.html
http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html

My eventcount design peforms as good as futexs on the lock-free fast-paths,
and "almost" as good on the slow-paths... Futexs are basically "lock-free"
eventcounts anyway...

;)

dav...@webmaster.com

unread,

Jun 13, 2006, 5:26:00 PM6/13/06

to

Chris Thomasson wrote:

> You can "emulate" them fairly eaisly in user-space using CAS and
> mutex/condvar:

One more interlocked operation in the fast path negates any conceivable
benefit in the contended cases.

For example, my futex-based reader-writer lock has one locked operation
in the readlock fast path, one in the readunlock fast path, one in the
writelock fast path, none in the writeunlock fast path. Precisely
because of this, it's faster than most mutexes (and much faster than
glibc pthread mutexes).

For P4 SMP, the cost of a synchronization primitive is basically the
number of locked operations it has.

DS

Chris Thomasson

unread,

Jun 14, 2006, 6:04:29 PM6/14/06

to

<dav...@webmaster.com> wrote in message
news:1150233960....@y43g2000cwc.googlegroups.com...

>
> Chris Thomasson wrote:
>
>> You can "emulate" them fairly eaisly in user-space using CAS and
>> mutex/condvar:
>
> One more interlocked operation in the fast path negates any conceivable
> benefit in the contended cases.

Where is the atomic operation in the fast-path your worried about? The
eventcount example I posted has atomic operation free signaling for the
fast-path.

> For example, my futex-based reader-writer lock has one locked operation
> in the readlock fast path, one in the readunlock fast path, one in the
> writelock fast path, none in the writeunlock fast path. Precisely
> because of this, it's faster than most mutexes (and much faster than
> glibc pthread mutexes).
>
> For P4 SMP, the cost of a synchronization primitive is basically the
> number of locked operations it has.

Indeed. Here is a example rw-spinlock that has no atomic operation in the
write-unlock function on x86:

http://appcore.home.comcast.net/appcore/src/ac_rwspinlock_algo1_c.html
http://appcore.home.comcast.net/appcore/include/ac_rwspinlock_algo1_h.html

Fast-pathed writers do a single XCHG for the lock, and a store with release
semantics for an unlock. You can use the eventcount to do the same thing,
however it won't be a spinlock anymore...

;)

dav...@webmaster.com

unread,

Jun 15, 2006, 12:38:11 AM6/15/06

to

Chris Thomasson wrote:

> <dav...@webmaster.com> wrote in message
> news:1150233960....@y43g2000cwc.googlegroups.com...

> Indeed. Here is a example rw-spinlock that has no atomic operation in the

> write-unlock function on x86:
>
> http://appcore.home.comcast.net/appcore/src/ac_rwspinlock_algo1_c.html
> http://appcore.home.comcast.net/appcore/include/ac_rwspinlock_algo1_h.html
>
> Fast-pathed writers do a single XCHG for the lock, and a store with release
> semantics for an unlock. You can use the eventcount to do the same thing,
> however it won't be a spinlock anymore...
>
> ;)

Futexes are an efficient way for a process to block. Obviously, they
won't help you optimize a spinlock (unless it falls back to blocking).

For example, suppose you want to contruct a basic event that can be
locked or unlocked and releases all locked threads when unlocked.
Assume that this is used to wake any threads waiting for an operation
to finish, such that it's very rare to unlock an event that's already
unlocked or block on the event when it has long since unlocked.

With a futex, the operations are:

1) Lock: atomically set 'locked' to 'true'.
2) Unlock: atomically set 'locked' to 'false', wake futex.
3) Block: block on futex if 'locked' is 'true' (single operation)

In the fast paths, performance is perfect. Neither lock nor unlock
require an interlocked operation; unlock requires one kernel call but
you need that to wake a sleeping thread anyway; block makes one kernel
call but you need that to sleep anyway.

The problems happened when they made the futex implementation so
complex with creeping featurism that it became incomprehensible.

DS

dav...@webmaster.com

unread,

Jun 15, 2006, 3:11:04 AM6/15/06

to

I think you're right. The differences in the fast path were not due to
futexes but implementation differences. However, futexes do make many
circumstances surrounding blocking vastly more efficient.

The key futex operation is 'block if and only if this memory address
has this value, otherwise return immediately'. And it's *extremely*
useful. For example, here's part of a futex-based mutex implementation:

InterlockedInc/Dec inc/decrements with no return value and no special
memory visiblity semantics. InterlockedSet/Get are just normal memory
reads/writes on x86. 's' and 'waiters' are aligned 32-bit values. '1'
is LOCKED, '0' is UNLOCKED.

bool TryLock(void)
{
return InterlockedExchange(&s, LOCKED) == UNLOCKED;
}

void Lock(void)
{
while(!TryLock())
{
InterlockedInc(&waiters);
sys_futex(&s, FUTEX_WAIT, LOCKED, NULL); // block if s==LOCKED
InterlockedDec(&waiters);
}
}

void Unlock(void)
{
InterlockedSet(&s, UNLOCKED);
if(InterlockedGet(&waiters)!=0)
sys_futex(&s, FUTEX_WAKE, 1, NULL); // wake one thread blocked on s
}

On thing I wonder is if Lock is better this way:

void Lock(void)
{
if(TryLock()) return;
InterlockedInc(&waiters);
do
sys_futex(&s, FUTEX_WAIT, LOCKED, NULL); // block if s==LOCKED
while(!TryLock());
InterlockedDec(&waiters);
}

Can you implemented cleaner, more efficient mutexes?

DS

Chris Thomasson

unread,

Jun 15, 2006, 9:53:16 PM6/15/06

to

<dav...@webmaster.com> wrote in message
news:1150355464.1...@c74g2000cwc.googlegroups.com...

I am not really worried about mutex performance as much as I am worried
about rw-mutex performance. You can't really get rid of the nasty memory
barriers in your mutex, or any other mutex. However, you can get rid of them
in rw-mutexs. Lock-free reader patterns coupled with zero-overhead lock-free
proxy garbage collection are the way to go.

As for futexs, well, futexs are tricky:

http://people.redhat.com/~drepper/futex.pdf

Yikes! They should have used a parking interface with a waiters-bit... Alex
should like that.

;)

> more efficient mutexes?

There is always Petersons algorithm:

http://groups.google.com/group/comp.programming.threads/msg/1e45b4b16bad9784

Try to augment this with a waitset...

:)

Chris Thomasson

unread,

Jun 15, 2006, 10:03:49 PM6/15/06

to

> There is always Petersons algorithm:
>
> http://groups.google.com/group/comp.programming.threads/msg/1e45b4b16bad9784

IIRC, the old server program I was using to experiment with this algorithm
reaped some fairly major benefits (throughput and scalability) from the fact
that the lock prefix is not used anywhere in the algorithm...

dav...@webmaster.com

unread,

Jun 15, 2006, 11:00:37 PM6/15/06

to

Chris Thomasson wrote:

> I am not really worried about mutex performance as much as I am worried
> about rw-mutex performance. You can't really get rid of the nasty memory
> barriers in your mutex, or any other mutex. However, you can get rid of them
> in rw-mutexs. Lock-free reader patterns coupled with zero-overhead lock-free
> proxy garbage collection are the way to go.

You think it's possible to create reader-writer mutexes where the
'readlock' and 'readunlock' operations have no memory barriers or
interlocked operations (assuming no write locks acquired or pending,
just in the fast path) on platforms with atomic 32-bit read/write
operations? That I'd like to see.

I think in my other post, I completely forgot to make my main main
point. My main point was that converting, say a pure spin reader-writer
lock to a blocking one tends to increase the cost of the unblocking
operations. If all the acquiring threads are spinning, simply setting a
value is sufficient to get them going, but if they are blocked, you
must test for blocked threads and if so make a kernel call. Futexes
make this painless and efficient.

DS

Chris Thomasson

unread,

Jun 16, 2006, 4:27:11 AM6/16/06

to

<dav...@webmaster.com> wrote in message
news:1150426837....@y41g2000cwy.googlegroups.com...

>
> Chris Thomasson wrote:
>
>> I am not really worried about mutex performance as much as I am worried
>> about rw-mutex performance. You can't really get rid of the nasty memory
>> barriers in your mutex, or any other mutex. However, you can get rid of
>> them
>> in rw-mutexs. Lock-free reader patterns coupled with zero-overhead
>> lock-free
>> proxy garbage collection are the way to go.
>
> You think it's possible to create reader-writer mutexes where the
> 'readlock' and 'readunlock' operations have no memory barriers or
> interlocked operations (assuming no write locks acquired or pending,
> just in the fast path) on platforms with atomic 32-bit read/write
> operations? That I'd like to see.

You can accomplish this by taking advantage of lock-free proxy garbage
collection, and dependant load ordering which is supported by every
processor out there, except the DEC Alpha. Read access is accomplished by
simply acquiring reference to a proxy collector object. This operation can
be as simple as this:

#if ! defined (Alpha)
#define dependant_load(x) (*x)
#else
dependant_load(x) {
myx = *x;
rmb
return myx;
}
#endif

static void *proxy;

// acquire read access
void *myproxy = dependant_load(&proxy);

// release read access. GC polling algorithm ensures release semantics.
myproxy = 0;

I have a patent on a proxy collector that can do this and of course there is
Joe Seighs RCU-SMR. We have discussed many aspects of this "virtually
zero-overhead" type of proxy garbage collection in the past. It basically
involves amortizing memory barrier usage by segregating them into specific
epochs in time.

I have even implemented my collector using POSIX mutexs. I can achieve
memory barrier free reads by applying the periodic/episodic sync pattern
that my collector is based on. I have discussed the basics of the pattern
here many times. Here is an example of how my invention can use POSIX mutexs
to implement a 100% "lock-free" reader safe write operation:

http://groups.google.com/group/comp.programming.threads/msg/b7814ee86cd4b54d

All of the "new" lock-free and zero-overhead stuff is patented or has
patents pending. The main reason I patented my stuff was so that I could
actually use it. Patent activity in the area is active. So If you plan on
using lock-free commercially chances are that you will have to acquire the
appropriate licenses.

> I think in my other post, I completely forgot to make my main main
> point. My main point was that converting, say a pure spin reader-writer
> lock to a blocking one tends to increase the cost of the unblocking
> operations. If all the acquiring threads are spinning, simply setting a
> value is sufficient to get them going, but if they are blocked, you
> must test for blocked threads and if so make a kernel call. Futexes
> make this painless and efficient.

Yeah, futexs do come in handy. That is why I created my *portable eventcount
implmentation. I can use the lock-free eventcount as a futex.

* I say portable here because I can use the eventcount on any hardware/os
combination that supports normal CAS and pthread mutex/condvars.

;)

BTW, IIRC, Alex Terekhov pointed out some race-conditions in a linux futex
implementation. Something in the futex waitqueue... Speaking of waitqueues,
what do you think about futexs with waiters-bit and parking interface?
Something like this:

int futex_wait( int *futex, int futex_cmp )
{
register int cmp, old;

pin_memory_page( futex );
lock_wait_queue( futex );

old = *futex;

do
{
if ( ( old & ~WAITERS_BIT ) != futex_cmp && there_are_no_waiters )
{ goto bye; }

cmp = old;

old = atomic_cas( futex, old | WAITERS_BIT, cmp );

} while ( cmp != old );

if ( ( old & ~WAITERS_BIT ) == futex_cmp )
{
enqueue_us;
}

bye:

unlock_wait_queue( futex );
unpin_memory_page( futex );

if ( we_enqueued )
{
wait_on_gate( ... );
}

return 0;