Linux futex

Joe Seigh

unread,

Oct 11, 2003, 12:02:51 PM10/11/03

to

I was looking around at what they're up to with futexes
in Linux. It appears that they will allow futexes to
have timeouts, etc... But I don't see any evidence that
they have proper timeout logic. Anyone know if the
linux futex people have a clue or not?

Joe Seigh

Alexander Terekhov

unread,

Oct 11, 2003, 12:11:10 PM10/11/03

to

Same thoughts here. Futex stuff is basically brain-damaged.

And, it appears that the general idea is also somewhat silly.

regards,
alexander.

Joe Seigh

unread,

Oct 11, 2003, 1:08:59 PM10/11/03

to

Alexander Terekhov wrote:
>
> Joe Seigh wrote:
> >
> > I was looking around at what they're up to with futexes
> > in Linux. It appears that they will allow futexes to
> > have timeouts, etc... But I don't see any evidence that
> > they have proper timeout logic. Anyone know if the
> > linux futex people have a clue or not?
>
> Same thoughts here. Futex stuff is basically brain-damaged.

The kernel mailing list doesn't seem to be the place to hold
reasoned discussions. Too much background noise.

>
> And, it appears that the general idea is also somewhat silly.
>

Avoiding kernel syscalls?

Joe Seigh

Alexander Terekhov

unread,

Oct 11, 2003, 1:56:35 PM10/11/03

to

Joe Seigh wrote:
[...]

> > And, it appears that the general idea is also somewhat silly.
> >
>
> Avoiding kernel syscalls?

No. That's okay. I simply don't know how to make a futex based
*realtime* condvar. Did you try to make a futex based sema with
timeouts and cancels? (please assume that timeout on a futex
can consume a signal ;-) )

regards,
alexander.

Joe Seigh

unread,

Oct 11, 2003, 3:09:59 PM10/11/03

to

Timeouts on a semaphore should not consume a signal. Condvars are
different, you could implement it either way but you have to
document which way it is.

I'm not sure what you mean by "realtime" condvars. Condvars are
certainly not performance daemons and of course "realtime" has
almost nothing to do with performance. Condvars have a built in
bottleneck by definition. Actually, posix in general has painted
itself in a corner performance wise, and they're pretty defensive
about that corner as being the right place to be.

Joe Seigh

Alexander Terekhov

unread,

Oct 11, 2003, 5:00:10 PM10/11/03

to

Joe Seigh wrote:
[...]

> Timeouts on a semaphore should not consume a signal.

Of course. Now, what about timeouts on a futex? ;-)

> Condvars are
> different, you could implement it either way but you have to
> document which way it is.
>

> I'm not sure what you mean by "realtime" condvars. ...

http://www.redhat.com/archives/phil-list/2003-March/msg00151.html

regards,
alexander.

Joe Seigh

unread,

Oct 12, 2003, 1:28:26 PM10/12/03

to

Alexander Terekhov wrote:
>
> Joe Seigh wrote:
> [...]
> > Timeouts on a semaphore should not consume a signal.
>
> Of course. Now, what about timeouts on a futex? ;-)

Same thing for futex semaphores.

>
> > Condvars are
> > different, you could implement it either way but you have to
> > document which way it is.
> >
> > I'm not sure what you mean by "realtime" condvars. ...
>
> http://www.redhat.com/archives/phil-list/2003-March/msg00151.html
>

A waitset should be a single kernel object which knows the scheduling policy.
If you implement your own waitsets using multiple kernel synchronization
objects, you are going to have problems, so don't do it that way.

Joe Seigh

unread,

Oct 14, 2003, 5:01:29 PM10/14/03

to

I just found the futex user space code today. It's beyond
brain-damaged. All that kernel rewrite was for no good
reason at all.

Joe Seigh

Eric Dumazet

unread,

Oct 15, 2003, 1:56:42 AM10/15/03

to

"Joe Seigh" <jsei...@xemaps.com> a écrit dans le message de
news:3F8C66F1...@xemaps.com...

> I just found the futex user space code today. It's beyond
> brain-damaged. All that kernel rewrite was for no good
> reason at all.

So, it took sooo much time for you to 'discover' this code.

And of course, you are able to give a final word on it : bullshit ? I never
saw some requests/comments from you on the NPTL mailing list.

I'm using futexes in a real application, and it performs much better than
the existing alternatives.

>
> Joe Seigh

Eric Dumazet

Casper H.S. Dik

unread,

Oct 15, 2003, 6:53:54 AM10/15/03

to

"Eric Dumazet" <da...@cosmosbay.com> writes:

Shouldn't they just have improved the implementation of mutexes
instead? For portability, improving existing applications, perhaps?

(Any API with "fast" in its name is inherently suspect, I think)

OTOH, the documentation I found claims that it's a building block
for other things and it very much looks like how the underpinnings
of a pthread_mutex_t would ordinarily work in a decent implementation
(which would then perform on the same order as a "futex")

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Joe Seigh

unread,

Oct 15, 2003, 7:03:24 AM10/15/03

to

Eric Dumazet wrote:
>
> "Joe Seigh" <jsei...@xemaps.com> a écrit dans le message de
> news:3F8C66F1...@xemaps.com...
> > I just found the futex user space code today. It's beyond
> > brain-damaged. All that kernel rewrite was for no good
> > reason at all.
>
> So, it took sooo much time for you to 'discover' this code.

It wasn't for lack of trying. And it appears that the code I found,
the part that runs in user space, isn't even official code. Apparently
they assume that there is only one way to do a futex and it must be
obvious how it works, so why document it. And I kind of assumed
there was only one way to do a futex and that kernel developers and
thread library implementers are competent, so why check otherwise.

>
> And of course, you are able to give a final word on it : bullshit ? I never
> saw some requests/comments from you on the NPTL mailing list.

I don't recall being on the NPTL mailing list.

>
> I'm using futexes in a real application, and it performs much better than
> the existing alternatives.

I've seen significant performance gains from futexes on win32 and it didn't
require a rewrite of the win32 kernel.

Joe Seigh

Loic Domaigne

unread,

Oct 15, 2003, 10:41:21 AM10/15/03

to

Hello guys!

> I've seen significant performance gains from futexes on win32 and it didn't
> require a rewrite of the win32 kernel.

If M$ decides to implement futex in the win32 kernel, they could call it...
Mutex for example! Which would stand of course for "Microsoft futex" ;-)

Regards,
Loic.
--
Article posté via l'accès Usenet http://www.mes-news.com
Accès par Nnrp ou Web

Alexander Terekhov

unread,

Oct 15, 2003, 11:24:41 AM10/15/03

to

Loic Domaigne wrote:
>
> Hello guys!
>
> > I've seen significant performance gains from futexes on win32 and it didn't
> > require a rewrite of the win32 kernel.
>
> If M$ decides to implement futex in the win32 kernel, they could call it...
> Mutex for example! Which would stand of course for "Microsoft futex" ;-)

Just for fun, you might want to read this:

http://www.wintellect.com/resources/newsletters/2003/feb2003.aspx

regards,
alexander.

P.S. Follow the embedded links. ;-)

Joe Seigh

unread,

Oct 15, 2003, 12:40:51 PM10/15/03

to

Loic Domaigne wrote:
>
> Hello guys!
>
> > I've seen significant performance gains from futexes on win32 and it didn't
> > require a rewrite of the win32 kernel.
>
> If M$ decides to implement futex in the win32 kernel, they could call it...
> Mutex for example! Which would stand of course for "Microsoft futex" ;-)
>

Except you usually don't implement futexes in the kernel, they can be done
entirely in user space. Actually, Microsoft already has a futex. It's
called CriticalSection. There are many ways to implment futexes. Some
are better than others and it depends on what underlaying kernel synchronization
primatives you have to work with.

This isn't the first time something like this (screwing up the kernel) happened.
Around 89 or so, IBM needed to implement a event synchronization object (like
an ecb) for virtual machines to wait on other virtual machines. They wanted
it to have a fast path though, so they started working out one ugly api and
kernel hack (kernel objects existing in user memory, much like linux futex).
I figured out a cleaner way to do it without doing that. You basically come
up a "recursive" definition for event object in the form of a FSM (Mealy machine).
However the team doing the actual work got behind schedule and was too busy to
listen at that point. So I submitted it as an invention disclosure (KI920112)
to document it. Yes, IBM doesn't even know what IBM knows. I won't even mention
what disclosure KI92069 was for.

Joe Seigh

unread,

Oct 15, 2003, 12:43:06 PM10/15/03

to

Alexander Terekhov wrote:
>
> Loic Domaigne wrote:
> >
> > Hello guys!
> >
> > > I've seen significant performance gains from futexes on win32 and it didn't
> > > require a rewrite of the win32 kernel.
> >
> > If M$ decides to implement futex in the win32 kernel, they could call it...
> > Mutex for example! Which would stand of course for "Microsoft futex" ;-)
>
> Just for fun, you might want to read this:
>
> http://www.wintellect.com/resources/newsletters/2003/feb2003.aspx
>

Well, he's aware of the scheduling fairness problem. He just doesn't
know how to deal with it.

Joe Seigh

Eric Dumazet

unread,

Oct 15, 2003, 12:55:29 PM10/15/03

to

"Joe Seigh" <jsei...@xemaps.com> a écrit dans le message de

news:3F8D2C4B...@xemaps.com...

> It wasn't for lack of trying. And it appears that the code I found,
> the part that runs in user space, isn't even official code. Apparently
> they assume that there is only one way to do a futex and it must be
> obvious how it works, so why document it. And I kind of assumed
> there was only one way to do a futex and that kernel developers and
> thread library implementers are competent, so why check otherwise.

Instead of being vague, why dont you give pointers to the code you looked at
?

Maybe you found some code written by a student, or whatever...

Documentation of futex is old stuff. google for 'futex manpage' gives it if
you care.

http://homepages.cwi.nl/~aeb/linux/man2html/man4/futex.4.html

If you want to see a good example of futex uses, just look at glibc (current
CVS) sources, since NPTL is part of it.
pthread_mutex are build on top of futexes, instead of costly
signal()/sigsuspend()/kill() functions.

With the help of compilers, locking/unlocking a mutex is done with 2
instructions in fast path, inlined in the calling function.

> I don't recall being on the NPTL mailing list.

As futexes were designed with the help/ask for features asked by members of
this list, it is a good place to find some help/infos.

>
> I've seen significant performance gains from futexes on win32 and it
didn't
> require a rewrite of the win32 kernel.

futexes were added to linux, like many other new features. Linux was not
rewritten for futexes.
One can even build a linux kernel without futexes if he wants to save some
kernel space.

BTW, In dont know what you are calling futexes on win32. For sure that's not
the same thing. linux futexes are not win32 futexes.

Eric Dumazet

Alexander Terekhov

unread,

Oct 15, 2003, 1:14:25 PM10/15/03

to

Eric Dumazet wrote:
[...]

> With the help of compilers, locking/unlocking a mutex is done with 2
> instructions in fast path, inlined in the calling function.

Can you post some pseudo-code illustrating how NPTL's mutexes work?

regards,
alexander.

Joe Seigh

unread,

Oct 15, 2003, 1:50:57 PM10/15/03

to

Look in the ftp directory listed in that man page for a futex-2.2.tar.gz.
That's where I got the examples I looked at.

Sounds like they're pretty well convinced that the only way to implement
a fast mutex was by hacking the kernel.

Joe Seigh

Eric Dumazet

unread,

Oct 15, 2003, 4:07:20 PM10/15/03

to

"Alexander Terekhov" <tere...@web.de> a écrit dans le message de
news:3F8D8071...@web.de...

A 'simple' mutex can be implemented with one futex (that is a 32 bits word):
By convention , a 0 value means : free mutex

The i486 fast path to lock the mutex is :

mov $1,%eax
lock xaddl %eax,futex
testl %eax,%eax
jne out_of_line_lock_code

To unlock the mutex:

lock subl $1,futex
jne out_of_line_unlock_code

The linker places 'out_of_line' code out of the fastpath zone, to keep
icache footprint small.

The test are usually not taken, and destination address greater than current
EIP : CPU predicts correctly the branch not taken.

out_of_line code calls the futex syscalls, with FUTEX_WAIT and FUTEX_WAKE
ops

>
> regards,
> alexander.

Joe Seigh

unread,

Oct 15, 2003, 5:22:49 PM10/15/03

to

There will be a quiz to see if you've figured out how this works, Alexander.

1) What possible benefit they have from doing it this way.
2) Why futexes can only be SCHED_OTHER.

Answers to 1 and 2 are the same.

Joe Seigh

SenderX

unread,

Oct 16, 2003, 3:44:34 AM10/16/03

to

> I've seen significant performance gains from futexes on win32 and it
didn't
> require a rewrite of the win32 kernel.

futex's are basically lock-free semaphores?

The fast semaphore code that we posted can easily be used as a fast mutex...

http://appcore.home.comcast.net/pthreads/lfsema.cpp

http://appcore.home.comcast.net/pthreads/lfsema.h

or

http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6685

This really required a Linux kernel rewrite?

:/

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net

Alexander Terekhov

unread,

Oct 16, 2003, 4:58:51 AM10/16/03

to

Eric Dumazet wrote:
[...]

> A 'simple' mutex can be implemented with one futex (that is a 32 bits word):
> By convention , a 0 value means : free mutex

Hmm. I understand this bit. As for the rest, I'm asking you to
show *complete* illustration of NPTL's mutex trylock, timedlock,
lock, and unlock calls. And, BTW, if you can, use the ISO BASIC
(for the illustration, I mean), please.

regards,
alexander.

Joe Seigh

unread,

Oct 16, 2003, 7:10:49 AM10/16/03

to

Part of the logic is in the kernel which operates on the same 32 bit word that
the user space code does. You can infer some of the logic from the man page.

1 means a free mutex. To acquire it, you decrement it. If the result is
0 you've acquired the mutex, less than zero, contention and you have to
call into the kernel.

Trylock might require compare and swap logic which I don't think is available
on 386, but if you are actually using a 386 then you obviously don't care
about performance so why are you using a futex. Timeouts are handled in the
kernel, not user spaece.

Releasing the lock involves incrementing the futex. If the result is 1,
you're done, otherwise there is contention. Here's where things get a
little messed up. You should just go into the kernel to wake up the
next lock waiter, but the code changes the lock to 1 so that some other
running thread can acquire the lock while the first thread is transits
into the kernel (woo! that's a bit window). Once in the kernel, the
code verifies that there are still waiters and reattempts to acqire
the futex on behalf on one of the waiters (there's your latency window)
unless it's gone completely darwinian in which case it just wakes up
a thread and leaves it to its own devices (I don't think it does this
since I don't see a retry loop in user space).

Any way, the setting the futex to 1 (free) when there are other waiters
breaks scheduling and makes futexes SCHED_OTHER.

Since the kernel operates directly on userspace data with interlocked
instructions, changes had to be done to accommodate this since typically
the kernel only copies to and from userspace using specialized routines.

This is all just a guess since I haven't actually seen the kernel code.

Joe Seigh

unread,

Oct 16, 2003, 7:34:46 AM10/16/03

to

SenderX wrote:
>
> > I've seen significant performance gains from futexes on win32 and it
> didn't
> > require a rewrite of the win32 kernel.
>
> futex's are basically lock-free semaphores?
>
> The fast semaphore code that we posted can easily be used as a fast mutex...
>
> http://appcore.home.comcast.net/pthreads/lfsema.cpp
>
> http://appcore.home.comcast.net/pthreads/lfsema.h
>
> or
>
> http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6685
>
> This really required a Linux kernel rewrite?
>

Apparently, if you didn't know any better. One of the dangers of
having the ability to modify the kernel is that you don't try
hard enough to look for other less intrusive solutions. They're
going to have a hard time living this one down. We'll see how
cool they are anyway.

Joe Seigh

unread,

Oct 16, 2003, 7:53:43 AM10/16/03

to

> Trylock might require compare and swap logic which I don't think is available
> on 386, but if you are actually using a 386 then you obviously don't care
> about performance so why are you using a futex. Timeouts are handled in the
> kernel, not user spaece.
>

Maybe not. Just don't go into the kernel after doing the decrement. The
kernel has to handle late (getting into the kernel) waiters and a trylock
would look like a really late waiter.

Joe Seigh

Alexander Terekhov

unread,

Oct 16, 2003, 8:27:15 AM10/16/03

to

Joe Seigh wrote:
[...]

> This is all just a guess since I haven't actually seen the kernel code.

I've seen the kernel code (futex.c). I didn't really like what I saw
there. FWIW, here's my take on "reverse engineering" The Linux Futex
beast (stuff like "FD" and signals aside; and just ignore stolen_sig
for now ;-) ):

http://www.terekhov.de/DESIGN-futex-CV.cpp

//*** futex emulator
class futex {

//*** unimplemented since it's non-copyable/non-copy-constructible
futex(const futex &);
futex & operator=(const futex &);

//*** RAII helper [used inside {timed}wait() only]
struct waiter {

waiter(waiter * & queue ) : next(queue), pqueue(&queue) {
queue = this;
}

~waiter() {
if ( !is_reset() ) { // timedout/canceled
while (*pqueue && this != *pqueue)
pqueue = &(*pqueue)->next;
assert(this == *pqueue);
*pqueue = next;
}
}

waiter * reset() {
waiter * w = next;
next = this;
return w;
}

bool is_reset() {
return this == next;
}

waiter * next;
waiter * * pqueue;

} * m_queue;
int m_value;
mutex mutable m_mutex;
condvar m_condvar;

public:

futex() :
m_queue(0), m_value(0), m_mutex(), m_condvar() {
}

~futex() {
assert(!m_queue);
}

operator int() const {
mutex::guard guard(m_mutex);
return m_value;
}

futex & operator=(int value) {
mutex::guard guard(m_mutex);
m_value = value;
return *this;
}

futex & operator++() {
mutex::guard guard(m_mutex);
++m_value;
return *this;
}

futex & operator--() {
mutex::guard guard(m_mutex);
--m_value;
return *this;
}

// ... other atomic ops ...

void wait(int value) {
mutex::guard guard(m_mutex);
if (value == m_value) {
waiter this_waiter(m_queue);
do { m_condvar.wait(guard); }
while (!this_waiter.is_reset());
}
}

bool timedwait(int value, const timespec & abstime, bool & stolen_sig) {
mutex::guard guard(m_mutex);
if (value == m_value) {
waiter this_waiter(m_queue);
do {
if (m_condvar.timedwait(guard, abstime)) {
// timestamps aside for a moment...
stolen_sig = this_waiter.is_reset();
return true;
}
} while (!this_waiter.is_reset());
}
return stolen_sig = false;
}

void wake(unsigned int wakeups) {
mutex::guard guard(m_mutex);
while (m_queue && wakeups--)
m_queue = m_queue->reset();
m_condvar.broadcast();
}

}; //*** class futex

regards,
alexander.

Eric Dumazet

unread,

Oct 16, 2003, 9:53:56 AM10/16/03

to

"Alexander Terekhov" <tere...@web.de> a écrit dans le message de

news:3F8E5DCB...@web.de...

>
> Hmm. I understand this bit. As for the rest, I'm asking you to
> show *complete* illustration of NPTL's mutex trylock, timedlock,
> lock, and unlock calls. And, BTW, if you can, use the ISO BASIC
> (for the illustration, I mean), please.

Well, I didnt wrote the thing, so I looked at the NPTL sources. This is C
language.

As I dont know BASIC myself, I let you do the conversion.

The futex based mutexes are optimized in the sense that only one syscall is
done in case of contention.

>
> regards,
> alexander.

See you
Eric

Alexander Terekhov

unread,

Oct 17, 2003, 3:53:34 AM10/17/03

to

Eric Dumazet wrote:
[...]

> Well, I didnt wrote the thing, so I looked at the NPTL sources. This is C
> language.

It's GNU-C dialect.

>
> As I dont know BASIC myself, I let you do the conversion.
>
> The futex based mutexes are optimized in the sense that only one syscall is
> done in case of contention.

Here's the "non-generic" NPTL (just in case: licensed under LGPL) mutex stuff:

static inline int
__attribute__ ((always_inline))
__lll_mutex_trylock (int *futex)
{
return atomic_compare_and_exchange_val_acq (futex, 1, 0) != 0;
}
#define lll_mutex_trylock(futex) __lll_mutex_trylock (&(futex))

#define lll_mutex_timedlock(lock, abstime) \
({ \
int *__futex = &(lock); \
int __val = 0; \
if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex, 1, 0),\
0) != 0) \
__val = __lll_timedlock_wait (__futex, abstime); \
__val; \
})

int
__lll_timedlock_wait (int *futex, const struct timespec *abstime)
{
/* Reject invalid timeouts. */
if (abstime->tv_nsec < 0 || abstime->tv_nsec >= 1000000000)
return EINVAL;

do
{
struct timeval tv;
struct timespec rt;

/* Get the current time. */
(void) __gettimeofday (&tv, NULL);

/* Compute relative timeout. */
rt.tv_sec = abstime->tv_sec - tv.tv_sec;
rt.tv_nsec = abstime->tv_nsec - tv.tv_usec * 1000;
if (rt.tv_nsec < 0)
{
rt.tv_nsec += 1000000000;
--rt.tv_sec;
}

/* Already timed out? */
if (rt.tv_sec < 0)
return ETIMEDOUT;

/* Wait. */
int oldval = atomic_compare_and_exchange_val_acq (futex, 2, 1);
if (oldval != 0)
lll_futex_timed_wait (futex, 2, &rt);
}
while (atomic_compare_and_exchange_bool_acq (futex, 2, 0) != 0);

return 0;
}

#define lll_mutex_lock(lock) \
(void) ({ \
int *__futex = &(lock); \
if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex, 1, 0),\
0) != 0) \
__lll_lock_wait (__futex); \
})

void
__lll_lock_wait (int *futex)
{
do
{
int oldval = atomic_compare_and_exchange_val_acq (futex, 2, 1);
if (oldval != 0)
lll_futex_wait (futex, 2);
}
while (atomic_compare_and_exchange_bool_acq (futex, 2, 0) != 0);
}

#define lll_mutex_unlock(lock) \
((void) ({ \
int *__futex = &(lock); \
int __val = atomic_exchange_rel (__futex, 0); \
if (__builtin_expect (__val > 1, 0)) \
lll_futex_wake (__futex, 1); \
}))

Can you see some problem(s) here?

Now, condvars aside, here's the NPTL counting sema stuff:

int
__new_sem_wait (sem_t *sem)
{
/* First check for cancellation. */
CANCELLATION_P (THREAD_SELF);

int *futex = (int *) sem;
int val;
int err;

do
{
if (*futex > 0)
{
val = atomic_decrement_if_positive (futex);
if (val > 0)
return 0;
}

/* Enable asynchronous cancellation. Required by the standard. */
int oldtype = __pthread_enable_asynccancel ();

err = lll_futex_wait (futex, 0);

/* Disable asynchronous cancellation. */
__pthread_disable_asynccancel (oldtype);
}
while (err == 0 || err == -EWOULDBLOCK);

__set_errno (-err);
return -1;
}

int
__new_sem_post (sem_t *sem)
{
int *futex = (int *) sem;
int err, nr;

nr = atomic_exchange_and_add (futex, 1);
err = lll_futex_wake (futex, nr + 1);
if (__builtin_expect (err, 0) < 0)
{
__set_errno (-err);
return -1;
}
return 0;
}

regards,
alexander.

Joe Seigh

unread,

Oct 17, 2003, 6:46:14 AM10/17/03

to

Alexander Terekhov wrote:
>
> Eric Dumazet wrote:
> [...]
> > Well, I didnt wrote the thing, so I looked at the NPTL sources. This is C
> > language.
>
> It's GNU-C dialect.
>
> >
> > As I dont know BASIC myself, I let you do the conversion.
> >
> > The futex based mutexes are optimized in the sense that only one syscall is
> > done in case of contention.
>
> Here's the "non-generic" NPTL (just in case: licensed under LGPL) mutex stuff:
>

What's with the 2's? And you didn't tell us how your scheme works so we can't
really tell you if it's doing what it's supposed to be doing.

Also, absolute timeouts are easier to work with than relative timeouts. If your
code preempts during your time calculations, it's going to throw things off a bit.
It can happen beforehand to the caller in his code but that's ok, it's his problem.
(there's a way around this problem but that's outside the scope of this discussion.)

And, unless you are getting the checking for free, checking for timeouts when
timeout is going to be checked later anyway is false optimization. The percent
of times where you would catch this is ~0, so 99.999...% of the time you are
wasting cycles doing the check. There are people who will call your code
with expired timeouts, but do you want to reward those idiots?

Joe Seigh

Alexander Terekhov

unread,

Oct 17, 2003, 7:11:06 AM10/17/03

to

Joe Seigh wrote:
[...]

> What's with the 2's?

2 indicates contention to the mutex owner; he calls futex_wake(), then.
There seems to be a problem here with respect to timedout waiters...
because they CAN consume a signal under this scheme no matter whether
timeouts on a futex are allowed to do it or not. Oder? (And I know that
you know how to fix it, Joe ;-) )

> And you didn't tell us how your scheme works so we can't
> really tell you if it's doing what it's supposed to be doing.

This is not my scheme, to begin with. And no one told me how this scheme
works either. Open source, you know.

>
> Also, absolute timeouts are easier to work with than relative timeouts.

Agreed.

regards,
alexander.

Joe Seigh

unread,

Oct 17, 2003, 7:51:30 AM10/17/03

to

Alexander Terekhov wrote:
>
> Joe Seigh wrote:
> [...]
> > What's with the 2's?
>
> 2 indicates contention to the mutex owner; he calls futex_wake(), then.
> There seems to be a problem here with respect to timedout waiters...
> because they CAN consume a signal under this scheme no matter whether
> timeouts on a futex are allowed to do it or not. Oder? (And I know that
> you know how to fix it, Joe ;-) )

I think that Linux futex uses negative values to indicate contention. And
actually that's not really true since the futex can be temporarily set to
1 (free). The real indication of contention is whether there are waiters
in the kernel's wait set and only the kernel knows that.

Timeouts are a little strange. You have to be aware that calling with
an expired timeout or relative timeout of zero can be used as a
conditional acquire of the lock. See posix sem_timedwait() manpage
for instance.

And that conditional try stuff has to be be revisited. With all this
fast path stuff, sometimes the only definitive answer is by going
into the kernel but you want to avoid that. Time to play with the
semantics a little which shouldn't be a problem since the caller
has to handle the "miss is as good as a mile" race conditions.

Joe Seigh

Alexander Terekhov

unread,

Oct 17, 2003, 8:14:02 AM10/17/03

to

Joe Seigh wrote: ...

I meant that

: /* Already timed out? */

: if (rt.tv_sec < 0)
: return ETIMEDOUT;

shall be changed to:

/* Already timed out? */
if (rt.tv_sec < 0)

{
lll_futex_wake (futex, 1);
return ETIMEDOUT;
}

Do you follow me?

regards,
alexander.

Eric Dumazet

unread,

Oct 17, 2003, 9:13:50 AM10/17/03

to

"Joe Seigh" <jsei...@xemaps.com> a écrit dans le message de
news:3F8FDA96...@xemaps.com...

> I think that Linux futex uses negative values to indicate contention. And
> actually that's not really true since the futex can be temporarily set to
> 1 (free). The real indication of contention is whether there are waiters
> in the kernel's wait set and only the kernel knows that.

AFAIK :

0 means : free lock
1 means : locked by one thread (no contention)
other values : locked and (maybe) contention

0 is a good 'free' lock value for compatibility with linuxthreads reasons.
(NPTL library can replace the linux thread library WITHOUT recompiling the
old application)

See you

Joe Seigh

unread,

Oct 17, 2003, 9:24:33 AM10/17/03

to

Ah, ok (looking at the manpage again). I'll have to look at the race
conditions again.

Joe Seigh

unread,

Oct 17, 2003, 11:26:43 AM10/17/03

to

> Eric Dumazet wrote:
> > AFAIK :
> >
> > 0 means : free lock
> > 1 means : locked by one thread (no contention)
> > other values : locked and (maybe) contention
> >
> > 0 is a good 'free' lock value for compatibility with linuxthreads reasons.
> > (NPTL library can replace the linux thread library WITHOUT recompiling the
> > old application)
> >
>
> Ah, ok (looking at the manpage again). I'll have to look at the race
> conditions again.
>

Ok, I absolutely have no idea how futex up, down, and sideways relate to
implmenting a mutex. I think I can see some extremely nasty race
conditions involving 3 or more threads but it all kind of depends on
what you mean by "getting a lock".

Joe Seigh

Alexander Terekhov

unread,

Oct 17, 2003, 1:20:46 PM10/17/03

to

Joe Seigh wrote:
[...]

> Ok, I absolutely have no idea how futex up, down, and sideways relate to
> implmenting a mutex. I think I can see some extremely nasty race
> conditions involving 3 or more threads but it all kind of depends on
> what you mean by "getting a lock".

"getting a lock" means that CAS for a lock status update 0->1 (on fast
path) or 0->2 (on slow path) succeeded. What "nasty race conditions
involving 3 or more threads" are you talking about? Details, please.

regards,
alexander.

Joe Seigh

unread,

Oct 17, 2003, 2:03:08 PM10/17/03

to

I take back my recent retraction. It is as I said earlier. 1 is lock is free and
0 locked w/o contention, < 0 locked w/ contention. Read the man page.
Read the code samples.

386 doesn't have CAS, just locked increment, decrement, and swap. You're
trying to do FSM logic which you can't do with those instructions.

Joe Seigh

Eric Dumazet

unread,

Oct 17, 2003, 6:18:26 PM10/17/03

to

"Joe Seigh" <jsei...@xemaps.com> a écrit dans le message de

news:3F9031B0...@xemaps.com...

>
> I take back my recent retraction. It is as I said earlier. 1 is lock is
free and
> 0 locked w/o contention, < 0 locked w/ contention. Read the man page.
> Read the code samples.

man pages are always out of date...

The futex logic in kernel doesnt force a value for 'locked' or 'free'. You
decide.
NPTL choice is 0 for free mutex.

The FUTEX_WAIT is used by a thread willing to 'sleep' by :
giving the address of the futex, AND the value it is supposed to be (last
value got by an atomic update).

NPTL uses futex with 0 for 'free mutex', 1 for 'locked by one thread' ,
other values for : 'locked', and contention (need to call FUTEX_WAKE)

So the code to lock is :

int addval = 1 ;
for (;;) {
int oldvalue = atomic_add(&futex, addval) ; // atomic_add returns the value
of futex BEFORE the increment
if (oldvalue != 0) // futex was already locked by somebody
futexsyscall(FUTEX_WAIT, &futex, oldvalue + addval) ; // +1 or 2 because
of the atomic_add
continue ;
}
break ;
}

The code to unlock is :

int newval = atomic_dec(&futex) ; // atomic_dec returns the new value of
futex AFTER the decrement
if (newval != 0) { // This thread must call futex syscall to wakeup other
thread
*futex = 0 ; // really free the futex
futexsyscall(FUTEX_WAKE, &futex, 1 /* wakeup one thread */) ;
}

>
> 386 doesn't have CAS, just locked increment, decrement, and swap. You're
> trying to do FSM logic which you can't do with those instructions.

NPTL doesnt care of i386. i486 is required. (lock xadd instruction to
implement atomic_inc/atomic_dec)

Eric

Eric Dumazet

unread,

Oct 18, 2003, 3:15:23 AM10/18/03

to

"Eric Dumazet" <da...@cosmosbay.com> a écrit dans le message de
news:3f906abf$0$13307$626a...@news.free.fr...

> man pages are always out of date...
>

And finally a new up2date doc is now available :
http://people.redhat.com/drepper/futex.pdf

I dont know if I will take time to read it all, it's *very* long !

Alexander Terekhov

unread,

Oct 18, 2003, 10:27:23 AM10/18/03

to

Joe Seigh wrote:
[...]

> I take back my recent retraction. It is as I said earlier. 1 is lock is free and
> 0 locked w/o contention, < 0 locked w/ contention. Read the man page.
> Read the code samples.

I've just noticed a link posted by Eric. I'm going to read it.

>
> 386 doesn't have CAS, just locked increment, decrement, and swap. You're
> trying to do FSM logic which you can't do with those instructions.

Yeah. "Untested"...

#define SWAP_BASED_MUTEX_FOR_WINDOWS_INITIALIZER { 0, 0 }

struct swap_based_mutex_for_windows {

atomic<int> m_lock_status; // -1: free, 0: locked, 1 lock-contention
atomic<auto_reset_event *> m_retry_event; // DCSI'd

void DCSI();
void slow_lock();
bool slow_trylock();
bool slow_timedlock(absolute_timeout const & timeout);
void release_one_waiter_if_any();

void lock() {
if (m_lock_status.swap(0, msync::acq) >= 0) slow_lock();
}

bool trylock() {
return (m_lock_status.swap(0, msync::acq) < 0) ? true : slow_trylock();
}

bool timedlock(absolute_timeout const & timeout) {
return (m_lock_status.swap(0, msync::acq) < 0) ? true : slow_timedlock(timeout);
}

void unlock() {
if (m_lock_status.swap(-1, msync::rel) > 0) release_one_waiter_if_any();
}

};

void swap_based_mutex_for_windows::slow_lock() {
DCSI();
while (m_lock_status.swap(1, msync::acq) >= 0)
m_retry_event.load(msync::none)->wait();
}

bool swap_based_mutex_for_windows::slow_trylock() {
DCSI();
return m_lock_status.swap(1, msync::acq) < 0;
}

bool swap_based_mutex_for_windows::slow_timedlock(absolute_timeout const & timeout) {
DCSI();
while (m_lock_status.swap(1, msync::acq) >= 0)
if (!m_retry_event.load(msync::none)->timedwait(timeout)) return false;
return true;
}

void swap_based_mutex_for_windows::release_one_waiter_if_any() {
if (auto_reset_event * retry_event = m_retry_event.load(msync::none))
retry_event->set();
}

void swap_based_mutex_for_windows::DCSI() {
if (!m_retry_event.load(msync::none)) {
named_windows_mutex_trick guard(this);
if (!m_retry_event.load(msync::none)) {
m_retry_event.store(new auto_reset_event(), msync::rel);
m_lock_status.store(-1, msync::rel);
}
}
}

Comments? TIA.

regards,
alexander.

Joe Seigh

unread,

Oct 18, 2003, 10:44:47 AM10/18/03

to

Eric Dumazet wrote:
>
> "Joe Seigh" <jsei...@xemaps.com> a écrit dans le message de
> news:3F9031B0...@xemaps.com...
> >
> > I take back my recent retraction. It is as I said earlier. 1 is lock is
> free and
> > 0 locked w/o contention, < 0 locked w/ contention. Read the man page.
> > Read the code samples.
>
> man pages are always out of date...
>
> The futex logic in kernel doesnt force a value for 'locked' or 'free'. You
> decide.
> NPTL choice is 0 for free mutex.
>
> The FUTEX_WAIT is used by a thread willing to 'sleep' by :
> giving the address of the futex, AND the value it is supposed to be (last
> value got by an atomic update).
>
> NPTL uses futex with 0 for 'free mutex', 1 for 'locked by one thread' ,
> other values for : 'locked', and contention (need to call FUTEX_WAKE)

Yeah, it appears they reversed everything and shifted by one to get
binary compatibility. "other values" > 1, negative values not allowed
AFAICT.

>
(snip)

>
> >
> > 386 doesn't have CAS, just locked increment, decrement, and swap. You're
> > trying to do FSM logic which you can't do with those instructions.
>
> NPTL doesnt care of i386. i486 is required. (lock xadd instruction to
> implement atomic_inc/atomic_dec)

They gave up 386 compatibility for that binary compatibility with the old libs.

>
> Eric

Alexander Terekhov

unread,

Oct 18, 2003, 10:48:17 AM10/18/03

to

Alexander Terekhov wrote:
[...]

No need for "if" here (assert() aside for a moment ;-) ).

void swap_based_mutex_for_windows::release_one_waiter_if_any() {
m_retry_event.load(msync::none)->set();

Alexander Terekhov

unread,

Oct 18, 2003, 10:48:28 AM10/18/03

to

Eric Dumazet wrote:
[...]
> So the code to lock is : ...

NPTL's mutex code that does increments and decrements is:

/* Copyright (C) 2002 Free Software Foundation, Inc.
This file is part of the GNU C Library.
Contributed by Ulrich Drepper <dre...@redhat.com>, 2002.

The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.

The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, write to the Free
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA. */

#include <atomic.h>

/* Implement generic mutex. Basic futex syscall support is required:

lll_futex_wait(futex, value) - call sys_futex with FUTEX_WAIT
and third parameter VALUE

lll_futex_wake(futex, value) - call sys_futex with FUTEX_WAKE
and third parameter VALUE
*/

/* Mutex lock counter:
bit 31 clear means unlocked;
bit 31 set means locked.

All code that looks at bit 31 first increases the 'number of
interested threads' usage counter, which is in bits 0-30.

All negative mutex values indicate that the mutex is still locked. */

static inline void
__generic_mutex_lock (int *mutex)
{
unsigned int v;

/* Bit 31 was clear, we got the mutex. (this is the fastpath). */
if (atomic_bit_test_set (mutex, 31) == 0)
return;

atomic_increment (mutex);

while (1)
{
if (atomic_bit_test_set (mutex, 31) == 0)
{
atomic_decrement (mutex);
return;
}

/* We have to wait now. First make sure the futex value we are
monitoring is truly negative (i.e. locked). */
v = *mutex;
if (v >= 0)
continue;

lll_futex_wait (mutex, v);
}
}

static inline void
__generic_mutex_unlock (int *mutex)
{
/* Adding 0x80000000 to the counter results in 0 if and only if
there are not other interested threads - we can return (this is
the fastpath). */
if (atomic_add_zero (0x80000000, mutex))
return;

/* There are other threads waiting for this mutex, wake one of them
up. */
lll_futex_wake (mutex, 1);
}

#define lll_mutex_lock(futex) __generic_mutex_lock (&(futex))
#define lll_mutex_unlock(futex) __generic_mutex_unlock (&(futex))

Joe Seigh

unread,

Oct 18, 2003, 10:56:11 AM10/18/03

to

I'll have to read though it, I just did a quick scan. It probably
should have just stuck to the facts and just presented the semantics.
The discussion of why other techniques don't work distracts from
things. It shouldn't be a tutorial on locking in general and it's
assuming how correct locking will look. It would be nice if the
orginal paper was available online, especially if it answers why
futexes need to hack the kernel when there are a number of ways
to implement futexes without kernel modifications.

Joe Seigh

unread,

Oct 18, 2003, 1:39:08 PM10/18/03

to

Alexander Terekhov wrote:
>
> Eric Dumazet wrote:
> [...]
> > So the code to lock is : ...
>
> NPTL's mutex code that does increments and decrements is:
>

(snip)

That's probably how I'd do it. Split out the mutex "token" from
the wait count so that "queue jumping" (SCHED_OTHER) can take
place. But an ordinary semaphore syscall would suffice. You
don't need any special custom syscalls. I guess they don't
know the trick. :)

Joe Seigh

unread,

Oct 18, 2003, 1:52:38 PM10/18/03

to

One thing I noticed and I guess a warning is warranted. It suggests
that asynchronous waits would be used for multiple events. Someone
may think this is a good way to wait for multiple locks.

DANGERWILLROBINSONDANGER! (there's your warning)

You will get deadlock. Queueing up a request for a lock isn't the same
as actually acquiring a lock. The locks could be granted in a different
order than you requested them. Even if you use asynchronous wait locks
with FIFO service order such as a bakery algorithm lock or a waitticket
based lock, the requests could be interleaved, e.g.

thread 1 requests lock A
thread 2 requests lock A
thread 2 requests lock B
thread 1 requests lock B

This will hang because thread 1 won't get lock B until after thread 2 gets it,
but thread 2 is waiting for lock A first, etc...

So don't asynchronous wait for multiple locks. You have been warned.

Joe Seigh

unread,

Oct 18, 2003, 2:08:30 PM10/18/03

to

Joe Seigh wrote:
>
> But an ordinary semaphore syscall would suffice. You
> don't need any special custom syscalls. I guess they don't
> know the trick. :)
>

Actually, you don't need the trick for timeouts. A few spurious
wakeups won't hurt since you have to handle them anyway since you
can get them from queue jumping.

Joe Seigh

Alexander Terekhov

unread,

Oct 18, 2003, 2:13:35 PM10/18/03

to

Joe Seigh wrote:
[...]

> That's probably how I'd do it. Split out the mutex "token" from
> the wait count so that "queue jumping" (SCHED_OTHER) can take
> place.

Yeah. I personally kinda like queue "uber jumping" (no waiters count)
and simply using xchg/swap (no cas/++/--). Still parsing my example?

;-)

regards,
alexander.

Joe Seigh

unread,

Oct 18, 2003, 3:15:02 PM10/18/03

to

What, the wake everybody up thundering herd solution? That's a little too
darwinian maybe. How about metered release?

Joe Seigh

Sean Burke

unread,

Oct 18, 2003, 3:15:17 PM10/18/03

to

Joe Seigh <jsei...@xemaps.com> writes:

> SenderX wrote:
> >
> > > I've seen significant performance gains from futexes on win32 and it
> > didn't
> > > require a rewrite of the win32 kernel.
> >
> > futex's are basically lock-free semaphores?
> >
> > The fast semaphore code that we posted can easily be used as a fast mutex...
> >
> > http://appcore.home.comcast.net/pthreads/lfsema.cpp
> >
> > http://appcore.home.comcast.net/pthreads/lfsema.h
> >
> > or
> >
> > http://intel.forums.liveworld.com/thread.jsp?forum=242&thread=6685
> >
> > This really required a Linux kernel rewrite?
> >
>
> Apparently, if you didn't know any better. One of the dangers of
> having the ability to modify the kernel is that you don't try
> hard enough to look for other less intrusive solutions. They're
> going to have a hard time living this one down. We'll see how
> cool they are anyway.

Well here is something that's pretty cool - the FUTEX_FD operation
lets you create a file descriptor that is tied to a futex, suitable
for use with select() or poll().

So, if i understand correctly, this would allow one to build a
message queue (including a process-shared queue) that would be
select()able or poll()able. That, IMHO, is a cool thing, and
worth extending the kernel to obtain.

-SEan

PS - Die SysV IPC, die!!!

Alexander Terekhov

unread,

Oct 18, 2003, 3:20:22 PM10/18/03

to

Joe Seigh wrote:
[...]

> > Yeah. I personally kinda like queue "uber jumping" (no waiters count)
> > and simply using xchg/swap (no cas/++/--). Still parsing my example?
> >
>

> What, the wake everybody up thundering herd solution? ...

struct swap_based_mutex_for_windows {

atomic<int> m_lock_status; // -1: free, 0: locked, 1 lock-contention
atomic<auto_reset_event *> m_retry_event; // DCSI'd

^^^^^^^^^^

<...>

void unlock() {
if (m_lock_status.swap(-1, msync::rel) > 0) release_one_waiter_if_any();

} ^^^

regards,
alexander.

Joe Seigh

unread,

Oct 18, 2003, 3:27:05 PM10/18/03

to

Sean Burke wrote:
>
>
> Well here is something that's pretty cool - the FUTEX_FD operation
> lets you create a file descriptor that is tied to a futex, suitable
> for use with select() or poll().
>
> So, if i understand correctly, this would allow one to build a
> message queue (including a process-shared queue) that would be
> select()able or poll()able. That, IMHO, is a cool thing, and
> worth extending the kernel to obtain.
>

The write up mentions that mechanism for detecting race conditions
isn't exposed by select(), poll(), and epoll(). So you could lose
events.

Joe Seigh

unread,

Oct 18, 2003, 3:40:32 PM10/18/03

to

even if the waiter hasn't quite shown up yet. That's a semaphore.

Joe Seigh

Alexander Terekhov

unread,

Oct 18, 2003, 3:50:32 PM10/18/03

to

http://groups.google.com/groups?q=auto-reset+terekhov+ABM

regards,
alexander.

Alexander Terekhov

unread,

Oct 18, 2003, 4:05:39 PM10/18/03

to

Eric Dumazet wrote:
>
> "Eric Dumazet" <da...@cosmosbay.com> a écrit dans le message de
> news:3f906abf$0$13307$626a...@news.free.fr...
> > man pages are always out of date...
> >
>
> And finally a new up2date doc is now available :
> http://people.redhat.com/drepper/futex.pdf

Bah! Ulrich and "class"! That's milestone!! "template" is next
(and, hopefully, DRB will follow Ulrich rather soon), I guess.

regards,
alexander.

P.S. More comments will follow.

Sean Burke

unread,

Oct 18, 2003, 4:29:19 PM10/18/03

to

Joe Seigh <jsei...@xemaps.com> writes:

Yes, the problem being that waiting on the futex requires
you to specify the last value you saw for the futex, so
the kernel can wake your thread (or in this case, decline
to sleep it) if the futex has changed. The select and poll
APIs don't provide a means to indicate what futex value
your thread last saw.

Perhaps it would be more useful if file descriptors derived
from futexes were treated as "wake on non-zero"? Then for
something like a ring buffer, a futex could be used to store
the amount data available to read, yielding a read file
descriptor. A second futex representing the free space
available in the buffer would yield a write file decriptor.

-SEan

SenderX

unread,

Oct 18, 2003, 5:06:48 PM10/18/03

to

> Yes, the problem being that waiting on the futex requires
> you to specify the last value you saw for the futex, so
> the kernel can wake your thread (or in this case, decline
> to sleep it) if the futex has changed. The select and poll
> APIs don't provide a means to indicate what futex value
> your thread last saw.

Humm... This futex thing sounds like its not quite ready for production
code...

Just use one of the lock-free semas that Joe and I previously posted.

Sean Burke

unread,

Oct 18, 2003, 7:21:29 PM10/18/03

to

"SenderX" <x...@xxx.xxx> writes:

It's no fault of the futex - the question is, what should
be the behavior of a file descriptor that is derived from
a futex?

Do the lock-free semaphores you refer to provide a file
descriptor that can be used with select()?

-SEan

SenderX

unread,

Oct 18, 2003, 10:38:13 PM10/18/03

to

> Do the lock-free semaphores you refer to provide a file
> descriptor that can be used with select()?

No, but It sounds like the API's in question are not totally ready for a
futex yet anyway...

lock-free semaphores are only good for avoiding the kernel 100% of the time
when the sema is under any sort of low/moderate to heavy contention... Its
all in user-space as well, except for the semaphore calls used to wake
waiters.

There are simple, and very effective" for protecting any number of lock-free
collections...

>>> So, if i understand correctly, this would allow one to build a
>>> message queue (including a process-shared queue) that would be
>>> select()able or poll()able.

Like Windows IOCP?

Please elaborate on how a futex and file-descriptor get used together?

"Super simple" pseudo-code would be good...

Valentin Nechayev

unread,

Oct 19, 2003, 2:21:11 AM10/19/03

to

>>> Joe Seigh wrote:

>> And finally a new up2date doc is now available :
>> http://people.redhat.com/drepper/futex.pdf
>>
>> I dont know if I will take time to read it all, it's *very* long !

JS> I'll have to read though it, I just did a quick scan. It probably
JS> should have just stuck to the facts and just presented the semantics.
JS> The discussion of why other techniques don't work distracts from
JS> things. It shouldn't be a tutorial on locking in general and it's
JS> assuming how correct locking will look. It would be nice if the
JS> orginal paper was available online, especially if it answers why
JS> futexes need to hack the kernel when there are a number of ways
JS> to implement futexes without kernel modifications.

As far as I read (Kernel Traffic and other sources), using kernel land
was declared as the way to provide efficient inter-process locking,
analogically to "adaptive mutexes" in Solaris.

-netch-

Sean Burke

unread,

Oct 19, 2003, 3:07:07 AM10/19/03

to

"SenderX" <x...@xxx.xxx> writes:

> > Do the lock-free semaphores you refer to provide a file
> > descriptor that can be used with select()?
>
> No, but It sounds like the API's in question are not totally ready for a
> futex yet anyway...

If you mean select/poll, yes. But we don't get to change those.

> lock-free semaphores are only good for avoiding the kernel 100% of the time
> when the sema is under any sort of low/moderate to heavy contention... Its
> all in user-space as well, except for the semaphore calls used to wake
> waiters.
>
> There are simple, and very effective" for protecting any number of lock-free
> collections...

I really don't understand what you think the difference is.
The futex-based mutexes only enter the kernel when a thread
needs to wait, or to waken a waiter.

The futex provides:

- The basic apis for waits, timed waits, and wakes that you
can use for mutexes, condvars, semaphores, whatever.

- Support for shifting extra waiters on a condvar directly
to the mutex wait queue when a condvar gets a broadcast.

- You don't have to do anything special for inter-process
sync objects.

- A neat API that lets you get a file descriptor for your futex.
(Tho not without caveats.)

And, if you feel the urge to "spin" on a held lock for a while
before sleeping, there is nothing in the futex that stops you
from doing this.

I really don't see what is so controversial about futexes.
Aside from the file descriptor thing, it does nothing beyond
the most basic essentials.

> >>> So, if i understand correctly, this would allow one to build a
> >>> message queue (including a process-shared queue) that would be
> >>> select()able or poll()able.
>
> Like Windows IOCP?

I don't know anything about Windows IOCP.

> Please elaborate on how a futex and file-descriptor get used together?
>
> "Super simple" pseudo-code would be good...

You should first read Ulrich Drepper's pdf. I found it very
illuminating, and the link was posted here earlier today.

Assume I have a queue, that consists of a list of nodes.
The queue contains a pointer to the first node, or NULL
if the queue is empty. When a thread goes to pop from
an empty queue, it does a futex_wait on the pointer.
When another thread pushes an item onto the queue, it
does a futex_wake, again indicating the queue pointer
as the futex.

The file descriptor becomes an alias for the futex,
one that can be passed to select or poll. Presumably,
when a thread calls select with one or more futex file
descriptors, it is placed on the wait queue for all of
the futexes simultaneously. So, any other thread that
calls futex_wake after a push will wake the thread that
called select.

(The tricky part, I suppose, is removing the woken thread
from all of the other futex wait queues.)

One other thought occurs to me though - the futex has been
defined as a 32-bit word, even on LP64 machines. This means
that you won't be able to use a pointer as a futex on LP64
machines. Perhaps that should be reconsidered.

-SEan

SenderX

unread,

Oct 19, 2003, 4:35:10 AM10/19/03

to

> If you mean select/poll, yes. But we don't get to change those.

Thats what I meant.

> I really don't understand what you think the difference is.

FD's && Kernel rewrite.

> The futex-based mutexes only enter the kernel when a thread
> needs to wait, or to waken a waiter.

Yeah, that sounds exactly like a lock-free semaphore.

> Assume I have a queue, that consists of a list of nodes.
> The queue contains a pointer to the first node, or NULL
> if the queue is empty. When a thread goes to pop from
> an empty queue, it does a futex_wait on the pointer.
> When another thread pushes an item onto the queue, it
> does a futex_wake, again indicating the queue pointer
> as the futex.

Yeah, that sounds exactly like a lock-free semaphore.

It seems that the difference between futexs and lock-free semaphores that
Joe and I wrote for win32 and pthreads, is the kernel re-write and
file-descriptor integration?

--
The designer of the experimental, SMP and HyperThread friendly, AppCore
library.

http://AppCore.home.comcast.net

Joe Seigh

unread,

Oct 19, 2003, 9:32:27 AM10/19/03

to

Valentin Nechayev wrote:
>
> JS> assuming how correct locking will look. It would be nice if the
> JS> orginal paper was available online, especially if it answers why
> JS> futexes need to hack the kernel when there are a number of ways
> JS> to implement futexes without kernel modifications.
>
> As far as I read (Kernel Traffic and other sources), using kernel land
> was declared as the way to provide efficient inter-process locking,
> analogically to "adaptive mutexes" in Solaris.
>

That wasn't the question.

Joe Seigh

unread,

Oct 19, 2003, 10:40:50 AM10/19/03

to

Sean Burke wrote:
> > The write up mentions that mechanism for detecting race conditions
> > isn't exposed by select(), poll(), and epoll(). So you could lose
> > events.
>
> Yes, the problem being that waiting on the futex requires
> you to specify the last value you saw for the futex, so
> the kernel can wake your thread (or in this case, decline
> to sleep it) if the futex has changed. The select and poll
> APIs don't provide a means to indicate what futex value
> your thread last saw.
>
> Perhaps it would be more useful if file descriptors derived
> from futexes were treated as "wake on non-zero"? Then for
> something like a ring buffer, a futex could be used to store
> the amount data available to read, yielding a read file
> descriptor. A second futex representing the free space
> available in the buffer would yield a write file decriptor.
>

Well, the futex api has a bunch of problems, so I'm not quite sure
where to start.

First of all, the futex api is not a very clean api and not very
well defined. It's chimeric synchronization primatives are unlike
anything anyone is used to. If implementing a fast userspace mutex
is tricky, implementing a fast user space mutex with the futex api
is trickier.

One thing is the FUTEX_WAIT with the EWOULDBLOCK is what? What does
it even mean? The thing I am sure it is not is an eventcount. Not
unless you think an eventcount which can lose events is a good eventcount.

I'm not sure what the race condition is that they think select() et al
will experience. But I don't think it matters if you use a cleaner
design. Just use an event as the synchronization object and tie the
fd to that. Events don't require an additional parameter to wait on them
so it will work for things like select() where you can't pass any per
fd information.

Joe Seigh

unread,

Oct 19, 2003, 10:45:51 AM10/19/03

to

Alexander Terekhov wrote:
>

> > even if the waiter hasn't quite shown up yet. That's a semaphore.
>
> http://groups.google.com/groups?q=auto-reset+terekhov+ABM
>

What?

Ok, I went back and took another look at your code and I see what you're
trying to do now. It might work. I don't know if I want to spend time
reviewing it since I don't believe the futex approach is the right way to go,
so alternative futex designs aren't a big thing with me. Sorry.

Joe Seigh

Sean Burke

unread,

Oct 19, 2003, 2:40:30 PM10/19/03

to

"SenderX" <x...@xxx.xxx> writes:

> > If you mean select/poll, yes. But we don't get to change those.
>
> Thats what I meant.
>
>
> > I really don't understand what you think the difference is.
>
> FD's && Kernel rewrite.

As I understand it, the kernel had to be rewritten anyway,
to allow a process to contain multiple "scheduler entities".
And the scheduling code was being reworked anyway for a
variety of reasons.

Not only the NPTL team, but also the NGPT team before them,
petitioned long and loud for new kernel facilities to enable
truly conformant pthreads on Linux. If new facilities were
not needed, why didn't they simply fix Linux Threads?

The suggestion that the kernel was rewritten solely to provide
futexes seems astonishingly ill-informed, by the usual standards
of this group.

> > The futex-based mutexes only enter the kernel when a thread
> > needs to wait, or to waken a waiter.
>
> Yeah, that sounds exactly like a lock-free semaphore.

Exactly. So I would think that all of the lock-free enthusiasts
here would be delighted. It should provide an excellent platform
for experimentation.

> > Assume I have a queue, that consists of a list of nodes.
> > The queue contains a pointer to the first node, or NULL
> > if the queue is empty. When a thread goes to pop from
> > an empty queue, it does a futex_wait on the pointer.
> > When another thread pushes an item onto the queue, it
> > does a futex_wake, again indicating the queue pointer
> > as the futex.
>
> Yeah, that sounds exactly like a lock-free semaphore.

Indeed. So why the big controversy.

> It seems that the difference between futexs and lock-free semaphores that
> Joe and I wrote for win32 and pthreads, is the kernel re-write and
> file-descriptor integration?

Well, if your lock-free semaphore was using pthreads, then on
Linux it was using a pthread implementation that left a great
deal to be desired. The futex allows you the choice of building
lock-free semaphores for Linux either

- On a pthreads implementation that is on par with other platforms,

- Directly to the futex calls.

-SEan

Joe Seigh

unread,

Oct 19, 2003, 3:00:32 PM10/19/03

to

Sean Burke wrote:

>
> "SenderX" <x...@xxx.xxx> writes:
>
> > > The futex-based mutexes only enter the kernel when a thread
> > > needs to wait, or to waken a waiter.
> >
> > Yeah, that sounds exactly like a lock-free semaphore.
>
> Exactly. So I would think that all of the lock-free enthusiasts
> here would be delighted. It should provide an excellent platform
> for experimentation.
>

No, it provides a terrible platform for experimentation. The sematics
of the futex ops are unclear. They appear to be hard coded for specific
protocols. They are quite clearly not general purpose.

Joe Seigh

Sean Burke

unread,

Oct 19, 2003, 3:27:26 PM10/19/03

to

Joe Seigh <jsei...@xemaps.com> writes:

> Sean Burke wrote:
> > > The write up mentions that mechanism for detecting race conditions
> > > isn't exposed by select(), poll(), and epoll(). So you could lose
> > > events.
> >
> > Yes, the problem being that waiting on the futex requires
> > you to specify the last value you saw for the futex, so
> > the kernel can wake your thread (or in this case, decline
> > to sleep it) if the futex has changed. The select and poll
> > APIs don't provide a means to indicate what futex value
> > your thread last saw.
> >
> > Perhaps it would be more useful if file descriptors derived
> > from futexes were treated as "wake on non-zero"? Then for
> > something like a ring buffer, a futex could be used to store
> > the amount data available to read, yielding a read file
> > descriptor. A second futex representing the free space
> > available in the buffer would yield a write file decriptor.
> >
>
> Well, the futex api has a bunch of problems, so I'm not quite sure
> where to start.
>
> First of all, the futex api is not a very clean api and not very
> well defined. It's chimeric synchronization primatives are unlike
> anything anyone is used to. If implementing a fast userspace mutex
> is tricky, implementing a fast user space mutex with the futex api
> is trickier.

I don't understand why you say that. I don't see it as a
synchronization primitive at all. I see a protocol that
allows threads that wish to wait for events to communicate
with other threads that wish to waken them. Something a lot
handier than sleep/sigwait/etc.

From Ulrich's paper, it seems that the responsibility to use
proper sync primitives falls primarily on the userspace code
that manipulate the futex's value.

> One thing is the FUTEX_WAIT with the EWOULDBLOCK is what? What does
> it even mean? The thing I am sure it is not is an eventcount. Not
> unless you think an eventcount which can lose events is a good eventcount.

It's completely analagous to the compare and swap instruction -

Compare and swap:
"Here is the value that I last saw at this address.
If it is still that value, replace it with this new value,
else return failure."

futex:
"Here is the value I last saw at this address.
If it is still that value, put me to sleep,
else return EWOULDBLOCK.

I won't opine as to whether EWOULDBLOCK is the most intuitive
error code, but the futex api is clearly intended for specialists,
anyway.

> I'm not sure what the race condition is that they think select() et al
> will experience.

I assume that the implementation of select() needs to
place the calling thread on that futex's wait queue.
But, the futex wait requires that you indicate your idea
of the futex's "current" value, just as you must do with
compare and swap.

Ulrich is assuming that, in select's implementation,
it will resort to reading the current futex value is, since
it has no better information. This creates a race condition
between a thread that is looping on select, and another
thread that is updating the futex - if the updating thread
modifies the futex prior to select reading the value, the
wakeup is missed.

My suggestion is that select should instead require that
the futexes that are used as read fd's be null when read
would block, and that futexes used as write fd's should
be null when write would block. This eliminates the race,
since a non-null value always means readable/writable.

> But I don't think it matters if you use a cleaner
> design. Just use an event as the synchronization object and tie the
> fd to that. Events don't require an additional parameter to wait on them
> so it will work for things like select() where you can't pass any per
> fd information.

Why an additional parameter?

For the same reason that a condvar needs an associated mutex:
to ensure that state cannot change between testing the predicate
and sleeping on the condition.

For the same reason that compare and swap does things that
simple atomic swap can't. It allows you to prevent races.

Weren't you among those arguing that WIN32 events are
not an adequate substitute for condvars? Don't the same
considerations apply here?

-SEAn

Sean Burke

unread,

Oct 19, 2003, 3:48:12 PM10/19/03

to

Joe Seigh <jsei...@xemaps.com> writes:

> Sean Burke wrote:
> >
> > "SenderX" <x...@xxx.xxx> writes:
> >
> > > > The futex-based mutexes only enter the kernel when a thread
> > > > needs to wait, or to waken a waiter.
> > >
> > > Yeah, that sounds exactly like a lock-free semaphore.
> >
> > Exactly. So I would think that all of the lock-free enthusiasts
> > here would be delighted. It should provide an excellent platform
> > for experimentation.
> >
>
> No, it provides a terrible platform for experimentation.
> The sematics of the futex ops are unclear.

I don't understand what you mean here. Granted, when
futex_wait() tests the current futex value against
the argument's value, it is not specified what sort
of memory barriers are used when reading the futex.

The assumption must then be that full barriers are used.
Perhaps it is unfortunate from an optimization standpoint
that more relaxed barriers cannot be specified. Is this
your objection?

> They appear to be hard coded for specific
> protocols. They are quite clearly not general purpose.

Could you please be more specific? It appears to be possible
to implement mutexes, semphores, and condvars. What's missing?

-SEan

Joe Seigh

unread,

Oct 19, 2003, 4:22:56 PM10/19/03

to

Sean Burke wrote:

>
> Joe Seigh <jsei...@xemaps.com> writes:
>
> > No, it provides a terrible platform for experimentation.
> > The sematics of the futex ops are unclear.
>
> I don't understand what you mean here. Granted, when
> futex_wait() tests the current futex value against
> the argument's value, it is not specified what sort
> of memory barriers are used when reading the futex.
>
> The assumption must then be that full barriers are used.
> Perhaps it is unfortunate from an optimization standpoint
> that more relaxed barriers cannot be specified. Is this
> your objection?

Nothing to do with membars.

>
> > They appear to be hard coded for specific
> > protocols. They are quite clearly not general purpose.
>
> Could you please be more specific? It appears to be possible
> to implement mutexes, semphores, and condvars. What's missing?
>

It had better be possible. That's what the futex api was written
for. But those are specific purposes, not general purpose.

I've done various implementations of one sort or another. The futex
api would not be my api of choice for doing that kind of stuff.

Joe Seigh

Sean Burke

unread,

Oct 19, 2003, 4:51:40 PM10/19/03

to

Joe Seigh <jsei...@xemaps.com> writes:

If you don't want to discuss it in detail, that's fine.
But to the best of my recollection, you did describe the
futex as "totally braindamaged". Have you changed this
verdict to "suboptimal"?

Or are "suboptimal" and "totally braindamaged" now synonymous
in this group?

-SEan

Joe Seigh

unread,

Oct 19, 2003, 5:25:28 PM10/19/03

to

It's "braindamaged" from the point of view that it's not the
cleanest kernel interface there is with respect to kernel
interactions with user memory. It's not clear whether they
did not know there was a way to do that, or aren't bothering
to explain why their method is better.

Joe Seigh

unread,

Oct 19, 2003, 5:44:56 PM10/19/03

to

Sean Burke wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>

> > One thing is the FUTEX_WAIT with the EWOULDBLOCK is what? What does
> > it even mean? The thing I am sure it is not is an eventcount. Not
> > unless you think an eventcount which can lose events is a good eventcount.
>
> It's completely analagous to the compare and swap instruction -
>
> Compare and swap:
> "Here is the value that I last saw at this address.
> If it is still that value, replace it with this new value,
> else return failure."
>
> futex:
> "Here is the value I last saw at this address.
> If it is still that value, put me to sleep,
> else return EWOULDBLOCK.
>
> I won't opine as to whether EWOULDBLOCK is the most intuitive
> error code, but the futex api is clearly intended for specialists,
> anyway.

I see what the problem is. I was assuming the kernel code was a little
smarter than it was. I was assuming that the kernel code would return
from FUTEX_WAIT without sleeping if it saw the futex count as zero. If
it saw the futex count as non zero, it would know that a later (by virtue
of the kernel code being the arbitrator) request to wake one thread would
happen. So as far as I could tell, there wasn't any problem being
solved by that extra value being passed on FUTEX_WAIT.

Joe Seigh

unread,

Oct 19, 2003, 5:45:37 PM10/19/03

to

Sean Burke wrote:
>
> Joe Seigh <jsei...@xemaps.com> writes:
>

> > One thing is the FUTEX_WAIT with the EWOULDBLOCK is what? What does
> > it even mean? The thing I am sure it is not is an eventcount. Not
> > unless you think an eventcount which can lose events is a good eventcount.
>
> It's completely analagous to the compare and swap instruction -
>
> Compare and swap:
> "Here is the value that I last saw at this address.
> If it is still that value, replace it with this new value,
> else return failure."
>
> futex:
> "Here is the value I last saw at this address.
> If it is still that value, put me to sleep,
> else return EWOULDBLOCK.
>
> I won't opine as to whether EWOULDBLOCK is the most intuitive
> error code, but the futex api is clearly intended for specialists,
> anyway.

I see what the problem is. I was assuming the kernel code was a little

SenderX

unread,

Oct 19, 2003, 7:34:09 PM10/19/03

to

> Well, if your lock-free semaphore was using pthreads, then on
> Linux it was using a pthread implementation that left a great
> deal to be desired. The futex allows you the choice of building
> lock-free semaphores for Linux either
>
> - On a pthreads implementation that is on par with other platforms,
>
> - Directly to the futex calls.

Cool.

> Exactly. So I would think that all of the lock-free enthusiasts
> here would be delighted.

=)

> It should provide an excellent platform
> for experimentation.

I am sort of confined to windows && pthreads at the moment.

When will we be seeing futexs in pthreads for win32?

;)

P.S. (O.T.)

> It should provide an excellent platform
> for experimentation.

Has anybody been tinkering around with that lock-free read, fast-pathed
write mutex I posted?

The testing on windows with pthreads has been going great so far...

SenderX

unread,

Oct 19, 2003, 7:44:29 PM10/19/03

to

> No, it provides a terrible platform for experimentation. The sematics
> of the futex ops are unclear.

We get to experiment, and discover the semantics...

;)

Joe Seigh

unread,

Oct 19, 2003, 8:18:11 PM10/19/03

to

I've already discovered too much. :(

And there is a dark side when you introduce a new api this
way. I wonder if I should join the dark side? I believe
it pays better.

Joe Seigh

Sean Burke

unread,

Oct 19, 2003, 8:54:40 PM10/19/03

to

Joe Seigh <jsei...@xemaps.com> writes:

I don't often look inside pthread implementations,
so I don't know how other kernels compare in this
regard. Perhaps the hype was overblown.

On the other hand, Linux Threads uses sigwait/signal
for sleep/wakeup, which has obvious problems once your
threads no longer have distinct pids. So it is a big
improvement over what was there before.

-SEan

Alexander Terekhov

unread,

Oct 20, 2003, 5:59:57 AM10/20/03

to

Joe Seigh wrote:
[...]

> Ok, I went back and took another look at your code and I see what you're
> trying to do now. It might work. I don't know if I want to spend time
> reviewing it since I don't believe the futex approach is the right way to go,
> so alternative futex designs aren't a big thing with me. Sorry.

The stuff I've posted was not using "alternative futex designs". It was
using MS auto-reset event (bin. sema).

regards,
alexander.

Joe Seigh

unread,

Oct 20, 2003, 6:56:39 AM10/20/03

to

Ok, sorry. Sometimes it's hard to tell with your mnemonics.

If you had a thread set the futex to 1 but not yet actaully
in the event wait set yet, auto-reset by another thread wouldn't
wake it up. Don't you need a semamphore to avoid that race
condition?

BTW, is the source for NPTL userspace code around anyware as a tar
file? I rather not deal with a Redhat source rpm. One, I've never
been able to get rpm to install source rpms. Two, I have no idea
what "install" of source means especially on a different level of
Linux (and I wouldn't take anybody's word that rpm source install is
"harmless" and I don't have a sacrificial linux box to try it out
myself).

I'm sort of curious of how contrived the futex implementations of
the other stuff might be.

Joe Seigh

Alexander Terekhov

unread,

Oct 20, 2003, 6:57:14 AM10/20/03

to

Sean Burke wrote:
[...]

> - Support for shifting extra waiters on a condvar directly
> to the mutex wait queue when a condvar gets a broadcast.

^^^^^^^^^^^^^^^^^^^^^^^

Shifting to the internal (CV's one) or the "external" mutex
(protecting the wait predicate(s))?

>
> - You don't have to do anything special for inter-process
> sync objects.

Well. "See above." ;-)

regards,
alexander.

Alexander Terekhov

unread,

Oct 20, 2003, 7:03:41 AM10/20/03

to

Joe Seigh wrote:
[...]

> If you had a thread set the futex to 1 but not yet actaully
> in the event wait set yet, auto-reset by another thread wouldn't
> wake it up. Don't you need a semamphore to avoid that race
> condition?

MS auto-reset event IS "a semaphore". I don't see a race condition.

>
> BTW, is the source for NPTL userspace code around anyware as a tar
> file?

http://listman.redhat.com/archives/phil-list/2003-September/msg00051.html
http://listman.redhat.com/archives/phil-list/2003-October/msg00018.html

regards,
alexander.

Joe Seigh

unread,

Oct 20, 2003, 7:10:56 AM10/20/03

to

Alexander Terekhov wrote:
>
> Joe Seigh wrote:
> [...]
> > If you had a thread set the futex to 1 but not yet actaully
> > in the event wait set yet, auto-reset by another thread wouldn't
> > wake it up. Don't you need a semamphore to avoid that race
> > condition?
>
> MS auto-reset event IS "a semaphore". I don't see a race condition.

Ok. I was confusing it with some of the pulse event semantics.

Joe Seigh

SenderX

unread,

Oct 20, 2003, 7:25:00 AM10/20/03

to

> MS auto-reset event IS "a semaphore". I don't see a race condition.

MS events do not know how many times they have been set. They are not
semaphores. No internal count?

calling SetEvent 10000 times on an event is about equal to calling SetEvent
once. The object knows no difference...

Joes EventCount or Semaphore would be the "correct" kernel object for a
waitset.

Joe Seigh

unread,

Oct 20, 2003, 8:05:44 AM10/20/03

to

SenderX wrote:
>
> > MS auto-reset event IS "a semaphore". I don't see a race condition.
>
> MS events do not know how many times they have been set. They are not
> semaphores. No internal count?
>
> calling SetEvent 10000 times on an event is about equal to calling SetEvent
> once. The object knows no difference...
>
> Joes EventCount or Semaphore would be the "correct" kernel object for a
> waitset.
>

That's ok in this case. You only want to wake up at most 1 thread. It's
using the same trick the futex is using with the woken up threads always
assuming other waiters.

Joe Seigh

Alexander Terekhov

unread,

Oct 20, 2003, 8:40:17 AM10/20/03

to

SenderX wrote:
>
> > MS auto-reset event IS "a semaphore". I don't see a race condition.
>
> MS events do not know how many times they have been set. They are not
> semaphores. No internal count?

http://groups.google.com/groups?selm=3CE0BCD3.EF695748%40web.de

>
> calling SetEvent 10000 times on an event is about equal to calling SetEvent
> once. The object knows no difference...
>
> Joes EventCount or Semaphore would be the "correct" kernel object for a
> waitset.

I can't see how Joe's "lock-free" semaphore is supposed to work
without any corrections of the count owned by the "real" semaphore
(and, beside that, his processCancels() can hit indeterminate value
in "test3"... if called from timedWait(), IIRC).

I continue to ignore "CAS64" stuff all together. ;-)

regards,
alexander.

Joe Seigh

unread,

Oct 20, 2003, 9:05:51 AM10/20/03

to

Basically, you have to show that a thread cannot exit semaphore wait
without a corresponding prior signal, and that a thread in semaphore
wait cannot hang. I don't bother with doing the proof because they
are a lot of work and a complete waste of time since nobody will bother
with them anyway. The convention here is that the code is "proof"
and it's everybody else's job to figure out the code. I can play that
game as well as anybody.

>
> I continue to ignore "CAS64" stuff all together. ;-)

Actually, the only reason the CAS64 atomic_ptr stuff is in
eventcount is to simulate a win32 eventcount object. I'd
prefer using win32 built ins.

Joe Seigh

Alexander Terekhov

unread,

Oct 20, 2003, 9:43:28 AM10/20/03

to

Joe Seigh wrote:
[...]

> I can play that game as well as anybody.

Let's continue playing the game. With NPTL stuff. We've basically done
with mutexes. My conclusion is that the current (0.60) NPTL has it least
one serious race condition and that they could have done "almost the
same stuff" with IA-386 friendly XCHG and a simple "kernel-based" bin.
semaphore. Let's take a look at their futex-based counting semaphore:

<LGPL'd stuff>

int
__new_sem_wait (sem_t *sem)
{
/* First check for cancellation. */
CANCELLATION_P (THREAD_SELF);

int *futex = (int *) sem;
int val;
int err;

do
{
if (*futex > 0)
{
val = atomic_decrement_if_positive (futex);
if (val > 0)
return 0;
}

/* Enable asynchronous cancellation. Required by the standard. */
int oldtype = __pthread_enable_asynccancel ();

err = lll_futex_wait (futex, 0);

/* Disable asynchronous cancellation. */
__pthread_disable_asynccancel (oldtype);
}
while (err == 0 || err == -EWOULDBLOCK);

__set_errno (-err);
return -1;
}

int
__new_sem_trywait (sem_t *sem)
{
int *futex = (int *) sem;
int val;

if (*futex > 0)
{
val = atomic_decrement_if_positive (futex);
if (val > 0)
return 0;
}

__set_errno (EAGAIN);
return -1;
}

int
sem_timedwait (sem_t *sem, const struct timespec *abstime)
{
/* First check for cancellation. */
CANCELLATION_P (THREAD_SELF);

int *futex = (int *) sem;
int val;
int err;

if (*futex > 0)
{
val = atomic_decrement_if_positive (futex);
if (val > 0)
return 0;
}

err = -EINVAL;
if (abstime->tv_nsec < 0 || abstime->tv_nsec >= 1000000000)
goto error_return;

do
{
struct timeval tv;
struct timespec rt;
int sec, nsec;

/* Get the current time. */
__gettimeofday (&tv, NULL);

/* Compute relative timeout. */
sec = abstime->tv_sec - tv.tv_sec;
nsec = abstime->tv_nsec - tv.tv_usec * 1000;
if (nsec < 0)
{
nsec += 1000000000;
--sec;
}

/* Already timed out? */
err = -ETIMEDOUT;
if (sec < 0)
goto error_return;

/* Do wait. */
rt.tv_sec = sec;
rt.tv_nsec = nsec;

/* Enable asynchronous cancellation. Required by the standard. */
int oldtype = __pthread_enable_asynccancel ();

err = lll_futex_timed_wait (futex, 0, &rt);

/* Disable asynchronous cancellation. */
__pthread_disable_asynccancel (oldtype);

if (err != 0 && err != -EWOULDBLOCK)
goto error_return;

val = atomic_decrement_if_positive (futex);
}
while (val <= 0);

return 0;

error_return:
__set_errno (-err);
return -1;
}

int
__new_sem_post (sem_t *sem)
{
int *futex = (int *) sem;
int err, nr;

nr = atomic_exchange_and_add (futex, 1);
err = lll_futex_wake (futex, nr + 1);
if (__builtin_expect (err, 0) < 0)
{
__set_errno (-err);
return -1;
}
return 0;
}

</LGPL'd stuff>

Do you see anything here that you don't "quite understand"?

regards,
alexander.

Joe Seigh

unread,

Oct 20, 2003, 10:06:24 AM10/20/03

to

Alexander Terekhov wrote:
>
> Joe Seigh wrote:
> [...]
> > I can play that game as well as anybody.
>
> Let's continue playing the game. With NPTL stuff. We've basically done
> with mutexes. My conclusion is that the current (0.60) NPTL has it least
> one serious race condition and that they could have done "almost the
> same stuff" with IA-386 friendly XCHG and a simple "kernel-based" bin.
> semaphore. Let's take a look at their futex-based counting semaphore:
>
> <LGPL'd stuff>

(snip)

> </LGPL'd stuff>
>
> Do you see anything here that you don't "quite understand"?
>

It appears to be ok. I'm going more by the DESIGN*.txt files.

Seems a little suboptimal, but not quite as suboptimal as the
condvar implementation. I need to develope a better feel for
the futex ops. I'll probably do a optimal condvar design as
an exercise.

Joe Seigh

Alexander Terekhov

unread,

Oct 20, 2003, 10:28:04 AM10/20/03

to

Joe Seigh wrote:
>
> Alexander Terekhov wrote:

[... NPTL's counting sema... ]

> It appears to be ok. I'm going more by the DESIGN*.txt files.
>

> Seems a little suboptimal, ....

Yeah,

sem_post(sem_t *sem)
{
n = atomic_increment(sem->count);
// Pass the new value of sem->count
futex_wake(&sem->count, n + 1);
}

why "n + 1" (n >= 0; their atomic_increment() returns old
value) instead of simply "1", to begin with? Any ideas?

> I'll probably do a optimal condvar design as an exercise.

Let's try to do "optimal" counting sema first. ;-)

regards,
alexander.

Joe Seigh

unread,

Oct 20, 2003, 11:00:51 AM10/20/03

to

Alexander Terekhov wrote:
>
> Joe Seigh wrote:
> >
> > Alexander Terekhov wrote:
>
> [... NPTL's counting sema... ]
>
> > It appears to be ok. I'm going more by the DESIGN*.txt files.
> >
> > Seems a little suboptimal, ....
>
> Yeah,
>
> sem_post(sem_t *sem)
> {
> n = atomic_increment(sem->count);
> // Pass the new value of sem->count
> futex_wake(&sem->count, n + 1);
> }
>
> why "n + 1" (n >= 0; their atomic_increment() returns old
> value) instead of simply "1", to begin with? Any ideas?

No. I'd have to work up some formalisms to reason about
futex ops with.

>
> > I'll probably do a optimal condvar design as an exercise.
>
> Let's try to do "optimal" counting sema first. ;-)

Well, that wasn't the part I was thinking about but part of
optimizing condvar applies to semaphores as well.

Joe Seigh

unread,

Oct 20, 2003, 11:18:47 AM10/20/03

to

> Alexander Terekhov wrote:
> >
> >
> > why "n + 1" (n >= 0; their atomic_increment() returns old
> > value) instead of simply "1", to begin with? Any ideas?
>
> No. I'd have to work up some formalisms to reason about
> futex ops with.
>

Yeah, "1" will work. A iterative proof will show it works.

I don't know about you, but I'm going to sanity check
my intuition with this new stuff for a while. Especially
if I run into unexplained discrepencies.

Joe Seigh

Sean Burke

unread,

Oct 20, 2003, 2:55:04 PM10/20/03

to

Alexander Terekhov <tere...@web.de> writes:

Not playing the game Alex. I write complete paragraphs,
not vague allusions. You can do the same.

-SEAn

Alexander Terekhov

unread,

Oct 20, 2003, 3:17:18 PM10/20/03

to

Joe Seigh wrote:
>
> > Alexander Terekhov wrote:
> > >
> > >
> > > why "n + 1" (n >= 0; their atomic_increment() returns old
> > > value) instead of simply "1", to begin with? Any ideas?
> >
> > No. I'd have to work up some formalisms to reason about
> > futex ops with.
> >
>
> Yeah, "1" will work.

Sure. I'd probably want to have a 'waiters' count and call
futex_wait() only if the incremented semaphore value is not
greater than the number of waiters. Ideally, this count shall
be maintained/published by the kernel. Hmm, this is would
still require quite expensive barriers -- we would need "the
most expensive" StoreLoad in unlock/post, to begin with.

Oder?

regards,
alexander.

Alexander Terekhov

unread,

Oct 20, 2003, 3:39:53 PM10/20/03

to

Sean Burke wrote:
>
> Alexander Terekhov <tere...@web.de> writes:
>
> > Sean Burke wrote:
> > [...]
> > > - Support for shifting extra waiters on a condvar directly
> > > to the mutex wait queue when a condvar gets a broadcast.
> > ^^^^^^^^^^^^^^^^^^^^^^^
> >
> > Shifting to the internal (CV's one) or the "external" mutex
> > (protecting the wait predicate(s))?
> >
> > >
> > > - You don't have to do anything special for inter-process
> > > sync objects.
> >
> > Well. "See above." ;-)
>
> Not playing the game Alex.

Damn.

> I write complete paragraphs,
> not vague allusions. You can do the same.

http://google.com/groups?selm=3F5E2D13.A2F5DA8C%40web.de
(that was true prior to NPTL 0.60)

http://listman.redhat.com/archives/phil-list/2003-October/msg00003.html
(and I'm still in total darkness WRT "how it supposed to work")

Got it now? "Or am I just missing or misunderstanding something?"

regards,
alexander.

Alexander Terekhov

unread,

Oct 20, 2003, 4:26:45 PM10/20/03

to

We won't need any of those barriers if kernel would maintain
"have waiters" bit inside futex. This seems to be the most
optimal solution for counting semas (and mutexes too, BTW).

I think.

regards,
alexander.

Alexander Terekhov

unread,

Oct 20, 2003, 4:35:14 PM10/20/03

to

Alexander Terekhov wrote:
>
> Alexander Terekhov wrote:
> >
> > Joe Seigh wrote:
> > >
> > > > Alexander Terekhov wrote:
> > > > >
> > > > >
> > > > > why "n + 1" (n >= 0; their atomic_increment() returns old
> > > > > value) instead of simply "1", to begin with? Any ideas?
> > > >
> > > > No. I'd have to work up some formalisms to reason about
> > > > futex ops with.
> > > >
> > >
> > > Yeah, "1" will work.
> >
> > Sure. I'd probably want to have a 'waiters' count and call
> > futex_wait() only if the incremented semaphore value is not

^^^^

Err. futex_WAKE(), not futex_wait(). Of course.

Joe Seigh

unread,

Oct 20, 2003, 4:44:05 PM10/20/03

to

To avoid a kernel call, yes. A simple solutions is to just
increment a wait count before and decrement it after the wait call
and check that just before the wakeup call.

Joe Seigh

Alexander Terekhov

unread,

Oct 20, 2003, 5:09:56 PM10/20/03

to

Joe Seigh wrote:
[...]

> To avoid a kernel call, yes. A simple solutions is to just
> increment a wait count before and decrement it after the wait call
> and check that just before the wakeup call.

Yes, but with relaxed memory ordering you'll have to impose proper
order on both sides, to begin with. On the signaling side, you'll
really need to "read" your wait count AFTER sema count increment is
made visible to all other processors. That's StoreLoad barrier. Very
expensive, as they say. Things are much simpler if we would have one
single "have waiters" bit maintained by kernel in the same "memory
location". Oder?

regards,
alexander.

Joe Seigh

unread,

Oct 20, 2003, 5:20:07 PM10/20/03

to

Alexander Terekhov wrote:
>
> We won't need any of those barriers if kernel would maintain
> "have waiters" bit inside futex. This seems to be the most
> optimal solution for counting semas (and mutexes too, BTW).
>

Well, speaking of optimization, I was going to wait to see
how long it would take someone to figure it out but I don't
have the patience.

futexes are almost like eventcounts, so for a condvar you
basically have a signal count (event count) that gets
atomically incremented everytime there is a signal/broadcast.

What the waiter does is read the signal count value, release
the mutex, futex_wait on the signal count, and then reacquire
the mutex. That's it.

For broadcast, you simply do a wake all after incrementing
the signal count.

For signal, you do a wake one after incrementing the signal
count. Some late arriving waiters may sneak in here on the
signal count change, but spurious wakeups are your friend
here.

And Bob's your uncle.

This is not the super optimized convar that I did earlier.

Joe Seigh