sem_post() is a signal safe function, which means that it's normally
heavier to use than mutexes.
However, I took a look to its implementation on Linux, and I don't see
where is the actual cost.
Below is the implementation on Linux, and there's nothing about
signals here.
Could anybody tell me if semaphore are really expensive ?
int
__new_sem_post (sem_t *sem)
{
struct new_sem *isem = (struct new_sem *) sem;
__typeof (isem->value) cur;
do
{
cur = isem->value;
if (isem->value == SEM_VALUE_MAX)
{
__set_errno (EOVERFLOW);
return -1;
}
}
while (atomic_compare_and_exchange_bool_acq (&isem->value, cur + 1,
cur));
atomic_full_barrier ();
if (isem->nwaiters > 0)
{
int err = lll_futex_wake (&isem->value, 1,
isem->private ^ FUTEX_PRIVATE_FLAG);
if (__builtin_expect (err, 0) < 0)
{
__set_errno (-err);
return -1;
}
}
return 0;
}
/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
Oh my gosh! Is it only me who thinks that here must be
atomic_compare_and_exchange_bool_rel(), i.e. "_rel" instead of "_acq"?
Otherwise it does not provides sequential consistency. Think of the
following scenario.
Initially semaphore value = 0. Thread 1 setups some dependent state,
then increments semaphore value with sem_post(). Thread 2 reads
semaphore value with sem_get_value() and observes semaphore value = 1,
then thread 2 reads dependent state. There is no synchronizes-with
edge (release->acquire) between threads, so read of dependent state
races with modification. Oh boy!
In general, in order to provide sequential consistency every function
that directly or indirectly mutates shared state must be 'release',
and every function that directly of indirectly reveals shared state
must be 'acquire'.
--
Dmitriy V'jukov
In general, 'signal safety' means only that function must be
reenterant (can be executed by a thread, and repeatedly entered from
signal handler). This in turn means only that one must use 'lock-free
techniques' (atomically change data from one consistent state to
another). And 'lock-free techniques' can be either faster or slower
than other techniques, they are actually orthogonal.
The implementation you shown looks quite fast and optimal, i.e. the
function would probably implemented the same if there would no 'signal
safety' requirement.
-
Dmitriy V'jukov
Ok, that's actually what I was thinking.
But the section of "Programming with POSIX threads", "6.6.6
Semaphores: synchronizing with a signal-catching function", insists on
the fact that sem_post() is more expensive because it must be signal
safe.
So it's not true anymore.
>> int
>> __new_sem_post (sem_t *sem)
>> {
>> struct new_sem *isem = (struct new_sem *) sem;
>>
>> __typeof (isem->value) cur;
>> do
>> {
>> cur = isem->value;
>> if (isem->value == SEM_VALUE_MAX)
>> {
>> __set_errno (EOVERFLOW);
>> return -1;
>> }
>> }
>> while (atomic_compare_and_exchange_bool_acq (&isem->value, cur + 1,
>> cur));
>
> /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
>
> Oh my gosh! Is it only me who thinks that here must be
> atomic_compare_and_exchange_bool_rel(), i.e. "_rel" instead of "_acq"?
No, I agree. We found a similar issue in the pthread mutex locking not
too long ago. To be fair, on x86 it probably makes no difference as the
rel/acq behaviour is implicit in the atomic instruction. I bet you
could get some interesting bugs on mips though.
Someone should probably file a bug against glibc. I think it's
currently maintained primarily by Ulrich Drepper, and uses the redhat
bugzilla:
http://sources.redhat.com/bugzilla/
Chris
By the same reason read "cur = isem->value" must be 'acquire', because
if the function returns EOVERFLOW, it reveals state of the semaphore.
--
Dmitriy V'jukov
I believe it is a problem on SPARC RMO too. And 64-bit Linux runs
SPARC in RMO mode.
And there also may be a problem on compiler level (compiler
reorderings), which may come out on all architectures.
--
Dmitriy V'jukov
And, of course, IA-64.
--
Dmitriy V'jukov
Think of the following program.
Initially semaphore value = SEM_VALUE_MAX - 1
int data1 = 0;
int data2 = 0;
// thread 1
data1 = 1;
if (sem_post(sem) && errno == EOVERFLOW)
assert(data2 == 1);
// thread 2
data2 = 1;
if (sem_post(sem) && errno == EOVERFLOW)
assert(data1 == 1);
This program has no races under sequential consistency. However with
current sem_post() implementation it has races. IF function (even
indirectly) reveals state (return -1 and sets errno to EOVERFLOW) it
MUST be 'acquire'.
--
Dmitriy V'jukov
Sorry for the stupid question but what do _acq/_rel mean ?
I took a look to the atomic_compare_and_exchange_bool_rel()
implementation and it's actually the same as
atomic_compare_and_exchange_bool_acq() for _all_ architectures.
I think good place to study acquire/release semantics is C++0x draft:
http://www.open-std.org/Jtc1/sc22/wg21/docs/papers/2009/
(see sections 1.10 and 29)
> I took a look to the atomic_compare_and_exchange_bool_rel()
> implementation and it's actually the same as
> atomic_compare_and_exchange_bool_acq() for _all_ architectures.
It's just an implementation detail, they definitely may be implemented
differently on IA-64, SPARC RMO and some other.
--
Dmitriy V'jukov
> Sorry for the stupid question but what do _acq/_rel mean ?
> I took a look to the atomic_compare_and_exchange_bool_rel()
> implementation and it's actually the same as
> atomic_compare_and_exchange_bool_acq() for _all_ architectures.
You need to also look at the glibc ports package, where the less common
architectures live. The following is for mips...notice that the last
two parameters have their positions swapped.
#define atomic_compare_and_exchange_bool_acq(mem, new, old) \
__atomic_bool_bysize (__arch_compare_and_exchange_bool, int, \
mem, new, old, "", MIPS_SYNC_STR)
#define atomic_compare_and_exchange_bool_rel(mem, new, old) \
__atomic_bool_bysize (__arch_compare_and_exchange_bool, int, \
mem, new, old, MIPS_SYNC_STR, "")
Chris
Ah ok thanks I missed that point.
Could you give me the pointer for the mips specific part ?
I would like to look at the implementation details more closely.
The answer is more complicated than that. I wrote from the perspective
of the history and intent of the POSIX working group, more than from the
perspective of any particular operating system implementation. There
were a lot of reasons for this, but one particular practical reason was
that there weren't many implementations at that time. (Really only two
"mainstream"; Solaris and Tru64 UNIX, and neither were at that time
complete or fully optimized.)
In general, semaphores represent a kernel operation, involving a trap
into kernel mode -- something of a "context switch". Compared to normal
mutex operations, which can be performed entirely in user mode, a kernel
call is expensive.
The basic notion of the "futex", sharing functions between kernel and
user space, wasn't unknown, and similar concepts had been applied for
shared (cross-process) mutexes and condition variables. I admit it
hadn't really occurred to me that someone would bother to do that for
semaphores, which are generally a more cumbersome and less flexible
synchronization model than mutexes and condition variables. (Though they
certainly do have their place.)
Furthermore, the futex wait and wake operations are kernel calls; if
there is a waiter, your signal handler's sem_post() will make a kernel
call. A kernel call is never necessary for a non-shared mutex or
condition variable. Never necessary in "pure POSIX" that is, or in many
implementations of POSIX threads. Since Linux mutexes and condition
variables are also based on futexes, and Linux threads are always 1:1
kernel threads, mutexes are put in the same position. (And in fairness,
the kernel linkage here is pretty fast.)
In the final analysis, I probably should have said something more
"weaselly" along the lines of "on many implementations sem_post() is
likely to be more expensive". At the time, this didn't seem justified;
but that's only one of many things that's changed in over 15 years.
It remains true that you're better off avoiding synchronization from a
signal catching function -- and ideally even avoiding asynchronous
signal catching at all. A signal pre-empts one of your active threads
for a completely different function, regardless of what it's doing at
the time, and that's almost always a bad idea. Use sigwait*() instead,
and you'll be better off... and questions like this become less important.
Dave Butenhof <david.b...@hp.com> writes:
> Francis Moreau wrote:
>>
>> But the section of "Programming with POSIX threads", "6.6.6
>> Semaphores: synchronizing with a signal-catching function", insists on
>> the fact that sem_post() is more expensive because it must be signal
>> safe.
>>
>> So it's not true anymore.
>
> The answer is more complicated than that.
Yes, definitely. I should have added "on gnu/linux systems".
> I wrote from the perspective of the history and intent of the POSIX
> working group, more than from the perspective of any particular
> operating system implementation. There were a lot of reasons for this,
> but one particular practical reason was that there weren't many
> implementations at that time. (Really only two "mainstream"; Solaris
> and Tru64 UNIX, and neither were at that time complete or fully
> optimized.)
>
> In general, semaphores represent a kernel operation, involving a trap
> into kernel mode -- something of a "context switch". Compared to
> normal mutex operations, which can be performed entirely in user mode,
> a kernel call is expensive.
This is true if threads are 1:n kernel threads, isn't it ?
However, as you said below, this is not the case on Linux system where
the kernel/user thread mapping is 1:1. And in this case mutex
operations
also need some kernel calls.
> The basic notion of the "futex", sharing functions between kernel and
> user space, wasn't unknown, and similar concepts had been applied for
> shared (cross-process) mutexes and condition variables. I admit it
> hadn't really occurred to me that someone would bother to do that for
> semaphores, which are generally a more cumbersome and less flexible
> synchronization model than mutexes and condition variables. (Though
> they certainly do have their place.)
Why not ?
Why did't you think it didn't worth to optimize the semphores
operations ?
>
> Furthermore, the futex wait and wake operations are kernel calls; if
> there is a waiter, your signal handler's sem_post() will make a kernel
> call. A kernel call is never necessary for a non-shared mutex or
> condition variable. Never necessary in "pure POSIX" that is, or in
> many implementations of POSIX threads. Since Linux mutexes and
> condition variables are also based on futexes, and Linux threads are
> always 1:1 kernel threads, mutexes are put in the same position. (And
> in fairness, the kernel linkage here is pretty fast.)
>
> In the final analysis, I probably should have said something more
> "weaselly" along the lines of "on many implementations sem_post() is
> likely to be more expensive". At the time, this didn't seem justified;
> but that's only one of many things that's changed in over 15 years.
Yes, that probably means it's high time to release a new version of
your
book ;)
>
>
> It remains true that you're better off avoiding synchronization from a
> signal catching function -- and ideally even avoiding asynchronous
> signal catching at all. A signal pre-empts one of your active threads
> for a completely different function, regardless of what it's doing at
> the time, and that's almost always a bad idea. Use sigwait*() instead,
> and you'll be better off... and questions like this become less
> important.
I understood this point and agree with you.
Download the latest glibc-ports package from
"http://ftp.gnu.org/gnu/glibc/" and it will have the mips version. I
think you can also get it via git.
Chris
>> In general, semaphores represent a kernel operation, involving a trap
>> into kernel mode -- something of a "context switch". Compared to
>> normal mutex operations, which can be performed entirely in user mode,
>> a kernel call is expensive.
>
> This is true if threads are 1:n kernel threads, isn't it ?
>
> However, as you said below, this is not the case on Linux system where
> the kernel/user thread mapping is 1:1. And in this case mutex
> operations also need some kernel calls.
And even that is a bit of a simplification. Given the POSIX 'process
shared' attribute for mutexes and condition variables, an implementation
must be prepared in some paths to make a kernel call to block and
unblock across address spaces. Essentially this comes to look somewhat
like the futex, although perhaps in most cases less generalized.
>> The basic notion of the "futex", sharing functions between kernel and
>> user space, wasn't unknown, and similar concepts had been applied for
>> shared (cross-process) mutexes and condition variables. I admit it
>> hadn't really occurred to me that someone would bother to do that for
>> semaphores, which are generally a more cumbersome and less flexible
>> synchronization model than mutexes and condition variables. (Though
>> they certainly do have their place.)
>
> Why not ?
>
> Why did't you think it didn't worth to optimize the semphores
> operations ?
To be accurate, I didn't say I thought it wasn't worthwhile. I said it
never occurred to me that anyone would do it. Not quite the same thing.
There's no reason you shouldn't be able to do all your synchronization
with semaphores if you're more comfortable that way. It just wasn't the
POSIX model, and we weren't specifically designing to accommodate
"semaphore intensive" operations because nobody there saw that as more
important than getting mutexes and condition variables right.
>> In the final analysis, I probably should have said something more
>> "weaselly" along the lines of "on many implementations sem_post() is
>> likely to be more expensive". At the time, this didn't seem justified;
>> but that's only one of many things that's changed in over 15 years.
>
> Yes, that probably means it's high time to release a new version of
> your book ;)
Sure; I've been saying that for years. The problem is finding time to do
the work. I have a full time job that does not include writing thread
books, and a full time family. I haven't found a way to take out a big
enough chunk of time to make headway.
Ok, I think I got the idea, although such paper seems not the most
appropriate documentation if you want to understand memory ordering,
IMHO.
So the acq/rel notation seems coming from the C++ world although I'm
not sure it's a good idea to hide the memory barriers inside this
naming convention (see which kind of bugs it can trigger).
BTW, why this 'atomic_full_barrier()' needed if
atomic_compare_and_exchange_bool_acq() should already have an implicit
barrier (well actually atomic_compare_and_exchange_bool_rel()) ?
Thanks
Here is a critical store-load sequence like in Dekker/Peterson mutual
exclusion algorithm. Critical store-load sequence always requires
#StoreLoad style memory barrier in between.
Critical store here is done to isem->value, and critical load is done
from isem->nwaiters.
W/o #StoreLoad barrier you may end up with the following situation.
Thread that executes sem_post() will set isem->value to 1, and will
load 0 from isem->nwaiters (i.e. no need to wake up anybody). And
thread that executes sem_wait() will set isem->nwaiters to 1, and will
load 0 from isem->value (i.e. have to wait). Thus you will end up with
a deadlock - sema count is > 0 but thread is blocked.
#StoreLoad guarantees that either thread 1 will see isem->nwaiters ==
1 or thread 2 will see isem->value == 1 or both.
--
Dmitriy V'jukov
Yes, the C++ standard is not light reading.
> So the acq/rel notation seems coming from the C++ world although I'm
> not sure it's a good idea to hide the memory barriers inside this
> naming convention (see which kind of bugs it can trigger).
The terms acquire and release predate the C++ memory model.
The basic idea is that a load-acquire is guaranteed to see the state of
memory visible to another thread at the time that thread did a
store-release. On some architectures this requires an explicit barrier
or a special store or load instruction, on others it's implicit in the
basic store and load instructions.
A relaxed load does not place any requirements on ordering, and may see
values written by other threads "out of sequence".
Slides 9-12 from my presentation at ACCU2008 provide a few examples to
illustrate.
http://www.justsoftwaresolutions.co.uk/files/future_of_concurrency.pdf
There's also a big section on the C++ memory model in chapter 5 of my
book:
http://www.manning.com/williams
Anthony
--
Author of C++ Concurrency in Action | http://www.manning.com/williams
just::thread C++0x thread library | http://www.stdthread.co.uk
Just Software Solutions Ltd | http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976
It's not to say that one may use "basic store and load" on the C
language level on such architectures. There is always some compiler
participation involved regardless of the hardware architecture (if you
do not code directly in assembly/machine codes, of course).
--
Dmitriy V'jukov
Agreed: you don't want the compiler rearranging your code so that a
write physically happens after the store-release that's supposed to make
it visible.
If you want a release operation, you need to mark it as such. On some
architectures the instruction generated will just be a basic store, but
the marking conveys important information to the compiler about how it
can rearrange your code in the optimizer. The same applies to acquire
operations.