Google 网上论坛不再支持新的 Usenet 帖子或订阅项。历史内容仍可供查看。

Optimize pthread_cond_wait loop?

已查看 164 次
跳至第一个未读帖子

Hallvard B Furuseth

未读,
2008年9月14日 11:45:362008/9/14
收件人
How much does it make sense to optimize a pthread_cond_wait() loop
in a consumer, to reduce the work needed at spurious wakeups? E.g.:

while (workqueue->length == 0) {
workqueue->idle++;
if (workqueue->done) goto done;
pthread_cond_wait(&workqueue->cv, &workqueue->mutex);
workqueue->idle--;
}

can become

if (workqueue->length == 0) {
workqueue->idle++;
/* moving this test outside needs extra code elsewhere when
* we are done - like an "exit consumer thread"-work item. */
if (workqueue->done) goto done;
do {
pthread_cond_wait(&workqueue->cv, &workqueue->mutex);
} while (workqueue->length == 0);
workqueue->idle--;
}

yet the program has in any case paid for a context switch, likely
some reads and writes to update the condition variable and mutex,
and a memory barrier.

(Assume contention for the mutex is noticeable. Spurious wakeups,
as far as the consumer threads are concerned, can come from the OS
or when another consumer finishes its work item and grabs a newly
signalled one, before a waiting thread can see it.)

--
Hallvard

Szabolcs Ferenczi

未读,
2008年9月14日 13:31:432008/9/14
收件人
On Sep 14, 5:45 pm, Hallvard B Furuseth <h.b.furus...@usit.uio.no>
wrote:

> How much does it make sense to optimize a pthread_cond_wait() loop
> in a consumer, to reduce the work needed at spurious wakeups?

It does not make much sense. Forget about it.

On the one hand, in a well balanced application, spurious wakeups do
not consume that much computing resource as it is commonly believed.
Furthermore, context switch time will improve much faster due to
hardware development than it would worth paying much attention to it.

On the other hand, spurious wakeups are left there in the
implementation because the implementors (not just Pthread guys since
the condition variable is not Pthread specific) thought it was
expensive to filter it out.

Nevertheless, you can try for fun.

> E.g.:
>
>   while (workqueue->length == 0) {
>     workqueue->idle++;
>     if (workqueue->done) goto done;
>     pthread_cond_wait(&workqueue->cv, &workqueue->mutex);
>     workqueue->idle--;
>   }
>
> can become
>
>   if (workqueue->length == 0) {
>     workqueue->idle++;
>     /* moving this test outside needs extra code elsewhere when
>      * we are done - like an "exit consumer thread"-work item. */
>     if (workqueue->done) goto done;
>     do {
>       pthread_cond_wait(&workqueue->cv, &workqueue->mutex);
>     } while (workqueue->length == 0);
>     workqueue->idle--;
>   }
>
> yet the program has in any case paid for a context switch, likely
> some reads and writes to update the condition variable and mutex,
> and a memory barrier.

I am afraid you did not improve anything, rather you made it more
complicated. The spurious wakeup is still there and the repeated
evaluation of the condition is also there only you duplicate it. You
think you optimise something that might be a YAGNI anyway (workqueue-
>idle).

> (Assume contention for the mutex is noticeable.  Spurious wakeups,
> as far as the consumer threads are concerned, can come from the OS
> or when another consumer finishes its work item and grabs a newly
> signalled one, before a waiting thread can see it.)

Just to make it clear: Spurious wakeup is meant for the situation when
the wait operation returns, however, there is no corresponding signal
operation for it.

I am afraid you mean something different by spurious wakeup.

Best Regards,
Szabolcs

Hallvard B Furuseth

未读,
2008年9月14日 14:40:292008/9/14
收件人
[Rearranging reply]

Szabolcs Ferenczi writes:
> On Sep 14, 5:45 pm, Hallvard B Furuseth <h.b.furus...@usit.uio.no>
> wrote:
>> How much does it make sense to optimize a pthread_cond_wait() loop
>> in a consumer, to reduce the work needed at spurious wakeups?
> (...)

>> (Assume contention for the mutex is noticeable.  Spurious wakeups,
>> as far as the consumer threads are concerned, can come from the OS
>> or when another consumer finishes its work item and grabs a newly
>> signalled one, before a waiting thread can see it.)
>
> Just to make it clear: Spurious wakeup is meant for the situation when
> the wait operation returns, however, there is no corresponding signal
> operation for it.
>
> I am afraid you mean something different by spurious wakeup.

I just explained what I meant. I did wonder at the end just when a
wakeup was considered "spurious", which is why I added that paragraph.
I guess I should just have said optimizing the situation when the wait
returns, but there is no work to do.

> I am afraid you did not improve anything, rather you made it more
> complicated. The spurious wakeup is still there

Which is the situation I'm asking about. And "more complicated" is of
course why I'm asking. Otherwise there'd be no reason _not_ to do it:
Might not help, but doesn't hurt either. Left to myself I guess I
might replace while...{} with if...{ do{}while... } for that reason,
but not move the done-test out of the loop since that is more work.

> and the repeated
> evaluation of the condition is also there only you duplicate it.

Slightly more code, slightly less work for the thread when it does loop.

--
Hallvard

Markus Elfring

未读,
2008年9月15日 01:27:102008/9/15
收件人
> I did wonder at the end just when a wakeup was considered "spurious",
> which is why I added that paragraph.

Would you like to look at any explanations from previous discussions on the
topic "spurious wakeup"?
http://groups.google.de/group/comp.programming.threads/msg/bb8299804652fdd7

Regards,
Markus

Szabolcs Ferenczi

未读,
2008年9月15日 03:36:542008/9/15
收件人
On Sep 14, 8:40 pm, Hallvard B Furuseth <h.b.furus...@usit.uio.no>

wrote:
> [Rearranging reply]
>
>
>
> Szabolcs Ferenczi writes:
> > On Sep 14, 5:45 pm, Hallvard B Furuseth <h.b.furus...@usit.uio.no>
> > wrote:
> >> How much does it make sense to optimize a pthread_cond_wait() loop
> >> in a consumer, to reduce the work needed at spurious wakeups?
> > (...)
> >> (Assume contention for the mutex is noticeable.  Spurious wakeups,
> >> as far as the consumer threads are concerned, can come from the OS
> >> or when another consumer finishes its work item and grabs a newly
> >> signalled one, before a waiting thread can see it.)
>
> > Just to make it clear: Spurious wakeup is meant for the situation when
> > the wait operation returns, however, there is no corresponding signal
> > operation for it.
>
> > I am afraid you mean something different by spurious wakeup.
>
> I just explained what I meant.  I did wonder at the end just when a
> wakeup was considered "spurious", which is why I added that paragraph.
> I guess I should just have said optimizing the situation when the wait
> returns, but there is no work to do.
>
> > I am afraid you did not improve anything, rather you made it more
> > complicated.  The spurious wakeup is still there
>
> Which is the situation I'm asking about.  And "more complicated" is of
> course why I'm asking.  Otherwise there'd be no reason _not_ to do it:

If you did not really mean spurious wakeup, you probably want to
return more close to the original monitor concept as defined by Hoare:

MONITORS: AN:OPERATING SYSTEM STRUCTURING CONCEPT
ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/73/401/CS-TR-73-401.pdf

1: pthread_mutex_lock(&workqueue->mutex);
2: while (workqueue->length == 0) { // Note: this could be an if
without spurious wakeups
3: pthread_cond_wait(&workqueue->cv, &workqueue->mutex);
4: }
5: ...
6: pthread_mutex_unlock(&workqueue->mutex);
7:...
8: pthread_mutex_lock(&workqueue->mutex);
9: ... // possible change of workqueue->length
10: if (workqueue->length != 0) {
11: pthread_cond_signal(&workqueue->cv);
12: }
13: pthread_mutex_unlock(&workqueue->mutex);

The idea is that when the condition does not hold (line 2) the caller
is suspended on the condition variable. When the condition is made
true (line 10) you give the wakeup signal.

Note 1: You still need the while loop at line 2 to filter out genuine
spurious wakeups.

Note 2: You have to check the predicate in two places which is a
duplication of code but it can be factored out.

Note 3: Checking the condition at the `sender side' is cheaper than
checking it at the `receiver side'

Best Regards,
Szabolcs

Hallvard B Furuseth

未读,
2008年9月15日 17:51:422008/9/15
收件人

OK, most places only talk about OS-generated wakeups as "spurious",
though Dave Butenhof includes "stolen wakeups" (by other threads)
in the term too. So it'd have been best if I didn't mention it.

Anyway, my question remains - is there a point in optimizing such a
pthread_cond_wait loop, to reduce the time the mutex is held? The one I
look at at the moment can be squeezed down from 12 to 5 instructions on
my machine, or to 7 instructions if the 'done' test is kept inside the
loop. But there are memory barriers etc involved too, so - how much
does that save compared to the pthread_cond_wait itself?

--
Hallvard

Dave Butenhof

未读,
2008年9月16日 06:33:252008/9/16
收件人
Hallvard B Furuseth wrote:
> Markus Elfring writes:
>>> I did wonder at the end just when a wakeup was considered "spurious",
>>> which is why I added that paragraph.
>> Would you like to look at any explanations from previous discussions on the
>> topic "spurious wakeup"?
>> http://groups.google.de/group/comp.programming.threads/msg/bb8299804652fdd7
>
> OK, most places only talk about OS-generated wakeups as "spurious",
> though Dave Butenhof includes "stolen wakeups" (by other threads)
> in the term too. So it'd have been best if I didn't mention it.

It depends on your point of view, really. In terms of the interface and
standard definition, a spurious wakeup is when a wait returns without a
corresponding signal or broadcast. However, the code that called wake
has no way to know WHY the wait returns -- only that the wait returned
and the predicate isn't satisfied. From the entirely pragmatic point of
view of the application, it makes no difference whether the
implementation did it for some obscure internal reason, whether the
return is due to a (UNIX) signal from some other independent facility in
the process, or because the application's own code either called
signal/broadcast "optimistically" or "stole" the work in another thread.

All the call site knows is that the wait returned and there's nothing to
do. That's spurious -- and it really doesn't matter why it happened.

> Anyway, my question remains - is there a point in optimizing such a
> pthread_cond_wait loop, to reduce the time the mutex is held? The one I
> look at at the moment can be squeezed down from 12 to 5 instructions on
> my machine, or to 7 instructions if the 'done' test is kept inside the
> loop. But there are memory barriers etc involved too, so - how much
> does that save compared to the pthread_cond_wait itself?

All of this code, either way, is inside the mutex. Minimizing contention
on the mutex, especially one used frequently by many threads, is always
"a good idea" -- but to what extent its worth spending lots of time
depends greatly on the nature of the application. So, sure; if it's
trivial to drop cycles off the hold time, or to remove locks (or
separate them), in general that's great. You may never be able to
measure the impact though.

In your particular case, it's not at all obvious that you HAVE dropped
any cycles off the hold time. If your predicate is almost always
satisfied, there's no difference in your "optimized" code at all (the
first test returns false and drops out, whether it's a while or if). If
your queues are short lived, the extra overhead of, as your comment
points out, added logic to make sure that waiters see the 'done'
condition, may outweigh any trivial average improvement in mutex hold
time per work item.

(On the other hand, if there's a lot of concurrent traffic on the queue,
it's possible that removing the per-iteration increase and decrease of
the idle count could have good effects on cache behavior; particularly
if your queue data shares a cache line with some other shared data.
Whether that will actually help a real application is a good deal less
clear.)

As Szabolcs pointed out, you've made it more complicated, harder to read
and maintain, for what is at best an uncertain real improvement. Is that
tradeoff worthwhile to you and your project? If you can generate real
data showing an improvement in throughput (or some other characteristic
that's meaningful to you), absolutely. If not... do what makes you feel
most comfortable, but my vote would be to keep it simple unless there's
a real reason to make it complicated.

Hallvard B Furuseth

未读,
2008年9月18日 14:57:382008/9/18
收件人
Dave Butenhof <david.b...@hp.com> writes:

> Hallvard B Furuseth wrote:
>> OK, most places only talk about OS-generated wakeups as "spurious",
>> though Dave Butenhof includes "stolen wakeups" (by other threads)
>> in the term too. So it'd have been best if I didn't mention it.
>
> It depends on your point of view, really.

> (..snip...)


> All the call site knows is that the wait returned and there's nothing to
> do. That's spurious -- and it really doesn't matter why it happened.

It certainly makes sense from a programmer's point of view to use the
same term for both. The existence of OS-generated spurious wakeups has
indeed scared me into fixing bugs I didn't know about, as an URL Markus
posted said was in part the intent. (Code which was buggy even in the
absence of OS-generated spurious wakeups, that is.)

>> Anyway, my question remains - is there a point in optimizing such a
>> pthread_cond_wait loop, to reduce the time the mutex is held? The one I
>> look at at the moment can be squeezed down from 12 to 5 instructions on
>> my machine, or to 7 instructions if the 'done' test is kept inside the
>> loop. But there are memory barriers etc involved too, so - how much
>> does that save compared to the pthread_cond_wait itself?
>
> All of this code, either way, is inside the mutex. Minimizing contention
> on the mutex, especially one used frequently by many threads, is always
> "a good idea" -- but to what extent its worth spending lots of time
> depends greatly on the nature of the application. So, sure; if it's
> trivial to drop cycles off the hold time, or to remove locks (or
> separate them), in general that's great.

Thanks. OK, for my current case that means it's worth having a closer
look.

> You may never be able to measure the impact though.

That's the hard part indeed. In particular when I can measure a
performance impact from not changing the program and just running it
again.

> (...) If you can generate real


> data showing an improvement in throughput (or some other characteristic
> that's meaningful to you), absolutely. If not... do what makes you feel
> most comfortable, but my vote would be to keep it simple unless there's
> a real reason to make it complicated.

Yup. Not to worry, that is why I'm not just going ahead and doing it.

--
Hallvard

0 个新帖子