[Boost-users] thread_group::interrupt_all is not reliable

69 views
Skip to first unread message

Stonewall Ballard

unread,
Nov 29, 2009, 7:27:41 PM11/29/09
to boost...@lists.boost.org
I've discovered that under circumstances apparently related to timing
and load, sending interrupt_all to a thread_group when all the threads
are waiting on a boost::condition_variable leaves one thread waiting
about 1/3 of the time. This is with boost 1_40_0 running on Mac OS X
10.6.2, with 32-bit boost libraries. Boost uses the posix thread
system here.

I boiled my app down to some test code that runs as a command-line
app. It's a bit longer than I'd like, but this configuration seems to
be necessary to invoke the problem. The test uses a queue to pass
"tasks" from the main thread to worker threads, and another queue to
pass "results" back to the main thread. The problem is most apparent
when all the tasks are finished and the queue empties, so that all the
worker threads are waiting on the input queue when the main thread
sends interrupt_all.

I've looked at the waiting thread in a debugger when this happens, and
found that it has been interrupted, but is still waiting on the
condition. It looks like it just got missed by the interrupt_all. This
is more likely to happen when there are a lot of worker threads (16,
or one per core in my testing).

The test code is parked at <http://sb.org/ThreadTest.zip>, 20KB. It's
an XCode 3.2 project, but the five source files could be readily
compiled and run in any Unix environment.

I don't see any errors in the code that could cause these failures.
There is a work-around, which is to interrupt the waiting thread
again. This required a modified version of thread_group so I could do
a timed_join_all on it.

I welcome any suggestions about what could be wrong here, or ways to
simplify the test to make it more suitable for a bug report.

- Stoney

--
Stonewall Ballard
sto...@sb.org http://stoney.sb.org/

_______________________________________________
Boost-users mailing list
Boost...@lists.boost.org
http://lists.boost.org/mailman/listinfo.cgi/boost-users

Stonewall Ballard

unread,
Nov 30, 2009, 1:36:02 PM11/30/09
to boost...@lists.boost.org
I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.

This gave me the clue to try that:
<http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadcast.html>
> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().

thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp.

Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable.

I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting.

Of course, this could be a bug in the OS X pthreads implementation as well.

- Stoney

Roland Bock

unread,
Dec 1, 2009, 3:29:39 AM12/1/09
to boost...@lists.boost.org
Stonewall Ballard wrote:
> I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
>
> This gave me the clue to try that:
> <http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadcast.html>
>> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
>
> thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp.
>
> Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable.
>
> I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting.
>
> Of course, this could be a bug in the OS X pthreads implementation as well.

Hi,

FWIW, I ran that test of yours several times with varying parameters on
my machine (quad core, 64bit, linux) and it did not show a single
failure. Of course, since it is not a deterministic effect even on your
machine, failure to reproduce does not really mean much, but well, I
thought you might like to hear anyway :-)

And I totally agree: Predictable scheduling should not be required to
wake up all threads, especially since the document also says

<snip>


The pthread_cond_broadcast() or pthread_cond_signal() functions may be

called by a thread whether or not it currently owns the mutex [...]
</cite>


As for boiling down the application for others to inspect:
Your debugger showed that the thread is still in wait() after the
interrupt call.

Can you assure that ALL worker threads are in wait() prior to the interrupt?
* If yes: There seems to be no connection with the interlocked queue,
the sleep and so on. It should be possible to get rid of all that
for a much simpler test program
* If no: OK, there seems to be a connection between the wait(), the
interrupt and the sleep and/or mutex.

In any case, I would assume that by analysing the situation right before
the interrupt, you should be able to reproduce the problem with much
less code.

Hope that helps in any way?


Regards,

Roland

Peter Dimov

unread,
Dec 1, 2009, 9:50:25 AM12/1/09
to boost...@lists.boost.org, Anthony Williams
Stonewall Ballard wrote:
> I think I found the cause of this problem. It seems that the caller
> of interrupt_all should be holding the mutex associated with the
> condition on which the threads are waiting.

This looks like a classic CV pitfall when using "atomic" predicates that are
not protected by the mutex used for the wait. The basic outline is that
thread A does a CV wait on an atomic boolean variable, and thread B sets
this variable and does a notify. There exists a race in which A sees false,
is preempted, B stores true, does notify, waking up nobody, and then A
continues with the wait. The cure is to insert an empty lock+unlock of A's
mutex in B between the store and the notify; it doesn't need to encompass
the store, and it doesn't need to encompass the notify call, either.

I can't read the boost.thread code well enough to be able to diagnose the
problem, but from a cursory look, it looks possible to me that
condition_variable::wait may perform its interruption check, be preempted,
the interrupt can proceed with setting interrupt_requested and doing a
broadcast, and then condition_variable::wait to block on its
pthread_cond_wait.

Your workaround does prevent it, but I think that boost.thread should take
care of that internally, by storing a mutex pointer in the thread data,
along with the cv pointer.

Stonewall Ballard

unread,
Dec 1, 2009, 9:53:46 AM12/1/09
to Roland Bock, boost...@lists.boost.org

On Dec 1, 2009, at 3:29 AM, Roland Bock wrote:

> Stonewall Ballard wrote:
>> I think I found the cause of this problem. It seems that the caller of interrupt_all should be holding the mutex associated with the condition on which the threads are waiting.
>> This gave me the clue to try that:
>> <http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadcast.html>
>>> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
>> thread::interrupt() calls pthread_cond_broadcast in pthread/thread.cpp.
>> Although "predictable scheduling" doesn't seem like it should include a failure to wake up, taking the mutex around the call to thread_pool::interrupt_all() appears to be 100% reliable.
>> I can patch my app to do that, but I don't think there's a general solution. The documentation should include a note that thread::interrupt() isn't reliable unless the caller is holding the mutex associated with the condition variable on which the interrupted thread is waiting.
>> Of course, this could be a bug in the OS X pthreads implementation as well.
>
> Hi,
>
> FWIW, I ran that test of yours several times with varying parameters on my machine (quad core, 64bit, linux) and it did not show a single failure. Of course, since it is not a deterministic effect even on your machine, failure to reproduce does not really mean much, but well, I thought you might like to hear anyway :-)

Thanks, but this doesn't surprise me. Since the reliability drops rapidly as I add threads, I suspect it has something to do with running this on an 8-core (16 hyperthread) machine. I also suspect that it's a Mac OS bug.

> And I totally agree: Predictable scheduling should not be required to wake up all threads, especially since the document also says
>
> <snip>
> The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex [...]
> </cite>

Of course, that could just be a promise that it won't crash. I hope you're right, though.

> As for boiling down the application for others to inspect:
> Your debugger showed that the thread is still in wait() after the interrupt call.
>
> Can you assure that ALL worker threads are in wait() prior to the interrupt?
> * If yes: There seems to be no connection with the interlocked queue,
> the sleep and so on. It should be possible to get rid of all that
> for a much simpler test program
> * If no: OK, there seems to be a connection between the wait(), the
> interrupt and the sleep and/or mutex.

Yes, I added code to check that, and all 16 threads were waiting on that condition when I interrupted them.

> In any case, I would assume that by analysing the situation right before the interrupt, you should be able to reproduce the problem with much less code.

If I understand this correctly, I should be able to reproduce it with pthreads alone (no boost). I'm going to try that when I get some time and file a bug report.

> Hope that helps in any way?

Yes, thanks. Reasoning about threads requires code review.

- Stoney

_______________________________________________

Stonewall Ballard

unread,
Dec 1, 2009, 10:32:27 AM12/1/09
to Peter Dimov, Anthony Williams, boost...@lists.boost.org

On Dec 1, 2009, at 9:50 AM, Peter Dimov wrote:

> Stonewall Ballard wrote:
>> I think I found the cause of this problem. It seems that the caller
>> of interrupt_all should be holding the mutex associated with the
>> condition on which the threads are waiting.
>
> This looks like a classic CV pitfall when using "atomic" predicates that are not protected by the mutex used for the wait. The basic outline is that thread A does a CV wait on an atomic boolean variable, and thread B sets this variable and does a notify. There exists a race in which A sees false, is preempted, B stores true, does notify, waking up nobody, and then A continues with the wait. The cure is to insert an empty lock+unlock of A's mutex in B between the store and the notify; it doesn't need to encompass the store, and it doesn't need to encompass the notify call, either.
>
> I can't read the boost.thread code well enough to be able to diagnose the problem, but from a cursory look, it looks possible to me that condition_variable::wait may perform its interruption check, be preempted, the interrupt can proceed with setting interrupt_requested and doing a broadcast, and then condition_variable::wait to block on its pthread_cond_wait.

I've found that all the threads are waiting on the condition variable at the time that I interrupt them, so that scenario doesn't seem to apply.

The actual interrupt code in pthread/thread.cpp is:

void thread::interrupt()
{
detail::thread_data_ptr const local_thread_info=get_thread_info();
if(local_thread_info)
{
lock_guard<mutex> lk(local_thread_info->data_mutex);
local_thread_info->interrupt_requested=true;
if(local_thread_info->current_cond)
{
BOOST_VERIFY(!pthread_cond_broadcast(local_thread_info->current_cond));
}
}
}

As you can see, each thread has a mutex protecting its internal state, and the interrupter locks that mutex before setting interrupt_requested to true. I think that this precludes a race condition there.

I think that what's actually happening here is that 16 threads are waiting on the same condition, then they're individually interrupted (using the above code). Each interrupt wakes all the threads waiting on the condition, where all but one thread locks the mutex, see that it's not interrupted, then unlocks the mutex and waits. Given the hardware parallelism on my 16-hyperthread Mac Pro, I suspect that there's ample opportunity for a bug.

> Your workaround does prevent it, but I think that boost.thread should take care of that internally, by storing a mutex pointer in the thread data, along with the cv pointer.

As I'm getting more convinced that this is bug in pthreads, at least on OS X, it seems unwise to add that to boost::thread. A pthread condition variable knows its mutex, of course, but there doesn't appear to be any way to retrieve it.

I'm going to try to reproduce this using just pthreads calls.

Thanks for your help.

- Stoney

_______________________________________________

Peter Dimov

unread,
Dec 1, 2009, 11:43:34 AM12/1/09
to Stonewall Ballard, Anthony Williams, boost...@lists.boost.org
Stonewall Ballard wrote:
> On Dec 1, 2009, at 9:50 AM, Peter Dimov wrote:
>
>> I can't read the boost.thread code well enough to be able to
>> diagnose the problem, but from a cursory look, it looks possible to
>> me that condition_variable::wait may perform its interruption check,
>> be preempted, the interrupt can proceed with setting
>> interrupt_requested and doing a broadcast, and then
>> condition_variable::wait to block on its pthread_cond_wait.
>
> I've found that all the threads are waiting on the condition variable
> at the time that I interrupt them, so that scenario doesn't seem to
> apply.

They are before interrupt_all is issued, but then, as you yourself write:

> I think that what's actually happening here is that 16 threads are
> waiting on the same condition, then they're individually interrupted
> (using the above code). Each interrupt wakes all the threads waiting
> on the condition, where all but one thread locks the mutex, see that
> it's not interrupted, then unlocks the mutex and waits.

they are getting interrupted and awakened one by one, the end result being
that condition_variable::wait and thread::interrupt are being called in
parallel.

> The actual interrupt code in pthread/thread.cpp is:

...


> As you can see, each thread has a mutex protecting its internal
> state, and the interrupter locks that mutex before setting
> interrupt_requested to true. I think that this precludes a race
> condition there.

It doesn't.

inline void condition_variable::wait(unique_lock<mutex>& m)
{
detail::interruption_checker check_for_interruption(&cond);

// thread::interrupt is executed here

BOOST_VERIFY(!pthread_cond_wait(&cond,m.mutex()->native_handle()));

Stonewall Ballard

unread,
Dec 1, 2009, 12:52:39 PM12/1/09
to Peter Dimov, Anthony Williams, boost...@lists.boost.org

On Dec 1, 2009, at 12:12 PM, Anthony Williams wrote:

> Peter is right.
>
> The mutex that is locked across the broadcast protects the interrupt
> flag, but is distinct from the mutex that is associated with the
> condition variable. It is thus possible for the waiting thread to miss
> an interruption between when it checks for interruption and when it
> enters pthread_cond_wait.
>
> Anthony

I see that now. It's necessary for the interrupter to lock the condition variable mutex before interrupting the thread. The likelihood of failure would depend on the number of threads waiting on the condition variable, and the use of an interrupt_all() to interrupt them repeatedly.

So this really is a bug in boost::thread, at least with the pthreads implementation. I suppose that makes it easier to fix than a bug in pthreads.

Peter's suggestion:


On Dec 1, 2009, at 9:50 AM, Peter Dimov wrote:

> Your workaround does prevent it, but I think that boost.thread should take care of that internally, by storing a mutex pointer in the thread data, along with the cv pointer.

seems like the right thing to do for a general fix.

Thanks to all for your help. I'm confident now that my workaround will keep this from happening in my app, but a proper fix to boost::threads is necessary.

I'm surprised that I can't find any other reports of this problem.

- Stoney

_______________________________________________

Roland Bock

unread,
Dec 1, 2009, 4:07:13 PM12/1/09
to boost...@lists.boost.org
Stonewall Ballard wrote:
> The test code is parked at <http://sb.org/ThreadTest.zip>, 20KB. It's
> an XCode 3.2 project, but the five source files could be readily
> compiled and run in any Unix environment.

I admit it took several times of reading this thread until I finally
thought I got it (which might have something to do with several rather
long days of PHP & JavaScript hacking).

Apart from the original problem, I wonder if you could shed some light
on a comment in your WorkQueue code:

CONTROL:
void increment()
{
boost::mutex::scoped_lock lock(the_mutex);
++tasks_available;
lock.unlock();
// notify_one may fail to wake a worker if there
// are multiple items in the queue, so it's better
// to waste a bit of CPU and notify_all
the_condition_variable.notify_all();
}

WORKER:
void wait_and_decrement()
{
boost::mutex::scoped_lock lock(the_mutex);
while ( tasks_available == 0 ) {
the_condition_variable.wait(lock);
}
--tasks_available;
}

What would be a scenario in which notify_one would fail? I would assume
that a problem with notify_one would occur only if you would do
something like

tasks_available += 42;


instead of

++tasks_available;

In such a case, notify_one would wake less threads than required. But
how can that happen with notify_one being called for each and every task?

Regards,

Roland

PS: Thank you for starting this thread (and Peter and Anthony for
participating, of course). Very interesting, because I have written a
different kind of threadpool a few days ago (no queues, but you can call
a method addTask() which will hand the task to the next thread available
(blocking for as long as all threads are working on previously added
tasks)). The destructor calls interupt_all() with potentially many
threads in wait(). I probably would have run into the same problems as
you did when using it in production (8-core).

Stonewall Ballard

unread,
Dec 1, 2009, 5:09:13 PM12/1/09
to Roland Bock, boost...@lists.boost.org

Take a look at this:
<http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html>
which is where I got original design for the WorkQueue I'm using in my app. There is a comment and reply on that page:

> Okay, I think I have found an issue when multiple consumers use the queue. Assume all consumers are waiting for new data to be pushed onto the queue. When the producer then pushes multiple items in short succession, i.e. so quick that the first consumer to wake up cannot empty the queue again, then the_condition_variable.notify_one() is only called once (because it is blocked by the 'was_empty if' later). It seems to work for me if I replace notify_one() by notify_all().
>
> Btw, I hope that all consumer threads waking up at the same is not a problem, but as far as I understand the notification mechanism, only one of them will acquire control over the mutex...
>
> Let me know what you think...
> by David at 23:14:37 on Sunday, 07 September 2008
> Hi David,
>
> You're right. Thanks for spotting that. I guess my testing was not exhaustive enough :-(
>
> The only impact of waking all the consumers is that they consume CPU time: if there's nothing in the queue they just treat it as a spurious wake and go back to sleep.
>
> by Anthony Williams at 08:00:20 on Monday, 08 September 2008

I added the comment to the code in case someone came along later and decided that it was wasteful to notify_all().

> PS: Thank you for starting this thread (and Peter and Anthony for participating, of course). Very interesting, because I have written a different kind of threadpool a few days ago (no queues, but you can call a method addTask() which will hand the task to the next thread available (blocking for as long as all threads are working on previously added tasks)). The destructor calls interupt_all() with potentially many threads in wait(). I probably would have run into the same problems as you did when using it in production (8-core).

This is the first time I've used boost::thread, so I guess that code is fairly new and not sufficiently beaten-upon. This discussion was very enlightening to me too. I greatly appreciate having experienced people available to look at the problem.

My code is running in the idle loop of a dialog window, so I had to avoid blocking the producer for more than a moment. Since there could be many more available tasks than threads, a queue was the obvious choice.

- Stoney

_______________________________________________

Roland Bock

unread,
Dec 1, 2009, 6:12:01 PM12/1/09
to boost...@lists.boost.org
Stonewall Ballard wrote:
[...]

>> In such a case, notify_one would wake less threads than required. But how can that happen with notify_one being called for each and every task?
>
> Take a look at this:
> <http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html>
[...]

> I added the comment to the code in case someone came along later and decided that it was wasteful to notify_all().

Hmm. Have you read the section "UPDATE" on the page? To me that says
that the requirement for the notify_all has already been taken care of
in the presented code (by calling it each time a task is pushed).

>> PS: Thank you for starting this thread (and Peter and Anthony for participating, of course). Very interesting, because I have written a different kind of threadpool a few days ago (no queues, but you can call a method addTask() which will hand the task to the next thread available (blocking for as long as all threads are working on previously added tasks)). The destructor calls interupt_all() with potentially many threads in wait(). I probably would have run into the same problems as you did when using it in production (8-core).
>
> This is the first time I've used boost::thread, so I guess that code is fairly new and not sufficiently beaten-upon. This discussion was very enlightening to me too. I greatly appreciate having experienced people available to look at the problem.
>
> My code is running in the idle loop of a dialog window, so I had to avoid blocking the producer for more than a moment. Since there could be many more available tasks than threads, a queue was the obvious choice.

Yes, a blocking dialog would be bad. I am using the threadpool for a
merge process which can delegate the processing of chunks to its
workers. At the time I wrote it, I thought that blocking would not be
problem (and in the intended scenario it isn't), because I can rarely
use all available threads due to the sheer size of the chunks. The
merger blocks that anyway.

But after today, I think I will change my pool. Limiting the number of
tasks is part of the merge logic, the pool should not do something like
that.

Regards,

Roland

Stonewall Ballard

unread,
Dec 1, 2009, 6:22:47 PM12/1/09
to boost...@lists.boost.org

On Dec 1, 2009, at 6:05 PM, Roland Bock wrote:

> Stonewall Ballard wrote:
> [...]


>>> In such a case, notify_one would wake less threads than required. But how can that happen with notify_one being called for each and every task?
>> Take a look at this:
>> <http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html>

> [...]


>> I added the comment to the code in case someone came along later and decided that it was wasteful to notify_all().
>

> Hmm. Have you read the section "UPDATE" on the page? To me that says that the requirement for the notify_all has already been taken care of in the presented code (by calling it each time a task is pushed).

Ah, I'd missed that. Thanks for pointing it out. It takes several readings to understand all this.

>> My code is running in the idle loop of a dialog window, so I had to avoid blocking the producer for more than a moment. Since there could be many more available tasks than threads, a queue was the obvious choice.
>

> Yes, a blocking dialog would be bad. I am using the threadpool for a merge process which can delegate the processing of chunks to its workers. At the time I wrote it, I thought that blocking would not be problem (and in the intended scenario it isn't), because I can rarely use all available threads due to the sheer size of the chunks. The merger blocks that anyway.
>
> But after today, I think I will change my pool. Limiting the number of tasks is part of the merge logic, the pool should not do something like that.

May as well code a general-purpose pool so that it's debugged by the time you need it for another program.

Thanks again for your help.

Peter Dimov

unread,
Dec 3, 2009, 9:38:36 AM12/3/09
to Stonewall Ballard, Anthony Williams, boost...@lists.boost.org
Stonewall Ballard wrote:
> I see that now. It's necessary for the interrupter to lock the
> condition variable mutex before interrupting the thread. The
> likelihood of failure would depend on the number of threads waiting
> on the condition variable, and the use of an interrupt_all() to
> interrupt them repeatedly.
>
> So this really is a bug in boost::thread, at least with the pthreads
> implementation. I suppose that makes it easier to fix than a bug in
> pthreads.

...

> I'm surprised that I can't find any other reports of this problem.

You might want to file a ticket for this issue on svn.boost.org so that it
doesn't get lost.

Stonewall Ballard

unread,
Dec 8, 2009, 12:50:37 PM12/8/09
to Peter Dimov, Anthony Williams, boost...@lists.boost.org

On Dec 3, 2009, at 9:38 AM, Peter Dimov wrote:

> Stonewall Ballard wrote:
>> I see that now. It's necessary for the interrupter to lock the
>> condition variable mutex before interrupting the thread. The
>> likelihood of failure would depend on the number of threads waiting
>> on the condition variable, and the use of an interrupt_all() to
>> interrupt them repeatedly.
>>
>> So this really is a bug in boost::thread, at least with the pthreads
>> implementation. I suppose that makes it easier to fix than a bug in
>> pthreads.
>
> ...
>
>> I'm surprised that I can't find any other reports of this problem.
>
> You might want to file a ticket for this issue on svn.boost.org so that it doesn't get lost.

Done. Ticket #3735.

- Stoney

_______________________________________________

Reply all
Reply to author
Forward
0 new messages