A C++11 implementation bug: condition variable, atomic, threads

758 views
Skip to first unread message

U.Mutlu

unread,
Jan 12, 2016, 1:35:12 PM1/12/16
to std-dis...@isocpp.org
/*
condition variable example (but is [sometimes] buggy)

Code adapted from the example at [1]

Question:
How to fix this code?
--> it could be a low-level issue, ie. compiler-, stdlibrary-, or pthread-issue

What-it-does:
10 threads race for a job. The main thread supplies the job.
One of the threads takes the job and processes it.
This is repeated 10000 times.
BUT: sometimes the pgm freezes (ie. a deadlock bug happens) --> see below

Compile:
g++ -Wall -O2 -std=gnu++11 test_thread_cv_ex.cpp -lpthread

Run:
./a.out
repeat multiple times (up to 10 times should be ok for the runtime bug to
show up)

Symptoms of the bug:
sometimes pgm lands in an endless-loop! ie. deadlock happens --> pgm not
finishing

Test environment:
- g++ --version
g++ (Debian 4.9.2-10) 4.9.2

- uname -a
Linux my 4.2.0-0.bpo.1-amd64 #1 SMP Debian 4.2.6-3~bpo8+2 (2015-12-14)
x86_64 GNU/Linux

- dpkg -l | grep -i pthread
ii libpthread-stubs0-dev:amd64 0.3-4 amd64 pthread stubs not provided
by native libc, development files

See also:
[1] http://www.cplusplus.com/reference/condition_variable/condition_variable/
[2] http://en.cppreference.com/w/cpp/thread/condition_variable
*/

#include <iostream>
#include <atomic>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <unistd.h>

using namespace std;

mutex mtx;
condition_variable cv;

atomic<bool> fReady; // init done in main
atomic<bool> fQuit; // init done in main

void threadfunc(int id)
{
/*
Any thread that intends to wait on condition_variable has to acquire a
unique_lock<mutex> first.
The wait operations atomically release the mutex and suspend the execution of
the thread.
When the condition variable is notified, the thread is awakened, and the mutex
is reacquired [2].
*/

while (!fQuit)
{
// wait for fReady:
unique_lock<mutex> lck(mtx);
cv.wait(lck); // unlocks lck, and waits for cv be signalled;
// then autom. reacquires lck
if (fQuit) break;
if (!fReady) continue;

// work, ie. process the job:
cout << "thread " << id << " has grabbed the job\n";
//...

// notify parent with fReady=false;
// wakes up also all the other threads
fReady = false;
cv.notify_all();
}
}

int main()
{
fQuit = false;
fReady = false;

const size_t N = 10;
thread threads[N];
for (size_t i = 0; i < N; ++i)
threads[i] = thread(threadfunc, i);

for (size_t i = 0; i < 10000; ++i)
{
cout << N << " threads ready to race: ";

// notify threads about work to do
{
unique_lock<mutex> lck(mtx);
fReady = true;
cv.notify_all();

// wait till job was done
while (fReady) cv.wait(lck);
}
}

// set fQuit and wake up the threads
{
unique_lock<mutex> lck(mtx);
fQuit = true;
cv.notify_all();
}

// wait till all threads quit:
for (auto& th : threads) th.join();

cout << "pgm finished\n";
return 0;
}




Howard Hinnant

unread,
Jan 12, 2016, 1:57:07 PM1/12/16
to std-dis...@isocpp.org
On Jan 12, 2016, at 1:28 PM, U.Mutlu <for-...@mutluit.com> wrote:
>
> Question:
> How to fix this code?
> --> it could be a low-level issue, ie. compiler-, stdlibrary-, or pthread-issue

Or it could be programmer error. That’s where my money is.

How to fix? Check your predicate after you lock the mutex and before you wait on the condition_variable. Otherwise you risk waiting when there is no one to notify you. You may wait forever.

Howard

Thiago Macieira

unread,
Jan 12, 2016, 2:08:44 PM1/12/16
to std-dis...@isocpp.org
On Tuesday 12 January 2016 19:28:17 U.Mutlu wrote:
> /*
> condition variable example (but is [sometimes] buggy)
>
> Code adapted from the example at [1]
>
> Question:
> How to fix this code?
> --> it could be a low-level issue, ie. compiler-, stdlibrary-, or
> pthread-issue
>
> What-it-does:
> 10 threads race for a job. The main thread supplies the job.
> One of the threads takes the job and processes it.
> This is repeated 10000 times.
> BUT: sometimes the pgm freezes (ie. a deadlock bug happens) --> see below

I'm not convinced this is a bug in the implementation or in pthreads. Your
code does something really weird:

> while (!fQuit)
> {
> // wait for fReady:
> unique_lock<mutex> lck(mtx);
> cv.wait(lck); // unlocks lck, and waits for cv be signalled;
> // then autom. reacquires lck
> if (fQuit) break;
> if (!fReady) continue;
>
> // work, ie. process the job:
> cout << "thread " << id << " has grabbed the job\n";
> //...
>
> // notify parent with fReady=false;
> // wakes up also all the other threads
> fReady = false;
> cv.notify_all();
> }

Why is each thread notifying all other threads and waking them up? You're
reusing the cv variable for notifying on both directions.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358

U.Mutlu

unread,
Jan 12, 2016, 4:12:13 PM1/12/16
to std-dis...@isocpp.org
I'm sure the application code is correct. There must be something
buggy deep in the libraries or the compiler itself.

I forgot to mention: I'm running the code on a Quadcore-CPU ("AMD A10-5750M
APU with Radeon(tm) HD Graphics").

I hereby challenge everybody to find the cause of the bug.
Or at least confirm the environment (OS, compiler, CPU etc.) where the code
works or fails.




U.Mutlu

unread,
Jan 12, 2016, 4:15:11 PM1/12/16
to std-dis...@isocpp.org
As I wrote in my other reply:

Howard Hinnant

unread,
Jan 12, 2016, 4:26:49 PM1/12/16
to std-dis...@isocpp.org
I can assure you that your code is susceptible to a hang on all platforms that allow threadfunc to be preempted between the unlock() at the end of the while loop, and the lock() at the beginning of the while loop. Oh, except for those platforms where spurious wakeups occur often.

I’ve given you two great pointers towards the bug in your code. However, if you want to believe this:

> I'm sure the application code is correct. There must be something
> buggy deep in the libraries or the compiler itself.

go right ahead. File a bug report!

Howard

PS: You’re welcome for debugging your code, but that’s all you get.

Andrey Semashev

unread,
Jan 12, 2016, 4:51:43 PM1/12/16
to std-dis...@isocpp.org
[1]

> while (!fQuit)
> {
> // wait for fReady:

[2]
[3]

> {
> unique_lock<mutex> lck(mtx);
> fQuit = true;
> cv.notify_all();
> }
>
> // wait till all threads quit:
> for (auto& th : threads) th.join();
>
> cout << "pgm finished\n";
> return 0;
> }

At [1] you're not checking for the loop condition under the lock. It is
possible that when the lock is taken at [3] the other thread(s) are
blocked on acquiring the lock inside the loop at [2]. As a result,
notify_all() is missed by those threads and they are blocked on the
cv.wait() forever (unless the spurious wakeup happens).

Tony V E

unread,
Jan 12, 2016, 5:14:06 PM1/12/16
to U.Mutlu
‎> I'm sure the application code is correct

Care to provide a formal proof? Or even informal?

What happens if the main thread reaches notify_all() before any of the worker threads have reached wait()?



Sent from my BlackBerry portable Babbage Device
  Original Message  
From: U.Mutlu
Sent: Tuesday, January 12, 2016 4:12 PM
To: std-dis...@isocpp.org
Reply To: std-dis...@isocpp.org
Subject: [std-discussion] Re: A C++11 implementation bug: condition variable, atomic, threads
--

---
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-discussio...@isocpp.org.
To post to this group, send email to std-dis...@isocpp.org.
Visit this group at https://groups.google.com/a/isocpp.org/group/std-discussion/.

Thiago Macieira

unread,
Jan 12, 2016, 6:58:42 PM1/12/16
to std-dis...@isocpp.org
On Tuesday 12 January 2016 22:11:55 U.Mutlu wrote:
> I hereby challenge everybody to find the cause of the bug.

Looks like both Howard and Tony found bugs (plural) and they're all in your
source code.

U.Mutlu

unread,
Jan 12, 2016, 7:05:50 PM1/12/16
to std-dis...@isocpp.org
Sorry, I don't get your reasoning: a lock here is exclusive,
meaning only one thread (incl. the main thread) can get the lock,
and that's the whole purpose of a lock.

No Sir, the reason for the bug does not lie in the app-logic.
The deadlock happens definitely not at the locations you say, especially
not at [1] nor at [3]; it rather happens when the 10000-loop is active.

So, just do it how you think it should be and prove what you say.



U.Mutlu

unread,
Jan 12, 2016, 7:10:07 PM1/12/16
to std-dis...@isocpp.org
Thiago Macieira wrote on 01/13/2016 12:58 AM:
> On Tuesday 12 January 2016 22:11:55 U.Mutlu wrote:
>> I hereby challenge everybody to find the cause of the bug.
>
> Looks like both Howard and Tony found bugs (plural) and they're all in your
> source code.

I doubt that!
So, then why does nobody simply post a fixed version of the code?



Thiago Macieira

unread,
Jan 12, 2016, 7:25:32 PM1/12/16
to std-dis...@isocpp.org
Because that's not our job and that's not the purpose of this mailing list.

Nathan Ernst

unread,
Jan 12, 2016, 7:32:31 PM1/12/16
to std-dis...@isocpp.org
Because we're not here to debug or fix your code. We're here to discuss the standard.

If you want, I'm sure any number of us on this forum are available at substantial consulting rates. As already mentioned by others, there are several bugs in your usage. This does not constitute an issue with the standard nor an implementation.

Additionally, for what it's worth, 99.999% of the time I hear a developer state something like this:

I'm sure the application code is correct. There must be something
buggy deep in the libraries or the compiler itself.
I forgot to mention: I'm running the code on a Quadcore-CPU ("AMD A10-5750M APU with Radeon(tm) HD Graphics").
I hereby challenge everybody to find the cause of the bug.
Or at least confirm the environment (OS, compiler, CPU etc.) where the code works or fails.

...they're usually deep over their head; they don't truly understand what they're doing, and they've probably not done the legwork on their side to eliminate any source of errors on their side.

Yes, compilers and library implementations occasionally have errors, but to insist upon this when your code has demonstrable defects and refusing to accept that (and correct, and then retest) is a special combination of arrogance and ignorance that will not get your problem resolved.

Regards,

I hope you find your resolution (but I'm not holding my breath).

-Nate

U.Mutlu

unread,
Jan 12, 2016, 7:33:20 PM1/12/16
to std-dis...@isocpp.org
Tony V E wrote on 01/12/2016 11:14 PM:
> ‎> I'm sure the application code is correct
>
> Care to provide a formal proof? Or even informal?
>
> What happens if the main thread reaches notify_all() before any of the worker threads have reached wait()?

No, even if one starts the job-loop only after ensuring that
at least one worker thread has called "cv.wait(lck)", it still isn't working.
So, that is not the cause of the bug.

We could shorten this is discussion if one of you allegedly super guys
would just post a fixed version of the damn code, instead of fabulating...




U.Mutlu

unread,
Jan 12, 2016, 7:39:42 PM1/12/16
to std-dis...@isocpp.org
Thiago Macieira wrote on 01/13/2016 01:25 AM:
> On Wednesday 13 January 2016 01:07:09 U.Mutlu wrote:
>> Thiago Macieira wrote on 01/13/2016 12:58 AM:
>>> On Tuesday 12 January 2016 22:11:55 U.Mutlu wrote:
>>>> I hereby challenge everybody to find the cause of the bug.
>>>
>>> Looks like both Howard and Tony found bugs (plural) and they're all in
>>> your
>>> source code.
>>
>> I doubt that!
>> So, then why does nobody simply post a fixed version of the code?
>
> Because that's not our job and that's not the purpose of this mailing list.

The code is C++11 conform, and none of you is able to find the bug,
so just say it clearly!




U.Mutlu

unread,
Jan 12, 2016, 7:45:09 PM1/12/16
to std-dis...@isocpp.org
Nathan Ernst wrote on 01/13/2016 01:32 AM:
> Yes, compilers and library implementations occasionally have errors, but to
> insist upon this when your code has demonstrable defects and refusing to
> accept that (and correct, and then retest) is a special combination of
> arrogance and ignorance that will not get your problem resolved.

I bet NONE of you replied here is able to fix this code, because the code has
no error!
The error is somewhere else, as the subject says!



Nathan Ernst

unread,
Jan 12, 2016, 7:55:11 PM1/12/16
to std-dis...@isocpp.org
And, yet, you have refused to acknowledge Howard, Tony and Andrey's suggestions and incites, which just reinforces my assertion of your arrogance and ignorance in perpetuating this "bug". You're earning no friends here. 

Again, we are not here to fix *your* buggy code. That is your job.

U.Mutlu

unread,
Jan 12, 2016, 8:04:46 PM1/12/16
to std-dis...@isocpp.org
Nathan Ernst wrote on 01/13/2016 01:55 AM:
> And, yet, you have refused to acknowledge Howard, Tony and Andrey's
> suggestions and incites, which just reinforces my assertion of your
> arrogance and ignorance in perpetuating this "bug". You're earning no
> friends here.
>
> Again, we are not here to fix *your* buggy code. That is your job.

You are just a provactor and I-help-only-for-money man,
I'm not much suprized of your ilk...

Tony V E

unread,
Jan 12, 2016, 9:01:48 PM1/12/16
to Matthew Woehlke
Instead of fixing your bug we are trying to help you learn.  My question was "what happens when..." so the answer can't be "No".  I understand that you are answering the question "Is the bug (that I'm _currently_ seeing) here...".  That's not the question I asked, however.

What I'm trying to point out is a general issue, which is easiest to see in the main thread, but could also be seen in the worker threads.  The general issue is that if no one if waiting on a cond-var, then the notify() or notify_all() is "lost".

So, consider again my question.  Try to understand what I am describing.  Once you see how that problem could happen in the main thread, try to see how it could also happen in the worker threads - ie maybe the last worker thread once all the work is done - does the last thread miss the last notify?  Why?

And _think_ about what Howard said.  You were given advice from an absolute expert.  It was only one step away from writing the code for you.  And look at how most cond-var loops are constructed, and how yours differs.

Finally, imagine that I fixed the code for you.  How would you know that it is correct?  Your current code needs to run 10 times to see the bug.  What if the code a gave need 100 runs to see a bug, or 1000?  Or _years_ - this has happened.  See the Ptolemy Project of Berkeley "The Ptolemy II system itself began to be widely used, and every use of the system exercised this code. No problems were observed until the code deadlocked on April 26, 2004, four years later.”

Tony V E

unread,
Jan 12, 2016, 9:04:34 PM1/12/16
to std-discussion
On Tue, Jan 12, 2016 at 9:01 PM, Tony V E <tvan...@gmail.com> wrote:
On Tue, Jan 12, 2016 at 7:33 PM, U.Mutlu <for-...@mutluit.com> wrote:
Tony V E wrote on 01/12/2016 11:14 PM:
‎> I'm sure the application code is correct

Care to provide a formal proof? Or even informal?

What happens if the main thread reaches notify_all() before any of the worker threads have reached wait()?

No, even if one starts the job-loop only after ensuring that
at least one worker thread has called "cv.wait(lck)", it still isn't working.
So, that is not the cause of the bug.

We could shorten this is discussion if one of you allegedly super guys
would just post a fixed version of the damn code, instead of fabulating...


Instead of fixing your bug we are trying to help you learn.  My question was "what happens when..." so the answer can't be "No".  I understand that you are answering the question "Is the bug (that I'm _currently_ seeing) here...".  That's not the question I asked, however.

What I'm trying to point out is a general issue, which is easiest to see in the main thread, but could also be seen in the worker threads.  The general issue is that if no one if waiting on a cond-var, then the notify() or notify_all() is "lost".

So, consider again my question.  Try to understand what I am describing.  Once you see how that problem could happen in the main thread, try to see how it could also happen in the worker threads - ie maybe the last worker thread once all the work is done - does the last thread miss the last notify?  Why?

And _think_ about what Howard said.  You were given advice from an absolute expert.  It was only one step away from writing the code for you.  And look at how most cond-var loops are constructed, and how yours differs.

Finally, imagine that I fixed the code for you.  How would you know that it is correct?  Your current code needs to run 10 times to see the bug.  What if the code I gave need 100 runs to see a bug, or 1000?  Or _years_ - this has happened.  See the Ptolemy Project of Berkeley "The Ptolemy II system itself began to be widely used, and every use of the system exercised this code. No problems were observed until the code deadlocked on April 26, 2004, four years later.”

P.S. That's not to say that testing is useless. It is still useful.  But so is understanding.

Tony

U.Mutlu

unread,
Jan 12, 2016, 9:43:09 PM1/12/16
to std-dis...@isocpp.org
Tony V E wrote on 01/13/2016 03:01 AM:
> On Tue, Jan 12, 2016 at 7:33 PM, U.Mutlu <for-...@mutluit.com> wrote:
>
>> Tony V E wrote on 01/12/2016 11:14 PM:
>>
>>> ‎> I'm sure the application code is correct
>>>
>>> Care to provide a formal proof? Or even informal?
>>>
>>> What happens if the main thread reaches notify_all() before any of the
>>> worker threads have reached wait()?
>>>
>>
>> No, even if one starts the job-loop only after ensuring that
>> at least one worker thread has called "cv.wait(lck)", it still isn't
>> working.
>> So, that is not the cause of the bug.
>>
>> We could shorten this discussion if one of you allegedly super guys
>> would just post a fixed version of the damn code, instead of fabulating...
>>
>>
> Instead of fixing your bug we are trying to help you learn. My question
> was "what happens when..." so the answer can't be "No". I understand that
> you are answering the question "Is the bug (that I'm _currently_ seeing)
> here...". That's not the question I asked, however.

I understand your reasoning, and I have added that protection you said,
but: it has no effect, things are simply the same.

> What I'm trying to point out is a general issue, which is easiest to see in
> the main thread, but could also be seen in the worker threads. The general
> issue is that if no one is waiting on a cond-var, then the notify() or
> notify_all() is "lost".

So, then how to protect against this?
Do you maybe mean that the two statements below as a whole have to be done
atomically like a transaction, ie. using another mutex? :
unique_lock<mutex> lck(mtx);
cv.wait(lck);

Ok, this is an idea that makes sense and could work, I'll try it out.

> So, consider again my question. Try to understand what I am describing.
> Once you see how that problem could happen in the main thread, try to see
> how it could also happen in the worker threads - ie maybe the last worker
> thread once all the work is done - does the last thread miss the last
> notify? Why?

By "last thread" do you mean the physical last thread or the last thread who
did the last job?

> And _think_ about what Howard said. You were given advice from an absolute
> expert. It was only one step away from writing the code for you. And look
> at how most cond-var loops are constructed, and how yours differs.

Mine is taken from an example from the link below, as was stated in my initial
posting.

> Finally, imagine that I fixed the code for you. How would you know that it
> is correct? Your current code needs to run 10 times to see the bug. What
> if the code a gave need 100 runs to see a bug, or 1000? Or _years_ - this
> has happened. See the Ptolemy Project of Berkeley "The Ptolemy II system
> itself began to be widely used, and every use of the system exercised this
> code. No problems were observed until the code deadlocked on April 26,
> 2004, four years later.”

Congratulations!
My intention is just to avoid such a situation. And my motivation was also
to make people aware of a dangerous example code I found on this otherwise
good and popular site:
http://www.cplusplus.com/reference/condition_variable/condition_variable/




Howard Hinnant

unread,
Jan 12, 2016, 9:48:14 PM1/12/16
to std-dis...@isocpp.org
On Jan 12, 2016, at 9:42 PM, U.Mutlu <for-...@mutluit.com> wrote:
>
> My intention is just to avoid such a situation. And my motivation was also
> to make people aware of a dangerous example code I found on this otherwise
> good and popular site:
> http://www.cplusplus.com/reference/condition_variable/condition_variable/

This code does not appear to have the bug I pointed out in your code. It may have other bugs that I haven’t noticed. But no one, including you, have demonstrated such hypothetical bugs.

Howard

Tony V E

unread,
Jan 12, 2016, 10:43:00 PM1/12/16
to std-discussion
No.
 
  unique_lock<mutex> lck(mtx);
  cv.wait(lck);

Ok, this is an idea that makes sense and could work, I'll try it out.


When working with threads, sometimes (particularly when looking for bugs) you need to consider what state each thread _might_ be in. By 'state' I mean, in your example, what code is currently executing (or about to execute - imagine the thread is between two lines of code), and what are the states of all the important variables (ie fReady, fQuit, mtx, etc).  And then imagine the worst case scenario.

The problem is that depending on timing, there are multiple possible states, and they multiply with each other.  If a worker thread could be in any of 3 states, then 2 worker threads together might make 3 x 3 = 9 possible states.


So, consider again my question.  Try to understand what I am describing.
Once you see how that problem could happen in the main thread, try to see
how it could also happen in the worker threads - ie maybe the last worker
thread once all the work is done - does the last thread miss the last
notify?  Why?

By "last thread" do you mean the physical last thread or the last thread who did the last job?

I just mean the last thread to reach the point of waiting on cv.  Whichever worker thread that happens to be.  Think how it might be possible that as each thread is going through each step of the last iteration of the loop (ie so notify_all() won't be called any more!), who notifies the last thread to reach cv.wait()? - Is it possible for that to happen?  Is it possible for a thread to see fQuit == false, then call cv.wait() but everyone else has finished calling notify_all() for the last time?  ie at the point Andrey marked as [2].  When the last thread is at [2] could the main thread finish, and all other threads finish, before the last thread reaches cv.wait()?
Note that Andrey marked those spots for a reason.



And _think_ about what Howard said.  You were given advice from an absolute
expert.  It was only one step away from writing the code for you.  And look
at how most cond-var loops are constructed, and how yours differs.

Mine is taken from an example from the link below, as was stated in my initial posting.

Yes, and what is different between the two?  In particular, what is different in the order of locking, waiting, and checking your booleans?

The first reply to this thread was from Howard:

"Check your predicate AFTER you lock the mutex and BEFORE you wait on the condition_variable.  Otherwise you risk waiting when there is no one to notify you.  You may wait forever."  (capitalization for emphasis by me, not Howard)

By "predicate" he means the booleans.
 

Finally, imagine that I fixed the code for you.  How would you know that it
is correct?  Your current code needs to run 10 times to see the bug.  What
if the code a gave need 100 runs to see a bug, or 1000?  Or _years_ - this
has happened.  See the Ptolemy Project of Berkeley "The Ptolemy II system
itself began to be widely used, and every use of the system exercised this
code. No problems were observed until the code deadlocked on April 26,
2004, four years later.”

Congratulations!
My intention is just to avoid such a situation. And my motivation was also
to make people aware of a dangerous example code I found on this otherwise
good and popular site:
http://www.cplusplus.com/reference/condition_variable/condition_variable/



There are important differences between the example code and yours.


So, at this point we've given you at least 3 different styles of responses:

1. Howard: Mentioned principles/guidelines you should follow when using cond-vars - how to order your checks and locks.  I'm sure a programmer could go far just by following Howard's guidelines, even without actually understanding why they are following them (beyond "because Howard said", which is a pretty good reason on its own).

2. Andrey: outline of specific points in the code and specific states, such that a deadlock could occur.  Maybe you should 1. understand the situation he is trying to describe. 2. add debug output to your code to see what state it is in when it deadlocks. (Even if it deadlocks in a different state than what Andrey describes, that doesn't mean Andrey's case couldn't happen as well.)

3. Tony: mostly just point out that cv.wait() will wait forever if the last notify_all() happened before the wait().  From that, try to think of how it could happen that there are no more notify_all() calls left, but someone is still about to wait().  (And once you see that, then try to understand if/how Howard's guidelines make that situation impossible.)


Finally, as mentioned before, we are WAY off topic for std-discussion.  If there really was a bug at the cplusplus.com website, it would still be offtopic, although probably of interest to many here.  Even a bug in the library or compiler is actually off topic here.  Only if there was a  library/compiler bug, and the bug existed because the standard said it must work that way, would it be on topic for this list.

But at this point it should be clear the bug is elsewhere (ie in your code).  stackoverflow.com is possibly a better place for questions like this.

Tony

P.S. I hope it is OK with you, I might use your example code (or code similar to it) as part of my concurrency training material.  Proper use of cond-vars is tricky at first until you really understand how all of your code can interact.


U.Mutlu

unread,
Jan 12, 2016, 11:43:45 PM1/12/16
to std-dis...@isocpp.org
I now have finally found the bug. It is this line from the above example:
while (!ready) cv.wait(lck);

I had used it also in my code. When I replace it with a lamda version
then the bug disappeares! :

// wait till job finished
{
unique_lock<mutex> lck(mtx);
cv.wait(lck, []{ return !fReady; });
}




U.Mutlu

unread,
Jan 12, 2016, 11:53:02 PM1/12/16
to std-dis...@isocpp.org
Ok, here is the fixed code using lambda versions of cv.wait() :

#include <iostream>
#include <atomic>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <unistd.h>
#include <vector>

using namespace std;

mutex mtx;
condition_variable cv;
atomic<bool> fReady; // init done in main
atomic<bool> fQuit; // init done in main

void threadfunc(int id)
{
/*
Any thread that intends to wait on condition_variable has to acquire a
unique_lock<mutex> first.
The wait operations atomically release the mutex and suspend the execution of
the thread.
When the condition variable is notified, the thread is awakened, and the mutex
is reacquired [2].
*/

while (!fQuit)
{
// wait for fReady:
unique_lock<mutex> lck(mtx);
#if 0
// OK
cv.wait(lck);
#else
// ALSO OK
cv.wait(lck, []{ return fReady || fQuit; });
#endif
if (fQuit) break;
if (!fReady) continue;

// work, ie. process the job:
cout << "thread " << id << " has grabbed the job\n";
//...

// notify parent with fReady=false (ie. "job done"
// wakes up also all the other threads
fReady = false;
cv.notify_all();
}

cout << "thread " << id << " finished\n";
}

int main()
{
fQuit = false;
fReady = false;

const size_t N = 10;
vector<thread> threads;
for (size_t i = 0; i < N; ++i)
threads.push_back(thread(threadfunc, i));

for (size_t i = 0; i < 10000; ++i)
{
cout << i << " : " << N << " threads ready to race: ";

// notify all threads about work to do
{
unique_lock<mutex> lck(mtx);
fReady = true;
}
cv.notify_all();

#if 0
// wait till job finished [THIS IS BUGGY!]
{
unique_lock<mutex> lck(mtx);
while (fReady) cv.wait(lck); // <-- THIS IS THE CULPRIT!
}
#else
// wait till job finished [THIS IS OK]
{
unique_lock<mutex> lck(mtx);
cv.wait(lck, []{ return !fReady; });
}
#endif
}

// set fQuit and wake up all threads
{
unique_lock<mutex> lck(mtx);
fQuit = true;
}
cv.notify_all();

// wait till all threads have quit:
cout << "join\n";

U.Mutlu

unread,
Jan 13, 2016, 12:08:00 AM1/13/16
to std-dis...@isocpp.org
U.Mutlu wrote on 01/13/2016 05:52 AM:
> U.Mutlu wrote on 01/13/2016 05:43 AM:
>> Howard Hinnant wrote on 01/13/2016 03:48 AM:
>>> On Jan 12, 2016, at 9:42 PM, U.Mutlu <for-...@mutluit.com> wrote:
>>>>
>>>> My intention is just to avoid such a situation. And my motivation was
>>>> also to make people aware of a dangerous example code I found on this
>>>> otherwise good and popular site:
>>>> http://www.cplusplus.com/reference/condition_variable/condition_variable/
>>>
>>>>
>>> This code does not appear to have the bug I pointed out in your code. It
>>> may have other bugs that I haven’t noticed. But no one, including you,
>>> have demonstrated such hypothetical bugs.
>>>
>>> Howard
>>>
>>
>> I now have finally found the bug. It is this line from the above example:
>> while (!ready) cv.wait(lck);
>>
>> I had used it also in my code. When I replace it with a lambda version
>> then the bug disappeares! :
>>
>> // wait till job finished
>> {
>> unique_lock<mutex> lck(mtx);
>> cv.wait(lck, []{ return !fReady; });
>> }
>
> Ok, here is the fixed code using lambda versions of cv.wait() :

And here is the final working version.
In this version the two atomic vars have been converted to normal vars
as there was no need for them to be declared atomic b/c they are already
protected via the mtx.

#include <iostream>
#include <atomic>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <unistd.h>
#include <vector>

using namespace std;

mutex mtx;
condition_variable cv;
bool fReady = false;
bool fQuit = false;

void threadfunc(int id)
{
/*
Any thread that intends to wait on condition_variable has to acquire a
unique_lock<mutex> first.
The wait operations atomically release the mutex and suspend the execution of
the thread.
When the condition variable is notified, the thread is awakened, and the mutex
is reacquired [2].
*/

while (!fQuit)
{
// wait for fReady:
unique_lock<mutex> lck(mtx);
#if 1
// OK
cv.wait(lck);
#else
// ALSO OK
cv.wait(lck, []{ return fReady || fQuit; });
#endif
if (fQuit) break;
if (!fReady) continue;

// work, ie. process the job:
cout << "thread " << id << " has grabbed the job\n";
//...

// notify parent with fReady=false (ie. "job done")
// wakes up also all the other threads
fReady = false;
cv.notify_all();
}

cout << "thread " << id << " finished\n";
}

int main()
{

U.Mutlu

unread,
Jan 13, 2016, 12:48:54 AM1/13/16
to std-dis...@isocpp.org
Ok, final comment:
The implementation of the non-lambda version of condition_variable.wait()
in my compiler (g++ (Debian 4.9.2-10) 4.9.2) seems to be buggy.
An official bug report has been filed.
Here's the very final version.
So long, and thanks for all the fish... :-)


#include <iostream>
// #include <atomic>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <unistd.h>
#include <vector>

using namespace std;

mutex mtx;
condition_variable cv;
bool fReady = false; // making this atomic<bool> not necessary b/c
protected via mtx
bool fQuit = false; // dito

void threadfunc(int id)
{
/*
Any thread that intends to wait on condition_variable has to acquire a
unique_lock<mutex> first.
The wait operations atomically release the mutex and suspend the execution of
the thread.
When the condition variable is notified, the thread is awakened, and the mutex
is reacquired [2].
*/

while (!fQuit)
{
// wait for fReady:
unique_lock<mutex> lck(mtx);
#if 0
// THIS IS BUGGY!
cv.wait(lck);
#else
// THIS IS OK
cv.wait(lck, []{ return fReady || fQuit; });
#endif
if (fQuit) break;
if (!fReady) continue;

// work, ie. process the job:
cout << "thread " << id << " has grabbed the job\n";
//...

// notify parent with fReady=false (ie. "job done")
// wakes up also all the other threads
fReady = false;
cv.notify_all();
}

unique_lock<mutex> lck(mtx);
{
unique_lock<mutex> lck(mtx);
cout << "join\n";
}
for (auto& th : threads) th.join();

{
unique_lock<mutex> lck(mtx);

Thiago Macieira

unread,
Jan 13, 2016, 2:13:16 AM1/13/16
to std-dis...@isocpp.org
Three people found bugs in your code. But even if we had found implementation
issues, there was no discussion about standard, which is the topic of this
mailing list.

There's nothing wrong with the standard. Not even you claimed there was. That
makes this entire discussion off-topic.

Thiago Macieira

unread,
Jan 13, 2016, 2:15:16 AM1/13/16
to std-dis...@isocpp.org
On Wednesday 13 January 2016 02:04:34 U.Mutlu wrote:
> > Again, we are not here to fix *your* buggy code. That is your job.
>
> You are just a provactor and I-help-only-for-money man,
> I'm not much suprized of your ilk...

Post your code to Stack Overflow or another, more suitable forum and you may
see many of the people here reply to you there with more detailed answer.

Please stop insisting on getting fixed code on std-discussion. It's not the
point of the mailing list.

Tony V E

unread,
Jan 14, 2016, 12:05:44 PM1/14/16
to std-discussion
You are accessing fQuit without a mutex here, but writing it from another thread.  That's undefined behaviour.
 
      {

        // wait for fReady:
        unique_lock<mutex> lck(mtx);
#if 0
        // THIS IS BUGGY!
        cv.wait(lck);
#else
        // THIS IS OK
        cv.wait(lck, []{ return fReady || fQuit; });
#endif

The big difference in the above code isn't that one uses a lambda and the other doesn't.
The big difference is that one checks fQuit (AFTER acquiring the lock) and the other doesn't.  That is your bug, as mentioned in various other emails.

I don't think this is the culprit.  I think your changes of checking fQuit in  threadfunc is the culprit.
Have you tried using the old #if 0 code here, and only testing the changes of the threadfunc code?
 
        }
#else
        // wait till job finished [THIS IS OK]
        {
        unique_lock<mutex> lck(mtx);
        cv.wait(lck, []{ return !fReady; });
        }
#endif
      }

    // set fQuit and wake up all threads
    {
    unique_lock<mutex> lck(mtx);
    fQuit = true;
    }
    cv.notify_all();

    // wait till all threads have quit:
    {
    unique_lock<mutex> lck(mtx);
    cout << "join\n";
    }
    for (auto& th : threads) th.join();

    {
    unique_lock<mutex> lck(mtx);

    cout << "pgm finished\n";
    }
    return 0;
  }




Reply all
Reply to author
Forward
0 new messages