Suppose I have a graph that I want to search for a node that has some
characteristic (e.g., holds a value within some range). Suppose further that of
the many ways to search the graph (e.g., depth-first, breadth-first, random
walk, etc.), I can't predict which will be best, so I want to run them all
concurrently, stopping when either one of them finds a suitable node or they all
fail. That is, given something like this
Node* dfsSearch(Graph g, Predicate p); // do dfs search of g for
// a node satisfying p
Node* bfsSearch(Graph g, Predicate p); // do bfs search of g
Node* rwSearch(Graph g, Predicate p); // do random walk search of g
I want to do this:
concurrently invoke dfsSearch, bfsSearch, and rwSearch;
wait until one returns success or they all return lack of success;
if one returned success, tell the others to stop searching;
One of the problems in mapping this to C++0x is that C++0x has no notion of
thread cancellation/interruption, but that can be built manually, as Anthony
Williams shows in section 9.2 of his "C++ Concurrency in Action"
(http://www.manning.com/williams/), which I'm pleased to be able to plug here.
My question, however, has to do with implementing this part:
wait until one returns success or they all return lack of success;
Considering only the simplest case (where at least one search returns success),
what's the best way to wait? I'm going to assume that I have a future<Node*>
for each function.
I could wait on a condition variable and have each search do a notify_one(), but
condition variables are inherently tied to mutexes, and in this case, there is
no need for a mutex (no shared state), so that seems inappropriate to me.
I could poll the futures (waiting for each with a timeout of 0), but polling
constantly seems like a waste of machine resources, and polling with some lag
period between polls introduces a delay between when a future is ready and when
I detect that. The attraction of a condition variable is that I avoid that delay.
I have no MT experience, so I'm looking for advice from people who do. C++0x
threading experience is rare, of course, but I think this is more of a general
MT design question.
All insights appreciated.
Scott
--
* C++ and Beyond: Meyers, Sutter, & Alexandrescu, Oct. 24-27 near Seattle
(http://cppandbeyond.com/)
* License my training materials for commercial (http://tinyurl.com/yfzvkp9) or
personal use (http://tinyurl.com/yl5ka5p).
Most fitting for the context MT construct sounds to be a Semaphore.
Semaphore is somewhat like mutex with Wait being locking and Post
being unlocking. Difference is that Post may be done by some other
thread (or process) than the one who did Wait.
Your task that needs the results runs the tasks of alternative
solutions and starts to Wait after Semaphore. First task that is
finished does Post releasing the waiting task.
If it was not success then it does not sound likely that others may
still succeed by your problem description so the released main task
stops the other two on any case.
I do not know if C++0x contains a semaphore of some sort but i imagine
that it is possible to construct one with mutexes as well.
I don't get this part. The problem suggests that all runners do the same
search. So if any of them exhausted the input and reports 'fail' you can
accept that as the verdict. Why wait the others to alse report the same?
Though that is just a minor detail to signal the stop condition.
If all runners have their separate location to store the result, I'd start
with a single atomic<bool> that is the finish sign. Each thread polls it
regularly and if finds set, cancels its work, exiting. And whoever
finishes naturally sets it.
This only leaves the problem of joining. A simple way looks like every
thread knows the 'next' one and joins that on exit. The "main" thread
considered the first. So soon after anyone sets the bool you're back to a
single thread, and can proceed to find the result from the candidates.
Having no locks at all it seem efficient for a single-core environment. On
multicore there will be a ton of reads around a shared byte, but unless
atomic<> has some weird penalty it shall not be a problem either, at least I
don't see how other sync stuff would be better.
I didn't check fow future<> stuff works in C++0x, my approach only builds
what I get from the memory model and thread basics.
But there is a shared state, namely the results. When the worker thread
has found some result, it has to write it somewhere where the main thread
can find it. This location is shared and needs some MT protection, e.g. a
mutex. You might also want to maintain a counter of finished worker
threads to decide when they have all failed, this counter would be shared
as well.
HTH
Paavo
If you don't care too much about performance, use futures and wait for
all of them (you do incur latency for the workers to terminate).
\\
If you care about performance, you want something like a semaphore (but
C++0x doesn't seem to have one out of the box).
The main thread will do an acquire(3) on the semaphore (which has been
initialized to 0). Each worker thread will do a release(1), if it's not
successful or a release(3), if it is successful. Your main thread will
wake up (return from the acquire), when one of the threads was
successful or all of them were unsuccessful. Threads still working when
another thread found a result can terminate asynchronously (your main
thread doesn't incur latency for them).
For the result, you can just use an atomic variable (initialized to some
empty state, set by a successful worker).
You have 3 shared variables/objects between the threads: semaphore,
success/termination flag and result (the flag variable and result must
be atomic). You probably want to use a thread-safe shared pointer to
manage the lifetime (workers may run longer than your main thread).
Scott, I am sorry for the very crude and brief description of the algorithm.
I have no time to elaborate any further right now. So, I hope that you can
gain something from this!
:^)
I will give more info when I get some more free time.
Thanks.
------------------------------------------
I did something very similar when I needed to search a large, completely
unordered array. I created three tasks, one searched from left-to-right, one
searched from right-to-left, and the other from the center to both ends of
the array. I used an eventcount for the waiting. Luckily, an eventcount can
be fairly easily constructed in C++0x. IIRC, here is a high-level brief
sketch of how the data-structures were setup:
<high-level pseudo-code sketch>
___________________________________________________________
// an entry in the array to search for
struct entry
{
bool compare(unsigned) const;
};
// the entry array
struct array
{
entry m_data[DEPTH]; // large array of entry's
};
// search completion
#define FAILURE 0xFFFFFFFFU // special failure code
struct complete
{
eventcount m_ecount;
atomic<entry*> m_complete; // = NULL
atomic<unsigned> m_refs;
complete(unsigned refs) : m_refs(refs)
{
}
void signal(entry* e)
{
entry* cmp = NULL;
if (m_complete.compare_exchange_strong(cmp, e, mb_relaxed))
m_ecount.signal();
if (m_refs.fetch_sub(-1, mb_release) == 1)
{
atomic_thread_fence(mb_acquire);
delete this;
}
}
entry* wait()
{
entry* e;
while (! (e = m_complete.load(mb_relaxed)))
{
eventcount::key const k = m_ecount.get();
if ((e = m_complete.load(mb_relaxed))) break;
m_ecount.wait(k);
}
if (m_refs.fetch_sub(-1, mb_release) == 1)
{
atomic_thread_fence(mb_acquire);
delete this;
}
return (e != FAILURE) ? e : NULL;
}
};
// a search task
typedef void (*fp_search_type) (stask&);
struct stask
{
array* m_array; // the target array
atomic<bool> m_cancel; // cancel flag == false
complete& m_complete; // completion
unsigned m_key; // search key
fp_search_type m_fp_search; // search funcptr
stask(array* a, unsigned key, fp_search_type s, complete& c)
: m_array(a), m_complete(c), m_key(key), m_fp_search(s)
{
}
// this is called by the main thread-pool
void work()
{
m_fp_search(*this);
}
};
// a left-to-right search
static void search_ltr(stask& s)
{
for (unsigned i = 0; i < DEPTH && ! m_cancel.load(mb_relaxed); ++i)
{
if (s.m_array[i].m_data.compare(s.m_key))
{
s.m_complete.signal(&s.m_array[i].m_data);
return;
}
}
s.m_complete.complete(FAILURE);
}
// a right-to-left search
static void search_rtl(stask& s)
{
[...]; // completion logic is identical to left-to-right search
}
// a center-to-both-ends search
static void search_cbe(stask& s)
{
[...]; // completion logic is identical to left-to-right search
}
// the main search function... finally! ;^)
static entity* search(array& a, unsigned key)
{
// create a completion structure with a refcount of 4
complete* c = new complete(4);
// create search tasks
stask* stask_ltr = new stask(&a, key, search_ltr, *c);
stask* stask_rtl = new stask(&a, key, search_rtl, *c);
stask* stask_cbe = new stask(&a, key, search_cbe, *c);
// enqueue search tasks
g_thread_pool.push(stask_ltr);
g_thread_pool.push(stask_rtl);
g_thread_pool.push(stask_cbe);
// wait for a result...
entry* e = c.wait();
// cancel everything
stask_ltr.m_cancel = true;
stask_rtl.m_cancel = true;
stask_cbe.m_cancel = true;
// release references
stask_ltr.release();
stask_rtl.release();
stask_cbe.release();
// return result
return e;
};
___________________________________________________________
That's about all of the relevant parts of the scheme I was using. Basically,
the main search function would create a completion structure (e.g., `struct
complete' in the example) and three search tasks `struct stask'. Then it
enqueued the tasks in the main thread pool and finally waited on the
completion structure.
The completion and failure logic was very simple. First we can observe that
if one search fails, all of them failed. Also, if one succeeds then all of
them succeeded. So, if a thread happens to find the entry, it does a
compare-and-swap on the "result variable" (examine the `struct
complete::signal() procedure'). Also, if a thread fails to find the entry is
does a CAS with a special failure code (e.g., the FAILED macro). If the CAS
succeeds, then the entire search operation is complete and the result can be
reported back to the main search function (examine the `struct
complete::wait() function').
Finally, the main search operation obtains the result, releases its
references to the allocated tasks, and returns. Please note that it does not
need to wait or join with any of the tasks. As soon as the result is ready,
the main search function can finish. You can also do this without a thread
pool and use detached threads since there is no need to join. The completion
structure and tasks were all reference counted which worked fairly nicely in
that scheme.
I hope that makes some sense!
:^o
[...]
This is what a future is for. As I noted, my plan is for each search function
to have its own future. In C++0x, callees write their result to a promise, and
callers retrieve the result through a future. The state shared by a promise and
future is internally synchronized, so there is no need for clients to use a
mutex with it.
> This location is shared and needs some MT protection, e.g. a
> mutex. You might also want to maintain a counter of finished worker
> threads to decide when they have all failed, this counter would be shared
> as well.
When all workers have finished, all their futures will be ready, so there is no
need for a counter to determine that (although it could be convenient).
I failed to explain that there are two kinds of failure:
- Normal: Search was completed, no suitable node was found.
- Exceptional: Search was aborted with an exception of some kind.
I agree that in the case of a normal failure, the other threads can be
interrupted at that point. If a failure is exceptional, the other threads
probably need to be allowed to continue.
> If all runners have their separate location to store the result, I'd
> start with a single atomic<bool> that is the finish sign. Each thread
> polls it regularly and if finds set, cancels its work, exiting. And
> whoever finishes naturally sets it.
Okay, so you'd poll, which was one of the strategies I mentioned.
> Having no locks at all it seem efficient for a single-core environment.
> On multicore there will be a ton of reads around a shared byte, but
> unless atomic<> has some weird penalty it shall not be a problem either,
> at least I don't see how other sync stuff would be better.
Do you have a reason for preferring polling to use of a condition variable or,
as others have suggested, a semaphore?
Thanks,
C++0x has no semaphore type, nor, from what I can tell, does pthreads, TBB or
Boost.Thread. My sense is that the preferred thread communication mechanism in
these libraries is condition variables. From Boost.Thread:
> The classes condition_variable and condition_variable_any provide a mechanism for one thread to wait for notification from another thread that a particular condition has become true.
From TBB blog post:
> Condition variables should be the method of choice to have a thread wait until a condition changes.
From draft C++0x:
> Condition variables provide synchronization primitives used to block a thread until notified by some other
> thread that some condition is met or until a system time is reached.
Should I conclude that if I don't want to poll, I should probably be using a
condition variable?
Thanks,
> Paavo Helde wrote:
>> But there is a shared state, namely the results. When the worker
>> thread has found some result, it has to write it somewhere where the
>> main thread can find it.
>
> This is what a future is for. As I noted, my plan is for each search
> function to have its own future. In C++0x, callees write their result
> to a promise, and callers retrieve the result through a future. The
> state shared by a promise and future is internally synchronized, so
> there is no need for clients to use a mutex with it.
I see. I overlooked the usage of C++0x future<> in the OP. I am not
familiar with this feature, but it appears to be a higher level construct
having internal synchronisation. I guess what you are needing here is a
more special case making use of more lower level primitives.
Regards
Paavo
It also has the advantage that it can hold either a return value or an exception.
FWIW, here is the original fast-pathed eventcount algorithm I invented a
while back on c.p.t.:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380
As you can see, the simplistic naive eventcount can be fairly easily
implemented with 100% pure C++0x. Please take note of the fact that it
depends on a condition variable for the slow-path waiters case. So, a
condvar is a very natural solution for the slow path of an eventcount.
IMHO: Condvars/Monitors coupled with a standard atomic and membar API make
eventcounts as simple as pie!
;^)
This algorithm has been further extended by me and others to support many
different features. Perhaps I should code an example of the solution I
"tried" to present in Relacy. Humm...
http://groups.google.com/group/relacy
BTW, this tool is AWESOME!!!!!!
Read the topics section:
http://groups.google.com/group/relacy/topics
and you can see example impls of my fast-pathed eventcount algorithm that I
posted. Dmitriy Vyukov created a nice distributed eventcount based on
per-thread waitsets and clever organization of said waitsets here:
http://software.intel.com/en-us/forums/showthread.php?t=62364
And here is where Dmitriy gives me credit for presenting my original
fast-path eventcount algorithm:
http://software.intel.com/en-us/forums/showpost.php?p=72258
And to Joe Seigh for expressing his thoughts on the algorithm in the first
place. Joe gave me the initial idea/seed to create my fast-pathed solution.
Here is where I presented my first eventcount; Luckily, Joe commented on it:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/aa8c62ad06dbb380
Enjoy!
;^)
FWIW, here is the original fast-pathed eventcount algorithm I invented a
FWIW, here is the original fast-pathed eventcount algorithm I invented a
> Perhaps I should code an example of the solution I "tried" to present in
> Relacy.
Let me clarify:
Perhaps I should code an example of my solution to the problem you presented
in Relacy.
Sorry for any confusions!
GRRR
;^(...
Perhaps it's just my inexperience with MT development, but my impression is that
anything involving manual fences, CAS, relaxed memory ordering directives, and
arithmetic on atomics is unlikely to be viewed by most people as simple :-)
My primary goal is to come up with a "natural" design that can be expressed
using C++0x, as opposed to one that will run as fast as possible. As such, I'd
like to rely on fairly mainstream C++0x threading facilities such as futures,
condition variables, and, where necessary, atomics. I'd also like to be able to
handle the case where one or more of the functions I invoke throw exceptions.
(This is one of the reasons that futures are attractive.)
The problem I'd like to solve seems like it'd be pretty common, and I was hoping
there was a more or less "standard" design using C++0x for approaching it.
Digging a little further, I came across this from Anthony Williams in a thread
from last October:
> Currently there is no good way, other than condition variables and
> mutexes.
>
[...]
>
> I can certainly see the benefits of standardizing a lightweight
> mechanism such as a semaphore or event count or even a futex, but that
> will have to wait for TR2 as it is currently far too late for getting
> new proposals in for C++0x.
So it looks like condvars are the "standard" C++0x way of communicating events
between threads.
In the meantime, I'd been wondering about the idea of writing my own mutex that
didn't actually do any locking, using that in conjunction with a condition
variable in an attempt to reduce the cost of that approach. Anthony had already
been there. In the same message, he wrote:
> Well, std::condition_variable_any only requires a "lockable" object, so
> you could pass a class that implements lock/try_lock/unlock as empty
> functions. However, the implementation probably just uses an internal
> mutex in this case, so you don't necessarily gain anything.
Am I correct in assuming that the internal mutex would be for the condition
variable and would be used to help manage the data structure keeping track of
which threads were blocked on the condvar? If so, wouldn't using a no-op mutex
type eliminate any cost that would be associated with acquiring the
(uncontended) mutex used in conjunction with the condvar (i.e., the mutex passed
to the condvar)? My understanding is that the cost of such acquisition is low
on some platforms, but not so low on others.
Ahh, but the complexity can be hidden by the final result. C++0x provides a
perfect medium for an eventcount to thrive within. IMVHO, it's easier to
build an efficient eventcount than an efficient semaphore in C++0x.
Eventcounts have a concrete usage pattern for consumers/waiters/ect...:
http://groups.google.com/group/comp.programming.threads/msg/8e7a0379b55557c0
(refer to latter part of post)
> My primary goal is to come up with a "natural" design that can be
> expressed using C++0x, as opposed to one that will run as fast as
> possible.
Ohhh. I thought you wanted speedy design. Well, create 3 search futures.
Join with all of them. Pick a result of any of them. Done.
Or use mutexs and condvars with explicit state to pick a winner. You can use
mutexs and condvar to implement my example without using any atomics and/or
eventcounts. Question... Do you need to explicitly join with a future once
you spawn it?
Not very exciting now is it?
C++0x atomics work great. My example in essense used CAS on a single result
variable to solve the problem. Eventcount allows the main search thread to
wait on that value to become either usable or report an error. Think
conditional variable for lockless algorithms.
> As such, I'd like to rely on fairly mainstream C++0x threading facilities
> such as futures, condition variables,
Not fun! :^)
> and, where necessary, atomics.
I have to study the semantics of C++0x futures a lot more Scott. Alls I know
wrt the problem at hand is that when _any_ search future hit's a result
means that all other search futures are going to hit the exact same result.
This fact makes it so you do not have to wait on all spawned futures; let
them die on their own (e..g, detached futures?). There is no need to join
with any of them; just use a single eventcount to wait on a single result
from all search futures. First search future to hit a successful CAS wins.
Keep in mind that a total search can complete long before all the futures
complete, or even one future completes. An eventcount can efficiently and
__portably__ handle all the conditional waiting. No need to explicitly join
with all of the futures promises.
My design is absolutely 100% portable in C++0x terms and seems as natural as
breathing air to me. Why not explore it?
> I'd also like to be able to handle the case where one or more of the
> functions I invoke throw exceptions. (This is one of the reasons that
> futures are attractive.)
I am not a fan of cross-thread exceptions.
> The problem I'd like to solve seems like it'd be pretty common, and I was
> hoping there was a more or less "standard" design using C++0x for
> approaching it.
Simple. Create 3 search futures. Join with all of them. Pick a result of any
of them. Done.
IMVVHO, polling is just crappy conditional logic. It can work okay for some
short lived wait periods with backoff scheme, but its kind of a hack.
> Digging a little further, I came across this from Anthony Williams in a
> thread from last October:
>
>> Currently there is no good way, other than condition variables and
>> mutexes.
>>
> [...]
>>
>> I can certainly see the benefits of standardizing a lightweight
>> mechanism such as a semaphore or event count or even a futex, but that
>> will have to wait for TR2 as it is currently far too late for getting
>> new proposals in for C++0x.
I need to read this post to provide a complete response. But I can
definitely say that event count in C++0x does not really need to be
standardized simply because it can be so readily constructed out of atomics,
membar, and conditional variable.
[...]
Except that when one is done, I want the others to stop working. And I don't
want to waste resources polling to see which one finishes first. And if the one
that finished first thew an exception, I may want to allow the others to continue.
The problem is easier if we are willing to let all the searches run to
completion and if we assume that none will throw. But I'm not willing to make
those assumptions. I mean, there's a *little* fun here, right?
> I have to study the semantics of C++0x futures a lot more Scott. Alls I know
> wrt the problem at hand is that when _any_ search future hit's a result
> means that all other search futures are going to hit the exact same result.
Not necessarily. There may be multiple nodes that satisfy the search criterion.
And one or more of the search algorithms might be super-fast, but use a
heuristic that sometimes returns a false negative (hence would always be run
concurrently with a slower but never-wrong algorithm).
> I am not a fan of cross-thread exceptions.
They're one of the nicest features of C++0x threading, IMO. It means that a lot
of the experience we have with writing ST code using exceptions can be ported to
an MT environment. Why don't you like them?
You might want to take a look at N2709
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2709.html), which
shows the nine-line spawn_task function that's the essence of what was
ultimately standardized as std::async. (std::async is more sophisticated, but
spawn_task is at its heart.) If something is sufficiently common, making it a
standard library facility is not unreasonable. I'm also inclined to say that
anything that's commonly useful and that requires the use of membars is a good
candidate for standardization. We have decades of experience showing that
programmers, as a large group, can't avoid such easy-to-understand problems as
deadlocks and races. This suggests to me that asking them, as a group, to
master membars and relaxed memory orderings is asking too much.
Well, you would have to incorporate a cancellation check with the various
search algorithms. My very crude pseudo-code handles this.
> And I don't want to waste resources polling to see which one finishes
> first.
A condition variable or eventcount will work fine.
> And if the one that finished first thew an exception, I may want to allow
> the others to continue.
This throws more complexity into the mix. Need to think.
> The problem is easier if we are willing to let all the searches run to
> completion and if we assume that none will throw.
INDEED!
> But I'm not willing to make those assumptions. I mean, there's a *little*
> fun here, right?
:^)
Are you in control of creating the search functions?
>> I have to study the semantics of C++0x futures a lot more Scott. Alls I
>> know
>> wrt the problem at hand is that when _any_ search future hit's a result
>> means that all other search futures are going to hit the exact same
>> result.
>
> Not necessarily. There may be multiple nodes that satisfy the search
> criterion. And one or more of the search algorithms might be super-fast,
> but use a heuristic that sometimes returns a false negative (hence would
> always be run concurrently with a slower but never-wrong algorithm).
So how could you possibly ever trust the result of the algorithm that might
return a false negative in this case? Here is a question... Should the false
negative stop other searches? If not, why use an algorithm that can return
false negative at all?
BTW, you seem to be revealing more and more "details/requirements" about/for
the problem. Is this a real problem that you are stuck on, or an "academic"
question?
Can the graph mutate while readers are performing searches on it?
Do you NEED information from _all_ searches before you can return a result
that is _not_ a false negative? IMVVO, if you launch several _identical_
searches over a datum, then one result should represent all results. This
makes synchronization ohh so much easier; 1 CAS! Unless you need combined
information from all results.
>> I am not a fan of cross-thread exceptions.
>
> They're one of the nicest features of C++0x threading, IMO. It means that
> a lot of the experience we have with writing ST code using exceptions can
> be ported to an MT environment. Why don't you like them?
Does Thread A have to explicitly wait on something in order to "receive" an
exception? Does it have to manually check for exceptions? I see a
future<>::has_exception() member function...
Well, I am not really a fan of exceptions at all because it tends to "force"
me into a transactional style of programming.Think ScopeGuard. I mean, it
works well for the problem...
I am a C guy dam% it! Ouch... I am old!
;^(...
that looks like a boost::wait_for_any( begin, end ) job:
I believe it is actually more interesting if you don't want to wait
but want to go ahead and do_stuff( ) in the main thread;
then you're going to need something like a task observer:
template<typename Task>
class task_observer {
public:
task_observer( );
task_observer( const Task & );
~task_observer( );
const Task & task( ) const;
//basic task state
bool finished( ) const;
bool running( ) const;
bool started( ) const;
//etc
//watch signals
template< typename Callable > void whenStarted( Callable c );
template< typename Callable > void whenFinished( Callable c );
//etc
//progress monitoring
int progressValue( ) const;
template< typename Callable > void whenProgress( int p, Callable
c );
//etc
}; //end task_observer
to take care of finishing the redundant threads when on the slow path.
I'm not sure what's in the works on boost::thread but the above
functionality (callbacks and pregress monitoring) was not available
last time I checked.
the usage would probably be something like this:
//create tasks and tasks_observers for each search
packaged_task< Node* > dfs( dfsSearch ), bfs( bfsSearch ),
rw( rwSearch );
task_observer< packaged_task > to_dfs( dfs ), to_bfs( bfs ),
to_rw( rw );
//use a thread_group for simplicity of interrupt_all method
thread_group threads;
to_dfs.whenFinished( thread_group::interrupt_all, threads );
to_bfs.whenFinished( thread_group::interrupt_all, threads );
to_rw.whenFinished( thread_group::interrupt_all, threads );
//launch the threads
threads.add_thread( new thread( move( dfs ) ) );
threads.add_thread( new thread( move( bfs ) ) );
threads.add_thread( new thread( move( rw ) ) );
//continue main thread activity, _no_ polling
do_stuff( );
your question is legit: some threading frameworks provide signals and
callbacks on observable futures for a no polling thread management.
some time ago I have written my own observable_task class.
hth,
gil
Great points Sir! I would love to see eventcounts standardized in C++0x.
IMHO, it's a __very__ useful synchronization primitive indeed. It works
perfectly as an efficient conditional blocking mechanism for existing/new
lockless algorithms.
Yes.
> BTW, you seem to be revealing more and more "details/requirements" about/for
> the problem. Is this a real problem that you are stuck on, or an "academic"
> question?
It's academic. I'm trying to figure out what a general solution to this kind of
problem would look like so that I can try to map it to what C++0x gives me.
> Can the graph mutate while readers are performing searches on it?
No. I'm only interested in how to organize the communication between the
threads doing the searching and the thread needing the result of the search.
> Do you NEED information from _all_ searches before you can return a result
> that is _not_ a false negative? IMVVO, if you launch several _identical_
> searches over a datum, then one result should represent all results. This
> makes synchronization ohh so much easier; 1 CAS! Unless you need combined
> information from all results.
I'm happy to assume that any successful result is as good as any other, but I
want to allow for the possibility that one or more searches may not be able to
run to completion due to an exception. This is C++. In general, functions may
throw.
> Does Thread A have to explicitly wait on something in order to "receive" an
> exception? Does it have to manually check for exceptions? I see a
> future<>::has_exception() member function...
Only in old drafts :-) That member function got removed. The current draft is
N3092, but I think there will be an even newer draft in a few days. As far as I
know, there was no movement to put the exception-polling function back in.
Futures are really quite nice. You invoke a function asynchronously, and you
get a future back at the point of invocation. At that point, you are running
concurrently with the invoked function. (Probably. There are edge cases, but
we'll ignore them.) When you want the result of the invocation, you call get()
on the future. If no result is yet available, you block until it is. If the
result was a normal return, you get the return value. If the result was an
exception, you get the exception. It's just like a normal synchronous call: you
get back either a return value or an exception, depending on what the function
you called delivered.
If you never call get(), you never get either the return value or the exception.
> I am a C guy dam% it! Ouch... I am old!
Well, I'm old, and I'm a C++ guy :-)
Humm... Not exactly sure what your getting at here. I know that an
eventcount allows one to skip a mutex around the "predicate". What advantage
does this idea have over it?
WRT condvar, IIRC, you need that mutex in order to correctly synchronize
wait generations with the predicate.
> In the same message, he wrote:
>
> > Well, std::condition_variable_any only requires a "lockable" object, so
> > you could pass a class that implements lock/try_lock/unlock as empty
> > functions. However, the implementation probably just uses an internal
> > mutex in this case, so you don't necessarily gain anything.
>
> Am I correct in assuming that the internal mutex would be for the
> condition variable and would be used to help manage the data structure
> keeping track of which threads were blocked on the condvar?
I think that condvar wait generations are "synchronized" in a sense with the
user predicate because the condvar impl atomically unlocks the user mutex
and waits on the condition. A condvar can be implemented in many different
ways. A condvar internal waiting logic can be based on futexs.
> If so, wouldn't using a no-op mutex type eliminate any cost that would be
> associated with acquiring the (uncontended) mutex used in conjunction with
> the condvar (i.e., the mutex passed to the condvar)? My understanding is
> that the cost of such acquisition is low on some platforms, but not so low
> on others.
Still not sure what you are getting at. Can you show me a simple example of
how the no-op mutex can be used? Here is general pattern for eventcount:
___________________________________________________
static eventcount g_ec;
static lock_free_queue g_queue; // 100% lock-free queue
// producer
g_queue.push(new node());
g_ec.signal();
// consumer
node* n;
while (! (n = g_queue.try_pop())) // try op
{
eventcount::key const k = g_ec.get(); // get wait key
if ((n = g_queue.try_pop())) break; // try again!
g_ec.wait(k); // wait on the key.
}
// `n' now contains a popped node!
___________________________________________________
The cool thing about this is that `lock_free_queue' has no notion of
conditional waiting. The eventcount "transforms" it into a conditional,
waitable, algorithm.
Yes, i read a bit, seems that the condition variable is good for the
problem description. Unless the main task fulfills one of the
alternative sub-tasks itself it should not poll for others.
I solved this communication under the impression that all searches would
either find an item, or fail to find an item. NOT, throw an exception for
some other reason:
http://groups.google.com/group/comp.lang.c++/browse_frm/thread/9ca81bcf32799d75
Ouch! Well, I could easily return simple integer error codes in the
CAS'able result contained within the `struct complete' data-structure.
Cross-thread exceptions seem to make things more complex. If you spawn 3
futures, and 2 of them throw, where do all of those exceptions get caught
at?
<pseudo-code>
__________________________________________________________
static unsigned g_exceptions = 0;
try
{
std::future f1 = spawn_task(...);
std::future f2 = spawn_task(...);
std::future f3 = spawn_task(...);
// wait for infinity...
Sleep(INFINITE);
}
catch (future_exception const& e)
{
++g_exceptions;
cout << "caught " << g_exceptions << " exception(s)" << endl;
}
__________________________________________________________
If `f2' and `f3' throw, should the output be:
caught 1 exception(s)
caught 2 exception(s)
? I am confused!
You are. You only get an exception when you call get() on a future that holds
one, so the above code will wait forever (unless one of the spawn_task calls
throws...or the Sleep throws :-}).
To get the result held by f1 (or an exception if it holds one), you'd do
something like this:
try {
auto result = f1.get();
// f1 held a return value, work with it here
}
catch(...) {
// f1 held an exception, handle it here
}
BTW, spawn_task is not part of (draft) C++0x, but std::async is.
While toying around to familiarize myself with C++0x I too found out
that the high-level interface (std::future, std::async, std::promise,
std::packaged_task and so on) were severely lacking a select-like
functionality. I think that more generally, it is not possible to
"wait one future among several" (unless busy-waiting). At the time I
was toying with a thread pool class (since the libstdc++ std::async is
minimalist at this point and spawns a thread if and only if passed
std::launch::async), so I tried to do the simplest thing that would be
correct: anything time a worker completes a task, it signals it using
a std::condition_variable (as you pointed out else-thread, it's is a
convenient way to signal among threads); this condition_variable is a
member of the pool. Meanwhile, when a thread calls the select member
of the thread pool instance, for every future passed it checks:
whether it is valid, and if it is if it's ready. If that's the case it
returns a reference to that future. Finally, the thread waits on the
(same) member std::condition_variable and repeats the check anytime
it's woken up.
Glossing over the details (overloads for passing the future and return
type and so on), there are still some kinks:
- there is no std::future::is_ready method anymore (it was in a paper
but it's not in the FCD), but you can call std::future::wait_for
passing std::chrono::duration(0) as an argument; this is IMO a bit
silly
- as also pointed out elsethread, std::condition_variable work in
tandems with std::mutex's. Out of desperation (I had no need for
synchronisation), I constructed a dummy mutex each time I needed to
wait on the condition_variable. Needless to say, if you check e.g. the
posix requirements for pthread_cond_wait, this is Really Bad. After
spurious errors I resorted to adding a std::mutex as a pool member
Implementing a dummy Lockable is interesting, but I suspect that it
still needs to work as a synchronization primitive; perhaps a lock-
free Lockable would work (i.e. when using std::condition_variable as
signalling device collisions are rare so spinning is acceptable), but
if an implementation uses the std::condition_variable to access some
location (which it is my understanding pthreads do) then we need to
explicitly member when locking succeeds.
In any case (whether using a regular std::mutex or some other
Lockable) I'll probably refactor this in some sort of signalling
class. A sort of repeatable rendez-vous but with spurious wake-ups.
> > Paavo Helde wrote:
> >> But there is a shared state, namely the results. When the
> >> worker thread has found some result, it has to write it
> >> somewhere where the main thread can find it.
> > This is what a future is for. As I noted, my plan is for
> > each search function to have its own future. In C++0x,
> > callees write their result to a promise, and callers
> > retrieve the result through a future. The state shared by
> > a promise and future is internally synchronized, so there is
> > no need for clients to use a mutex with it.
> I see. I overlooked the usage of C++0x future<> in the OP.
> I am not familiar with this feature, but it appears to be
> a higher level construct having internal synchronisation.
Yes. The intent, if I understand it correctly, is to
encapsulate the creation of the thread and its join. The
results data are set in the thread, and only read after the join
in the creating thread.
And again, IIUC, Scott's problem can be summarized as "you can
only wait for one future at a time". You can poll them, in
a loop, but Scott, very understandably, doesn't want to do that.
I'm not as familiar as I'd like to be with the proposed
threading in the standard, but I have handled a similar problem
with Posix threads: I made the threads detached (so no join was
available), and used a (single) condition to communicate, with
a table of results, one per thread.
--
James Kanze
Clear now :)
>> If all runners have their separate location to store the result, I'd
>> start with a single atomic<bool> that is the finish sign. Each thread
>> polls it regularly and if finds set, cancels its work, exiting. And
>> whoever finishes naturally sets it.
>
> Okay, so you'd poll, which was one of the strategies I mentioned.
No, the poll you suggested is a different kind of animal. What said here is
internal in the promise and is done before proceeding with useful fork -- to
provide early exit. IOW that is the cooperative way of thread canceling. I
doubt you can avoid it if you actually aim to waste processor cycles after
the goal is reached.
Where you do suggest poll as strategy is replaced by using join. There are
no excess wakeups and busy or semi-busy waits.
>> Having no locks at all it seem efficient for a single-core environment.
>> On multicore there will be a ton of reads around a shared byte, but
>> unless atomic<> has some weird penalty it shall not be a problem either,
>> at least I don't see how other sync stuff would be better.
>
> Do you have a reason for preferring polling to use of a condition variable
> or, as others have suggested, a semaphore?
Sure. :) The main reason is none of that is actually needed in the schema.
The thread ITSELF is a sync primitive, and works much like a cond/event. If
you can use it, there is no point to create one extra manually.
I'd use those if the threads do not stop immediately but linger on to do
some other job. Like when a thread pool is used.
Reading other parts of this topic it seem std::future is not serving too
well for your setup. So if you want performance possibly you better go with
just launching threads. Besides the problem of lacking
WaitForMultipleObjects-equivalent, I don't see a straightforward way to
avoid creating an extra thread. (As I put out, the main thread can run one
search so only 2 others are created; in setups where you have the same
amount of cores as parallel paths it is quite important to not have
parazites...)
Still there may be a way to chain the promises. i imagine a lisp-like
solution, having a list processor that calculates the result for (car)
itself returning a joint result fith thar of (cdr). IOW your promise also
has a future it executes. On the top you wait for the first that will
complete waiting on the second that on the third... And passing in the
extra atomic<bool> for the early-termination signal shall not be a hard
addition. May worth exploring.
>> I need to read this post to provide a complete response. But I can
>> definitely say that event count in C++0x does not really need to be
>> standardized simply because it can be so readily constructed out of
>> atomics, membar, and conditional variable.
>
> You might want to take a look at N2709
> (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2709.html),
> which shows the nine-line spawn_task function that's the essence of what
> was ultimately standardized as std::async. (std::async is more
> sophisticated, but spawn_task is at its heart.) If something is
> sufficiently common, making it a standard library facility is not
> unreasonable. I'm also inclined to say that anything that's commonly
> useful and that requires the use of membars is a good candidate for
> standardization. We have decades of experience showing that programmers,
> as a large group, can't avoid such easy-to-understand problems as
> deadlocks and races. This suggests to me that asking them, as a group, to
> master membars and relaxed memory orderings is asking too much.
I'd guess all that is fallout of the "Cona compromise". Which wasn't
unreasonable at the time, too bad the deadlines for C++0x got missed anyway.
I'd be glad with the memory model and the thread stuff as it is in the
standard if it were finished and issued last year or earlier...
Can you summarize what an eventcount is and the conditions under which one would
use it? I'm not familiar with the idea, and googling around doesn't give me a
clear notion. All the people who discuss it seem to already know what it is :-)
> Still not sure what you are getting at. Can you show me a simple example of
> how the no-op mutex can be used?
Traditionally, a condvar is used to let a thread sleep until some shared state
is ready to be examined or modified. When the condvar is notified, the thread
is awakened and given the mutex for the shared state. In my case, what I want
is to have a thread spawn several search tasks, then sleep until one of them is
done. When it's awakened, there is no shared state to examine (because each
search task will return a separate future, and the state referred to by each
future is internally synchronized), so there is no need for a mutex. But the
condvar API requires a mutex, so the idea is that passing a no-op mutex would
avoid the cost of acquiring a mutex that the API requires but that isn't needed
in this scenario.
A similar scenario is discussed in the thread I referred to earlier:
http://tinyurl.com/22mvpso .
This isn't quite true. The calling thread can read the future as soon as it's
set by the callee (via the corresponding promise). The callee thread may then
run for a while longer, e.g., to invoke destructors of local objects and
thread-local objects. If the thread runs detached, there may never be a join,
although that complicates matters at program shutdown.
> I'm not as familiar as I'd like to be with the proposed
> threading in the standard, but I have handled a similar problem
> with Posix threads: I made the threads detached (so no join was
> available), and used a (single) condition to communicate, with
> a table of results, one per thread.
This sounds like your design had the master thread wait until all worker threads
were done, i.e., until the table was completely filled in. Is that correct? My
problem is more along the lines of wanting to wake the master when any entry in
the table is filled in with a non-exception result. But that just pushes the
notification problem into the table object: how is it to know when a thread has
posted a non-exception result if it doesn't use polling?
I'd worry that between checking the last future and waiting on the condvar, all
the threads would notifiy. Because this happened before you waited, you'd not
see those notifications, and because all threads had notified, you'd end up
waiting forever. Is that not a problem?
> - as also pointed out elsethread, std::condition_variable work in
> tandems with std::mutex's. Out of desperation (I had no need for
> synchronisation), I constructed a dummy mutex each time I needed to
> wait on the condition_variable. Needless to say, if you check e.g. the
> posix requirements for pthread_cond_wait, this is Really Bad. After
> spurious errors I resorted to adding a std::mutex as a pool member
Regardless of what Posix says, I *think* what you did should be acceptable under
C++0x. Anthony Williams, C++0x threading expert extraordinaire, has suggested
that in an earlier thread, but draft C++0x may have changed since then, and I
may have misunderstood his suggestion. But I think that the idea of a no-op
mutex for use with a condvar should be valid.
> [...] The calling thread can read the future [...]
I already like C++, I think I'll love the upcoming standard!
I know what you guys are speaking about, of course, I just feel good
reading sentences like that ;-)
Cheers!
--
FSC - http://userscripts.org/scripts/show/59948
http://fscode.altervista.org - http://sardinias.com
"Kona." It's in Hawaii :-)
> I'd be glad with the memory model and the thread stuff as it is in the
> standard if it were finished and issued last year or earlier...
I'm not sure how much difference it's making in practice. gcc 4.5 and VC10 both
have big chunks of C++0x implemented, though the threading stuff is
unfortunately still missing. I'm not sure how much more of C++0x we'd have in
our compilers now if the standard had been finished, say, last year. Perhaps a
bit more, but I'm guessing not a huge amount more.
> Paavo Helde wrote:
>> I see. I overlooked the usage of C++0x future<> in the OP. I am not
>> familiar with this feature, but it appears to be a higher level
>> construct having internal synchronisation.
>
> It also has the advantage that it can hold either a return value or an
> exception.
Which makes it more high-level in some sense. And this exception throwing
is not useful anyway in this case as far as I can see. One does not want
to trigger an exception if another thread can still produce a result, and
if all fail, one wants to throw a new exception instead (possibly
combining information from all of the worker thread exceptions).
In other words, in my experience an ordinary mutex and condition would
work nicely here, with a protected result location an and a thread
counter. Alternatively, C++0x could provide a way to wait for multiple
futures simultanuously, but I have understood from other replies it has
no such facility. If this is true, this means IMO that the future<>
feature has only very specific and restricted usage. I'm sorry if I am
messing things up here, I really do not know anything about C++0x
threading!
Regards
Paavo
I'm afraid I don't follow you here. Can you elaborate?
In the scenario I'm discussing, if one thread throws, it has no effect on other
threads. The exception for that thread will just be stored in the thread's
future. The other threads will continue to run.
> if all fail, one wants to throw a new exception instead (possibly
> combining information from all of the worker thread exceptions).
And the caller can do that if it wants to. It can just do a get() on each
future, catch any exception that may be present, then use whatever logic it
wants to to decide what, if anything, should be thrown from there.
> In other words, in my experience an ordinary mutex and condition would
> work nicely here, with a protected result location an and a thread
> counter.
So you'd run all searches concurrently, passing them all a common result
location to write to. What would you do if all ended by throwing an exception?
How would you distinguish "none of the searches found anything" from "all the
searches ended by throwing an exception"?
It does. I'll take a look. Thanks for the pointer.
> lucdanton wrote:
>> member of the pool. Meanwhile, when a thread calls the select member
>> of the thread pool instance, for every future passed it checks:
>> whether it is valid, and if it is if it's ready. If that's the case
>> it returns a reference to that future. Finally, the thread waits on
>> the (same) member std::condition_variable and repeats the check
>> anytime it's woken up.
>
> I'd worry that between checking the last future and waiting on the
> condvar, all the threads would notifiy. Because this happened before
> you waited, you'd not see those notifications, and because all threads
> had notified, you'd end up waiting forever. Is that not a problem?
This is why you need a counter of finished threads, or something similar.
Notification counting is not reliable and is not meant to be used in this
way.
This explains also why condition waiting is so tightly related to mutex
locking in boost::thread; studying the mutex-protected data is (often?
always?) the only reliable way for the waiter thread to find out about
the current state of affairs.
Regards
Paavo
Then I don't see how a condvar can be used at all to let the caller know that it
needs to check the state of the callees (i.e., if they have returned something).
If it blocks on the condvar, it will reliably wake only if the condvar is
notified, but how can it guarantee that the workers aren't performing their
notifies just before it blocks on the condvar?
From past experience I recall only 3 methods for thread cancelation:
1. you have an API thak kills a thread. Dox warns you that you shall not use
it. using it removes the thread execution from the scheduler, while it can
be in any state, so you have a good chance to leak resources, and
considerable chance to be in a critical session or something, leaving locked
mutexes, etc, making the program unstable. I ever used it only in
program shutdown code if other ways failed (the thread go stuck due to some
design error...)
2. thread cancel triggering exception. some systems (maybe java?) define
points where the runtime checks the request on the thread and emits a
specific exception. So the programmer have some chance to juggle with
finally{} and recover -- certainly say goodbye to NOTHROW concept...
3. the cooperative way I was talking about: the thread itself watches a
variable or condition that is set outside, and interprets it as a request to
abandon whatever is done and exit ASAP.
The ponts where it looks at the variable is left to the thread, and the
method of access varies. One common way is to have a pthread mutex/cond
pair or on W32 an Event (optionally with a mutex) and the usual
producer/consumer queue. Having the cancel req either in-band as a message
in the queue, or a separate variable that is checked along. The mess
begins when the processing involves other kind of blocking too, like waiting
socket, file IO, etc. A common way to fight complexity is to give in, and
use some timeout in blocks, then check, say every 1 sec.
The situation you described is way more friendly: you have a process without
external blocking, and a natural loop with reasonably short processing
steps, so here the best way is just read a variable. cond/event is not
needed. For correctness you still need a mutex or at least a membar for the
regular case, but atomic<> stuff have those semantics in the pack
implicitly. So you can read it before visiting every next node. (if the
visit is too short AND atomics are hosed on the particular system, there may
be a counter for every Nth, I doubt it would pay off).
atomic<bool> is good to send in a request to stop. You can use an
atomic<int> as a kinda semaphore or completion count to report status too.
Say, start with 0, and every thread completed does an increment. For your
situation as soon as it is at 3 work is done. certainly that leaves you with
your original problem in the main thread what to do until all bumps
happened.
Doing a poll from this, really IDLE thread is really not good. You already
fiddled with cond+fake mutex, that is a working way. I could suggest an
alternative: use just a mutex.
1. main thread sets variable atomic<int> to N that is the number of threads
(N=3 in your example.
2. create and lock a mutex
3. launch N threads. on exit each decrements the variable, if the result is
zero unlocks the mutex. (They use the cancel signal to each other too
certainly.)
4. main thread waits on the mutex.
Unless I mislooked something you proceed on the main thread exactly when
appropriate. As I imagine the setup there may be other problems: if the
objects for the threads are local in the function frame, you may have them
auto-destroyed while the thread is not completely finished. I generally
prefer joins to avoid that situation easiest.
> Paavo Helde wrote:
>> Which makes it more high-level in some sense. And this exception
>> throwing is not useful anyway in this case as far as I can see. One
>> does not want to trigger an exception if another thread can still
>> produce a result
>
> In the scenario I'm discussing, if one thread throws, it has no effect
> on other threads. The exception for that thread will just be stored
> in the thread's future. The other threads will continue to run.
>
>> if all fail, one wants to throw a new exception instead (possibly
>> combining information from all of the worker thread exceptions).
>
> And the caller can do that if it wants to. It can just do a get() on
> each future, catch any exception that may be present, then use
> whatever logic it wants to to decide what, if anything, should be
> thrown from there.
>
>> In other words, in my experience an ordinary mutex and condition
>> would work nicely here, with a protected result location an and a
>> thread counter.
>
> So you'd run all searches concurrently, passing them all a common
> result location to write to. What would you do if all ended by
> throwing an exception? How would you distinguish "none of the searches
> found anything" from "all the searches ended by throwing an
> exception"?
Hmm, it seems I have lost the initial task setup. I thought that a thread
either finds something or throws an exception. Ternary logic
(true/false/not_found) complicates the things a bit;-)
Anyway, in my setup, each worker thread would decrement a thread counter
and notify the master thread when terminating, regardless of the result.
A found result would be stored in one location (overwriting the previous
result or discarded as needed), error messages collected in another
location (in a string or in something more elaborate) (plus logged to a
log file if they mean abnormal behavior), and the thread counter would be
the third location. These would be the data items protected by the mutex
the master thread uses for condition waiting.
When notified, the master thread would check what it has got. If result
is found, fine, return the result. If not, and all threads have
terminated, check if there are error messages. If yes, then something
went wrong, throw an exception including the error messages. If not,
return "not found".
If there is a way to wait for multiple future<> results, all this could
be done by the master thread instead, which can then sort out all this
ternary logic by itself. This might be a bit cleaner, especially in
regard of dealing with orphaned threads (those not yet finished when the
master thread returns the found result).
Regards
Paavo
MT design is mostly about avoiding sharing as much as possible. It will
involve extra sync primitives, extra lock operations, task switches.
> What would you do if all ended by throwing an exception?
And also extra complexity in cases like that. Race for time and space. With
extra chances to overlook some interaction.
I would not start considering it before having multiple (per-thread)
locations is proved as an issue. What is hard to imagine even wth a system
without dynamic allocation.
> Paavo Helde wrote:
>> This explains also why condition waiting is so tightly related to
>> mutex locking in boost::thread; studying the mutex-protected data is
>> (often? always?) the only reliable way for the waiter thread to find
>> out about the current state of affairs.
>
> Then I don't see how a condvar can be used at all to let the caller
> know that it needs to check the state of the callees (i.e., if they
> have returned something).
> If it blocks on the condvar, it will reliably wake only if the
> condvar is
> notified, but how can it guarantee that the workers aren't performing
> their notifies just before it blocks on the condvar?
It has to check the state of mutex-protected variables before entering
the wait. As the mutex is locked, other threads cannot change the state
between this check and entering the wait. The system guarantees that
entering the wait and releasing the lock on mutex are done atomically, in
the sense that no notification is lost at this point. Any other thread
wanting to notify the condvar will lock the mutex, change the data state,
then notify (before or after releasing the lock, this does not matter
AFAIK). This guarantees the waiter thread always wakes up when needed.
The notifications may get lost when the master thread has released the
lock and runnig somewhere else, and also when two notifications arrive
almost exactly at the same time (after releasing the mutex it is not
guaranteed who will get the next lock, so two worker threads may lock the
mutex, update data and call notify before the master thread wakes up).
Regards
Paavo
Here is a pseudo-code sketch of a much more conventional C++0x solution that
should meet the requirements of the problem. Keep in mind that I do not have
access to a C++0x threading library...
_____________________________________________________________
struct entry
{
bool compare() const;
};
struct array
{
entry m_data[DEPTH];
};
struct complete
{
std::condition_vairable m_cond;
std::mutex m_mutex;
};
struct search_thread
{
array& m_array;
unsigned m_key;
complete& m_complete;
std::promise<entry*> m_promise;
std::atomic<bool> m_cancel;
std::thread m_thread;
search_thread(array& a,
unsigned key,
complete& c,
std::promise<entry*>& p)
: m_array(a),
m_key(key),
m_complete(c),
m_promise(p),
m_cancel(false),
m_thread(&search_thread::entry, this)
{
}
~search_thread()
{
m_thread.join();
}
void entry() // the search thread entry
{
for (unsigned i = 0;
i < DEPTH && ! m_cancel.load(mb_relaxed);
++i)
{
if (m_array.m_data[i].compare(m_key))
{
m_complete.m_mutex.lock();
m_promise.set_value(&m_array.m_data[i]);
m_complete.m_mutex.unlock();
m_complete.m_cond.notify_one();
return;
}
}
m_complete.m_mutex.lock();
m_promise.set_value(NULL);
m_complete.m_mutex.unlock();
m_complete.m_cond.notify_one();
}
};
// the main search function
static entry* search(array& a, unsigned key)
{
// create our completion structure
complete c;
// create a promise and future for each search
std::promise<entry*> p_1;
std::promise<entry*> p_2;
std::future<entry*> f_1 = p_1.get_future();
std::future<entry*> f_2 = p_2.get_future();
// create our wait state
std::future<entry*>* waiting[2] = { &f_1, &f_2 };
// create a thread for each promise
search_thread t_1(a, key, c, p_1);
search_thread t_2(a, key, c, p_2);
// wait for any results
entry* result = NULL;
unsigned exceptions = 0;
c.m_mutex.lock();
for (;;)
{
for (unsigned i = 0; i < 2; ++i)
{
if (waiting[i] && waiting[i]->is_ready())
{
// okay, a result is actually ready from a future.
try
{
result = waiting[i]->get();
goto bailout;
}
catch (...)
{
// well, the future has an exception!
}
waiting[i] = NULL;
if (++exceptions == 2)
{
// damn... Both of the futures have exceptions!
goto bailout;
}
}
}
c.m_cond.wait(c.m_mutex);
}
bailout:
c.m_mutex.unlock();
// cancel all of the search threads
t_1.m_cancel.store(true, mb_relaxed);
t_2.m_cancel.store(true, mb_relaxed);
// that's all folks! ;^D
return entry;
}
_____________________________________________________________
Please examine the code sketch Scott. I use a simple condvar and mutex to
avoid spin-polling for the futures state. The wait logic allows for graceful
handling of futures with exceptions. It should do all you want.
Now... An eventcount would get around the need to use condvar+mutex on the
fast-path. It would be much more efficient. Perhaps I will post an
eventcount version when I get some time.
IIRC the pthread dox, it states that after you passed the cond, you MUST
evaluate the condition again, as the system may wake up out of order. It's a
QoI issue to not happen often, but must be considered in design.
Yeah, quoting man for pthread_cond:
" The pthreadcondwait() function is used to block on a condition vari-
able. It is called with mutex locked by the calling thread. The mutex
is released and the calling thread is blocked atomically waiting for
the
associated condition to be signalled by another thread. Upon successful
completion the mutex is re-locked and owned by the calling thread. The
predicate associated with the condition variable should be tested and
the
pthreadcondwait() call repeated if necessary.
"
> (i.e., if they have returned something). If it blocks on the condvar, it
> will reliably wake only if the condvar is notified, but how can it
> guarantee that the workers aren't performing their notifies just before it
> blocks on the condvar?
It's not a practical problem (discounting people not aware of the quoted
obligation), you evaluate the same original condition the party that
supposedly did the signal had, and go back if not true.
pthread_cond in NOT the condition itself, just a handy way to poll it as few
times as possible.
Part of the mutex being there is to prevent notification races (without
it, you might loose a notification). You do have shared data as well,
your notification flag (or whatever you are using).
All things being equal, condition_variable_any should have more overhead
than condition_variable. If you look at Boost threads
condition_variable_any simply uses an additional internal mutex to do
what you would do with the standard condition_variable usage pattern by
hand.
E.g. (from Boost boost/thread/pthread/condition_variable.hpp):
class condition_variable_any
{
[...]
template<typename lock_type>
void wait(lock_type& m)
{
int res=0;
{
detail::interruption_checker check_for_interruption(&cond);
{
boost::pthread::pthread_mutex_scoped_lock
internal_lock(&internal_mutex);
m.unlock();
res=pthread_cond_wait(&cond,&internal_mutex);
}
m.lock();
}
if(res)
{
boost::throw_exception(condition_error());
}
}
[...]
> But the condvar API
> requires a mutex, so the idea is that passing a no-op mutex would avoid
> the cost of acquiring a mutex that the API requires but that isn't
> needed in this scenario.
>
> A similar scenario is discussed in the thread I referred to earlier:
> http://tinyurl.com/22mvpso .
Not really true. For condition_variable_any, you can get away with that,
since it already uses an additional internal mutex. For
condition_variable you need to use a valid mutex, not just some dummy.
This pseudo-code will only work when a future is obtained from a promise:
________________________________________________________
struct complete
{
std::condition_variable m_cond;
std::mutex m_mutex;
};
template<typename T>
struct promise_ex
{
std::promise<T> m_promise;
complete& m_complete;
void set_value(T v)
{
m_complete.m_mutex.lock();
m_promise.set_value(v);
m_complete.m_mutex.unlock();
m_complete.m_cond.notify_one();
}
};
template<typename T>
struct multi_wait
{
complete m_complete;
linked_list<std::future<T>*> m_list;
void push_future(std::future<T>* f)
{
m_list.push(f);
}
std::future<T>* wait()
{
std::future<T>* result = NULL;
m_complete.m_mutex.lock();
while (! m_list.is_empty())
{
for each std::future<T>* in m_list as f
{
if (f->is_ready())
{
result = f;
m_list.pop(f);
goto bailout;
}
}
m_complete.m_cond.wait(m_complete.m_mutex);
}
bailout:
m_complete.m_mutex.unlock();
return result;
}
};
________________________________________________________
That should allow a thread to wait on the result of multiple futures without
spin-polling.
Interesting idea, thanks. I'd been thinking that each worker would notify the
main thread when it was done. At each notification, the main thread would check
the result to see if it was "real" or an exception, and if real, it would tell
the remaining workers to stop working. In his book, Anthony Williams uses TLS
for each worker's "stop now" flag, and, within each worker, checking that flag
would presumably be faster than accessing a shared flag such as you suggest.
Certainly it would be more encapsulated; having multiple threads share the
"stop now" flag makes me think of global variables, which is something I really
don't like to think about. Your approach also makes it tougher to detect that
e.g., some workers threw an exception while others ran to completion. But I see
how it works, thanks again for the idea.
Okay, then this means we can't use the readiness of a future as a notification
mechanism, because the mutex controlling that state is hidden. We have to have
some other data structure whose synchronization is under user control. Good to
know.
I suspect that this is implemented internally by polling the futures, because
there is, AFAIK, no API that allows a future to issue a notification that it's
ready, at least not in C++0x.
I took a look at the source code for boost::wait_for_any, but after wandering
around in the wilderness for a while, I retreated, scratched and bruised,
without having figured out what was really going on. (wait_for_any waits on a
future_waiter, which holds a vector of registered_waiters named futures, each
element of which holds a shared pointer named future to a future_object_base....)
Can somebody confirm or deny that boost::wait_for_any is implemented by polling
the futures?
> I believe it is actually more interesting if you don't want to wait
> but want to go ahead and do_stuff( ) in the main thread;
I agree, but my philosophy is:
- First we crawl.
- Then we crawl on broken glass.
I'm still working on the crawling part.
As I noted at the outset, I'm no MT expert, so my judgment is suspect, but this
looks like it would work. I like that I can wait for the first result to come
back, and if I don't like it (e.g., because it's an exception), I can easily
continue waiting on the rest of the worker threads. Once I have a result I
like, there a handy list of futures that tells me which workers need to be told
to stop working. (There's no API for that in C++0x, but Chris's solution
augments the promise API, so it's not surprising that we'd have to augment
(i.e., build on) the standard future API, too.)
This looks to be essentially the same approach you took in another posting that
I already replied to. As I wrote there, I don't see any problems with it, and I
think the approach has some nice properties.
Thanks for posting this approach.
It doesn't poll. It uses a single condition variable internally to wait
for any future to become ready (condition_variable_any). Which doesn't
help you if you want to stay with C++0x, since it's also using an
internal mutex, each of the Boost futures has.
So presumably there is logic in the promise corresponding to a future that does
a notify on the condvar associated with the future?
Thanks,
The unsafe window happens in case of a spurious wake-up, when I check
all futures for readiness but don't find any. Then at this point if in
fact one future comes to completetion before waiting again on the
condvar, the associated worker will signal but noone is waiting so it
will be missed. So in all likelihood once all but one future is ready
the wait will be entered one last time waiting for that elusive signal.
Is that correct ?
I guess I'm back to using a semaphore.
Well, the technique should work fine. EXCEPT for the fact that I cannot seem
to find a function that is able to poll a future for "readiness". It appears
that the `future<>::is_ready()' function is nowhere to be found within the
following draft:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf
I see `future<>::valid()' function, but I don't quite know what that is
supposed to do.
DAMN! It appears that one has to call a timed wait with a timeout of zero in
order to perform a poll operation wrt the readiness of a future. Why is
`future<>::is_ready()' missing in this draft!?
;^(...
SH#@I#@$I#@$I!!!!
Humm... Perhaps I speak too fast here Scott. I think I can totally get
around the fact that `future<>::is_ready()' does not seem to be included in
the current C++0x draft. It should be simple. Let me think for a moment, and
I will try to post a solution...
>> I believe it is actually more interesting if you don't want to wait
>> but want to go ahead and do_stuff( ) in the main thread;
>
> I agree, but my philosophy is:
> - First we crawl.
> - Then we crawl on broken glass.
>
> I'm still working on the crawling part.
LOL. Interesting thing is that for the approaches I considered for your
example those that execute one search on the main thread look and feel less
complex than where you have your main thread implement an idle wait instead.
My idea of 'learn to crawl first' for the situation would go on other route,
first create solution using classic elements, and leave translation to
std::future for the second step.
The lighter way is to not look at the counter value in the thread send
notify blindly and inspect in main. That removes a little knowledge-export
but creates extra wakeups.
What you write is to check the whole 'done this far' state, effectively
re-evaluating it. Means re-inspect all your threads' result area, probably
mark some, etc -- smells as a ton of redundant processing. Needed only
because you changed the place from where the native information is present
to where it is lost. Overall complexity grows.
> In his book, Anthony Williams uses TLS for each worker's "stop now" flag,
> and, within each worker, checking that flag would presumably be faster
> than accessing a shared flag such as you suggest.
Hm, I seriously doubt Anthony would suggest that -- I'd think the
comparision probably used some other method (like a mutex-protected
variable). I also fail to see a simple way to change the flag state from
another thread if it is in TLS... Please give me some reference, then if
still in doubt we can ask him (normally he's active on ACCU...)
If you use a mutex then contension is really a thing to consider, as
everyone will keep flipping it. With possible collisions too, so separation
may due.
If you use atomic instead the picture is different. Mostly because it keeps
the state, and changes only once. You have a zillion reads, but those do
not block, only make sure the cache is not invalidated. So I don;t see
how separation would make anything better contention-wise.
> Certainly it would be more encapsulated; having multiple threads share
> the "stop now" flag makes me think of global variables, which is something
> I really don't like to think about.
Global? The solutions I presented had all the state in the stack frame where
you start the search. Can't imagine anything more local ;-) You only pass
addresses around.
And sharing them naturally maps your requirement: IIRC to stop *all*
siblings as soon as any one found the solution. So they best to have that
signal shared for all. Reasons may surface to organize differently, but I
see it as deviation, or de-normalization.
> Your approach also makes it tougher to detect that e.g., some workers
> threw an exception while others ran to completion.
I fail to see how that is harder or easier -- once done the main thread scan
the results and evaluate them. Same algo. (We did NOT share the 'final
result' area for reasons discussed earlier, that would make it a simple
read, hope you did not compare against that.)
> But I see how it works, thanks again for the idea.
For the encapsulation/logic distribuition, I think you miss a detail. I
wrote the logic in 'abstract'. In real life, it is often packaged
differently, what as a side effect covers your concerns.
Like instead of passing in the atomic<int> counter and the mutex, I'd pass
in a functor, that the thread shall call on finish. (in a general framework
I'd probably have 2 functors, to call before and after payload work..)
The functor -- a simple instance of boost::bind or sigc::bind -- holds the
logic to decrement and unlock. Or do whatever else you want delegated to
help the main thread. You may even include the logic to inspect the result
and set the shared stop flag. Or join the neighbor, etc.
Or an alternative implementation would have a base class doing the control
logic, and have your "search" implementation as virtual or strategy.
The Boost futures have support for condition variables (part of future,
not promise). You can register condition_variable_any's with a future.
It looks like quite a bit of stuff was cut for C++0x.
A known way to reduce multithreading complexity is to use inter-thread
message queues. I have successfully implemented such message queues to
introduce parallelism in a script language which does not otherwise have
any script-level MT primitives.
With a message queue, the task becomes simple: each worker thread just
posts its results to the queue, and the main thread reads the messages in
a loop and just counts successes, failures and exceptions as it likes.
All the MT synching would be hidden in the (reusable) queue component.
I don't know if one can implement a queue in terms of future<>s, they
seem about the same level abstractions.
Regards
Paavo
> This isn't quite true. The calling thread can read the future
> as soon as it's set by the callee (via the corresponding
> promise). The callee thread may then run for a while longer,
> e.g., to invoke destructors of local objects and thread-local
> objects. If the thread runs detached, there may never be
> a join, although that complicates matters at program shutdown.
In other words, it uses a condition in its implementation. (I'm
not sure how this works with exceptions, though.)
> > I'm not as familiar as I'd like to be with the proposed
> > threading in the standard, but I have handled a similar problem
> > with Posix threads: I made the threads detached (so no join was
> > available), and used a (single) condition to communicate, with
> > a table of results, one per thread.
> This sounds like your design had the master thread wait until
> all worker threads were done, i.e., until the table was
> completely filled in. Is that correct?
In my case, yes. In practice, the master thread was woken up
each time a worker thread updated the table. It then checked
whether it had all the information it needed, and went back to
sleep if not.
> My problem is more along the lines of wanting to wake the
> master when any entry in the table is filled in with
> a non-exception result. But that just pushes the notification
> problem into the table object: how is it to know when
> a thread has posted a non-exception result if it doesn't use
> polling?
When the worker thread updates the table, it locks the mutex,
makes the modifications, then called pthread_cond_signal on the
condition and freed the mutex. The master thread then woke up,
and checked the conditions.
--
James Kanze
I'd actually been thinking about this design. C++0x has no thread-safe queue,
but both TBB and PPL support concurrent_queue with compatible interfaces, so
that's likely to become a de facto standard.
He creates a class interruptable_thread that mimics the std::thread interface,
except it adds an interrupt() function. The TLS flag is set when somebody calls
interrupt() on the interruptable_thread. You can see the whole thing in section
9.2 of his his "C++ Concurrency in Action"
(http://www.manning.com/williams/), which I'm again pleased to be able to plug.
> Global? The solutions I presented had all the state in the stack frame
> where you start the search. Can't imagine anything more local ;-) You
> only pass addresses around.
The problem with globals is that you can't tell where they are read/written
without examining your entire code base. Taking a local variable and passing
out a bunch of pointers to it has the same problem. From an analysis point of
view, it's no longer local.
>> Your approach also makes it tougher to detect that e.g., some workers
>> threw an exception while others ran to completion.
>
> I fail to see how that is harder or easier -- once done the main thread
> scan the results and evaluate them. Same algo. (We did NOT share the
> 'final result' area for reasons discussed earlier, that would make it a
> simple read, hope you did not compare against that.)
I was referring to this comment of yours:
> Still there may be a way to chain the promises. i imagine a lisp-like solution, having a list processor that calculates the result for (car) itself returning a joint result fith thar of (cdr).
This sounds more complicated to me than simply having one future for each worker.
On what platform?
>
> Suppose I have a graph that I want to search for a node that has some
> characteristic (e.g., holds a value within some range). Suppose
> further that of the many ways to search the graph (e.g., depth-first,
> breadth-first, random walk, etc.), I can't predict which will be best,
> so I want to run
> them all concurrently, stopping when either one of them finds a
> suitable node or they all fail. That is, given something like this
>
> Node* dfsSearch(Graph g, Predicate p); // do dfs search of g for
> // a node satisfying p
> Node* bfsSearch(Graph g, Predicate p); // do bfs search of g
> Node* rwSearch(Graph g, Predicate p); // do random walk search
> of g
> I want to do this:
>
> concurrently invoke dfsSearch, bfsSearch, and rwSearch;
> wait until one returns success or they all return lack of success;
> if one returned success, tell the others to stop searching;
>
> One of the problems in mapping this to C++0x is that C++0x has no
> notion of thread cancellation/interruption, but that can be built
> manually, as Anthony Williams shows in section 9.2 of his "C++
> Concurrency in Action" (http://www.manning.com/williams/), which I'm
> pleased to be able to plug here.
> My question, however, has to do with implementing this part:
>
> wait until one returns success or they all return lack of success;
See the Windows API WaitForMultipleObjects. This kind of stuff is trivial
in Windows.
http://msdn.microsoft.com/en-us/library/ms687025(VS.85).aspx
>
> Considering only the simplest case (where at least one search returns
> success), what's the best way to wait? I'm going to assume that I
> have a future<Node*> for each function.
>
> I could wait on a condition variable and have each search do a
> notify_one(), but condition variables are inherently tied to mutexes,
> and in this case, there is no need for a mutex (no shared state), so
> that seems inappropriate to
> me.
> I could poll the futures (waiting for each with a timeout of 0), but
> polling constantly seems like a waste of machine resources, and
> polling with some lag period between polls introduces a delay between
> when a future is ready and when I detect that. The attraction of a
> condition variable is that I
> avoid that delay.
> I have no MT experience, so I'm looking for advice from people who
> do. C++0x threading experience is rare, of course, but I think this
> is more of a general MT design question.
>
> All insights appreciated.
>
> Scott
Something like this should be able to "work around the problem":
_______________________________________________________
struct complete
{
std::condition_variable m_cond;
std::mutex m_mutex;
};
// new data-structure...
template<typename T>
struct future_ex
{
std::future<T> m_future;
bool m_ready; // == false;
bool is_ready() const
{
return m_ready;
}
};
template<typename T>
struct promise_ex
{
std::promise<T> m_promise;
complete& m_complete;
/* new -> */ std::future_ex<T>& m_future_ex;
void set_value(T v)
{
m_complete.m_mutex.lock();
m_promise.set_value(v);
/* new -> */ m_future_ex.m_ready = true;
m_complete.m_mutex.unlock();
m_complete.m_cond.notify_one();
}
};
// new ... Changed all `std::future<T>' to `future_ex<T>'
template<typename T>
struct multi_wait
{
complete m_complete;
linked_list<future_ex<T>*> m_list;
void push_future(future_ex<T>* f)
{
m_list.push(f);
}
future_ex<T>* wait()
{
future_ex<T>* result = NULL;
m_complete.m_mutex.lock();
while (! m_list.is_empty())
{
for each future_ex<T>* in m_list as f
{
if (f->is_ready())
{
result = f;
m_list.pop(f);
goto bailout;
}
}
m_complete.m_cond.wait(m_complete.m_mutex);
}
bailout:
m_complete.m_mutex.unlock();
return result;
}
};
_______________________________________________________
Humm... I now have to explicitly extend futures and promises in order for
this `multi_wait<>' scheme to work out. Scott, is there truly no way to poll
a `std::future<>' for completion such that a subsequent call to
`std::future<>::get' will be 100% guaranteed to contain either a result or
an exception? What is that `std::future<>::valid' call all about!?
Confused again!
;^(...
I don't know that, but what's wrong with doing the wait with a zero timeout?
AFAIK, that's the accepted way to poll.
A personal bias... :^(
Perhaps the wait function... Ahhhh, I almost tried to optimize a very slow
path...
:^o
> AFAIK, that's the accepted way to poll.
It works.
Which compilers are you folks using to test this?
With MinGW g++ 4.4.1 and Visual C++ 10.0 I find no <future> header...
Cheers,
- Alf
--
blog at <url: http://alfps.wordpress.com>
I was totally confused for some reason. Anyway, I think I am beginning to
understand the future/promise API much better.
> You only get an exception when you call get() on a future that holds one,
> so the above code will wait forever (unless one of the spawn_task calls
> throws...or the Sleep throws :-}).
;^)
[...]
I use VC10 and the thread library from just::software
(http://www.stdthread.co.uk/). From the library's web site:
# Compatible with Microsoft Visual Studio 2008, Microsoft Visual C++ Express
2008, Microsoft Visual Studio 2010 and Microsoft Visual C++ Express 2010 for
both 32-bit and 64-bit Windows targets.
# Compatible with g++ 4.3 and g++ 4.4 for 32-bit and 64-bit Ubuntu linux
targets, making full use of the C++0x support from g++ including rvalue
references and variadic templates.
g++ 4.5 on Linux ships with <future> (and std::future, std::promise,
std::packaged_task std::shared_future, std::async are in there). Can't
remember for 4.4, and last time I checked nobody was working on making
<future> work for MinGW.
> On what platform?
If he's asking about C++0x, he's asking about a more or less
portable solution (a solution portable to all platforms which
support C++0x). The platform shouldn't matter.
At least in Windows XP, WaitForMultipleObjects did NOT know how
to wait for a C++0x future. (Understandably so, since the
proposal for futures hadn't even been drafted when Windows XP
came out.) Scott mentionned one possible C++0x solution, using
condition variables, which are considerably more flexible than
WaitForMultipleObjects (although in simple cases,
WaitForMultipleObjects may be easier to use). And is at least
somewhat portable, even today, since Boost::thread supports
condition variables for both Windows and Unix. (They're the
standard idiom in Unix, supported directly by pthreads. And I
believe that Windows 7 supports them directly as well---under
earlier versions they have to be emulated.)
--
James Kanze
I had a incling that it was a computer scientist question!
I didn't know if the futures thing was a key part of his post (I don't
usually read long posts in detail, especially ones with a different
vocabulary than my own, and I don't know "futures"... maybe in the future
I will ;)), but I thought I'd add what I did just in case it was
relevant. I noticed that he said he didn't know much about MT development
and also noticed that everyone was just responding from a UNIX-like
(pthreads and such) perspective, so I thought I'd just wing it out there
just in case.
This is sort of a pet peeve of mine, but please don't suggest to
anyone to actually directly use the windows threading API if at all
possible. It's borked, at least for windows XP and earlier, as it
lacks condition variables. When working at the level of mutexes or
semaphores (and not hardware memory barriers or lock free algorithms),
you really need the "primitive" "release mutex and wait on a signal"
aka pthread_cond_wait in order to do sane threading programming at the
mutex or semaphore level. Implementing the equivalent of
pthread_cond_wait with the windows XP and earlier API is surprisingly
difficult, and generally results in much less efficient code than if
they had properly implemented kernel and scheduler to directly
understand pthread_cond_wait or equivalent.
My favorite papers on the subject are:
I don't follow why it would be faster to use TLS. The location of the
shared state is immaterial. The master thread needs to modify some
shared memory which the worker thread periodically reads. It doesn't
matter if it's a global, passed as an argument to the worker thread,
or in a TLS of the worker thread.
> Certainly it would be more encapsulated; having multiple threads share the
> "stop now" flag makes me think of global variables, which is something I really
> don't like to think about.
I don't agree. I wouldn't make such a blanket statement based on some
passing similarity. Actual globals, aka namespace scope objects and
static storage objects, should in general be avoided because it
restricts you to having one per process. Sometimes, you may find that
whatever it was doing, you suddenly need to do 2, or N, of those
things concurrently in a single process.
Instead, you could have a unique context object for a job which you
explicitly pass to worker threads. Unlike globals, this in no way
prevents multiple concurrent executions of jobs.
As a specific example, in my company's product, we have an engine
which runs a domain specific language. Originally, we assumed that
there would be at most one engine per process. However, later we
decided that it would be desirable to have multiple concurrently
executing engines in a single process. We took all of the offending
globals and changed them into global TLS objects, that is a TLS object
declared at global scope.
I just hit an issue last Friday because they were put into a global
TLS instead of some job context object. A client can call functions on
the executing server, such as send input to the engine, receive output
from the engine, stop it, get a logging callback, etc. When a client
thread accesses the engine, the engine implicitly set the global TLS
entry for that client thread. The problem arose when a single client
thread tried to access multiple engines. The engines conflicted
because they both kept trying to set the global TLS entry for the
client thread.
A global TLS object for the stop flag could restrict thread reuse and
thread pooling. Suppose in your example you wanted to start off two
searches on two independent data sets, each search using three
threads. You couldn't use a thread pool for the search threads because
of the global TLS stop flag.
I would lean towards having a job context object which is explicitly
passed around where it is needed over a global TLS object. Then again,
I haven't had enough experience, so I don't want to make any strong
global statements. I just thought I'd bring up this other point of
view.
Well, ya see, me a programmer on Wintel has no need for UNIX/pthreads. I
am not a computer scientist like you are.
> When working at the level of mutexes or
> semaphores (and not hardware memory barriers or lock free algorithms),
> you really need the "primitive" "release mutex and wait on a signal"
> aka pthread_cond_wait in order to do sane threading programming at the
> mutex or semaphore level. Implementing the equivalent of
> pthread_cond_wait with the windows XP and earlier API is surprisingly
> difficult, and generally results in much less efficient code than if
> they had properly implemented kernel and scheduler to directly
> understand pthread_cond_wait or equivalent.
And you didn't get tenure why? (Because you didn't go to grad school? Go
figure! :P).
>
> My favorite papers on the subject are:
>
> http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
>
> http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBIQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.125.3384%26rep%3Drep1%26type%3Dpdf&rct=j&q=implementing%20condition%20variables%20with%20semaphores&ei=P815TKOnJ4TksQO8zZTtCg&usg=AFQjCNEHC9EfI5eaHE5YGzcRpTV6etikpg&sig2=zc_MuXn5jBHA-9aavofdGw&cad=rja
Is there a 1-900 number for you that can help? Sorry, I didn't mean to be
insensitive. I was using "1-900" as a vehicle to "1-999" (talk to someone
who cares, for free). I wasn't dissing you. Not that I care.
In Anthony Williams' design, the TLS flag is read and written only by the owning
thread, i.e., the flag is not shared across threads. When the master wants to
change the value of the flag, it invokes a member function on the thread object,
and within that thread object the TLS flag is modified.
Could you write up a simple example of multiple producers and multiple
consumers without conditions variables? Do you have any nontrivial
examples where windows event classes result in simpler and more
efficient code than the condition variable alternative? My claim is
that this is inordinately difficult compared to the much simpler
solution with condition variables.
> > When working at the level of mutexes or
> > semaphores (and not hardware memory barriers or lock free algorithms),
> > you really need the "primitive" "release mutex and wait on a signal"
> > aka pthread_cond_wait in order to do sane threading programming at the
> > mutex or semaphore level. Implementing the equivalent of
> > pthread_cond_wait with the windows XP and earlier API is surprisingly
> > difficult, and generally results in much less efficient code than if
> > they had properly implemented kernel and scheduler to directly
> > understand pthread_cond_wait or equivalent.
>
> And you didn't get tenure why? (Because you didn't go to grad school? Go
> figure! :P).
>
> > My favorite papers on the subject are:
>
> >http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
>
> >http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBIQFjAA&url=http...
>
> Is there a 1-900 number for you that can help? Sorry, I didn't mean to be
> insensitive. I was using "1-900" as a vehicle to "1-999" (talk to someone
> who cares, for free). I wasn't dissing you. Not that I care.
You seem to have some problem with education, as though it's not
applicable to the real world. If you are not a troll, please consider
talking about the issues, such as my claim that the windows threading
API pre windows vista is truly horrible and no one should use it
directly (except for those few people who wrap it in a sane and
correct way) instead of making ad hominem attacks.
Usually :-) If the future you got is from std::async and you didn't specify the
launch policy, the invocation of the "asynchronous" function may be deferred
until you call wait or get, in which case your "zero time" wait will have to
wait until the "asynchronous" function (which, in this case, is being invoked
synchronously) returns.
To avoid this possibility, specify a launch policy of std::launch::async when
you call std::async.
(BTW, I don't specify this stuff, I just try to figure out how to explain it.)
Give me a moment to look else-thread for the particulars, but I can't
see how.
If you want to do the standard cooperative thread cancellation, then
1- the master thread modifies some shared state, using some sort of
synchronization,
2- and the worker thread must periodically read that shared state,
using some sort of synchronization.
It sounds like Anthony Williams' design is:
1- Master thread calls a function on the worker thread's thread
object. This sets some flag X.
2- (??) Worker thread periodically checks flag X somehow. This is the
part which I'm not quite clear on. When worker thread notices that
flag X is set, it sets the its own thread specific entry in the global
stop flag TLS object.
3- Periodically, the worker thread will check its own thread specific
entry in the global stop flag TLS object. The worker thread will stop
processing when it notices that it set.
Presumably this isn't what he's talking about. It doesn't buy you
anything in terms of performance.
Alternatively, perhaps he is using signals to accomplish this? When
The master thread calls the member function on the worker thread's
thread object, it could send a signal to that thread, and in the
signal handler the worker thread could set the thread specific entry
in the global TLS object. This is the only workable interpretation
which I can think of offhand which improves performance. (In which
case, the thread specific entries of the global TLS object must be
sig_atomic_t and modified only through volatile lvalues. Funnily
enough, I think we have a legitimate portable correct use of volatile
in a threading situation in place of synchronization.) At which point,
I would hesitate to use this approach due to the prevalence of
volatile bugs in modern compilers.
"I believe you". I'm sure I don't, and it doesn't make you them.
Your accusation written.
>This is sort of a pet peeve of mine, but please don't suggest to
>anyone to actually directly use the windows threading API if at all
>possible. It's borked, at least for windows XP and earlier, as it
>lacks condition variables.
Wow. on the same logic the pthread API is even more "broken" as it does not
have interlocked atomic ops, thread cancel, waitformultipleobject, just from
top of my head... For some reason I could prtty well solve all my MT
problems using events and interlocked*. I'd say it is more like boroen
logic ;-)
>When working at the level of mutexes or
>semaphores (and not hardware memory barriers or lock free algorithms),
>you really need the "primitive" "release mutex and wait on a signal"
>aka pthread_cond_wait in order to do sane threading programming at the
>mutex or semaphore level.
Guess you needed them so much to work around the lack of interlocked
increment and interlocked compare&swap. Having those atopmics remove many
cases where you need to lock a mutex.
> Implementing the equivalent of
>pthread_cond_wait with the windows XP and earlier API is surprisingly
>difficult, and generally results in much less efficient code than if
>they had properly implemented kernel and scheduler to directly
>understand pthread_cond_wait or equivalent.
Implementing the perfect equivalent maybe. But covering the same high-level
use case is not. The use of attached an auto-locking mutex in signaling
looks convenient, but is not needed more often than it is. And the rest of
the cases can be probably covered better too, using WaitFormultipleObjects.
(Btw if you just use the latter to wait on Event and a mutex with ALL
option, isn't the effect the same as you are after? )
>Give me a moment to look else-thread for the particulars, but I can't
>see how.
So at least I'm not alone with this. ;) I asked Anthony for an explanation
on ACCU, hopefully he says something.
>Could you write up a simple example of multiple producers and multiple
>consumers without conditions variables?
Yeah, that is probably the scenario where cond is good. But having a ton of
MT programs on my back I used such queue exactly 0 times. And the actually
common multi-producer, single consumer queue is okay with simple events.
Work with me a second. It's late, and I'm not the most familiar with
the Windows event APIs.
The important quality of condition variables is not "the mutex is
automatically acquired after returning from wait". Don't get me wrong,
it's nice, and probably the only sane way to present it to users.
However, the important part of condition variables is releasing the
lock and going on the wait queue, and doing that atomically.
You can use membars, CAS, lock free algorithms, and other "low level".
In that case, you don't need condition variables. However, there are
situations where you want a thread A to go to sleep until another
thread B has completed some work. In such a case, you need to work
with the OS scheduler to put thread A to sleep. Ideally, you also do
not want to poll in order to "wake up" thread A.
To do this without thread A polling (or some other thread polling),
you need thread A to check to see if thread B has finished its work,
and go to sleep, and to do these atomically with regards to the OS
scheduler.
I assume that this is a rather common situation, telling the OS
scheduler to put a thread to sleep until some other thread has
finished its work. What APIs do you generally use to interact with the
OS scheduler to do this?
Also, for the record, a large source of my hate is the following
snippet from MSDN
http://msdn.microsoft.com/en-us/library/ms684914%28VS.85%29.aspx
> Remarks
> A thread waiting on a synchronization object can be momentarily removed from the wait state by a kernel-mode APC, and then returned to the wait state after the APC is complete. If the call to PulseEvent occurs during the time when the thread has been removed from the wait state, the thread will not be released because PulseEvent releases only those threads that are waiting at the moment it is called. Therefore, PulseEvent is unreliable and should not be used by new applications. Instead, use condition variables.
Basically, the entire function is beyond broken. I like the
conclusion: "Instead, use condition variables".
Err, hit submit too soon. So, while PulseEvent is obviously broken,
but you can make write perfectly correct code with SetEvent and
ResetEvent on manual-reset Events, and SetEvent on automatic-reset
events (I think ??). I admit that I'm heavily biased against them
though.
I'm going to shut up about this before I shoot myself in the foot
further.
>The important quality of condition variables is not "the mutex is
>automatically acquired after returning from wait". Don't get me wrong,
>it's nice, and probably the only sane way to present it to users.
>However, the important part of condition variables is releasing the
>lock and going on the wait queue, and doing that atomically.
I don't see any difference compared to W32's Event if you drop the attached
mutex handling.
>However, there are
>situations where you want a thread A to go to sleep until another
>thread B has completed some work. In such a case, you need to work
>with the OS scheduler to put thread A to sleep. Ideally, you also do
>not want to poll in order to "wake up" thread A.
If the other thread finishes you can WaitForSingeObject directly on the
thread handle. If it continues, you do the same on an Event, that is set
(reset? never can remember which ;-) on the thread.
>I assume that this is a rather common situation, telling the OS
>scheduler to put a thread to sleep until some other thread has
>finished its work. What APIs do you generally use to interact with the
>OS scheduler to do this?
The mentioned Event. The scenario you described is indeed quite fundamental,
did you think it is not covered on word's most widespread platform?
>Also, for the record, a large source of my hate is the following
>snippet from MSDN
>http://msdn.microsoft.com/en-us/library/ms684914%28VS.85%29.aspx
Somewhat strange, it marks PulseEvent as Win2000 thing. It was definitely
present on NT4.0, and possibly earlier. The conditionals it suggest to use
made in only at XP.
Not sure whether PulseEvent had this chance to lose wakeups back then. In my
scenarios only Set and Reset was used -- in a straightforward way at
different ends.
OTOH, comes to mind that Pulse never played well with WaitForMultipleEvents
(for obvious reasons), that makes it kinda dead beef outside very special
scenarios, that may not even exist ;-)
>Basically, the entire function is beyond broken. I like the
>conclusion: "Instead, use condition variables".
Okay, that makes it 'ballast' that you can find an any API. What is
important in practice if there is a set of non-broken functions that helps
you cover your actual problems (including fair performance, safety,
complexity management, etc).
I think I'm going to stick with shutting up before I continue putting
my foot into my mouth. Off day for me I guess.
> The conditionals it suggest to use made in only at XP.
Correction: *after* XP, first available on Vista.
Here is a simple multi-producer/consumer queue using a single manual reset
event for waiters:
<pseudo-code sketch>
_________________________________________________________
struct node
{
node* m_next;
};
struct queue
{
node* m_head; // = NULL;
node* m_tail;
HANDLE m_waitset; // manual reset event; set to false
CRITICAL_SECTION m_lock;
void push(node* n)
{
n->m_next = NULL;
EnterCriticalSection(&m_lock);
if (! m_head)
{
m_head = n;
SetEvent(m_waitset);
}
else
{
m_tail->m_next = n;
}
m_tail = n;
LeaveCriticalSection(&m_lock);
}
node* pop()
{
node* n;
EnterCriticalSection(&m_lock);
while (! (n = m_head))
{
LeaveCriticalSection(&m_lock);
WaitForSingleObject(m_waitset, INFINITE);
EnterCriticalSection(&m_lock);
}
if (! (m_head = n->m_next))
{
ResetEvent(m_waitset);
}
LeaveCriticalSection(&m_lock);
return n;
}
};
_________________________________________________________
> Do you have any nontrivial
> examples where windows event classes result in simpler and more
> efficient code than the condition variable alternative?
An event can make some code simpler. Take a slow path for, say a simple
mutex algorithm:
<event version - pseudo-code sketch>
_________________________________________________________
struct mutex
{
LONG m_state; // = 0
HANDLE m_waitset; // auto reset event; set to false
void lock()
{
if (InterlockedExchange(&m_state, 1))
{
while (InterlockedExchange(&m_state, 2))
{
WaitForSingleObject(m_waitset, INFINITE);
}
}
}
void unlock()
{
if (InterlockedExchange(&m_state, 0) == 2)
{
SetEvent(m_waitset);
}
}
};
_________________________________________________________
If Windows _only_ had condition variables then I would have to do something
like this:
<condvarversion - pseudo-code>
_________________________________________________________
struct mutex
{
LONG m_state; // = 0
BOOL m_set; // = FALSE
COND_MUTEX m_mutex;
COND_VAR m_cond;
void lock()
{
if (InterlockedExchange(&m_state, 1))
{
while (InterlockedExchange(&m_state, 2))
{
m_mutex.lock();
while (! m_set) m_cond.wait(m_mutex);
m_set = false;
m_mutex.unlock();
}
}
}
void unlock()
{
if (InterlockedExchange(&m_state, 0) == 2)
{
m_mutex.lock();
m_set = true;
m_mutex.unlock();
m_cond.signal();
}
}
};
_________________________________________________________
I would say that is more complex than using a single kernel event.
> My claim is
> that this is inordinately difficult compared to the much simpler
> solution with condition variables.
[...]
> You can use membars, CAS, lock free algorithms, and other "low level".
> In that case, you don't need condition variables.
FWIW, condition variables can work very well with lock-free algorithms. A
condition variable allows for a straight forward implementation of an
eventcount. An eventcount makes it fairly easy to add efficient conditional
waiting logic to new and existing lock-free algorithms. IMVHO, a clever
marriage between lock-free and lock-based algorithms is a beautiful thing!
;^)
[...]
Here is some information on eventcounts:
http://portal.acm.org/citation.cfm?id=359060.359076
FWIW, here are some of the issues:
http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
Well, actually, it's not all _that_ hard to get a correct condvar impl on
Windows. Here is a sketch for a simple broadcast only condvar for Windows.
Keep in mind that POSIX allows for broadcast only impls as
`pthread_cond_signal()' is allowed to wake more than one thread.
<pseudo-code sketch>
_______________________________________________________
struct condvar
{
struct waitset
{
HANDLE m_event; // = manual reset event; set to false
LONG m_refs; // = 0
};
waitset* m_waitset; // = NULL
LONG m_refs; // = 0
CRITICAL_SECTION m_mutex;
void wait(LPCRITICAL_SECTION umutex)
{
// grab a reference to the current waitset
EnterCriticalSection(&m_mutex);
waitset* w = m_waitset;
if (! w) w = m_waitset = new waitset();
++m_refs;
LeaveCriticalSection(&m_mutex);
// unlock user mutex and wait on the waitset we obtained
LeaveCriticalSection(umutex);
WaitForSingleObject(w->m_waitset, INFINITE);
// decrement the waitsets refcount
if (! InterlockedDecrement(&w->m_refs))
{
delete w;
}
EnterCriticalSection(umutex);
// that's all folks! ;^)
}
void broadcast()
{
// swap a reference to the current/new waitset with NULL
EnterCriticalSection(&m_mutex);
waitset* w = m_waitset;
LONG refs = m_refs;
m_waitset = NULL;
m_refs = 0;
LeaveCriticalSection(&m_mutex);
if (w)
{
// signal all waiters
SetEvent(w->m_event);
// transfer the swapped reference count to the waitset reference
count
if (InterlockedExchangeAdd(&w->m_refs, refs) == -refs)
{
delete w;
}
}
}
void signal()
{
broadcast();
}
};
_______________________________________________________
Barring any damn typos, the algorithm works perfectly fine. BTW, there are
several enhancements that can be made to it... Perhaps a waitset cache?
;^)
You'll have to help me out here please.
----
I assume the event is a manual reset event from CreateEvent?
----
Second, don't we have a race condition in wait? Luckily, I see it just
with instruction interleaving (as opposed to something more complex
like 2 writes becoming visible to different cores in different
orders).
thread X enters wait,
thread X creates a waitset,
thread X assigns it to m_waitset,
thread X calls WaitForSingleObject(w->m_waitset, INFINITE),
thread X returns from it,
thread X uses an atomic decrement and sees 0.
thread Y enters wait,
thread Y assigns m_waitset to the local w,
thread Y runs if (! w) and sees w is non-zero,
thread X calls "delete w;"
thread Y then calls WaitForSingleObject(w->m_waitset, INFINITE), which
waits on an object which was just deleted.
We could move "EnterCriticalSection(umutex);" up above the code:
> // decrement the waitsets refcount
> if (! InterlockedDecrement(&w->m_refs))
> {
> delete w;
> }
That might remove that race, I think. Alternatively, we can use
m_mutex to guard the decrement and delete if 0 part, and I think that
also removes the race condition.
----
What is the type "waitset"? I can't really understand any more or
comment any more as I don't really know what that is.
Nevermind.