Anyone know of a good example?
I can find examples that don't work but I would like to study an
example that does work.
Steve
--
Neural Planner Software Ltd www.NPSL1.com
EasyNN-plus. Neural Networks plus. www.easynn.com
SwingNN. Forecast with Neural Networks. www.swingnn.com
JustNN. Just Neural Networks. www.justnn.com
OK, I just wrote this example, but you can tell me if you think it won't work:
CEvent evtThread1, evtThread2, evtThread3;
// assume these events have been created and have some meaning somewhere
CSyncObject * apEvents[] = { &evtThread1, &evtThread2, &evtThread3 };
CMultiLock lockmany( apEvents, _countof(apEvents), FALSE );
DWORD dwResult = lockmany.Lock( 5000, FALSE );
switch ( dwResult )
{
case WAIT_OBJECT_0:
case WAIT_OBJECT_0+1:
case WAIT_OBJECT_0+2:
{
DWORD dwEvt = dwResult - WAIT_OBJECT_0;
// do something in response to the event apEvents[dwEvt];
}
break;
case WAIT_TIMEOUT:
// Do something if timeout expired
break;
case WAIT_ABANDONED:
// Do something if wait abandoned
break;
}
>OK, I just wrote this example, but you can tell me if you think it won't work:
> CEvent evtThread1, evtThread2, evtThread3;
> // assume these events have been created and have some meaning somewhere
Thanks, that's a useful example. I am currently using a CSingleLock
with CMutex to prevent problems with link lists within the thread. The
trouble is that there is a performance problem locking the whole
thread. There are multiple link lists modified within the thread. I
should be able to use CMultiLock to lock a different CMutex for each
list.
>Thanks, that's a useful example. I am currently using a CSingleLock
>with CMutex to prevent problems with link lists within the thread. The
>trouble is that there is a performance problem locking the whole
>thread. There are multiple link lists modified within the thread. I
>should be able to use CMultiLock to lock a different CMutex for each
>list.
A couple of things. For synchronization within a single process, using the
windows lightweight mutex, i.e. CRITICAL_SECTION, is preferable, but even
though using MFC's CCriticalSection with CMultiLock compiles fine, it blows
up at run-time. This is because CRITICAL_SECTION is not referenced with a
HANDLE, which is required for the WaitForMultipleObjects function which
underlies CMultiLock, yet CCriticalSection derives from CSyncObject, just
like CMutex. That's one of the many design flaws in the MFC sync classes.
Another is that CSingleLock by default does not acquire the lock! You have
to always remember to specify "true" for its ctor to get the default
behavior of every other lock class that's ever been written. Reading the
documentation for CMultiLock's ctor, I have no idea what it does WRT WFMO's
wait-all and timeout parameters. This is just scratching the surface; MFC's
sync classes are really bad. I believe some people say ATL has better
classes for this.
Getting back to the CRITICAL_SECTION usage and the impossibility of using
it with WFMO, I have to wonder if you can't approach your problem in a
different way. If you really do need to protect access to multiple lists at
the same time, you can lock them individually in series. Just be sure to do
it the same order and release them in reverse order in all your threads.
This is necessary to avoid deadlock. This avoids WFMO and will allow you to
use the more efficient CRITICAL_SECTION. Of course, if you're doing this
every time you access the lists, what you really need is a single mutex to
guard them all. Whatever the case, the access protocol has to be consistent
and make sense from a bird's eye view of all the threads that use the
lists.
--
Doug Harrison
Visual C++ MVP
>Getting back to the CRITICAL_SECTION usage and the impossibility of using
>it with WFMO, I have to wonder if you can't approach your problem in a
>different way. If you really do need to protect access to multiple lists at
>the same time, you can lock them individually in series. Just be sure to do
>it the same order and release them in reverse order in all your threads.
>This is necessary to avoid deadlock. This avoids WFMO and will allow you to
>use the more efficient CRITICAL_SECTION. Of course, if you're doing this
>every time you access the lists, what you really need is a single mutex to
>guard them all. Whatever the case, the access protocol has to be consistent
>and make sense from a bird's eye view of all the threads that use the
>lists.
The threads are multiple instances of the same code. I'm trying to get
an application to run a background calculation on two and four
processor systems. The data is referenced by multiple lists. It works
fine on a single processor but needs to exhibit a performance
improvement with multiple processors. Therefore I need to start one
thread instance on each processor reading and writing the same data
referenced by numerous CTypedPtrList. I need to prevent concurrent
access to the data. If I use a single mutex only one thread runs at a
time so only one processor is doing the calculation.
This has to be simple but so far it has defeated me.
I just modified a library that had dozens of locking points, many of which were completely
unnecessary anyway (reading a scalar variable that was DWORD aligned, for example), and
most of which were completely gratuitous because the chances of hitting the data from two
threads simultaneously in a real program was vanishingly small (guaranteed zero in almost
all cases, since the only locked data structure is only used under some very obscure
conditions that applications would never encounter). So I just rewrote it, removed all
locks, and added a way that the main GUI thread could ask the background thread to make a
change or read the data in a guaranteed-FIFO way. All locking requirements vanished. It
made one utility program slightly more complex, and made all other programs run much
faster and the entire library became much simpler.
I have found very, very few instances in which concurrent access is *actually* required
for performance reasons. It is usually possible to re-engineer the code to eliminate all
need to lock. The problem is that most people think first of locking, because it appears
superficially to be a "simpler" solution. It usually isn't.
joe
On Sun, 16 Aug 2009 15:37:53 -0500, "Doug Harrison [MVP]" <d...@mvps.org> wrote:
Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
CTypedPtrList (or any list structure) can also be a factor; if the lists are very large,
you can even add paging overhead to the whole mess, moving the downward-curve-point even
further towards the y-axis (it shows up with fewer processors). In general, taking a
naive algorithm and simply running it concurrently on multiple processors, with tons of
synchronization between the threads, is not going to give you a real performance
improvement. You have to look at the nature of the algorithm and data, and redesign the
algorithm to maximize concurrency. The simple approach only works when the threads can
work on independent data.
Your discovery that locking doesn't work terribly well in these cases illustrates the
point. Note that you *cannot* use N lists where N is the number of processors, because
with this approach you cannot guarantee canonical locking order (which is what Doug was
talking about) so you will end up with deadlock.
The key here is that scaling requires insight and sophistication. Tossing locks at the
problem won't result in a satisfactory solution, and will probably generate an unworkable
solution, or one with pitiful performance. You have to decompose the algorithm into
pieces which can be worked on independently, and ideally not require any locking at all.
If there is locking, it is done only at the point where work is parceled out to the
threads, and not on a per-iteration basis.
Sometimes redesign is the only solution. Without knowing more about your data and the
nature of the algorithms, it is hard to offer specific advice.
joe
On Mon, 17 Aug 2009 16:35:28 +0100, Stephen Wolstenholme <st...@tropheus.demon.co.uk>
wrote:
>On Sun, 16 Aug 2009 15:37:53 -0500, "Doug Harrison [MVP]"
>The threads are multiple instances of the same code. I'm trying to get
>an application to run a background calculation on two and four
>processor systems. The data is referenced by multiple lists. It works
>fine on a single processor but needs to exhibit a performance
>improvement with multiple processors. Therefore I need to start one
>thread instance on each processor reading and writing the same data
>referenced by numerous CTypedPtrList. I need to prevent concurrent
>access to the data. If I use a single mutex only one thread runs at a
>time so only one processor is doing the calculation.
>
>This has to be simple but so far it has defeated me.
The threads need to work on private data as much as possible and update
shared data as infrequently as possible. That may pose a challenge. :)
In one famous example in the history of multithreading, a colleague of mine, Pete
Oleinick, observed that in a multiprocessor partial-differential-equation evaluator, you
could get orders of magnitude performance improvement if you just ignored locking
completely. So at the boundaries of the rectangles, you needed to compute the "adjacent
points" possibly using obsolete data from another rectangle. His insight: so what? It is
a *converging* algorithm! The errors grow smaller and smaller, even if the data is
"obsolete" at the time it is fetched (the hardware interlocked the fetching). Result:
huge factors of performance improvement with no locking at all!
I do want to point out that "no locking" is not always the same as "lock free". Lock free
algorithms use hardware locking to do atomic computations, or rely on clever tricks such
as OpenMP uses which also rely on hardware locking. "No locking" algorithms simple work
correctly in the absence of any locking. In the case of the OP, I think it requires some
careful thought and data refactoring to get it to work right without locking; that's the
usual case.
joe
( warning: no direct attempt at helping OP here :-) )
In all honesty, this is not at all a good group for this kind of
questions, and is probably not simple, either. Of course, people like
a challenge so they chime in, but you need either a careful thought on
what you can run in parallel, or (more likely) a prior art on the
problem. (In the "scientific" part of the computing field, there is
research and results WRT parallel processing, and that is clearly
gaining interest with the recent (sic!) attack of the multi-core. I
would be surprised that there's nothing in the field of neural
networks WRT parallel processing - seek and ye might find!)
Or else, if you can trim down the problem to a short explanation,
perhaps us MFC philistines could help you.
Goran.
Not only had I not signed an NDA, so I had *no idea* what their product was (we taught
off-premises at a hotel, and they couldn't say what they were working on, which didn't
stop them from complaining that I hadn't solved their problem), parallel decomposition is
a deep, complex and arcane art form. It is not something one accomplishes by tossing a
few mutexes at the problem and hoping it now works in parallel.
joe
>( warning: no direct attempt at helping OP here :-) )
>
>In all honesty, this is not at all a good group for this kind of
>questions, and is probably not simple, either. Of course, people like
>a challenge so they chime in, but you need either a careful thought on
>what you can run in parallel, or (more likely) a prior art on the
>problem. (In the "scientific" part of the computing field, there is
>research and results WRT parallel processing, and that is clearly
>gaining interest with the recent (sic!) attack of the multi-core. I
>would be surprised that there's nothing in the field of neural
>networks WRT parallel processing - seek and ye might find!)
>
>Or else, if you can trim down the problem to a short explanation,
>perhaps us MFC philistines could help you.
>
>Goran.
This group is suitable because I am using MFC classes.
I should point out that I have many years of experience working on
multi-processor systems. My difficulties now are because I was spoilt
by those systems as they did all the concurrency and data access
checking by hardware.
Wait, what!?
Am I right in concluding that you have data (a CList<> or some such
object, but that's irrelevant) that you read/modify from multiple
threads at once, and you just tried to see if that would work? And you
wonder why it worked in debug, but not in release!?
If so, I really hope Joe N. pulls one of his rants on you.
You just can't do that and it's quite bad that you seem not to
understand why (that's assuming that if you did understand, you
wouldn't even tried).
You must prevent that one piece of data is modified from multiple
threads at once. You use a mutex for that (in MFC parlance, that's
CCriticalSection or perhaps CMutex).
You also must prevent that one piece of data is modified in one thread
while it is being read in another. You use a thing called read-write
mutex for that (see e.g.
http://www.boost.org/doc/libs/1_34_1/doc/html/thread/concepts.html#thread.concepts.ReadWriteMutex).
No MFC equivalent for that; I saw some MFC shots at that on the net
that looked dubious, but it's been a long time.
These are two basic things you must not do. Dare I say: absolutely all
the theory and practice stems from there. Observe them and you'll be
fine. Assume that if nothing is done, assaulting data with more than
one thread at a time will break said data. There are only very
specific situations where it would not happen (e.g. Joe said something
about reading a scalar variable that was DWORD aligned - that AFAIK is
true on i386, but not a general truth).
You can of course read data from as many threads as you want
simultaneously. That's easier said than done IMO (read-write mutex
helps).
Also... For very small amounts of data, you have APIs that are in fact
wrappers for CPU test-and-set instructions (see http://en.wikipedia.org/wiki/Test-and-set
and "Interlocked Variable Access" in MSDN). These are sometimes
helpful (doesn't seem to be your case, but you have provided little
specifics).
>The threads are multiple instances of the same code. I'm trying to get
>an application to run a background calculation on two and four
>processor systems. The data is referenced by multiple lists. It works
>fine on a single processor but needs to exhibit a performance
>improvement with multiple processors.
That's of course the hard bit. You must somehow make your threads not
work on the same data at the same time (otherwise they will compete
for access and speed will go poof to the point of being considerably
slower than a single thread solution). Without knowing any specifics,
I can offer only the most general advice: isolate "calculation" and
"store" parts of the code. Do your calculation in any given thread in
isolation, using a copy of the data, to avoid contention (copying goes
a long way in parallel work - it's a classic speed vs. memory usage
trade-off right there). Store results when done. If you can make your
threads work on separate data well, you're done! BTW, are you doing
neural network "training"? If so, I would guess this split should be
possible because your network structure should be viewable as isolated
parts. (That's what I seem to remember from the days of yore. Don't
take my word on that, though.)
HTH,
Goran.
Seriously, though, locking is *mandatory* if two or more threads are accessing non-scalar
data in ways that involve one or more of the threads doing potential non-monotonic
non-robust concurrent modification of the data.
By "removing all locking", I implied that you create situations in which there is no
concurrent modification of data. Deep and fundamental to the architecture of your system,
there is no concurrent modification of the data. One technique is to have one thread "set
up" the structure so it is "ready-to-go" and then fire off all the computational threads,
which can use the data but never modify it. So if you have a set of linked list objects
that are directing the computation, it is essential that under no circumstances will those
lists ever be modified while the threads are running. Furthermore, if there is any data
referenced by the lists, that data must not be modified except in a synchronized fashion
during the computation. Note that I said "in a synchronized fashion", not "under control
of a lock", because in some cases (e.g., the partial differential equation solver I
mentioned earlier) the hardware synchronized access to the memory (so you couldn't get a
half-float value fetched while the other half-float was stil being stored). But critical
to this design was the realization that hardware synchronization existed to protect the
atomic values, and the logic of the program did not require that the value be The Very
Latest Value; the algorithm would work correctly if it operated on obsolete data, it just
might take an extra iteration to converge to the same value. This worked because the
modifications were monotonic and robust. By "robust" I mean that if a thread did a
computation with obsolete data, it *still* got a correct answer. This means that
operations like a++ or a--, if a is a shared variable, *must* have locking, because the
computation is non-robust (hint: use InterlockedIncrement/InterlockedDecrement and let the
hardware handle it). But setting a variable to FALSE in one thread, where no other thread
can ever set it back to TRUE, is monotonic, and in many contexts robust (e.g., thread
shutdown by setting a "threads should continue to execute" BOOL to FALSE).
If you just take out all the locks and you have complex data (more than one value
interacting, say, for example, the pointers in a linked list) that is modified, then your
code is again incorrect. This is because linked list manipulation of a 2-way linked list
(as is CList) is non-monotonic and non-robust. Or, more simply put, it's wrong to do this
without a lock.
But note that you only have to apply the lock when you CHANGE the list. You don't lock
the entire list at the start of the computation and hold it for the entire computation
(that just serializes all the computations, as you already observed). And you cannot lock
more than one list at a time, unless you guarantee that the lists are *always* locked in
the same order, which must be monotonic (this is the deadlock avoidance issue Doug was
talking about). If you have four lists, call them lists 1, 2, 3 and 4, the valid locking
orders are 1; 2; 3; 4; (that is, never more than one list at a time!) or 1-2; 1-2-3;
1-2-3-4; 1-3; 1-3-4; 1-4; 2-3; 2-3-4; 2-4; 3-4; AND THAT'S ALL! Any locking sequence that
locks a list such that the number of the list is not strictly greater than all previous
lists locked along that execution path means you will deadlock and your threads will hang.
To make the "no lock" code work correctly, you need certain kinds of atomic operations;
for example, I often use an I/O Completion Port as an interthread queue, and sometimes a
PostMessage queue as an interthread queue. Note these actually do use locking internally
so they work correctly, *but I don't have to reason about those locks being correct*. They
are also "leaf locks" meaning that you cannot get deadlock by using them; they are the
leaves of any complex locking tree. They do not themselves do any secondary locking so
they can't deadlock with other threads using those same internal locks, so a whole class
of lock reasoning disappears from your concern.
So you have to do some evalutations. You have not said what the lists represent, for
example. If the lists represent "work lists" of things to do, you can replace them
completely by an I/O Completion Port and a worker thread, putting things in with
PostQueuedCompletionStatus and removing them with GetQueuedCompletionStatus. Or, if there
are a very small number of these, it might be simpler to create a secondary UI thread,
with a Message Map to dispatch the requests, and use PostThreadMessage.
But given you have told us practically nothing about the nature of the data structures, it
is hard to make suggestions for a specific implementation strategy.
Hints:
(a) you can't take existing single threaded code and use it without locks
(b) you can't sprinkle mutexes/CRITICAL_SECTIONs around like pixie dust
and expect your code to fly
Strategy:
Wait a minute, we've all been saying this in nearly every reply: you have to
REDESIGN your code to work in a multithreaded environment.
What surprised me was this comment:
>I should point out that I have many years of experience working on
>multi-processor systems. My difficulties now are because I was spoilt
>by those systems as they did all the concurrency and data access
>checking by hardware.
I have been using multiprocessors since 1975. At no time in the entire history of
multiprocessors am I aware of *any* architecture that would do "concurrency and data
access checking by hardware" in a way that would allow any arbitrary piece of code to
survive without explicit locking. They did not lock arrays, or linked lists, or queues,
or anything else I am aware of. Now some would do streaming data instructions on vectors
and/or arrays, but that code was carefully crafted so no other processor would do
streaming data instructions on the *same* data. Compilers like FORTRAN could easily
decompose problems into disjoint concurrent streams by doing the partitioning we've been
telling you about; it worked because the computations followed extremely rigid patterns
which the compiler could recognize.
Hardware locking can protect at most one scalar value from concurrent access. Sometimes
that "scalar" is wide, e.g., LOCK CMPXCHG8B (32-bit) or LOCK CMPXCHG16B (64-bit) which can
manipulate wide values atomically; for example the management of queued spin locks in the
kernel involves using one of these instructions to clear the lock and unlink the queue
element in a single hardware-atomic operation), but it is still a "scalar" value from the
viewpoint of the instruction involved.
I could suggest looking at OpenMP as one of the options, but it only works correctly in
very limited domains, usually dealing with mathematical computations. And what it often
does is handle the fork/join logic that is tricky in C/MFC. But without any understanding
of the basic nature of your data, it is hard to tell if OpenMP is of positive or negative
value in solving your problem.
P.S., there is new support in Vista for multiple-reader-single-writer queueing. But it
uses spin locks for the exclusion, and doesn't guarantee FIFO ordering or have
antistarvation mechanisms.
joe
>I have been using multiprocessors since 1975. At no time in the entire history of
>multiprocessors am I aware of *any* architecture that would do "concurrency and data
>access checking by hardware" in a way that would allow any arbitrary piece of code to
>survive without explicit locking. They did not lock arrays, or linked lists, or queues,
>or anything else I am aware of.
I'll look at the rest of your comments and suggestions later but for
now a bit of history:-
The hardware that I worked on was originally developed by ICL in the
1970's. It's latest incarnation is the Fujitsu Trimetra range. There
are multi-node versions and each node can have multiple processors.
The last system I worked on had 16 processors. The hardware and low
levels of the operating system do all the access checking using data
descriptors. The descriptors control read, write and execute access to
any data. You may be able to find some more information on the web.
http://en.wikipedia.org/wiki/ICL_VME has a brief description of the
OS. http://en.wikipedia.org/wiki/ICL_Series_39 and
http://en.wikipedia.org/wiki/ICL_2900_Series for brief descriptions of
the hardware.
I was not aware that there was any contemporary architecture that handled synchronization
at this level.
Note that this only works when all the data can be described by a single descriptor and
there is no case of multiple pointers; for example, if you have CPU 0 and CPU 1, and CPU 0
tries to access data via Descriptor A, then CPU 1 can be locked out because Descriptor A
is busy. But if Descriptor A has a reference to storage managed by Descriptor B, then you
can't do it in hardware any longer; A cannot be held locked while B is accessed or you can
get deadlock (if the list is circular, for example, and another CPU has accessed B while A
is locked but before CPU 0 locks B, and the second CPU has to access Descriptor A, it's
deadlock time). So it doesn't generalize to a list because either the entire list has to
be contained in the single descriptor, which means that the list manipulations have to be
atomic with a single call that locks the descriptor, does the manipulation, and unlocks
the descriptor, or the list is represented by a list that has multiple descriptors in
which case canonical order has to be maintained by the users of the data, because the OS
would deadlock.
Whether the OS provides fine-grain or coarse-grain locking doesn't change the fact that
multilocking can't work automatically. On the x86, the hardware provides fine-grain
(single instruction, single data reference) synchronization. There is no coarse-grain
locking except by explicit locks. But your approach of locking an entire list for the
duration of the computation isn't consistent with what your original architecture could
have done.
I find it best to build something like
template <class T> class CLockedList : public CList<T> {
public:
void Add(T p) { EnterCriticalSection(&lock);
list.Add(p);
LeaveCriticalSection(&lock); }
void Remove(T p) { EnterCriticalSection(&lock);
list.Remove(p);
LeaveCriticalSection(&lock); }
CLockedList<T>() { InitializeCriticalSection(&lock); }
virtual ~CLockedList<T>() { DeleteCriticalSection(&lock); }
protected:
CList<T> list;
CRITICAL_SECTION lock;
};
Note: I do not use template classes very much, and I didn't bother to look at my existing
code, so this is suggestive syntax rather than definitive.
I worked on a 16-processor multiprocessor in 1975, and several other multiprocessors since
then (the x86 being merely the latest) and all were classical memory architectures.
Therefore, we always had to do all the locking, but each object was represented by a
"descriptor" that had a lock in it.
You can simulate the notion of "descriptor" by using a C++ class to handle the locking,
but note that no matter how you do the locking, it is impossible to have a low-level
system that provides perfectly correct behavior for more than a single object at a time.
Multi-object synchronization is always the responsibility of the application. Algorithm
decomposition is also the responsibility of the application.
joe
On Wed, 19 Aug 2009 16:58:13 +0100, Stephen Wolstenholme <st...@tropheus.demon.co.uk>
wrote:
>On Wed, 19 Aug 2009 10:44:24 -0400, Joseph M. Newcomer
>I find it best to build something like
>
>template <class T> class CLockedList : public CList<T> {
> public:
> void Add(T p) { EnterCriticalSection(&lock);
> list.Add(p);
> LeaveCriticalSection(&lock); }
> void Remove(T p) { EnterCriticalSection(&lock);
> list.Remove(p);
> LeaveCriticalSection(&lock); }
> CLockedList<T>() { InitializeCriticalSection(&lock); }
> virtual ~CLockedList<T>() { DeleteCriticalSection(&lock); }
> protected:
> CList<T> list;
> CRITICAL_SECTION lock;
> };
The problem with that is always, "How do you lock around a series of
operations that must be atomic?" You can either expose the locking
mechanism or give up on making the class "inherently" thread-safe and
require users to impose locking entirely from the outside. STL containers
choose the latter. Also, I wouldn't derive publicly from CList, because
doing so allows users to manipulate the list through base class functions
which don't respect the locking. Oh wait, you also have "list" as a member
variable. That or (much less likely) private inheritance is the way I'd go.
(Obligatory disclaimer: YMMV)
Third option: apply the interface segregation principle (http://
www.objectmentor.com/resources/articles/isp.pdf). That is, expose a
specific interface for said series of operations (also expose other
interfaces for other use-cases).
e.g.
struct WholeBunchAtOnce { virtual void DoIt(params) = 0; }
struct BasicOps { virtual void Op1()=0; /*etc*/ }
class ThreadSafe :
public WholeBunchAtOnce,
public /*protected?*/ BasicOps
{ /*implement all here, ye dawg!*/}
then
void UseCase1(WholeBunchAtOnce& me, params) {... me.DoIt(params)... }
void UseCase1(BasicOps& me, params) { ... me.Op1(params);...}
then
ThreadSafe Me;
UseCase1(Me, params);
and (in other code, hopefully far, far away)
UseCase2(me, otherParams);
Fourth option: for "grouped" operations, force clients to use a
"locked" proxy object, e.g.
class LockedRefToX
{
LockedRefToX(X& x) { LockX() }
~LockedRefToX() { UnlockX() }
X* operator->*()
};
(This requires e.g. that LockedRefToX is friend of X for Lock/Unlock).
Goran.
Key here is that all such classes that intrinsically lock must be "leaves" of the locking
hierarchy, and cannot themselves call classes that set locks, because otherwise you can
get inadvertent deadlock. This raises many interesting questions about architecture and
abstraction; in the case of locking, the "details" of the abstraction must be exposed, if
only through documentation.
joe