hi everyone, I'm a student in computer science with a little of experience (exercises) in programming clusters of PC with mpi. I'd like to have some reference about using the mpi library with multicore processors.
In article <fipd2v$nd...@aioe.org>, giuseppe <giuse...@bobo.it> wrote: > I'd like to have > some reference about using the mpi library with multicore processors.
It's basically the same as using MPI on multi-processor machines, which have been around for a long time.
If running a distributed/parallel program on a single physical machine. There are more efficient ways to "pass information" around. Shared memory for example (take a look at OpenMP... or implement your own with pthreads). MPI's real advantage is for multi-machine communication/coordination.
If you just want to play around with MPI (which is what I'm doing)... doesn't really matter what you run it on. Single-core, multi-core... multi-machine. It'll "run", but obviously different setups are going to have different advantages over one another.
> If running a distributed/parallel program on a single physical > machine. There are more efficient ways to "pass information" around. > Shared memory for example (take a look at OpenMP... or implement your > own with pthreads). MPI's real advantage is for multi-machine > communication/coordination.
uhmm, you are right! May be better using an efficient and stable library like OpenMP than using my own implementation (or not?)
> If you just want to play around with MPI
I wanna play around with somethings that make me programming effectively real-life machine such as multicore processors (not the HPC at the NASA).
> doesn't really matter what you run it on. Single-core, multi-core... > multi-machine. It'll "run", but obviously different setups are going > to have different advantages over one another.
that was my doubt, i haven't thought that mpi is optimized for network comunications even if it can work on a single multi-core machine.
giuseppe <giuse...@bobo.it> writes: >> doesn't really matter what you run it on. Single-core, multi-core... >> multi-machine. It'll "run", but obviously different setups are going >> to have different advantages over one another.
> that was my doubt, i haven't thought that mpi is optimized for network > comunications even if it can work on a single multi-core machine.
MPI - as the expanded acronym says - is based on a messaging paradigm which incurs loss of efficiency inside a SMP node because you would not only have to transfer data, but you will also have the messaging overhead (setting up and tearing down the connections between processes and so on).
Good MPI implementations would use something like shared memory IPC inside a SMP node, but if you're concerned with the last bit of performance, a thread-based programming model like OpenMP would be better suited.
In article <firl1c$tf...@news.lrz-muenchen.de>, Sebastian Hanigk <han...@in.tum.de> wrote:
>MPI - as the expanded acronym says - is based on a messaging paradigm >which incurs loss of efficiency inside a SMP node because you would not >only have to transfer data, but you will also have the messaging >overhead (setting up and tearing down the connections between processes >and so on).
And with MPI, you get the increase of efficiency of never having false sharing and other locality problems.
Which is why it's frequently the case that codes with both OpenMP and MPI implementations run faster in pure MPI mode on big SMPs.
lind...@pbm.com (Greg Lindahl) writes: > And with MPI, you get the increase of efficiency of never having false > sharing and other locality problems.
> Which is why it's frequently the case that codes with both OpenMP and > MPI implementations run faster in pure MPI mode on big SMPs.
It's good that you mention the threading problems that can occur.
One of the major drawbacks of MPI on SMP machines is in my opinion the necessary synchronisation for communication; one-sided communication directives (which MPI supports only half-hearted) are a really nice way of loose coupling, especially if your hardware supports it natively.
In article <fj5tr1$us...@news.lrz-muenchen.de>, Sebastian Hanigk <han...@in.tum.de> wrote:
>One of the major drawbacks of MPI on SMP machines is in my opinion the >necessary synchronisation for communication; one-sided communication >directives (which MPI supports only half-hearted) are a really nice way >of loose coupling, especially if your hardware supports it natively.
Yes, although many programmers are unpleased to discover that they often need just as much synchronization with one-sided communications. So they end up sprinkling their code with barriers, and sometimes have to resort to double-buffering.
lind...@pbm.com (Greg Lindahl) writes: > Yes, although many programmers are unpleased to discover that they > often need just as much synchronization with one-sided > communications. So they end up sprinkling their code with barriers, > and sometimes have to resort to double-buffering.
I had good experiences with one-sided communication in cases where your data layout would be unpredictable (in my case plugging newly developed algorithms into existing legacy codebase). The buffering issues can sometimes (often?) used for non-blocking communication, especially useful if your interconnect supports some kind of RDMA operations.
Regarding the synchronisation subroutine calls: I surmise that MPI codes usually employing send-receives where many if not all processes take part which means an implicit synchronisation step at the end of every communication epoch - if it's needed or not; at least in theory one could use less synchronisation, albeit explicit, by employing RDMA communication. I'm currently using a BlueGene for some tests and the low-level messaging layer gives you the opportunity to specifiy callbacks for sender and receiver of those messages so you could for example simply notify the target whenever you put something into its memory.
In article <fj7ck9$tc...@news.lrz-muenchen.de>, Sebastian Hanigk <han...@in.tum.de> wrote:
>Regarding the synchronisation subroutine calls: I surmise that MPI codes >usually employing send-receives where many if not all processes take >part which means an implicit synchronisation step at the end of every >communication epoch - if it's needed or not; at least in theory one >could use less synchronisation,
Many of the MPI codes I've looked at have the minimum of synchronization.
BTW, you may not want to use "RDMA" the way you're using it, it's been hijacked by one community and redefined to be more and less than actual remote direct memory access.
> I'm currently using a BlueGene for some tests and the >low-level messaging layer gives you the opportunity to specifiy >callbacks for sender and receiver of those messages so you could for >example simply notify the target whenever you put something into its memory.
This is a typical feature -- it's needed because you still need synchronization.
lind...@pbm.com (Greg Lindahl) writes: >>Regarding the synchronisation subroutine calls: I surmise that MPI codes >>usually employing send-receives where many if not all processes take >>part which means an implicit synchronisation step at the end of every >>communication epoch - if it's needed or not; at least in theory one >>could use less synchronisation,
> Many of the MPI codes I've looked at have the minimum of > synchronization.
I think we talk about slightly different things; if you mean by "synchronisation" explicit calls to the barrier subroutine, you're right. I was more referring to the (sometimes unnecessary) synchronisation due to the two-sided communication model of MPI (let's not talk about eager vs. rendezvous at the moment).
Simple example: ghost cell exchange in a CFD code. In the MPI case, every send/receive incurs synchronisation, but you could simply read the remote processes' memory without the - explicit - help of the target. Of course, you have to ensure that you're reading consistent data, but this is simply one barrier before the next update step.
> BTW, you may not want to use "RDMA" the way you're using it, it's been > hijacked by one community and redefined to be more and less than > actual remote direct memory access.
It is? I'm not really sure what would be the best terminology, I'm often using RDMA, SHMEM or distributed shared memory whenever I'm referring to (more or less) passive-target, one-sided communication in a cluster.
>> I'm currently using a BlueGene for some tests and the >>low-level messaging layer gives you the opportunity to specifiy >>callbacks for sender and receiver of those messages so you could for >>example simply notify the target whenever you put something into its memory.
> This is a typical feature -- it's needed because you still need > synchronization.
Depends. Current work on a 3D-FFT could be realised solely with get-communication on disjunct buffers so barrier synchronisation is barely needed. I've dabbled with the implementation of an accumulation routine protoype which uses a put operation into remote memory and the respective callback on the target process does the accumulation operation, but I'm still thinking how to implement atomicity.
In article <fj9pr6$dr...@news.lrz-muenchen.de>, Sebastian Hanigk <han...@in.tum.de> wrote:
>> Many of the MPI codes I've looked at have the minimum of >> synchronization.
>I think we talk about slightly different things; if you mean by >"synchronisation" explicit calls to the barrier subroutine, you're >right.
No, I'm referring to all forms of synchronization, including 2-sided communication synchronization.
>Simple example: ghost cell exchange in a CFD code. In the MPI case, >every send/receive incurs synchronisation,
No, it doesn't. For example, I can irecv/isend and then waitall. That results in one synchronization with my neighbors. Nothing extra.
> but you could simply read the >remote processes' memory without the - explicit - help of the target. Of >course, you have to ensure that you're reading consistent data, but this >is simply one barrier before the next update step.
That's a synchronization, too. So there you have it: one in each case.
lind...@pbm.com (Greg Lindahl) writes: >>Simple example: ghost cell exchange in a CFD code. In the MPI case, >>every send/receive incurs synchronisation,
> No, it doesn't. For example, I can irecv/isend and then waitall. That > results in one synchronization with my neighbors. Nothing extra.
But this only works for eager sends or receives! If the amount of data you're about to transfer exceeds some buffer limit, even the i-routines will behave like the synchronous ones. Many MPI implementation let you fiddle with the buffer limit and you could use the more unusual immediate buffered send/receive routines.
>> but you could simply read the >>remote processes' memory without the - explicit - help of the target. Of >>course, you have to ensure that you're reading consistent data, but this >>is simply one barrier before the next update step.
> That's a synchronization, too. So there you have it: one in each case.
It is one synchronisation per update cycle with one-sided communication regardless of the number of dimensions etc. whereas the synchronisations in the MPI case would be two times the number of exchange dimensions for the rendezvous protocol; it can be brought down to one synchronisation if immediate routines are used and they do not have to switch to a synchronous mode of communication.
In article <fjb9hf$c1...@news.lrz-muenchen.de>, Sebastian Hanigk <han...@in.tum.de> wrote:
>But this only works for eager sends or receives! If the amount of data >you're about to transfer exceeds some buffer limit, even the i-routines >will behave like the synchronous ones.
Not only is this implementation-dependent behavior, but your comment doesn't make any sense. MPI_RECV always blocks until the data is available. MPI_IRECV never does. So no, large transfers never make MPI_IRECV behave like MPI_RECV. With IRECV, the blocking happens at the MPI_WAIT.
And there is usually only one MPI_WAIT, no matter how many dimensions your halo exchange has.
Now perhaps you're using a funny definition of "synchronization". But it doesn't sound like a useful one.
lind...@pbm.com (Greg Lindahl) writes: >>But this only works for eager sends or receives! If the amount of data >>you're about to transfer exceeds some buffer limit, even the i-routines >>will behave like the synchronous ones.
> Not only is this implementation-dependent behavior, but your comment > doesn't make any sense. MPI_RECV always blocks until the data is > available. MPI_IRECV never does. So no, large transfers never make > MPI_IRECV behave like MPI_RECV. With IRECV, the blocking happens at > the MPI_WAIT.
I'm sorry for any misunderstanding, my comment above has been written in a slight hurry ...
Regarding MPI_Irecv I cannot say anything at the moment - I strongly assume that your description should be expected. But its complementary sending routine switches from immediate return to blocking behaviour after exceeding an implementation-dependend message size threshold.
> And there is usually only one MPI_WAIT, no matter how many dimensions > your halo exchange has.
Yes. But if your halo's exchange buffer size is larger than the implementation's threshold, you will end up with blocking behaviour on each exchange while the zero-copy RDMA (without any connotation I'm perhaps unaware of) access can obviate this.
> Now perhaps you're using a funny definition of "synchronization". But > it doesn't sound like a useful one.
I don't think I have given or used an unusual definition of synchronisation; in MPI, there is an implicit synchronisation between the sending and receiving party hidden in the respective calls to the send or receive routines, with the exception of the immediate versions of those routines whose behaviour depends on the transfer size.
Could it be that this discussion goes in some kind of circle while we're misunderstanding each other? I'm in no way dismissing MPI as inferior, but for some purposes it is very nice to have the means for one-sided, passive-target communication available. Without doubt the RDMA scheme has its own set of problems (I just remembered a short article: <http://www.hpcwire.com/hpc/815242.html>), I'm still struggling with the registration/pinning issues - compute node kernels without swapping capability are a godsend for that purpose.
In article <fjcgqe$2t...@news.lrz-muenchen.de>, Sebastian Hanigk <han...@in.tum.de> wrote:
>Regarding MPI_Irecv I cannot say anything at the moment - I strongly >assume that your description should be expected. But its complementary >sending routine switches from immediate return to blocking behaviour >after exceeding an implementation-dependend message size threshold.
No. Isend returns immediately in all cases. What work it does before returning is implementation dependent, and that's what you seem to be referring to, incorrectly.
>Could it be that this discussion goes in some kind of circle while we're >misunderstanding each other?
It's entirely possible.
> I'm in no way dismissing MPI as inferior, > but for some purposes it is very nice to have the means for one-sided, > passive-target communication available.
Indeed, it is sometimes useful. But now you've returned to the beginning of the discussion, and I have the same reply as before.
lind...@pbm.com (Greg Lindahl) writes: > No. Isend returns immediately in all cases. What work it does before > returning is implementation dependent, and that's what you seem to be > referring to, incorrectly.
I beg to differ. Now it seems that you have an unusual definition of "immediately". Take a look at the data from <http://www.cs.sandia.gov/smb/overhead.html> and you see in fig. 2 (Overhead as a function of message size for MPI_Isend) that interconnects without good communication offload capabilities suffer a penalty proportional to the message size.
lind...@pbm.com (Greg Lindahl) writes: > The fact that ISend sometimes does a significant amount of work before > returning has nothing to do with synchronization or blocking.
Fair enough. Now I have to try to explain the difference to our users :-)
I'll try to do the same measurements with the MPI calls replaced by ARMCI calls (that's the library I'm currently using) and post the results.
>> You might want to ask Doug about my objections to his experimental method.
>Care to explain?
Doug is asking "how much work can I get done while communicating?" But he's measuring a loop that doesn't touch main memory. You've probably heard of Don Becker's comment on zero copy: it's when you get someone else to do the copy. Everone likes to pretend that this copy is free, but it isn't. Well, all that DMA memory traffic costs. So Doug's number is an upper bound; if you used the Stream benchmark as the work you'd get a lower bound. And a real app would be somewhere in between. (Since you have a framework for measuring this, perhaps you could do the stream measurement for us.)
Another issue I have with Doug's paper is that many readers misinterpreted it. It only applies to the modest fraction of codes which do large messages and can overlap. Most codes aren't like that.
lind...@pbm.com (Greg Lindahl) writes: > How did they notice?
Parameter space exploration in some newly implemented parallelisations; mostly we noticed the (more or less) sudden rise in run time due to non-overlap.
> Doug is asking "how much work can I get done while communicating?" But > he's measuring a loop that doesn't touch main memory.
Yes, this is not really realistic and you have to be careful that your optimiser does not remove the loop.
One part of my diploma thesis' work was the implementation of the SRUMMA matrix multiplication algorithm where the key idea is maximal overlap; the working part sandwiched between get and wait calls was a BLAS call.
> You've probably heard of Don Becker's comment on zero copy: it's when > you get someone else to do the copy. Everone likes to pretend that > this copy is free, but it isn't. Well, all that DMA memory traffic > costs.
On the upside, you can probably decrease the transfer latency and if memory is tight, it could help to save the memory which would have been used for transfer buffers.
> So Doug's number is an upper bound; if you used the Stream > benchmark as the work you'd get a lower bound. And a real app would be > somewhere in between. (Since you have a framework for measuring this, > perhaps you could do the stream measurement for us.)
I wouldn't call it a framework, but I think I can do something useful with my allotted CPU time.
> Another issue I have with Doug's paper is that many readers > misinterpreted it. It only applies to the modest fraction of codes > which do large messages and can overlap. Most codes aren't like that.
I had the luxury of tackling a very easy problem in that respect (matrix multiplication) so for me that hasn't been unusual; for other codes I do concur with you.