mpi and multicore processors

giuseppe

unread,

Nov 30, 2007, 11:17:59 AM11/30/07

to

hi everyone,
I'm a student in computer science with a little of experience
(exercises) in programming clusters of PC with mpi. I'd like to have
some reference about using the mpi library with multicore processors.

thanks a lot,
Giuseppe

Greg Lindahl

unread,

Nov 30, 2007, 2:19:10 PM11/30/07

to

In article <fipd2v$ndh$1...@aioe.org>, giuseppe <gius...@bobo.it> wrote:

> I'd like to have
> some reference about using the mpi library with multicore processors.

It's basically the same as using MPI on multi-processor machines, which
have been around for a long time.

-- greg

Justin W

unread,

Nov 30, 2007, 6:40:02 PM11/30/07

to

If running a distributed/parallel program on a single physical
machine. There are more efficient ways to "pass information" around.
Shared memory for example (take a look at OpenMP... or implement your
own with pthreads). MPI's real advantage is for multi-machine
communication/coordination.

If you just want to play around with MPI (which is what I'm doing)...
doesn't really matter what you run it on. Single-core, multi-core...
multi-machine. It'll "run", but obviously different setups are going
to have different advantages over one another.

-Justin

giuseppe

unread,

Dec 1, 2007, 3:32:19 AM12/1/07

to

Justin W ha scritto:

> If running a distributed/parallel program on a single physical
> machine. There are more efficient ways to "pass information" around.
> Shared memory for example (take a look at OpenMP... or implement your
> own with pthreads). MPI's real advantage is for multi-machine
> communication/coordination.

uhmm, you are right! May be better using an efficient and stable library
like OpenMP than using my own implementation (or not?)

> If you just want to play around with MPI

I wanna play around with somethings that make me programming effectively
real-life machine such as multicore processors (not the HPC at the NASA).

> doesn't really matter what you run it on. Single-core, multi-core...
> multi-machine. It'll "run", but obviously different setups are going
> to have different advantages over one another.

that was my doubt, i haven't thought that mpi is optimized for network
comunications even if it can work on a single multi-core machine.

Thanks,
Giuseppe

Sebastian Hanigk

unread,

Dec 1, 2007, 7:45:33 AM12/1/07

to

giuseppe <gius...@bobo.it> writes:

>> doesn't really matter what you run it on. Single-core, multi-core...
>> multi-machine. It'll "run", but obviously different setups are going
>> to have different advantages over one another.
>
> that was my doubt, i haven't thought that mpi is optimized for network
> comunications even if it can work on a single multi-core machine.

MPI - as the expanded acronym says - is based on a messaging paradigm
which incurs loss of efficiency inside a SMP node because you would not
only have to transfer data, but you will also have the messaging
overhead (setting up and tearing down the connections between processes
and so on).

Good MPI implementations would use something like shared memory IPC
inside a SMP node, but if you're concerned with the last bit of
performance, a thread-based programming model like OpenMP would be
better suited.

Sebastian

Greg Lindahl

unread,

Dec 1, 2007, 8:22:45 PM12/1/07

to

In article <firl1c$tfj$1...@news.lrz-muenchen.de>,
Sebastian Hanigk <han...@in.tum.de> wrote:

>MPI - as the expanded acronym says - is based on a messaging paradigm
>which incurs loss of efficiency inside a SMP node because you would not
>only have to transfer data, but you will also have the messaging
>overhead (setting up and tearing down the connections between processes
>and so on).

And with MPI, you get the increase of efficiency of never having false
sharing and other locality problems.

Which is why it's frequently the case that codes with both OpenMP and
MPI implementations run faster in pure MPI mode on big SMPs.

-- greg

Sebastian Hanigk

unread,

Dec 5, 2007, 5:17:11 AM12/5/07

to

lin...@pbm.com (Greg Lindahl) writes:

> And with MPI, you get the increase of efficiency of never having false
> sharing and other locality problems.
>
> Which is why it's frequently the case that codes with both OpenMP and
> MPI implementations run faster in pure MPI mode on big SMPs.

It's good that you mention the threading problems that can occur.

One of the major drawbacks of MPI on SMP machines is in my opinion the
necessary synchronisation for communication; one-sided communication
directives (which MPI supports only half-hearted) are a really nice way
of loose coupling, especially if your hardware supports it natively.

Sebastian

Greg Lindahl

unread,

Dec 5, 2007, 3:51:33 PM12/5/07

to

In article <fj5tr1$us6$1...@news.lrz-muenchen.de>,
Sebastian Hanigk <han...@in.tum.de> wrote:

>One of the major drawbacks of MPI on SMP machines is in my opinion the
>necessary synchronisation for communication; one-sided communication
>directives (which MPI supports only half-hearted) are a really nice way
>of loose coupling, especially if your hardware supports it natively.

Yes, although many programmers are unpleased to discover that they
often need just as much synchronization with one-sided
communications. So they end up sprinkling their code with barriers,
and sometimes have to resort to double-buffering.

-- greg

Sebastian Hanigk

unread,

Dec 5, 2007, 6:35:42 PM12/5/07

to

lin...@pbm.com (Greg Lindahl) writes:

> Yes, although many programmers are unpleased to discover that they
> often need just as much synchronization with one-sided
> communications. So they end up sprinkling their code with barriers,
> and sometimes have to resort to double-buffering.

I had good experiences with one-sided communication in cases where your
data layout would be unpredictable (in my case plugging newly developed
algorithms into existing legacy codebase). The buffering issues can
sometimes (often?) used for non-blocking communication, especially
useful if your interconnect supports some kind of RDMA operations.

Regarding the synchronisation subroutine calls: I surmise that MPI codes
usually employing send-receives where many if not all processes take
part which means an implicit synchronisation step at the end of every
communication epoch - if it's needed or not; at least in theory one
could use less synchronisation, albeit explicit, by employing RDMA
communication. I'm currently using a BlueGene for some tests and the
low-level messaging layer gives you the opportunity to specifiy
callbacks for sender and receiver of those messages so you could for
example simply notify the target whenever you put something into its memory.

Sebastian

Greg Lindahl

unread,

Dec 6, 2007, 3:06:51 PM12/6/07

to

In article <fj7ck9$tc8$1...@news.lrz-muenchen.de>,
Sebastian Hanigk <han...@in.tum.de> wrote:

>Regarding the synchronisation subroutine calls: I surmise that MPI codes
>usually employing send-receives where many if not all processes take
>part which means an implicit synchronisation step at the end of every
>communication epoch - if it's needed or not; at least in theory one
>could use less synchronisation,

Many of the MPI codes I've looked at have the minimum of synchronization.

BTW, you may not want to use "RDMA" the way you're using it, it's been
hijacked by one community and redefined to be more and less than
actual remote direct memory access.

> I'm currently using a BlueGene for some tests and the
>low-level messaging layer gives you the opportunity to specifiy
>callbacks for sender and receiver of those messages so you could for
>example simply notify the target whenever you put something into its memory.

This is a typical feature -- it's needed because you still need
synchronization.

-- greg

Sebastian Hanigk

unread,

Dec 6, 2007, 4:33:33 PM12/6/07

to

lin...@pbm.com (Greg Lindahl) writes:

>>Regarding the synchronisation subroutine calls: I surmise that MPI codes
>>usually employing send-receives where many if not all processes take
>>part which means an implicit synchronisation step at the end of every
>>communication epoch - if it's needed or not; at least in theory one
>>could use less synchronisation,
>
> Many of the MPI codes I've looked at have the minimum of
> synchronization.

I think we talk about slightly different things; if you mean by
"synchronisation" explicit calls to the barrier subroutine, you're
right. I was more referring to the (sometimes unnecessary)
synchronisation due to the two-sided communication model of MPI (let's
not talk about eager vs. rendezvous at the moment).

Simple example: ghost cell exchange in a CFD code. In the MPI case,
every send/receive incurs synchronisation, but you could simply read the
remote processes' memory without the - explicit - help of the target. Of
course, you have to ensure that you're reading consistent data, but this
is simply one barrier before the next update step.

> BTW, you may not want to use "RDMA" the way you're using it, it's been
> hijacked by one community and redefined to be more and less than
> actual remote direct memory access.

It is? I'm not really sure what would be the best terminology, I'm
often using RDMA, SHMEM or distributed shared memory whenever I'm
referring to (more or less) passive-target, one-sided communication in a
cluster.

>> I'm currently using a BlueGene for some tests and the
>>low-level messaging layer gives you the opportunity to specifiy
>>callbacks for sender and receiver of those messages so you could for
>>example simply notify the target whenever you put something into its memory.
>
> This is a typical feature -- it's needed because you still need
> synchronization.

Depends. Current work on a 3D-FFT could be realised solely with
get-communication on disjunct buffers so barrier synchronisation is
barely needed. I've dabbled with the implementation of an accumulation
routine protoype which uses a put operation into remote memory and the
respective callback on the target process does the accumulation
operation, but I'm still thinking how to implement atomicity.

Sebastian

Greg Lindahl

unread,

Dec 6, 2007, 6:45:00 PM12/6/07

to

In article <fj9pr6$drf$1...@news.lrz-muenchen.de>,
Sebastian Hanigk <han...@in.tum.de> wrote:

>> Many of the MPI codes I've looked at have the minimum of
>> synchronization.
>
>I think we talk about slightly different things; if you mean by
>"synchronisation" explicit calls to the barrier subroutine, you're
>right.

No, I'm referring to all forms of synchronization, including
2-sided communication synchronization.

>Simple example: ghost cell exchange in a CFD code. In the MPI case,
>every send/receive incurs synchronisation,

No, it doesn't. For example, I can irecv/isend and then waitall. That
results in one synchronization with my neighbors. Nothing extra.

> but you could simply read the
>remote processes' memory without the - explicit - help of the target. Of
>course, you have to ensure that you're reading consistent data, but this
>is simply one barrier before the next update step.

That's a synchronization, too. So there you have it: one in each case.

-- greg

Sebastian Hanigk

unread,

Dec 7, 2007, 6:07:38 AM12/7/07

to

lin...@pbm.com (Greg Lindahl) writes:

>>Simple example: ghost cell exchange in a CFD code. In the MPI case,
>>every send/receive incurs synchronisation,
>
> No, it doesn't. For example, I can irecv/isend and then waitall. That
> results in one synchronization with my neighbors. Nothing extra.

But this only works for eager sends or receives! If the amount of data
you're about to transfer exceeds some buffer limit, even the i-routines
will behave like the synchronous ones. Many MPI implementation let you
fiddle with the buffer limit and you could use the more unusual
immediate buffered send/receive routines.

>> but you could simply read the
>>remote processes' memory without the - explicit - help of the target. Of
>>course, you have to ensure that you're reading consistent data, but this
>>is simply one barrier before the next update step.
>
> That's a synchronization, too. So there you have it: one in each case.

It is one synchronisation per update cycle with one-sided communication
regardless of the number of dimensions etc. whereas the synchronisations
in the MPI case would be two times the number of exchange dimensions for
the rendezvous protocol; it can be brought down to one synchronisation
if immediate routines are used and they do not have to switch to a
synchronous mode of communication.

Sebastian

Greg Lindahl

unread,

Dec 7, 2007, 2:50:38 PM12/7/07

to

In article <fjb9hf$c12$1...@news.lrz-muenchen.de>,
Sebastian Hanigk <han...@in.tum.de> wrote:

>But this only works for eager sends or receives! If the amount of data
>you're about to transfer exceeds some buffer limit, even the i-routines
>will behave like the synchronous ones.

Not only is this implementation-dependent behavior, but your comment
doesn't make any sense. MPI_RECV always blocks until the data is
available. MPI_IRECV never does. So no, large transfers never make
MPI_IRECV behave like MPI_RECV. With IRECV, the blocking happens at
the MPI_WAIT.

And there is usually only one MPI_WAIT, no matter how many dimensions
your halo exchange has.

Now perhaps you're using a funny definition of "synchronization". But
it doesn't sound like a useful one.

-- greg

Sebastian Hanigk

unread,

Dec 7, 2007, 5:17:59 PM12/7/07

to

lin...@pbm.com (Greg Lindahl) writes:

>>But this only works for eager sends or receives! If the amount of data
>>you're about to transfer exceeds some buffer limit, even the i-routines
>>will behave like the synchronous ones.
>
> Not only is this implementation-dependent behavior, but your comment
> doesn't make any sense. MPI_RECV always blocks until the data is
> available. MPI_IRECV never does. So no, large transfers never make
> MPI_IRECV behave like MPI_RECV. With IRECV, the blocking happens at
> the MPI_WAIT.

I'm sorry for any misunderstanding, my comment above has been written in
a slight hurry ...

Regarding MPI_Irecv I cannot say anything at the moment - I strongly
assume that your description should be expected. But its complementary
sending routine switches from immediate return to blocking behaviour
after exceeding an implementation-dependend message size threshold.

> And there is usually only one MPI_WAIT, no matter how many dimensions
> your halo exchange has.

Yes. But if your halo's exchange buffer size is larger than the
implementation's threshold, you will end up with blocking behaviour on
each exchange while the zero-copy RDMA (without any connotation I'm
perhaps unaware of) access can obviate this.

> Now perhaps you're using a funny definition of "synchronization". But
> it doesn't sound like a useful one.

I don't think I have given or used an unusual definition of
synchronisation; in MPI, there is an implicit synchronisation between
the sending and receiving party hidden in the respective calls to the
send or receive routines, with the exception of the immediate versions of
those routines whose behaviour depends on the transfer size.

Could it be that this discussion goes in some kind of circle while we're
misunderstanding each other? I'm in no way dismissing MPI as inferior,
but for some purposes it is very nice to have the means for one-sided,
passive-target communication available. Without doubt the RDMA scheme
has its own set of problems (I just remembered a short article:
<http://www.hpcwire.com/hpc/815242.html>), I'm still struggling with the
registration/pinning issues - compute node kernels without swapping
capability are a godsend for that purpose.

Sebastian

Greg Lindahl

unread,

Dec 7, 2007, 5:38:25 PM12/7/07

to

In article <fjcgqe$2to$1...@news.lrz-muenchen.de>,
Sebastian Hanigk <han...@in.tum.de> wrote:

>Regarding MPI_Irecv I cannot say anything at the moment - I strongly
>assume that your description should be expected. But its complementary
>sending routine switches from immediate return to blocking behaviour
>after exceeding an implementation-dependend message size threshold.

No. Isend returns immediately in all cases. What work it does before
returning is implementation dependent, and that's what you seem to be
referring to, incorrectly.

>Could it be that this discussion goes in some kind of circle while we're
>misunderstanding each other?

It's entirely possible.

> I'm in no way dismissing MPI as inferior,
> but for some purposes it is very nice to have the means for one-sided,
> passive-target communication available.

Indeed, it is sometimes useful. But now you've returned to the
beginning of the discussion, and I have the same reply as before.

-- greg

Sebastian Hanigk

unread,

Dec 7, 2007, 7:58:28 PM12/7/07

to

lin...@pbm.com (Greg Lindahl) writes:

> No. Isend returns immediately in all cases. What work it does before
> returning is implementation dependent, and that's what you seem to be
> referring to, incorrectly.

I beg to differ. Now it seems that you have an unusual definition of
"immediately". Take a look at the data from
<http://www.cs.sandia.gov/smb/overhead.html> and you see in fig. 2
(Overhead as a function of message size for MPI_Isend) that
interconnects without good communication offload capabilities suffer a
penalty proportional to the message size.

On the available Blue Gene
(<http://www.epcc.ed.ac.uk/facilities/blue-gene/>) I had done some
measurements a few weeks ago and put the resulting data files and
overhead plots on the web just now
(<http://www.fs.tum.de/~shanigk/mpi_overhead/>).

Sebastian

Greg Lindahl

unread,

Dec 7, 2007, 8:47:09 PM12/7/07

to

In article <fjcq7l$649$1...@news.lrz-muenchen.de>,
Sebastian Hanigk <han...@in.tum.de> wrote:

>I beg to differ. Now it seems that you have an unusual definition of
>"immediately".

OK, "does not block".

The fact that ISend sometimes does a significant amount of work before
returning has nothing to do with synchronization or blocking.

><http://www.cs.sandia.gov/smb/overhead.html>

You might want to ask Doug about my objections to his experimental method.

-- greg

Sebastian Hanigk

unread,

Dec 7, 2007, 9:01:18 PM12/7/07

to

lin...@pbm.com (Greg Lindahl) writes:

> The fact that ISend sometimes does a significant amount of work before
> returning has nothing to do with synchronization or blocking.

Fair enough. Now I have to try to explain the difference to our users :-)

I'll try to do the same measurements with the MPI calls replaced by ARMCI
calls (that's the library I'm currently using) and post the results.

>><http://www.cs.sandia.gov/smb/overhead.html>
>
> You might want to ask Doug about my objections to his experimental method.

Care to explain?

Anyway, have a good night!

Sebastian

Greg Lindahl

unread,

Dec 8, 2007, 4:26:37 PM12/8/07

to

In article <fjcttf$87m$1...@news.lrz-muenchen.de>,

Sebastian Hanigk <han...@in.tum.de> wrote:
>lin...@pbm.com (Greg Lindahl) writes:
>
>> The fact that ISend sometimes does a significant amount of work before
>> returning has nothing to do with synchronization or blocking.
>
>Fair enough. Now I have to try to explain the difference to our users :-)

How did they notice?

>>><http://www.cs.sandia.gov/smb/overhead.html>
>>
>> You might want to ask Doug about my objections to his experimental method.
>
>Care to explain?

Doug is asking "how much work can I get done while communicating?" But
he's measuring a loop that doesn't touch main memory. You've probably
heard of Don Becker's comment on zero copy: it's when you get someone
else to do the copy. Everone likes to pretend that this copy is free,
but it isn't. Well, all that DMA memory traffic costs. So Doug's
number is an upper bound; if you used the Stream benchmark as the work
you'd get a lower bound. And a real app would be somewhere in between.
(Since you have a framework for measuring this, perhaps you could do the
stream measurement for us.)

Another issue I have with Doug's paper is that many readers
misinterpreted it. It only applies to the modest fraction of codes
which do large messages and can overlap. Most codes aren't like that.

-- greg

Sebastian Hanigk

unread,

Dec 8, 2007, 7:57:49 PM12/8/07

to

lin...@pbm.com (Greg Lindahl) writes:

> How did they notice?

Parameter space exploration in some newly implemented parallelisations;
mostly we noticed the (more or less) sudden rise in run time due to
non-overlap.

> Doug is asking "how much work can I get done while communicating?" But
> he's measuring a loop that doesn't touch main memory.

Yes, this is not really realistic and you have to be careful that your
optimiser does not remove the loop.

One part of my diploma thesis' work was the implementation of the SRUMMA
matrix multiplication algorithm where the key idea is maximal overlap;
the working part sandwiched between get and wait calls was a BLAS call.

> You've probably heard of Don Becker's comment on zero copy: it's when
> you get someone else to do the copy. Everone likes to pretend that
> this copy is free, but it isn't. Well, all that DMA memory traffic
> costs.

On the upside, you can probably decrease the transfer latency and if
memory is tight, it could help to save the memory which would have been
used for transfer buffers.

> So Doug's number is an upper bound; if you used the Stream
> benchmark as the work you'd get a lower bound. And a real app would be
> somewhere in between. (Since you have a framework for measuring this,
> perhaps you could do the stream measurement for us.)

I wouldn't call it a framework, but I think I can do something useful
with my allotted CPU time.

> Another issue I have with Doug's paper is that many readers
> misinterpreted it. It only applies to the modest fraction of codes
> which do large messages and can overlap. Most codes aren't like that.

I had the luxury of tackling a very easy problem in that respect (matrix
multiplication) so for me that hasn't been unusual; for other codes I do
concur with you.

Sebastian