Fwd: Presentation of new program using PyPar

15 views

Skip to first unread message

Ole Nielsen

unread,

Oct 19, 2011, 12:19:19 PM10/19/11

to pypar-...@googlegroups.com

This is a mail thread with Tom Grimes in regard to Python-Accelerated Coherent-states Electron-nuclear dynamics (PACE) which is about to be released as Open Source and is using pypar.

-o

---------- Forwarded message ----------
From: Ole Nielsen <ole.molle...@gmail.com>
Date: Mon, Oct 10, 2011 at 10:56 PM
Subject: Re: Presentation of new program using PyPar
To: tom.g...@ttu.edu

Hi Tom

Thanks for the details and for sending the details when you are ready. Making your software open source is a great move and it doesn't matter if it is "finished" - it is amazing what the community can do if they need it.

As for your question I can't say for sure, but have the following thoughts to share.

From what you describe the problem is trivially parallel (the best kind :-)) and should therefore give you decent speed up. In general there are three kinds of obstacles to good speed up

Some of the program cannot be parallelised. This is typically a set up phase where e.g. equations are being constructed as in for example finite-element codes. This phase will always put a limit on the speed up according to what is known as Amdahl's law. If for instance half of your sequential code was not parallisable, the maximum overall speedup you could get would be 2!
Communication overhead. If the program needs to communicate a lot (for example swapping boundaries and at each time step in case of domain decomposition), then the parallel efficiency (speedup divided by the number of processors) will taper off as more processors are used. In your case, this should not be a big factor as the only communication takes place at the end. How much data are you transferring when gathering the results on the master?
Load balancing. Unless each process has roughly the same amount of work to do, the speedup will not be ideal. In other words, you have to wait for the last processor to finish and if that has been allocated 90 percent of the work for instance, then your speed up will never exceed a mere 10%.

As I don't know the details of your code, I cannot tell you which one it is. However, here are some things you could try:

Measure the network speed. Run the test code in pypar on two nodes (with one process per node). There are two versions network_timing.c and network_timing.py, one uses pure C, the other pypar. Usually they produce the same results, though. They both work by sending data blocks of increasing size through mpi.send and mpi.receive and then solve for network latency and bandwidth. The process is repeated 10 times to allow for hiccups in the network. If you suspect your network is problematic, try to time it that way.
To get good speedup in parallel computing, it is paramount to keep things as simple and clean as possible. I learned this 15 years ago from Dr Ron Bell of IBM and his advice has helped me a lot over the years. To keep things simple in your case I would make sure (as you probably have already) that your code has only two parts to do with MPI:

A computation phase, where work is split up in some straightforward way using pypar.myid to determine which part should be computed. By straightforward I mean either by carving the problem up into equally sized blocks or by using a cyclic manner. You can see the example of each in the mandelbroot fractal example. For problems like yours I usually use a cyclic distribution. But try not to use mpi communication to distribute the work unless you absolutely need to in case of dynamic load balancing. There is also an example of that in the fractal example.
A communication phase where results are collected. Here it makes sense to use one of the collective MPI commands but you can also do a simple for-loop on the master to get all the pieces back. Again see the fractal example bundled with pypar.

Do some timing of the individual phase to identify where it goes wrong e.g. time only phase 1 and check the speed up. I know that at the end of the day it is the wall clock from the time the program starts until it ends that matters, but for troubleshooting it makes sense to time the different parts separately.
Parallel programming is an order of magnitude harder to debug than sequential programming, so it is important to make sure the sequential code is as clean, efficient and tested as it can be before parallelising it: "Get it right, then fast!"
Finally, expecting to measure speedup when the sequential runtime is less than a few minutes usually doesn't work to a bunch of overheads. Usually, you'd use something with a sequential run time of at least 5 to 10 minutes to verify speedup.

So, I have no doubt that your algorithm can be succesfully parallelised - and it probably already is. So feel free to try some of this and get back to me if you want to throw a few ideas around and get feedback.

Cheers

Ole

PS - do you mind if I post this thread on the pypar mailing list as others might benefit?

I will definitely send you a link when we get PACE open-source'd. It
should be within the next year.

Right now, our MPI support is experimental. For some reason,
communications calls (broadcast, send, etc.) seem to be taking much
longer than anticipated, but it is unknown whether that is our system
administrators fault (problems with the network itself) or a design flaw
in PACE. Currently, the program breaks a very large set of integrals
into chunks and assigns them to slave nodes. All the chunks are
completely independent and no inter-node communication is used, save
slave-to-master. There is only one MPI process per node because the
computational parts in the compiled extensions are multithreaded.
(Indeed, I have found it impossible to restrict TBB code to one
processor.) However, on a set of integrals that takes twelve seconds on
one node, splitting the job between three nodes still takes about ten
seconds, and unless the transfer speeds are considerably less that 100
Mbps (the "usual" network is 1000 Mbps, though Infiniband is available)
I just can't seem to find the issue. If I can't resolve the problem when
I work on the MPI-based part again, would you be willing to give me some
pointers?

If you would like, I would be happy to send you a copy of the abstract
and/or the slides for the talk.

Sincerely,
Tom

--
Tom Grimes, Ph.D.
Postdoctoral Research Assistant
Jorge Morales Group

Texas Tech University
Dept. of Chemistry and Biochemistry (040)
Box 41061
Lubbock, TX 79409-1061

(806) 742-0065 (voice); (806) 742-1289 (fax)

On Mon, 2011-10-10 at 03:41 -0500, Ole Nielsen wrote:
> Dear Tom
>
> Thank you very much for your mail and for the good news about the
> imminent release of your PACE. I am always extremely happy when
> someone is using software I have written - after all that is the
> reason to write and release open source.
>
> I am not too worried about the exact licensing to be honest. I picked
> GPL because it is one of the main ones, but you have my permission to
> license PACE anyway you like and keep pypar in there :-)
>
> Two requests though:

> * I try to keep track of those project that I know of which are

> using pypar, so when you have a link to your release
> repository and/or some journal papers or even slides from your
> presentations, I would like to include links to those on the
> pypar page under the heading "Some projects and publications
> that use Pypar" - see bottom of
> http://code.google.com/p/pypar/

> * If you have made modifications to the code, I would be very

> happy to get the patch and release it for others to use
> I am looking forward to hear how the presentations goes.
>
> Cheers
> Ole
>
>
> On Sat, Oct 8, 2011 at 1:59 AM, Tom Grimes <tom.g...@ttu.edu> wrote:
> Mr./Dr. Nielsen,
>
> The purpose of this e-mail is to inform you that our
> group has used
> your open source package, PyPar, in a new computational
> chemistry
> program. Python-Accelerated Coherent-states Electron-nuclear
> dynamics,
> PACE, is being presented in a talk by myself, Tom Grimes, at
> the
> Southwest Theoretical Chemistry Conference this month. You are
> mentioned
> in the acknowledgements for the talk, and as soon as PACE is
> completely
> verified for accuracy and we receive clearance from TTU legal
> PACE will
> be released to the public as open source. The exact nature of
> the
> license is not currently known, but we will ensure it complies
> with your
> package licensing.
>
> I also want to personally thank you for making your
> software available.
> If you want any additional information on PACE, I will be
> happy to
> provide it. I will also notify you when PACE becomes
> available for
> distribution.
>
> Sincerely,
> Tom Grimes
>
> --
> Tom Grimes, Ph.D.
> Postdoctoral Research Assistant
> Jorge Morales Group
>
> Texas Tech University
> Dept. of Chemistry and Biochemistry (040)
> Box 41061
> Lubbock, TX 79409-1061
>
> (806) 742-0065 (voice); (806) 742-1289 (fax)
>
>
>
>
>

Ole Nielsen

unread,

Oct 19, 2011, 12:22:44 PM10/19/11

to tom.g...@ttu.edu, pypar-...@googlegroups.com

Sounds great - let me know how you go. And bear in mind that no "real" parallel code ever scales perfectly.

Over the years I have come to appreciate any parallel efficiency (speedup/processors) above 50%

I flick the past conversation to the list (CC'd) so please chime in as you go.

Cheers and good luck with the OS release.

Ole

On Mon, Oct 10, 2011 at 11:59 PM, Tom Grimes <tom.g...@ttu.edu> wrote:

Ole,

Thanks for the advice. I have a few other things to do before I get
back to MPI, but I hope to start working on it again soon.

You are correct that the integral evaluations are embarrassingly
parallel, which is I why I am puzzled by the poor scaling. The time to
evaluate each integral is approximately the same and while the integral
load changes from run to run, within a run the requests are constant, so
I have load-balancing code to divide the work (roughly) evenly. I did
notice one thing you said that might be key here: the master node
communicates the range of integrals to the slave nodes (required by the
load balancing code). If I can eliminate that and divide the problem
based on ID only, then a barrier is avoided. Part of the reason I chose
to communicate the ranges is that the load balancing takes into account
how fast each node did its share of the work in the previous batch (to
cope with heterogeneous clusters, busy nodes, and sharing nodes with
other jobs). But I will try a more simplistic approach to optimize it
for dedicated cluster usage (completely homogeneous, exclusive access).

The rate determining step in the program is the computation of these
integrals. But, by Amdahl's Law, I do not expect proportional overall
speedup. Everything that can be parallelized has been parallelized, at
least for one node, so the parallel portions are the majority of the run
time.

By all means go ahead and post this on the mailing list. When I get
back to the MPI stuff, I'll post directly to the list.

-Tom

P.S.: If it were up to me, I would have made the code open source from
the beginning. Unfortunately, I have to deal with TTU legal. While I
don't expect PACE to be finished, according to my advisor it needs to be
numerically correct before we release it.

--
Tom Grimes, Ph.D.
Postdoctoral Research Assistant
Jorge Morales Group

Texas Tech University
Dept. of Chemistry and Biochemistry (040)
Box 41061
Lubbock, TX 79409-1061

(806) 742-0065 (voice); (806) 742-1289 (fax)

On Mon, 2011-10-10 at 10:56 -0500, Ole Nielsen wrote:
> Hi Tom
>
>
> Thanks for the details and for sending the details when you are ready.
> Making your software open source is a great move and it doesn't matter
> if it is "finished" - it is amazing what the community can do if they
> need it.
>
>
> As for your question I can't say for sure, but have the following
> thoughts to share.
> From what you describe the problem is trivially parallel (the best
> kind :-)) and should therefore give you decent speed up. In general
> there are three kinds of obstacles to good speed up

> 1. Some of the program cannot be parallelised. This is typically

> a set up phase where e.g. equations are being constructed as
> in for example finite-element codes. This phase will always
> put a limit on the speed up according to what is known as
> Amdahl's law. If for instance half of your sequential code was
> not parallisable, the maximum overall speedup you could get
> would be 2!

> 2. Communication overhead. If the program needs to communicate a

> lot (for example swapping boundaries and at each time step in
> case of domain decomposition), then the parallel efficiency
> (speedup divided by the number of processors) will taper off
> as more processors are used. In your case, this should not be
> a big factor as the only communication takes place at the end.
> How much data are you transferring when gathering the results
> on the master?

> 3. Load balancing. Unless each process has roughly the same

> amount of work to do, the speedup will not be ideal. In other
> words, you have to wait for the last processor to finish and
> if that has been allocated 90 percent of the work for
> instance, then your speed up will never exceed a mere 10%.
>
>
> As I don't know the details of your code, I cannot tell you which one
> it is. However, here are some things you could try:
>
>

> 1. Measure the network speed. Run the test code in pypar on two

> nodes (with one process per node). There are two versions
> network_timing.c and network_timing.py, one uses pure C, the
> other pypar. Usually they produce the same results, though.
> They both work by sending data blocks of increasing size
> through mpi.send and mpi.receive and then solve for network
> latency and bandwidth. The process is repeated 10 times to
> allow for hiccups in the network. If you suspect your network
> is problematic, try to time it that way.

> 2. To get good speedup in parallel computing, it is paramount to

> keep things as simple and clean as possible. I learned this 15
> years ago from Dr Ron Bell of IBM and his advice has helped me
> a lot over the years. To keep things simple in your case I
> would make sure (as you probably have already) that your code
> has only two parts to do with MPI:

> 1. A computation phase, where work is split up in some

> straightforward way using pypar.myid to determine
> which part should be computed. By straightforward I
> mean either by carving the problem up into equally
> sized blocks or by using a cyclic manner. You can see
> the example of each in the mandelbroot fractal
> example. For problems like yours I usually use a
> cyclic distribution. But try not to use mpi
> communication to distribute the work unless you
> absolutely need to in case of dynamic load balancing.
> There is also an example of that in the fractal
> example.

> 2. A communication phase where results are collected.

> Here it makes sense to use one of the collective MPI
> commands but you can also do a simple for-loop on the
> master to get all the pieces back. Again see the
> fractal example bundled with pypar.

> 3. Do some timing of the individual phase to identify where it

> goes wrong e.g. time only phase 1 and check the speed up. I
> know that at the end of the day it is the wall clock from the
> time the program starts until it ends that matters, but for
> troubleshooting it makes sense to time the different parts
> separately.

> 4. Parallel programming is an order of magnitude harder to debug

> than sequential programming, so it is important to make sure
> the sequential code is as clean, efficient and tested as it
> can be before parallelising it: "Get it right, then fast!"

> 5. Finally, expecting to measure speedup when the sequential

Tom Grimes

unread,

Oct 19, 2011, 4:36:13 PM10/19/11

to pypar-discuss

Only 50%? PACE has an asymptotic parallelism of 86% (in the limit of
very large problems, see http://homunculus.dyndns.org/pics/PACE_parallelism.tif
). But to be fair, that figure is for multithreading. The MPI code
still needs to be updated.

-Tom

> ...
>
> read more »

Ole Nielsen

unread,

Oct 19, 2011, 9:51:42 PM10/19/11

to pypar-...@googlegroups.com

Hi again

With over 80% efficiency you should be very happy. That is excellent.
Regarding the figure, how many processing elements are being used?

Cheers
Ole

Reply all

Reply to author

Forward

0 new messages