OpenMP & MPI for very large simulations

Brian L

unread,

Feb 26, 2016, 11:53:28 AM2/26/16

to FDS and Smokeview Discussions

Hello,

Lots of good information here.

Just wanted to consult.

For very large simulations, say 20 million+ cells, what combination yields the best results on a machine/cluster that has 20 physical cores?

Is it a hybrid combo of MPI+Open MPI?

OpenMPI on it's own to single Mesh a large simulation doesn't seem on it's own very efficient.

It is creating 20 individual mesh pieces and using MPI them solve simultaneously? When I do this, I find that CPU utilization fluctations and is not 100%. I know this way creates many open boundaries and "communication penalty".

From reading the userguide, it appears the Optimal # of threads for Open MP is 4. Therefore, should I be breaking up my large simulation into 5 mesh pieces, assigning MPI to them, and then let OpenMP do it's work for each of the 5 mesh pieces with Open MP assigning 4 cores for each of the 5 mesh pieces?

I have not done any benchmarks yet, but with this method I seem to be able to get full CPU load (which is what matters for simulations I think)

There is also opportunity to daisy chain machines together, so I could combine 2x20 two core machines. What's the fastest combo here?

I guess ultimately my question is, is there a point where OpenMP + MPI is faster than simply MPI it's self. I think that beyond a certain # of MPI processes, I see large RAM usage and inefficiencies especially if I can't balance the mesh # exactly to 20 segments.

Thanks in advance.

Randy McDermott

unread,

Feb 26, 2016, 12:42:06 PM2/26/16

to FDS and Smokeview Discussions

I don't consider myself an expert on this, so Kevin and Lukas (or others) may want to chime in, but I think of it like this: The sweet spot for MPI is something like 32^3 meshes. We are trying to get scaling down past 16^3, but that gets difficult. For your problem 20e6/(32^3) = 610, so you are not going to get there. You are at something more like 20e6/(100^3) = 20. I would stick with MPI with OMP_NUM_THREADS=1. The main reason for this is that for 20 meshes, MPI scales close to linearly (see attached strong scaling plot from FDS User Guide), whereas OpenMP is more like 50% (2 times speed up for 4 meshes). So, you want to take the linear scaling as far as you can before you start using the weaker scaling. This supposes that cores are limited (as you have stated in your case).

--
You received this message because you are subscribed to the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fds-smv+u...@googlegroups.com.
To post to this group, send email to fds...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/7348a68e-1f05-47ac-bfda-2c378d3abd3e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

strong_scaling_test.pdf

Randy McDermott

unread,

Feb 26, 2016, 12:43:34 PM2/26/16

to FDS and Smokeview Discussions

I meant "2 times speed up for 4 cores".

Brian L

unread,

Feb 26, 2016, 1:28:27 PM2/26/16

to FDS and Smokeview Discussions

Hi Randy,

Thanks for the thoughts and quick reply.

Do you have a reference regarding 32^3 is the sweet spot?

Also, any thoughts on the "communication penalty" for the MPI, since they need to communicate with each other? Also, that scaling seems to be somewhat a theoretical scaling or a simple FDS scaling, because a mesh that contains a fire or area of interest drastically slows down.

I will try 20 meshes at OMP_NUM_THREADS=1 and see where that goes.

Cheers

Randy McDermott

unread,

Feb 26, 2016, 2:31:00 PM2/26/16

to FDS and Smokeview Discussions

32^3 is my experience with FDS. But, in general, as the surface to volume ratio goes up the communication becomes the bottleneck. I've never heard of strong scaling well beyond 16^3.

With the exception of droplets, which are not dynamically load balanced (something we are thinking about), FDS does the same amount of work in a mesh whether there is "action" there or not. If your calculation slows down it is because the time step shrinks, but that is true for all meshes. The cpu time / time step stays the same. You slow down because you have more time steps.

To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/d0cb7b2b-a639-4e5a-aa18-b27d64b47806%40googlegroups.com.

Kisrobert

unread,

Feb 27, 2016, 2:26:36 PM2/27/16

to FDS and Smokeview Discussions

With the exception of droplets, which are not dynamically load balanced

It means that meshes in which droplets will appear (e.g. sprinklers) should be smaller than the others?
Or can be useful enable more cores -OpenMP- for the mesh with droplets?

Randy McDermott

unread,

Feb 28, 2016, 11:54:59 AM2/28/16

to FDS and Smokeview Discussions

You can look at the _cpu.csv file to see how much time is spent in PART on a given MPI process. You could try to manually balance this changing the mesh size, as you suggest.

To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/0fdeca15-9523-486f-a9f4-3df95fe359fd%40googlegroups.com.

Reply all

Reply to author

Forward