Using MPI and OpenMP effectively with multiple cores

1,911 views
Skip to first unread message

fletchjams

unread,
Feb 24, 2020, 4:06:25 PM2/24/20
to FDS and Smokeview Discussions
We have a server (dual processors) with 36 physical cores available and have run FDS using a speed test to attempt to find the most efficient setup for the following command lines for multiple meshes.

set OMP_NUM_THREADS = XX

mpiexec -n YY fds CHID_.fds 

Setting XX = 02 and YY = 12 seems to work quite efficiently and give fast results and allows us to split the mesh into 12 or more smaller connected meshes.

We have found that running two parallel simulations together with these settings and our available CPU does not hinder the speed too much however when using greater number of MPI processes (OpenMP has been limited to 02 on all simulations now), we get slower results. How does the number of MPI processes and the number of OpenMP setup affect the simulation speed? Is there any resource to indicate the best setup for single or multiple simulations at once? I wonder if we are throttling the system by over-using the cores?

Any help to explain the setup for multi core machines would be very to us

JMejias

unread,
Feb 24, 2020, 4:53:11 PM2/24/20
to FDS and Smokeview Discussions
In my opinion the work you did to obtain an efficient setup for your server is probably the best way to find what you are looking for. The info provided about this in the guide is in the chapter 3, which i think you already read. 

I am not an expert, but i think that the efficiency of these calculations is also determined by the way the cores are connected, the ram memory they have available, the time spent by MPI in the communication between processes etc, (other than the amount of cores and MPI-OpenMP setup). So in this sense the efficiency would be also specific to the cluster and the simulation. So if you look for the shortest time to complete a determined simulation your approach might be good to get there.

Kevin McGrattan

unread,
Feb 24, 2020, 5:10:40 PM2/24/20
to fds...@googlegroups.com
I would also suggest that while a job is running, login to the nodes running the job and issue the "top" command. I assume this is a linux cluster. Check that your MPI processes are all running close to 100% times the number of OpenMP threads assigned to each MPI process. In other words, if OMP_NUM_THREADS=2, your MPI processes should be running close to 200%. In addition, while running "top", hit the number of 1, and you'll see a breakdown of each core and its CPU usage.

o...@aquacoustics.biz

unread,
Feb 25, 2020, 1:43:42 AM2/25/20
to FDS and Smokeview Discussions

You want to avoid core saturation.  In my experience this is always detrimental to computational efficiency.  Each processor wants to have an unallocated core for best performance.


Computational efficiency can be very model dependent.  You want to distribute the load between processes (note this is NOT the same as processors) because in any time step the slowest process will cause others to sit idle.  Assuming your model involves a fire then you’re likely to see the highest computational burden in the meshes containing the fire and plume.  You may want to allocate more processes or adjust the mesh size to improve the performance in this part of the domain.


If you would care to post your model I would be happy to review your resource allocation and process balance.


A useful tool is to run a limited time or reduced resolution case and look at the CHID_cpu.csv output.  You should expect to see a low and balanced time in Main for each Rank (this is where a process sits in idle waiting for other processes to finish).


As in noted in the FDS Users Guide, MPI is almost always more productive than OMP.  For a given number of meshes you would do better to run two similar simulations concurrently rather than simply allocating unused cores to OMP.


You might be interested in the performance metrics in http://www.fire.aquacoustics.biz/HPC_Report_05.pdf.  Note that NIST provide a number of standard models for comparative performance measurements:  openmp_test128x.fds et al.

fletchjams

unread,
Feb 25, 2020, 6:48:20 AM2/25/20
to FDS and Smokeview Discussions
Thank you, I will read up on the comparison tests. I have attached the file which has 15 meshes split only into 13 processes. There are sprinklers usually active in the model and a high heat release rate fire, so have attempted to split the mesh there more. I suppose the key is to look at the cpu file and use this as a feedback loop to improve the efficiency, but getting carried away with mesh splitting in this zone concerns me; I am not sure how to determine if any spurious mesh boundary effects are bought in. 

The machine itself has the following specifications

Intel(R) Xeon(R) Gold 6140 CPU @ 2.30 GHz 2.29 GHz (2 processors)
509 GB useable RAM
18 cores per processor

I am actually attempting to write a script to produce the most efficient mesh setup, editing the &MESH lines of the FDS text file without much user input. I have started with allocating MPI "groups" which are connected IE where all meshes in that group are not disconnected by the presence of a mesh blocking the way. Is this correct, they need to be connected in this way or do they each need to be directly connected to each other for the mpi service to work properly? Perhaps the attached file of a current model is quite a good example of where this would be important.
test001.fds

Robin

unread,
Feb 26, 2020, 3:50:28 AM2/26/20
to FDS and Smokeview Discussions
With regards to sprinkler simulations, be careful as you use more meshes as you will then increase the number of particles in the simulation, which will dramatically slow your simulation down, avoid this by adjusting MAXIMUM_PARTICLES if you can. In addition, for sprinklers, there are also large improvements possible with AGE and PARTICLES_PER_SECOND (sometimes also VERTICAL and HORIZONTAL_VELOCITY, depending on what you are modelling). Obviously let the particles drain from the bottom of the mesh, not bounce around on the floor. SAMPLING_FACTOR can also be useful to adjust, to save time writing files, save time in smokeview and save diskspace. Also, check that your number of particles is actually being limited to MAXIMUM_PARTICLES, I had a simulation where this was not the case, I think because the particles were coming in faster than they could be deleted, caused massive part files and very slow simulations.

It is a very good idea to make a spreadsheet that uses WALL CLOCK TIME to calculate the rate of simulation at each step, and compare this against number of particles, so you can see if your simulation might dramatically slow down later as more sprinklers get activated. Also this gives you a much better idea of how long the simulation will take to complete.

Occasionally make copies of your CHID_cpu.csv somewhere else, so you can see how the usage changes over the simulation.

I don't think it is relevant for your case (509 GB RAM!! I assume all sockets are full) but for others it might be, we had some "Fast" pcs with relatively large amounts of RAM  at 64gb, but only two large sticks, i.e.  only occupying two sockets. I replaced the 2 x 32 gb sticks with  8 x 8 GB sticks of the same speed, occupying all eight sockets. Even with the same amount of RAM, the speed increase was considerable, like 30%, as I had more memory bandwidth available.  https://lenovopress.com/lp0501.pdf

Other useful links.

Tim O'Brien

unread,
Feb 26, 2020, 4:24:28 AM2/26/20
to fds...@googlegroups.com

Thank you for sending through the model.  I’ve had the briefest look at in PyroSim.   My cluster is busy until Saturday.  I’ll get back to you shortly.  t.

--
You received this message because you are subscribed to a topic in the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fds-smv/LLB_Olgx9cs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fds-smv+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/bf243f31-8bbd-4f73-ae04-2231c61218e2%40googlegroups.com.

o...@aquacoustics.biz

unread,
Feb 28, 2020, 7:55:34 AM2/28/20
to FDS and Smokeview Discussions
I've managed to have a brief 'play' with your model this evening using FDS 6.7.4 on FireNZE's Linux cluster.  Running the model to just 60 seconds (TEND=60) at the original mesh resolution I've been able to reduce the run time from 1416 seconds to 1148 seconds (about a 20% saving) through allocation of computational resources and one mesh split through intuitive balancing of computational load based on CHID_cpu.csv.  My current hardware allocation is 15 MPI processes, four of which are assigned 2 OMP threads for balancing (a total of 19 cores which is somewhat less than your 24 cores).  No computational nodes are saturated on the cluster, and the refinement process is very much hardware dependent.  I'm confident that I can reduce this to about 25% (say 1,000 seconds) with further adjustments mesh allocations and without changing the mesh resolutions in your model.

I'd be interested (and so might others) to know the run time of you model on your hardware to 60 seconds.

For more significant performance improvements you might be thinking about where stuff is changing and adjusting your meshes appropriately.  The jet fans and the plume ceiling jet are high.  So away from the fire nothing much should be happening low down.  I won't be investigating this further. 

As others have noted, the computational balance will shift when particles are injected into the domain. The associated meshes will require additional resources. While FDS is is remarkably capable, don't expect to simulate real sprinkler droplet distributions or actual fire control.  There are comparative techniques for doing this but I won't be going their.  In order to optimize the model for the full 600 second run time I'll be reducing the mesh resolutions by a factor of two in each dimension and reducing the particle generation from sprinklers.  More to follow on this...   

Getting back to your original post, you are definitely over-subscribing your cores.  This is almost always detrimental to computational efficiency.  The best improvement with OMP allocation is a factor of two, but there is no performance gain if individual processes are significantly unbalanced. In essence you're allocating computational resources to speed up stuff that is off the critical path.   

On Tuesday, 25 February 2020 10:06:25 UTC+13, fletchjams wrote:

Tim O'Brien

unread,
Mar 2, 2020, 6:15:04 AM3/2/20
to fds...@googlegroups.com

Dear fletchjams,

 

I’ve almost finished optimizing your model but I’d like to take this thread offline for a while.

 

There are several issues with your model that going to cause you grief.  Would you please send me a return email to o...@aquacoustics.biz so I can discuss these issues.

 

With kindest regards,

 

Tim

 

T.G. O'Brien, PG Cert. Eng. (Fire), BE(Hons.), MSFPE, MNFPA, CMEngNZ, Int. PE

Consulting Fire Engineer

FireNZE (a trading division of Aquacoustics Limited)

http://www.fire.aquacoustics.biz

 

+64 (0)4 479 3963

+64 (0)27 641 3111

o...@aquacoustics.biz

 

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender immediately and delete the email and any attached files.  Disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.

 

 

From: fds...@googlegroups.com [mailto:fds...@googlegroups.com] On Behalf Of o...@aquacoustics.biz
Sent: Saturday, 29 February 2020 1:56 AM
To: FDS and Smokeview Discussions
Subject: [fds-smv] Re: Using MPI and OpenMP effectively with multiple cores

 

I've managed to have a brief 'play' with your model this evening using FDS 6.7.4 on FireNZE's Linux cluster.  Running the model to just 60 seconds (TEND=60) at the original mesh resolution I've been able to reduce the run time from 1416 seconds to 1148 seconds (about a 20% saving) through allocation of computational resources and one mesh split through intuitive balancing of computational load based on CHID_cpu.csv.  My current hardware allocation is 15 MPI processes, four of which are assigned 2 OMP threads for balancing (a total of 19 cores which is somewhat less than your 24 cores).  No computational nodes are saturated on the cluster, and the refinement process is very much hardware dependent.  I'm confident that I can reduce this to about 25% (say 1,000 seconds) with further adjustments mesh allocations and without changing the mesh resolutions in your model.

--

You received this message because you are subscribed to a topic in the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/fds-smv/LLB_Olgx9cs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to fds-smv+u...@googlegroups.com.

o...@aquacoustics.biz

unread,
Mar 5, 2020, 4:24:32 PM3/5/20
to FDS and Smokeview Discussions
With appropriate adjustments the model run time can be reduced by ~63%.  With model refinement it can be reduced to less than 15 minutes without loss of fidelity.  See attached.  t. 


On Tuesday, 25 February 2020 10:06:25 UTC+13, fletchjams wrote:
Model Refinement.pdf
test3.fds

James Fletcher

unread,
Mar 6, 2020, 5:48:54 AM3/6/20
to fds...@googlegroups.com
Wow this looks like a lot of work and I appreciate the effort and advice on this. I haven't read this in detail yet but I agree in principle that the most accurate looking model is not always the most efficient and often wastes a lot of time. It is often to appease the requirements of people outside technical circles. I will test the model on our machines and give you some feedback on this. Thanks

James

--
You received this message because you are subscribed to the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fds-smv+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/b8b92aff-8a2c-4f82-9b1b-e6292f6dc721%40googlegroups.com.

Alexander Bruns

unread,
Oct 17, 2022, 6:32:21 PM10/17/22
to FDS and Smokeview Discussions
Hi,

I am new to FDS and want to use it for a different purpose.

I am writing a Master Thesis on HPC-Performance and are searching for free open source tools like calculix and found FDS.

As I am not interested on the results from FDS, but are interested on the Ressource Footprint FDS does on the Hardware I have been enabled to use for my theses, I struggle with finding huge use cases to do scaling tests.

I read the existing documentation, but only the for me relevant part, and did not find examples I could directly use.
Ok to be correct I found single examples but I do not understand hwo to change the input to have input-files for 2, 4, 8, 16, 32 MPI-Processes...

In this thread, I saw the HPC_Report_05.pdf and see an example job at the end. But I do not understand how I can create out of it a test input file that scales on 2, 4, 8, 16, 32 cores. And how to change it from a amall 20K size up to the 2 million size.

The test-machines I can use have 32 cores (2 sockets with 16 cores each socket-CPU), 384 GB mem and a RAID-10 Disc-array for local scratch.

Can someone help me to find good test-cases which can load these machines? It must not fill the whole ram, but I would like to see how 32 MPI processes or 16 MPI-Processes with 2 OpenMP-Threads scale.

I hope I do not need to dig that much in the documentation, as again, I am not interested in FDS-Results. I am interested to use it as an example to put load on the machines.

Could someone support me on that?

Greetings

Alexander

Randy McDermott

unread,
Oct 17, 2022, 6:35:57 PM10/17/22
to fds...@googlegroups.com

o...@aquacoustics.biz

unread,
Oct 19, 2022, 5:40:12 AM10/19/22
to FDS and Smokeview Discussions
Dear Alexander, do let me know if you want the full sequence of models (variations in grid dimensions and meshes) that I used in my HPC performance study (HPC-O05.pdf) and I will provide these.  The model only uses a fraction of FDS's capability but is generally representative for life safety modelling in the building environment using a well behaved fire scenario.
Reply all
Reply to author
Forward
0 new messages