Question about scaling to multiple meshes

434 views
Skip to first unread message

Dave Sheppard

unread,
Jun 21, 2018, 10:36:03 AM6/21/18
to FDS and Smokeview Discussions

I am evaluating the performance of FDS using multiple meshes.  My test case is a 2.4 wide by 5 long by 2.4 tall mesh with a 16 kW fire, an open vent at one end, and a vent with a fixed velocity vent at the other end of the domain.  I am adding meshes with the same size and grid spacing and running the models with the same number of CPUs as meshes.  I am using online services so there are supposedly more available cpus than I need.

 

I was expecting to see a mostly constant time to run the models as meshes and cpus were added, but with more than 4 meshes the slope in time to run the model with respect to the number of meshes is larger than I expected.  The following chart shows the (clock time to run the model) divided by (the simulation time) with respect to the number of meshes/cpus that were assigned to the model.


Is this the scalability that I should expect or am I missing something? I have attached two of the input files.

 

Thanks


hall16.fds
hall02.fds

dr_jfloyd

unread,
Jun 21, 2018, 10:49:25 AM6/21/18
to FDS and Smokeview Discussions
Look at the .out files for pressure iterations and velocity error and the cpu time summary in the cpu.csv files.

You have basically made a tunnel that gets longer as you add more meshes.  With a short tunnel and a small number of meshes, you probably have few if any iterations of the pressure solver to resolve any velocity errors at the mesh boundaries.  As you add meshes you are probably seeing more pressure iterations which will add to the overall computational time.

Kevin

unread,
Jun 21, 2018, 10:54:58 AM6/21/18
to FDS and Smokeview Discussions
Dave, this appears to be a "weak" scaling test. There's an example in the FDS User's Guide, chapter "Running FDS". First thing to check is that all your cases are running the same number of time steps. If not, then the calculations are different and it's hard to interpret your results. The next thing to consider is how your MPI processes are being assigned to the nodes. On our cluster, each node has 12 cores, and for our weak scaling test, we assign the MPI processes so that they fill up the nodes. Thus running with 1 mesh is fastest because the MPI process has the node to itself and it is not competing with other processes for access to the RAM and it has no MPI communication to do. The efficiency decreases as we fill up the node with 12 MPI processes. After that, the efficiency levels off somewhat.  By the time we get to 432 MPI processes, our efficiency is 0.6. This means that 432 meshes are running 1/0.6=1.67 * the single mesh time. Your performance is worse than mine, if I am interpreting things right, in that you are running about 2.25 * single mesh case with 24 processes. Take a look at the _cpu.csv file and look at the relative costs of the major subroutines. In particular, look at COMMunications. That's the MPI cost.

Kevin

unread,
Jun 21, 2018, 10:57:53 AM6/21/18
to FDS and Smokeview Discussions
Following up on Jason's comment, in our weak scaling test, there is nothing but an empty box and I turn off all output, which doesn't scale well. This will give you a baseline for assessing the speed of your compute cluster. FDS will just run through the algorithm, and each MPI process ought to be doing the exact same amount of work.

Salah Benkorichi

unread,
Jun 21, 2018, 10:58:21 AM6/21/18
to FDS and Smokeview Discussions
Also with this online services, you need to make sure that the hyperthreadening is disabled in order to use the real cores instead of virtual one which would result in slowing your performance.

S.


Sent from my phone


From: fds...@googlegroups.com <fds...@googlegroups.com> on behalf of Kevin <mcgr...@gmail.com>
Sent: Thursday, June 21, 2018 3:54:57 PM
To: FDS and Smokeview Discussions
Subject: [fds-smv] Re: Question about scaling to multiple meshes
 
Dave, this appears to be a "weak" scaling test. There's an example in the FDS User's Guide, chapter "Running FDS". First thing to check is that all your cases are running the same number of time steps. If not, then the calculations are different and it's hard to interpret your results. The next thing to consider is how your MPI processes are being assigned to the nodes. On our cluster, each node has 12 cores, and for our weak scaling test, we assign the MPI processes so that they fill up the nodes. Thus running with 1 mesh is fastest because the MPI process has the node to itself and it is not competing with other processes for access to the RAM and it has no MPI communication to do. The efficiency decreases as we fill up the node with 12 MPI processes. After that, the efficiency levels off somewhat.  By the time we get to 432 MPI processes, our efficiency is 0.6. This means that 432 meshes are running 1/0.6=1.67 * the single mesh time. Your performance is worse than mine, if I am interpreting things right, in that you are running about 2.25 * single mesh case with 24 processes. Take a look at the _cpu.csv file and look at the relative costs of the major subroutines. In particular, look at COMMunications. That's the MPI cost.

--
You received this message because you are subscribed to the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fds-smv+u...@googlegroups.com.
To post to this group, send email to fds...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/d22b8a40-3e52-4564-b694-3aee0d69b24c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dave Sheppard

unread,
Jun 22, 2018, 1:57:37 PM6/22/18
to FDS and Smokeview Discussions

Thank you for all of the great feedback.  I appreciate the suggestions about what to look at to improve run speed.


I have been looking at items as you all suggested.  Below is what I found. 


For the cpu files, the following table and chart show the maximum values for each category for each run. 



Mesh Count

MAIN

DIVG

MASS

VELO

PRES

WALL

DUMP

PART

RADI

FIRE

COMM

EVAC

HVAC

TotalT_USED

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

9

578

234

333

211

72

121

0

224

14

46

0

0

1842

4

8

555

229

316

223

70

126

0

209

13

65

0

0

1777

8

21

730

327

408

481

102

153

0

267

15

112

0

0

2537

15

29

829

349

488

395

128

98

0

326

17

557

0

0

3069

16

32

858

361

505

395

133

193

0

337

17

594

0

0

3306

24

40

965

410

572

521

146

220

0

372

20

919

0

0

3765

 


 

I tried removing the slice file from the 8 and 16 mesh models.  There was a measurable but negligible difference in the calculation speed.

 

With respect to number of MPI processes and hyper threading.  The service that I am using for these runs has the job setup streamlined.  As far as I can tell the command that they are using to start the 16 mesh job is.  


mpiexec -np 16 fds_mpi hall16.fds

 

Kevin

unread,
Jun 22, 2018, 2:28:50 PM6/22/18
to FDS and Smokeview Discussions
Dave, attached are the results of the latest weak scaling test that is included in the FDS User's Guide. The only routine that increases substantially in CPU time (once we are beyond 8 processes) is COMMunications, the work done by MPI in exchanging info between the meshes. The initial increase in CPU time that affects all subroutines (1, 2, 4, 8 process cases) is due to the fact that a single process runs faster than 12 processes on a single node.

You can see how the total CPU time is affected by the communication time, weakly.

I still cannot tell if all of your calculations are doing the exact same thing. I also do not understand what is meant by 0 time used by the 1 mesh case.
Weak_Scaling_Plot.PDF

Dave Sheppard

unread,
Jun 22, 2018, 3:36:25 PM6/22/18
to FDS and Smokeview Discussions

Kevin, Thanks for replying so quickly. 


The results that we are discussing are from a ‘weak’ scaling case.  I also performed a ‘strong’ scaling test, but I am trying to understand these results first. 


I chose this scenario because it best represents the modeling that we want to conduct.  For example, there is a fire located in one mesh and we add additional meshes so that we can calculate the propagation of the fire products to other areas in the building.  In this scenario all meshes, except for the 1st, are empty. 


The efficiency that I calculate starts to drop from near 100% when the model has more than 4 meshes.  Efficiency drops to about 53% by about 16 meshes and continues to decrease. A chart is attached.


The 0 values in my previous post for the one mesh case were an artifact from the program that I wrote to find the maximum values in the cpu files.  The table below has the correct values.


Mesh

MAIN

DIVG

MASS

VELO

PRES

WALL

DUMP

PART

RADI

FIRE

COMM

EVAC

HVAC

TotalT_USED

1

7

589

236

331

164

71

125

0

230

15

0.1

0

0

1768

 

 

 

To help understand what I did, I attached the input files that I used. They are identical except for the mesh lines and location of the velocity vent.

hall.zip

Kevin

unread,
Jun 22, 2018, 4:07:22 PM6/22/18
to FDS and Smokeview Discussions
Do the following: Simulate an empty box that does nothing for 10 s. Now add another box, then 4, 8, 16, 32. Each simulation will have the same number of time steps, and the work done by each process will be the same. This will check your cluster. It is possible that the nodes on the cluster you are using may have 12, 16, 24, or 32 cores, or it might divide cores into two threads. If the nodes have 24 "processing units", then you will see a degradation of performance as you load more MPI processes into this single node. A core is not a wholly independent processor where we can expect 2 cores to do a job twice as fast as one. There is overhead associated with all these cores accessing RAM and other services of the node.

By doing this, you are assessing the best possible performance of this cluster. Don't complicate it by giving it jobs that have different amounts of work per MPI process.

Dave Sheppard

unread,
Jun 22, 2018, 7:34:36 PM6/22/18
to FDS and Smokeview Discussions




Kevin, 


I set up the models as you suggested.  The meshes are lined up and touch. Are they supposed to touch?  I am seeing a similar decrease in efficiency. 


 

Mesh Count

Time Step

Simulation Time

Clock Time

Clock/Simulation Time

Efficiency

1

196

9.95

120

12.1

1

2

196

9.95

120

12.1

1

4

196

9.95

120

12.1

1

8

196

9.95

180

18.1

0.7

16

196

9.95

240

24.1

0.5

32

196

9.95

240

24.1

0.5

 

Mesh Count

MAIN

DIVG

MASS

VELO

PRES

WALL

DUMP

PART

RADI

FIRE

COMM

EVAC

HVAC

TotalT_USED

1

0.7191

33.24

13.51

40.15

13.43

7.418

1.789

0

30.47

0

0.007896

0

0

140.7

2

0.7006

32.02

13.17

38.22

13.02

6.83

1.681

0

27.74

0

0.4949

0

0

134

4

1.326

35.06

14.43

42.11

14.31

8.64

2.052

0

31.64

0

3.518

0

0

151

8

2.72

42.66

18.77

49.96

22.47

10.36

2.713

0

38.16

0

5.187

0

0

189

16

4.682

45.54

19.4

59.77

20.14

13.41

4.072

0

43.14

0

59.82

0

0

251

32

5.239

45.92

19.61

60.37

19.87

13.31

4.54

0

43.14

0

73.21

0

0

265

empty.zip

Kevin

unread,
Jun 23, 2018, 3:11:50 PM6/23/18
to FDS and Smokeview Discussions
Yes, the meshes should touch so that we can assess the cost of the MPI exchanges.

I need to know the cluster architecture. If it's linux, type

lscpu

This should tell you how many sockets, cores, and threads you have on each node. If you have 16 cores per node, then your timings look as I expect. It says that when your run 16 MPI processes on the same node, their speed is roughly half as fast as if you run one process. As you add more and more MPI processes onto other nodes, the efficiency does not dramatically decrease. You cannot expect to load up an entire node and have all those jobs run as fast as one job on the same node. It's not like you have 16 independent processors and banks of RAM. 

Kevin

unread,
Jun 25, 2018, 9:59:38 AM6/25/18
to FDS and Smokeview Discussions

The weak scaling tests are called "weak_scaling_test_XXX.fds". The ones with the labels "scarc" and "glmat" test other pressure solvers. You want "weak_scaling_test"

Dave Sheppard

unread,
Jun 26, 2018, 1:36:51 PM6/26/18
to FDS and Smokeview Discussions

I ran the NIST weak scaling input files for 1, 2, 4, 16, 32, 64, 128 meshes.  Unlike the NIST results for the same models, the efficiency doesn’t plateau as the number of meshes is increased. 


I will follow up with the engineers that run the service to see if they can determine why the scaling test results are different than the NIST results.



 

 


 


Kevin

unread,
Jun 26, 2018, 2:00:44 PM6/26/18
to FDS and Smokeview Discussions
It certainly appears to me that the COMMunication times are much slower than on the NIST cluster. Do they use Infiniband, which is a very fast network connecting the nodes?

Dave Sheppard

unread,
Jul 6, 2018, 5:50:35 AM7/6/18
to FDS and Smokeview Discussions

I wanted to close out this thread with the final outcome.  I contacted the company that runs the service and explained to them that that FDS on other clusters performed better on scaling tests.  They agreed to run the NIST weak scaling test and use the results to modify the way that FDS runs on their cluster.  They also reached out to the folks at NIST for assistance.

 

The result is that FDS now scales to multiple MPI processes in a similar way as the scaling results on the NIST cluster.



Kevin

unread,
Jul 6, 2018, 8:52:42 AM7/6/18
to FDS and Smokeview Discussions
Let me add a final thought. My interpretation of the plot is as follows: the cases were run on a compute cluster consisting of nodes with 16 cores. That is, one can run up to 16 MPI processes per node. For cases with 1, 4, and 16 MPI processes, the efficiency decreases because more and more processes are using the memory and hardware of a single node. From 16 to 32 processes, there is a further drop in efficiency. Why? My guess is that now there is MPI message passing across the network connecting the two nodes. The network is very fast, but not as fast as when the message passing was done internally on a single node which has one bank of RAM (Random Access Memory), which was the case for 1, 4, and 16 processes. As more MPI processes are added (64 and 128), there is no further decrease in efficiency because each node is doing comparable work and the network is able to easily handle the increased traffic. 
Reply all
Reply to author
Forward
0 new messages