I am evaluating the performance of FDS using multiple meshes. My test case is a 2.4 wide by 5 long by 2.4 tall mesh with a 16 kW fire, an open vent at one end, and a vent with a fixed velocity vent at the other end of the domain. I am adding meshes with the same size and grid spacing and running the models with the same number of CPUs as meshes. I am using online services so there are supposedly more available cpus than I need.
I was expecting to see a mostly constant time to run the models as meshes and cpus were added, but with more than 4 meshes the slope in time to run the model with respect to the number of meshes is larger than I expected. The following chart shows the (clock time to run the model) divided by (the simulation time) with respect to the number of meshes/cpus that were assigned to the model.
Is this the scalability that I should expect or am I missing something? I have attached two of the input files.
Thanks
Thank you for all of the great feedback. I appreciate the suggestions about what to look at to improve run speed.
I have been looking at items as you all suggested. Below is what I found.
For the cpu files, the following table and chart show the maximum values for each category for each run.
Mesh Count |
MAIN |
DIVG |
MASS |
VELO |
PRES |
WALL |
DUMP |
PART |
RADI |
FIRE |
COMM |
EVAC |
HVAC |
TotalT_USED |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
9 |
578 |
234 |
333 |
211 |
72 |
121 |
0 |
224 |
14 |
46 |
0 |
0 |
1842 |
4 |
8 |
555 |
229 |
316 |
223 |
70 |
126 |
0 |
209 |
13 |
65 |
0 |
0 |
1777 |
8 |
21 |
730 |
327 |
408 |
481 |
102 |
153 |
0 |
267 |
15 |
112 |
0 |
0 |
2537 |
15 |
29 |
829 |
349 |
488 |
395 |
128 |
98 |
0 |
326 |
17 |
557 |
0 |
0 |
3069 |
16 |
32 |
858 |
361 |
505 |
395 |
133 |
193 |
0 |
337 |
17 |
594 |
0 |
0 |
3306 |
24 |
40 |
965 |
410 |
572 |
521 |
146 |
220 |
0 |
372 |
20 |
919 |
0 |
0 |
3765 |
I tried removing the slice file from the 8 and 16 mesh models. There was a measurable but negligible difference in the calculation speed.
With respect to number of MPI processes and hyper threading. The service that I am using for these runs has the job setup streamlined. As far as I can tell the command that they are using to start the 16 mesh job is.
mpiexec -np 16 fds_mpi hall16.fds
Kevin, Thanks for replying so quickly.
The results that we are discussing are from a ‘weak’ scaling case. I also performed a ‘strong’ scaling test, but I am trying to understand these results first.
I chose this scenario because it best represents the modeling that we want to conduct. For example, there is a fire located in one mesh and we add additional meshes so that we can calculate the propagation of the fire products to other areas in the building. In this scenario all meshes, except for the 1st, are empty.
The efficiency that I calculate starts to drop from near 100% when the model has more than 4 meshes. Efficiency drops to about 53% by about 16 meshes and continues to decrease. A chart is attached.
The 0 values in my previous post for the one mesh case were an artifact from the program that I wrote to find the maximum values in the cpu files. The table below has the correct values.
Mesh |
MAIN |
DIVG |
MASS |
VELO |
PRES |
WALL |
DUMP |
PART |
RADI |
FIRE |
COMM |
EVAC |
HVAC |
TotalT_USED |
1 |
7 |
589 |
236 |
331 |
164 |
71 |
125 |
0 |
230 |
15 |
0.1 |
0 |
0 |
1768 |
To help understand what I did, I attached the input files that I used. They are identical except for the mesh lines and location of the velocity vent.
Kevin,
I set up the models as you suggested. The meshes are lined up and touch. Are they supposed to touch? I am seeing a similar decrease in efficiency.
Mesh Count |
Time Step |
Simulation Time |
Clock Time |
Clock/Simulation Time |
Efficiency |
1 |
196 |
9.95 |
120 |
12.1 |
1 |
2 |
196 |
9.95 |
120 |
12.1 |
1 |
4 |
196 |
9.95 |
120 |
12.1 |
1 |
8 |
196 |
9.95 |
180 |
18.1 |
0.7 |
16 |
196 |
9.95 |
240 |
24.1 |
0.5 |
32 |
196 |
9.95 |
240 |
24.1 |
0.5 |
Mesh Count |
MAIN |
DIVG |
MASS |
VELO |
PRES |
WALL |
DUMP |
PART |
RADI |
FIRE |
COMM |
EVAC |
HVAC |
TotalT_USED |
1 |
0.7191 |
33.24 |
13.51 |
40.15 |
13.43 |
7.418 |
1.789 |
0 |
30.47 |
0 |
0.007896 |
0 |
0 |
140.7 |
2 |
0.7006 |
32.02 |
13.17 |
38.22 |
13.02 |
6.83 |
1.681 |
0 |
27.74 |
0 |
0.4949 |
0 |
0 |
134 |
4 |
1.326 |
35.06 |
14.43 |
42.11 |
14.31 |
8.64 |
2.052 |
0 |
31.64 |
0 |
3.518 |
0 |
0 |
151 |
8 |
2.72 |
42.66 |
18.77 |
49.96 |
22.47 |
10.36 |
2.713 |
0 |
38.16 |
0 |
5.187 |
0 |
0 |
189 |
16 |
4.682 |
45.54 |
19.4 |
59.77 |
20.14 |
13.41 |
4.072 |
0 |
43.14 |
0 |
59.82 |
0 |
0 |
251 |
32 |
5.239 |
45.92 |
19.61 |
60.37 |
19.87 |
13.31 |
4.54 |
0 |
43.14 |
0 |
73.21 |
0 |
0 |
265 |
I ran the NIST weak scaling input files for 1, 2, 4, 16, 32, 64, 128 meshes. Unlike the NIST results for the same models, the efficiency doesn’t plateau as the number of meshes is increased.
I will follow up with the engineers that run the service to see if they can determine why the scaling test results are different than the NIST results.
I wanted to close out this thread with the final outcome. I contacted the company that runs the service and explained to them that that FDS on other clusters performed better on scaling tests. They agreed to run the NIST weak scaling test and use the results to modify the way that FDS runs on their cluster. They also reached out to the folks at NIST for assistance.
The result is that FDS now scales to multiple MPI processes in a similar way as the scaling results on the NIST cluster.