FDS5 multi mesh benchmark runs

1,116 views
Skip to first unread message

Cian Davis

unread,
Mar 13, 2014, 9:26:25 AM3/13/14
to fds...@googlegroups.com

Hi All,
I'm wondering if anyone has a standard FDS5 file they use to benchmark
machines? The reason I ask for multi mesh is I want to assess some MPI runs.

Ideally, they'd run for a few hours though obviously this depends on the
hardware and the number of cores.

I've had a look in the archive and nothing really suits. The benchmark
files distributed seem to be single mesh.

We have a number of new machines (One based on a Intel Core i7 4770K,
the other on a Intel Core i7 4930K) and I want to see relative
performances and specifically whether I should leave Hyperthreading on
and not use the additional cores or disable it in the BIOS.

If no-one has anything suitable, I'll setup my own but thought it would
be interesting to have a set that we could all compare against.

Regards,
Cian

dr_jfloyd

unread,
Mar 13, 2014, 9:55:26 AM3/13/14
to fds...@googlegroups.com
We would recommend for new work that you use FDS6.  The Validation suite for FDS6 does have some good multi-mesh cases in it such as the Sandia plume cases at the higher resolutions.

Cian Davis

unread,
Mar 15, 2014, 10:13:36 AM3/15/14
to fds...@googlegroups.com

Thanks Dr. Floyd. We are transitioning to FDS6 but we use FDS5 for existing projects.

I've run a number of simulations based on a 4-mesh problem, each 125,000 cells with a 2MW steady fire and 2 vents. I started with running it on a single core and either dual or quad core to assess speedup.

The results (times in seconds):
Job Meshes Processor Cores Sim time Wall time Speedup
1 4 Intel Core i7 4770K 1 300 15130 1
2 4 Intel Core i7 4770K 2 300 8639 1.75
3 4 Intel Core i7 2600K 4 300 8518 1.78
4 4 Intel Core i7 4770K 4 300 6194 2.44
5 4 Intel Core i7 4930K 4 300 6623 2.28

This set of results suggests that the job isn't big enough to take full advantage of multiproc and the MPI overhead is reasonably high.

I'm running further jobs to assess the effect of HyperThreading. I have run the same job over four processors but run two instances simultaneously (so using all 8 threads simultaneously). Wall time was 10817 for each of the runs. Will reboot the machine on Monday, see if I can disable HyperThreading and rerun.

If anyone is interested in results or wants the test FDS file I used, let me know.

Regards,
Cian



--
You received this message because you are subscribed to the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fds-smv+u...@googlegroups.com.
To post to this group, send email to fds...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/417aff2f-40fd-4c42-ab9b-850f4ec9cbb4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Garth Hay

unread,
Mar 26, 2014, 6:54:28 PM3/26/14
to fds...@googlegroups.com
Hi Cian,

I'd be interested further in the results you're getting as we are also trying to carry out benchmarking. Similiar to us though I think you have correctly identified that that model isn't demanding enough.

There are a range or results kindly shared by Chris Salter available here: https://github.com/drezha/FDS_Resources/tree/master/FDS%20Benchmarking%20Files

We've also been looking at the Sandia plume with higher resolutions to carry out benchmarking, currently just thinking about suitability for higher numbers of meshes. 

Nikolai Ortiz

unread,
Mar 28, 2014, 5:53:20 PM3/28/14
to fds...@googlegroups.com
Hi..

I'm interested in reproduce some of this BENCHMARCKS but in the web says:

This following files allow this to be done. Distributed with FDS 5 (and for some reason stopped in FDS 6), the following files allow you to run a comparison between systems and compare how they perform.

Those TEST files are still broken ???

Regards


Nikolai

Barbro Maria Storm

unread,
Mar 29, 2014, 2:08:28 PM3/29/14
to fds...@googlegroups.com
Hi!
I'm finding this thread - and others like it - very interesting. I remember some of the same discussions from the transition between FDS4 and FDS5. I was looking for comparisons and test cases it a few months ago, and was surprised not finding much at all. I'm working on some myself now,  I wonder what the status is in the community as a whole? 

Apparently more people than me are doing cases with CPU-comparisons and FDS6, which is very interesting in addition to the FDS5-FDS6 comparison. Cian Davis; did you buy a lot of machines with different specifications specifically to test, or was it a random occurence (curious)? 

I've been testing with a few different configurations and comparing them to this; https://github.com/drezha/FDS_Resources . I do not know who Drezha is, but I would be very interested to setup something like this as a "submit your results and configuration". If I make a random/typical file - is anyone interested in contributing (anonymized w/ computational configurations)? 

Dear regards from Norway, 
B. Storm


--
You received this message because you are subscribed to the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fds-smv+u...@googlegroups.com.
To post to this group, send email to fds...@googlegroups.com.



--
Barbro Storm

Nikolai Ortiz

unread,
Apr 2, 2014, 9:40:41 AM4/2/14
to fds...@googlegroups.com
HI...
I don't know if

could help me a little with a few questions.

I run the scale1.fds  benchmark test on my own two PC.
(more office than cientific).

The case scale1.fds has 8 mesh by default.

The PC specs that I have are:

W7. Intel Xeon Processor E5645 12 GB RAM 2.4GHz this one 6 cores but 12 threads
W7. Intel Core i7-2600 Processor 8 GB RAm 3.4 Ghz this one 4 cores but 8 threads

The question is simple.
In the list publish by Chris there are multiples configurations of cores..
so..
the scale1.fds file must be fixed to put the same numbers of meshes than the number of cores, or than the number of threads.

I hope Chris can read this topic.

I paste the Chris code if some one want to take a look.

&HEAD CHID='scale1', TITLE='General purpose input file to test FDS timings, SVN $Revision: 5015 $' /

&MULT ID='mesh multiplier', DX=1.0, DY=1.0, DZ=1.0, I_LOWER=-1, I_UPPER=0, J_LOWER=-1, J_UPPER=0, K_LOWER=0, K_UPPER=1 /

&MESH IJK=64,64,64, XB=0.0,1.0,0.0,1.0,0.0,1.0, MULT_ID='mesh multiplier' /

&TIME T_END=10.0 /

&SURF ID='HOT', VEL=-0.1, TMP_FRONT=100., COLOR='RED' /

&VENT MB='XMIN', SURF_ID='OPEN' /
&VENT MB='XMAX', SURF_ID='OPEN' /
&VENT MB='YMIN', SURF_ID='OPEN' /
&VENT MB='YMAX', SURF_ID='OPEN' /
&VENT PBZ=0.0,   SURF_ID='HOT' /
&VENT MB='ZMAX', SURF_ID='OPEN' /


&TAIL /

Chris Salter

unread,
Apr 2, 2014, 11:48:58 AM4/2/14
to fds...@googlegroups.com
All,

The FDS files that you've linked to on the Github page are FDS files that were distributed with FDS 5. You should be able to find these in the samples folder when FDS installed. I realise that these were designed for FDS 5 and not 6, but these are imported fine into FDS 6 via Pyrosim.

@Nikolai, are you sure it's the FDS files that are broken or the MPI? I'm still having issues running FDS6 with MPI within Windows and am reduced to running FDS6 models through Pyrosim 2014 who have solved that issue for us. 
The original files came with a script that would allow the FDS file to create the same number of meshes as the computer had. However, to get my results for FDS5 multicore, I simplified the process as I relied on friends and volunteers to run the models - this was the easiest way to get the range of results. To make life simple, I simply left the FDS as it was, which creates 8 meshes in a FDS 5 environment (and FDS6 via Pyrosim).

@Barbro - There was talk of this on the FDS-SMV group on LinkedIn. Whilst I would love to get involved in something like this, my web programming skills are next to null, so unless it's a Google Spreadsheet, I'm reduced to doing things manually. However, there are a number of variables I'm not also taking into account, such as disk write speed (which could potentially affect memory starved models), RAM speeds and how much the user is using the PC in the background (in all the tests I had others run, I cannot guarantee that they're not using the machine in the background for something else computationally intensive and therefore skewing the results).

@Cian - I know we've had mixed results with HyperThreading - sometimes it seems to speed it up, other times it doesn't. I'm sure you know what hyper threading does already but as it spreads the data around the areas of the processors not being used, I would have thought, (with my limited knowledge of FDS), that the processor should be using the same areas of the CPU on each core and therefore no noticeable speed up would be appreciated. Except perhaps if the machine was in use whilst modelling in the background. If you're in London at some point in the future, would perhaps be good to perhaps meet and try and set something out?

Whilst I admit my results are basic, I hadn't seen anything of the sort in the community and endeavoured to at least provide something available. If there's interest in making something larger and far reaching, I would be happy to get involved. Creating a test bench of tests would perhaps be good - something FDS related would be best, compared to the helpful, but not 100% relevant, examples such as OpenBenchmarking.org or similar.

Nikolai Ortiz

unread,
Apr 2, 2014, 12:36:00 PM4/2/14
to fds...@googlegroups.com
Hi Chris.
thanks for your answer because this benchmark thing is important for many users (at least for me)

I ran the file scale1.fds in FDS 6 without problems.
I don't have Pyrosim license so I must do it all in FDS directly.
Well.. My question is because if my PC have 4 core but 8 threads and I run a simulation of 8 mesh the MPI will assign 2 mesh for every core, and if my PC have 6 cores the MPI will assign 2 mesh to 2 cores and the rest to the others 4 cores, or at least is the way I think it's working.

When I see the table you posted, for example, 
I see for example:
Intel Core i5-3470S 1381.113 2.9 4 8 Windows 7 64

this PC have almost the same resources that mine.

 W7. Intel Core i7-2600 Processor 8 GB RAm 3.4 Ghz this one 4 cores but 8 threads

I expected my times were close..

So when I run scale1.fds setting for 8 mesh (one mesh per thread) the time that I get is 4616.496.
but when I run the scale1.fds but modified for only 4 mesh ( so i can get one mesh per core) I get 1714.601
this last value is more close to the one of your list.
In the case of intel core i7 witch have HT technology this can be tricky, in the case of XEON or AMD could be different.

So to be sure about the way that the scale1.fds is running maybe is good to have some advice in the code things like the numbers of mesh fixed or the numbers of processors assigned by MPI, and maybe some warnings of not modify the number of mesh etc..  ;)

what do you think about it?, if is it possible to put those conditions and do the test again to be sure?.

I hope to share my results so you can put it in your list too..

Chris Salter

unread,
Apr 2, 2014, 2:20:30 PM4/2/14
to fds...@googlegroups.com
Nikolai,

I realise that the benchmarking isn't ideal. As I mention above, the aim was to try and make this easy to run across multiple systems with minimal user interaction as I recruited the help of my friends to run the models - whilst these guys might be pretty good with computers, they'll never have used Pyrosim or FDS before so I made it easy for them and sent a batch script and the FDS program to run the results with a single click and for the multicore runs, they installed the trial of Pyrosim and ran it through that. As there was no way of knowing how many cores the machine had (until they sent me back the results and told me), I couldn't tailor it individually.

You're right in your thinking of how FDS will allocate the meshes.

I realise that in an ideal world, I would have benchmarked a 4 core machine with perhaps 4 meshes (but the same number of cells throughout), however, as I said, I tried to keep it simple. The benefit of that is that ALL CPU's have run the same test and therefore should in theory, be comparable.

i7 and Xeon's are essentially the same CPU's. The Xeon's tend to be lower speeds as they're designed for workstation and server use where they're likely to be on all the time. They also support ECC RAM which is error checking - if a memory error appears, you're more likely to save your model with a ECC RAM, than without it. Maybe not a concern for small models but for models that take weeks to run, then there's perhaps an issue. Like the i7's, they also have hyper threading. And both have turbo boost's, which make benchmarking a pain.

I fully admit it's not the best solution, but up until now, it's not been discussed to much. However, I think with the release of FDS 6 which is showing a slow down across the board and we're not able to drop the accuracy down (now we've been using 0.1m x 0.1m x 0.1m mesh sizes for buildings, Building Control bodies aren't really accepting anything less), getting the most from processors and purchases has once again come to the fore.

Nikolai Ortiz

unread,
Apr 2, 2014, 2:41:20 PM4/2/14
to fds...@googlegroups.com
Hi Chris.
Sorry, if I sounded rude in my last post, maybe my english is not good enough..


I think the benchmark that you made is easy to use and is a good starting point.
Maybe later I will try to install de Pyrosim and see what made or how the pyrosim configure the MPI.

At the moment I don't have problems running simulations in W7 and MPI, or at least problems that I'll can see in form of warnings or similar.

Would be nice to see the ".out" of the simulations made by your friends to see how the machines where setup it.
As you said is important to run the same case, with the same parameter of MPI, in order to compare and correlate the results.
So the MPI parameters must be the same too.
If you can, please share me the same Batch that you send to your friends, I will run it in all the PC that I can get. 
if I can't install the pyrosim at least I can review a way to running in the FDS directly.

regards

Nikolai
Message has been deleted

Garth Hay

unread,
Apr 2, 2014, 5:44:53 PM4/2/14
to fds...@googlegroups.com
One factor of the benchmarking is definitely keeping the model simple and the variables the same for each PC.

Its interesting to see you've have mixed results with hyperthreading too Chris, we found the optimum number of meshes (in some very limited testing) to fall around 14 meshes on a 12 CPU HT to 24 machine. This puzzled us as it was more threads than there were cores, so there was benefit to having HT, but significantly less threads than total cores available. We also didn't manage to achieve 90-100% utilisation of the system when scaling the model complexity like we did on other machines (although with significantly less available cores). 

Nikolai: Something to remember that the performance of the simulation is not solely linked to your PC specs, this is because different user's systems will have more or less tasks running in the background, the age of the hardware as well as the IT environment that you may be running in. 

To exemplify this point, I work at a consultancy with a dedicated IT department who supply and maintain our PC's, as part of this the PC's have a specific "corporate profile" which gives a certain tasks increased priority and is constantly connected to the servers for a number of software and profile updates. We found by simply running disconnects and without the corporate profile installed on a new PC we achieved somewhere around a 10% performance increase.

Other actions liking assigning higher priority to the processes running FDS and turning off visual themes in windows 7 have also improved performance. 

This makes benchmarking across different users/machines difficult for very fine grained conclusions. We've found the most information has come from running different benchmark cases on the same PC with a number of different set ups. In an ideal world, we could do a similar process on a number of different machines and make comparisons based on the relative performance changes. 

I'm very keen to collaborate with you gentlemen on further benchmarking

Nikolai Ortiz

unread,
Apr 2, 2014, 6:27:31 PM4/2/14
to fds...@googlegroups.com
Hi Garth.
wise words..

I'm aware of the configurations and software running in background.
So to make some step by step we can beginning to define the case and the running parameters.
I think that the scale1.fds is a good test case to begining because doesn't take to many time running the case, but I'm not sure about the full use of the processor.
In the core i7 arquitecture in my case I have 4 cores and 8 logical processors.
In XEON  architecture I see in my PC, that it count with 6 cores and 6 logical processors.

My MPI command to run in the core i7 would be one of this two:
mpiexec -localonly 8 fds_mpi scale1.fds 
mpiexec -localonly 4 fds_mpi scale1.fds

and for the  XEON
mpiexec -localonly 6 fds_mpi scale1.fds

in the first case taking 8 mesh or 4 mesh and in the second 6 mesh only.

modifying the case in this way:

XEON 6 core:

&HEAD CHID='scale1', TITLE='General purpose input file to test FDS timings, SVN $Revision: 5015 $' /

&MULT ID='mesh multiplier', DX=1.0, DY=1.0, DZ=1.0, I_LOWER=-1, I_UPPER=0, J_LOWER=-1, J_UPPER=1 /

&MESH IJK=64,64,64, XB=0.0,1.0,0.0,1.0,0.0,1.0, MULT_ID='mesh multiplier' /

&TIME T_END=10.0 /

&SURF ID='HOT', VEL=-0.1, TMP_FRONT=100., COLOR='RED' /

&VENT MB='XMIN', SURF_ID='OPEN' /
&VENT MB='XMAX', SURF_ID='OPEN' /
&VENT MB='YMIN', SURF_ID='OPEN' /
&VENT MB='YMAX', SURF_ID='OPEN' /
&VENT PBZ=0.0,   SURF_ID='HOT' /
&VENT MB='ZMAX', SURF_ID='OPEN' /


&TAIL /


intel core i7 core: 8 logical processors

&HEAD CHID='scale1', TITLE='General purpose input file to test FDS timings, SVN $Revision: 5015 $' /

&MULT ID='mesh multiplier', DX=1.0, DY=1.0,I_LOWER=-1, I_UPPER=0, J_LOWER=-1, J_UPPER=2 /

&MESH IJK=64,64,64, XB=0.0,1.0,0.0,1.0,0.0,1.0, MULT_ID='mesh multiplier' /

&TIME T_END=10.0 /

&SURF ID='HOT', VEL=-0.1, TMP_FRONT=100., COLOR='RED' /

&VENT MB='XMIN', SURF_ID='OPEN' /
&VENT MB='XMAX', SURF_ID='OPEN' /
&VENT MB='YMIN', SURF_ID='OPEN' /
&VENT MB='YMAX', SURF_ID='OPEN' /
&VENT PBZ=0.0,   SURF_ID='HOT' /
&VENT MB='ZMAX', SURF_ID='OPEN' /


&TAIL /


 it's just an idea. What I have in mind is only use the real cores or processors to get the job. trying to not put a processor to make more than one job.
another idea could be, run several cases: one with only one mesh per processor, and another duplicating the jobs in every processor.
and making some statistics about the results.
we can think that if we run the same cases in the PCs with the same hardware but different users we can get some practical info.

what do you think about it?



Chris Salter

unread,
Apr 3, 2014, 3:36:55 AM4/3/14
to fds...@googlegroups.com
Nikolai,

It didn't come across as rude, I was just trying to put across my reasons for doing it the way that I did it.

The single core test can be found here at Dropbox. It was a simple batch script that contained the FDS executables and benchmark tests. Users were then asked to send me the .out when it completed.

Your comment below:
run several cases: one with only one mesh per processor, and another duplicating the jobs in every processor.
and making some statistics about the results.
Unfortunately, that won't prove to be particularly beneficial - part of the reason for benchmarking the MPI runs is to see what overhead the MPI has on the models, something that would happen if you're running a single model on 8 cores for example. That would only really net you 8 different results for running single core.

Gareth - I can only assume that some aspects of FDS make use of different parts of the processors at the same time which is where the mixed results come from in terms of the modelling. I find the 10% loss due to corporate IT being interesting - I'm in a similar position and we're making a case this year that buying through the IT department makes no sense for us (it limits what we can purchase) and that if we were able to do that, we could look into not connecting it to the network.

Bear in mind that we've been discussing Windows only so far - I've modelled some results on OS X and Linux to mixed results as well. Some Linux results were seeing a 25% speed up in a virtual machine compared to Windows, though within the corporate environment, Windows is a lot more easily come by. 

Ideally, I wanted to collect results from many people and then create averages. In theory, this should alleviate some of the issues and variations that would occur from individual machines.

As this seems to be of interest to people, including those on LinkedIn, it might be worth perhaps getting a list of people that would be interested in taking this further and doing something more about it.

Nikolai Ortiz

unread,
Apr 3, 2014, 9:30:50 AM4/3/14
to fds...@googlegroups.com
Well.. 
then I will run the cases and send you the .out we have 3 kinds of machines.

any way in the MPI cases..
there is a question to be made...
what number of processors must be used?
because in I said before:
Intel core i7 have 4 cores and 8 virtual processors, and the Xeon that I have has 6 cores and 6 virtual processors.

so I have this combinations:

My MPI command to run in the core i7 would be one of this two:
mpiexec -localonly 8 fds_mpi scale1.fds 
mpiexec -localonly 4 fds_mpi scale1.fds

and for the  XEON
mpiexec -localonly 6 fds_mpi scale1.fds

If I have the time then maybe I run the 3 ways and send the .out.

cheers.

Chris Salter

unread,
Apr 4, 2014, 6:06:21 AM4/4/14
to fds...@googlegroups.com
Nikolai,

They were run as the Windows Task Managers sees the number of cores (so in the case of your i7, it would be -n 8).
Pyrosim doesn't allow you to change the number of cores used or threads. It creates one process per mesh (so I've seen a model with 72 meshes, it was run on an 8 core machine - -n therefore equalled 72!)

Cheers
Chris

Cian Davis

unread,
Apr 4, 2014, 7:36:46 AM4/4/14
to fds...@googlegroups.com

All,
@Barbro We have been buying machines as components as we need them. Generally, the best value is a top-end single processor i7. The additional cost of the extreme versions doesn't justify the small amount of power increase. When you go Xeon, the cost of the processor, motherboard and RAM all increases - and not in line with the performance increase. When we bought our first machine 3 years ago (almost), Sandy Bridge had just been released - so the most powerful chip per core was the 2600K. When we bought our next machine, Haswell had just been released so we bought a 4770K. This year, the 4770K was still the fastest per core so one machine was based on this and the other on based on a 4930K as I wanted one machine with 64GB RAM.

@Chris That's in line with our experience. I think that hyperthreading may help when the process isn't using the full power of the core - basically filling in the troughs of the primary process usage. It does seem to result in some slowdown of the primary process. I will invariably be down in London soon - Thursday next week looks likely. Meeting up would be useful. I do agree that it would be useful as a community to have a set of benchmarks. Incidentally, I use Linux exclusively for FDS processing so even the difference between Windows and Linux would be interesting. Also, if you get to the stage of running out of RAM and swap being used, then yes the disk speed will make a difference - but it would be the difference between walking and cycling when the benchmark (i.e. no swap) is a fighter jet!

@Gareth That's some very interesting results on the optimum number of meshes on a HT machine. It falls into what I'm thinking - if you have more threads/processes than physical cores, the additional processes fill in the troughs of usage but if you have too many processes, it results in overall slowdown. Given your experiences, I am very glad I have managed to sidestep corporate IT and run Linux on our cluster!

I do have some web programming background and would be happy to setup some basic system. In the most simple form, it would log the benchmark name, the number of cores, the number of meshes, RAM, OS, wall time and maybe one or two other items.

Regards,
Cian

Chris Salter wrote on 02/04/14 16:48:
--
You received this message because you are subscribed to the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fds-smv+u...@googlegroups.com.
To post to this group, send email to fds...@googlegroups.com.

Nikolai Ortiz

unread,
Apr 4, 2014, 10:39:35 AM4/4/14
to fds...@googlegroups.com
Hi
@Chris..

So. I 'm not sure about the core i7 processor thing that you say, about open the task manager and see the number of processors.
because the logical processors and cores have differences in the performance, well that I suppose.

If you type this on W7 cmd.

WMIC CPU Get NumberOfCores /Format:List 
WMIC CPU Get NumberOfLogicalProcessors /Format:List

you will see what i'm talking about.

Is not my area but still is a endless fight about the matter of number of cores vs numbers of logical processors and the architecture of processors, AMD vs Intel vs.. whatever etc..
( I'm not a parallel computer expert so I can be wrong in many many ways.)

About the pyrosim, well.. I'm prefer use the FDS directly.
for MPI I'm using mpich2-1.4.1p1-win-x86-64.
I'm not sure about how the Pyrosim configure the FDS and MPI and what else they do, so, I suppose they made a call to OS for the number of processors and setup the MPI for this number.
I suggest use FDS and MPI as raw as possible for this benchmarking thing.

So I made the test.
I uploaded the files and scripts.
the script bench.bat will running all the cases and extract the number of cores and logical processors.
the script bench-a.bat do the same job as bench.bat but shutdown the PC at the end.

this cases:

scale1.fds and FDS6-bench1 are the same cases that Chris made.
scale1-4 is the same scale1.fds but I changed the CHID name to run this case after the scale1.fds  for the MPI configuration of 4 cores.

scale1-6.fds is a case based in scale1.fds but I modified to have only 6 mesh.
scale2-8.fds is a case based in scale1.fds but I modified to have 8 mesh but this meshes are all in the same floor level.


a resume of the main results.


Configuration A:
Processor Core i7-2600, 3.4GHz, 8M, VT-x, 95W, Optiplex 990 - 4 cores, 8 logical processors.
Memory: 8GB,Non-ECC,1333MHz DDR3,2X4GB,Dell OptiPlex 990 

for the scale1.fds runing with MPI 8 processors
 Total Elapsed Wall Clock Time (s):     4601.260
for the scale1.fds runing with MPI 4 processors
 Total Elapsed Wall Clock Time (s):     6639.150
For FDS6-bench1
 Total Elapsed Wall Clock Time (s):      396.011

Configuration B:
Processor Six Core Intel Xeon Processor E5645, 2.40GHz,12M L3, 5.86GT/s,FPWS T5500 - 6 cores - 6 logical processors
Memory: 12GB DDR3 ECC SDRAM Memory,1333MHz,3X4GB,Dell Precision T5500 and T7500 

for the scale1.fds runing with MPI 6 processors
 Total Elapsed Wall Clock Time (s):     6622.739

For FDS6-bench1
 Total Elapsed Wall Clock Time (s):      558.473


Configuration C:
Procesador Six Core Intel Xeon Processor ES2667, 2.90GHz,12M L3, 5.86GT/s,FPWS T5500 6 cores - 12 logical processors
Memory: 16GB DDR3 ECC SDRAM Memory,1333MHz,3X4GB,Dell 

for the scale1.fds runing with MPI 6 processors
 Total Elapsed Wall Clock Time (s):     4137.810

for the scale1.fds runing with MPI 12 processors
 Total Elapsed Wall Clock Time (s):    I WILL RUN THIS LATER

For FDS6-bench1
 Total Elapsed Wall Clock Time (s):      397.189


Well..  I didn't make a running of the configuration C using MPI 12 processors, because I made this thing in a break and I missed this case, but I think it's going to get a better result that the 6 processors configuration. I will try to run this case tonight.

I run this cases in the night, trying to maintain the background programs at minimum but, well, that is difficult to control.

Regards..
intel.zip
XEON.zip

MScFire Student

unread,
Apr 4, 2014, 5:26:54 PM4/4/14
to fds...@googlegroups.com
To add a little fuel to the fire...

I think you will really have to add the automatic Intel Turbo Bust aspect into the benchmarking (as I think Chris mentioned).

The Intel CPUs and Boards will have some variation of Intel Turbo Boost, with the 4th and 2nd Gen i7s using TB2.0.

My understanding of TB is that it will automatically boost the CPU speed under heavy load if the conditions in the PC are right. The amount it boosts, and its effectiveness, are related to the number of cores in use and the individual CPU.

For example, my 2nd Gen i7-3930K 6 Core processor has a base speed of 3.2GHz but it will be automatically Turbo Boosted to 3.8GHz under heavy load but only if there is one Core in use. The more cores that are in use, the less the automatic Turbo Boost.

The i7-3930K, 2nd Gen, Base 3.2MHz, 6Cores:
1 Core in use   = Turbo Boost speed of 3.8MHz.
2 Cores in use = Turbo Boost speed of 3.8MHz.
3 Cores in use = Turbo Boost speed of 3.7MHz.
4 Cores in use = Turbo Boost speed of 3.6MHz.
5 Cores in use = Turbo Boost speed of 3.5MHz.
6 Cores in use = Turbo Boost speed of 3.5MHz.

The i7-4770K, 4th Gen, Base 3.5MHz, 4 Cores:
1 Core in use   = Turbo Boost speed of 3.9MHz.
2 Cores in use = Turbo Boost speed of 3.9MHz.
3 Cores in use = Turbo Boost speed of 3.8MHz.
4 Cores in use = Turbo Boost speed of 3.7MHz.

I also think the benchmarking would have to include the max memory bandwidth associated with each processor and its associated RAM and if the motherboard is employing Intel XMP on the RAM!

Benchmark will be a nightmare, as I think Chris said!

Chris Salter

unread,
Apr 7, 2014, 3:51:18 AM4/7/14
to fds...@googlegroups.com
Indeed - I think the best we can hope for is some form of guesstimate of how something will perform.  I've not taken into account this as my belief was the same as yours (that if all the cores are operating, then the boost is slower/lower).

@Nikolai, I agree that using FDS and MPI natively is best, but when I got others to run them for me, Pyrosim is a one click install.
I realise that the Task Manager doesn't show the physical number of cores but the number of "core". (i.e. it takes into account the hyper threaded cores) but again, I was restricted in what I could achieve. The aim of the benchmarking was to make life simple.

MScFire Student

unread,
Apr 7, 2014, 5:29:20 AM4/7/14
to fds...@googlegroups.com
@Chris,
 
Benchmarking looking at the results for each individual processor setup might make more sense.
 
Just for an example, pick the Intel Core i7-4770K base 3.5MHz with 4 Cores:
  • The test motherboard, processor, RAM, the max memory bandwidth, & the FDS test file would be constant on the testing platform.
  • That means that the test variables will be:
    • The number of active cores (1, 2, 3, 4),
    • Hyper Threading (on/off)*,
    • Turbo Boost (on/off)*,
    • RAM XMP (on/off)*,
    • FDS MPI mesh number (1, 2, 3, 4).
    • (* = variables can be turned on/off in the PC BIOS).
  • I think that this would involve about 120 test runs and would result in a table for the given processor demonstrating the best setup for the quickest run time for a given FDS file size and complexity.
 You then initially repeat this for the most common processors and voilà you have some benchmarking (I think?).

Chris Salter

unread,
Apr 7, 2014, 5:42:02 AM4/7/14
to fds...@googlegroups.com
I think that's slightly more in depth that most people would be willing to run.

At the end of the day, I think most people will be chasing a rule of thumb benchmark. If they had to perform 120 different test, I sorely doubt that we'd ever get any results. Certainly I'm not wanting to run 120 tests at about an hour each on a CPU - I doubt that even the guys at OpenBenchmarking or the major news sites go into that much detail. Maybe a single test that in depth would be good to try and highlight issues, but I feel it would be a waste of time for comparing lots of processors.

Also, keeping it to the most popular CPU is a bit of an issue - I forsaw this as being a tool for the purchasing of processors. If we don't cover as many as possible, then it makes that idea pointless. And also then I'd question the use of benchmarking.

@Cian - I'd prefer to keep using Linux, but as we all use it and I'm about the only one in the team with Linux knowledge, it would be hard to use it.

Nikolai Ortiz

unread,
Apr 7, 2014, 10:59:22 AM4/7/14
to fds...@googlegroups.com
Hi...
@Chris i think is possible to install and configure FDS and MPICH with one click .. for example FDS is installed in this way already. And MPICH only need to run one command after installation. so, I think is possible to put that in a batch script and running it as administrator.
Well, if we look the results that I get in my last post, it is interesting that the XEON has a better performance than Core-i7 which have a faster clock. So the RAM and architecture of processors are making a difference.

@MscFire In linux I suppose is more easy turn on and off the Hyper Threading and the Turbo Boost as well all the others processor and RAM configurations, in windows I'm not sure if its possible to do that using commands. I'm sill with the first @Cris observation about the necesity of running the same case in multiples computers and thinking in this maybe is better run the case with the PC having the optimize configurations on.
The configuration of FDS and MPI is another thing because in the BENCHMARK you need to see, at least, if your computer is going to work well if you have multiple MESH assigned to one core and also if you have one MESH per core and if you have only one MESH NO MPI. I think that would be the principal problems that the standard user will have when he look for a BENCHMARK. Maybe the user is trying to answer this question. 
If I have this PC configurations available which one is better to run this multi mesh case, or this single mesh case?

Chris Salter

unread,
Apr 7, 2014, 11:23:58 AM4/7/14
to fds...@googlegroups.com
Well, it's technically two clicks because you'd need to install FDS and then click to install MPI but that's being a pedant! However, I've found a few times there are work around needed to get MPI+FDS working - this might be a factor on our corporate networked boxes that isn't on a home user or standalone box but sometimes things don't work as they should.

That's part of the aim of the benchmarking - a new CPU is almost always likely to be faster than an older one clock speed for clock speed due to increases in efficiency and the instruction sets (if FDS can take make use of those). In fact, I'm fairly sure that whilst the latest AMD chips can clock 4.7GHz(!), they'll be beaten by an i7 Haswell chip.

However, the issue is that you're comping 4 cores to 6 cores running 4 and 6 threads. This comes back down to my original issue - keeping everything constant. Making a CPU with 2 cores run 8 threads allowed me to compare a file against an 8 core running it with 8 threads. I think keeping the single core testing is also worthwhile as not every model will be split into two or more meshes and therefore this benchmark indicates the speed of a model using a single core. In fact, on a number of occasions, where I am able to run a model over a weekend, I would set four models running single core - each model would then use a single core for it's computation over the weekend and I'd have four complete models on the Monday morning.

I'm fairly sure that you can't disable these options in Linux without using the BIOS either - these are low level hardware instructions. However, I would be pleased to be corrected.

I emailed Kristopher Overholt who has provided some insight - the benchmark case is a model run by the FDS team to show the effect of changes to the code on the modelling time for various models. They also run all tests using Firebot with each new build. Timing benchmark files are stored in the repository (here).

Nikolai Ortiz

unread,
Apr 7, 2014, 12:47:01 PM4/7/14
to fds...@googlegroups.com
@Crhis well, the single Click is overestimated :)
For MPI in standalone configuration is to easy and the only problem that I found is the firewall permissions. 

I'm insist in the problem of multi-mesh, because the proposal that I make can keep the test valid for future develop of processors.

My idea is this:
You can have three cases.
1- One case, FDS no multi-mesh case, NO MPI.
2. One Case that can use only your full number of logical processors ( so if you have core i7 the you have 8, if you have xeon you have 6, etc)
3. One case how duplicate the use of all your logical processors  ( so if you have core i7 the you have 16, if you have xeon you have 12, etc)

This can be done using scripts to write the FDS files depending of the number of processors, so you only need to run a script and the work is done.


The first case will tell you how your PC will work if you only have one mesh in your simulations.
The second one will help you if you adjust your simulation to have the same number of mesh than processors, it's important because the MPI will slowly the simulation.
The third case will give you an idea about how your PC is going to work with a full consumption of his resources.

The kind of calculations that will make the FDS, will be almost the same in the cases 2 and 3, except for the number, so in principle the case is the same but the case number 3 is more demanding in a computational way, because it will have more number of calculations for processor.

I think that it's the comparison point that the user need. 
thinking as a standard user, maybe I think.. Ok. I need to run this simulation and I have this two or three PC, I can divide my space in 4 or 8 meshes, so if I see the benchmark I will know if it's better to take the processors overloaded or if it's better work more on the mesh and write one code for less meshes.


pdta: 
the cases in the K. Overholt thread are for a Benchamark of FDS vs OpenMP FDS in the same computer.


I found my self in this predicament (and because of that, I began to write in this nice forum thread) because I have a simulation with 6 meshes, and a old XEON it did it better than my Core-i7, in fact the simulation take the double of the time in the core-i7.
so, I ask my self.
  why is this happening...?

Chris Salter

unread,
Apr 8, 2014, 3:47:26 AM4/8/14
to fds...@googlegroups.com
Correct, the excel file does show the OpenMP benchmarks for the same machine but I was referring to the FDS timing benchmark files themselves, which are no longer are distributed with FDS 6 in the Examples folder.

In regards to your third test, I'm not sure I see the reason. I think we need to get the terminology straight now.
The i7's have 4 physical cores, 8 logical cores.
Xeon's have vary but will have at least 2 physical cores.

I don't see the benefit of running your third proposed test - you're then running over double the capacity of the chip. If you meant run one MPI test with the physical number of cores/mesh and an MPI using all the logical core/mesh, I can see the benefit of that.

Your point regarding the user, in my mind it's always better to use a smaller number of meshes - there can be issue with items happening at mesh boundaries so reduce the meshes, you reduce the potential failure points for the model.

I can't answer why the Xeon is quicker - however, if you check the output files, you should be able to see the percentage of time spent on each calculation (I imagine them to be similar but one area might be significantly different which might explain it?)

Nikolai Ortiz

unread,
Apr 8, 2014, 10:07:50 AM4/8/14
to fds...@googlegroups.com
Yep..
You got it right about the three tests.. is better what you say. 
So, it will be running a test for the number of cores and a test for the number of logical processors.

I think the XEON do better because the test was made it with 8 Mesh, the intel core-i7 have 4 cores and 8 logical processors in some way the 8 mesh are putting the cores to work harder. the Xeon Chip have 6 cores and 12 logical processors so the meshes were distributed better, So, even the XEON was a littler older (same year but old month) and the clock have less speed (Xeon 2.9, intel core-i7 3.4) the arquitecture in the chip and the memory RAM made the difference.

If you think It can help, I can try to write the scripts for the installation and running of FDS + MPI for windows 7 64bits using the less "next clicks" as possible 

Garth Hay

unread,
May 13, 2014, 8:27:00 PM5/13/14
to fds...@googlegroups.com
OK Guys so I'm hopefully getting a bit of time available in the next couple of weeks to document some of the benchmarking I have done -- is there somewhere set up for sharing this, otherwise I'll start up a google drive which is explicitly for sharing benchmark runs. 

I'll also run some of these simple test cases on the 5 different PC set ups we have currently. 

PH

unread,
May 21, 2014, 5:07:21 AM5/21/14
to fds...@googlegroups.com
Hello together,

I did some tests to find the fastest working configuration by using the ressources I already have (because of an existing project I had to use FDS 5.5.3).
For this I used scale1.fds (from examples). By default it is a 8-Mesh-input-file by using MULT_ID='mesh multiplier'. I modeled a 4Mesh a 8Mesh and a 16Mesh version of scale1.fds.
Note:
  • In all cases the total amount of cells (64³ = 262.144) was the same
  • In 4Mesh, 8Mesh, and 16Mesh calculation, meshes had the same size.

For calculation I used different configuration. The one for using multiple computers I described in: https://groups.google.com/forum/print/msg/fds-smv/1BAAGOaHnB8/2S_CGKjFrZsJ

These are the results:

Meshes Processes Dividing of processes Cores divided to meshes? (using MPICH) Time steps Calculation time
4 4 All on PC1 No 2085 684 s
4 4 All on PC2 No 2085 865 s
4 4 All on PC1 Yes 2074 265 s
4 4 All on PC2 Yes 2074 365 s
4 4 2 on PC1 Yes 2074 499 s
2 on PC2
8 8 All on PC1 Yes 1120 140 s
8 8 4 on PC1 Yes 1120 396 s
4 on PC2
16 16 All on PC1 Yes 1124 277 s
16 16 8 on PC1 Yes 1124 619 s
8 on PC2


Does anybody have an idea, why calculations on a single computer are faster than running it in a cluster?

Thanks a lot!

Patrick


 

Chris Salter

unread,
May 21, 2014, 5:16:47 AM5/21/14
to fds...@googlegroups.com
Seem to have typed a reply and have it disappear.

Basically, the communication between the cluster is slower than the communication within the same machine. The network connections are slower than the internal circuit board in the computer.

Hope that helps.

Cheers
Chris

PH

unread,
May 21, 2014, 10:51:28 AM5/21/14
to fds...@googlegroups.com
Chris, thanks a lot!

So in which cases it could be helpful to use a cluster?

Regards
Patrick

Chris Salter

unread,
May 21, 2014, 11:11:45 AM5/21/14
to fds...@googlegroups.com
It depends on the model really.

For a smallish model, you can probably quite comfortably run on a single machine, provided there are enough computer resources for (i.e. disk space and RAM). With four core machines you can easily run a model that has four meshes. You can even buy 10 core processors now that would allow 10 meshes in one machine as long as you have the RAM to run that many - I now have access to a 20 (2 x 10 core CPU's), 128GB RAM machine which is frankly a monster as it means we can run either a 20 mesh model or 5 x 4 mesh models at once!).

However, a cluster would come into it's own if you were exceeding the number of meshes that the single machine could run. I've seen some monstrously large models that had 71 meshes. I'm unaware of a CPU with 71 cores! In this instance, you are likely to see an overall speed up over running the 71 meshes on a single machine (assuming the single machine has for example, 8 cores). Also, if you have a large number of meshes, you might not have enough RAM on a single machine but spread the model over two or more may reduce the RAM needed on one specific machine. I know some places that use a mix of old and new machines and this allows them to spread the load amongst the different machines (smaller meshes to the slower machines etc). NIST use a cluster as well - I assume this is to get through all the validation checks in time (as they re-run them daily when the code changes from what I gather).

Back when I graduated, multi cores were fairly new. I paid a fairly large sum of money for an Intel Q6600 CPU (4 cores) to complete my final year coursework with, whilst I then worked for a company that were at the time, running on a cluster of 5 or 6 single core machines! If you were to buy a new machine now, chances are it is at least a dual core machine, if not a quad core. I personally believe the cluster is no longer required for the average FDS model (mind, that's based on our "average" CFD models, rather than the community as a whole) and that a new machine with sufficient RAM is sufficient.

Hopefully that helps clear that up?

PH

unread,
May 30, 2014, 3:06:04 AM5/30/14
to fds...@googlegroups.com
Chris,

thank you very much for answering that fast and also detailed! What you've described sounds quite plausible, in my mind. If I do additional tests will let you know results!

Best regards

Patrick
Reply all
Reply to author
Forward
0 new messages