FDS on CUDA-enabled GPUs

Waled Elsouki

unread,

Oct 31, 2017, 8:54:12 PM10/31/17

to FDS and Smokeview Discussions

Hi

I have no experience in computer programming, but I was wondering if it is possible to compile and run FDS on CUDA-enabled GPUs.

I understand this was looked at a long time ago https://groups.google.com/forum/#!topic/fds-smv/iJvyXuyxXAo , but this may be more feasible now with the release of CUDA FORTRAN https://developer.nvidia.com/cuda-fortran

It would be great to be able to utilise the GPUs we have in our CAD dedicated PCs (NVIDIA Quadra M2000).

If it is indeed feasible, could decent performance be expected ? say compared to the CPUs in the PCs (Intel Xeon E5-1650 v4 - 6 cores / 12 threads @ 3.60GHz).

What would the determining factor be in potential performance of FDS on a GPU? Is it floating point FLOPs?

Thanks,

Waled

Kevin

unread,

Nov 1, 2017, 8:55:27 AM11/1/17

to FDS and Smokeview Discussions

We do not want to tailor the code for one particular type of card.

There is no clear evidence that the GPU is going to help us because our current mode of shared-memory parallelization (OpenMP) only provides a speed up of a factor of two.

fde

unread,

Nov 28, 2017, 2:55:02 AM11/28/17

to FDS and Smokeview Discussions

Currently, crypocurrencies are being calculated with CUDA and equivalent system in ATI cards. In benchmark results many cards are faster than a decent CPU, because number of shader units are hundreds even though processing power is low. Here is a list of CUDA supporting cards. It is not one type of cards, there are many available and in use.

https://www.geforce.com/hardware/technology/cuda/supported-gpus

I do not know how OpenMP works but the way CUDA works sound a perfect use for CFD.

If FDS is not a right application for CUDA now, I hope in future we will be able to utilize GPUs along with CPU to increase the calculation speed.

Kevin

unread,

Nov 28, 2017, 9:53:56 AM11/28/17

to FDS and Smokeview Discussions

The bulk of the computational cost of FDS is in 3-D DO-LOOPS, where some finite difference term is calculated in every grid cell. OpenMP works by dividing up the total number of grid cells among the allocated threads, thus reducing the time spent processing these loops. However, there is an overhead cost to dividing the work among the threads, competition to access RAM, serial parts of the code that cannot be parallelized, etc. All in all, we can speed up the processing of a given mesh by about a factor of 2 using 4 to 6 OpenMP threads. I cannot see how CUDA is going to magically improve on this. The degree to which a program is sped up depends alot on the type of program. GPUs are optimized for high speed graphics processing, which is a completely different kind of calculation to CFD.

So you have to understand the limitations of OpenMP before you can make predictions about how CUDA is going to perform.

fde

unread,

Nov 28, 2017, 10:06:09 AM11/28/17

to FDS and Smokeview Discussions

Thank you for the detailed information. I see that I am wrong in predicting it could help us somehow. It seems I should give up my hopes to be able to compute faster in the near future.

Randy McDermott

unread,

Nov 28, 2017, 10:07:03 AM11/28/17

to FDS and Smokeview Discussions

My limited understanding is that CUDA is good for ordinary differential equations (ODEs). So, there could be potential benefit using CUDA for, say, particle tracking or chemical kinetics. But, even then, the down side is that the codes take a long time to develop and then they are only good for a particular chip architecture (again, my limited understanding).

--
You received this message because you are subscribed to the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fds-smv+unsubscribe@googlegroups.com.
To post to this group, send email to fds...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/fd4ae233-814e-4aca-afdc-705937738299%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Salah Benkorichi

unread,

Nov 28, 2017, 10:12:29 AM11/28/17

to fds...@googlegroups.com

We are currently computing faster. Take a zoom back to the past and where we are now. We are growing in an exponential trend in terms of speeding up the computation. With large clusters, and good CPU performances, there are lots of improvements, we're at the 8th generation.

I'm optimist with this, and surely more improvement will come in the future.

--

You received this message because you are subscribed to the Google Groups "FDS and Smokeview Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fds-smv+unsubscribe@googlegroups.com.
To post to this group, send email to fds...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/fds-smv/72093e44-6ba5-4cd6-b7b5-adb020e3cd7a%40googlegroups.com.

Kevin

unread,

Nov 28, 2017, 10:12:39 AM11/28/17

to FDS and Smokeview Discussions

You should be using MPI. Then you can speed up simulations by factors of 10 to 100, depending on the number of cores you have.

Martin Dragland

unread,

Mar 17, 2023, 12:48:28 PM3/17/23

to FDS and Smokeview Discussions

I inadvertently stumbled upon this old forum post.

Kevin, are you suggesting that to achieve the fastest simulation, one should always utilize as many MPI processes (cores) as available, or is there a point at which efficiency significantly declines, similar to the situation with OpenMP Threads?

Kevin McGrattan

unread,

Mar 17, 2023, 1:23:39 PM3/17/23

to fds...@googlegroups.com

Take a look at the attached plot. It shows the relative speed-up of FDS using 1 through 8 OpenMP threads. This is a single mesh case -- one is 64 cubed, the other 128 cubed. You see in both cases that we get a speed up of about a factor of 2 using 8 threads. This is not great. If I really put some effort into it, I could probably get this down to, maybe, a factor of 4 speed up. Still not great. The reason is that there are many parts of FDS that are still serial and would be difficult to parallelize at the loop level.

If I have 8 cores at my disposal, I am much better off breaking the mesh into 8 submeshes and running with 8 MPI processes. This will probably speed my job by approximately a factor of 7.5.

openmp_timing_benchmarks.pdf

Reply all

Reply to author

Forward