We have just converted one of our cfd dedicated machines to linux in order to allow us to run multiple models at once however the models have been running slower than expected. The machines info is shown in the attached image.
Originally we were running this machine on windows with the mpi set to the number of meshes and it was quite fast - could run 1.5M (26 mesh) model in 40 hours.
However, now when running four models utilising all 48 cores on the cfd machine, the computing speed was found to drop substantially (8-14s/hour). Based on my limited understanding on MPI and computing, I was under the impression that if each mesh is given one core, then using all 48 cores shouldn't decrease the speed of the system if enough memory is provided for each core. Currently the system is only utilising 10-20% of the available memory so I assume that isn't the case. Additionally, it appears quite a few of the cores are only running at 50% as shown in the second attached image.
Based on the fact it slows down when they are all on MPI (see additional information below), I thought it could be the following
- System is overloaded
- My understanding of MPI is incorrect and it cannot complete 48 processes at once efficiently
- For some unknown reason some of the cores are only working at 50% capacity
- Some cores are working on multiple models therefore slowing them down
I have been reading quite a bit about MPI and OMP to try to find a solution however I am not making much process. If anyone has suggestions regarding achieving the original speed (40-50s/hour) it would be greatly appreciated.
Cheers,
Harry
Additional information:
Prior running them on MPI, we were running them using OMP = number of meshes based on advice from someone but I switched one of the 9 mesh models over to MPI while the other three were running on OMP it increased to 48s/hour from 12s/hour. However when switching them all over to MPI I found the speed to drop at 8-14s/hour per model.
Code to start model:
OMP_NUM_THREADS=1
mpirun -np [number of meshs] fds [model name].fds
The four models are:
9 mesh = 550k cells
9 mesh = 550k cells
12 mesh = 600k cells
16 mesh = 1.5M cells