FDS error: REQ5 timed out for MP process 0

987 views
Skip to first unread message

Sander

unread,
Jul 20, 2016, 10:08:36 AM7/20/16
to FDS and Smokeview Discussions
Hello,

I am currently simulating a tunnel fire and it throws an error which I haven't had yet and I can't find anything about it:
REQ5 timed out for MPI process 0 running on xxxxxx
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
REQ5 timed out for MPI process 2 running on xxxxxx
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 2

Does someone know this problem and how to fix it?

Thanks in advance!

Kevin

unread,
Jul 20, 2016, 1:01:08 PM7/20/16
to FDS and Smokeview Discussions
This error occurs when a message passed from one computer to another fails. Does it occur at the start of the simulation? Does it always occur at the same point in the simulation?

There is no "fix" because the problem could be your computer(s). Are you using the latest version of FDS?

Sander

unread,
Jul 21, 2016, 2:03:19 AM7/21/16
to FDS and Smokeview Discussions
It is only one computer which we use, however there are 2 cpu's inside the machine. Could this be the problem?
The message occurs somewhere inside the simulation and I didn't run it again, since it takes a lot of time.
I am using the latest version of FDS.

Op woensdag 20 juli 2016 19:01:08 UTC+2 schreef Kevin:

Kevin

unread,
Jul 21, 2016, 9:04:37 AM7/21/16
to FDS and Smokeview Discussions
Sometimes in the middle of some routine activity, my computer just shuts down. Blip, gone. What happened? Who knows? The only way to know if this is a problem with FDS or with your computer is for you to run the exact same case again and observe if the same thing happens at the same point in the calculation. That would imply a problem with FDS, for example, maybe something happens in the calculation like the removal of an obstruction that causes the MPI exchange to fail. If the problem does not occur again, or if the problem occurs somewhere else, that would suggest the problem is with your computer. Maybe you have run out of memory, for example. Monitor the CPU and memory usage throughout the run.

Have you successfully run MPI jobs similar to this one on this computer?

Sander

unread,
Jul 22, 2016, 5:08:30 AM7/22/16
to FDS and Smokeview Discussions
I just rerun the simulation and it happened again, but at a different time step.
CPU and memory are not the problem. This computer is a new one, so I am figuring out what could be the problem as I am now not able to use it for these simulations..

Op donderdag 21 juli 2016 15:04:37 UTC+2 schreef Kevin:

Kevin

unread,
Jul 22, 2016, 8:35:59 AM7/22/16
to FDS and Smokeview Discussions
Next thing to do is to run the case on a different computer. Also, do other simulations fail in the same way? We (NIST) purchased a 36 node (12 cores per node) cluster several years ago. I would see the same error that you see once in about 20 calculations. In fact, this is why you get that message -- I added specifically to try to determine what was wrong. We talked to the vendor, and they had no idea what the problem was, at least at first. We kept asking them, and they started updating drivers, libraries, etc. Then the problem was much better. I still get the error, but far less often. And when I get the error, I cannot reproduce it. That is, I run a case, get the error, run again, no error.

What kind of computer are you using? What operating system?

Sander

unread,
Jul 27, 2016, 2:38:13 AM7/27/16
to FDS and Smokeview Discussions
It is a Dell Precision T7910 running on Windows 7. I am now first going to update all drivers by using the Dell support website and try the simulation again afterwards..

Op vrijdag 22 juli 2016 14:35:59 UTC+2 schreef Kevin:

Sander

unread,
Jul 27, 2016, 2:29:26 PM7/27/16
to FDS and Smokeview Discussions
Well I performed the same simulation again after updating and now it gives me this:
[mpiexec@1C4RJD2] ..\hydra\pm\pmiserv\pmiserv_cb.c (781): connection to proxy 0 at host 1C4RJD2 failed
[mpiexec@1C4RJD2] ..\hydra\tools\demux\demux_select.c (103): callback returned error status
[mpiexec@1C4RJD2] ..\hydra\pm\pmiserv\pmiserv_pmci.c (500): error waiting for event
[mpiexec@1C4RJD2] ..\hydra\ui\mpich\mpiexec.c (1130): process manager error waiting for completion

Still unable to run simulations on this machine, and I really don't know what's causing all this...

Does this say anything to you?

Op woensdag 27 juli 2016 08:38:13 UTC+2 schreef Sander:

Kevin

unread,
Jul 27, 2016, 3:34:25 PM7/27/16
to FDS and Smokeview Discussions
Can you run the case on another computer?

Sander

unread,
Jul 28, 2016, 2:32:02 AM7/28/16
to FDS and Smokeview Discussions
I will set up a case which runs on 12 cores since the other machines has only 12 cores.
Then I will compare the simulations to see what happens.

Op woensdag 27 juli 2016 21:34:25 UTC+2 schreef Kevin:

Sander

unread,
Jul 28, 2016, 9:06:56 AM7/28/16
to FDS and Smokeview Discussions
And it crashed again. The other simulation is still running fine.
The errors:
REQ5 timed out for MPI process      6 running on xxxx
REQ5 timed out for MPI process      8 running on xxxx
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 8

There is also a deviation in the HRR of both simulations. The blue one is the one that stops. It is strange there is a difference between them since I used the same fds file for both simulations.
The .out-file isn't saying anything, it just stops. So I don't have any clue about why this happens..




Op donderdag 28 juli 2016 08:32:02 UTC+2 schreef Sander:

Kevin

unread,
Jul 28, 2016, 9:20:36 AM7/28/16
to FDS and Smokeview Discussions
Are you saying that the calculation runs without stopping on machine A, and it stops on machine B? Does it do this consistently? If so, I cannot think of a way to debug it.

Sander

unread,
Jul 28, 2016, 9:35:00 AM7/28/16
to FDS and Smokeview Discussions
Well, I never had one of those errors on the machine we are currently using.
And I have those errors on ALL the simulations on the new one....


Op donderdag 28 juli 2016 15:20:36 UTC+2 schreef Kevin:

Sam J

unread,
Aug 5, 2016, 11:51:35 PM8/5/16
to FDS and Smokeview Discussions
Hi Sander,

Did you have any luck finding the cause this issue?

I have two very identical 4 node clusters on linux (each node has 4 cores). One of the cluster has been running stable for a while but on the second cluster, I get the same error as you about 2-3 days into running a very large job (about 20 million cells that are 0.2m in all dimensions). What is surprising is that that the error says "... timed out for MP process 2 on node1" even though node1 is the main node starting the sinulation in the cluster.

Ida Ginstrup

unread,
Nov 1, 2018, 11:23:27 PM11/1/18
to FDS and Smokeview Discussions
Hi Sam,

One of my colleagues is having the same issue as you and Sander.

Did you ever find a solution?

Thanks
Reply all
Reply to author
Forward
0 new messages