srun: error: _server_read: fd 25 error reading header: Connection reset by peer

315 views
Skip to first unread message

Mariana Rodriguez

unread,
Jul 3, 2018, 7:36:15 PM7/3/18
to MOOSE
Hello,

I am getting this error message:

srun: error: _server_read: fd 25 error reading header: Connection reset by peer
srun: error: Aborting, io error and missing step on node 2
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 11706209.0 ON node068 CANCELLED AT 2018-07-03T19:18:48 ***
[mpi...@node068.cm.cluster] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert (!closed) failed
[mpi...@node068.cm.cluster] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpi...@node068.cm.cluster] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpi...@node068.cm.cluster] main (ui/mpich/mpiexec.c:331): process manager error waiting for completion

Any idea of what does it mean  _server_read: fd 25 error reading header:?

Thanks!
Mariana


Cody Permann

unread,
Jul 5, 2018, 10:16:13 AM7/5/18
to moose...@googlegroups.com
One of the parallel processes crashed. When that happens, you need to look in the error logs to see if an error from one of your ranks was recorded. If it's clean, then you have to do things like "saving all of the output" to look through, run in debug mode, etc.


--
You received this message because you are subscribed to the Google Groups "moose-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to moose-users...@googlegroups.com.
Visit this group at https://groups.google.com/group/moose-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/moose-users/07C91B12-BEA7-4BEA-A4A6-40700B38EC4A%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages