ogs5.7.1_IPQC_ MPI -- PARALLEL on EVE -- ERROR

21 views
Skip to first unread message

Johannes Boog

unread,
Aug 7, 2018, 6:43:06 AM8/7/18
to Ogs-users
Dear ogs-users,

I am having problems with parallelized simulation of ogs5.7.1 on the UFZ-EVE-cluster. My reactive transport simulations (using IPQC and MPI) crash when running with more than 8 cores with the following error mesage:

ORTE has lost communication with its daemon located on node:

  hostname:  node033

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

Does someone experienced this error ralready?

Seems to be a problem related to node communicatoin on the cluster, however, the simulations with 4 and 8 cores finished. The benchmark isofrac_2d using 20 cores finishes as well.

Sometimes the model crashed after 5min sometimes after 2h.
I am a little confused now as I do not change input files between the simulations except  *.ddc.

Best,

Johannes

Wenqing Wang

unread,
Aug 8, 2018, 10:52:15 AM8/8/18
to 'Johannes Boog' via ogs-users
Sounds like a network connection problem.
--
You received this message because you are subscribed to the Google Groups "ogs-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ogs-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Reply all
Reply to author
Forward
0 new messages