MPI error: ORTE daemon unexpectedly failed

161 views
Skip to first unread message

RYCHKOV Valentin

unread,
Feb 20, 2019, 1:29:50 AM2/20/19
to RAVEN users, pico...@buckeyemail.osu.edu

Hi,

 

With Claudia we are fighting very strange error that we get when we run our analysis on the cluster.

We use RAVEN to run MAAP5 using DET sampler. (we can also do MC or any other type of analysis) using 552 cores (1 node to run the raven instance and 23 nodes to run the code)

 

First – here is the error message that is written in the simulation folder

--------------------------------------------------------------------------

 

An ORTE daemon has unexpectedly failed after launch and before

 

communicating back to mpirun. This could be caused by a number

 

of factors, including an inability to create a connection back

 

to mpirun due to a lack of common network interfaces and/or no

 

route found between them. Please check network connectivity

 

(including firewalls and network routing requirements).

--------------------------------------------------------------------------

 

Let me explain when it appears.

We run very big cases (10^5 MAAP runs) thus often the analysis exceeds the maximum wall time which is 7days.

So, we restart it, once it is killed due to the wall time limit by slurm.

There is no DET restart by RAVEN, so we implemented our poor’s-man restart on the interface level.

DET is essentially a grid sampler, so when we restart the analysis the interface (in the generateNewCommand method) checks if there are data after a previous successful run in the simulation folder then it submits a pause command like this:

 

finished= True

SIMULATION ALREADY RUN

Execution Command: [(u'parallel', u' echo SIMULATION ALREADY RUN dummy.txt'), (u'parallel', u' sleep 8s'), (u'parallel', u' hostname')]

(    2.61 sec) Code                     : Message         -> Execution command submitted: mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none echo SIMULATION ALREADY RUN dummy.txt  && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none sleep 8s  && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none hostname

 

If it doesn’t find the successful MAAP output data it executes MAAP:

Execution Command: [(u'parallel', u' MP502a2LINUX_opt.exe test.inp  ')]

(   24.53 sec) Code                     : Message         -> Execution command submitted: mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_2  -n 1 --bind-to none MP502a2LINUX_opt.exe test.inp

 

 

This method worked like charm when we had 10^4 runs, but starts to fail when we process >30000 run with the error I’ve mentioned before.

 

We contacted our EDF cluster support to get some help, they suggested that there might be a clutter left after previous job runs on one of the node’s /tmp/ folder,

The

--------------------------------------------------------------------------

[atcn001:56426] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/openmpi-sessions-71302@atcn001_0/1237) of (/tmp/openmpi-sessions-71302@atcn001_0/1237/0/0), mkdir failed [-1]

[atcn001:56426] [[1237,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 110

[atcn001:56426] [[1237,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 386

--------------------------------------------------------------------------

It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during orte_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):

 

  orte_session_dir failed

  --> Returned value Error (-1) instead of ORTE_SUCCESS

--------------------------------------------------------------------------

Atcn001 is the node on which the raven instance is running.

So they suggested to append orte-clean at the end of our command, if we do so it kills the process and we get an error like that:

 

(  146.74 sec) Code                     : Message         ->  Process Failed mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none echo SIMULATION ALREADY RUN dummy.txt  && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none sleep 15s  && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none hostname  && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none orte-clean --verbose   returnCode 213

(  146.74 sec) Code                     : Message         -> 'SIMULATION ALREADY RUN dummy.txt

atcn486

-------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

-------------------------------------------------------

--------------------------------------------------------------------------

mpiexec detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

 

  Process name: [[41250,1],0]

  Exit code:    213

--------------------------------------------------------------------------

 

If you read until here, then you probably have an idea of what is going on :)

 

Here are our thoughts/questions.

 

  1. We saw this error only when we re-run our analysis. (we tried 130000 MC runs with MAAP – no failure observed)
  2. When we use pause XX seconds command instead of running MAAP it doesn’t use all nodes, but places the jobs on fist 2-3 nodes in the nodefile list.
  3. Can raven run itself orte-clean at the end of the run cycle ( generateCommande -> execute command on the node -> finalizeCodeOutput on the node) to clean the mess?
  4. May be the to errors ORTE failure and Unable to create the sub-directory error are different in nature?

Any help would be appreciated,

We are using openmpi 2.0.1

 

Valentin.

 

 

cid:image001.png@01D17B86.19A62CD0

Valentin RYCHKOV
EDF Lab Paris-Saclay
7 Bd Gaspard Monge

91120 Palaiseau

valentin...@edf.fr

Tél. : +33 1 78 19 41 31

/Users/a78070/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_687381041

Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité.

 

 

 


Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus.
____________________________________________________

This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or virus-free.

RYCHKOV Valentin

unread,
Feb 20, 2019, 2:15:15 AM2/20/19
to RAVEN users, pico...@buckeyemail.osu.edu

In the meantime, as a test, I try to run a case now where I use an empty command echo SIMULATION ALREADY RUN dummy.txt in the serial and not in parallel mode.

 

SIMULATION ALREADY RUN

Execution Command: [(u'serial', u' echo SIMULATION ALREADY RUN dummy.txt')]

(    4.09 sec) Code                     : Message         -> Execution command submitted:  echo SIMULATION ALREADY RUN dummy.txt

 

But anyway, any thoughts are welcome.

 

Valentin.

 

cid:image001.png@01D17B86.19A62CD0

Valentin RYCHKOV
EDF Lab Paris-Saclay
7 Bd Gaspard Monge

91120 Palaiseau

valentin...@edf.fr

Tél. : +33 1 78 19 41 31

/Users/a78070/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_1728854814

Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité.

 

 

De : RYCHKOV VALENTIN <valentin...@edf.fr>
Date : mercredi 20 février 2019 à 07:29
À : RAVEN users <inl-rav...@googlegroups.com>
Cc : "pico...@buckeyemail.osu.edu" <pico...@buckeyemail.osu.edu>
Objet : MPI error: ORTE daemon unexpectedly failed

Diego Mandelli

unread,
Feb 20, 2019, 12:38:03 PM2/20/19
to valentin.rychkov, RAVEN users, pico...@buckeyemail.osu.edu

Valentin,

Could you expand a bit more on how the restart has been added to RAVEN?

Diego

--
You received this message because you are subscribed to the Google Groups "INL RAVEN Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to inl-raven-use...@googlegroups.com.
To post to this group, send email to inl-rav...@googlegroups.com.
Visit this group at https://groups.google.com/group/inl-raven-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/inl-raven-users/519FD55F-C89F-411D-95FF-DD4ECE4F714D%40edf.fr.
For more options, visit https://groups.google.com/d/optout.

RYCHKOV Valentin

unread,
Feb 20, 2019, 6:03:01 PM2/20/19
to diego.m...@inl.gov, RAVEN users, pico...@buckeyemail.osu.edu

Hi Diego,

 

The restart is not added to raven. We simply read the existing simulation tree in a sort of “interface check” mode.

In the generateCommand method of the interface we check if the folder has all attributes of successfully finished MAAP5 run, if yes then we read the output files and proceded with DET, if not we start MAAP5 run.

 

We use that to fix failed MAAP5 runs and to effectively extend wall time. It takes roughly 1-2sec to read a folder, so in 30h we are capable to read 100000 branches.

 

Is it more clear now?

Valentin

cid:image001.png@01D17B86.19A62CD0

Valentin RYCHKOV
EDF Lab Paris-Saclay
7 Bd Gaspard Monge

91120 Palaiseau

valentin...@edf.fr

Tél. : +33 1 78 19 41 31

/Users/a78070/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_1327941099

Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité.

 

 

De : "diego.m...@inl.gov" <diego.m...@inl.gov>
Date : mercredi 20 février 2019 à 18:38
À : RYCHKOV VALENTIN <valentin...@edf.fr>, RAVEN users <inl-rav...@googlegroups.com>
Cc : "pico...@buckeyemail.osu.edu" <pico...@buckeyemail.osu.edu>
Objet : Re: MPI error: ORTE daemon unexpectedly failed

Diego Mandelli

unread,
Feb 20, 2019, 6:23:18 PM2/20/19
to valentin.rychkov, RAVEN users, pico...@buckeyemail.osu.edu

Valentin,

By any chance do you have a file called ravenLocked.raven in any of these folder?

Diego

RYCHKOV Valentin

unread,
Feb 25, 2019, 11:33:58 AM2/25/19
to Diego Mandelli, RAVEN users, pico...@buckeyemail.osu.edu

Hi there,

 

 

It seems that we managed to avoid the error if we span the reading of previous simulation without mpiexec command.

I’m not sure, but it may be the latency of the cluster file system (we have diskless nodes) could cause the error.

 

 

So the bottom line:

We have the following runinfo:

 <RunInfo>

    <WorkingDir>scenario_det</WorkingDir>

    <Sequence>testDummyStep,post</Sequence>

    <batchSize>552</batchSize>

        <maxQueueSize>800</maxQueueSize>

    <mode>mpi

        <nodefile>/scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/nodes</nodefile>

    </mode>

    <NodeParameter>-machinefile</NodeParameter>

    <precommand>--bind-to none</precommand>

  </RunInfo>

 

  1. When you use ‘parallel’ in generated command, don’t create the concatenated commands using &&
    • The use of
      • String=’exec1 && exec’
      • returnCommand =[(‘parallel’, string)]
    • will lead to the following execution of the exec1 and exec2:
      • mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none exec1 && exec2,
    • So exec2 will be run on the node where the raven instance is running
    • Instead use:
      • returnCommand=[(‘parallel’,’exec1’),(‘parallel’,’exec2’)
    • which will lead to
      • mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none exec1 && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0  -n 1 --bind-to none exec2
  2. In our case exec1 is just a print in a file, so cluster input output latency may be very important, so instead of placing this job on a different node we run them on the raven node using ‘serial’ in generateCommand method:
    • returnCommand=[(‘serial,’exec1’)]

Voila,

 

Valentin.

 

 

cid:image001.png@01D17B86.19A62CD0

Valentin RYCHKOV
EDF Lab Paris-Saclay
7 Bd Gaspard Monge

91120 Palaiseau

valentin...@edf.fr

Tél. : +33 1 78 19 41 31

/Users/a78070/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_801874470

Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité.

 

 

De : "inl-rav...@googlegroups.com" <inl-rav...@googlegroups.com>
Répondre à : Diego Mandelli <diego.m...@inl.gov>
Date : jeudi 21 février 2019 à 00:23


For more options, visit https://groups.google.com/d/optout.

Paul W. Talbot

unread,
Feb 25, 2019, 12:06:10 PM2/25/19
to valentin.rychkov, Diego Mandelli, RAVEN users, pico...@buckeyemail.osu.edu

Ah, interesting. This should be quite useful for others looking to extend the capabilities of individual code interfaces. I’m glad you were able to find that fix.

Reply all
Reply to author
Forward
0 new messages