Hi,
With Claudia we are fighting very strange error that we get when we run our analysis on the cluster.
We use RAVEN to run MAAP5 using DET sampler. (we can also do MC or any other type of analysis) using 552 cores (1 node to run the raven instance and 23 nodes to run the code)
First – here is the error message that is written in the simulation folder
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
Let me explain when it appears.
We run very big cases (10^5 MAAP runs) thus often the analysis exceeds the maximum wall time which is 7days.
So, we restart it, once it is killed due to the wall time limit by slurm.
There is no DET restart by RAVEN, so we implemented our poor’s-man restart on the interface level.
DET is essentially a grid sampler, so when we restart the analysis the interface (in the generateNewCommand method) checks if there are data after a previous successful run in the simulation folder then it submits a pause command like this:
finished= True
SIMULATION ALREADY RUN
Execution Command: [(u'parallel', u' echo SIMULATION ALREADY RUN dummy.txt'), (u'parallel', u' sleep 8s'), (u'parallel', u' hostname')]
( 2.61 sec) Code : Message -> Execution command submitted: mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0 -n 1 --bind-to none echo SIMULATION ALREADY RUN dummy.txt && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0 -n 1 --bind-to none sleep 8s && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0 -n 1 --bind-to none hostname
If it doesn’t find the successful MAAP output data it executes MAAP:
Execution Command: [(u'parallel', u' MP502a2LINUX_opt.exe test.inp ')]
( 24.53 sec) Code : Message -> Execution command submitted: mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_2 -n 1 --bind-to none MP502a2LINUX_opt.exe test.inp
This method worked like charm when we had 10^4 runs, but starts to fail when we process >30000 run with the error I’ve mentioned before.
We contacted our EDF cluster support to get some help, they suggested that there might be a clutter left after previous job runs on one of the node’s /tmp/ folder,
The
--------------------------------------------------------------------------
[atcn001:56426] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/openmpi-sessions-71302@atcn001_0/1237) of (/tmp/openmpi-sessions-71302@atcn001_0/1237/0/0), mkdir failed [-1]
[atcn001:56426] [[1237,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 110
[atcn001:56426] [[1237,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 386
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
Atcn001 is the node on which the raven instance is running.
So they suggested to append orte-clean at the end of our command, if we do so it kills the process and we get an error like that:
( 146.74 sec) Code : Message -> Process Failed mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0 -n 1 --bind-to none echo SIMULATION ALREADY RUN dummy.txt && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0 -n 1 --bind-to none sleep 15s && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0 -n 1 --bind-to none hostname && mpiexec -machinefile /athos-scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/scenario_det/node_0 -n 1 --bind-to none orte-clean --verbose returnCode 213
( 146.74 sec) Code : Message -> 'SIMULATION ALREADY RUN dummy.txt
atcn486
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[41250,1],0]
Exit code: 213
--------------------------------------------------------------------------
If you read until here, then you probably have an idea of what is going on :)
Here are our thoughts/questions.
Any help would be appreciated,
We are using openmpi 2.0.1
Valentin.
|
|
|
|
Valentin RYCHKOV 91120 Palaiseau Tél. : +33 1 78 19 41 31 |
|
|
|
Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité. |
Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse.
Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message.
Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus.
____________________________________________________
This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval.
If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message.
E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
In the meantime, as a test, I try to run a case now where I use an empty command echo SIMULATION ALREADY RUN dummy.txt in the serial and not in parallel mode.
SIMULATION ALREADY RUN
Execution Command: [(u'serial', u' echo SIMULATION ALREADY RUN dummy.txt')]
( 4.09 sec) Code : Message -> Execution command submitted: echo SIMULATION ALREADY RUN dummy.txt
But anyway, any thoughts are welcome.
Valentin.
|
|
|
|
Valentin RYCHKOV 91120 Palaiseau Tél. : +33 1 78 19 41 31 |
|
|
|
Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité. |
De : RYCHKOV VALENTIN <valentin...@edf.fr>
Date : mercredi 20 février 2019 à 07:29
À : RAVEN users <inl-rav...@googlegroups.com>
Cc : "pico...@buckeyemail.osu.edu" <pico...@buckeyemail.osu.edu>
Objet : MPI error: ORTE daemon unexpectedly failed
Valentin,
Could you expand a bit more on how the restart has been added to RAVEN?
Diego
--
You received this message because you are subscribed to the Google Groups "INL RAVEN Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
inl-raven-use...@googlegroups.com.
To post to this group, send email to
inl-rav...@googlegroups.com.
Visit this group at
https://groups.google.com/group/inl-raven-users.
To view this discussion on the web visit
https://groups.google.com/d/msgid/inl-raven-users/519FD55F-C89F-411D-95FF-DD4ECE4F714D%40edf.fr.
For more options, visit
https://groups.google.com/d/optout.
Hi Diego,
The restart is not added to raven. We simply read the existing simulation tree in a sort of “interface check” mode.
In the generateCommand method of the interface we check if the folder has all attributes of successfully finished MAAP5 run, if yes then we read the output files and proceded with DET, if not we start MAAP5 run.
We use that to fix failed MAAP5 runs and to effectively extend wall time. It takes roughly 1-2sec to read a folder, so in 30h we are capable to read 100000 branches.
Is it more clear now?
Valentin
|
|
|
|
Valentin RYCHKOV 91120 Palaiseau Tél. : +33 1 78 19 41 31 |
|
|
|
Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité. |
De : "diego.m...@inl.gov" <diego.m...@inl.gov>
Date : mercredi 20 février 2019 à 18:38
À : RYCHKOV VALENTIN <valentin...@edf.fr>, RAVEN users <inl-rav...@googlegroups.com>
Cc : "pico...@buckeyemail.osu.edu" <pico...@buckeyemail.osu.edu>
Objet : Re: MPI error: ORTE daemon unexpectedly failed
Valentin,
By any chance do you have a file called ravenLocked.raven in any of these folder?
Diego
Hi there,
It seems that we managed to avoid the error if we span the reading of previous simulation without mpiexec command.
I’m not sure, but it may be the latency of the cluster file system (we have diskless nodes) could cause the error.
So the bottom line:
We have the following runinfo:
<RunInfo>
<WorkingDir>scenario_det</WorkingDir>
<Sequence>testDummyStep,post</Sequence>
<batchSize>552</batchSize>
<maxQueueSize>800</maxQueueSize>
<mode>mpi
<nodefile>/scratch/cp346b8n/2018_Simulation/18.11.20/5_24_bis/nodes</nodefile>
</mode>
<NodeParameter>-machinefile</NodeParameter>
<precommand>--bind-to none</precommand>
</RunInfo>
Voila,
Valentin.
|
|
|
|
Valentin RYCHKOV 91120 Palaiseau Tél. : +33 1 78 19 41 31 |
|
|
|
Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité. |
De : "inl-rav...@googlegroups.com" <inl-rav...@googlegroups.com>
Répondre à : Diego Mandelli <diego.m...@inl.gov>
Date : jeudi 21 février 2019 à 00:23
To view this discussion on the web visit https://groups.google.com/d/msgid/inl-raven-users/7AB96803-3DB7-44FD-8A65-96C26F58D8B8%40inl.gov.
For more options, visit https://groups.google.com/d/optout.
Ah, interesting. This should be quite useful for others looking to extend the capabilities of individual code interfaces. I’m glad you were able to find that fix.
To view this discussion on the web visit https://groups.google.com/d/msgid/inl-raven-users/7069EADD-7399-46E8-AD85-A055B0CDC4EA%40edf.fr.