[slurm-users] Reproducible irreproducible problem (timeout?)

16 views
Skip to first unread message

Laurence Marks

unread,
Dec 20, 2023, 8:32:37 AM12/20/23
to slurm...@lists.schedmd.com
I know that sounds improbable, but please readon.

I am running a reasonably large job on a University supercomputer (not a national facility) with 12 nodes on 64 core nodes. The job loops through a sequence of commands some of which are single cpu, but with a slow step where 3 tasks each with 4 nodes running hybrid omp/mpi are launched. I use mpirun for this (Intel impi), which in turn uses srun for each. These slow steps run for about 50 minutes. The full job runs for 48 hours, and I am typically queueing 11 of these at a time to run in parallel on different nodes.

After some (irreproducible) time, often one of the three slow tasks hangs. A symptom is that if I try and ssh into the main node of the subtask (which is running 128 mpi on the 4 nodes) I get "Authentication failed". Sometimes I can kill the mpiexec on the main parent node and this will propagate and I can continue (with some fault handling).

I know most people expect a single srun to be used, rather than a complex loop as above. The reason is that it is much, much more efficient to subdivide the problem, and also code maintenance is better with subproblems. This is an established code (been around 20+ years). I wonder if there are some timeouts or something similar which drop connectivity. I also wonder whether repeated launching of srun subtasks might be doing something beyond what is normally expected.

--
Emeritus Professor Laurence Marks (Laurie)
Northwestern University
"Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Györgyi

Davide DelVento

unread,
Dec 20, 2023, 12:35:07 PM12/20/23
to laurenc...@gmail.com, Slurm User Community List
Not an answer to your question, but if the jobs need to be subdivided, why not submit smaller jobs?

Also, this does not sound like a slurm problem, but rather a code or infrastructure issue. 

Finally, are you typically able to ssh into the main node of each subtask? In many places that is not allowed and you would get the "Authentication failed" error regardless... Some places (but definitely not all) allow instead logging in with something like

srun --jobid <nnnn> --pty bash

Where obviously <nnnn> is your job ID. Hope this helps

Gerhard Strangar

unread,
Dec 20, 2023, 1:57:19 PM12/20/23
to slurm...@lists.schedmd.com
Laurence Marks wrote:

> After some (irreproducible) time, often one of the three slow tasks hangs.
> A symptom is that if I try and ssh into the main node of the subtask (which
> is running 128 mpi on the 4 nodes) I get "Authentication failed".

How about asking an admin to check why it hangs?

Laurence Marks

unread,
Dec 20, 2023, 2:39:39 PM12/20/23
to Slurm User Community List
It is a University "supercomputer", not a national facility. Hence they are not that expert, which is why I am asking here. I am pretty certain that it is some form of communication issue, but beyond that it is not clear.

If I get suggestions such as "why don't they look for ABC in XYZ" then I may persuade them to look at specifics. They will need the coaching, alas.

Renfro, Michael

unread,
Dec 20, 2023, 3:41:20 PM12/20/23
to laurenc...@gmail.com, Slurm User Community List

Is this Northwestern’s Quest HPC or another one? I know at least a few of the people involved with Quest, and I wouldn’t have thought they’d be in dire need of coaching.

 

And to follow on with Davide’s point, this really sounds like a case for submitting multiple jobs with dependencies between them, as per [1, 2, 3].

 

[1] https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1795

[2] https://bioinformaticsworkbook.org/Appendix/HPC/SLURM/submitting-dependency-jobs-using-slurm.html#gsc.tab=0

[3] https://slurm.schedmd.com/sbatch.html#OPT_dependency

 

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Laurence Marks <laurenc...@gmail.com>
Date: Wednesday, December 20, 2023 at 1:40 PM
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Reproducible irreproducible problem (timeout?)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.


Laurence Marks

unread,
Dec 20, 2023, 3:45:54 PM12/20/23
to Renfro, Michael, Slurm User Community List
Dependencies is not an appropriate approach.

---
Professor Laurence Marks (Laurie)
www.numis.northwestern.edu
"Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Györgyi

Laurence Marks

unread,
Dec 20, 2023, 3:54:29 PM12/20/23
to Renfro, Michael, Slurm User Community List
In terms of dependencies, please think about timing. Currently one loop takes ~70 minutes, and say there is a queue time T for any job. If you split the slow part to run serial one loop takes ~190 minutes + 2T. The time for N iterations would be ~ 190N +570*T versus 70N+T. 

---
Professor Laurence Marks (Laurie)
www.numis.northwestern.edu
"Research is to see what everybody else has seen, and to think what nobody else has thought" Albert Szent-Györgyi
On Wed, Dec 20, 2023, 14:40 Renfro, Michael <Ren...@tntech.edu> wrote:
Reply all
Reply to author
Forward
0 new messages