I can't speak to what happens on node failure, but I can at least get you a greatly simplified pair of scripts that will run only one copy on each node allocated:
#!/bin/bash
# notarray.sh
#SBATCH --nodes=28
#SBATCH --ntasks-per-node=1
#SBATCH --no-kill
echo "notarray.sh is running on $(hostname)"
srun --no-kill somescript.sh
and
#!/bin/bash
# somescript.sh
echo "somescript.sh is running on $(hostname)"
I can verify that after submitting the job with "sbatch notarray.sh":
No need to pass srun a set of parameters for how many tasks to run, since it can figure that out from the sbatch context.
From:
slurm-users <slurm-use...@lists.schedmd.com> on behalf of Robert Peck <rp1...@york.ac.uk>
Date: Friday, April 16, 2021 at 2:40 PM
To: slurm...@schedmd.com <slurm...@schedmd.com>
Subject: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
Hi Robert,I hope your day is treating you well.
Thank you for your posts on the Slurm user list.
Would there be interest on your side to see a Slurm support contract for your systems at University of York?
Sites running Slurm with support give us feedback that support is invaluable and a great return back to the organization with much better system utilization with optimized configs by our experts (which pays for the support contract in and of itself), guaranteed resolutions to their issues and their sites not having to rely on in-house best-effort support hacks that get very expensive and turn into complicated chaos and potential down systems.
Additionally, support keeps the Slurm project alive and going strong
Please let me know your thoughts or if you would like me to reach out to another contact at University of York to chat about this further.Take care,
Jess Arrington
Executive Director:Global Sales & Alliances je...@schedmd.com | 801-616-7823
240 N 1200 E #203 Lehi, UT 84043
--
You received this message because you are subscribed to a topic in the Google Groups "slurm-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/slurm-users/I1T6GWcLjt4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to slurm-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slurm-users/70b2e90b-4939-7105-8a15-eb5a60addd99%40csamuel.org.