Dear distinguished list,
I am new to SLURM. I have recently installed SLURM 20.11.3 on two separate three node clusters. The first cluster was for testing purposes using three small RHEL 7.7 VMs (8 core, 8G RAM). After a successful installation and some sbatch testing, I proceeded to the second cluster.
The production cluster is running on three RHEL 7.7 physical servers, two sockets, 24 cores each, 2 threads per core and 1TB RAM. This installation was also successful.
Yesterday, a user brought an issue to my attention. They reported that when submitting a job via srun using the dependency option (-d afterany:aaaa:bbbb:cccc:dddd...), the dependency was not being honored.
I began by testing the srun -d option in my test cluster, which worked like a charm. The srun job went into the Pending state, waiting for resources. Once jobid 1296 completed, the srun executed.
$ srun /wks01/data/slurm-jobs/clean-up.sh -d afterany:1296
srun: job 1301 queued and waiting for resources
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1301 slurm clean-up zzgowand PD 0:00 1 (Resources)
1291 slurm hostname zzgowand R 0:56 1 r7slurm01
1292 slurm hostname zzgowand R 0:56 1 r7slurm01
1293 slurm hostname zzgowand R 0:56 1 r7slurm01
1294 slurm hostname zzgowand R 0:55 1 r7slurm01
1295 slurm hostname zzgowand R 0:55 1 r7slurm02
1296 slurm hostname zzgowand R 0:53 1 r7slurm02
However, when I ran the exact same test on the production cluster, it was like the '-d' option wasn't supplied. The job went into the Running state, but never really executed. It sat in this state for several minutes, when it should have run in a few seconds. I finally ended up aborting the foreground execution of srun. This resulted in the following messages:
srun launch/slurm: launch_p_step_launch: ... aborted before step completely launched.
Has anyone experienced this before?
Thank you.
Darin Gowan