Joseph - Here's an example, a simple job submission and its
corresponding log output on a newly-deployed cluster. Note that with
'sleep 10' commented out, the job completes successfully.
-r
$ cat sleep10.sh
#!/bin/bash
date
sleep 10
date
$ sbatch sleep10.sh
$ sudo cat /apps/slurm/log/slurmctld.log
[...]
[2019-06-24T16:21:45.292] _slurm_rpc_submit_batch_job: JobId=4
InitPrio=4294901757 usec=1828
[2019-06-24T16:21:45.376] sched: Allocate JobId=4
NodeList=tm1-compute00000 #CPUs=2 Partition=tm-nc4-mem16
[2019-06-24T16:22:49.123] Node tm1-compute00000 now responding
[2019-06-24T16:23:01.407] update_node: node tm1-compute00000 reason set
to: Instance stopped/deleted
[2019-06-24T16:23:01.407] requeue job JobId=4 due to failure of node
tm1-compute00000
[2019-06-24T16:23:01.407] Requeuing JobId=4
[2019-06-24T16:23:01.407] update_node: node tm1-compute00000 state set
to DOWN
[2019-06-24T16:23:01.423] node_did_resp: node tm1-compute00000 returned
to service
[2019-06-24T16:24:01.814] update_node: node tm1-compute00000 reason set
to: Instance stopped/deleted
[2019-06-24T16:24:01.814] update_node: node tm1-compute00000 state set
to DOWN
[2019-06-24T16:25:02.726] sched: Allocate JobId=4
NodeList=tm1-compute00001 #CPUs=2 Partition=tm-nc4-mem16
[2019-06-24T16:25:38.951] Node tm1-compute00001 now responding
[2019-06-24T16:26:01.608] update_node: node tm1-compute00001 reason set
to: Instance stopped/deleted
[2019-06-24T16:26:01.608] requeue job JobId=4 due to failure of node
tm1-compute00001
[2019-06-24T16:26:01.608] Requeuing JobId=4
[2019-06-24T16:26:01.608] update_node: node tm1-compute00001 state set
to DOWN
[2019-06-24T16:26:01.624] node_did_resp: node tm1-compute00001 returned
to service
[2019-06-24T16:26:46.922] node_did_resp: node tm1-compute00000 returned
to service
[2019-06-24T16:27:02.010] update_node: node tm1-compute00000 reason set
to: Instance stopped/deleted
[2019-06-24T16:27:02.010] update_node: node tm1-compute00000 state set
to DOWN
[2019-06-24T16:27:02.010] update_node: node tm1-compute00001 reason set
to: Instance stopped/deleted
[2019-06-24T16:27:02.010] update_node: node tm1-compute00001 state set
to DOWN
[2019-06-24T16:28:02.205] sched: Allocate JobId=4
NodeList=tm1-compute00002 #CPUs=2 Partition=tm-nc4-mem16
[2019-06-24T16:28:41.013] Node tm1-compute00002 now responding
[2019-06-24T16:29:01.832] update_node: node tm1-compute00002 reason set
to: Instance stopped/deleted
[2019-06-24T16:29:01.832] requeue job JobId=4 due to failure of node
tm1-compute00002
[2019-06-24T16:29:01.832] Requeuing JobId=4
[2019-06-24T16:29:01.832] update_node: node tm1-compute00002 state set
to DOWN
[2019-06-24T16:29:01.847] node_did_resp: node tm1-compute00002 returned
to service
[2019-06-24T16:30:02.231] update_node: node tm1-compute00002 reason set
to: Instance stopped/deleted
[2019-06-24T16:30:02.231] update_node: node tm1-compute00002 state set
to DOWN
[2019-06-24T16:31:02.499] sched: Allocate JobId=4
NodeList=tm1-compute00003 #CPUs=2 Partition=tm-nc4-mem16
[2019-06-24T16:31:38.717] Node tm1-compute00003 now responding
[2019-06-24T16:32:01.995] update_node: node tm1-compute00003 reason set
to: Instance stopped/deleted
[2019-06-24T16:32:01.996] requeue job JobId=4 due to failure of node
tm1-compute00003
[2019-06-24T16:32:01.996] Requeuing JobId=4
[2019-06-24T16:32:01.996] update_node: node tm1-compute00003 state set
to DOWN
[2019-06-24T16:32:02.012] node_did_resp: node tm1-compute00003 returned
to service
[2019-06-24T16:33:01.378] update_node: node tm1-compute00003 reason set
to: Instance stopped/deleted
[2019-06-24T16:33:01.379] update_node: node tm1-compute00003 state set
to DOWN
[2019-06-24T16:34:19.572] backfill: Started JobId=4 in tm-nc4-mem16 on
tm1-compute00004
[2019-06-24T16:34:58.373] Node tm1-compute00004 now responding
[2019-06-24T16:35:02.171] update_node: node tm1-compute00004 reason set
to: Instance stopped/deleted
[2019-06-24T16:35:02.171] requeue job JobId=4 due to failure of node
tm1-compute00004
[2019-06-24T16:35:02.171] Requeuing JobId=4
[2019-06-24T16:35:02.171] update_node: node tm1-compute00004 state set
to DOWN
[2019-06-24T16:35:02.187] node_did_resp: node tm1-compute00004 returned
to service
[2019-06-24T16:36:01.559] update_node: node tm1-compute00004 reason set
to: Instance stopped/deleted
[2019-06-24T16:36:01.559] update_node: node tm1-compute00004 state set
to DOWN
[2019-06-24T16:37:19.574] backfill: Started JobId=4 in tm-nc4-mem16 on
tm1-compute00005
[2019-06-24T16:37:58.411] Node tm1-compute00005 now responding
[2019-06-24T16:38:01.354] update_node: node tm1-compute00005 reason set
to: Instance stopped/deleted
[2019-06-24T16:38:01.354] requeue job JobId=4 due to failure of node
tm1-compute00005
[2019-06-24T16:38:01.354] Requeuing JobId=4
[2019-06-24T16:38:01.354] update_node: node tm1-compute00005 state set
to DOWN
[2019-06-24T16:38:01.370] node_did_resp: node tm1-compute00005 returned
to service
[2019-06-24T16:39:01.742] update_node: node tm1-compute00005 reason set
to: Instance stopped/deleted
[2019-06-24T16:39:01.742] update_node: node tm1-compute00005 state set
to DOWN
[2019-06-24T16:39:08.205] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=4
uid 326316723
$ sinfo -l
Mon Jun 24 16:43:47 2019
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS
NODES STATE NODELIST
tm-nc4-mem16* up infinite 1-infinite no NO all
4 down% tm1-compute[00002-00005]
tm-nc4-mem16* up infinite 1-infinite no NO all
16 idle~ tm1-compute[00000-00001,00006-00019]
tm-nc16-mem32 up infinite 1-infinite no NO all
20 idle~ tm1-compute[01000-01019]
Joseph Schoonover wrote on 6/22/19 8:07 AM: