Dear slurm-user list,
I got this error:
Unable to start service slurmctld: Job for slurmctld.service failed
because the control process exited with error code.\nSee \"systemctl
status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for
details.
but in slurmctld.service I see nothing suspicious:
slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
vendor preset: enabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: active (running) since Wed 2024-02-07 15:50:56 UTC; 19min ago
Main PID: 51552 (slurmctld)
Tasks: 21 (limit: 9363)
Memory: 10.4M
CPU: 1min 16.088s
CGroup: /system.slice/slurmctld.service
├─51552 /usr/sbin/slurmctld --systemd
└─51553 "slurmctld: slurmscriptd" "" "" "" "" "" ""
Feb 07 15:58:21 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: sched: _slurm_rpc_allocate_resources JobId=3 NodeList=(null)
usec=959
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 WTERMSIG 2
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 cancelled by interactive user
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=3 done
Feb 07 15:58:23 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _slurm_rpc_complete_job_allocation: JobId=3 error Job/step
already completing or completed
Feb 07 15:58:42 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: sched: _slurm_rpc_allocate_resources JobId=4
NodeList=cluster-master-2vt2bqh7ahec04c,cluster-worker-2vt2bqh7ahec04c-2
usec=512
Feb 07 16:06:04 cluster-master-2vt2bqh7ahec04c slurmctld[51553]:
slurmctld: error: _run_script: JobId=0 resumeprog exit status 1:0
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=4 WTERMSIG 2
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _job_complete: JobId=4 done
Feb 07 16:09:33 cluster-master-2vt2bqh7ahec04c slurmctld[51552]:
slurmctld: _slurm_rpc_complete_job_allocation: JobId=4 error Job/step
already completing or completed
I am unsure how to debug this further. It might be coming from a
previous problem I tried to fix (basically a few deprecated keys in the
configuration).
I will try to restart the entire cluster with the added changes to rule
out any follow up errors, but maybe it's something obvious a fellow list
user can see.
Best regards,
Xaver
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com