We just ran into the same problem.
We just upgraded to slurm 24.11.7 and I woke up today to 2 crashed controllers. They would immediately crash on restart.
Whe troubleshooting revealed multi partition jobs to be the problem.
-- Temporary safety guard for Slurm 24.11.7 crash investigation.
-- Reject multi-partition submissions
function slurm_job_submit(job_desc, part_list, submit_uid)
if job_desc.partition ~= nil and
string.find(job_desc.partition, ",") then
slurm.log_user("Multi-partition jobs are temporarily
disabled. Please submit to exactly one partition.")
slurm.log_info("Rejected multi-partition job from uid=%s
partition=%s",
tostring(submit_uid),
tostring(job_desc.partition))
return slurm.ERROR
end
return slurm.SUCCESS
end
function slurm_job_modify(job_desc, job_rec, part_list,
modify_uid)
if job_desc.partition ~= nil and
string.find(job_desc.partition, ",") then
slurm.log_user("Changing jobs to multiple partitions is
temporarily disabled. Please choose exactly one partition.")
slurm.log_info("Rejected multi-partition job modification
from uid=%s partition=%s",
tostring(modify_uid),
tostring(job_desc.partition))
return slurm.ERROR
end
return slurm.SUCCESS
end
This has now stabilized our cluster and luckily we can operate without multi partition jobs, but this was a really nasty surprise.
What did you end up doing with this problem? Is this a SLURM 24.11.7 problem and I need to just upgrade again?
Kind regards, Matze
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com