I'm pretty new to using slurm with a cloud infrastructure and I'm struggling with two issues in setting up my cluster:
My nodes keep getting stuck in "Down*+CLOUD" state. This happens when I get allocated the first job but then the node gets stuck. I keep trying to set their state to POWER_UP, IDLE, RESUME and I can see them getting IDLE for a few seconds before getting back to DOWN*+CLOUD. What's the correct way to slurm to just forget about these nodes and move on?
I think above is also related to the fact my first job also never truly dies. It gets stuck in CG state and doesn't respond to SCANCEL. Is there a way to force SCANCEL?
And in general what are the configs I should look at to make sure I leave a cloud cluster in a happy state after job allocation? I know my question is vague, I'm still trying to wrap my head around slurm's cloud mode.