Nodes stuck in DOWN*+CLOUD state, job stuck in CG

88 views
Skip to first unread message

Milad Alizadeh

unread,
Oct 3, 2022, 6:58:08 AM10/3/22
to google-cloud-slurm-discuss
I'm pretty new to using slurm with a cloud infrastructure and I'm struggling with two issues in setting up my cluster:

My nodes keep getting stuck in "Down*+CLOUD" state. This happens when I get allocated the first job but then the node gets stuck. I keep trying to set their state to POWER_UP, IDLE, RESUME and I can see them getting IDLE for a few seconds before getting back to DOWN*+CLOUD. What's the correct way to slurm to just forget about these nodes and move on?

I think above is also related to the fact my first job also never truly dies. It gets stuck in CG state and doesn't respond to SCANCEL. Is there a way to force SCANCEL?

And in general what are the configs I should look at to make sure I leave a cloud cluster in a happy state after job allocation? I know my question is vague, I'm still trying to wrap my head around slurm's cloud mode.

Milad Alizadeh

unread,
Oct 3, 2022, 7:03:11 AM10/3/22
to google-cloud-slurm-discuss
forgot to include slurmctl logs for when I try to set the node state to idle:

```
[2022-10-03T10:59:39.939] error: _find_node_record(763): lookup failure for milad-cluster-compute-9
[2022-10-03T10:59:39.939] error: update_node: node milad-cluster-compute-9 does not exist
[2022-10-03T10:59:39.939] _slurm_rpc_update_node for milad-cluster-compute-9: Invalid node name specified
[2022-10-03T10:59:50.020] update_node: node milad-cluster-compute-0-9 state set to IDLE
[2022-10-03T11:00:11.940] update_node: node milad-cluster-compute-0-9 reason set to: Instance stopped/deleted
[2022-10-03T11:00:11.941] update_node: node milad-cluster-compute-0-9 state set to DOWN
```

Olivier Martin

unread,
Oct 3, 2022, 8:50:29 AM10/3/22
to Milad Alizadeh, google-cloud-slurm-discuss
Hi, how did you deploy Slurm in the first place?
What does the output of sinfo look like? And squeue?


--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/90a8aea2-0769-4c79-9a53-66b62eeac959n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages