[slurm-users] Jobs showing running but not running

356 views
Skip to first unread message

Sushil Mishra via slurm-users

unread,
May 29, 2024, 1:18:27 PM5/29/24
to Slurm User Community List
Hi All,

I'm managing a cluster with Slurm, consisting of 4 nodes. One of the compute nodes appears to be experiencing issues. While the front node's 'squeue' command indicates that jobs are running, upon connecting to the problematic node, I observe no active processes and GPUs are not being utilized.

[sushil@ccbrc ~]$ sinfo -Nel
Wed May 29 12:00:08 2024
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
gag            1     defq*       mixed 48     2:24:1 370000        0      1   (null) none                
gag            1   glycore       mixed 48     2:24:1 370000        0      1   (null) none                
glyco1         1     defq* completing* 128    2:64:1 500000        0      1   (null) none                
glyco1         1   glycore completing* 128    2:64:1 500000        0      1   (null) none                
glyco2         1     defq*       mixed 128    2:64:1 500000        0      1   (null) none                
glyco2         1   glycore       mixed 128    2:64:1 500000        0      1   (null) none                
mannose        1     defq*       mixed 24     2:12:1 180000        0      1   (null) none                
mannose        1   glycore       mixed 24     2:12:1 180000        0      1   (null) none  


On glyco1 (affected node!):
squeue # gets stuck
sudo systemctl restart slurmd  # gets stuck

I tried the following to clear the jobs stuck in CG state, but any new job appears to be stuck in a 'running' state without actually running.
scontrol update nodename=glyco1 state=down reason=cg
scontrol update nodename=glyco1 state=resume reason=cg

There is no I/O issue in that node, and all file systems are under 30% in use.  Any advice on how to resolve this without rebooting the machine?

Best,
Sushil

Laura Hild via slurm-users

unread,
May 29, 2024, 4:44:58 PM5/29/24
to Sushil Mishra, Slurm User Community List
> sudo systemctl restart slurmd # gets stuck

Are you able to restart other services on this host? Anything weird in its dmesg?

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Ryan Novosielski via slurm-users

unread,
May 29, 2024, 4:49:25 PM5/29/24
to Sushil Mishra, Slurm User Community List
One of the other states — down or fail, from memory — should cause it to completely drop the job. 

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

Reply all
Reply to author
Forward
0 new messages