[slurm-users] unable to kill namd3 process

84 views
Skip to first unread message

Shaghuf Rahman

unread,
Apr 25, 2023, 10:28:29 AM4/25/23
to Slurm User Community List
Hi,

We are facing one issue in my environment and the behaviour looks strange to me. It is specifically associated with the namd3 application.
The issue is narrated below and I have made some of the cases.

I am trying to understand the way to kill the processes of the namd3 application submitted through sbatch without making the node in drain.

What I observed is when a user submits a single job on a node and then when he do scancel of namd3 job it kills the job and the node gets to idle state and everything looks as expected.
But when the user submit multiple jobs on a single node and do scancel 1 of his job, it puts the node in drain state. However the other jobs are running fine without an issue.

Due to this issue multiple nodes getting to drain state when a user do scancel of the namd3 job.

Note: When the user is not performing scancel, all job run successfully and the node states are also fine.

It is not creating issues with any of the applications. So we are suspecting the issue could be with the namd3 application
Kindly suggest some solution or any ideas on how to fix this issue.

Thanks in advance,
Shaghuf Rahman

Shaghuf Rahman

unread,
Apr 25, 2023, 11:03:12 AM4/25/23
to Slurm User Community List
Hi,

Also forgot to mention the process is still running when user do scancel and epilog does not clean if one job finished when doing multiple job submission.
We tried to use unkillable option but did not work. The process still remains the same until killing it manually.

Shaghuf Rahman

unread,
May 3, 2023, 6:15:48 AM5/3/23
to Slurm User Community List
Hi,

For an update we tried one case please find it below:

We tried by adding below script to kill the namd3 process in our epilog script.

# To kill remaining processes of job.
#
if [ $SLURM_UID = 1234 ] ; then
        STUCK_PID=`${SLURM_BIN}scontrol listpids $SLURM_JOB_ID | awk '{print $1}' | grep -v PID`
        for kpid in $STUCK_PID
        do
                kill -9 $kpid
        done
fi


but it didn't worked out as it is unable to fetch the required pid with "scontrol listpid" command

It looks like the slurmd had a problem with a job step that didn't end correctly, and the slurmd wasn't able to kill it after the timeout was reached.

Any help would be much appreciated.

Thanks,
Shaghuf Rahman

Reply all
Reply to author
Forward
0 new messages