[slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?

471 views
Skip to first unread message

Thomas Arildsen

unread,
Apr 14, 2021, 3:23:45 AM4/14/21
to slurm...@schedmd.com
I administer a Slurm cluster with many users and the operation of the
cluster currently appears "totally normal" for all users; except for
one. This one user gets all attempts to run commands through Slurm
killed after 20-25 seconds (I think the cause is another job - not so
much the time, see further down).
The following minimal example reproduces the error:

$ sudo -u <the_user> srun --pty sleep 25
srun: job 110962 queued and waiting for resources
srun: job 110962 has been allocated resources
srun: Force Terminated job 110962
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.
slurmstepd: error: *** STEP 110962.0 ON <node> CANCELLED AT 2021-
04-09T16:33:35 ***
srun: error: <node>: task 0: Terminated

When this happens, I find this line in the slurmctld log:

_slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=110962 uid
<the_users_uid>

It only happens for '<the_user>' and not for any other user that I know
of. This very similar but shorter-running example works fine:

$ sudo -u <the_user> srun --pty sleep 20
srun: job 110963 queued and waiting for resources
srun: job 110963 has been allocated resources

Note that when I run srun --pty sleep 20 as myself, srun does not
output the two srun: job... lines. This seems to me to be an additional
indication that srun is subject to some different settings for
'<the_user>'.
All settings that I have been able to inspect appear identical for
'<the_user>' as for other users. I have checked, and 'MaxWall' is not
set for this user and not for any other user, either. Other users
belonging to the same Slurm account do not experience this problem.

When this unfortunate user's jobs get allocated, I see messages like
this in
'/var/log/slurm/slurmctld.log':

sched: _slurm_rpc_allocate_resources JobId=111855 NodeList=<node>

and shortly after, I see this message:

select/cons_tres: common_job_test: no job_resources info for
JobId=110722_* rc=0

Job 110722_* is a pending array job by another user that is pending due
to 'QOSMaxGRESPerUser'. One pending part of this array job (110722_57)
eventually ends up taking over job 111855's CPU cores when 111855 gets
killed. This leads me to believe that 110722_57 causes 111855 to be
killed. However, 110722_57 remains pending afterwards.
Some of the things I fail to understand here are:
- Why does a pending job kill another job, yet remains pending
afterwards?
- Why does the pending job even have privileges to kill another job
in the first place?
- Why does this only affect '<the_user>'s jobs but not those of other
users?

None of this is intended to happen. I am guessing it must be caused by
some settings specific to '<the_user>', but I cannot figure out what
they are and they are not supposed to be like this. If these are
settings we admins somehow caused, it was unintended.

NB: some details have been anonymized as <something> above.

I hope someone has a clue what is going on here. Thanks in advance,

Thomas Arildsen

--
Special Consultant | CLAAUDIA

Phone: (+45) 9940 9844 | Email: ta...@its.aau.dk | Web:
https://www.claaudia.aau.dk/
Aalborg University | Niels Jernes Vej 14, 3-013, 9220 Aalborg Ø,
Denmark

Thomas Arildsen

unread,
Apr 14, 2021, 3:54:21 AM4/14/21
to slurm...@schedmd.com
Oh and I forgot to mention that we are using Slurm version 20.11.3.
Best,

Thomas

Ole Holm Nielsen

unread,
Apr 15, 2021, 3:08:10 AM4/15/21
to slurm...@lists.schedmd.com
Hi Thomas,

I wonder if your problem is related to that reported in this list thread?
https://lists.schedmd.com/pipermail/slurm-users/2021-April/007107.html

You could try to restart the slurmctld service, and also make sure your
configuration (slurm.conf etc.) has been pushed correctly to the slurmd nodes.

/Ole

Thomas Arildsen

unread,
Apr 15, 2021, 3:46:49 PM4/15/21
to slurm...@schedmd.com
Hi Ole

Thanks for the suggestion. I am afraid the solution is not the same. At least, restarting `slurmdbd` and `slurmctld` on the head node has made no difference either.
It puzzles me why Slurm appears to treat this one user differently than all others. Even other users under the same account are doing fine.
I think the possible relation to another (array) job I was speculating about in my original message was just coincidental.
I have now tried the following three steps in the hope of somehow fixing the problem, none of which have changed the situation:

- Deleted the user from Slurm using `sacctmgr remove user` and re-created the user again afterwards.
- Removed the user's home directory and let the login procedure populate a new home directory from scratch for the user.
- Restarted `slurmdbd` and `slurmctld` as mentioned above.

Thomas
Reply all
Reply to author
Forward
0 new messages