[slurm-users] Extract job information after completion

1,829 views
Skip to first unread message

O'Grady, Paul Christopher

unread,
Apr 27, 2021, 11:14:41 AM4/27/21
to slurm...@lists.schedmd.com
Sometimes when a slurm job fails I want to see what a user did, getting the command/workdir/stdout/stderr information. I can see that with "scontrol show job <jobid>". However, after the job is done that command doesn't seem to work anymore, saying "invalid job id". I try to use sacct, which seems to save history, but I can only find the "workdir" parameter there, not stdout/stderr/cmd. I tried using the "jobname" field of sacct, but when I use the "wrap" option of sbatch, then jobname only shows the string "wrap" which isn't useful.

My question: is there an easy way for me to get command/workdir/stdout/stderr information after a job has completed? Thanks!

chris


Sean McGrath

unread,
Apr 27, 2021, 12:19:35 PM4/27/21
to Slurm User Community List
Hi,

On Tue, Apr 27, 2021 at 03:14:04PM +0000, O'Grady, Paul Christopher wrote:

> Sometimes when a slurm job fails I want to see what a user did, getting the command/workdir/stdout/stderr information. I can see that with "scontrol show job <jobid>". However, after the job is done that command doesn't seem to work anymore, saying "invalid job id". I try to use sacct, which seems to save history, but I can only find the "workdir" parameter there, not stdout/stderr/cmd. I tried using the "jobname" field of sacct, but when I use the "wrap" option of sbatch, then jobname only shows the string "wrap" which isn't useful.
>
> My question: is there an easy way for me to get command/workdir/stdout/stderr information after a job has completed? Thanks!

Not sure if this is what you need. We do the following:

In slurm.conf set:

EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld

Which does a number of things, including the following:

root@pople01:/etc/slurm # tail -6 slurm.epilogslurmctld
# 20150210 - Sean
# Save the details of a job by doing an scontrol show job=job
# So it can be referenced for trubleshooting in future if needed
# should be run by the slurm epilog

/usr/bin/scontrol show job="$SLURM_JOB_ID" > "$recordsdir/$SLURM_JOBID.record"

So it writes the following to the file system:

root@pople01:/etc/slurm # cat /home/support/root/slurm_job_records/pople/2021/6.record
JobId=6 JobName=sbatch.sh
UserId=smcgrat(5446) GroupId=smcgrat(9249) MCS_label=N/A
Priority=1104631 Nice=0 Account=tchpc QOS=normal
JobState=COMPLETING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2021-04-27T15:56:12 EligibleTime=2021-04-27T15:56:12
AccrueTime=2021-04-27T15:56:12
StartTime=2021-04-27T15:56:13 EndTime=2021-04-27T15:56:13 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-27T15:56:13
Partition=compute AllocNode:Sid=pople01:14314
ReqNodeList=(null) ExcNodeList=(null)
NodeList=
BatchHost=pople-n001
NumNodes=2 NumCPUs=32 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=126000M,node=2,billing=32
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=63000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/home/users/smcgrat/sbatch.sh
WorkDir=/home/users/smcgrat
StdErr=/home/users/smcgrat/slurm-6.out
StdIn=/dev/null
StdOut=/home/users/smcgrat/slurm-6.out
Power=

Hope that helps.

Sean


>
> chris
>
>

--
Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin

sean.m...@tchpc.tcd.ie

https://www.tcd.ie/
https://www.tchpc.tcd.ie/

+353 (0) 1 896 3725


O'Grady, Paul Christopher

unread,
Apr 27, 2021, 6:36:51 PM4/27/21
to slurm...@lists.schedmd.com


On Apr 27, 2021, at 10:44 AM, slurm-use...@lists.schedmd.com wrote:

In slurm.conf set: 

EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld

Which does a number of things, including the following:

root@pople01:/etc/slurm # tail -6 slurm.epilogslurmctld                                                                                                                 
# 20150210 - Sean
# Save the details of a job by doing an scontrol show job=job
# So it can be referenced for trubleshooting in future if needed
# should be run by the slurm epilog

/usr/bin/scontrol show job="$SLURM_JOB_ID" > "$recordsdir/$SLURM_JOBID.record"


Sean, that’s a nice idea for a workaround.  Also good to know somebody else has run into this problem learning about completed job information for debugging purposes.

Thanks for taking the time to share your solution,

chris

Reply all
Reply to author
Forward
0 new messages