PBS Backend can't handle case where qstat/tracejob don't report info about completed jobs

11 views
Skip to first unread message

Kyle Robertson

unread,
May 7, 2017, 6:38:13 PM5/7/17
to gc3pie

I have encountered a cluster using the PBS/Torque scheduling system on which the qstat and tracejob commands give no information about job IDs that have finished running. Instead the checkjob command must be used. This prevents the PBS backend from determining which jobs have completed and thus the Engine class is unable to transition completed jobs to the TERMINATED state. I think new regex's will need to be written to parse the checkjob output, as I think it is different from the tracejob output.



Here is some sample output from the checkjob command:



job 8219309

AName: test.sh
State: Completed
Creds: user:MYUSERNAME group:MYUSERNAME account:ANACCOUNT class:ACLASS qos:debug
WallTime: 00:00:00 of 00:10:00
SubmitTime: Sun May 7 15:24:45
(Time Queued Total: 00:00:25 Eligible: 00:00:25)

StartTime: Sun May 7 15:25:10
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 1

Req[0] TaskCount: 1 Partition: QDR
Memory >= 756M Disk >= 0 Swap >= 0
Dedicated Resources Per Task: PROCS: 1 MEM: 600M

Allocated Nodes:
[ACLUSTERNODE]

IWD: /home/MYUSERNAME/jobs
StartCount: 1
Partition List: QDR,DDR,SHARED
StartPriority: 50009
Reservation '8219309' (-00:01:21 -> 00:08:39 Duration: 00:10:00)

Riccardo Murri

unread,
May 8, 2017, 8:04:30 AM5/8/17
to gc3...@googlegroups.com
Hello Kyle,

> I have encountered a cluster using the PBS/Torque scheduling system on
> which the qstat and tracejob commands give no information about job IDs
> that have finished running. Instead the checkjob command must be used. This
> prevents the PBS backend from determining which jobs have completed and
> thus the Engine class is unable to transition completed jobs to the
> TERMINATED state. I think new regex's will need to be written to parse the
> checkjob output, as I think it is different from the tracejob output.

Thanks for reporting! I have already replied to the GC3Pie bug report --
but I thought I could provide (for future reference and for Google
search results) the set of questions that need to be answered for
implementing/extending batch-queuing system support in GC3Pie.

For implementing or extending batch-queueing system support in GC3Pie,
these info and outputs are needed:

* what command is used to submit a job? does it require a shell script
or can it submit aribtrary (even binary) commands? what command-line
option (or other mechanism) is used to specify that a process requires
several CPUs, all on the same node?

* what command is used to check the queued/running/finished status of a
job? if the job is finished, does this check command exit with a
non-zero status? can you provide an example such output for each of
the three statuses? (queued/running/finished)

* what command is used to check the exit status of a *finished* job? can
you provide a sample output? how long after the job has finished does
this information persist (i.e., it can be queried via the
aforementioned command)?

* what command (if any) is used to check the resource usage of a
*finished* job? (i.e., how much wall time did it consume, how much CPU
time, etc.) can you provide a sample output? how long after the job
has finished does this information persist (i.e., it can be queried
via the aforementioned command)?

Ciao,
R

--
Riccardo Murri, Schwerzenbacherstrasse 2, CH-8606 Nänikon, Switzerland
Reply all
Reply to author
Forward
0 new messages