Challenges with PBS job IDs

125 views
Skip to first unread message

Whitcomb, Mr. Tim

unread,
Dec 1, 2018, 2:25:25 PM12/1/18
to cy...@googlegroups.com
I'm seeming to run into this on several platforms now, but it didn't seem to happen to me before and I'm curious if this is something that others have run into.

I now have several platforms that we run on where we were tracing odd problems with Cylc job polling not working as expected - jobs would show as failed as soon as they started running, even though they were still running in the PBS queue, which would make for a very disconcerting view on 'cylc mon' vs. 'qstat'. I traced this back to the PBS job ID returned by 'qsub' not matching the PBS job ID that shows in the first column returned by 'qstat'. In most cases, this is due to truncation of the job ID by the status, so I get (as an example):

$ qsub <script>
12345.pbs-localhost1

Then later
$ qstat 12345.pbs-localhost1

Jobid a b c d
--------------------------------------------------------------------------------------
12345.pbs-l <something else> <something else> <etc> <etc>

This has royally messed with the builtin PBS polling mechanism in Cylc, where I've gone so far as to add a custom batch manager module on some of these systems that automatically strips everything after the first period in the job ID, and that seems to have fixed things. It does, though, seem very clunky, and I'm wondering if others have run into similar situations and how they've addressed it. I suppose the real solution is to get the sysadmins to adjust it, but I have a dim view of success there.

Tim

Hilary Oliver

unread,
Dec 1, 2018, 6:15:30 PM12/1/18
to cy...@googlegroups.com
Hi Tim,

I'm not sure what determines the field width of the qstat columns (PBS itself, or some site-specific configuration?) but it seems odd that qstat would return what is essentially an invalid job ID!  I'll ask my PBS contacts.

I currently have access to a cluster managed with PBS 14.  By trial and error, if qsub returns a job ID <number>.<server-name> I can do "qstat <number>" but not "qstat <number>.<truncated-server-name>" - i.e. the truncated result return by qstat is not a valid job ID, so it's no wonder that screws up cylc's job query mechanism.

If you have to work around this in Cylc - which it seems you have already - the job ID returned by qsub is stored in the task's job.status file, and I think this has to exactly match the job ID returned by qstat in order for job poll (query) to work.

In the case you describe, this means you would have to strip the server-name off the job ID return by both qsub and qstat.

Current master branch (and the new 7.8 release) has a new variant PBS handler called "pbs_multi_cluster" that does something very similar: It sub-classes PBSHandler to append '@server-name' to the job ID returned by both qsub and qstat.  You can probably use this as a template: https://github.com/cylc/cylc/blob/master/lib/cylc/batch_sys_handlers/pbs_multi_cluster.py (if it is any better than the work around you've already come up with).  Note that the `manip_job_id()` method, which manipulates the job ID returned by qsub is not supported prior to cylc-7.8 because it relies on a small change in the batch system manager module.

Hilary




--

---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Coleman

unread,
Dec 2, 2018, 8:37:05 PM12/2/18
to cy...@googlegroups.com
You could try making Cylc look at "qstat -w" instead. It gives more padding for job-ids I believe (30 characters instead). See https://github.com/PBSPro/pbspro/blob/master/src/cmds/qstat.c#L528

I think you would just upate:
lib/cylc/batch_sys_handlers/pbs.py:    POLL_CMD = "qstat"

to

lib/cylc/batch_sys_handlers/pbs.py:    POLL_CMD = "qstat -w"

I haven't tried this, but it sounds like it should work.
Reply all
Reply to author
Forward
0 new messages