Detecting wall time errors on PBS/TORQUE

16 views
Skip to first unread message

Jeremy Cohen

unread,
Apr 26, 2016, 8:27:11 AM4/26/16
to saga-users
When running jobs on PBS/TORQUE platforms, if a job exceeds the wall time limit, I simply get a callback saying the job is done. Is there any way other than grepping output/error files to identify the error of exceeding the wall time limit?

I'm currently tee'ing the output from the job processes to a separate output file which doesn't seem to contain the "wall time exceeded" error. The error appears in the default output/error files generated by the job scheduler (I'm currently testing with TORQUE) which I don't have easy access to from my client application.

I can implement a workaround but wondered if the saga library has any way of directly identifying and returning the reason for job termination.

Thanks,
Jeremy
Message has been deleted

Jeremy Cohen

unread,
Apr 26, 2016, 12:52:38 PM4/26/16
to saga-users
Just to add some information to the original query, I was originally testing with the PBS job adaptor, although targeting a TORQUE platform. Switching to the TORQUE job adaptor results in the job state correctly switching to 'Failed' when the wall time is exceeded and I can access the TORQUE exit code via the exit_code attribute of the job object.

However, when testing with the PBS job adaptor against a PBS cluster, the job seems to disappear from the qstat output once it fails so saga-python sees that the job is no longer listed and assumes it has completed switching the status to done. The exit_code parameter is 'None'. This is causing some issues but I presume this is related to the PBS deployment that I'm targeting and there's not a lot that can be done to address this on the saga-python side?

One suggestion/question, might it be useful to map error codes within an adaptor to a string description of the cause of the error? So, in the event of the job state being set to 'Failed', one can call job.error_info or similar to get a string description of the error?
Reply all
Reply to author
Forward
0 new messages