How does cylc use task execution time limit? [SEC=UNCLASSIFIED]

Yi Xiao

unread,

Jan 31, 2019, 9:59:44 PM1/31/19

to cy...@googlegroups.com

Hi, cylc developers,

Assume a task usually takes less than 30 minutes to complete, and it has settings as

[[[job]]]

execution time limit = PT30M

will cylc reports the task as failed if the task is still in running state (under PBS) after 30+ minutes?

I am asking this question, because sometimes we observe tasks stay in "running state (under PBS)"

Long time after tasks' execution time limit and wall clock time limit. This usually is due to some system

Problems. I am hoping that cylc would consider the task as "failed", instead of "running", so that a retry is able to be triggered automatically,

And the suite becomes more robust.

Regards,

Xiao

Hilary Oliver

unread,

Jan 31, 2019, 11:47:25 PM1/31/19

to cy...@googlegroups.com

Hi Xiao,

If you specify an "execution time limit", Cylc will a) automatically generate the PBS wall time directive for the job, and b) poll the job automatically if it hasn't succeeded or failed by the time the limit is up.

(Here's the relevant bit of the current User Guide: https://cylc.github.io/cylc/doc/built-sphinx/appendices/suiterc-config-ref.html#runtime-name-job-execution-time-limit)

Cylc does not need to kill a job that exceeds its time limit, because PBS does that. Which also means you should not "observe tasks stay in "running state (under PBS)" long time after tasks' execution time limit and wall clock time limit.". If you do observe that it probably means that network problems stopped the final job status message getting back to the suite server program... in which case Cylc's follow-up job polling should return the true state.

It might help to know more about your problems. If it happens again, first check the PBS queue to see if the job really is still running. If it is still running, the wall time value must not be what you thought (else PBS would have killed it already). If it is not still running, check the job.err file - are there errors showing that the final job status message could not be sent back? And if there are, check the Cylc suite log to find out if the job poll failed - because it should have determined that the job is no longer running, even if status messages cannot be sent back. Then if necessary, test that Cylc's job polling is working properly on your platform.

Hilary

--

---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hilary Oliver

unread,

Jan 31, 2019, 11:50:28 PM1/31/19

to cy...@googlegroups.com

p.s. the only other explanation for a job that appears to still be running when it isn't, is that someone (the job owner, or a system administrator) hard killed it ("kill -9 PID") which is not trappable, so the job wrapper cannot send a message back before it dies in that case.

Yi Xiao

unread,

Feb 1, 2019, 12:03:09 AM2/1/19

to cy...@googlegroups.com

Thanks Hilary

Communication is certainly a problem to those tasks, because usually one of nodes become faulty.

I will noted down tasks status next time this happens,

Regards,

Xiao

Yi Xiao

unread,

Feb 13, 2019, 9:01:34 PM2/13/19

to cy...@googlegroups.com

Hi, Hilary

Another example today, and cylc gui is not showing the task as "failed"; instead still in green. I

· PBS shows the tasks in "F" (failed) state.

· Cylc log shows the following

2019-02-13T23:22:26Z WARNING - [glm_ops_odb_to_odb2_surface.20190124T0000Z] -job started PT5H ago, but has not finished

2019-02-13T23:22:27Z INFO - [glm_ops_odb_to_odb2_surface.20190124T0000Z] -(current:running) started (polled)

It is most likely that it was landed on a bad node. The time limit for this task is 5h.

In this situation, will cylc be able to report that is as failed, and then retry will kick in?

Thanks

Xiao

From: cy...@googlegroups.com <cy...@googlegroups.com> On Behalf Of Hilary Oliver
Sent: Friday, 1 February 2019 3:50 PM
To: cy...@googlegroups.com
Subject: Re: [cylc-dev] How does cylc use task execution time limit? [SEC=UNCLASSIFIED]

p.s. the only other explanation for a job that appears to still be running when it isn't, is that someone (the job owner, or a system administrator) hard killed it ("kill -9 PID") which is not trappable, so the job wrapper cannot send a message back before it dies in that case.

Hilary Oliver

unread,

Feb 13, 2019, 11:31:56 PM2/13/19

to cy...@googlegroups.com

Hi Xiao,

Your log messages show that cylc polled the job when it timed out after 5 hours, and the poll found that the job was still running (that's what "started (polled)" means).

If the job was not actually still running (?is that what you're claiming?) then job polling is not returning a correct result on your platform, and we should figure out why. If the poll had detected or inferred job failure, then cylc would indeed trigger a retry (assuming you have retries configured for that task). Another question might be, if the job failed, why was the poll-on-timeout needed? If the job failed internally or was killed gracefully, a job status message should have come back to report the failure as soon as it happened. If "bad node" means the node went down, or some such disaster, then no message would be sent though.

Hilary

Reply all

Reply to author

Forward