Hi, cylc developers,
Assume a task usually takes less than 30 minutes to complete, and it has settings as
[[[job]]]
execution time limit = PT30M
will cylc reports the task as failed if the task is still in running state (under PBS) after 30+ minutes?
I am asking this question, because sometimes we observe tasks stay in "running state (under PBS)"
Long time after tasks' execution time limit and wall clock time limit. This usually is due to some system
Problems. I am hoping that cylc would consider the task as "failed", instead of "running", so that a retry is able to be triggered automatically,
And the suite becomes more robust.
Regards,
Xiao
--
---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thanks Hilary
Communication is certainly a problem to those tasks, because usually one of nodes become faulty.
I will noted down tasks status next time this happens,
Regards,
Xiao
Hi, Hilary
Another example today, and cylc gui is not showing the task as "failed"; instead still in green. I
· PBS shows the tasks in "F" (failed) state.
· Cylc log shows the following
2019-02-13T23:22:26Z WARNING - [glm_ops_odb_to_odb2_surface.20190124T0000Z] -job started PT5H ago, but has not finished
2019-02-13T23:22:27Z INFO - [glm_ops_odb_to_odb2_surface.20190124T0000Z] -(current:running) started (polled)
It is most likely that it was landed on a bad node. The time limit for this task is 5h.
In this situation, will cylc be able to report that is as failed, and then retry will kick in?
Thanks
Xiao
From: cy...@googlegroups.com <cy...@googlegroups.com>
On Behalf Of Hilary Oliver
Sent: Friday, 1 February 2019 3:50 PM
To: cy...@googlegroups.com
Subject: Re: [cylc-dev] How does cylc use task execution time limit? [SEC=UNCLASSIFIED]
p.s. the only other explanation for a job that appears to still be running when it isn't, is that someone (the job owner, or a system administrator) hard killed it ("kill -9 PID") which is not trappable, so the job wrapper cannot send a message back before it dies in that case.