some fireworks termination is not detected

17 views
Skip to first unread message

davidmich...@gmail.com

unread,
Mar 28, 2019, 6:05:43 AM3/28/19
to fireworkflows
Hi,

I have a problem with certain FWs that have completed but still appear as RUNNING. It happens when the task run on the nodes of our hpc facility, but not when I run it in an interactive bash session.

I was wondering how this can happen ? Could it be a connexion problem between nodes and the MongoDB server ? (but in this case the task could not be launched)

Furthermore, it happens very often with a single PyTask, which return a FWAction object : return FWAction(update_spec={'bands': bands}). Could it be a cause ?

Is there a way to explicitely tell the launchpad that the FW has completed through the launchpad object ? And possibly raise an exception if the launchpad does not answer ?

Best regards,
David

Anubhav Jain

unread,
Apr 2, 2019, 5:02:37 PM4/2/19
to fireworkflows
Hi David,

To be honest I am not sure what might be happening here.

The only edge case I can think of is that your job completes, but hits the walltime while FWS is communicating with the database to update the state to COMPLETED which occurs after the job completes. Typically, the database communication to update the state would only take a few seconds maximum, so the chances of your job completing but then hitting walltime during FWS communication would be quite small. 

One thing you could do to try to debug would be to see how long the database communication might be taking. For example, pick a COMPLETED job and examine its Launch object in the database, particularly the timestamps that show you when the job started RUNNING and when it was tagged as COMPLETED. Then compare that time to the actual or expected runtime of your job (if you have that somewhere). If there is a big discrepancy it could be an indicator that the database is taking way to way too long to update for your job, and hitting walltime in the middle of the update.

Do you happen to have very large workflows (e.g. 1000 FWS or more)? I could see this perhaps being a bigger problem as the workflows get larger, although I think we have done a lot recently to speed up database updates of large workflows.

Note that there is no way to explicitly mark a FW as completed. While if you are desperate and risk-taking you could try manually calling the Launchpad.complete_launch() method, I wouldn't really recommend this and suggest you try fixing the underlying problem.

Anubhav Jain

unread,
Apr 2, 2019, 5:04:04 PM4/2/19
to fireworkflows
Also, if you have the Launch object (e.g. JSON from MongoDB) for one of the launches that are stuck in such a state, perhaps you could attach that JSON

davidmich...@gmail.com

unread,
Apr 3, 2019, 12:33:18 PM4/3/19
to fireworkflows
Hi Anubhav,

Thanks for your answer. I finally found what was responsible of this behaviour.

After having inlined all the code in the PyTask, I have removed code lines until the end of the PyTAsk is correctly detected ...

And the problem was caused by :

# transform user warnings into errors (that can be catch ...)
warnings.simplefilter('error', UserWarning)

I used this to catch numpy warnings :

for ifeat in range(nb_feat):
try:
mean_hrl = float(zs_hrl[ifeat][0][0])
except: # in case of warning (i.e. Nans) catch'em and do not take this segment into account
continue

The trick was nice, but it has side effects ...

Hoping this can be usefull to anyone.

Best regards,
David

davidmich...@gmail.com

unread,
Apr 4, 2019, 4:56:25 AM4/4/19
to fireworkflows
I solved the problem using a context manager :

with warnings.catch_warnings():
warnings.simplefilter('error', UserWarning)

sorry for annoying.

Anubhav Jain

unread,
Apr 4, 2019, 12:12:58 PM4/4/19
to davidmich...@gmail.com, fireworkflows
Thanks for updating us! Glad the problem is solved.

--
You received this message because you are subscribed to the Google Groups "fireworkflows" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.
To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.
For more options, visit https://groups.google.com/d/optout.


--
Best,
Anubhav
Reply all
Reply to author
Forward
0 new messages