Inconsistent Behavior In Workflows

60 views
Skip to first unread message

msta...@gmail.com

unread,
Nov 4, 2015, 1:58:09 PM11/4/15
to fireworkflows

Right now I have a master workflow that calls about 17 sub-workflows, most of which call another sub-workflows.  I am using multi-launcher for everything.  There is a relatively intricate dependency tree.  There are also several other workflows that get called asynchronously and independently from this master workflow.

The main issues I'm having are the follow:
   - randomly, a workflow firework within the master workflow will be updated to COMPLETE, even though the state of the workflow itself is RUNNING.  This ruins the dependency relationships of the other workflows.
   - randomly, after a sub-workflow is COMPLETED, the workflow firework in the master workflow will stay as RUNNING for several minutes and then fizzle.  This requires `lpad rerun_fws -s FIZZLED --task-level`, which is a waste of time and requires manual intervention.  It also reruns the entire workflow even though it already completed. This defeats the purpose of having automation.
   - randomly, but rarely, the master workflow will successfully run to completion.  This is desired but rarely occurs.

I am running on a single node with 12 cores. CPU and Memory usage are both under 100%

Any insight you can provide would be greatly appreciated.

Thanks,
Matt

Anubhav Jain

unread,
Nov 4, 2015, 2:28:40 PM11/4/15
to msta...@gmail.com, fireworkflows
Hi Matt,

I'd like to help but I think I need more information.

First, I just want to confirm you are running the latest FWS (v1.1.8). In particular, we patched some bugs with workflow locking in v1.1.7 that could cause anomalous behavior.

Second, I am having trouble following your description:

i) What is the the distinction between "master workflow" and "sub workflow"?  These terms are not part of the FWS lexicon. Is there just 1 Workflow object with many Fireworks, or are there multiple Workflow objects? If it is the latter, how exactly are the sub workflows related to the master? In FWS, different Workflow objects do not have any dependencies between them, so any such dependency that is added externally would need to be described.

ii) What is a "workflow firework"? It is the same thing as a firework?

Best
Anubhav

--
You received this message because you are subscribed to the Google Groups "fireworkflows" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/5873a956-3e16-44da-8d8b-ed87b08db381%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

msta...@gmail.com

unread,
Nov 4, 2015, 3:21:28 PM11/4/15
to fireworkflows, msta...@gmail.com, AJ...@lbl.gov
Sorry, I forget sometimes that I've kind of created my own terminology.

The master workflow is a workflow where all of the child Fireworks run ScriptTasks that call python scripts that create more workflows.
A sub-workflow is the workflow that is created by the ScriptTask in the master workflow.
Many of these subworkflows contain a Firework that runs a ScriptTask that creates another workflow.

What I meant by workflow firework is a firework that creates a workflow via ScriptTask.

I hope that clears up some confusion.

Also, apparently we're running 1.08, which sounds like it is very old.

msta...@gmail.com

unread,
Nov 5, 2015, 1:00:38 PM11/5/15
to fireworkflows, msta...@gmail.com, AJ...@lbl.gov
I think I fixed the issue with workflows incorrectly fizzling.  I was cleaning up the child fireworks too quickly.  I'm still having sporadic issues with fireworks executing out of order and sometimes I have issues where all fireworks in a workflow complete, but the workflow itself won't update to complete until other workflows complete.

Does the latest version take care of these issues?

Thanks,
Matt

Anubhav Jain

unread,
Nov 5, 2015, 1:11:13 PM11/5/15
to Matt Tannenbaum, fireworkflows
Hi Matt,

Glad to hear that the fizzling problem is no longer an issue.

1)
Regarding the issue where all the fireworks inside a workflow are complete, but the workflow state itself is incomplete, this was patched in v1.1.7. See the changelog:


and specifically this patch:
* fix WFLock causing inconsistent states in workflows; detect such cases in detect_lostruns; add –refresh as fix (G. Petretto)

I think that should fix the issue for future workflows. Also, if you update the code and use "lpad admin refresh -i <FW_ID>" for an existing workflow that is stuck in this inconsistent state, it should properly correct that previously existing workflow. If there are several such workflows to correct, there are other query options other than -i that can help you refresh more workflows in one command. Note that there should not be any harm in refreshing workflows (other than taking up time), and the patch in FWS 1.1.7 should also make it so that future workflows don't need the refresh.

2)
Regarding the Fireworks executing out of order, this is a new issue that I haven't heard come up in the past and I can't think of anything that could cause it. My suggestion would be to try and see if #1 can fix your main issue, and then we can talk more about this second issue if you still see it.

Finally - and this is unrelated to your issues - the new versions contain some updated reporting and introspection commands (e.g. "lpad report -i months" and "lpad introspect") that might also be useful.

Best
Anubhav

msta...@gmail.com

unread,
Nov 5, 2015, 1:25:54 PM11/5/15
to fireworkflows, msta...@gmail.com, AJ...@lbl.gov
Thank you for the response. I will definitely work on upgrading.

With regards to fireworks executing out of order:
   - the parent firework's task creates a workflow
   - the parent firework gets updated to COMPLETED while the child workflow is still in the RUNNING state
   - this causes dependent fireworks to execute before jobs it depends on have completed

An example:
Workflow A:
   -Firework1[Script creates Workflow B]
     -Firework1a[script]
     -Firework1b[script]
     -Firework1c[script]
  -Firework2[Workflow C]
     -Firework2a[script]
     -Firework2b[script]
     -Firework2c[script]
  -Firework3 etc.

Lets say Firework2 depends on Firework1. In my scenario, Firework1 updates to COMPLETED while Workflow B is still in the RUNNING state.  This allows Firework2 to run too early.

Mind that this is a simplistic example. In reality I have many Fireworks running in parallel, each of which creates more Workflows, and creates a relatively complex dependency tree.

I hope that helps clarify things.

Thanks,
Matt

Anubhav Jain

unread,
Nov 5, 2015, 1:30:39 PM11/5/15
to Matt Tannenbaum, fireworkflows
I should quickly mention that there are instructions on updating FWS here (scroll near the bottom):

(It should be very simple)

I will take a look at the out of order stuff soon.

Best
Anubhav

Anubhav Jain

unread,
Nov 6, 2015, 1:41:55 PM11/6/15
to Matt Tannenbaum, fireworkflows
Hi Matt,

I am trying to understand better your problem with Firework dependencies. I can't tell in your example above what the dependency of Firework2 is intended to be. In the diagram you put, it looks like it depends only on Firework1, so it should have behaved like it did, i.e. start running as soon as Firework1 completes.

A few notes that may or may not be helpful:
- Please make sure each Firework belongs to only 1 Workflow
- Related to above, note that dependencies should be between Fireworks within the same Workflow. There should not be any dependencies between Fireworks that belong to two different Workflows.
- Related to above, please confirm that you are not trying to use the ScriptTask generate new Fireworks or Workflows dynamically that are also supposed to delay execution of your originally defined Workflow. If you are trying to do that, it is likely going to lead to errors like you are seeing. See below for the proper way to do it in FWS.
- In terms of generating new Workflows for each Firework, the ScriptTask is one way to do it but if you are familiar with Python programming then you should really consider using PyTask or your own custom tasks along with dynamic actions to generate the new Workflows:


Look in particular at the section on dynamic workflows, with different diagrams on how you can make changes in the intended way. I would highly suggest you switch to this method rather than writing ScriptTasks.

Best
Anubhav

msta...@gmail.com

unread,
Nov 10, 2015, 6:05:00 PM11/10/15
to fireworkflows, msta...@gmail.com, AJ...@lbl.gov
Hey Anubhav,

So I created a couple custom tasks to create workflows instead of using ScriptTask.  I'm still having the issue where Firework1 updates to COMPLETED while the workflow created by its task is still in the RUNNING state.  Do you have any idea why this is happening?

My custom task resembles the following:

class CustomTask(FireTaskBase):

    _fw_name = "Custom Task"
    required_params = ["param1", "param2"]
    
    def run_task(self, fw_spec):
        t1 = CustomTask2(param="param")
        t2 = ScriptTask.from_str(...)
        etc.

        fw1 = Firework(t1, fw_id=1, name="name1")
        fw2 = Firework(t2, fw_id=2, name="name2")
        etc.

       workflow = Workflow([fw1, fw2, fw3, fw4, etc.], {fw1: fw2, fw2:fw4, fw3:fw4, etc.}, name="workflow_name")

       launchpad = LaunchPad()
       launchpad.add_wf(workflow)
       launch_multiprocess(launchpad, FWorker(), "Info", 0, 12, 0, None, 12)

       for id in workflow.root_fw_ids:
           wf_summary = launchpad.get_wf_summary_dict(id, mode="less")
           wf_state = wf_summary["state"]
           if wf_state == FIZZLED:
               raise RuntimeError("Workflow Fizzled")

       return FWAction()

class CustomTask2(FireTaskBase):

    _fw_name = "Custom Task 2"
    required_params = ["param"]

    def run_task(self, fw_spec):

         # define another workflow of ScriptTasks and add to launchpad
         # multi-launch

main workflow contains about a dozen CustomTask fireworks with a complex dependency tree.  It appears that either CustomTask is completing before the Fireworks in its workflow have all completed, or the Firework containing the CustomTask is getting updated to COMPLETED before the task is actually complete.

Any thoughts?

Anubhav Jain

unread,
Nov 10, 2015, 9:54:07 PM11/10/15
to Matt Tannenbaum, fireworkflows
Hi Matt,

I never tried using a FireTask to do the launch_multiprocess (or any other launch). I've always just run the launching separately. i.e. I define the workflows and enter them in the database (which contain tasks to run, not launch commands), and then run a launch command separately to run the launches. If more tasks need to be created, they are done using the dynamic actions of FireWorks, but I still use the "root" launch command to run them, i.e. never a launch inside the FireTask run() method.

I am not completely sure why you have designed things the way you did but one thing I would be careful of is making sure that the launchpad.add_wf(workflow) has completed properly before running the launch_multiprocess. Otherwise, the launch_multiprocess will not see the added workflow and it will just get skipped over, resulting in the current task being marked as completed. One way to test this is to use the python time.sleep() method to add a 10 second (this is very generous) delay between the time you add the workflow and the time that you launch them. If that seems to fix it, let me know and we can talk about how to patch things more permanently.

Best
Anubhav

note - I will not be around for the next week or so, so please pardon any delays in response


Reply all
Reply to author
Forward
0 new messages