queue launcher crashes with empty queue

58 views
Skip to first unread message

jkuck

unread,
Feb 8, 2017, 2:30:22 AM2/8/17
to fireworkflows
Hi,

I'm trying to run a long workflow that dynamically creates new fireworks at every iteration.  I'm running the workflow with a queue launcher in infinite mode.  Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds...zzz...

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

  File "/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py", line 216, in rapidfire

    raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!


2017-02-07 22:56:26,619 ERROR ----|^^^|----


It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue().  Any tips would be appreciated!


Thanks,
Jonathan

Joseph Montoya

unread,
Feb 8, 2017, 2:51:50 AM2/8/17
to jkuck, fireworkflows
Just to get a bit more info, does the issue persist when you restart the queue launcher?  Also, are you using fill mode?

Best,
Joey

-- 
You received this message because you are subscribed to the Google Groups "fireworkflows" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.
To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jkuck

unread,
Feb 8, 2017, 2:58:32 AM2/8/17
to fireworkflows, jdk...@gmail.com
Yes, the queue launcher crashes again after being restarted.  I'm calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir='.', nlaunches='infinite', njobs_queue=20,
                  njobs_block=500, sleep_time=None, reserve=False, strm_lvl='INFO', timeout=None,
                  fill_mode=False)
Thanks,
Jonathan

Anubhav Jain

unread,
Feb 8, 2017, 12:56:07 PM2/8/17
to fireworkflows, jdk...@gmail.com
Hi Jonathan

Two things:

1) Can you paste the output of "lpad get_fws -s READY -d count" after the script crashes?
2) Would you mind running the script again with strm_lvl="DEBUG" and pasting the output again?

I haven't seen or heard of this error before so it might take a little back and forth to figure out what's happening.

Best,
Anubhav

jkuck

unread,
Feb 8, 2017, 2:10:11 PM2/8/17
to fireworkflows, jdk...@gmail.com
Hi Anubhav,

Thanks a lot for the help.  Here's the info: 
1) Can you paste the output of "lpad get_fws -s READY -d count" after the script crashes?
I've tried this after two crashes now.  The first was '1' and the second '2'. 
2) Would you mind running the script again with strm_lvl="DEBUG" and pasting the output again?

Here is the output, I've included a successful submission as well:

2017-02-08 10:53:34,428 INFO Job submission was successful and job_id is 1176878

2017-02-08 10:53:34,428 INFO Sleeping for 5 seconds...zzz...

2017-02-08 10:53:39,455 INFO Finished a round of launches, sleeping for 60 secs

2017-02-08 10:54:39,516 INFO Checking for Rockets to run...

2017-02-08 10:54:39,555 INFO The number of jobs currently in the queue is: 0

2017-02-08 10:54:39,555 INFO 0 jobs in queue. Maximum allowed by user: 20

2017-02-08 10:54:39,640 INFO Launching a rocket!

2017-02-08 10:54:39,647 DEBUG getting queue adapter

2017-02-08 10:54:39,733 INFO Created new dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,733 INFO moving to launch_dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,734 DEBUG writing queue script

2017-02-08 10:54:39,740 INFO submitting queue script

2017-02-08 10:54:41,842 INFO Job submission was successful and job_id is 1176879

2017-02-08 10:54:41,843 INFO Sleeping for 5 seconds...zzz...

2017-02-08 10:54:46,933 INFO Launching a rocket!

2017-02-08 10:54:46,940 DEBUG getting queue adapter

2017-02-08 10:54:46,961 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 10:54:46,961 ERROR ----|vvv|----

2017-02-08 10:54:46,962 ERROR Error with queue launcher rapid fire!

2017-02-08 10:54:46,965 ERROR Traceback (most recent call last):

  File "/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py", line 216, in rapidfire

    raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!


2017-02-08 10:54:46,965 ERROR ----|^^^|----

Best,
Jonathan

Anubhav Jain

unread,
Feb 8, 2017, 4:32:19 PM2/8/17
to fireworkflows, jdk...@gmail.com
Hi Jonathan

Ok, unfortunately that doesn't provide much additional information, although it does seem like there are READY jobs to run in the queue.

Can you try something else? Immediately after the crash, "cd" to the directory listed by the qlauncher and manually try to submit the script. e.g., for a PBS queue system this would involve typing "qsub FW_submit.script". The directory is listed in the debug output you printed, e.g., atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 for the previous time you tried this.

Sometime, manually submitting the script can help clarify what errors (if any) are being thrown by the queuing system.

Best,
Anubhav

Jonathan Kuck

unread,
Feb 8, 2017, 4:49:45 PM2/8/17
to Anubhav Jain, fireworkflows
Hi Anubhav,

Correct me if I'm wrong, but I think the queue launcher is crashing before creating the launch directory.  It looks like 'atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 ' is the directory created by the successful submission.

The problem seems to be that somehow launchpad.run_exists(fworker) is evaluating to True in the while loop in rapidfire() in queue_launcher.py, but then false in launch_rocket_to_queue().  

Best,
Jonathan

jkuck

unread,
Feb 8, 2017, 8:44:39 PM2/8/17
to fireworkflows, anubh...@gmail.com
A bit more info, I've replicated the problem on a second cluster with the latest version of fireworks installed (1.4.0).  The second cluster uses slurm instead of pbs.  The only observable difference is the line where the error occurs, because I'm running the new version of fireworks:

2017-02-08 17:11:05,635 INFO Launching a rocket!

2017-02-08 17:11:05,637 DEBUG getting queue adapter

2017-02-08 17:11:05,673 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 17:11:05,673 ERROR ----|vvv|----

2017-02-08 17:11:05,673 ERROR Error with queue launcher rapid fire!

2017-02-08 17:11:05,674 ERROR Traceback (most recent call last):

  File "/home/kuck/.local/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py", line 221, in rapidfire

    raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!


2017-02-08 17:11:05,675 ERROR ----|^^^|----


Best,

Jonathan


On Wednesday, February 8, 2017 at 1:49:45 PM UTC-8, jkuck wrote:
Hi Anubhav,

Correct me if I'm wrong, but I think the queue launcher is crashing before creating the launch directory.  It looks like 'atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 ' is the directory created by the successful submission.

The problem seems to be that somehow launchpad.run_exists(fworker) is evaluating to True in the while loop in rapidfire() in queue_launcher.py, but then false in launch_rocket_to_queue().  

Best,
Jonathan

Anubhav Jain

unread,
Feb 9, 2017, 11:47:37 AM2/9/17
to jkuck, fireworkflows
Hi Jonathan

Thanks for the update- I'll take a closer look tomorrow.

Best
Anubhav

To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflows+unsubscribe@googlegroups.com.

To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.

For more options, visit https://groups.google.com/d/optout.



--
Best,
Anubhav

Anubhav Jain

unread,
Feb 10, 2017, 1:52:33 PM2/10/17
to fireworkflows, jdk...@gmail.com
Hi Jonathan,

I am looking over this issue again.

- I agree with you that if "launchpad.run_exists()" evaluates to True inside the rapidfire() method but then the same method evaluates to False within the "launch_rocket_to_queue" a short while later, that you would see the error trace that you mentioned

- This could certainly happen if, for example the following sequence of events occurred:
* There is a READY job in the LaunchPad
* that job gets submitted to the queue successfully, but it is still READY since we are not in reservation mode (that's fine)
* the rapidfire() code loops again, again sees the (same) READY job in the LaunchPad and goes ahead with calling the launch_rocket_to_queue() method to submit another queue job.
* However, before the launch_rocket_to_queue() gets to the part where it checks again for the existence of a job (launchpad.run_exists()), the job already queued has started RUNNING. Thus, in between the two calls to launchpad.run_exists(), the FW went from READY to RUNNING leaving behind no jobs to run when the second call happened.

Do you think this is the sequence of steps that is occurring?

I see two ways forward
1. If a ready job "disappears" by the time a queue is going to be submitted, simply consider the current iteration of rapidfire() to be finished
2. Try to count jobs so that the same READY job doesn't lead to 2+ queue submissions. This would potentially have some benefits in creating 1:1 mappings of jobs to queue submissions, although it would be very difficult to prevent two simultaneous qlaunch processes (e.g. on different machines/workers) from colliding.

Solution (1) is certainly easier to do and so I implemented it.

Please try FWS v1.4.1 (just released) and let me know if this fixes it.

Anubhav

jkuck

unread,
Feb 13, 2017, 5:44:13 PM2/13/17
to fireworkflows, jdk...@gmail.com
Hi Anubhav,

That sequence of events sounds like the problem to me.  I've tried again with FWS v1.4.1 and tentatively the problem seems to be fixed, thanks a lot!  I do have a couple of additional questions:

-I'm using an anaconda virtual environment on one of the clusters I have access to, but the latest version of fireworks I found is 1.3.9.  Is there a way to get access to the latest version?

-When I run a workflow, a folder named something like "block_2017-02-13-22-14-42-132705" is created.  Inside are a bunch of folders for each firework with names like "launcher_2017-02-13-22-15-50-842035".  When debugging I'd like to inspect the error file in the folder that corresponds to a particular firework with some a name I can view from the data stored in my database, such as the fw_id.  Is there a way to rename the "launcher..." folders with fw_id's?

Thanks,
Jonathan

To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.

To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.



--
Best,
Anubhav

Anubhav Jain

unread,
Feb 13, 2017, 5:49:56 PM2/13/17
to fireworkflows
Hi Jonathan,

Great to hear the queue launcher issue seems fixed (and thanks for pointing out the issue).

For the other two items, can you submit separate tickets? This will help keep things organized for people looking for answers to common questions.

Best
Anubhav

jkuck

unread,
Feb 13, 2017, 6:00:42 PM2/13/17
to fireworkflows
Good point, done.

Jonathan
Reply all
Reply to author
Forward
0 new messages