Trouble getting custom firetasks to run on remote servers using qlaunch

61 views
Skip to first unread message

Michael B

unread,
Jul 7, 2016, 1:15:36 PM7/7/16
to fireworkflows
Hello,

I'm currently running a Python script on my laptop which submits jobs to another computer's database (let's call it "Base") and I then connect to Base through ssh to execute "qlaunch -rh worker -ru michael rapidfire" in the command line (which will send the jobs to another computer called "worker"). I have Fireworks installed on all 3 computers and this works fine when my fireworks contain only built-in firetasks. However, when I include custom firetasks (in this case the RunQECalc task stored in run_qe_calc_task_v2.py) and try to run qlaunch I get the following error: 

michael@Base:~/FireworkFiles/rocketruns$ qlaunch -rh worker -rc /home/michael/FireworkFiles/rocketruns/ -ru michael rapidfire --nlaunches 1
[worker] run: qlaunch  rapidfire --nlaunches 1
[worker] out: 2016-07-07 09:56:59,786 INFO getting queue adapter
[worker] out: 2016-07-07 09:56:59,786 INFO Created new dir /home/michael/FireworkFiles/rocketruns/block_2016-07-07-15-56-59-786446
[worker] out: 2016-07-07 09:56:59,803 INFO The number of jobs currently in the queue is: 0
[worker] out: 2016-07-07 09:56:59,803 INFO 0 jobs in queue. Maximum allowed by user: 10
[worker] out: 2016-07-07 09:56:59,820 ERROR ----|vvv|----
[worker] out: 2016-07-07 09:56:59,820 ERROR Error with queue launcher rapid fire!
[worker] out: 2016-07-07 09:56:59,821 ERROR Traceback (most recent call last):
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/queue/queue_launcher.py", line 192, in rapidfire
[worker] out:     while jobs_in_queue < njobs_queue and launchpad.run_exists(fworker) \
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/core/launchpad.py", line 511, in run_exists
[worker] out:     return bool(self._get_a_fw_to_run(query=q, checkout=False))
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/core/launchpad.py", line 663, in _get_a_fw_to_run
[worker] out:     m_fw = self.get_fw_by_id(m_fw['fw_id'])
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/core/launchpad.py", line 316, in get_fw_by_id
[worker] out:     return Firework.from_dict(self.get_fw_dict_by_id(fw_id))
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 147, in _decorator
[worker] out:     new_args[0] = {k: _recursive_load(v) for k, v in args[0].items()}
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 147, in <dictcomp>
[worker] out:     new_args[0] = {k: _recursive_load(v) for k, v in args[0].items()}
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 108, in _recursive_load
[worker] out:     return {k: _recursive_load(v) for k, v in obj.items()}
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 108, in <dictcomp>
[worker] out:     return {k: _recursive_load(v) for k, v in obj.items()}
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 111, in _recursive_load
[worker] out:     return [_recursive_load(v) for v in obj]
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 103, in _recursive_load
[worker] out:     return load_object(obj)
[worker] out:   File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 306, in load_object
[worker] out:     mod = __import__(modname, globals(), locals(), [classname], 0)
[worker] out: ImportError: No module named run_qe_calc_task_v2
[worker] out:
[worker] out: 2016-07-07 09:56:59,821 ERROR ----|^^^|----
[worker] out:

Disconnecting from worker... done.

The error tells me that it cannot find the module run_qe_calc_task_v2 but here's what I don't understand. Base and worker share the same home directory so the same .bashrc file is run whenever I connect to either using ssh. Within .bashrc I've included the line export PYTHONPATH="${PYTHONPATH}:/home/michael/FireworkFiles/my_firetasks/" where I store all of my custom firetasks, so if I execute the line echo $PYTHONPATH on either computer then the my_firetasks directory is present. I've also set the my_firetasks directory within my laptop's PATH variable so there isn't a problem when I try to run qlaunch rapidfire on Base, or qlaunch rapidfire on worker when there's a my_launchpad.yaml file connecting it to Base's database. The problem only appears when I send the firework through remote qlaunch.

I tried replacing the custom firetask with task2 = ScriptTask(script = 'echo $PYTHONPATH ; python -c "import os; print(os.environ)"') in order to see if the PYTHONPATH variable was present when the firework was running. Again, /home/michael/FireworkFiles/my_firetasks/ was present for qlaunch rapidfire on both Base and worker but couldn't be found for qlaunch -rh worker -ru michael rapidfire, leading me to wonder why .bashrc isn't read on worker when it boots up to run the job.

The only solution I found was to add run_qe_calc_task_v2.py to the fireworks.user_objects directory on each computer. That way I could change the import statement in the python script from from run_qe_calc_task_v2 import RunQECalc to from fireworks.user_objects.run_qe_calc_task_v2 import RunQECalc and then remote qlaunch would work fine. The problem is I will later be accessing other computing clusters where I don't have access to the site-packages directory of Python. The only way I've thought of to work around this is to create virtual environments on each computer where I can store all custom firetasks in fireworks.user_objects but I want to know if there's a simpler solution. How can I get remotely launched fireworks to remember the PYTHONPATH, or how can I make sure remote fireworkers read their .bashrc file when sent a remote firework?

Michael

Anubhav Jain

unread,
Jul 7, 2016, 1:37:02 PM7/7/16
to Michael B, fireworkflows
Hi Michael,

I don't use remote qlaunch at all, so I haven't run into the issue simply because I don't use that feature which was coded by a collaborator.

I am guessing the issue is related to the fact that the remote qlaunch uses the "fabric" library to execute remote commands, and by default the commands are not run through a shell, e.g. something related to this:


The actual remote execution is done in this line in the FireWorks code through the fabric run() method:

fireworks/scripts/qlaunch_run.py:170

which reads:

run("qlaunch {} {} {}".format(pre_non_default, args.command, non_default))

As you can see, fabric.run() is being run without a "shell=True" kwarg. I would try playing around with line of code, trying something like:

run("qlaunch {} {} {}".format(pre_non_default, args.command, non_default), shell=True)

as a first guess. If not, I would try referring to the fabric docs (http://docs.fabfile.org/en/1.11/api/core/operations.html).

If this helps you in finding a solution, it would be great if you can submit a pull request back to fix the issue for future users.

Best,
Anubhav


--
You received this message because you are subscribed to the Google Groups "fireworkflows" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.
To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/902aa678-d438-4658-a9dd-28094960cf6e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anubhav Jain

unread,
Jul 7, 2016, 1:41:02 PM7/7/16
to Michael B, fireworkflows
A quick modification to the above:

It looks like "shell=True" by default for the run() method of fabric. So you will likely need to try some other things, like:

run("source .bashrc; qlaunch {} {} {}".format(pre_non_default, args.command, non_default))

or 

run("/bin/bash qlaunch {} {} {}".format(pre_non_default, args.command, non_default))

or 

run("export PYTHONPATH={{xyz}}; qlaunch {} {} {}".format(pre_non_default, args.command, non_default))

Hopefully one of those should work which should help you get up and running in the short term.

Best,
Anubhav

Michael B

unread,
Jul 7, 2016, 5:03:02 PM7/7/16
to fireworkflows, dinkysa...@gmail.com
Hi Anubhav,

Thank you for the recommendations. I tried all 3 of them and the last one worked but I'm not sure why the other two didn't. Putting that aside, however, why don't you use remote qlaunch? Sending jobs to run in other computers is the vital functionality of fireworks that I require (and I'm assuming is the reason why some other people use the program too) so is there another method you can use to achieve this? 

I also have another question regarding the use of the queue. If I want my job to run on multiple cpus, I will include mpirun at the start of of the command which runs the calculator and specify cpus_per_task: 12 within my_qadapter.yaml (the queuing system I'm using is Slurm). However, when I read the output files from the calculation they state only 1 processor was used; the only way I can make it run on 12 processors is to write mpirun -np 12 within the run calculator line, but this negates the purpose of only needing to specify the number within my_qadapter.yaml. Is there something I'm missing?

Thanks,
Michael

Anubhav Jain

unread,
Jul 7, 2016, 5:25:00 PM7/7/16
to Michael B, fireworkflows
>> Sending jobs to run in other computers is the vital functionality of fireworks that I require (and I'm assuming is the reason why some other people use the program too) so is there another method you can use to achieve this? 

I use a combination of strategies to run on remote computers:
(i) explicitly logging into those computers and typing qlaunch
(ii) having a crontab installed on the remote computers that automatically submit jobs every hour. This basically means that I am keeping the queue relatively close to my queue limits at all times without needing to run remote launch.

With (ii), I rarely need to do (i), and in the cases I need to do (i) (e.g. for testing new code), I usually have other reasons to want to be logged into the remote computer. So I never really spent much time with remote qlaunch myself, although I can see why it would be very useful. Note that there have been a couple of FWS users that have wanted to extend the remote functionality, adding more advanced features like preferentially launching to some clusters based on availability, but I am not sure how close any of them are to pushing any code back into the master that improves the situation.

>> However, when I read the output files from the calculation they state only 1 processor was used; the only way I can make it run on 12 processors is to write mpirun -np 12 within the run calculator line, but this negates the purpose of only needing to specify the number within my_qadapter.yaml. Is there something I'm missing?

You do need to set "-np" in your run line, along with setting the same number in the qlauncher. There are a couple of additional notes on this that might make life better:
* If you are running on a single node, your custom firetask can read the number of cpus (e.g. using the multiprocessing module) within Python and then auto-set the "np" parameter before running your mpirun command. Then you don't need to hard code the number of processors (it will be detected by the code automatically, and be different for different machines), and technically the number "12" will only be present in your my_qadapter.yaml file.
* If the above does not suffice (e.g. you are running on multiple nodes), another option is to not hard-code the number <X> in your "mpirun -n <X>" command, but rather to use the FW env framework for getting the number <X>. In this case, you will need to set the number "12" in both your my_qadapter.yaml file and your my_fworker.yaml file, but you will *not* need to write then number 12 anywhere in your Python code. Your Python code will read whatever <X> is put in the fw_env from the my_fworker.yaml. This will make sure that different computers can run different numbers of processors gracefully by modifying both the my_qadapter.yaml and the my_fworker.yaml in their configuration. See the docs on the FW env for more details if you're interested in this option.

There are probably more complicated setups that are possible if you have more strict demands, but those are the two most straightforward ways I can think of given your problem statement

Best,
Anubhav



--
You received this message because you are subscribed to the Google Groups "fireworkflows" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fireworkflow...@googlegroups.com.
To post to this group, send email to firewo...@googlegroups.com.
Visit this group at https://groups.google.com/group/fireworkflows.

Anubhav Jain

unread,
Jul 7, 2016, 5:49:45 PM7/7/16
to Michael B, fireworkflows
Btw, just a quick note that a few emails up, I wrote:

run("source .bashrc; qlaunch {} {} {}".format(pre_non_default, args.command, non_default))

which should have read:

run("source ~/.bashrc; qlaunch {} {} {}".format(pre_non_default, args.command, non_default))

Not sure if that will make any difference but anyway just wanted to correct the intended command.

Michael B

unread,
Jul 8, 2016, 11:18:16 AM7/8/16
to fireworkflows, dinkysa...@gmail.com
Thank you Anubhav! That is very helpful to know. I will ask if I have any more questions.

Michael
Reply all
Reply to author
Forward
0 new messages