I have two solutions for you.
The one requiring the least change from your current setup is just to establish an exterior directory (outside of the launch dirs) to hold all the data for a set of runs (i.e., one complete MD simulation), and store these directories somewhere in the Fireworks' specs (so you can look them up later if needed). In your bash script after a checkpoint is made, you could make a dir specific to this set of jobs (if it doesn't already exist), copy the checkpoint data there, make a queue submission, etc. Then when/if your MD sim finishes completely, have your bash script consolidate the data in this exterior directory into a format which you can easily read.
One way to implement this is with a larger workflow. If your runs right now are just one Firework (lets call it VASP_FW), your dynamic workflow might look like this:
VASP_FW1 - Runs, realizes job won't finish in time. Checkpoints, dynamically adds new FW (VASP_FW2)
|
|
VASP_FW2 - Runs, realizes job won't finish in time. Checkpoints, dynamically adds new FW (VASP_FW3)
|
|
... (process repeats)
|
|
VASP_FW_N - Runs, job finishes. Consolidates all the data from Fireworks VASP_FW(1 thru N) into the launch_dir for this Firework, so you have all the checkpoint data in one place (the launch_dir of the final FW).
This scheme will probably require you to write custom Firetasks (see
here and
here for more info), if you are not already doing so. The main con of this is that there is some added complexity, but the pro is that once it is figured out you will have much more flexibility. You can add new Fireworks to the workflow (through the "additions" argument to the
FWAction object at the end of run_task in whatever Firetasks you use to run your MD) and you can pass information to subsequent fireworks i.e., the directories of past checkpoints (either thru the new FW's spec, through the
file-passing interface (files_in and files_out), or through the "mod_spec" or "update_spec" arguments to FWAction). Another added perk is that you will have one workflow for an entire MD run, rather than a bunch of separate Fireworks.
The python psuedocode for your Firetask and Firework(s) could look something like:
class RunMDDynamicTask(FireTaskBase):
def run_task(self, fw_spec):
prev_checkpoint_dirs = fw_spec.get("checkpoint_dirs", [])
# run commands for VASP MD, checking walltime, creating checkpoint, etc.
...
if job_finished:
consolidate_checkpoints_to_this_dir(prev_checkpoint_dirs)
return FWAction()
else:
new_fw = Firework(RunMDDynamicTask(), {"checkpoint_dirs": prev_checkpoint_dirs,
# other params that need to be passed to the next FW})
return FWAction(additions=new_fw)
if __name__ == "__main__":
vasp_fw1 = Firework(RunMDDynamicTask())
wf = Workflow([vasp_fw1], name="MD Run for System Z")
launchpad.add_wf(wf)
You'll notice there is no queue submission in the above workflow description. This is because I'd recommend having a cron-job make queue submissions for you automatically (e.g., every 12 hours), which is completely separate from the operation of the workflow above - mixing workflow execution and queue submission tends to be confusing, for me at least. By having crontab submit your jobs automatically, as soon as one of your fws finishes and the next one is "READY", a queue submission you made previously will pull and run the next job. While much faster than waiting around for old jobs to finish to make queue submissions for new jobs, it will not preserve the job_id AFAIK (not sure why that would be needed though?)
If you prefer to not do that, I guess you could just add a command for submitting to the queue inside the else block of the above Firetask - ie "if job is not finished, submit to the queue with job id X and add another FW to the workflow"; I've never done this though so it could wind up in some goofy behavior.
Thanks,
Alex