Hi there,
I have a question concerning the handling of long runs with checkpoint/restart features:
Imagine I have a workflow that contains long MD runs (in my specific case with Gromacs). Each firework within the workflow executes a long MD run, with linear dependencies between the different fireworks (i.e. MD1 -> MD2 -> MD3 etc.).
Now it might well happen that the firework MD1 stops prior to full completion, e.g. because the wall time limit has been reached. This is detected by Gromacs, and the codes writes a checkpoint file and shuts down gracefully. Since no error is issued, the firework MD1 gets the status COMPLETED, and if we continue the execution of the workflow we will go on with the firework MD2, even though we should rather restart the firework MD1!
To circumvent this problem, one should have, and the end of the firework, the possibility to check whether the MD run has finished completely or not (I think I know how to do this), and then manually change the status from COMPLETED to something else (e.g. DEFUSED).
I thought that maybe this could be done with the FWAction object, but if I understand correctly this only allows to defuse the children of the firework and not the firework itself.
Is there any alternative way to do this?
I was also looking at
this thread, but the solution that is proposed there (dynamically add new fireworks if a checkpoint is written) is not ideal. I prefer to execute the entire MD run in one firework and not in several ones.
Any help is appreciated.
Thanks,
Stephan