Best way to restart a lost VASP relaxation run

Sandip De

unread,

Aug 21, 2018, 7:46:59 AM8/21/18

to atomate

Hello,

I have recently started using atomate and enjoying it a lot. I have one question when it comes to lost vasp geometry relxation runs which were a result of walltime out for the queue so the jobs crashed. Now I know how to restart the whole workflow with increased walltime or changed inputs etc. But if I don't want to waste the old unfinished run but continue with the last geometry obtained in the lost run what is the best way to proceed? I know I can write a workflow which copys the output from the lostrun directory and starts from there but I am wondering if there is something simpler way.

Thanks a lot

Best regards

Sandip

Anubhav Jain

unread,

Aug 21, 2018, 11:48:12 AM8/21/18

to 1san...@gmail.com, atomate

Hi Sandip

Currently there is no easy way in atomate to restart only from where you left off. As you mentioned, you can either (i) restart from scratch or (ii) create a new workflow that starts where you left off. Another way would be to do it the way certain molecular dynamics workflows are written in atomate - these will checkpoint the job before walltime by writing a STOPCAR and restart the new job at the place left off. But it adds complication to the workflow programming, and thus we did not implement things this way for simple geometry optimizations. You would have to modify the geometry optimization workflow to also work in this manner. Finally, you could manually update the FireWorks database itself so that the structure in the spec of your Firework is the partially optimized structure, and thus when restarting the job you would start from that structure. In order to do this, you'd need to be comfortable with MongoDB/pymongo as well as with FireWork specs and how Structure objects are serialized in pymatgen.

I am currently thinking about how to redesign some of the atomate workflows and making this kind use case easier will be one thing I'll keep in mind for the future. For now, I think most of us just restart the job with longer walltime.

Best,

Anubhav

--
You received this message because you are subscribed to the Google Groups "atomate" group.
To unsubscribe from this group and stop receiving emails from it, send an email to atomate+u...@googlegroups.com.
To post to this group, send email to ato...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/atomate/f5bef01e-0ef7-47bc-a5e6-db47d4b2451f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Best,
Anubhav

Sandip De

unread,

Aug 21, 2018, 1:08:48 PM8/21/18

to AJ...@lbl.gov, ato...@googlegroups.com

Dear Anubhav,

Thanks for your prompt reply and possible workaround suggestions. I think overriding the spec with the latest geometry would indeed be a quite simple workaround and I think I know how to do that.

Thank you
Best Regards
Sandip

Sandip De

unread,

Aug 23, 2018, 5:41:37 AM8/23/18

to AJ...@lbl.gov, ato...@googlegroups.com

Dear Anubhab,

One more question along the same line. I have some FIZZLED workflow resulted due to unconverged vasp geometry relaxation with similar errors like below. so basically one of the steps in the double relaxation workflow did not converge. In such a situation to restart the jobs, do I have the same workaround choices as you mentioned before or there is something inbuilt?

{ 'actions': None,

'errors': [ 'Non-converging '

'job'],

'handler': <custodian.vasp.handlers.NonConvergingErrorHandler object at 0x2aaac8a2bcf8>}

Unrecoverable error for handler: <custodian.vasp.handlers.NonConvergingErrorHandler object at 0x2aaac8a2bcf8>. Raising RuntimeError

Traceback (most recent call last):

File "/gpfs/backup/users/home/desa/.local/lib/python3.6/site-packages/custodian/custodian.py", line 320, in run

self._run_job(job_n, job)

File "/gpfs/backup/users/home/desa/.local/lib/python3.6/site-packages/custodian/custodian.py", line 446, in _run_job

raise CustodianError(s, True, x["handler"])

custodian.custodian.CustodianError: (CustodianError(...), 'Unrecoverable error for handler: <custodian.vasp.handlers.NonConvergingErrorHandler object at 0x2aaac8a2bcf8>. Raising RuntimeError')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/gpfs/backup/users/home/desa/.local/lib/python3.6/site-packages/fireworks/core/rocket.py", line 262, in run

m_action = t.run_task(my_spec)

File "/gpfs/backup/users/home/desa/.local/lib/python3.6/site-packages/atomate/vasp/firetasks/run_calc.py", line 204, in run_task

c.run()

File "/gpfs/backup/users/home/desa/.local/lib/python3.6/site-packages/custodian/custodian.py", line 330, in run

.format(self.total_errors, ex))

RuntimeError: 1 errors reached: (CustodianError(...), 'Unrecoverable error for handler: <custodian.vasp.handlers.NonConvergingErrorHandler object at 0x2aaac8a2bcf8>. Raising RuntimeError'). Exited...

Walltime used is = 07:56:56

CPU Time used is = 316:10:19

Memory used is = 5836620kb

Thank you
Best Regards
Sandip

Anubhav Jain

unread,

Aug 23, 2018, 11:50:09 AM8/23/18

to Sandip De, atomate

The workaround choices are the same, but you should really examine the job (look in the directory at custodian.json and error.x.tar.gz). If you did a simple restart with longer walltime option, you would almost certainly hit the same "max custodian errors" problem again. This is because the job looks to have died not due to insufficient time, but because custodian could not find a way to converge the job by changing VASP input parameters. It is possible that re-starting the job with a partially optimized geometry would work - although hard to say.

Typically in situations like this, our procedure would be

- look at the output files of the job and custodian trace as mentioned above

- decide on any changes needed to custodian that would help bring the job back on the right track - or remove any custodian rules that are interfering with the job

- push those changes to custodian

- update all the custodian version on our cluster to reflect the new set of rules for fixing jobs

- restart the job

This can be a long process since it involves updating the "fix" rules. For example, this PR (https://github.com/materialsproject/custodian/pull/76) has been up in the air for awhile, even though I think it is useful it is not sufficiently "proven" I guess to be merged in. Another option is to update which handlers are being used by custodian for your specific job, which can be done again via the FW spec if you know how to "hack" it - this makes most sense if you think a handler is interfering with the job progress. You can also update the number of max errors for custodian, although in my experience this doesn't usually help; if the job hits max errors it's probably not recoverable using the current strategy.

Best,

Anubhav

To view this discussion on the web visit https://groups.google.com/d/msgid/atomate/CADmgcfgAjJAGf5e7md9WuzVMoQ%2BFs%2BsRLabY4aLPZQ44FeZquQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Best,
Anubhav

Sandip De

unread,

Aug 23, 2018, 6:27:19 PM8/23/18

to AJ...@lbl.gov, ato...@googlegroups.com

Hi Anubhav,

Thanks for the suggestion. I will check the custodian errors and try to find out how to fix.

Thank you
Best Regards
Sandip

Reply all

Reply to author

Forward