Flexibility of multi-computer (client) setups and workchain restarts

19 views
Skip to first unread message

Kayahan Saritas

unread,
Feb 2, 2023, 5:18:23 PM2/2/23
to aiidausers
Hello, 

I have a few questions regarding multi-computer setups and workchain restarts. My expertise is on quantum Monte Carlo calculations using QMCPACK and I am trying to understand the value that aiida could bring to our qmc workflows such that we would be interested in writing an aiida-plugin for QMCPACK. QMCPACK most heavily interfaces with quantum espresso and uses DFT wavefunction as input similar to GW codes. My experience is that in some systems DFT is hard to converge mainly because of using very hard pseudopotentials (~500 Ry cutoff) that are compatible with QMC, which creates a bottleneck. 
My impression is that aiida could be helpful in this area with self-healing QE workchains. 
Despite this benefit, my questions are regarding using QMCPACK within aiida framework.  QMC workflows are quite different from typical DFT workflows in terms of the size of the data that should be moved around and combining most efficient ways to perform QMC and DFT parts of a workflow. 

1. For QMC calculations it is best if the number of nodes/walltime could be determined on the fly, after say its dependency DFT run. Can aiida submit jobs such that the number of nodes/walltime for that job is determined within the workflow? 

2. It would be better use of computational resources if I could perform DFT and QMC parts of a workflow at separate computers, for example DFT calculation using some mid-size institutional cluster and QMC part at a supercomputer. This requires that the wavefunction with size several GB at least should be transferred from computer A to computer B as part of the workflow. Is there any implementation within aiida that could perform these tasks robustly? 

3. This is something that is addressed in the workshops, but I couldn't find a detailed answer yet. Say in a workflow, one QE job failed at the first attempt, but then the error handling in aiida tried several things such as reducing mixing, cg diagonalization etc. to make the calculation work. However, lets say that task still fails after those tries. This is similar to a finite displacement phonon calculation, where all atomic displacement DFT runs completed, but one failed after all the error handling attempts. But then all the runs would need to complete successfully to get the phonon band spectrum etc. How should I work with that persistently failing task within aiida so that other successful tasks can still be used and I can get the final result? Is there a way that I can manipulate the process object parameters by hand and then resubmit the job to the queue? 

I think this has been a long email, but hopefully it will be useful to others that could have similar questions. I appreciate your help. 

Thanks,
Kayahan

Sebastiaan Huber

unread,
Feb 3, 2023, 2:54:44 AM2/3/23
to aiida...@googlegroups.com
Hi Kayahan,
>
> 1. For QMC calculations it is best if the number of nodes/walltime
> could be determined on the fly, after say its dependency DFT run. Can
> aiida submit jobs such that the number of nodes/walltime for that job
> is determined within the workflow?
Yes, in AiiDA workflows are implemented by what is called a `WorkChain`.
Since it is implemented in Python, you can code any logic in it.
So in your example, the work chain would first launch a DFT calculation,
and when that is done, you can analyze the results in Python and set the
resources that are ideal for the QMC calculation to be launched next.

We already use a similar concept in the `aiida-quantumespresso` plugin
(for the Quantum ESPRESSO suite) in the workchain for pw.x.
There we first launch a simple pw.x calculation that simply computes the
problem size and then exits.
The workchain then analyzes those results to automatically determine the
optimal parallelization settings and resources.

> 2. It would be better use of computational resources if I could
> perform DFT and QMC parts of a workflow at separate computers, for
> example DFT calculation using some mid-size institutional cluster and
> QMC part at a supercomputer. This requires that the wavefunction with
> size several GB at least should be transferred from computer A to
> computer B as part of the workflow. Is there any implementation within
> aiida that could perform these tasks robustly?
Yes, this is possible, but unfortunately not directly from A to B: so if
A and B are both different from the computer where AiiDA is running, the
data will first be retrieved from A to AiiDA and then be copied to B.
So it is important to keep in mind that this will put quite some data
transfer load on the AiiDA server.
If you were to run both calculations on the same machine (or really two
machines that share the same file system) you could simply symlink the
data and make it very efficient, so this is a tradeoff between
flexibility and efficiency.

> 3. This is something that is addressed in the workshops, but I
> couldn't find a detailed answer yet. Say in a workflow, one QE job
> failed at the first attempt, but then the error handling in aiida
> tried several things such as reducing mixing, cg diagonalization etc.
> to make the calculation work. However, lets say that task still fails
> after those tries. This is similar to a finite displacement phonon
> calculation, where all atomic displacement DFT runs completed, but one
> failed after all the error handling attempts. But then all the runs
> would need to complete successfully to get the phonon band spectrum
> etc. How should I work with that persistently failing task within
> aiida so that other successful tasks can still be used and I can get
> the final result? Is there a way that I can manipulate the process
> object parameters by hand and then resubmit the job to the queue?
If a job that is part of a bigger workflow fails, you can very easily
construct a new job (with a single line of Python) from the failed one
and change certain parameters before resubmitting.
However, this will create a new job in the provenance graph (it won't
override or replace the failed one, as that would be destroying provenance).
The new job will then not officially be part of the original workflow
and the workflow will not automatically continue.
That being said, you can code large workflows like this to account for
this possibility and then together with the "caching" mechanism in AiiDA
[1] make it easy to launch the entire workflow with some adjusted inputs
for the failing job.
The caching mechanism in AiiDA (when enabled) ensures that when it is
asked to submit a job, it will check the database if the exact same job
(same inputs) has already been launched.
If the case, instead of rerunning it, AiiDA will simply take the outputs
that already exist in the database and continue.
With this mechanism, you could relaunch the failed workflow, and only
the failed job would be relaunched and all other ones (which had already
completed) will be taken from the database.
This approach does take some careful thinking and design for the
top-level workflow but is commonly used by users.

Hope this answered your questions.

Regards,

Sebastiaan


[1]
https://aiida.readthedocs.io/projects/aiida-core/en/latest/howto/run_codes.html?highlight=caching#how-to-save-compute-time-with-caching

Kayahan Saritas

unread,
Feb 3, 2023, 9:16:27 AM2/3/23
to aiidausers
Thank you for the detailed answers, Kayahan
Reply all
Reply to author
Forward
0 new messages