Channel closed in PwBandsWorkChain

211 views
Skip to first unread message

Ignacio Martin Alliati

unread,
Jun 17, 2021, 10:12:27 AM6/17/21
to aiidausers
Hi all,

I'd like to ask you a question about an exception that I often get. 
I'm running QE's PwBandsWorkChain, however I should say that I don't think workchain itself is the problem, simply because I've completed it successfully with this same material (now I am only changing some parameters and re-running, like adding vdw_corr or requesting a different density of k-points). Moreover, the QE calculations finish successfully.

The relaxation steps are fine but the scf step is excepted. If I log into the cluster, I see the QE scf calculation finished successfully (the 'out' dir is 3.5 gb). But `verdi node show 4109` says:

state        Excepted <aiormq.exceptions.ChannelInvalidStateError: <Channel: "4"> closed>

Then, `verdi process report 4109` shows the errors in the attached file.

I wasn't able to reproduce the error in order to pinpoint what the cause was, it just happened a couple of times. Every time it happened, though, I did find the daemon to be somewhat erratic. What I mean is that, after finding out that the WorkChain had crashed:
  • `verdi status` returned a 'running' daemon
  • However, anything that I tried to submit afterwards stayed as 'Created' (with a stop emoji) in the output of `verdi process list`.
  • Then, I stopped and started the daemon
  • And only then the newly submitted jobs changed to 'Running'.
Two final points may be relevant to this. One is that this workchain took quite some time, so I did put my computer to sleep a few times while the workchain was active. (Would it be a good practice to stop the daemon before the end of the working day, and start it next morning?). The second one would be that this particular cluster has rotary IP numbers that are assigned randomly when I connect. Honestly, I barely know what that means, but I think that's why I needed to set the Key policy to WarningPolicy, otherwise `verdi computer coonfigure ssh` would fail.

That all the information I could gather about the error, I hope some of it makes sense.
Any ideas as to what may be going on?
Thanks in advance,
Ignacio.
________________________________
Ignacio Martin Alliati
PhD student, Maths and Physics
Queen's University Belfast.
error.txt

Giovanni Pizzi

unread,
Jun 19, 2021, 2:18:37 AM6/19/21
to AiiDA users mailing list
Dear Ignacio - if you put your computer to sleep, this is most probably the issue.
When you restart the daemon, the connections to RabbitMQ that were opened just before your computer went to sleep will "fail".
(Similarly, if e.g. AiiDA is in the process of submitting, or retrieving, a calculation, the SSH connection will be interrupted).
If you can turn off the daemon before putting it to sleep and turn it on again afterwards, this should solve the issue.

AiiDA has been indeed designed to run on a machine that is always on (at least while the simulations it is monitoring are running) and so these use cases have not been battle-tested.
Still, AiiDA has a way to detect most of these issues, and retry later and eventually pause the calculation.
If in your case the process excepts, this is probably something we could improve - AiiDA could just pause the calculation, and retry to retrieve the results later when your computer is back online.

Could you please open an issue on GitHub, describing in the most detailed way possible the issue (if you can with daemon logs in the `~/.aiida` folder, output of `verdi process report`, etc.).
This might be complex, but if you find a way to reproduce the error, that would be great - if we cannot reproduce it ourselves, we won't be able to understand how to fix it.

Best,
Giovanni




--
AiiDA is supported by the NCCR MARVEL (http://nccr-marvel.ch/), funded by the Swiss National Science Foundation, and by the European H2020 MaX Centre of Excellence (http://www.max-centre.eu/).
 
Before posting your first question, please see the posting guidelines at http://www.aiida.net/?page_id=356 .
---
You received this message because you are subscribed to the Google Groups "aiidausers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aiidausers+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aiidausers/77144f20-a17d-4269-90c2-d473019ce9b3n%40googlegroups.com.
<error.txt>

Ignacio Martin Alliati

unread,
Jun 23, 2021, 4:57:00 AM6/23/21
to aiidausers
Hi Giovanni,

Thanks very much for your response.
Indeed I was able to complete the workchain successfully by taking the precaution of stopping the daemon before going away from my computer.
Reproducing this might be tricky, as the entire workchain takes days. I'll give it a go with a smaller system and putting the mac to sleep on purpose.
Not sure if this would 'work', but either way, I'll open a GitHub issue with whatever information I was able to gather.

Regards,
Ignacio.

Reply all
Reply to author
Forward
0 new messages