AiiDA appears to drop connection

50 views
Skip to first unread message

Nathan Keilbart

unread,
Sep 27, 2022, 12:19:15 PM9/27/22
to aiidausers
Hello everyone,

I am very much enjoying using AiiDA and currently have v2.0.3 installed a cluster at LLNL where I work. It is setup with the RabbitMQ and Postgres database being hosted externally from AiiDA and I am able to get calculations to go through and retrieve results afterwards.
I have recently been attempting to submit around 10K calculations with the NWChem plugin. I am getting decent throughput but when I come back the next day sometimes I find that some of the services appear to have stopped. I am uncertain if it is a setting that we have on the server that might be doing this and I'm hoping someone can help me track this down. I have attempted to leave it running in a login node for one of our clusters so that it can ssh freely between the different clusters without a password. I have also left it in a `screen` instance and when I come back I'm able to still see all the daemon workers going.

I have attempted to look through the daemon log to see if it has any insight and it mainly seems to have errors related to the fact that it is either unable to upload new jobs or similarly related issues. It appears to happen at different times as well or at least I haven't noticed a pattern.

Like I said, I feel it's something going on with our servers but I'd like to be able to point the people in the right direction. If there are any pointers people could give me to track down the possible issue I would appreciate it.

Nathan

Sebastiaan Huber

unread,
Sep 27, 2022, 5:07:25 PM9/27/22
to aiida...@googlegroups.com
Hi Nathan,

Glad you are enjoying AiiDA so far.
To be able to help, I would need a little bit more information on the symptoms.
When you say "the services appear to have stopped", what services are you referring to and based on what observation do you think they are stopped?
How many daemon workers are you running?
Are you submitting all 10K calculations in one go, so do you have 10K calculations active in the output of `verdi process list`?
The more concrete information and details you can provide on what you do exactly and what you observe, the better I will be able to suggest what to investigate next.

Regards,

Sebastiaan
--
AiiDA is supported by the NCCR MARVEL (http://nccr-marvel.ch/), funded by the Swiss National Science Foundation, and by the European H2020 MaX Centre of Excellence (http://www.max-centre.eu/).
 
Before posting your first question, please see the posting guidelines at http://www.aiida.net/?page_id=356 .
---
You received this message because you are subscribed to the Google Groups "aiidausers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aiidausers+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aiidausers/76e9fa04-9d81-4686-9a0b-1b86577f6b45n%40googlegroups.com.

Nathan Keilbart

unread,
Sep 27, 2022, 5:13:47 PM9/27/22
to aiidausers
Hello Sebastiaan,

Yes no problem at all. Whatever information I can give I will.

My thought process is simply that it is running smoothly the whole day while I'm logged in myself and either over the night or during the weekend it hits a point where all of the remaining calculations have issues. It makes me think that one of the services or perhaps connections are not able to accessed. One of the common issues I see is that it can't upload files to the server. Othertimes, it seems to have issues downloading files or thinks that it already has but was unable to do that process.

For the daemons, I am submitting around 1000 structures at a time which has workflow I've developed, the base workflow, and the base calcjob node making it 3000 things for the daemons to keep track of. I believe I had around 15 daemons running to keep the jobs going. As I mentioned, they are still up and running when I log back into the same login node the next time.

As I mentioned, I get a feeling it might be a setting on my server since it seems like an upload/download issue but they aren't seeing anything on their side. If you have any thoughts of where I could suggest they look I would appreciate it. Let me know if I didn't provide enough information or if you need more details. Thanks.

Nathan

Sebastiaan Huber

unread,
Sep 27, 2022, 5:25:43 PM9/27/22
to aiida...@googlegroups.com
Hi Nathan


One of the common issues I see is that it can't upload files to the server. Othertimes, it seems to have issues downloading files or thinks that it already has but was unable to do that process.
How do you determine this?
Do you get this from the status of the job in `verdi process list output`?
Is it showing "Process was paused because transport task excepted 5 times in a row" or something to that effect?


For the daemons, I am submitting around 1000 structures at a time which has workflow I've developed, the base workflow, and the base calcjob node making it 3000 things for the daemons to keep track of. I believe I had around 15 daemons running to keep the jobs going. As I mentioned, they are still up and running when I log back into the same login node the next time.
Have you ever tried to simply restart the daemon when this happened?
If you do `verdi daemon stop && verdi daemon start 15`, do the jobs continue running again?


As I mentioned, I get a feeling it might be a setting on my server since it seems like an upload/download issue but they aren't seeing anything on their side. If you have any thoughts of where I could suggest they look I would appreciate it. Let me know if I didn't provide enough information or if you need more details. Thanks.
What is this server you are referring to? I take it this is the machine that AiiDA is running on.
Is it running on a node of the same cluster that the jobs are getting sent to?
For any of the problematic calcjobs that seem stuck, what does `verdi process report` return?

Regards,

Sebastiaan

Nathan Keilbart

unread,
Sep 27, 2022, 5:58:44 PM9/27/22
to aiidausers
Hello Sebastiaan,

I do have many jobs that have 'Pausing after failed transport task: upload_calculation failed 5 times consecutively' as you mentioned. I additionally have jobs that are stuck in QUEUE and some that I have seen stuck in RUNNING as well. Restarting the daemon was one of the first things I attempted upon logging in and seeing the jobs in that state thinking the daemon had gotten stuck. It seems that it does begin to restart some of the jobs as can be seen in the `verdi daemon logshow`. Perhaps this mean they are becoming stuck and if that's the case I'm not sure what needs done about that.

At the lab here we have multiple machines and I have AiiDA running on a login node for the server named Quartz. The simulations are being submitted to another server named Lassen where I am running the NWChem simulations. Here is an example from one of the jobs that hit the max 5 attempts of resubmitting the job.

*** 165566: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 6 LOG MESSAGES:
+-> ERROR at 2022-09-25 01:38:25.295192+00:00
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
 |     yield transport_request.future
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
 |     execmanager.upload_calculation(node, transport, calc_info, folder)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
 |     remote_user = transport.whoami()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
 |     retval, username, stderr = self.exec_command_wait(command)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
 |     retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
 |     ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
 |     channel = self.sshclient.get_transport().open_session()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
 |     raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
 | aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
 |     with transport_queue.request_transport(authinfo) as request:
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
 |     self.gen.throw(typ, value, traceback)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
 |     transport_request.future.result().close()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
 |     raise InvalidOperation('Cannot close the transport: it is already closed')
 | aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> ERROR at 2022-09-25 01:38:45.463782+00:00
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
 |     yield transport_request.future
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
 |     execmanager.upload_calculation(node, transport, calc_info, folder)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
 |     remote_user = transport.whoami()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
 |     retval, username, stderr = self.exec_command_wait(command)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
 |     retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
 |     ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
 |     channel = self.sshclient.get_transport().open_session()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
 |     raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
 | aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
 |     with transport_queue.request_transport(authinfo) as request:
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
 |     self.gen.throw(typ, value, traceback)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
 |     transport_request.future.result().close()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
 |     raise InvalidOperation('Cannot close the transport: it is already closed')
 | aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> ERROR at 2022-09-25 01:39:25.677589+00:00
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
 |     yield transport_request.future
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
 |     execmanager.upload_calculation(node, transport, calc_info, folder)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
 |     remote_user = transport.whoami()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
 |     retval, username, stderr = self.exec_command_wait(command)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
 |     retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
 |     ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
 |     channel = self.sshclient.get_transport().open_session()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
 |     raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
 | aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
 |     with transport_queue.request_transport(authinfo) as request:
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
 |     self.gen.throw(typ, value, traceback)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
 |     transport_request.future.result().close()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
 |     raise InvalidOperation('Cannot close the transport: it is already closed')
 | aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> ERROR at 2022-09-25 01:40:45.937048+00:00
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
 |     yield transport_request.future
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
 |     execmanager.upload_calculation(node, transport, calc_info, folder)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
 |     remote_user = transport.whoami()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
 |     retval, username, stderr = self.exec_command_wait(command)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
 |     retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
 |     ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
 |     channel = self.sshclient.get_transport().open_session()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
 |     raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
 | aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
 |     with transport_queue.request_transport(authinfo) as request:
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
 |     self.gen.throw(typ, value, traceback)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
 |     transport_request.future.result().close()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
 |     raise InvalidOperation('Cannot close the transport: it is already closed')
 | aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> ERROR at 2022-09-25 01:43:26.203834+00:00
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
 |     yield transport_request.future
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
 |     execmanager.upload_calculation(node, transport, calc_info, folder)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
 |     remote_user = transport.whoami()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
 |     retval, username, stderr = self.exec_command_wait(command)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
 |     retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
 |     ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
 |     channel = self.sshclient.get_transport().open_session()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
 |     raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
 | aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
 |     with transport_queue.request_transport(authinfo) as request:
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
 |     self.gen.throw(typ, value, traceback)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
 |     transport_request.future.result().close()
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
 |     raise InvalidOperation('Cannot close the transport: it is already closed')
 | aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> WARNING at 2022-09-25 01:43:26.210103+00:00
 | maximum attempts 5 of calling do_upload, exceeded

Please let me know what other information I can provide. Thanks.

Nathan

Jens Bröder

unread,
Sep 28, 2022, 5:35:33 AM9/28/22
to aiida...@googlegroups.com

Hi Nathan,

if I understood correctly, some jobs of yours did run, so some of your transport tasks succeed, right?

What RabbitMQ version did you setup? Are you on a version <=3.8.14, or have you configured the consumer_timeout?

Best, Jens

To view this discussion on the web visit https://groups.google.com/d/msgid/aiidausers/53c0a048-4083-4158-ba38-bee110777357n%40googlegroups.com.
-- 
----------------------------------------------------------
Dr. Jens Bröder
Research fellow, Data Steward
Helmholtz-Metadata Collaboration | Hub Information (IAS-1 project)
Institute for Advanced Simulation (IAS-9), Forschungszentrum Jülich
j.br...@fz-juelich.de  0241 92780348
https://helmholtz-metadaten.de/
----------------------------------------------------------

Sebastiaan Huber

unread,
Sep 28, 2022, 5:48:25 AM9/28/22
to aiida...@googlegroups.com
Hi Nathan,

It seems like your daemon workers at some point have difficulty reaching the server to which the calculation jobs are to be submitted.
What does the authentication to those servers look like?
Is this just a normal configuration where you access the compute nodes with an SSH key from the login node where AiiDA is running?
I am not quite sure why that should stop working, but that could be a setting on the server side that is invalidating a session after a certain time?

If this is the case, simply restarting the daemon workers should work just fine, i.e., running `verdi daemon restart` should be sufficient.
Once the daemon workers restart, they should reinitialize the SSH connections to the various compute nodes.
You can then simply run `verdi process play --all` to resume all paused jobs.

I don't think this is a problem with RabbitMQ, since that typically has different symptoms.
There we don't see problems with the transport, but jobs in AiiDA simply no longer advance because the daemon workers lost the task and are no longer even running them.
But that doesn't seem to be the case in this report, at least what I can surmise from the shared exception logs.

Regards,

Sebastiaan

Nathan Keilbart

unread,
Sep 30, 2022, 2:00:14 PM9/30/22
to aiidausers
Hello Sebastiaan,

Sorry for the delayed response.

From what I understand of the SSH connection is that there is no key but once we are behind the initial firewall to the server that I have AiiDA running on it is then able to SSH between the different servers without a password. I do know that I get kicked from servers every so often due to inactivity and perhaps that is whats happening or there is a maximum time. If you think this is the case perhaps a short work around I can do is leave a script running that also restarts the daemon every so many hours to keep fresh connections.

As for the RabbitMQ, I was attempting to log into where they give us a dashboard to access the information but I am unable to log into it at this time and waiting for them to get back to me. I do also have processes that are stuck in a QUEUED or RUNNING state that the daemon does not pick up when restarted as well so perhaps there is an issue there? Let me know what information I can provide on that side.

Nathan

Nathan Keilbart

unread,
Oct 6, 2022, 1:04:56 PM10/6/22
to aiidausers
An update on this. I have written a script that is restarting the daemon every 2-3 hours and I leave it running in a persistently active terminal. This appeared to be working for a bit and I was able to get decent throughput. I came back this morning to finding quite a few jobs appearing with this status saying it had failed transportation five times. Using the verdi process play --all command it appears to try and start the jobs over. The issue is I'm not seeing the daemon with any activity as I do verdi daemon logshow to see what it's doing. Checking the report from one of the jobs gives the following traceback:

+-> ERROR at 2022-10-06 16:49:34.185045+00:00

 | Traceback (most recent call last):
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 189, in do_update
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 135, in __enter__
 |     return next(self.gen)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/manager.py", line 286, in request_job_info_update
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 135, in __enter__
 |     return next(self.gen)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/manager.py", line 167, in request_job_info_update
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/manager.py", line 195, in _ensure_updating
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/manager.py", line 230, in _get_next_update_delay
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/manager.py", line 79, in get_minimum_update_interval
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/orm/authinfos.py", line 87, in computer
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/storage/psql_dos/orm/authinfos.py", line 74, in computer
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/storage/psql_dos/orm/utils.py", line 84, in __getattr__
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/storage/psql_dos/orm/utils.py", line 110, in is_saved
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/sqlalchemy/orm/attributes.py", line 482, in __get__
 |     return self.impl.get(state, dict_)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/sqlalchemy/orm/attributes.py", line 942, in get
 |     value = self._fire_loader_callables(state, key, passive)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/sqlalchemy/orm/attributes.py", line 973, in _fire_loader_callables
 |     return state._load_expired(state, passive)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/sqlalchemy/orm/state.py", line 712, in _load_expired
 |     self.manager.expired_attribute_loader(self, toload, passive)
 |   File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/sqlalchemy/orm/loading.py", line 1369, in load_scalar_attributes
 |     raise orm_exc.DetachedInstanceError(
 | sqlalchemy.orm.exc.DetachedInstanceError: Instance <DbAuthInfo at 0x2aabd381c130> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)

It seems that these jobs have become detached as the link at the bottom is saying? Not quite sure what that means or what to do about it. Thanks for any suggestions.

Nathan

Nathan Keilbart

unread,
Oct 6, 2022, 1:29:02 PM10/6/22
to aiidausers
Sorry I took a while to answer the Rabbit MQ version. I have v3.8.9 running and I'm not sure about the configuration. We have a tool here where we can simply start an instance of Rabbit MQ to use and I don't believe we have access to the settings.
Reply all
Reply to author
Forward
0 new messages