Hello Sebastiaan,
I do have many jobs that have 'Pausing after failed transport task: upload_calculation failed 5 times consecutively' as you mentioned. I additionally have jobs that are stuck in QUEUE and some that I have seen stuck in RUNNING as well. Restarting the daemon was one of the first things I attempted upon logging in and seeing the jobs in that state thinking the daemon had gotten stuck. It seems that it does begin to restart some of the jobs as can be seen in the `verdi daemon logshow`. Perhaps this mean they are becoming stuck and if that's the case I'm not sure what needs done about that.
At the lab here we have multiple machines and I have AiiDA running on a login node for the server named Quartz. The simulations are being submitted to another server named Lassen where I am running the NWChem simulations. Here is an example from one of the jobs that hit the max 5 attempts of resubmitting the job.
*** 165566: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 6 LOG MESSAGES:
+-> ERROR at 2022-09-25 01:38:25.295192+00:00
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
| yield transport_request.future
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
| execmanager.upload_calculation(node, transport, calc_info, folder)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
| remote_user = transport.whoami()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
| retval, username, stderr = self.exec_command_wait(command)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
| retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
| ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
| channel = self.sshclient.get_transport().open_session()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
| raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
| aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
| result = await coro()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
| with transport_queue.request_transport(authinfo) as request:
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
| self.gen.throw(typ, value, traceback)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
| transport_request.future.result().close()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
| raise InvalidOperation('Cannot close the transport: it is already closed')
| aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> ERROR at 2022-09-25 01:38:45.463782+00:00
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
| yield transport_request.future
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
| execmanager.upload_calculation(node, transport, calc_info, folder)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
| remote_user = transport.whoami()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
| retval, username, stderr = self.exec_command_wait(command)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
| retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
| ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
| channel = self.sshclient.get_transport().open_session()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
| raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
| aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
| result = await coro()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
| with transport_queue.request_transport(authinfo) as request:
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
| self.gen.throw(typ, value, traceback)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
| transport_request.future.result().close()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
| raise InvalidOperation('Cannot close the transport: it is already closed')
| aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> ERROR at 2022-09-25 01:39:25.677589+00:00
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
| yield transport_request.future
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
| execmanager.upload_calculation(node, transport, calc_info, folder)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
| remote_user = transport.whoami()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
| retval, username, stderr = self.exec_command_wait(command)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
| retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
| ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
| channel = self.sshclient.get_transport().open_session()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
| raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
| aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
| result = await coro()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
| with transport_queue.request_transport(authinfo) as request:
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
| self.gen.throw(typ, value, traceback)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
| transport_request.future.result().close()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
| raise InvalidOperation('Cannot close the transport: it is already closed')
| aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> ERROR at 2022-09-25 01:40:45.937048+00:00
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
| yield transport_request.future
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
| execmanager.upload_calculation(node, transport, calc_info, folder)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
| remote_user = transport.whoami()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
| retval, username, stderr = self.exec_command_wait(command)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
| retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
| ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
| channel = self.sshclient.get_transport().open_session()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
| raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
| aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
| result = await coro()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
| with transport_queue.request_transport(authinfo) as request:
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
| self.gen.throw(typ, value, traceback)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
| transport_request.future.result().close()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
| raise InvalidOperation('Cannot close the transport: it is already closed')
| aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> ERROR at 2022-09-25 01:43:26.203834+00:00
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 110, in request_transport
| yield transport_request.future
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
| execmanager.upload_calculation(node, transport, calc_info, folder)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/daemon/execmanager.py", line 105, in upload_calculation
| remote_user = transport.whoami()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 727, in whoami
| retval, username, stderr = self.exec_command_wait(command)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/transport.py", line 443, in exec_command_wait
| retval, stdout_bytes, stderr_bytes = self.exec_command_wait_bytes(command=command, stdin=stdin, **kwargs)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1394, in exec_command_wait_bytes
| ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 1354, in _exec_command_internal
| channel = self.sshclient.get_transport().open_session()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 583, in sshclient
| raise TransportInternalError('Error, ssh method called for SshTransport without opening the channel first')
| aiida.transports.transport.TransportInternalError: Error, ssh method called for SshTransport without opening the channel first
|
| During handling of the above exception, another exception occurred:
|
| Traceback (most recent call last):
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
| result = await coro()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 79, in do_upload
| with transport_queue.request_transport(authinfo) as request:
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/contextlib.py", line 153, in __exit__
| self.gen.throw(typ, value, traceback)
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/engine/transports.py", line 126, in request_transport
| transport_request.future.result().close()
| File "/usr/workspace/keilbart/envs/aiida/lib/python3.10/site-packages/aiida/transports/plugins/ssh.py", line 572, in close
| raise InvalidOperation('Cannot close the transport: it is already closed')
| aiida.common.exceptions.InvalidOperation: Cannot close the transport: it is already closed
+-> WARNING at 2022-09-25 01:43:26.210103+00:00
| maximum attempts 5 of calling do_upload, exceeded
Please let me know what other information I can provide. Thanks.
Nathan