paramiko transport problems for remote cluster

3,471 views
Skip to first unread message

Michael Wolloch

unread,
Jun 17, 2021, 12:42:27 PM6/17/21
to aiidausers
Hi all,

I am just getting back to working with AiiDA after some years and a change in Institution.

While I had no problems with the installation and trying out some stuff on my local workstation, I am running into problems setting up a connection to my supercomputer cluster.
Here are the versions of the relevant packages that I am using in my conda environment:
# Name                    Version                   Build  Channel
aiida-core                1.6.3              pyh6c4a22f_1    conda-forge
aiida-pseudo              0.6.2                    pypi_0    pypi
aiida-quantumespresso     3.4.2                    pypi_0    pypi
aiida-vasp                2.1.0                    pypi_0    pypi
paramiko                  2.7.2                      py_0

Here is my computer and its configuration:
(aiida) hodor:testdir> verdi computer show VSC4
--------------  ------------------------------------------------------------------------
Label           VSC4
PK              5
UUID            820dc927-983e-43cf-99e6-f196c59c2524
Description     VSC4 48c 96GB RAM
Hostname        l43.vsc.ac.at
Transport type  ssh
Scheduler type  slurm
Work directory  /gpfs/data/fs71411/mwo3/WORK/AiiDA
Shebang         #!/usr/bin/bash
Mpirun command  mpirun -np {tot_num_mpiprocs}
Prepend text    module purge; module load intel/19.0.4 intel-mpi/2019.4 intel-mkl/2019.4
Append text     module purge
--------------  ------------------------------------------------------------------------
(aiida) hodor:testdir> verdi computer configure show VSC4
* username               mwo3
* port                   27
* look_for_keys          True
* key_filename
* timeout                60
* allow_agent            True
* proxy_command
* compress               True
* gss_auth               False
* gss_kex                False
* gss_deleg_creds        False
* gss_host               l43.vsc.ac.at
* load_system_host_keys  True
* key_policy             AutoAddPolicy
* use_login_shell        True
* safe_interval          30.0


and the corresponding ssh config file:
Host VSC4
    HostName l43.vsc.ac.at
    User mwo3
    Port 27
    IdentityFile ~/.ssh/id_rsa


I tried to use this both with ssh and sftp as suggested in the documentation, and it worked fine, so no output producing things in .bashrc I guess:
(aiida) hodor:testdir> ssh VSC4 whoami
mwo3
(aiida) hodor:testdir> sftp VSC4
Connected to VSC4.
sftp> version
SFTP protocol version 3
sftp> exit


Now this is what I get when I test the computer. The connection works, but 4 out of 5 test fail:
(aiida) hodor:testdir> verdi computer test VSC4
Info: Testing computer<VSC4> for user<michael...@univie.ac.at>...
* Opening connection... [OK]
* Checking for spurious output... 06/17/2021 05:18:59 PM <19336> paramiko.transport: [ERROR] Secsh channel 1 open FAILED: open failed: Connect failed
[Failed]: ChannelException: ChannelException(2, 'Connect failed')
  Use the `--print-traceback` option to see the full traceback.
* Getting number of jobs from scheduler... 06/17/2021 05:18:59 PM <19336> paramiko.transport: [ERROR] Secsh channel 2 open FAILED: open failed: Connect failed
[Failed]: ChannelException: ChannelException(2, 'Connect failed')
  Use the `--print-traceback` option to see the full traceback.
* Determining remote user name... 06/17/2021 05:18:59 PM <19336> paramiko.transport: [ERROR] Secsh channel 3 open FAILED: open failed: Connect failed
[Failed]: ChannelException: ChannelException(2, 'Connect failed')
  Use the `--print-traceback` option to see the full traceback.
* Creating and deleting temporary file... 06/17/2021 05:18:59 PM <19336> paramiko.transport: [ERROR] Secsh channel 4 open FAILED: open failed: Connect failed
[Failed]: ChannelException: ChannelException(2, 'Connect failed')
  Use the `--print-traceback` option to see the full traceback.
Warning: 4 out of 5 tests failed


Here is the output with traceback printing (up to the first test, the rest is equivalent):
(aiida) hodor:testdir> verdi computer test VSC4 --print-traceback
Info: Testing computer<VSC4> for user<michael...@univie.ac.at>...
* Opening connection... [OK]
* Checking for spurious output... 06/17/2021 05:20:59 PM <19480> paramiko.transport: [ERROR] Secsh channel 1 open FAILED: open failed: Connect failed
[Failed]: ChannelException: ChannelException(2, 'Connect failed')
  Full traceback:
  Traceback (most recent call last):
    File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/aiida/cmdline/commands/cmd_computer.py", line 485, in computer_test
      success, message = test(transport=transport, scheduler=scheduler, authinfo=authinfo)
    File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/aiida/cmdline/commands/cmd_computer.py", line 74, in _computer_test_no_unexpected_output
      retval, stdout, stderr = transport.exec_command_wait('echo -n')
    File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/aiida/transports/plugins/ssh.py", line 1300, in exec_command_wait
      ssh_stdin, stdout, stderr, channel = self._exec_command_internal(command, combine_stderr, bufsize=bufsize)
    File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/aiida/transports/plugins/ssh.py", line 1263, in _exec_command_internal
      channel = self.sshclient.get_transport().open_session()
    File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/transport.py", line 875, in open_session
      return self.open_channel(
    File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/transport.py", line 1017, in open_channel
      raise e
  paramiko.ssh_exception.ChannelException: ChannelException(2, 'Connect failed')
* Getting number of jobs from scheduler... 06/17/2021 05:20:59 PM <19480> paramiko.transport: [ERROR] Secsh channel 2 open FAILED: open failed: Connect failed


I never worked with paramiko, but from the traceback it seems that the Transport method is problematic. I wrote a small python script that shows that a normal ssh paramiko session works, but the Transport fails when I use ssh keys, while it works (on a different port) when I use my password The (modified) squeue command checks ifslurm also works by listing all of my jobs):
import paramiko

host = "l43.vsc.ac.at"
user = "mwo3"
port = 27
squeue = 'date && squeue -u $USER  --format="%.18i %.7p %.15j %.8u %.2t %.10M %.6D %R %V %E"'

k = paramiko.RSAKey.from_private_key_file("/fs/home/wolloch/.ssh/id_rsa")
c = paramiko.SSHClient()
c.set_missing_host_key_policy(paramiko.AutoAddPolicy())

print('Try ssh client:')
c.connect( hostname = host, username = user, port = port, pkey = k )
stdin , stdout, stderr = c.exec_command(squeue)
print(stdout.read())
c.close()

print('\nTry transport with password:')
trans = paramiko.Transport((host, 22))
trans.connect(username=user, password = <password>)
sftp = paramiko.SFTPClient.from_transport(trans)
filepath = '/home/fs71411/mwo3/test/test_vsc4.txt'
localpath = '/fs/home/wolloch/aiida/test.txt'
sftp.put(localpath,filepath)
sftp.close()
trans.close()

print('\nTry transport with ssh-key:')
trans = paramiko.Transport((host, port))
trans.connect(username=user, pkey = k)
sftp = paramiko.SFTPClient.from_transport(trans)
filepath = '/home/fs71411/mwo3/test/test_vsc4.txt'
localpath = '/fs/home/wolloch/aiida/test.txt'
sftp.put(localpath,filepath)
sftp.close()
trans.close()

This is the output of the script (I found no verbose sftp.put option, but i checked that the file transfer worked in the the password case. Job list is correctly empty as well):
(aiida) hodor:aiida> python paramiko_test.py
Try ssh client:
b'Thu Jun 17 17:49:17 CEST 2021\n             JOBID PRIORIT            NAME     USER ST       TIME  NODES NODELIST(REASON) SUBMIT_TIME DEPENDENCY\n'

Try transport with password:

Try transport with ssh-key:
Oops, unhandled type 3 ('unimplemented')
Traceback (most recent call last):
  File "paramiko_test.py", line 39, in <module>
    sftp = paramiko.SFTPClient.from_transport(trans)
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/sftp_client.py", line 164, in from_transport
    chan = t.open_session(
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/transport.py", line 875, in open_session
    return self.open_channel(
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/transport.py", line 1006, in open_channel
    raise e
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/transport.py", line 2055, in run
    ptype, m = self.packetizer.read_message()
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/packet.py", line 459, in read_message
    header = self.read_all(self.__block_size_in, check_rekey=True)
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/packet.py", line 303, in read_all
    raise EOFError()
EOFError
Exception ignored in: <function BufferedFile.__del__ at 0x14aaf7c08f70>
Traceback (most recent call last):
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/file.py", line 66, in __del__
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/channel.py", line 1392, in close
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/channel.py", line 991, in shutdown_write
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/channel.py", line 967, in shutdown
  File "/home/michael/miniconda3/envs/aiida/lib/python3.8/site-packages/paramiko/transport.py", line 1846, in _send_user_message
AttributeError: 'NoneType' object has no attribute 'time'


I have really no idea if this last test has any connection to the failed verdi computer test, but I thought you might appreciate the information.

Could this be an issue of my paramiko installation, or maybe my cluster is not allowing something? We do get a OTP password send to our phones every 24 hours when we try to log in, but I was of course already logged in before today when making the test.

Any help is appreciated, many thanks,
Michael

Zhu, Bonan

unread,
Jun 17, 2021, 1:07:10 PM6/17/21
to aiida...@googlegroups.com

Hi Michael,

 

So what type of authentication does your cluster use? For a normal session from your shell, do you need the public key, or a password (OTP) or both?

 

From your detailed tests you showed that using the password is fine a problem, but using the private key does not work. AiiDA’s stock SshTransport only supports the PublicKey authentical method.

 

I think you are on the right track doing the tests with paramiko. Unfortunately I am not an expert but in some cases, I found myself extending its classes to implement the authentication sequence for specific computer. Once you figure out how to do it, you can make a transport plugin with a specialised version of the transport ( probably as a subclass of SshTansport), and then tell AiiDA to use it instead when setting up the computer.

 

Best wishes,

Bonan

--
AiiDA is supported by the NCCR MARVEL (http://nccr-marvel.ch/), funded by the Swiss National Science Foundation, and by the European H2020 MaX Centre of Excellence (http://www.max-centre.eu/).
 
Before posting your first question, please see the posting guidelines at http://www.aiida.net/?page_id=356 .
---
You received this message because you are subscribed to the Google Groups "aiidausers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aiidausers+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aiidausers/a9a7cef9-c94f-4676-8f93-fb99a068071cn%40googlegroups.com.

Michael Wolloch

unread,
Jun 17, 2021, 1:44:26 PM6/17/21
to aiida...@googlegroups.com

Hi Bonan,

thanks for the quick answer!

Usually I log in via PublicKey. If I use the correct port (27), this works very well. For each fresh log in from a new machine a OTP is sent to my phone, and additionally the connection has to be made from a university domain (Today I am connecting from my office though, so that is a given, when at home, I use a VPN). After entering the OTP, all subsequent logins from that machine are then completed without another OTP for 24 hours. I just checked, and restarting the machine does not reset this. I was still able to login with ssh VSC4, using the ssh config file presented previously and not needing another OTP. So I do not think that it is an issue with the OTP, although that could be a problem when actually running high-throughput production code around the clock. For this it would probably be better to run AiiDA on one of the login nodes of the cluster itself in a conda environment, but it is too early to worry about that.

I am not sure if I want to tweak authentication sequence, especially as I am not sure what is even needed to succeed. I guess I will do some more testing with paramiko.Transport and try to find out what is the matter with this Oops, unhandled type 3 ('unimplemented') error that I get in my test script. That makes only sense however when I know what AiiDA is actually trying to do. I guess the best way would be to look into aiida/transports/plugins/ssh.py and go from there?

Thanks again for the quick response, maybe someone else has some ideas as well, and if I find out anything about this, I will post it of course.

Cheers, Michael

--
Dr. Michael Wolloch
Computational Materials Physics
Faculty of Physics
University of Vienna
Kolingasse 14-16, Room 3.64, 1090 Vienna
michael...@univie.ac.at
Tel: +43 1 4277 73316

Giovanni Pizzi

unread,
Jun 18, 2021, 2:25:23 AM6/18/21
to AiiDA users mailing list
Hi Michael,

some quick pointers:
- The last issues you see might (?) be an issue of paramiko, but unrelated. See e.g. this:
(and apparently adding a sleep at the end of your script might help?)
- The good thing is that you are able to connect with SSH in your last test so paramiko *should* be able to connect in some way
- The main error you have is the "ChannelException: ChannelException(2, 'Connect failed')".
When paramiko connects, it will try to run every command in a different channel, multiplexed over the same connection.
Maybe one reason is that the server is highly limiting the number of possible channels?
You could check the value of MaxSessions in the sshd_config of your server, or ask your admins if this is the problem.

I'm not sure this really helps, but maybe it gives you some pointers.
One more resource are the two "recipes" in our "cookbook" in the documentation:
They are a stripped-down version of what AiiDA does when connecting.
You can try to start from these and check if you have the same issues, maybe working on these simpler scripts will help.

Best,
Giovanni




Keilbart, Nathan Daniel

unread,
Jun 21, 2021, 3:22:08 AM6/21/21
to aiida...@googlegroups.com

Hello Michael,

 

I am by no means an expert with this portion but I have successfully installed AiiDA on our supercomputer at LLNL. What this entailed though was installing AiiDA on a server behind our firewall which is then able to ssh without a password between the other servers. Essentially, I have to login to this specific server to work with AiiDA which has been working well for me. Additionally, the Postgres database and RabbitMQ service is hosted somewhere else and I have to provide the information to connect. If you were to attempt this setup you would need to get your university to host a database and RabbitMQ server. Not sure if that’s something they would be willing to do.

 

I also briefly used AiiDA back at Penn State which has 2FA that could be approved through an application. I had AiiDA installed on my workstation and could simply leave it running to check on the simulations while periodically receiving the 2FA push. This would need to be done once a day as you have mentioned. I didn’t get as much experience from that as I graduated soon after.

 

I hope that was helpful and let me know if you have any other questions.

 

Nathan

 

-----------------------------------------------------------------------------------

Nathan Keilbart, PhD

Postdoctoral Research Scientist, Quantum Simulations Group

Lawrence Livermore National Laboratory

(925) 423-6620

-----------------------------------------------------------------------------------

Michael Wolloch

unread,
Jun 21, 2021, 5:14:22 AM6/21/21
to aiida...@googlegroups.com
Hi Nathan, hi Giovanni,

thanks for your input!

I have not been able to connect yet with a publickey, but I think I made some progress. First, I went through the cookbook example suggested by Giovanni, and checked what kind of transport channel AiiDA is trying to open. From that I got that it is an open session (transport.open_session()) and I found an example for a simple connection test for this channel online somewhere. I adapted that script to test the connection both with a password and the publickey, while also enabling logging on the debug level (something I should have done sooner I guess).
Here is the script if you are interested:

import paramiko
from time import sleep


def transport_test_password(ip, port, user, password, command):

transport = paramiko.Transport((ip, port))

try:

transport.start_client()

print('transport client started')

except Exception as e:

print(e)

try:

transport.auth_password(username=user, password=password)

print('transport authentication did not fail')

print('is transport authenticated: {}'.format(transport.is_authenticated()))

except Exception as e:

print(e)

if transport.is_authenticated():

print(transport.getpeername())

channel = transport.open_session()

channel.exec_command(command)

response = channel.recv(1024)

print('Command %r(%r)-->%s' % (command,user,response))


def transport_test_keyfile(ip, port, user, keyfile, command):

transport = paramiko.Transport((ip, port))

try:

transport.start_client()

print('transport client started')

except Exception as e:

print(e)

try:

k = paramiko.RSAKey.from_private_key_file(keyfile)

transport.auth_publickey(username=user,key=k)

sleep(1)

print('transport authentication did not fail')

print('is transport authenticated: {}'.format(transport.is_authenticated()))

except Exception as e:

print(e)

if transport.is_authenticated():

print(type(transport.is_authenticated()))

print(transport.getpeername())

channel = transport.open_session()

channel.exec_command(command)

response = channel.recv(1024)

print('Command %r(%r)-->%s' % (command,user,response))


host = "vsc4.vsc.ac.at"

user = "mwo3"

keyfile = "/fs/home/wolloch/.ssh/id_rsa"

passwd = <password>

command = 'pwd'


paramiko.util.log_to_file("VSC4_test.log", level="DEBUG")


transport_test_password(ip=host, port=22, user=user, command=command, password=passwd)

transport_test_keyfile(ip=host, port=27, user=user, command=command, keyfile=keyfile)


And here is the output:

(aiida) hodor:aiida> python paramiko_test.py

transport client started
transport authentication did not fail
is transport authenticated: True
('193.170.79.54', 22)
Command 'pwd'('mwo3')-->b'/home/fs71411/mwo3\n'
transport client started
transport authentication did not fail
is transport authenticated: False


Apparently there is no exception generated for the authentication even with the publickey, but nevertheless, the transport is not authenticated and the command is not executed of course. (I tried also to put in some time delay with the sleep function, but it did not help). When looking at the log file (which I have attached), some things appear interesting to me:
  1. An older version of open-ssh is used for the password authentication than for the publickey one on the server side! (7.4 vs 7.8; see line numbers 3,4 and 24,25)
  2. Both methods give the message userauth is OK (LNs: 13 and 35)
  3. This is followed in the password case by Authentication (password) successful! (LN: 14), while for the publickey version two lines follow and then the script finishes without further output to the logfile: Authentication continues...; Methods: ['keyboard-interactive'](LNs:36 and 37)
I have not enough knowledge to get much else out of the logfile I fear, but maybe it gives a hint to some of you. I think the best thing to do now is contact my cluster administration and ask them for help, maybe they have some insight into the whole thing. If they cannot help, I will definitely try to follow Nathans advice and run AiiDA on a login node or VM inside of the cluster infrastructure.

Thanks to all of you again, and all the best, Michael


You received this message because you are subscribed to a topic in the Google Groups "aiidausers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aiidausers/52OBK9Tq8BM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to aiidausers+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aiidausers/5B037BAA-6E16-488A-A863-19FB9D3199EC%40llnl.gov.
VSC4_test.log

Michael Wolloch

unread,
Jun 24, 2021, 9:20:37 AM6/24/21
to aiida...@googlegroups.com

Hi all,

I figured this thing out! In fact I spend a lot of time debugging an issue that was not even there. My example code did not authenticate correctly because I did not use the  paramiko.SSHclient.connect() method and then open a transport channel, which is what is done by AiiDA. Instead I was using parmaiko.Transport.auth_publickey() and that runs into problems, turns out because of the OTP, even if there is no need to enter it at that given time. It can be fixed using the auth_interactive_dumb() method afterwards, but this is handled anyhow in an elegant way by the connect() method of SSHclient! The real problem was that my cluster limits the MaxSessions in the sshd_config to 1, a problem that Giovanni pointed out already some days ago. I cannot access sshd_config on the cluster myself and had to wait for support to answer me, which I got todaz. I asked them if they can increase this to 10 (would that be enough)? But I have not heared back from them yet and I am a little doubtful that they will make the change just for one user.

Here is some code that works and closely mimics what AiiDA does, but if I add another channel at the end (by uncommenting the two lines), I get the exact error I see when running verdi computer test VSC4, as expected with only one session allowed per client.

import paramiko

host = "l43.vsc.ac.at"

user = "mwo3"

keyfile = "/fs/home/wolloch/.ssh/id_rsa"

command = 'pwd'

 

k = paramiko.RSAKey.from_private_key_file(keyfile)

paramiko.util.log_to_file("min_test.log", level="DEBUG")

 

client = paramiko.SSHClient()

client.set_missing_host_key_policy(paramiko.client.AutoAddPolicy())

client.connect(hostname=host, username=user, port=27, pkey=k)

channel = client.get_transport().open_session()

channel.close()

#channel2 = client.get_transport().open_session()
#channel2.close()
client.close()

I guess the lesson for me (other than knowing a lot more about paramiko and ssh connections) is that one should really try to follow the traceback closely, find out what really is going on, and not immediately start with simplified versions of the problem!

Thanks for all the help anyhow, I am very happy that this mailing list is so active and supportive,

Michael

Reply all
Reply to author
Forward
0 new messages