SSH Authentication failed on cluster with Torque PBS

43 views
Skip to first unread message

Arthur-Ervin Avramiea

unread,
Jan 12, 2018, 10:12:17 AM1/12/18
to scoop-users
Hi all,

I'm trying to run a script parallelized using scoop on a cluster (Torque PBS).

I have generated the ssh keys in the home directory of my access node on the cluster, and confirmed that they work by copying the keys to my local computer, and using them to log in to the cluster.

The home directory on my access node is a network folder that is shared with the processing nodes, so all nodes should have access to the ssh keys.

However, when I submit the job, I get the following error:

ERROR:root:Error while launching SCOOP subprocesses:
ERROR
:root:Traceback (most recent call last):
 
File "/home/aeavram/.local/lib/python2.7/site-packages/scoop/launcher.py", line 479, in main
    rootTaskExitCode
= thisScoopApp.run()
 
File "/home/aeavram/.local/lib/python2.7/site-packages/scoop/launcher.py", line 260, in run
    backend
=self.backend,
 
File "/home/aeavram/.local/lib/python2.7/site-packages/scoop/launch/brokerLaunch.py", line 157, in __init__
   
"SSH process stderr:\n{stderr}".format(**locals()))
Exception: Could not successfully launch the remote broker.
Requested remote broker ports, received:

Port number decoding error:
need more than
1 value to unpack
SSH process stderr
:
Authentication failed.


INFO
:launcherLogger:Finished cleaning spawned subprocesses.here...

Any tips as to what may be wrong in here? Thank you!

Carl Witt

unread,
Jun 20, 2018, 10:47:57 AM6/20/18
to scoop-users
I ran into a similar issue, but regarding the worker and in a SLURM environment. Root cause seems to be in both cases a misconfigured SSH setup. I've double checked my key setup and it  seems to be correct. I was suspecting SLURMs Job Containment module to block SSH connections, but it *shouldn't* be a problem: "The purpose of this module is to prevent users from sshing into nodes that they do not have a running job on, [...] This module does this by determining the job which originated the ssh connection. The user's connection is "adopted" into the "external" step of the job." 

However, output looks similar. 

[2018-06-20 16:00:15,797] launcher  INFO    SCOOP 0.7 1.1 on linux2 using Python 2.7.5 (default, Aug  4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], API: 1013

[2018-06-20 16:00:15,797] launcher  INFO    Detected SLURM environment.

[2018-06-20 16:00:15,798] launcher  INFO    Deploying 11 worker(s) over 2 host(s).

[2018-06-20 16:00:15,798] launcher  INFO    Worker distribution:

[2018-06-20 16:00:15,798] launcher  INFO       node120: 5 + origin

[2018-06-20 16:00:15,798] launcher  INFO       node121: 5

[2018-06-20 16:00:16,086] workerLaunch (127.0.0.1:35159) WARNING Could not successfully launch the remote worker on node121.

Requested remote group process id, received:

Group id decoding error:

invalid literal for int() with base 10: ''

SSH process stderr:

Authentication failed.

Carl Witt

unread,
Jul 4, 2018, 4:35:49 AM7/4/18
to scoop-users
Indeed, it turned out to be a configuration issue in the cluster (related to the job containment configuration), so a compute nodes allocated to my job couldn't ssh into another compute node allocated to the job.
Reply all
Reply to author
Forward
0 new messages