Could not resolve hostname when using SCOOP across nodes

34 views
Skip to first unread message

Riter

unread,
Jul 7, 2019, 9:30:25 PM7/7/19
to scoop-users
Hi, when using the following job file to distribute SCOOP processes I encountered a problem. Could anyone help me go through the problem? Thank you in advance. 

================================
#!/bin/bash
#PJM -L rscgrp=lecture-flat
#PJM -L node=16

#------ Make hostfile for DDN-MVAPICH --------#
HOST_FILE=hostfile.${PJM_JOBID}
for i in `cat ${PJM_O_NODEINF}`; do
   echo ${i}:${PJM_PROC_BY_NODE}
done > ${HOST_FILE}

module load intelpython/3.6
source activate /usrpath/localpy
python -m scoop --hostfile ${HOST_FILE} -vv -n 600 /usrpath/scoopTest/map_doc.py
================================
There are 16 nodes (cat hostfile.258972248)


==============
[2019-07-08 10:15:56,452] launcher  INFO    SCOOP 0.7 1.1 on linux using Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0], API: 1013
[2019-07-08 10:15:56,454] launcher  INFO    Deploying 600 worker(s) over 16 host(s).
[2019-07-08 10:15:56,455] launcher  DEBUG   Using hostname/ip: "172.30.167.199:1" as external broker reference.
[2019-07-08 10:15:56,455] launcher  DEBUG   The python executable to execute the program with is: /usrpath/localpy/bin/python.
[2019-07-08 10:15:56,457] launcher  INFO    Worker distribution: 
[2019-07-08 10:15:56,458] launcher  INFO       172.30.167.199:1:37 + origin
[2019-07-08 10:15:56,458] launcher  INFO       172.30.167.201:1:38 
[2019-07-08 10:15:56,459] launcher  INFO       172.30.167.202:1:38 
[2019-07-08 10:15:56,459] launcher  INFO       172.30.167.203:1:38 
[2019-07-08 10:15:56,460] launcher  INFO       172.30.167.204:1:38 
[2019-07-08 10:15:56,460] launcher  INFO       172.30.167.205:1:38 
[2019-07-08 10:15:56,461] launcher  INFO       172.30.167.206:1:38 
[2019-07-08 10:15:56,461] launcher  INFO       172.30.167.207:1:38 
[2019-07-08 10:15:56,462] launcher  INFO       172.30.167.208:1:37 
[2019-07-08 10:15:56,462] launcher  INFO       172.30.167.209:1:37 
[2019-07-08 10:15:56,462] launcher  INFO       172.30.167.210:1:37 
[2019-07-08 10:15:56,463] launcher  INFO       172.30.167.211:1:37 
[2019-07-08 10:15:56,463] launcher  INFO       172.30.167.212:1:37 
[2019-07-08 10:15:56,464] launcher  INFO       172.30.167.213:1:37 
[2019-07-08 10:15:56,464] launcher  INFO       172.30.167.214:1:37 
[2019-07-08 10:15:56,464] launcher  INFO       172.30.167.215:1:37 
[2019-07-08 10:15:56,465] brokerLaunch DEBUG   Launching remote broker: ssh -x -n -oStrictHostKeyChecking=no 172.30.167.199:1 /work/pm7010/m07010/localpy/bin/python -m scoop.broker.__main__ --echoGroup --echoPorts --backend ZMQ 
ERROR:root:Error while launching SCOOP subprocesses:
ERROR:root:Traceback (most recent call last):
  File "/usrpath/localpy/lib/python3.6/site-packages/scoop/launch/brokerLaunch.py", line 143, in __init__
    self.brokerPort, self.infoPort = ports
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usrpath/localpy/lib/python3.6/site-packages/scoop/launcher.py", line 479, in main
    rootTaskExitCode = thisScoopApp.run()
  File "/usrpath/localpy/lib/python3.6/site-packages/scoop/launcher.py", line 260, in run
    backend=self.backend,
  File "/usrpath/localpy/lib/python3.6/site-packages/scoop/launch/brokerLaunch.py", line 157, in __init__
    "SSH process stderr:\n{stderr}".format(**locals()))
Exception: Could not successfully launch the remote broker.
Requested remote broker ports, received:
b''
Port number decoding error:
not enough values to unpack (expected 2, got 1)
SSH process stderr:
b'ssh: Could not resolve hostname 172.30.167.199:1: Name or service not known\r\n'

Derek Tishler

unread,
Jul 7, 2019, 11:04:37 PM7/7/19
to scoop-users
Are you sure you are providing the host file in the appropriate format for passwordless ssh? It appears to just be getting the ip address and possibly conflicting with your use of '-n 600' which is handled by the hostfile format I think:
https://scoop.readthedocs.io/en/0.7/usage.html#hostfile-format

Each line of my host file looks like:
user@ip N

Riter

unread,
Jul 8, 2019, 12:18:50 AM7/8/19
to scoop-users
Derek, thank you for your reply.

Without  '-n 600', the number behind the IP address changes to 68 (the physical core number). However, the error remains.
Then I tried to add "usrname@ " before the IP address in the host file. The error "could not resolve hostname" still remains.

----------the new 
python -m scoop --hostfile usrpath/hostfile  -vv usrpath/map_doc.py

The new host file is as follows.

Derek Tishler

unread,
Jul 8, 2019, 1:13:20 AM7/8/19
to scoop-users
Are you able to do a terminal ssh, without password ideally as scoop will require, into these user@IP without getting the same error to ensure this is not a routing issue?

Oh I also have to use--tunnel flag after scoop in the launch, that may be something to try. I think I got that trick from:
https://groups.google.com/forum/#!topic/scoop-users/XwibJVaefi4

Riter

unread,
Jul 8, 2019, 3:09:09 AM7/8/19
to scoop-users
Derek, thank you for your remind. I have tested the ssh-link into the nodes--failure. Now I am waiting for a reply from the HPC administrator. 

Thank you. 
Reply all
Reply to author
Forward
0 new messages