Workers on second nodes abort, while workers on the first node keep running

18 views
Skip to first unread message

Amit Juneja

unread,
Nov 9, 2018, 7:16:28 PM11/9/18
to scoop-users
Let me describe at a high level, and then I can provide any additional information needed. I have two hosts - host1 and host2. If I specify host1 on the first line of the host file, the workers start running on both the hosts but they abort on host2. Now if I switch the order of hosts in the host file, that is, host2 is first and host1 is second, the worker nodes now abort on host1. I don't see any error messages, and in both cases I'm launching the process from host1. In other words, worker nodes only complete running on the first host in the host file, and abort on the other host. 

Any ideas?

Derek Tishler

unread,
Nov 9, 2018, 10:36:13 PM11/9/18
to scoop-users
How are you launching the program, have you tried the verbosity flag(-vv) to get more info on the network error? Have you tried using tunnel(--tunnel) to get around ip routing issues if the network is not ideally setup?

ex:
python -m scoop --hostfile hosts -vv --tunnel deap.py
Reply all
Reply to author
Forward
0 new messages