Hanging Jobs: Peer to Peer Communication Issue on Torque/PBS Cluster?

122 views
Skip to first unread message

Per Sederberg

unread,
Jun 10, 2015, 8:01:36 PM6/10/15
to scoop...@googlegroups.com
Hi Scoop Devs/Users:

I'm quite excited about all SCOOP has to offer and I've incorporated it thoroughly into my work. I have various scripts that run perfectly well on stand-alone machines and on our cluster when started from the command line, however, they fail reliably when I submit them as batch parallel jobs. I can successfully submit and run small sample code, but when we try and scale up to an actual problem we get hangs when we have workers spawn additional processes. 

The only error that comes up (and it comes up every time) is the following issue about results not being sent directly back to a peer:

[2015-06-10 19:06:04,599] scoopzmq  (10.21.60.18:55198) DEBUG   10.21.60.18:55198: Could not send result directly to peer 10.21.45.1:56132, routing through broker.


Sometimes this error comes up and SCOOP recovers and the process continues and will finish, but usually as soon as a worker that was spawned with futures.submit makes it's own futures.submit or futures.map call, the code is executed, we see this error, and the entire process hangs indefinitely. We had a hunch that maybe the workers were spawning too many subprocesses (we have on the order of 20 to 80 top level processes and 50 to 100 lower level processes spawned by each upper level process), and reducing this to 20 top level and 30 bottom level did successfully complete once, but it still typically hangs.


Here is some info about the Python and SCOOP versions that we can see at the top of the code:


[2015-06-10 19:05:04,622] launcher  INFO    SCOOP 0.7.1 release on linux2 using Python 2.7.9 |Anaconda 2.2.0 (64-bit)| (default, Mar 9 2015, 16:20:48) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)], API: 1013

[2015-06-10 19:05:04,623] launcher  INFO    Detected PBS environment.

[2015-06-10 19:05:04,623] launcher  INFO    Deploying 96 worker(s) over 12 host(s).


As you can see we're using recent versions of most libraries and everything seems to work right up until spreading out across cluster nodes.


Here are my questions:


1) What causes this "Could not send result directly to peer" error and how can I possibly fix it? 


2) How can I debug what is happening when the process hangs? I'm already using the -vv and --debug options and there is no info in there other than the error above. I've also tried putting in print statements and such, but the only info I've been able to extract is that it's hanging when trying to collect the result (hence the locking up).


On a related note, whenever the job hangs like this I have to kill the job with the batch scheduler (i.e., qdel). This orphans a bunch of zombie processes on the worker nodes. Is SCOOP supposed to be able to clean those up or is it the job of the user to write a script that will go through the worker nodes and kill all the orphaned processes?


Thanks for any insight in solving this problem!


Best,

Per


Per Sederberg

unread,
Jun 10, 2015, 10:54:04 PM6/10/15
to scoop...@googlegroups.com
Some additional info:

I get the same hangs when running the very simple recurse.py example from SCOOP:


I get a list of errors like:

[2015-06-10 22:28:39,780] scoopzmq  (10.21.41.28:58624) DEBUG   10.21.41.28:58624: Could not send result directly to peer 10.21.41.28:

49157, routing through broker.


and it just hangs. Eventually I have to kill the job and the workers are left hanging and I have to ssh into each node and kill each job manually.


I should also note that the ssh keys seem to be set up correctly and I can ssh between all nodes on the cluster without a password.


Any thoughts for why I would keep getting this error and hang?


Thanks,

Per

jiangcha...@gmail.com

unread,
Jul 16, 2018, 11:53:22 PM7/16/18
to scoop-users
excuse,i meet the same problem,do you solve it?

在 2015年6月11日星期四 UTC+8上午10:54:04,Per Sederberg写道:

Derek Tishler

unread,
Aug 4, 2018, 11:34:23 AM8/4/18
to scoop-users
Could this be relating to the need for externally routable ip's as mentioned in docs? I had this issue when using a simple local cluster of pc's and instead of messing with the routing I went with the --tunnel option which quickly fixed things, though perhaps not at full performance:.

python -m scoop --hostfile hosts --tunnel scipt.py

I think this is my source, sorry not a pro, but perhaps this can get you on a workable path:
https://groups.google.com/forum/#!searchin/scoop-users/%22--tunnel%22%7Csort:date/scoop-users/XwibJVaefi4/VE6v3_Ng7CsJ
Reply all
Reply to author
Forward
0 new messages