[2015-06-10 19:06:04,599] scoopzmq (10.21.60.18:55198) DEBUG 10.21.60.18:55198: Could not send result directly to peer 10.21.45.1:56132, routing through broker.
Sometimes this error comes up and SCOOP recovers and the process continues and will finish, but usually as soon as a worker that was spawned with futures.submit makes it's own futures.submit or futures.map call, the code is executed, we see this error, and the entire process hangs indefinitely. We had a hunch that maybe the workers were spawning too many subprocesses (we have on the order of 20 to 80 top level processes and 50 to 100 lower level processes spawned by each upper level process), and reducing this to 20 top level and 30 bottom level did successfully complete once, but it still typically hangs.
Here is some info about the Python and SCOOP versions that we can see at the top of the code:
[2015-06-10 19:05:04,622] launcher INFO SCOOP 0.7.1 release on linux2 using Python 2.7.9 |Anaconda 2.2.0 (64-bit)| (default, Mar 9 2015, 16:20:48) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)], API: 1013
[2015-06-10 19:05:04,623] launcher INFO Detected PBS environment.
[2015-06-10 19:05:04,623] launcher INFO Deploying 96 worker(s) over 12 host(s).
As you can see we're using recent versions of most libraries and everything seems to work right up until spreading out across cluster nodes.
Here are my questions:
1) What causes this "Could not send result directly to peer" error and how can I possibly fix it?
2) How can I debug what is happening when the process hangs? I'm already using the -vv and --debug options and there is no info in there other than the error above. I've also tried putting in print statements and such, but the only info I've been able to extract is that it's hanging when trying to collect the result (hence the locking up).
On a related note, whenever the job hangs like this I have to kill the job with the batch scheduler (i.e., qdel). This orphans a bunch of zombie processes on the worker nodes. Is SCOOP supposed to be able to clean those up or is it the job of the user to write a script that will go through the worker nodes and kill all the orphaned processes?
Thanks for any insight in solving this problem!
Best,
Per
[2015-06-10 22:28:39,780] scoopzmq (10.21.41.28:58624) DEBUG 10.21.41.28:58624: Could not send result directly to peer 10.21.41.28:
49157, routing through broker.
and it just hangs. Eventually I have to kill the job and the workers are left hanging and I have to ssh into each node and kill each job manually.
I should also note that the ssh keys seem to be set up correctly and I can ssh between all nodes on the cluster without a password.
Any thoughts for why I would keep getting this error and hang?
Thanks,
Per