Reduce tasks that never complete

138 views
Skip to first unread message

dataquerent

unread,
Dec 19, 2013, 2:01:31 AM12/19/13
to rha...@googlegroups.com
I am trying to reproduce very simple examples from the basic tutorials.

With the current install, I can run "my first map reduce program" from

https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md

successfully.

When I try to run "my second map reduce program" from the same page, the reduce function hangs up for a long, long time.

The tail of the syslog file looks like this:
2013-12-19 14:54:52,597 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:378)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:473)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:203)
        at sun.net.www.http.HttpClient.New(HttpClient.java:290)
        at sun.net.www.http.HttpClient.New(HttpClient.java:306)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:995)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:931)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:849)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1636)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1593)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1493)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1401)
        at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1333)

2013-12-19 14:54:52,597 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201312191055_0006_r_000000_0: Failed fetch #5 from attempt_201312191055_0006_m_000001_0
2013-12-19 14:54:52,597 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201312191055_0006_r_000000_0 adding host hit-nxdomain.opendns.com to penalty box, next contact in 37 seconds
2013-12-19 14:54:52,597 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201312191055_0006_r_000000_0: Got 1 map-outputs from previous failures
2013-12-19 14:55:22,599 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201312191055_0006_r_000000_0 Need another 2 map output(s) where 0 is already in progress
2013-12-19 14:55:22,599 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201312191055_0006_r_000000_0 Scheduled 0 outputs (1 slow hosts and0 dup hosts)
2013-12-19 14:55:22,599 INFO org.apache.hadoop.mapred.ReduceTask: Penalized(slow) Hosts:
2013-12-19 14:55:22,599 INFO org.apache.hadoop.mapred.ReduceTask: hit-nxdomain.opendns.com Will be considered after: 7 seconds.
2013-12-19 14:55:32,599 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201312191055_0006_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2013-12-19 14:56:22,604 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201312191055_0006_r_000000_0 Need another 2 map output(s) where 1 is already in progress
2013-12-19 14:56:22,604 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201312191055_0006_r_000000_0 Scheduled 0 outputs (0 slow hosts and1 dup hosts)


I would expect to find errors in stderr, but there are zero lines and zero characters in stderr and stdout.

-rw-r--r-- 1 user user   152 12月 19 14:56 log.index
-rw-rw-r-- 1 user user     0 12月 19 14:37 stderr
-rw-rw-r-- 1 user user     0 12月 19 14:37 stdout
-rw-rw-r-- 1 user user 22488 12月 19 14:55 syslog

I imagine this "second map reduce" should be able to run in a few minutes, but it's taken more than 23 minutes.

I suspect that no matter how long I let it run, it's going to repeat the same error message, namely that it has one map output, but it needs another 2 map outputs.

Any suggestions are welcome.
Thanks.

Antonio Piccolboni

unread,
Dec 19, 2013, 2:42:15 AM12/19/13
to RHadoop Google Group
It appears it's a network problem, it manifests itself after the map phase but before the reduce phase because nodes are trying to communicate with each other. You may want to run a test mapreduce job independent of rmr2, like one of the examples that are distributed with Hadoop. The other thing is that you may want to try to ping each node from each one and make sure that they all can resolve all names. In general this looks like a hadoop problem, by just googling the error message I found several threads covering the issue. I hope this helps.


Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dataquerent

unread,
Dec 19, 2013, 8:41:20 PM12/19/13
to rha...@googlegroups.com, ant...@piccolboni.info
Thank you for your response.

The hadoop installation is running in pseudo-distributed mode with both name and data nodes on a single machine.  This could be contributing to the problem.
I'll try testing the hadoop with normal map-reduce in Java, and also I'll try the "ping" test.

dataquerent

unread,
Dec 19, 2013, 8:43:02 PM12/19/13
to rha...@googlegroups.com, ant...@piccolboni.info
Thanks for the response.
The system is a single machine running in pseudo-distributed mode, which might be the cause of the problem.
I will verify that normal Java map-reduce works, and I will also try the "ping" test you described.


On Thursday, December 19, 2013 3:42:15 PM UTC+8, Antonio Piccolboni wrote:
Reply all
Reply to author
Forward
0 new messages