Recurring errors with EMR

Rupesh Mane

unread,

Dec 19, 2014, 5:27:16 PM12/19/14

to snowpl...@googlegroups.com

Hi all,

Lately I am seeing EMR ETL jobs failing on regular basis. All of the time these errors are on S3DistCp step. While digging little more I noticed that node/HDFS is correct. SSHing into the cluster and trying to copy manually also fails. See copied error below. All our setup is in EAST coast DC.

I don't think this is snowplow error. I believe this is something on AWS side, but before I check with them, I want to see how many folks are seeing this error.

One thing that could be done on snowplow side is to timeout the jobs are certain minutes. Right now, the job just waits in background for hours.

Any help/info much appreciated.

Cheers - Rupesh

WARN hdfs.DFSClient: Failed to connect to /172.31.44.7:9200, add to deadNodes and continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.31.46.216:38419 remote=/172.31.44.7:9200]
14/12/19 22:15:51 INFO hdfs.DFSClient: Could not obtain block blk_1762404823403011073_1091 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
14/12/19 22:16:54 WARN hdfs.DFSClient: Failed to connect to /172.31.44.7:9200, add to deadNodes and continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.31.46.216:38444 remote=/172.31.44.7:9200]
14/12/19 22:16:54 INFO hdfs.DFSClient: Could not obtain block blk_1762404823403011073_1091 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
14/12/19 22:17:57 WARN hdfs.DFSClient: Failed to connect to /172.31.44.7:9200, add to deadNodes and continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.31.46.216:38465 remote=/172.31.44.7:9200]
14/12/19 22:17:57 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_1762404823403011073_1091 file=/local/snowplow/shredded-events/com.snowplowanalytics.snowplow/ad_conversion/jsonschema/1-0-0/part-00002-00001
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2367)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2154)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2322)
    at java.io.DataInputStream.read(DataInputStream.java:100)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:238)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
    at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:189)
    at org.apache.hadoop.fs.FsShell.run(FsShell.java:1763)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.fs.FsShell.main(FsShell.java:1874)
Caused by: java.io.IOException: No live nodes contain current block
    at org.apache.hadoop.hdfs.DFSClient.bestNode(DFSClient.java:1375)
    at org.apache.hadoop.hdfs.DFSClient.access$1500(DFSClient.java:80)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2360)
    ... 18 more

copyToLocal: Could not obtain block: blk_1762404823403011073_1091 file=/local/snowplow/shredded-events/com.snowplowanalytics.snowplow/ad_conversion/jsonschema/1-0-0/part-00002-00001

Alex Dean

unread,

Dec 19, 2014, 7:40:44 PM12/19/14

to snowpl...@googlegroups.com

Hi Rupesh,

Can you clarify the behavior you are seeing? You initially refer to "jobs failing on regular basis" but then you later say "the job just waits in background for hours". Is the job dying or is it getting stuck?

Interested to hear what your Amazon support says too.

A

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Rupesh Mane

unread,

Dec 24, 2014, 6:32:31 PM12/24/14

to snowpl...@googlegroups.com

MR step for S3DistCp is stuck and does not respond. Sometimes few files are copied over and sometimes none. I manually kill the run, terminate EMR cluster, and start from beginning except staging.

Not contacted AWS yet. One more thing I forgot to mention, we are using only 2 data/working node cluster because the data size is not huge. This might impact replicating the data properly for redundancy. Is there recommendation on minimum size of EMR cluster?

Cheers - Rupesh

Alex Dean

unread,

Dec 25, 2014, 4:02:24 AM12/25/14

to snowpl...@googlegroups.com

Hey Rupesh,

Increasing the quantity and specification of the boxes can't hurt - try bumping to 3 or 6 nodes...

Cheers,

Alex

Reply all

Reply to author

Forward