Hi all,
Lately I am seeing EMR ETL jobs failing on regular basis. All of the time these errors are on S3DistCp step. While digging little more I noticed that node/HDFS is correct. SSHing into the cluster and trying to copy manually also fails. See copied error below. All our setup is in EAST coast DC.
I don't think this is snowplow error. I believe this is something on AWS side, but before I check with them, I want to see how many folks are seeing this error.
One thing that could be done on snowplow side is to timeout the jobs are certain minutes. Right now, the job just waits in background for hours.
Any help/info much appreciated.
Cheers - Rupesh
WARN hdfs.DFSClient: Failed to connect to /172.31.44.7:9200, add to deadNodes and continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.31.46.216:38419 remote=/172.31.44.7:9200]
14/12/19 22:15:51 INFO hdfs.DFSClient: Could not obtain block blk_1762404823403011073_1091 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
14/12/19 22:16:54 WARN hdfs.DFSClient: Failed to connect to /172.31.44.7:9200, add to deadNodes and continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.31.46.216:38444 remote=/172.31.44.7:9200]
14/12/19 22:16:54 INFO hdfs.DFSClient: Could not obtain block blk_1762404823403011073_1091 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
14/12/19 22:17:57 WARN hdfs.DFSClient: Failed to connect to /172.31.44.7:9200, add to deadNodes and continuejava.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.31.46.216:38465 remote=/172.31.44.7:9200]
14/12/19 22:17:57 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_1762404823403011073_1091 file=/local/snowplow/shredded-events/com.snowplowanalytics.snowplow/ad_conversion/jsonschema/1-0-0/part-00002-00001
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2367)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2154)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2322)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341)
at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:238)
at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:262)
at org.apache.hadoop.fs.FsShell.copyToLocal(FsShell.java:189)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1763)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1874)
Caused by: java.io.IOException: No live nodes contain current block
at org.apache.hadoop.hdfs.DFSClient.bestNode(DFSClient.java:1375)
at org.apache.hadoop.hdfs.DFSClient.access$1500(DFSClient.java:80)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2360)
... 18 more
copyToLocal: Could not obtain block: blk_1762404823403011073_1091 file=/local/snowplow/shredded-events/com.snowplowanalytics.snowplow/ad_conversion/jsonschema/1-0-0/part-00002-00001