Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Job fails with Could not obtain block errors
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  1 message - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
John Conwell  
View profile  
 More options Jul 13 2011, 3:29 pm
From: John Conwell <j...@iamjohn.me>
Date: Wed, 13 Jul 2011 12:29:34 -0700
Local: Wed, Jul 13 2011 3:29 pm
Subject: Job fails with Could not obtain block errors

I have a MR job that repeatedly fails during a join operation in the Mapper,
with the errors "java.io.IOException: Could not obtain block".  I'm running
on EC2, on a 12 node cluster, provisioned by whirr.  Oddly enough on a 5
node cluster the MR job runs through without any problems.

The repeated exception the tasks are reporting in the web UI for this job
is:

java.io.IOException: Could not obtain block: blk_8346145198855916212_1340
file=/user/someuser/output_6_doc_tf_and_u/part-00002
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.ja va:1993)

        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java: 1800)

        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at
org.apache.hadoop.io.SequenceFile$Reader.sync(SequenceFile.java:2186)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecord Reader.java:48)

        at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFi leInputFormat.java:59)

        at
org.apache.hadoop.mapred.lib.DelegatingInputFormat.getRecordReader(Delegati ngInputFormat.java:124)

        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:1115)

        at org.apache.hadoop.mapred.Child.main(Child.java:262)

When I look at the task log details for this failed job it shows that the
DFSClient failed to connect to a datanode that had a replicated copy of this
block, and added the datanode ipaddress to the list of deadNodes (exception
shown below).

11:25:19,204  INFO DFSClient:1835 - Failed to connect to /
10.114.123.82:50010, add to deadNodes and continue
java.io.IOException: Got error in response to OP_READ_BLOCK self=/
10.202.163.95:43022, remote=/10.114.123.82:50010 for file
/user/someuser/output_6_doc_tf_and_u/part-00002 for block
5843350240062345818_1332
        at
org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java: 1487)

        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java: 1811)

        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1465)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1437)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecord Reader.java:43)

        at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFi leInputFormat.java:59)

        at
org.apache.hadoop.mapred.lib.DelegatingInputFormat.getRecordReader(Delegati ngInputFormat.java:124)

        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.j ava:1115)

        at org.apache.hadoop.mapred.Child.main(Child.java:262)

It then goes onto try the other two datanodes that contain replicas of this
block, each throwing the same exception, and each being added to the list of
dead nodes, at which point the task fails.  This cycle of failures is
happening multiple times during this job, against several different blocks.

I then looked in the namenode's log, to see what is going on with datanodes
that are getting added to the list of deadNodes, and found them associated
with the following error:

2011-07-13 05:33:55,161 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.heartbeatCheck: lost heartbeat from 10.83.109.118:50010

Looking through the rest of the namenode log I count 36 different entries
for lost heartbeats.  Is this a common error?  The odd thing is that after
the job fails, hdfs seems to be able to recover itself, bringing these nodes
back online and re-replicating the files across the nodes again.  So when I
browse the hdfs, and look for one of the files that was causing the previous
failures, its showing up in the correct directory, with its replication set
to 3

Also, I had read this kind of error could be because of the default ulimit
-n, so I increased it to Cloudera's recommended value of 16384, but I still
have the same issue.

Any ideas why I'm getting such unstability with the hdfs?  Why these nodes
are going down and causing my jobs to fail?  Ideas on what direction I
should take to trouble shoot this issue?

--

Thanks,
John C


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »