Hypertable::Exception: Error appending 287 bytes to DFS fd 2 - DFS BROKER i/o error

122 views
Skip to first unread message

David

unread,
Aug 13, 2013, 6:31:23 AM8/13/13
to hyperta...@googlegroups.com
A rangeserver process terminated and reported the following ERROR:
1376152285 ERROR Hypertable.RangeServer : (/root/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:1077) Problem updating meta log with SPLIT_LOG_INSTALLED state for 2/1[1512383183351F84CC9BHA..1512385154951FF1F73BCA] split-point='1512384138852066871BHA'
1376152285 FATAL Hypertable.RangeServer : split_install_log (/root/src/hypertable/src/cc/Hypertable/RangeServer/Range.cc:1078): Hypertable::Exception: Error appending 287 bytes to DFS fd 2 - DFS BROKER i/o error
at virtual size_t Hypertable::DfsBroker::Client::append(int32_t, Hypertable::StaticBuffer&, uint32_t) (/root/src/hypertable/src/cc/DfsBroker/Lib/Client.cc:318)
at virtual size_t Hypertable::DfsBroker::Client::append(int32_t, Hypertable::StaticBuffer&, uint32_t) (/root/src/hypertable/src/cc/DfsBroker/Lib/Client.cc:307): java.io.IOException: All datanodes 10.190.115.54:50010 are bad. Aborting...

meanwhile, the DfsBroker reported the following Exception:
java.io.IOException: All datanodes 10.190.115.54:50010 are bad. Aborting...
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:935)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:755)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:424)

I am using the 0.9.7.8 version.
Is it the problem of HDFS?  or are all replica about the meta log file bad?

Doug Judd

unread,
Aug 13, 2013, 2:00:59 PM8/13/13
to hypertable-user
This is an HDFS problem.  Check the HDFS admin console to verify that all DataNodes are running.  If you have several DataNodes that are down, then try restarting them and try again.

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Doug Judd
CEO, Hypertable Inc.

David

unread,
Aug 13, 2013, 11:52:26 PM8/13/13
to hyperta...@googlegroups.com, do...@hypertable.com
I also think it's an HDFS problem.  But i don't find that any datanode are down, and only RangeServer process terminated, both Dfsbroker and Thriftbroker are ok on 10.190.115.54, so i restart the Rangeserver service, then all things become ok.
I look the datanode log for 10.190.115.54, and find some SocketTimeoutException, as following:
2013-08-11 00:54:46,623 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: HadoopServer114:50010:DataXceiver error processing READ_BLOCK operation  src: /10.190.115.56:36138 dest: /10.190.115.54:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.190.115.54:50010 remote=/10.190.115.56:36138]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:247)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:166)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:214)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:492)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:655)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:280)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:88)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:63)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
        at java.lang.Thread.run(Thread.java:662)

Doug Judd

unread,
Aug 15, 2013, 4:37:51 AM8/15/13
to hypertable-user
Hi David,

Check out this post:

Re: DataXceiver error processing WRITE_BLOCK operation src: /x.x.x.x:50373 dest: /x.x.x.x:50010

Try increasing the nofile limit.  Also make sure you have the following properties set in your hadoop configuration hdfs-site.xml:

<property>
  <name>dfs.namenode.handler.count
  <value>20</value>
</name></property>
<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>4096</value>
</property>

Let us know if this resolves the problem

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

David

unread,
Aug 18, 2013, 11:07:05 PM8/18/13
to hyperta...@googlegroups.com, do...@hypertable.com
After i started the RangeServer service, all things ok, up to now, the problem can not reproduce. 
The nofile limit already was 65536 before. I don't find the two configuration items in the hdfs-site.xml file, but i find them on the Cloudrea configuration page,  the value of 'dfs.namenode.handler.count' item is 60, and the value of 'dfs.datanode.max.xcievers' item is 4096.
I also find the following output in the Datanode log:
......
2013-08-11 00:09:55,310 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: HadoopServer114:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.190.115.54:43584 dest: /10.190.115.54:50010
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.190.115.54:50010 remote=/10.190.115.54:43584]
......
2013-08-11 00:57:51,471 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: HadoopServer114:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.190.115.63:57978 dest: /10.190.115.54:50010
java.io.IOException: Premature EOF from inputStream
......

but i can not find the corresponding ' WRITE_BLOCK'  ERROR  for port '50373'.

David

unread,
Sep 25, 2013, 11:21:24 AM9/25/13
to hyperta...@googlegroups.com, do...@hypertable.com
Hi doug,
I encounter the same problem again and find some new hints.
When i upgraded the cluster from 0.9.7.8 to 0.9.7.10 and restarted it yesterday, the same problem appeared on the most RangeServers. i.e. executing 'cap start' succeeded, but several minutes later, the load of many RangeServers is very high through 'top' command, even reach 80+,  then the RangeServers terminate and report the same exception in the log.
I increased the HDFS read timeout(default value is 60s) and decreased the HdfsBroker.Workers and Hypertable.RangeServer.Workers, then the cluster restarted  successfully this time, and the exception above also disappeared in the log. By the way, in whole process, the write application don't stop, so i am not sure whether it would influence the RangeServer startup?
Some hours later, i found the  writing speed is too slow to tolerate, up to today, the speed is still too slow. So, i just stopped all writing applications, then recovered the HdfsBroker.Workers and Hypertable.RangeServer.Workers, and restarted the cluster again. Although  this time to restart is very successful, but the writing speed is still slow, although the current speed is a little faster than yesterday's, but it is just half of the previous speed. The writing application bases on the 0.9.7.8 version, i don't know whether it would slow the writing speed?
In a word, i find that the 'All datanodes *.*.*.*:50010 are bad. Aborting' problem is relative with the higher load. But for the both doubts described above, i hope get your guide. 
Thanks a lot.


Doug Judd

unread,
Sep 25, 2013, 11:30:37 AM9/25/13
to hypertable-user
Can you try upgrading to 0.9.7.11 and see if that fixes the problem?  It includes a fix for a file descriptor leak that was causing HDFS problems on some deployments.  Also, if you see a RangeServer consuming an extraordinary amount of CPU, capture stack traces with one of the following commands:

pstack <pid>

or

gdb --batch --quiet -ex "thread apply all bt full" -ex "quit" /opt/hypertable/current/bin/Hypertable.RangeServer <pid>

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.

David

unread,
Sep 25, 2013, 11:31:59 AM9/25/13
to hyperta...@googlegroups.com, do...@hypertable.com
By the way, the cluster has 24 RangeServers and has stored 90+T data.

David

unread,
Sep 26, 2013, 1:51:57 AM9/26/13
to hyperta...@googlegroups.com, do...@hypertable.com
Hi Doug,
Sorry, i can not currently upgrade to 0.9.7.11 version again.
The cluster load is not high from yesterday to today, inspect load through 'top' command and cloudera monitor page, it is almost less than 1.
The writing application is compiled with the 0.9.7.8 version,  whether it would slow the writing speed?
I want to compile the writing application with the  0.9.7.10 version, but it always report the 'undefined symbol: _ZN10Hypertable14FailureInducer8instanceE' error when linking. The attachment is the CMakeList file, can you take a look?

CMakeLists.txt

David

unread,
Sep 26, 2013, 1:52:26 AM9/26/13
to hyperta...@googlegroups.com, do...@hypertable.com
CMakeLists.txt

Doug Judd

unread,
Sep 26, 2013, 1:57:49 AM9/26/13
to hypertable-user
Try adding HyperCommon to the beginning and end of the TARGET_LINKED_LIBRARIES list.  Sometimes the linker gets confused when trying to resolve symbols.

- Doug


--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.
Reply all
Reply to author
Forward
0 new messages