Problem syncing commit log: /hypertable/servers/rs18/log/user/3252: Error flushing DFS fd 73135

David

unread,

Feb 25, 2014, 1:21:12 AM2/25/14

to hyperta...@googlegroups.com

A RangeServer reported initially ERROR as follows at 00:26 AM:

1393259159 ERROR Hypertable.RangeServer : run_compaction (/root/src/hypertable/src/cc/Hypertable/RangeServer/AccessGroup.cc:790): 2/3[fIAB5E1506148,4600252381060051387036800..fIAB5E1506148,4600282321205571388678400](ONEDAY) Hypertable::Exception: Problem writing to DFS file '/hypertable/tables/2/3/ONEDAY/o3RQ4uVI0h-S57S8/cs848' : java.io.IOException: All datanodes 10.190.115.56:50010 are bad. Aborting... - DFS BROKER i/o error

at virtual void Hypertable::CellStoreV6::add(const Hypertable::Key&, Hypertable::ByteString) (/root/src/hypertable/src/cc/Hypertable/RangeServer/CellStoreV6.cc:478)

1393259159 ERROR Hypertable.RangeServer : ~CellStoreV6 (/root/src/hypertable/src/cc/Hypertable/RangeServer/CellStoreV6.cc:90): Hypertable::Exception: Error closing DFS fd: 74527 - DFS BROKER i/o error

at virtual void Hypertable::DfsBroker::Client::close(int32_t) (/root/src/hypertable/src/cc/DfsBroker/Lib/Client.cc:232)

at virtual void Hypertable::DfsBroker::Client::close(int32_t) (/root/src/hypertable/src/cc/DfsBroker/Lib/Client.cc:229): java.io.IOException: All datanodes 10.190.115.56:50010 are bad. Aborting...

then it reported frequently ERROR as follows:

1393259161 ERROR Hypertable.RangeServer : (/root/src/hypertable/src/cc/Hypertable/Lib/CommitLog.cc:152) Problem syncing commit log: /hypertable/servers/rs18/log/user/3252: Error flushing DFS fd 73135

1393259161 ERROR Hypertable.RangeServer : (/root/src/hypertable/src/cc/Hypertable/RangeServer/RangeServer.cc:2711) Problem sync'ing user log fragment (/hypertable/servers/rs18/log/user/3252) - DFS BROKER i/o error

PERROR frag-3242 revision 1393256318014910549

PERROR frag-3242 distance 10

PERROR frag-3242 size 947145893

......

PERROR frag-3243 revision 1393256629546005597

PERROR frag-3243 distance 9

PERROR frag-3243 size 846830647

PERROR frag-3244 revision 1393256898652277621

PERROR revision 1393271799433710001 not found in map

......

Nine hours late, i restarted the RangeServer service, hence, these ERRORS disappeared.

I have collected the related logs and the commit log file named 3252, if necessary, i will upload it to here.

I want to know how is these ERRORs triggered? Any ideas can be appreciated!

Doug Judd

unread,

Feb 25, 2014, 1:33:22 AM2/25/14

to hypertable-user

Hi David,

Are all of your HDFS DataNodes up and running? Did any of them crash?

- Doug

--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.

--

Doug Judd

CEO, Hypertable Inc.

David

unread,

Feb 25, 2014, 3:14:42 AM2/25/14

to hyperta...@googlegroups.com, do...@hypertable.com

Yes, all DataNodes run normally, and no any crashed node.
When these ERROR appeared initially, the cluster had fairly high bytes to read HDFS and the problematic RangeServer had fairly high load. There was not any other applications to access hypertable except the writing application at that time point, but the writing application always runs for 7*24hours, and the loading data is fairly little at that time.

Whether or not the compaction was running at that time, and it caused the high load, and finally resulted in timeout to access the DataNode. But i don't understand why the timeout can continue so long time?

After i restarted the problematic RangeServer, the ERROR disappeared, whether or not it can denote the problem is not caused by HDFS?

Doug Judd

unread,

Feb 27, 2014, 1:22:08 PM2/27/14

to hypertable-user

Hi David,

Please check the DataNode logs at the time of the failure to see if they reported any errors and report back. Thanks.

- Doug

--

You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.

Doug Judd

unread,

Feb 27, 2014, 1:51:14 PM2/27/14

to hypertable-user

Can you also check for "Too many open files" in the datanode logs? You can run a command similar to this:

cap shell

cap> with slave fgrep -i "Too many open files" /var/log/hadoop-hdfs/hadoop-hdfs-datanode-*

Also, what is your open file limit on the slave machines (i.e. ulimit -n) ?

- Doug

David

unread,

Feb 28, 2014, 2:43:49 AM2/28/14

to hyperta...@googlegroups.com, do...@hypertable.com

The open file limit of all machines is 65536.

I checked all DataNode logs at that time, could not find the keyword "Too many open files", but got some others ERROR. I analyzed these ERROR, and found that it can be classified five types as follows:

1. HadoopServer116:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.190.115.56:46399 dest: /10.190.115.56:50010

java.io.IOException: Premature EOF from inputStream

at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)

at java.lang.Thread.run(Thread.java:662)

2. HadoopServer116:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.190.115.60:35559 dest: /10.190.115.56:50010

java.io.IOException: Connection reset by peer

at sun.nio.ch.FileDispatcher.read0(Native Method)

at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)

at sun.nio.ch.IOUtil.read(IOUtil.java:171)

at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)

at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:56)

at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:143)

at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156)

at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)

at java.io.BufferedInputStream.read(BufferedInputStream.java:317)

at java.io.DataInputStream.read(DataInputStream.java:132)

at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)

at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)

at java.lang.Thread.run(Thread.java:662)

3. HadoopServer116:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.190.115.56:46439 dest: /10.190.115.56:50010

java.io.IOException: Interrupted receiveBlock

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:680)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)

at java.lang.Thread.run(Thread.java:662)

4. HadoopServer126:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.190.115.56:52631 dest: /10.190.115.66:50010

java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.190.115.66:50010 remote=/10.190.115.56:52631]

at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)