Lost connection and data loss after name node incident

24 views
Skip to first unread message

Kristoffer Sjögren

unread,
Mar 18, 2015, 8:00:32 AM3/18/15
to async...@googlegroups.com
Hi

We had an incident on our Hadoop/HBase (asynchbase-1.6.0 / hbase-0.94.6-cdh4.4.0) production cluster related to the name node which caused HBaseClient to loose connection with its region servers.

Normally after a restarting single or multiple server(s) HBaseClient reconnect soon as regions come back online. However, during this incident HBaseClient wasn't able to reconnect. It even stopped complaining about the connection after a while and silently dropped writes.

It's hard to figure out exactly why this happened but the first and last seen application log from HBaseClient report chan=null which looks suspicious. I also have look through datanode, namenode, regionserver and master logs but i'm not exactly sure what to look for.

Cheers,
-Kristoffer

[1] 

2015-03-15T21:31:56.819+0100 [New I/O worker #8] INFO org.hbase.async.HBaseClient:2158 [Added client for region RegionInfo(table="unique_rows", region_name="unique_rows,row", stop_key="order_795_1403881148949_"), which was updated in the regions cache.  Now we know that RegionClient@1076603013(chan=null, #pending_rpcs=0, #batched=0, #rpcs_inflight=0) is hosting 1 region.] ::: 
2015-03-15T21:31:56.871+0100 [New I/O boss #9] WARN org.hbase.async.HBaseClient:2755 [Couldn't connect to the RegionServer @ 10.3.24.22:60020] ::: 
2015-03-15T21:31:56.911+0100 [New I/O boss #9] ERROR org.hbase.async.RegionClient:1095 [Unexpected exception from downstream on [id: 0x117af469]] ::: java.net.ConnectException: Connection refused: /10.3.24.22:60020
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_25]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[na:1.8.0_25]
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) [netty-3.9.4.Final.jar:na]

2015-03-15T23:20:07.241+0100 [New I/O worker #4] INFO org.hbase.async.HBaseClient:2158 [Added client for region RegionInfo(table="unique_rows", region_name="unique_rows,row", stop_key=""), which was added to the regions cache.  Now we know that RegionClient@1586402841(chan=null, #pending_rpcs=0, #batched=0, #rpcs_inflight=0) is hosting 1 region.] ::: 
2015-03-15T23:20:07.241+0100 [New I/O boss #9] WARN org.hbase.async.HBaseClient:2755 [Couldn't connect to the RegionServer @ 10.3.24.41:60020] ::: 
2015-03-15T23:20:07.242+0100 [New I/O boss #9] ERROR org.hbase.async.RegionClient:1095 [Unexpected exception from downstream on [id: 0x01743554]] ::: java.net.ConnectException: Connection refused: /10.3.24.41:60020
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_25]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[na:1.8.0_25]
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.9.4.Final.jar:na]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.9.4.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_25]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25]

2015-03-15T23:20:07.861+0100 [New I/O worker #4] INFO org.hbase.async.HBaseClient:2158 [Added client for region RegionInfo(table="unique_rows", region_name="unique_rows,row", stop_key=""), which was added to the regions cache.  Now we know that RegionClient@1023451702(chan=null, #pending_rpcs=0, #batched=0, #rpcs_inflight=0) is hosting 1 region.] :::

Reply all
Reply to author
Forward
0 new messages