Hey Tsuna,
I hate to bump an old thread but I seem to be having this same issue again. Seems like it had gone away, perhaps our system had been more stable for a while. We recently updated to CDH4 and the problem has occurred every time we've lost a region server since the update. The sequence of events went something like this:
1. RegionServer goes down (lost connection to zk and shut itself down, likely a long gc).
2. Client app starts throwing errors (timestamps in GMT here):
[ERROR] [2012-10-25 16:45:12,676] [New I/O client worker #1-2] o.h.a.RegionClient: Unexpected exception from downstream on [id: 0x51f88fbd, /
10.10.2.80:43354 => /
10.10.2.101:60020]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.write0(Native Method) ~[na:1.6.0_26]
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) ~[na:1.6.0_26]
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72) ~[na:1.6.0_26]
at sun.nio.ch.IOUtil.write(IOUtil.java:28) ~[na:1.6.0_26]
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) ~[na:1.6.0_26]
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool$PooledSendBuffer.transferTo(SocketSendBufferPool.java:239) ~[ccm_depends.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.write0(NioWorker.java:470) ~[ccm_depends.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.writeFromSelectorLoop(NioWorker.java:400) ~[ccm_depends.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:286) ~[ccm_depends.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200) ~[ccm_depends.jar:na]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_26]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_26]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
[WARN] [2012-10-25 16:45:12,706] [New I/O client boss #1] o.h.a.HBaseClient: Couldn't connect to the RegionServer @
10.10.2.101:60020[ERROR] [2012-10-25 16:45:12,848] [New I/O client boss #1] o.h.a.RegionClient: Unexpected exception from downstream on [id: 0x7a273ef7]
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.6.0_26]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) ~[na:1.6.0_26]
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:384) ~[ccm_depends.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354) ~[ccm_depends.jar:na]
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:276) ~[ccm_depends.jar:na]
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_26]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_26]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
[WARN] [2012-10-25 16:45:12,861] [New I/O client boss #1] o.h.a.HBaseClient: Couldn't connect to the RegionServer @
10.10.2.101:60020...
[ERROR] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.RegionClient: Unexpected exception from downstream on [id: 0x44782ec7]
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: Couldn't connect to the RegionServer @
10.10.2.101:60020[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299208) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /
10.10.2.80:59970 => /
10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[ERROR] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.RegionClient: Unexpected exception from downstream on [id: 0x0cb91397]
3. The RegionServer is brought back online. The regions had moved to other RSs by this point according to the HMaster status page. Timestamp in PST:
Thu Oct 25 09:48:35 PDT 2012 Starting regionserver on hb2
4. Several minutes later, the client app is still shows the ErrBacks being called (timestamp in GMT):
[WARN] [2012-10-25 16:55:13,274] [akka:event-driven:dispatcher:HbaseLogActorDispatcher-272] HBaseLogErrback: Problem logging transaction: 02519d6e74c to hbase, retrying in 5 seconds, exception was com.stumbleupon.async.DeferredGroupException: At least one of the Deferreds failed, first exception:
It seems that even though the regions were online and the failed regionserver was restarted, the Deferreds were still failing. A restart of the client app and everything was happy again.
I'm sure that I've left out a bunch of pertinent details but I'd love to understand how to better troubleshoot this situation and get to a point where I can avoid restarting the client apps to solve it. Any ideas?
Thanks,
Steve