recovering from a region server failure

918 views
Skip to first unread message

sb

unread,
May 2, 2012, 2:02:40 PM5/2/12
to Async HBase
We sometimes have a region server fail, for various reasons.
Sometimes the asynchbase client notices that the region came back
online on another region server and sometimes it does not, the regions
are failing over to another server in a reasonable amount of time. In
the cases where the asynchbase client does not notice that the region
came back online, it continues to report errors like Connection
Refused or Server not serving region - in order to get things working
again we shutdown and restart our application. I'm not clear on what
the difference is in the cases where the client does not recover.
We've seen this in the other HbaseClient too. I would appreciate any
feedback on how we can get to more graceful recovery every time or how
we can troubleshoot why it does not recover sometimes.

Thanks,
Steve

tsuna

unread,
May 4, 2012, 5:35:22 PM5/4/12
to sb, Async HBase
Hi Steve,

On Wed, May 2, 2012 at 11:02 AM, sb <s...@connexity.com> wrote:
> We sometimes have a region server fail, for various reasons.
> Sometimes the asynchbase client notices that the region came back
> online on another region server and sometimes it does not, the regions

What do you mean by "it does not notice"?

> are failing over to another server in a reasonable amount of time.

What enables you to think the regions are failing over in a reasonable
amount of time. What is a reasonable amount of time for you?

> In the cases where the asynchbase client does not notice that the region
> came back online, it continues to report errors like Connection
> Refused or Server not serving region - in order to get things working
> again we shutdown and restart our application.

"Connection refused" seems to indicate that asynchbase looked up the
location of a particular region in the META table, attempted to
connect to the RegionServer that hosts that region according to the
META table, and that RegionServer was down, but the machine itself was
up, so the operating system reset the connection.

It would be better to paste an excerpt of what you're seeing in the
logs to be sure.

I don't see how restarting your application could fix things.

> I'm not clear on what
> the difference is in the cases where the client does not recover.
> We've seen this in the other HbaseClient too.

What is it that you call "the other HBaseClient"? HBase's own
standard client, HTable?

> I would appreciate any
> feedback on how we can get to more graceful recovery every time or how
> we can troubleshoot why it does not recover sometimes.

If you give us more details about the nature of the failure, why the
region is moving to another RegionServer, and what you see in the
logs, then maybe I would be able to shed more light on what's going
on, and whether or not what you're seeing is a bug or is expected.

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

sb

unread,
May 4, 2012, 8:15:12 PM5/4/12
to Async HBase
Hey tsuna,

Thanks for your response.

Looking at the HMaster master-status web page I can see that regions
from a failed region server have come back online on another region
server.

While I have not specifically timed it, it seems that when a region
server fails the regions come back online on another server in about 1
minute.

We get alerted from a monitoring system when the RS failure occurs and
when the client app starts reporting errors ( we get both alerts at
about the same time ). We are pretty good about restating the the
failed RS quickly, though after the regions are back online on a
different server. I wait until the failed RS is back up and regions
have stopped moving before I address the client apps. When everything
seems good on the HMaster master-status page I would expect that the
clients would be able to recover and continue to function normally,
sometimes they do not. We run 3 instances of the same client app,
sometimes 2 of 3 will recover and 1 will not. At that point, I
restart the client app that did not recover and it starts working as
expected again.

By 'the other HbaseClient', I mean Hbase's standard client, HTable.
We have seen this problem with both async and htable clients. Since
we see the problem in both clients I'm not sure if my app has a
problem, if hbase has a problem or ???

St.Ack suggested that I use the hbase shell to query the .META. table
for one of the regions in question when the problem is occurring. We
have an RS failure very week or so. I'll do the hbase shell query and
gather some logs the next time it happens.

I appreciate your help.

Thanks,
Steve

tsuna

unread,
May 5, 2012, 12:53:46 AM5/5/12
to sb, Async HBase
On Fri, May 4, 2012 at 5:15 PM, sb <s...@connexity.com> wrote:
> Looking at the HMaster master-status web page I can see that regions
> from a failed region server have come back online on another region
> server.

Interesting. I would like to think that the HMaster web page and META
should be in agreement, but ultimately the only thing that matters for
your app is what's in META. So if for any reason the master shows the
transition faster on its status page than it actually happens in META,
then maybe it could mislead you. I do not know whether this does
happen in practice, but it's one possibility to keep in mind.

> While I have not specifically timed it, it seems that when a region
> server fails the regions come back online on another server in about 1
> minute.

1 minute is very long. asynchbase will "give up" if an RPC cannot be
routed to a region after slightly less than 10 seconds. Right now the
exact number of attempts and the timing between attempts isn't
configurable (see the code at the end of HBaseClient.handleNSRE(), a
package-private method, if you want to know more).

> We get alerted from a monitoring system when the RS failure occurs and
> when the client app starts reporting errors ( we get both alerts at
> about the same time ).  We are pretty good about restating the the
> failed RS quickly, though after the regions are back online on a
> different server.  I wait until the failed RS is back up and regions
> have stopped moving before I address the client apps.  When everything
> seems good on the HMaster master-status page I would expect that the
> clients would be able to recover and continue to function normally,
> sometimes they do not.  We run 3 instances of the same client app,
> sometimes 2 of 3 will recover and 1 will not.  At that point, I
> restart the client app that did not recover and it starts working as
> expected again.

That's weird.

> By 'the other HbaseClient', I mean Hbase's standard client, HTable.
> We have seen this problem with both async and htable clients.  Since
> we see the problem in both clients I'm not sure if my app has a
> problem, if hbase has a problem or ???

If you're seeing the same thing with asynchbase and HTable then yeah
it seems to indicate that the problem comes from HBase itself? But
then it wouldn't explain why 2 out 3 app servers are able to recover
and not the other.

> St.Ack suggested that I use the hbase shell to query the .META. table
> for one of the regions in question when the problem is occurring.  We
> have an RS failure very week or so.  I'll do the hbase shell query and
> gather some logs the next time it happens.

Yeah this sort of issue definitely requires digging in the logs on
both side to get a good understanding of the picture and try to make
sense of what you're seeing.

sb

unread,
Oct 25, 2012, 6:04:34 PM10/25/12
to async...@googlegroups.com, sb
Hey Tsuna,

I hate to bump an old thread but I seem to be having this same issue again.  Seems like it had gone away, perhaps our system had been more stable for a while.  We recently updated to CDH4 and the problem has occurred every time we've lost a region server since the update.  The sequence of events went something like this:
1. RegionServer goes down (lost connection to zk and shut itself down, likely a long gc).

2. Client app starts throwing errors (timestamps in GMT here):
[ERROR] [2012-10-25 16:45:12,676] [New I/O client worker #1-2] o.h.a.RegionClient: Unexpected exception from downstream on [id: 0x51f88fbd, /10.10.2.80:43354 => /10.10.2.101:60020]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.write0(Native Method) ~[na:1.6.0_26]
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) ~[na:1.6.0_26]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72) ~[na:1.6.0_26]
        at sun.nio.ch.IOUtil.write(IOUtil.java:28) ~[na:1.6.0_26]
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) ~[na:1.6.0_26]
        at org.jboss.netty.channel.socket.nio.SocketSendBufferPool$PooledSendBuffer.transferTo(SocketSendBufferPool.java:239) ~[ccm_depends.jar:na]
        at org.jboss.netty.channel.socket.nio.NioWorker.write0(NioWorker.java:470) ~[ccm_depends.jar:na]
        at org.jboss.netty.channel.socket.nio.NioWorker.writeFromSelectorLoop(NioWorker.java:400) ~[ccm_depends.jar:na]
        at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:286) ~[ccm_depends.jar:na]
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200) ~[ccm_depends.jar:na]
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_26]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_26]
        at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
[WARN] [2012-10-25 16:45:12,706] [New I/O client boss #1] o.h.a.HBaseClient: Couldn't connect to the RegionServer @ 10.10.2.101:60020
[ERROR] [2012-10-25 16:45:12,848] [New I/O client boss #1] o.h.a.RegionClient: Unexpected exception from downstream on [id: 0x7a273ef7]
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.6.0_26]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) ~[na:1.6.0_26]
        at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:384) ~[ccm_depends.jar:na]
        at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:354) ~[ccm_depends.jar:na]
        at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:276) ~[ccm_depends.jar:na]
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [na:1.6.0_26]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [na:1.6.0_26]
        at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
[WARN] [2012-10-25 16:45:12,861] [New I/O client boss #1] o.h.a.HBaseClient: Couldn't connect to the RegionServer @ 10.10.2.101:60020

...

[ERROR] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.RegionClient: Unexpected exception from downstream on [id: 0x44782ec7]
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: Couldn't connect to the RegionServer @ 10.10.2.101:60020
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299208) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[WARN] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.HBaseClient: When handling disconnection of RegionClient@1180655027(chan=null, #pending_rpcs=18, #batched=0, #rpcs_inflight=0) and removing RegionInfo(table=".META.", region_name=".META.,,1", stop_key="") from region2client, it was found that RegionClient@1482197042(chan=[id: 0x1fd4340c, /10.10.2.80:59970 => /10.10.2.106:60020], #pending_rpcs=0, #batched=512, #rpcs_inflight=299207) was in fact serving this region
[ERROR] [2012-10-25 16:45:29,911] [New I/O client boss #1] o.h.a.RegionClient: Unexpected exception from downstream on [id: 0x0cb91397]

3.  The RegionServer is brought back online. The regions had moved to other RSs by this point according to the HMaster status page.  Timestamp in PST:
Thu Oct 25 09:48:35 PDT 2012 Starting regionserver on hb2

4. Several minutes later, the client app is still shows the ErrBacks being called (timestamp in GMT):
[WARN] [2012-10-25 16:55:13,274] [akka:event-driven:dispatcher:HbaseLogActorDispatcher-272] HBaseLogErrback: Problem logging transaction: 02519d6e74c to hbase, retrying in 5 seconds, exception was com.stumbleupon.async.DeferredGroupException: At least one of the Deferreds failed, first exception:


It seems that even though the regions were online and the failed regionserver was restarted, the Deferreds were still failing.  A restart of the client app and everything was happy again.

I'm sure that I've left out a bunch of pertinent details but I'd love to understand how to better troubleshoot this situation and get to a point where I can avoid restarting the client apps to solve it.  Any ideas?

Thanks,
Steve

Shrijeet Paliwal

unread,
Oct 25, 2012, 6:34:51 PM10/25/12
to sb, async...@googlegroups.com
We might have seen this a bunch of time ourselves (app not recovering from region server crash until restarted), but it does not happen always. I am guilty of being lazy of not reporting / digging. 

A week guesstimate is, in high throughput use case (with multiple clients) all clients realize around the same time that local region cache has gone stale and they need to rebuild it - they bombard the META with requests. I have visually experienced it (look at master page and see if one region is seeing lot more request that other and if it is serving META).

But asynchbase has a guard to avoid this going out of control,  Tsuna explains it here https://groups.google.com/forum/?fromgroups=#!topic/asynchbase/jOlYz1l5Ehs  (keyword : meta storm) 

Any ways this is just a guess. I will provide some logs as well if I see it happening. 

I had a question: what was the exact error here (the trace of DeferredGroupException)  :

"[WARN] [2012-10-25 16:55:13,274] [akka:event-driven:dispatcher:HbaseLogActorDispatcher-272] HBaseLogErrback: Problem logging transaction: 02519d6e74c to hbase, retrying in 5 seconds, exception was com.stumbleupon.async.DeferredGroupException: At least one of the Deferreds failed, first exception:"
--
Shrijeet

sb

unread,
Oct 25, 2012, 6:45:34 PM10/25/12
to async...@googlegroups.com, sb
Hey Shrijeet, appreciate your response.  Wish I knew what that complete exception was, whatever I'm doing to produce that message seems to be leaving off part of the stack.  I'm just concatenating the Exception that comes into the Errback there, I must need to do something more.  Fixing this hasn't been a priority until now.

tsuna

unread,
Oct 28, 2012, 9:52:36 PM10/28/12
to sb, async...@googlegroups.com
Sorry for the belated reply.

On Thu, Oct 25, 2012 at 3:04 PM, sb <s...@connexity.com> wrote:
> I hate to bump an old thread but I seem to be having this same issue again.

Nothing wrong with that. It's actually better than starting a new thread.

> Seems like it had gone away, perhaps our system had been more stable for a
> while.

I tried digging in my email archives, but it looks like the email you
sent to HBase's user mailing list didn't lead anywhere. You
previously reported seeing the same issue with both asynchbase and
HTable, which led me to believe that maybe the problem didn't lie in
the client. Is that still the case?

Can you share more details about your setup? How many region servers
do you have? How many regions are they hosting? How many regions do
you typically use? What sort of RPC throughput are you sending to
HBase?

From the excerpts you pasted above, it appears that the server hosting
the META region itself was hosed? Did you just lose the RS hosting
META, or more than that?

--
Benoit "tsuna" Sigoure

sb

unread,
Oct 29, 2012, 2:38:10 PM10/29/12
to async...@googlegroups.com, sb
Correct, the post I made to the HBase user list did not go anywhere.  Soon after we enjoyed a period of stability and I never followed up.

Since the last time I posted on this topic we've been using the async client more and the (sync) hbase client less.  Generally, we are using the async client constantly, with a constant stream of input to the app.  We've been using the (sync) hbase client for more periodic things, for instance we do our own Increment coalescing and use the (sync) hbase client to write out the Increments (we had this before the async client provided similar functionality and have stuck with it as we flush different kinds of rows at different times).  In the case of this last failure, it does look like both the async hbase client and the (sync) hbase client failed to recover (as was the case previously), the (sync) hbase client was reporting org.apache.hadoop.hbase.client.RetriesExhaustedException.

We have 12 region servers.  Each region server is hosting 40 to 50 regions with around 525 regions total.  I'd say that we use about 500 of those regions continuously.  Most of our tables have just a few regions but a couple of the tables have >100 regions.  I don't currently have good stats on what a single instance of my app sends through the async client alone in terms of RPC throughput.  I can tell you that our total RPC load on hbase across all of our apps and both the async and (sync) hbase clients is around 75K RPCs/second.  I would think that for a single instance of my app, for the RPC traffic through the async client, it would max out in the range of 15K RPCs/second in a worst case scenario (at our absolute busiest time).

We lost a single region server that was hosting 40 some regions.  Turns out that it was the RS that was hosting the META table but the ROOT table was on another RS that did not go down.

Interestingly, we lost another RS last night but the clients recovered without issue, everything continued functioning as expected.  That RS was not hosting META.  We lost 40 some regions in this case too.  The load through my app in terms of RPCs/second would have been lower last night than it was for the last RS crash last Thursday, we tend to be busier during daylight hours in the US.

Thanks,
Steve


tsuna

unread,
Oct 30, 2012, 3:22:49 AM10/30/12
to sb, async...@googlegroups.com
On Mon, Oct 29, 2012 at 11:38 AM, sb <s...@connexity.com> wrote:
> In the case of this last failure, it does look like both
> the async hbase client and the (sync) hbase client failed to recover (as was
> the case previously), the (sync) hbase client was reporting
> org.apache.hadoop.hbase.client.RetriesExhaustedException.

OK, that seems to indicate that HBase was genuinely unhappy, however
I'm not sure why restarting your app seems to have fixed the problem
immediately.

In what you pasted above, the error messages are from a time of
roughly 16:45 GMT, from HBase's own logs its says the RegionServer
restarted at 09:48 GMT. I'm going to assume it means it took roughly
3 minutes for the RegionServer to restart, however the RegionServer
also needs time to bring the regions back online. Yet, 10 minutes
into the event, at 16:55 GMT, the client was still unhappy. It is
possible that HBase wasn't ready to serve yet, I guess it depends on
how much log splitting needed to happen and whether or not you have
distributed log splitting. However you said the regions appeared to
be online, which is weird. But it's hard to know what the client was
complaining about without having the full error message of the
DeferredGroupException.

> We lost a single region server that was hosting 40 some regions. Turns out
> that it was the RS that was hosting the META table but the ROOT table was on
> another RS that did not go down.

Losing META should be fine too, it's just more annoying because you
temporarily lose the ability to locate regions you didn't already know
about.

> Interestingly, we lost another RS last night but the clients recovered
> without issue, everything continued functioning as expected. That RS was
> not hosting META. We lost 40 some regions in this case too. The load
> through my app in terms of RPCs/second would have been lower last night than
> it was for the last RS crash last Thursday, we tend to be busier during
> daylight hours in the US.

OK, that's weird, sorry if I'm not of much help here, I don't think we
have enough information at this point. If you could get the full
exception and stack trace showing the client complaining about a
region that you think is online, as well as the logs from HBase's
RegionServers when it brought online the said region, maybe that could
help shed some light on the actual order of events, and what happened,
and what should have happened instead.

--
Benoit "tsuna" Sigoure
Reply all
Reply to author
Forward
0 new messages