Replication stuck after all region servers died abruptly

41 views

Skip to first unread message

Pankil Doshi

unread,

Jun 24, 2016, 7:47:11 PM6/24/16

to HBase Indexer Users

Hi,

We have a setup of 4 region serves and all of them got killed abruptly. Once they were back up, replication is no more taking place or it stuck at apoint. Status 'replication' command is showing lag:

SizeOfLogQueue=1 and Replication Lag=5976392

also in hbase region server logs I see variety of errors:

1) Most of the times logs are stuck with this message:

2016-06-24 16:40:37,855 INFO [test25:60020Replication Statistics #0] regionserver.Replication: Normal source for cluster Indexer_test: Total replicated edits: 1, currently replicating from: hdfs://server/hbase/WALs/test25,60020,1466804734059/test25%2C60020%2C1466804734059.1466804736943 at position: 18469

2) On one of the region server I see:

Recovered source for cluster/machine(s) Indexer_test: Total replicated edits: 0, currently replicating from: hdfs://server/hbase/oldWALs/test25%2C60020%2C1466722298626.1466722301438 at position: 2464501

Recovered source for cluster/machine(s) Indexer_test: Total replicated edits: 0, currently replicating from: hdfs://social-vcell-qe5/hbase/oldWALs/test25%2C60020%2C1466802467810.1466802470622 at position: 443630

3) After sometime I start seeing:

regionserver.HBaseInterClusterReplicationEndpoint: Can't replicate because of a local or network error:

java.io.IOException: Call to test22/ip:38377 failed on local exception: java.io.IOException: Connection reset by peer

I tried couple of things:

1) rolling restart on region servers. One region server at a time

2) restart of all 4 indexer instances. (One odd thing I see in indexer logs is I do get some events as soon as I restart but nothing after few and it stops at that point)

Has anyone faced something similar? Looks like more of hbase replication issue.

Thanks,

Pankil

Reply all

Reply to author

Forward

0 new messages