Hi,
We have a setup of 4 region serves and all of them got killed abruptly. Once they were back up, replication is no more taking place or it stuck at apoint. Status 'replication' command is showing lag:
SizeOfLogQueue=1 and Replication Lag=5976392
also in hbase region server logs I see variety of errors:
1) Most of the times logs are stuck with this message:
2016-06-24 16:40:37,855 INFO [test25:60020Replication Statistics #0] regionserver.Replication: Normal source for cluster Indexer_test: Total replicated edits: 1, currently replicating from: hdfs://server/hbase/WALs/test25,60020,1466804734059/test25%2C60020%2C1466804734059.1466804736943 at position: 18469
2) On one of the region server I see:
Recovered source for cluster/machine(s) Indexer_test: Total replicated edits: 0, currently replicating from: hdfs://server/hbase/oldWALs/test25%2C60020%2C1466722298626.1466722301438 at position: 2464501
Recovered source for cluster/machine(s) Indexer_test: Total replicated edits: 0, currently replicating from: hdfs://social-vcell-qe5/hbase/oldWALs/test25%2C60020%2C1466802467810.1466802470622 at position: 443630
3) After sometime I start seeing:
regionserver.HBaseInterClusterReplicationEndpoint: Can't replicate because of a local or network error:
java.io.IOException: Call to test22/ip:38377 failed on local exception: java.io.IOException: Connection reset by peer
I tried couple of things:
1) rolling restart on region servers. One region server at a time
2) restart of all 4 indexer instances. (One odd thing I see in indexer logs is I do get some events as soon as I restart but nothing after few and it stops at that point)
Has anyone faced something similar? Looks like more of hbase replication issue.
Thanks,
Pankil