I suspect this is during region split, although I am not sure. My current repro is to telnet to an instance, run 'help', then execute 'stats'. 'help' works almost all of the time, but 'stats' always hangs, and the socket is not responsive to any other commands. Here is the gist of the logs during the repro.
https://gist.github.com/mxk1235/6bec63717bb09b40f82f
One of the most troubling lines in the gist is the following line:
INFO [ClientCnxn.run] - EventThread shut down
does this affect any requests that come after? is the thread restarted if it's needed?
Here is how it happened. We ended up in a situation where a lot of metrics got piped into OpenTsdb via the socket interface at the same time, and opentsdb appeared to hang, /api/version wasn't responding and metrics were not being recorded. Some messages from this group indicated that it may happen during compaction, so we turned compaction off and restarted. /api/version came back 200 for upto an hour, but then started having issues again. Metrics are not recorded at all. The UI is not responsive either. We think the region is in the process of being split or never completed, and opentsdb is having issues simply talking to Hbase.
do you have any advice in how to run diagnostics or repair on the hbase side?
any help is greatly appreciated. thanks in advance.
-mike