Hello all,
We run OpenTSDB (2.3) in my company with workloads of ~250k new datapoints per second and over 2000 queries per minute on a fleet of OpenTSDB read-only instances. Salting is enabled to spread load more evenly across our HBase cluster which runs on d2.xl instances.
Overall we’re quite satisfied with OpenTSDB and appreciate the efforts put towards its development!
Our biggest concern at the moment is service performance degradation caused by queries on high cardinality metric paths ran over wide time spans (weeks or months). From a backend perspective such queries may return one row for every million rows scanned.
We are aware that these issues can be addressed by de-normalizing the metrics, doing rollups or using features in OpenTSDB 2.4 which replace those scans with gets, etc. In the meantime however, we need to protect the service in some way before offending metric paths are detected (e.g. by a heavy query).
An interesting observation was that when these queries happen, OpenTSDB sends RPCs to HBase that can take an unlimited amount of time to return. IIRC by default a minimum of 128 rows are required before the OpenTSDB scanner callback is even called.
So if the query is very selective on which rows it needs (e.g. by filtering on the highest cardinality tag, etc.), this could take many seconds or even minutes, keeping an HBase RPC Handler busy. Running a few of these simultaneously can immediately affect all request handling in HBase (we don’t split HBase RPC Queues between write and read requests), driving the service to a halt.
As per
https://issues.apache.org/jira/browse/HBASE-13090, there is a way to limit long running scanners but our attempts to utilize it have been unsuccessful so far. We suspect the Async HBase client sends RPCs in a way that disables this functionality (e.g. heartbeats are disabled).
1. How can a limit be enforced on the running time of a scan RPC so that HBase RPC Handler queues do not get congested?
2. What’s the suggested approach to limit the negative effect of such workloads, i.e. queries on high-cardinality metrics for wide time spans, with a very selective filter?
Thanks!
Stan