Wow thanks for that detail! The timings do look about right. Some other things to check:
* See what the cache hit-rate is in HBase for the reads
* See if the UIDs have been cached for the queries in OpenTSDB. E.g. between each bench are you restarting the TSD VMs or letting them run? When restarting, there is a lot of latency introduced by the lazy initialization of HBase connections and the UID lookups.
* Enable query summaries to see where most of the time is spent.
I need some good docs on this. There is a big bottleneck that we need to address in that for each region server, an asynchronous client is created that listens for responses from that client on a single thread. The TSD will parallelize queries on a per metric basis (and per salt if salting is enabled, in your case it likely isn't). So if you have 150 metrics, they'll be sent to the region server at the same time asynchronously. However the response will be processed essentially synchronously in that the region server will keep sending the responses to a single network queue that the AsyncHBase client processes and sends to the TSD one RPC response at a time. I'd like to add a dynamic client pool to AsyncHBase to help out with this. We'll also likely enable use of the native HBase client for queries as it uses multiple threads for reads.