Read path in TSDB and optimization for queries

71 views
Skip to first unread message

Avind

unread,
Mar 30, 2018, 12:13:52 AM3/30/18
to OpenTSDB
This is our setup 
  • Cloud based Paas Hbase (3 region nodes) .. 50 regions each.
  • TSDB table has been pre-split (150 splits across all available regions) and we have uniform writes & no hot spots (enabled random metrics creation, no salting) 
  • TSD's running on independent VM's behind a LB (only read through this path) 
    • 2 core 8 GB RAM machines with 4~5 GB allocation to TSD JVM
We have been trying to measure out read performance and this is a summary of our observations .. 
  • all testing was done directly on the tsd's with scripts running curl 
  • all metrics are unique with no significant tag carnality (<2)  
  • data points are at 1 min frequency for all metrics.
  • data has been loaded for 2 years for all metrics.
Timing observations are also attached in case the table below breaks on posting .. 


Test Case # Metrics # TSD x # Metrics per query # Query Duration # Downsample Rate (mins) Avg RespTime (ms) Max RespTime (ms)
Test#1 1 1x1 1 day 60 120 450
Test#2 5 1x5 1 day 60 119 405
Test#3 10 2x5 1 day 60 135 1001
Test#4 20 4x5 1 day 60 170 1675
Test#5 30 4x8 1 day 60 193 1337
Test#6 90 12x8 1 day 60 257 4010
Test#7 150 19x8 1 day 60 308 7720
             
Test#8 1 1x1 7 days 480 522 2688
Test#9 5 1x5 7 days 480 608 1334
Test#10 10 2x5 7 days 480 646 3067
Test#11 20 2x10 7 days 480 1135 1687
Test#12 20 4x5 7 days 480 730 4350
Test#13 30 4x8 7 days 480 971 3751
Test#14 90 12x8 7 days 480 1125 10823
Test#15 150 19x8 7 days 480 1825 27955
             
Test#16 1 1x1 14 days 720 808 1237
Test#17 5 1x5 14 days 720 1362 2569
Test#18 10 1x10 14 days 720 1664 3437
Test#19 10 2x5 14 days 720 1147 2655
Test#20 20 4x5 14 days 720 1156 3180
Test#21 30 4x8 14 days 720 1641 5107
Test#22 90 12x8 14 days 720 2240 18999
Test#23 150 19x8 14 days 720 3587 30898
             
Test#24 1 1x1 30 days 1440 1919 2913
Test#25 5 1x5 30 days 1440 2694 3651
Test#26 10 2x5 30 days 1440 2867 4681
Test#27 20 4x5 30 days 1440 2939 6056
Test#28 30 4x8 30 days 1440 3413 4170
Test#29 50 4x12 30 days 1440 5480 6724
Test#30 90 12x8 30 days 1440 5445 12840
Test#31 150 19x8 30 days 1440 6978 13045


Questions 
  • Is this a reasonable performance with the setup we have .. can we do any better? 
  • Also wanted to know what is the read pattern for a query with multiple metrics..(did not get any good doc explaining the way the query is actually executed across TSD and Hbase)
    • i.e. is there any parallelism in read from hbase.
  • Note .. The max response time is usually the 1st read ..
  • All averages are over a 100 cycle run for each query ..
read test observations.JPG

ManOLamancha

unread,
May 22, 2018, 2:23:21 PM5/22/18
to OpenTSDB

On Thursday, March 29, 2018 at 9:13:52 PM UTC-7, Avind wrote:

Questions 
  • Is this a reasonable performance with the setup we have .. can we do any better? 

Wow thanks for that detail! The timings do look about right. Some other things to check:
* See what the cache hit-rate is in HBase for the reads
* See if the UIDs have been cached for the queries in OpenTSDB. E.g. between each bench are you restarting the TSD VMs or letting them run? When restarting, there is a lot of latency introduced by the lazy initialization of HBase connections and the UID lookups.
* Enable query summaries to see where most of the time is spent. 
  • Also wanted to know what is the read pattern for a query with multiple metrics..(did not get any good doc explaining the way the query is actually executed across TSD and Hbase)
    • i.e. is there any parallelism in read from hbase.
I need some good docs on this. There is a big bottleneck that we need to address in that for each region server, an asynchronous client is created that listens for responses from that client on a single thread. The TSD will parallelize queries on a per metric basis (and per salt if salting is enabled, in your case it likely isn't). So if you have 150 metrics, they'll be sent to the region server at the same time asynchronously. However the response will be processed essentially synchronously in that the region server will keep sending the responses to a single network queue that the AsyncHBase client processes and sends to the TSD one RPC response at a time. I'd like to add a dynamic client pool to AsyncHBase to help out with this. We'll also likely enable use of the native HBase client for queries as it uses multiple threads for reads.
  • Note .. The max response time is usually the 1st read ..
That would indicate you're populating the block cache in HBase. Or if you're restarting the TSD it's the time to connect to HBase find the regions and start the query. 
Reply all
Reply to author
Forward
0 new messages