OpenTSDB Latency

297 views
Skip to first unread message

james taylor

unread,
Mar 25, 2018, 11:37:19 PM3/25/18
to OpenTSDB
Hey, Guys

We just setup openTSDB on top of an AWS EMR HBase Cluster with Grafana querying it.

We are noticing that the dashboard queries are taking minutes to load.

Looking for general guidance on what openTSDB metrics or HBase Metrics to troubleshoot the issue.

Key observations:

* We see an error related to the TagvFilter plugin not being found
* We see a high cache eviction rate -- wondering if this is an indication that the cache is to small or just could be expiring data TTL
* We see that the tsd.hbase.latency from 50% to 95% related to SCANS is really high at 2147483647ms
* We see an error frequently when we run a test query from the OpenTSDB gui that WARNS of duplicate timestamps just streaming the error as the query runs

Key configurations:
* We use S3 for storing HBase data
* We have the Master node doing (rw)
* Data nodes are running one instance of tsd (ro) and we are using an ELB to load balance across the 7 data nodes and the ELB endpoint is setup in Grafana
* We only store 3 days worth of data in or about 300,000,000 data-points

Configuration and Stats below:


1522033556
68state=runnable host=ip-10-146-220-77
tsd.jvm.thread.states15220335560state=blocked host=ip-10-146-220-77
tsd.jvm.thread.states152203355619state=waiting host=ip-10-146-220-77
tsd.jvm.thread.states15220335560state=terminated host=ip-10-146-220-77
tsd.jvm.thread.states15220335562state=timed_waiting host=ip-10-146-220-77
tsd.jvm.thread.count152203355689host=ip-10-146-220-77
tsd.uid.cache-hit152203355681kind=metrics host=ip-10-146-220-77
tsd.uid.cache-miss15220335561kind=metrics host=ip-10-146-220-77
tsd.uid.cache-size152203355614kind=metrics host=ip-10-146-220-77
tsd.uid.random-collisions15220335560kind=metrics host=ip-10-146-220-77
tsd.uid.rejected-assignments15220335560kind=metrics host=ip-10-146-220-77
tsd.uid.ids-used1522033556397kind=metrics host=ip-10-146-220-77
tsd.uid.ids-available152203355616776818kind=metrics host=ip-10-146-220-77
tsd.uid.cache-hit1522033556340kind=tagk host=ip-10-146-220-77
tsd.uid.cache-miss15220335565kind=tagk host=ip-10-146-220-77
tsd.uid.cache-size152203355612kind=tagk host=ip-10-146-220-77
tsd.uid.random-collisions15220335560kind=tagk host=ip-10-146-220-77
tsd.uid.rejected-assignments15220335560kind=tagk host=ip-10-146-220-77
tsd.uid.ids-used152203355616kind=tagk host=ip-10-146-220-77
tsd.uid.ids-available152203355616777199kind=tagk host=ip-10-146-220-77
tsd.uid.cache-hit1522033556102kind=tagv host=ip-10-146-220-77
tsd.uid.cache-miss15220335563kind=tagv host=ip-10-146-220-77
tsd.uid.cache-size15220335566kind=tagv host=ip-10-146-220-77
tsd.uid.random-collisions15220335560kind=tagv host=ip-10-146-220-77
tsd.uid.rejected-assignments15220335560kind=tagv host=ip-10-146-220-77
tsd.uid.ids-used1522033556680kind=tagv host=ip-10-146-220-77
tsd.uid.ids-available152203355616776535kind=tagv host=ip-10-146-220-77
tsd.uid.filter.rejected15220335560kind=raw host=ip-10-146-220-77
tsd.uid.filter.rejected15220335560kind=aggregate host=ip-10-146-220-77
tsd.jvm.ramfree15220335561234051792host=ip-10-146-220-77
tsd.jvm.ramused15220335562963275776host=ip-10-146-220-77
tsd.hbase.latency_50pct15220335560method=put host=ip-10-146-220-77 class=IncomingDataPoints
tsd.hbase.latency_75pct15220335560method=put host=ip-10-146-220-77 class=IncomingDataPoints
tsd.hbase.latency_90pct15220335560method=put host=ip-10-146-220-77 class=IncomingDataPoints
tsd.hbase.latency_95pct15220335560method=put host=ip-10-146-220-77 class=IncomingDataPoints
tsd.datapoints.added15220335560type=all host=ip-10-146-220-77 class=TSDB
tsd.hbase.latency_50pct15220335562147483647method=scan host=ip-10-146-220-77 class=TsdbQuery
tsd.hbase.latency_75pct15220335562147483647method=scan host=ip-10-146-220-77 class=TsdbQuery
tsd.hbase.latency_90pct15220335562147483647method=scan host=ip-10-146-220-77 class=TsdbQuery
tsd.hbase.latency_95pct15220335562147483647method=scan host=ip-10-146-220-77 class=TsdbQuery
tsd.hbase.root_lookups15220335560host=ip-10-146-220-77
tsd.hbase.meta_lookups152203355658type=uncontended host=ip-10-146-220-77
tsd.hbase.meta_lookups15220335560type=contended host=ip-10-146-220-77
tsd.hbase.rpcs15220335560type=increment host=ip-10-146-220-77
tsd.hbase.rpcs15220335560type=delete host=ip-10-146-220-77
tsd.hbase.rpcs152203355613type=get host=ip-10-146-220-77
tsd.hbase.rpcs15220335560type=put host=ip-10-146-220-77
tsd.hbase.rpcs15220335560type=append host=ip-10-146-220-77
tsd.hbase.rpcs15220335560type=rowLock host=ip-10-146-220-77
tsd.hbase.rpcs1522033556221type=openScanner host=ip-10-146-220-77
tsd.hbase.rpcs152203355616682type=scan host=ip-10-146-220-77
tsd.hbase.rpcs.batched15220335560host=ip-10-146-220-77
tsd.hbase.flushes15220335560host=ip-10-146-220-77
tsd.hbase.connections.created152203355648host=ip-10-146-220-77
tsd.hbase.connections.idle_closed152203355641host=ip-10-146-220-77
tsd.hbase.nsre15220335560host=ip-10-146-220-77
tsd.hbase.nsre.rpcs_delayed15220335560host=ip-10-146-220-77
tsd.hbase.region_clients.open15220335567host=ip-10-146-220-77
tsd.hbase.region_clients.idle_closed152203355641host=ip-10-146-220-77
tsd.compaction.count15220335562019369host=ip-10-146-220-77
tsd.compaction.duplicates15220335560type=identical host=ip-10-146-220-77
tsd.compaction.duplicates152203355618103298type=variant host=ip-10-146-220-77

Current configuration:

"tsd.core.auto_create_metrics":"true",
"tsd.core.auto_create_tagks":"true",
"tsd.core.auto_create_tagvs":"true",
"tsd.core.connections.limit":"0",
"tsd.core.enable_api":"true",
"tsd.core.enable_ui":"true",
"tsd.core.meta.cache.enable":"false",
"tsd.core.meta.enable_realtime_ts":"false",
"tsd.core.meta.enable_realtime_uid":"false",
"tsd.core.meta.enable_tsuid_incrementing":"false",
"tsd.core.meta.enable_tsuid_tracking":"false",
"tsd.core.plugin_path":"/usr/share/opentsdb/plugins",
"tsd.core.preload_uid_cache":"false",
"tsd.core.preload_uid_cache.max_entries":"300000",
"tsd.core.socket.timeout":"0",
"tsd.core.stats_with_port":"false",
"tsd.core.storage_exception_handler.enable":"false",
"tsd.core.tree.enable_processing":"false",
"tsd.core.uid.random_metrics":"false",
"tsd.http.cachedir":"/tmp/opentsdb",
"tsd.http.query.allow_delete":"false",
"tsd.http.request.cors_domains":"",
"tsd.http.request.cors_headers":"Authorization, Content-Type, Accept, Origin, User-Agent, DNT, Cache-Control, X-Mx-ReqToken, Keep-Alive, X-Requested-With, If-Modified-Since",
"tsd.http.request.enable_chunked":"false",
"tsd.http.request.max_chunk":"4096",
"tsd.http.show_stack_trace":"true",
"tsd.http.staticroot":"/usr/share/opentsdb/static/",
"tsd.mode":"ro",
"tsd.network.async_io":"true",
"tsd.network.bind":"0.0.0.0",
"tsd.network.keep_alive":"true",
"tsd.network.port":"hidden",
"tsd.network.reuse_address":"true",
"tsd.network.tcp_no_delay":"true",
"tsd.network.worker_threads":"",
"tsd.no_diediedie":"false",
"tsd.query.allow_simultaneous_duplicates":"true",
"tsd.query.enable_fuzzy_filter":"true",
"tsd.query.filter.expansion_limit":"4096",
"tsd.query.skip_unresolved_tagvs":"false",
"tsd.query.timeout":"0",
"tsd.rtpublisher.enable":"false",
"tsd.rtpublisher.plugin":"",
"tsd.search.enable":"false",
"tsd.search.plugin":"",
"tsd.startup.enable":"false",
"tsd.startup.plugin":"",
"tsd.stats.canonical":"false",
"tsd.storage.compaction.flush_interval":"10",
"tsd.storage.compaction.flush_speed":"2",
"tsd.storage.compaction.max_concurrent_flushes":"10000",
"tsd.storage.compaction.min_flush_threshold":"100",
"tsd.storage.enable_appends":"false",
"tsd.storage.enable_compaction":"false",
"tsd.storage.fix_duplicates":"true",
"tsd.storage.flush_interval":"1000",
"tsd.storage.hbase.data_table":"tsdb",
"tsd.storage.hbase.meta_table":"tsdb-meta",
"tsd.storage.hbase.prefetch_meta":"false",
"tsd.storage.hbase.scanner.maxNumRows":"128",
"tsd.storage.hbase.tree_table":"tsdb-tree",
"tsd.storage.hbase.uid_table":"tsdb-uid",
"tsd.storage.hbase.zk_basedir":"/hbase",
"tsd.storage.hbase.zk_quorum":"ip-<ip address>.ec2.internal",
"tsd.storage.repair_appends":"false",
"tsd.storage.salt.buckets":"7",
"tsd.storage.salt.width":"1",
"tsd.timeseriesfilter.enable":"false",
"tsd.uidfilter.enable":"false"}

Any general guidance for troubleshooting or performance tuning would be appreciated.

Thanks

ManOLamancha

unread,
May 22, 2018, 2:13:14 PM5/22/18
to OpenTSDB
On Sunday, March 25, 2018 at 8:37:19 PM UTC-7, james taylor wrote:
Hey, Guys

We just setup openTSDB on top of an AWS EMR HBase Cluster with Grafana querying it.

We are noticing that the dashboard queries are taking minutes to load.

Looking for general guidance on what openTSDB metrics or HBase Metrics to troubleshoot the issue.

Key observations:

* We see an error related to the TagvFilter plugin not being found

This is safe to ignore. It should just be a debug message. All it means is that there aren't any extra plugins for different filters loaded.
 
* We see a high cache eviction rate -- wondering if this is an indication that the cache is to small or just could be expiring data TTL

Yeah that'll be a large impact if the block-cache isn't sized correctly in HBase. You generally want your hit rate to be in the 60 to 80% range for a good experience.
 
* We see that the tsd.hbase.latency from 50% to 95% related to SCANS is really high at 2147483647ms

That's definitely abnormal and points to data issues (below) and possibly disk access delays, etc.
 
* We see an error frequently when we run a test query from the OpenTSDB gui that WARNS of duplicate timestamps just streaming the error as the query runs

This could be a big part of the issue. If you have a lot of duplicate values in each row then you're pulling a lot of data out of HBase just to squash it on the TSD side. You might want to check your pipeline to reduce the dupes. 
 
Key configurations:
* We use S3 for storing HBase data

S3 is super slow for querying and can *definitely* account for those massive scan times. So you want to try your best to tune the HBase servers so that most of the data is in the block-cache and only historical searches should hit S3. There's a feature to enable cache-on-write in HBase so try bumping up the block cache, set it to be off-heap, and enable the write cache.

* We have the Master node doing (rw)
* Data nodes are running one instance of tsd (ro) and we are using an ELB to load balance across the 7 data nodes and the ELB endpoint is setup in Grafana
* We only store 3 days worth of data in or about 300,000,000 data-points

I also finally started a http://opentsdb.net/docs/build/html/user_guide/tuning.html tuning guide, I'll add this info in there. 

AND if anyone else has some good pointers on tuning, please open PRs :) 
Reply all
Reply to author
Forward
0 new messages