OpenTSDB Query Performance Problems

547 views
Skip to first unread message

Haifeng Ding

unread,
May 17, 2013, 12:00:14 AM5/17/13
to open...@googlegroups.com
Hi, all.

I was exercising OpenTSDB with a large production data set and experiencing very slow query speeds. 

Here is a example query over data of 12 hour approx. The query execution last 89 seconds. It's hardly acceptable either for interactive analysis during trouble-shooting or constructing monitoring dashboards.

Query stats:
8287176 points retrieved, 38111 points plotted in 89273ms.

Query logs:
2013-05-17 11:13:51,470 INFO  [New I/O worker #3] TsdbQuery: TsdbQuery(start_time=1368676800, end_time=1368718200, metric=[0, 0, 3] (test.metric), tags={}, rate=false, aggregator=sum, group_bys=()) matched 557462 rows in 68253 spans
2013-05-17 11:15:09,772 INFO  [Gnuplot #7] Plot: Wrote Gnuplot script to /home/data1/build/tmp/tsd/bd6214b7.gnuplot
2013-05-17 11:15:09,857 INFO  [Gnuplot #7] HttpQuery: [id: 0xa893a6c0, /172.21.206.53:55282 => /10.42.230.49:8402] HTTP /q?start=2013/05/16-12:00:00&end=2013/05/16-23:30:00&m=sum:10m-avg:test.metric&o=&yrange=%5B0:%5D&wxh=1328x484&json done in 89273ms
2013-05-17 11:15:09,882 INFO  [New I/O worker #3] HttpQuery: [id: 0xa893a6c0, /172.21.206.53:55282 => /10.42.230.49:8402] HTTP /q?start=2013/05/16-12:00:00&end=2013/05/16-23:30:00&m=sum:10m-avg:test.metric&o=&yrange=%5B0:%5D&wxh=1328x484&png done in 1ms

I also made several CPU profiling with the query process. I found that HBase was responding fast enough, while most of the time was spent on generating Gnuplot scripts or AsciiText data. FYI, I attached a screenshot of profiling result, showing methods in SpanGroup$SGIInterator are main hot spots.

My OpenTSDB setup:
1. Query/Push tsd daemons are deployed on two machines separately.
2. HBase is running in cluster mode, with 10+ nodes.
3. The test metric above contains 60k-70k distinct time series, i.e. tagk-tagv combinations. 

My questions are:
1. Is it reasonable with the query performance and profiling results above?
2. Is there any suggestion or best practice to improve query performance of OpenTSDB? For example, is it feasible to reach 10 seconds for the above query execution?

If any other information is missing here, please reply. Thanks!

--
Ding Haifeng






opentsdb_query_profiling.png

tsuna

unread,
May 20, 2013, 4:04:18 AM5/20/13
to Haifeng Ding, open...@googlegroups.com
Hi Ding,

On Thu, May 16, 2013 at 9:00 PM, Haifeng Ding <hank...@gmail.com> wrote:
> 8287176 points retrieved, 38111 points plotted in 89273ms.

That's just under 100k points per second. Not impressive by any
standards. How fast does the query return if you don't filter on any
tags? Or if you just scan the underlying HBase table for the key
range appropriate to your query?

> Query logs:
> 2013-05-17 11:13:51,470 INFO [New I/O worker #3] TsdbQuery:
> TsdbQuery(start_time=1368676800, end_time=1368718200, metric=[0, 0, 3]
> (test.metric), tags={}, rate=false, aggregator=sum, group_bys=()) matched
> 557462 rows in 68253 spans
> 2013-05-17 11:15:09,772 INFO [Gnuplot #7] Plot: Wrote Gnuplot script to
> /home/data1/build/tmp/tsd/bd6214b7.gnuplot
> 2013-05-17 11:15:09,857 INFO [Gnuplot #7] HttpQuery: [id: 0xa893a6c0,
> /172.21.206.53:55282 => /10.42.230.49:8402] HTTP
> /q?start=2013/05/16-12:00:00&end=2013/05/16-23:30:00&m=sum:10m-avg:test.metric&o=&yrange=%5B0:%5D&wxh=1328x484&json
> done in 89273ms

Hmm, this is particularly disappointing because you're only querying
one metric. If you were querying multiple metrics at a time, just
bear in mind that right now each metric gets handled sequentially
(even though in theory they could be handled in parallel), which can
contribute to slower response times than would be possible under
optimal circumstances.

> I also made several CPU profiling with the query process. I found that HBase
> was responding fast enough, while most of the time was spent on generating

I still find this dubious that we don't see the HBase access at all in
the screenshot you shared. It cannot possibly be so fast as to be
invisible to the profiler, can it? And since this code is still
written in a blocking fashion, the profiler should be able to see it
wait for HBase.

> My questions are:
> 1. Is it reasonable with the query performance and profiling results above?

Not really.

> 2. Is there any suggestion or best practice to improve query performance of
> OpenTSDB? For example, is it feasible to reach 10 seconds for the above
> query execution?

I would like to say that the answer is yes, it's possible, but we
first need to determine exactly why it's so slow right now.

Can you share your test data set maybe?

--
Benoit "tsuna" Sigoure
Reply all
Reply to author
Forward
0 new messages