OpenTSB with large number of tag values.

212 views
Skip to first unread message

gollumM

unread,
Oct 27, 2011, 3:24:00 PM10/27/11
to OpenTSDB
I am thinking of using/testing opentsb with sensor data.

we can have a very large number of sensors and every sensor sends 2 or
3 types of timestamped metric.
I will use 3 'metric.names' and segregate
I am planning to use tags to segregate data using 'sensor=' tag.
Number of sensors can be large... 1m.

Is this a good use of OpenTSB ?

Dave Barr

unread,
Oct 27, 2011, 6:13:36 PM10/27/11
to gollumM, OpenTSDB
It depends on your access patterns.

TSDB's schema doesn't always deal well with what we call "sparse
metrics". By that we mean a metric that has a lot of tag
combinations. If you look at the schema:

http://opentsdb.net/schema.html

You see that the metric ID and the timestamp are the first part of the
key. HBase stores data sorted by key. The worst case scenario for
this is if you just want to extract a single sensor data from a large
time range that has other sensor data at the same time. In this case
the hbase scan spends a lot of time 'skipping over' data that isn't
the sensor (tags) you are interested in. If you are generally doing a
lot of aggregation of most of your sensor data, when you are doing a
query then this schema works fine.

If you are doing lots of sparse queries as well as aggregation, then
you may have to make some compromises on which one to optimize for.
If you wanted to hack OpenTSDB's schema and wanted to optimize for
sparse queries, then you could change they key used for the sensor ID
before the timestamp in the schema. That way you could do quick range
scans of a given sensor at a given time range.

--Dave

Mandar

unread,
Oct 31, 2011, 10:47:00 PM10/31/11
to Dave Barr, OpenTSDB
I tested with about 5 millions points and as the email says queries are getting slower as I continue to add points.

if instead I used the sensor name as the metric it would have the same effect of moving primary tag to the left of metric/time stamp.

However using it this way I won't have to fork opentsb right away.

Any issues with this?

I will still eventually want to fork opentsb to have variable length rows.
Some rows will align base ts for the day and some rows for the hour.


Regards,
Mandar U Jog

tsuna

unread,
Oct 31, 2011, 11:14:39 PM10/31/11
to Mandar, Dave Barr, OpenTSDB
On Mon, Oct 31, 2011 at 7:47 PM, Mandar <mand...@gmail.com> wrote:
> I tested with about 5 millions points and as the email says queries are getting slower as I continue to add points.
>
> if instead I used the sensor name as the metric it would have the same effect of moving primary tag to the left of metric/time stamp.
>
> However using it this way I won't have to fork opentsb right away.
>
> Any issues with this?

No, that's perfectly fine. It just makes it harder to express queries
such as "show me everything for this time range, regardless of that
tag", because now that tag is part of the metric.

> I will still eventually want to fork opentsb to have variable length rows.

Rows are already variable length, but I guess you meant "variable
length tag IDs". If you do this, I'd be interested in see how you
structure the key so that you can scan your table efficiently. The
problem with variable length IDs is that it's hard to tell where IDs
start and stop inside of the row key, and this is required for
efficient server-side (within HBase's RegionServers) filtering.

> Some rows will align base ts for the day and some rows for the hour.

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Mandar

unread,
Nov 12, 2011, 11:31:08 AM11/12/11
to tsuna, Dave Barr, OpenTSDB
Sensor name as metric name has worked better. my access patterns are such that compaction of data that is more than 2 days old is a must. I would Like to have 1/4 to 1 day of data per cell.

I saw something like this on the road map, has the work started?

Regards,
Mandar U Jog

tsuna

unread,
Nov 12, 2011, 5:25:47 PM11/12/11
to Mandar, Dave Barr, OpenTSDB
On Sat, Nov 12, 2011 at 8:31 AM, Mandar <mand...@gmail.com> wrote:
> Sensor name as metric name has worked better. my access patterns are such that compaction of data that is more than 2 days old is a must. I would Like to have 1/4 to 1 day of data per cell.
>
> I saw something like this on the road map, has the work started?

Yes we've been compacting every new data point at StumbleUpon since
September 27. I think it's stable and can be used by others now.
Three reasons why this hasn't been released yet:
1. The "fsck" command has support for compacted cells but there's a
couple bugs where it doesn't correctly report corrupted data.
2. The new version of OpenTSDB depends on asynchbase 1.1.x, not released yet.
3. I haven't had time to work on OpenTSDB since early October.

The code is here: https://github.com/tsuna/opentsdb/tree/compact
If you want rows that span 1 day instead of 1 hour, you need to change
MAX_TIMESPAN to 86400 in Const.java

Hope this helps.

Reply all
Reply to author
Forward
0 new messages