TSDB's schema doesn't always deal well with what we call "sparse
metrics". By that we mean a metric that has a lot of tag
combinations. If you look at the schema:
http://opentsdb.net/schema.html
You see that the metric ID and the timestamp are the first part of the
key. HBase stores data sorted by key. The worst case scenario for
this is if you just want to extract a single sensor data from a large
time range that has other sensor data at the same time. In this case
the hbase scan spends a lot of time 'skipping over' data that isn't
the sensor (tags) you are interested in. If you are generally doing a
lot of aggregation of most of your sensor data, when you are doing a
query then this schema works fine.
If you are doing lots of sparse queries as well as aggregation, then
you may have to make some compromises on which one to optimize for.
If you wanted to hack OpenTSDB's schema and wanted to optimize for
sparse queries, then you could change they key used for the sensor ID
before the timestamp in the schema. That way you could do quick range
scans of a given sensor at a given time range.
--Dave
if instead I used the sensor name as the metric it would have the same effect of moving primary tag to the left of metric/time stamp.
However using it this way I won't have to fork opentsb right away.
Any issues with this?
I will still eventually want to fork opentsb to have variable length rows.
Some rows will align base ts for the day and some rows for the hour.
Regards,
Mandar U Jog
No, that's perfectly fine. It just makes it harder to express queries
such as "show me everything for this time range, regardless of that
tag", because now that tag is part of the metric.
> I will still eventually want to fork opentsb to have variable length rows.
Rows are already variable length, but I guess you meant "variable
length tag IDs". If you do this, I'd be interested in see how you
structure the key so that you can scan your table efficiently. The
problem with variable length IDs is that it's hard to tell where IDs
start and stop inside of the row key, and this is required for
efficient server-side (within HBase's RegionServers) filtering.
> Some rows will align base ts for the day and some rows for the hour.
--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com
I saw something like this on the road map, has the work started?
Regards,
Mandar U Jog
Yes we've been compacting every new data point at StumbleUpon since
September 27. I think it's stable and can be used by others now.
Three reasons why this hasn't been released yet:
1. The "fsck" command has support for compacted cells but there's a
couple bugs where it doesn't correctly report corrupted data.
2. The new version of OpenTSDB depends on asynchbase 1.1.x, not released yet.
3. I haven't had time to work on OpenTSDB since early October.
The code is here: https://github.com/tsuna/opentsdb/tree/compact
If you want rows that span 1 day instead of 1 hour, you need to change
MAX_TIMESPAN to 86400 in Const.java
Hope this helps.