New round of changes – please read before upgrading

31 views

Skip to first unread message

tsuna

unread,

Jun 24, 2011, 1:06:10 AM6/24/11

to OpenTSDB

Hi all,
I've just pushed a bunch of changes [1] to our public tree [2]. We've
been running with them in production for over 2 weeks with no
problems.

One of them [3] requires a bit of special care before upgrading.

In order to be rolled out cleanly, it should be deployed during the
first 10 minutes past the hour. So in order to deploy this change,
wait until it's like 1 minute past the hour, then you have 9 minutes
to do a rolling restart of all your TSDs. This assumes that your TSDs
are only getting recent data, and that you're not sending them "old"
timestamps (old = more than a minute old). If you live in a "weird"
time zone (e.g. in India GMT +5:30 hours, or in Afghanistan GMT +4:30
hours) you need to make sure you're rolling the change out at the
beginning of the hour in the GMT time zone.

Why do you need to do this?

The TSDs used to store 10 minutes worth of data per row in HBase.
Now, instead of aligning rows on 10 minute boundaries, they are
aligned on 1h boundaries, and TSDs store up to 1h worth of data per
row. This is makes read queries quite a bit more efficient and is
going to give us a lot more optimization opportunities for the next
round of changes (which will be easier to rollout).

What happens if you screw up?

Don't panic! There's nothing to be scared about :)
At most 1h worth of data in your "tsdb" table will end up "out of
order". When you attempt to query it, you will either get an Internal
Server Error (HTTP 500) complaining about something being "added out
of order", or a gap at the point where there is the problem. Luckily,
this round of changes includes two new "fsck"-type tools to deal with
these errors and more. There is one "fsck" tool for the "tsdb-uid"
table, and one for the "tsdb" table. In case of "out of order"
problems, you can use the latter to detect and then automatically fix
problems.

Let's say you rollout the change on June 9 between 10 and 11 am, and
you run into the problem described above. All you need is to run:

./src/tsdb fsck 2011/06/09-10:00:00 2011/06/09-11:00:00 sum your.metric.name

To find what problems there are. Ignore the "sum" in the command
above, it doesn't matter. The output will tell you how to fix the
errors if they can be fixed automatically.

If you screw up, chances are that you will need to run the command
above for every single metric you have. This is easy to do since you
can list all the metrics using this command:

./src/tsdb uid grep metrics .

(yes the last "." is needed, it's a regexp that will match all the
metric names).

I'm sorry for the small constraint on the rollout of this change, but
it was easier to do this way. OpenTSDB is very careful about
compatibility, and is striving to make operations, upgrading and
monitoring easy. In fact this round of changes include some
preliminary code for forward compatibility with future releases.
Also, the bigger features that are coming up soon will be disabled by
default, to allow you to roll them out at your own pace.

Please write back to this group if you need help with anything.

[1] New changes:
https://github.com/stumbleupon/opentsdb/compare/33dff14fe2...0b1a02f5dc
[2] Our public tree is at: https://github.com/stumbleupon/opentsdb
[3] This change needs special care:
https://github.com/stumbleupon/opentsdb/commit/0bccaabd754d4dc849da63dc88519c0fe08afd24