The hash table is good too. Having a relatively easy way of saying
"gimme all the metrics for host=blah" would be good. We may want a
way though to reset/clear these, though, or perhaps bucketize them by
some time interval (week? month?). Metrics change and decay over
time, and roles of hosts change. Some box may have been a web server
one day and have a bunch of metrics one month, but later it's doing
something else and has a bunch of different metrics. At some point I
want to forget about associating that box with those old web metrics.
If we put like year + month in the key, those associations will be
automatically regenerated each month, but we will still be able to
search back and find old associations if we make queries against the
older data.
--Dave
In our case we are constantly recycling hosts in different roles. We
essentially never rename hosts (they are named based on their physical
location).
When we move a host to a different role, the combination of tags
though will generally be different. We use a hybrid scheme here,
where each service has a role name and a value (so you can have
overlap different services on a host, just not multiple instances of a
service on a given host). So for example, of a box is part of a
memcache pool "main" and also webserver pool "default", it would have
the tags "role_memcache=main role_web=default".
So yeah, since the list of tags, and thus the hash would change when
we moved a host's role, it would cover this case. Stale metrics would
only stick around if a service changed such that a metric was no
longer relevant or used. In this case it's arguable if you would want
to delete this association anyway as long as you wanted to retain the
historical data for that metric.
--Dave
Although I'm OK with JSON to store metadata in HBase, I'm also
thinking about ProtoBufs, because they have the advantage of having a
well-defined structure, as in once you write your .proto file, it's
pretty obvious what will be in HBase, whereas JSON is just a string
where you can make typos etc.
My other concern, regardless of whether or not JSON or PB is used as a
format, is that of concurrent updates. It sounds like the TSD will
need to lock the row in order to be able to update the metadata. Or
actually… maybe I can just implement support in asynchbase for the
atomic update RPC HBase has ("checkAndPut" or something like that).
Otherwise when you have multiple TSDs trying to update the same piece
of metadata the 2nd write might overwrite the first one.
Also the reason I originally decided not to keep track of associations
between metrics and tag combinations used in the wild is that they are
temporal as Dave pointed out, in the sense that they change as things
come and go. BTW just to correct the terminology a bit:
"proc.stat.cpu" is a metric, "host=foo" is a tag (where "host" is a
tag key, "foo" a tag value") and [proc.stat.cpu host=foo] is a time
series for the metric "proc.stat.cpu". In other words a metric + a
specific set of tags should be called a "time series".
The problem with keeping track of time series exist is that there are
many many many of them. If you want to be able to answer queries like
"what metrics use this tag foo=bar", then you need a simple inverted
index of tags. But if you want to be able to answer queries like
"what metrics use tags foo=bar *and* qux=baz" then you need to store
the combinatorial explosion of all tags and all metrics, which can
easily go into the billions of items. A year ago I looked at our TSD
traffic and found over 200000 unique time series. Virtually all our
datapoints have 3 tags, generally more. This adds up to 1333 billion
possible combinations. Sure they won't all exist and thus won't all
be stored, but the size of the data explodes exponentially, which is
not good. Adding the temporal aspect of the data makes things only
worse.
There is another strategy that can be used to auto-complete forms.
Once you have a metric, a start time and (optionally) an end time, you
can do a short scan to try to discover what kind of tags are used.
E.g. I entered the following in the TSD UI:
- Start: 2001/02/03-00:00:00
- End: 2001/03/04-00:00:00
- Metric: proc.stat.cpu
Then I can do a two short scans of proc.stat.cpu, one right after the
start time, one right before the end time. Assuming I scan just a few
hundred rows, or a few thousand rows at most, it should be possible to
"guess" most of the combinations of tags in that time range.
--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com
Yeah this row-lock business is pretty ugly, I'd like to avoid using it
or spreading the disease as much as possible.
> Gotcha, I'll refer to the metric + tags as a time series now :) I
> think that a well designed time series definition should be immutable.
> e.g. if you have "proc.stat.cpu host=foo", that should be created once
> and always refer to that metric on host "foo". If host "foo" dies or
> we rename it to "smoo", then we'll have a new time series but the old
> one is still relevant for historical purposes. It's all in the tags
> and combinations.
That's exactly how it works. A time series is a unique combination of
a metric and set of tags. If you change a tag in any way, we're now
talking about a new time series.
> For mapping, I was thinking of including arrays in each "metrics" and
> "tagk" meta objects that list all of the relevant associates.
> e.g. "metrics":[tagk_uid1, tagk_uid2, tagk_uid3...]
> "tagk":[metric_uid1, metric_uid2, metric_uid3...]
> This way it's easy to perform a quick lookup in HBase for the proper
> tagk or metric and find out what is associated.
>
> Hows that sound? thanks!
That sounds reasonable.
Hello Igor,I was looking for the same queries like using suggest and getting tagv based on other tagv, but the url's which you listed are not working. The query is being done before q only. Can you please help me in order to make it work. Actually my data goes like this:But the drawback here is, I must not mention time because fetching user deviceId's are not based on time & metrics but they are only based on user name. That means I want get deviceId values based on user value. I tried it using "suggest endpoint" by following:and it's not working. Any suggestion will be very helpful for me.