On Fri, Jun 22, 2012 at 6:01 PM, Matt <
ma...@cloudaloe.org> wrote:
> A question about OpenTSDB HBase data structures as they relate to the slides
> about how time series are stored (inside HBase "Table:tsdb" slide). It
> relates to the relationship between time-stamp sparsity and performance,
> that this thread has touched upon.
Looks like there may be some confusion as to what I meant by "sparse".
I realize that the term is ambiguous in this context.
HBase, by definition, is like a big sparse hash map. HBase doesn't
care what columns you use or don't use.
key | column:value
bar | q1:v1
foo | q2:v2, q3:v3
qux | q1:v4
In the example above, we have a table with 3 rows and 5 KeyValues.
There are 3 different "columns" used (also referred to as "qualifiers"
in Bigtable/HBase-speak). But there is absolutely no cost due to the
fact that the table is "sparse" because the rows "bar" and "qux" don't
have values for q2/q3.
On disk the data is stored like this:
bar,q1:v1
foo,q2:v2
foo,q3:v3
qux,q1:v4
It doesn't matter what the column names are. It's completely
irrelevant to HBase. Bigtable/HBase are inherently sparse.
What I meant by "sparse" in the context of OpenTSDB was that if you
have a metric with a tag, and that tag has a lot of possible values,
and you query your metric for a specific combination of tag=value,
then OpenTSDB has to tell HBase to "skip over" all the tags you don't
care about, which is inefficient (low data locality).
> What can we say about the efficiency of OpenTSDB in that case compared to
> the ideal case where columns are always reused?
There is no notion of "columns getting re-used" in HBase. I hope it's
clear why, given what I explained above.