sparse / irregular time series

378 views
Skip to first unread message

Andrew Harbick

unread,
May 15, 2012, 10:33:55 AM5/15/12
to open...@googlegroups.com
I was exploring the use of opentsdb for some data that I'm collecting and I can't quite figure out if this is going to work.

Here are the details of the data set:
  1.  The metric is duration_seconds computed as the total number of seconds that a user spends consuming a piece of content.
  2.  Each data point has one tag for the content_guid (the identifier for the content)
  3.  There are 20-30 different content_guids
  4.  In the data set linked below the metrics are reported only when a given user finishes a piece of content.  As such the data is sporadic/sparse.
  5.  In the data set linked below there are no duplicate timestamps (i.e. each metric/tag is reported at a different time)


When I plot the data aggregating with average (no downsampling) I get this:  https://s3.amazonaws.com/aharbick/opentsdb/duration_seconds_agg_avg.png

When I plot the data aggregating with min (no downsampling) I get this: https://s3.amazonaws.com/aharbick/opentsdb/duration_seconds_agg_min.png

The main thing that I don't understand is how the aggregation works for each datapoint.  Concretely, in the plot aggregating by average how is the first data point around 760 while the second data point about 250 when in the raw data the first three data points are 761, 46, and 665.  In other words how is opentsdb filling in the gaps?

Beyond that, is it possible to use opendtsdb for my use case?  How should I be reporting this data?  Let opentsdb aggregate?  In other words don't report 50 seconds once at the end of the session but reported it as the seconds are consumed e.g. 10, 10, 10, 10, 10?  But what should I be reporting when a particular content_guid (tag) isn't being consumed at all?  It seems like doing that would require something like

   for content_guid in all_known_content_guids
      active_user=false
      for user in active_content_sessions_using(content_guid)
          active_user=true
          put duration_seconds timestamp 10 tag=content_guid
      end
      if active_user == false
          put duration_seconds timestamp 0 tag=content_guid
      end
   end

but then I'd be kicking out a lot of 0 metrics whenever no one is consuming a piece of content.

opentsdb seems really close (and is certainly awesome in operational metrics cases!).  Can someone help?

Thanks!
Andy

tsuna

unread,
May 16, 2012, 1:37:42 PM5/16/12
to Andrew Harbick, open...@googlegroups.com
Hi Andrew,

On Tue, May 15, 2012 at 7:33 AM, Andrew Harbick <ahar...@gmail.com> wrote:
> The main thing that I don't understand is how the aggregation works for each
> datapoint.  Concretely, in the plot aggregating by average how is the first
> data point around 760 while the second data point about 250 when in the raw
> data the first three data points are 761, 46, and 665.  In other words how
> is opentsdb filling in the gaps?

I put an explanation up here:
http://tsunanet.net/~tsuna/opentsdb/misc/aggregation.html

I hope this will help you get a better understanding of what's going on.

> Beyond that, is it possible to use opendtsdb for my use case?

Yes.

> How should I be reporting this data?  Let opentsdb aggregate?

Yes, let OpenTSDB aggregate.

> In other words don't
> report 50 seconds once at the end of the session but reported it as the
> seconds are consumed e.g. 10, 10, 10, 10, 10?

No, I wouldn't recommend doing that.

> But what should I be
> reporting when a particular content_guid (tag) isn't being consumed at all?
>  It seems like doing that would require something like
>
>    for content_guid in all_known_content_guids
>       active_user=false
>       for user in active_content_sessions_using(content_guid)
>           active_user=true
>           put duration_seconds timestamp 10 tag=content_guid
>       end
>       if active_user == false
>           put duration_seconds timestamp 0 tag=content_guid
>       end
>    end
>
> but then I'd be kicking out a lot of 0 metrics whenever no one is consuming
> a piece of content.

That shouldn't be a problem, and it'll help you get more useful graphs.

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Matt

unread,
Jun 22, 2012, 9:01:51 PM6/22/12
to open...@googlegroups.com, Andrew Harbick
Hi Tsuna,

A question about OpenTSDB HBase data structures as they relate to the slides about how time series are stored (inside HBase "Table:tsdb" slide). It relates to the relationship between time-stamp sparsity and performance, that this thread has touched upon.

I've read that time-stamps are provided by the collector agents, not decided by the OpenTSDB back-end or database. In that slideshow mentioned here above, it appeared to me that the columns that make up the 'postfix part' of the time-stamp are highly regular. Wouldn't this to an extent be an idealization of a real case OpenTSDB system? I could assume that frequently where time series flow from multiple servers, each server would have a different time stamp even if the sampling interval is uniform. E.g. servers A1,A2,....,An, each sample every 10 seconds, but each one started sampling at a practically random time, so there'd be closer to 10 columns for every 10 second interval, if assuming that the number of servers n is large and the start times are random. This would make the matrix sparser than the one in the slide, growing more and more sparse as the sampling interval widens. What can we say about the efficiency of OpenTSDB in that case compared to the ideal case where columns are always reused?

In case those assumptions are mostly relevant, it would probably arise (1) whether it is sensible to skew the time-stamps from all servers to fit the same column or perform other trickery to align the samples, and (2) whether there is a clear approximating quantitative model for the relationship between the ratio (of columns to interval size) and read processing performance. Aligning the time-stamps is not necessarily 'fun' as it slightly distorts the data and introduces the need for more pre-processing before storage, both of which can offset the benefit of the resulting compactness they'd create, or introduce other issues. 

I'd appreciate a concise discussion if it's not already out there.
Have I missed any existing post/doc about this aspect? 

Thanks in advance,
Matt

tsuna

unread,
Jun 23, 2012, 1:39:45 AM6/23/12
to Matt, open...@googlegroups.com, Andrew Harbick
On Fri, Jun 22, 2012 at 6:01 PM, Matt <ma...@cloudaloe.org> wrote:
> A question about OpenTSDB HBase data structures as they relate to the slides
> about how time series are stored (inside HBase "Table:tsdb" slide). It
> relates to the relationship between time-stamp sparsity and performance,
> that this thread has touched upon.

Looks like there may be some confusion as to what I meant by "sparse".
I realize that the term is ambiguous in this context.
HBase, by definition, is like a big sparse hash map. HBase doesn't
care what columns you use or don't use.

key | column:value
bar | q1:v1
foo | q2:v2, q3:v3
qux | q1:v4

In the example above, we have a table with 3 rows and 5 KeyValues.
There are 3 different "columns" used (also referred to as "qualifiers"
in Bigtable/HBase-speak). But there is absolutely no cost due to the
fact that the table is "sparse" because the rows "bar" and "qux" don't
have values for q2/q3.

On disk the data is stored like this:

bar,q1:v1
foo,q2:v2
foo,q3:v3
qux,q1:v4

It doesn't matter what the column names are. It's completely
irrelevant to HBase. Bigtable/HBase are inherently sparse.


What I meant by "sparse" in the context of OpenTSDB was that if you
have a metric with a tag, and that tag has a lot of possible values,
and you query your metric for a specific combination of tag=value,
then OpenTSDB has to tell HBase to "skip over" all the tags you don't
care about, which is inefficient (low data locality).

> What can we say about the efficiency of OpenTSDB in that case compared to
> the ideal case where columns are always reused?

There is no notion of "columns getting re-used" in HBase. I hope it's
clear why, given what I explained above.

Matt

unread,
Jun 23, 2012, 5:27:11 AM6/23/12
to open...@googlegroups.com, Matt
Thanks for the reminder about the physical HBase model... appreciate it.
I guess my question related to how OpenTSDB works under the hood.

I guess the column family in my example would just be wider (wider in the logical HBase sense, as seen on the slide) but the wideness would not translate into any kind of verbosity in the physical HBase view, as empty cells are simply not physically stored. So the logical widening of the column family (say in my example by a factor of 10 compared to a 'fully synchronized' data collection 'ideal' situation) would actually not affect storage size whatsoever. That's for storage size. How should I think of the query performance as being affected by the logical sparsity, is there a loss of performance compared to having dense columns? I guess in sequential scans it is meaningless as HBase would run on a range anyway, and that OpenTSDB doesn't need to juggle the data in its own memory in ways that care of the logical density. 

Hopefully I got it right this time. 
Comments?

Matt

On Saturday, June 23, 2012 8:39:45 AM UTC+3, tsuna wrote:

Pablo Chacin

unread,
Jun 29, 2012, 5:27:57 AM6/29/12
to open...@googlegroups.com
tsuna
On 06/23/2012 07:39 AM, tsuna wrote:
> What I meant by "sparse" in the context of OpenTSDB was that if you
> have a metric with a tag, and that tag has a lot of possible values,
> and you query your metric for a specific combination of tag=value,
> then OpenTSDB has to tell HBase to "skip over" all the tags you don't
> care about, which is inefficient (low data locality).
Maybe instead of calling this "sparse" it may be better to talk about
"high cardinality".

Sparse is a very overloaded term with refers mostly
in any context I've used it is highly "dispersed or scattered"
(that is, has a low density).

If I understand well you explanation, even if a tag have a high cardinality
(many different possible values), for each value you can still have a dense
sequence of data points (e.g. metric=%cpu tag = hostname)

As an opposite situation, I would consider "sparse" the case where for
a given
metric and tag combination, you have very few data points scattered
along time
(e.g. metric %cpu tag=jobname).

In both cases, the cardinality of the tag may be high, but the density
of points for
a metric is different.

My 2 Cents

Pablo

--
Pablo Chacin
R&D Engineer
SenseFields SL
Tlf (+34) 93 418 05 85
Baixada de Gomis 1,
08023 Barcelona (Spain)
http://www.sensefields.com/


tsuna

unread,
Jul 19, 2012, 5:27:02 PM7/19/12
to Pablo Chacin, open...@googlegroups.com
On Fri, Jun 29, 2012 at 2:27 AM, Pablo Chacin <pch...@sensefields.com> wrote:
> Maybe instead of calling this "sparse" it may be better to talk about
> "high cardinality".

Yes, you're right.
Reply all
Reply to author
Forward
0 new messages