Counters, Order of Operations, and Results that nobody would expect

857 views
Skip to first unread message

Kyle Brandt

unread,
Aug 26, 2015, 1:08:20 PM8/26/15
to OpenTSDB
Hi All,

I basically want to raise the some awareness around two issues:


I really value counters in monitoring. They allow me to monitor things and get the full quantities of what I am monitoring without "gaps" between polling. So for example a spike might get smoothed out due to the polling frequency, but at least it won't totally be lost given a somewhat high frequency polling interval (Say, 15 seconds).

This was a guiding principle when building scollector. A ton of time was invested to find the Raw counters for things like CPU time, network traffic, website hits so we could have more visibility. Just getting CPU time as a counter in Windows was maybe a week of man hours due to the strange way they are stored in Windows - the function we needed ended up being the following:

func TSys100NStoEpoch(nsec uint64) int64 {
 nsec
-= 116444736000000000
 seconds
:= nsec / 1e7
 
return int64(seconds)
}


However, due to the Rate Calculation being the last step in the order of operations, Aggregation and and Downsampling are not going to what people expect when doing functions like Min and Max:

Grouping
Down Sampling
Interpolation
Aggregation
Rate Calculation

When downsampling happens first, instead of getting the max "rate" in the time window it is effectively just going to give the average. This leads to users making false assumptions about the data they are seeing. With aggregation, I  believe you will just get whatever counter happens to have the highest number at the time. That will "fool" users because things with high traffic will tend to be the max, but something might be higher just because it has been running for a long time and the counter hasn't been restarted.

Since I can't imagine anyone expecting these behaviors when doing rate calculations, and monitoring counters has so many advantages, I would really like to see Rate Calculation get moved down the stack and placed before Down Sampling. Without that, people are probably proceeding on misinterpreted data - which is a pretty bad thing. I don't know Java so it isn't something I can directly help with, but with Bosun and scollector I think a lot has been contributed from those to the OpenTSDB ecosystem. So I'm really hoping for some love on this.

Best,
Kyle 

V Kurien

unread,
Aug 26, 2015, 1:32:49 PM8/26/15
to OpenTSDB
Like Kyle, I'd really love to see this happen. Right now, I'll have to calculate my own rates to work around this unexpected behavior. I expect that this will be a common scenario since if I want to find top_n talkers across host in terms of a monotonic counter like sent_bytes or received_bytes on a host, I would want to use max or percentile as aggregation functions after a rate calculation is performed.

tsuna

unread,
Aug 28, 2015, 7:57:33 PM8/28/15
to Kyle Brandt, V Kurien, OpenTSDB
I think this is a mistake I made early on. I wanted to allow different
orders but without a proper query language there was no way to
express rate(downsample(my.metric, 10m, avg)) vs
downsample(rate(my.metric), 10m, avg), and since I knew there
wouldn’t be a query language for a while, I needed to settle on an order.
I went with the current order for the reason that Chris guessed in issue
#476, namely that downsampling first allows us to make the remaining
steps cheaper as they look at less data.

The case shown in issue #476 didn’t occur to me, where a data point
would be “messed up” by a counter rolling over and then that
subsequently affecting the rate computation. It’s pretty obvious that
counter rollovers are going to mess up downsampling, but I didn’t
realize that simply doing rate computation first avoided that problem
(even though it’s obvious in hindsight).

Ideally the order of operations would be driven by the user’s query,
so you could even express things like downsample(rate(rate(foo)), 10m,
avg) or whatever, but again this limitation stems more from the lack
of flexibility in how you can express what you want than anything
else.

Having said all that, I agree that the current behavior is not user
friendly and leads to surprising results, and that unless we give
people the option of choosing the order in which operations are
performed, it probably makes sense to move rate computation to be
first.
--
Benoit "tsuna" Sigoure

Jonathan Creasy

unread,
Aug 28, 2015, 10:57:02 PM8/28/15
to tsuna, Kyle Brandt, V Kurien, OpenTSDB
We added expressions to our fork, and talked with Manolama early on about merging some of it upstream but neither side has really made it a priority yet.


We added a new TSDB datasource to Grafana that supports the new query expressions.

From our internal docs:

Here is an example query:

curl -s "http://localhost:9000/api/query/query?start=1m-ago&x=difference(sum:proc.stat.cpu.percpu{cpu=1},sum:proc.stat.cpu.percpu{cpu=2})"

The main change here to the query API is the expression parameter: x=difference(sum:proc.stat.cpu.percpu{cpu=1},sum:proc.stat.cpu.percpu{cpu=2})

You can create more complex expressions like:
x=scale(difference(sum:proc.stat.cpu.percpu{cpu=1},sum:proc.stat.cpu.percpu{cpu=2}), 0.003) // scale the result of difference to .3%

x = alias(scale(difference(...{cpu=1}, ...{cpu=2}), 0.003), 'newname') //alias the result name to 'newname'

When you encode this string as a URL, please note that { and } need to be replaced by %7B and %7D respectively.

V Kurien

unread,
Aug 29, 2015, 9:12:02 PM8/29/15
to OpenTSDB, ky...@kbrandt.com, the.v....@gmail.com
Hi Benoit
Any idea when we might see a patch for this or in the main release? I'd rather avoid the rate calculation if possible....

Kyle Brandt

unread,
Aug 30, 2015, 9:43:27 AM8/30/15
to OpenTSDB, ky...@kbrandt.com, the.v....@gmail.com
Another possibility might be to change the order of operations depending on the downsampling / or aggregation. One of the most common aggregations I use is some, and for downsampling I use average. In those cases, if I am not mistaken it rate calculation can remain where it is for performance reasons. Then when you do something where it does matter, such as max aggregation or downsampling, then the rate calculation happens first.

In regards to counter resets. I still believe the best thing to do is to just drop the datapoint instead of setting it to zero. This is because the rate wasn't really between samples, it just isn't known. It could also be NaN, but that might just end up being annoying. The reason I don't like the zero is that in Bosun for alerting we end up using reduction functions. So the average rate can be impacted a lot by zeros. Also, it makes for wonky graphs to have those drops when you just want to see how something has grown over time.

V Kurien

unread,
Aug 30, 2015, 11:57:24 AM8/30/15
to Kyle Brandt, OpenTSDB
That makes sense to me as well. I also agree with Kyle's point on counter reset. In terms of NaN (or missing data), IMO what we need to be able to do is to "detect" if a series contains missing data and then treat it specially if so. I wonder if tcollector can add some metadata to an instance of a ts that indicates that it does contain an NaN, for example as a "well known tag". If so, I can filter out such instances.

VK

V Kurien

unread,
Aug 31, 2015, 9:05:29 AM8/31/15
to Kyle Brandt, OpenTSDB
Actually the filter tag idea won't work since a series can have NA in parts. Instead it seems more like metadata in the result of a query.

tsuna

unread,
Aug 31, 2015, 1:31:32 PM8/31/15
to V Kurien, Kyle Brandt, OpenTSDB
There is no notion of “missing data” in OpenTSDB, since no assumptions
are made on the interval between data points. Things don’t have to
tick at a fixed rate.

For monotonically increasing counters, resets are generally easily
detected by the first derivative (aka rate) becoming negative,
typically only for one data point (although something crash looping
might produce a more erratic behavior). Again this is where a more
powerful query language comes in handy as it allows you to express
things like ge(rate(foo), 0) where ge(x, 0) means only keep values of
x for x >= 0.
--
Benoit "tsuna" Sigoure

V Kurien

unread,
Sep 1, 2015, 8:28:28 AM9/1/15
to tsuna, Kyle Brandt, OpenTSDB
Agreed. BTW, in terms of Kyle's proposal, when do you think that we can see that happen?

V Kurien

unread,
Oct 19, 2015, 1:43:30 PM10/19/15
to OpenTSDB

Bump. Any update on getting this moved into 2.2?


On Wednesday, August 26, 2015 at 10:08:20 AM UTC-7, Kyle Brandt wrote:

ManOLamancha

unread,
Nov 10, 2015, 6:25:57 PM11/10/15
to OpenTSDB
On Monday, October 19, 2015 at 10:43:30 AM UTC-7, V Kurien wrote:

Bump. Any update on getting this moved into 2.2?

I won't be able to get it in 2.2 unfortunately. BUT I've ported up the expression code in the "put" branch that is part of the framework for modifying the order of operations that should satisfy this.

Kyle Brandt

unread,
Nov 11, 2015, 2:12:54 PM11/11/15
to OpenTSDB
So are you saying the put branch handles counters differently and I can try it now, or that the put branch lays the foundations for this change?

ManOLamancha

unread,
Nov 11, 2015, 2:53:36 PM11/11/15
to OpenTSDB
On Wednesday, November 11, 2015 at 11:12:54 AM UTC-8, Kyle Brandt wrote:
So are you saying the put branch handles counters differently and I can try it now, or that the put branch lays the foundations for this change?

Yeah, just the foundation so far, sorry! We're getting there though :) 

Kyle Brandt

unread,
Feb 22, 2016, 11:28:02 PM2/22/16
to OpenTSDB
Any progress on this front?

ManOLamancha

unread,
Feb 25, 2016, 12:52:35 PM2/25/16
to OpenTSDB
On Monday, February 22, 2016 at 8:28:02 PM UTC-8, Kyle Brandt wrote:
Any progress on this front?

Not yet, sorry! :'(  We've been working on abuse prevention here and have some cool features to push upstream. 

Kyle Brandt

unread,
Feb 26, 2016, 10:57:46 AM2/26/16
to OpenTSDB
Think it will be part of 2.3?

ManOLamancha

unread,
Feb 26, 2016, 2:00:37 PM2/26/16
to OpenTSDB
On Friday, February 26, 2016 at 7:57:46 AM UTC-8, Kyle Brandt wrote:
Think it will be part of 2.3?

Folks are chomping for a 2.3 RC with the expression and graphite code so I may need to cut that before HBaseCon and move the reordering to 2.4. 

adam.l...@turn.com

unread,
Mar 2, 2016, 2:01:36 PM3/2/16
to OpenTSDB, the.v....@gmail.com, ky...@kbrandt.com
On Monday, August 31, 2015 at 10:31:32 AM UTC-7, tsuna wrote:
There is no notion of “missing data” in OpenTSDB, since no assumptions
are made on the interval between data points.  Things don’t have to
tick at a fixed rate.

For monotonically increasing counters, resets are generally easily
detected by the first derivative (aka rate) becoming negative,
typically only for one data point (although something crash looping
might produce a more erratic behavior).  Again this is where a more
powerful query language comes in handy as it allows you to express
things like ge(rate(foo), 0) where ge(x, 0) means only keep values of
x for x >= 0.


There absolutely needs to be a notion of missing data. Half the problems in this thread wouldn't exist if TSDB had a concept of missing data. Here's a trivial way to support it without a fixed polling interval: the first time a reporter pushes a metric it marks a "this is my first report" flag that introduces a barrier for that data.

Using the first derivative to detect resets is a heuristic; there are many cases where it fails. Those seem like corner cases, but they're not ignorable.

The current approach of introducing 0, or worse yet letting the interpolation do its thing, is inventing data. In my opinion, inventing data is a cardinal sin for any product that has "DB" in its name.

At our scale (thousands of machines, each pushing thousands of metrics) we hit incorrect-but-working-as-designed behavior all the time. I can rattle off examples if you want.

ManOLamancha

unread,
Mar 2, 2016, 3:18:55 PM3/2/16
to OpenTSDB, the.v....@gmail.com, ky...@kbrandt.com
On Wednesday, March 2, 2016 at 11:01:36 AM UTC-8, adam.l...@turn.com wrote:

There absolutely needs to be a notion of missing data. Half the problems in this thread wouldn't exist if TSDB had a concept of missing data. Here's a trivial way to support it without a fixed polling interval: the first time a reporter pushes a metric it marks a "this is my first report" flag that introduces a barrier for that data.

We kinda have that already with the meta data system, it just needs better tuning for high-throughput uses. What else would you have the first metric report?
 
Using the first derivative to detect resets is a heuristic; there are many cases where it fails. Those seem like corner cases, but they're not ignorable.

For a counter we now have the rate options that handle rollovers properly. What other cases are you thinking of?
 
The current approach of introducing 0, or worse yet letting the interpolation do its thing, is inventing data. In my opinion, inventing data is a cardinal sin for any product that has "DB" in its name.

With 2.2 we now have the NaN handling for missing data so that has been helping a ton to detect upstream issues. And I agree, we definitely don't want to invent data when writing to storage but at query time, it can be nice to interpolate or invent the data in some situations. But we should always have the fallback of showing the raw data.

Kyle Brandt

unread,
Mar 2, 2016, 3:33:08 PM3/2/16
to OpenTSDB, the.v....@gmail.com, ky...@kbrandt.com
You mentioned "we now have rate options", have these been updated somewhere? The main thing I wanted in regards to rate calculation was the datapoint to  be dropped instead of emitting a zero during rollover, is that possible now?

ManOLamancha

unread,
Mar 2, 2016, 3:37:14 PM3/2/16
to OpenTSDB, the.v....@gmail.com, ky...@kbrandt.com
On Wednesday, March 2, 2016 at 12:33:08 PM UTC-8, Kyle Brandt wrote:
You mentioned "we now have rate options", have these been updated somewhere? The main thing I wanted in regards to rate calculation was the datapoint to  be dropped instead of emitting a zero during rollover, is that possible now?

Ahh, shoot, no I don't think that will drop yet. I could modify it to emit a NaN or drop though, sorry! 

Kyle Brandt

unread,
Mar 2, 2016, 3:48:48 PM3/2/16
to OpenTSDB, the.v....@gmail.com, ky...@kbrandt.com
In that case NaN seems more correct, yet also more annoying to consume :-P. I'm not sure NaN is needed there because you can always query the series as a gauge and see when it reset. 

In regards to inventing data. I understand trying to walk the balance of "statistically correct" and ease of use - (i.e. bosun's expression language vs something like working in R/Pandas). This is why interpolation is so nice IMO, in many cases it is what you want, and having to line up time series first if that is the most common use case would maybe be more correct, but also inconvenient. However, in the case of rate calculation being late in the order of operations, I think the "sin" of making up data holds true because I think people get data back, but think it means something else than it does.

For example if query a network bytes counter as a rate, and select max downsampling, I would expect I'm getting the max rate for each downsampling bucket. Currently this is not true.

adam.l...@turn.com

unread,
Mar 2, 2016, 4:24:51 PM3/2/16
to OpenTSDB, the.v....@gmail.com, ky...@kbrandt.com


On Wednesday, March 2, 2016 at 12:18:55 PM UTC-8, ManOLamancha wrote:
On Wednesday, March 2, 2016 at 11:01:36 AM UTC-8, adam.l...@turn.com wrote:

There absolutely needs to be a notion of missing data. Half the problems in this thread wouldn't exist if TSDB had a concept of missing data. Here's a trivial way to support it without a fixed polling interval: the first time a reporter pushes a metric it marks a "this is my first report" flag that introduces a barrier for that data.

We kinda have that already with the meta data system, it just needs better tuning for high-throughput uses. What else would you have the first metric report?
 
This isn't related to metadata in any way. When a machine restarts, and the counter that used to be 200M is now 0, the metadata is the same. What I mean is that the reporter knows for sure that there is a hole in the data; all you need is that when it first reports a metric *since booting* it marks the *datapoint* as first (not the metric). The point is to insert a break between the last time the metric was reported (i.e. before the restart) and the first report. That means that TSDB now knows to 1) not do rate across that hole and 2) not interpolate across that hole.
 
Using the first derivative to detect resets is a heuristic; there are many cases where it fails. Those seem like corner cases, but they're not ignorable.

For a counter we now have the rate options that handle rollovers properly. What other cases are you thinking of?
 
The biggest case that's not handled properly is when the machine (aka reporter) is restarted. Whenever I see rollover with respect to rate is in talk about integer overflow, which is a comparatively rare event in the world of 64-bit ints.

 
The current approach of introducing 0, or worse yet letting the interpolation do its thing, is inventing data. In my opinion, inventing data is a cardinal sin for any product that has "DB" in its name.

With 2.2 we now have the NaN handling for missing data so that has been helping a ton to detect upstream issues. And I agree, we definitely don't want to invent data when writing to storage but at query time, it can be nice to interpolate or invent the data in some situations. But we should always have the fallback of showing the raw data.


I disagree. Never invent data. Full Stop. The correct way to handle missing data is to exclude it from computation.
I've witnessed multiple cases where what's nice to you means I get incorrect plots. The counter+rate stuff at least has the curtesy to be obviously wrong. But the same problems occur with other metrics where the effect is subtle and easy to miss.

I understand that this is a big ask, because it needs to extend all the way to the UI layers. 

adam.l...@turn.com

unread,
Mar 2, 2016, 4:38:27 PM3/2/16
to OpenTSDB, the.v....@gmail.com, ky...@kbrandt.com
To be clear, the only parts I have a problem with are inserting zero and assuming interpolation is always a valid thing to do.

Kyle Brandt

unread,
Apr 18, 2016, 2:14:34 PM4/18/16
to OpenTSDB, ky...@kbrandt.com
When it comes to the reordering and 2.3 vs 2.4 and HBaseCon. Not only is HBaseCon in May, but my Birthday is in May as well. And I want proper counters for my Birthday ... not sure if that changes priorities?

ManOLamancha

unread,
May 3, 2016, 4:17:36 PM5/3/16
to OpenTSDB, the.v....@gmail.com, ky...@kbrandt.com
On Wednesday, March 2, 2016 at 1:24:51 PM UTC-8, adam.l...@turn.com wrote:
This isn't related to metadata in any way. When a machine restarts, and the counter that used to be 200M is now 0, the metadata is the same. What I mean is that the reporter knows for sure that there is a hole in the data; all you need is that when it first reports a metric *since booting* it marks the *datapoint* as first (not the metric). The point is to insert a break between the last time the metric was reported (i.e. before the restart) and the first report. That means that TSDB now knows to 1) not do rate across that hole and 2) not interpolate across that hole.

It could be though in that perhaps one way would be to insert an annotation between the last and new report. The schema doesn't have a means of flagging the data point. Besides, some collectors may write 200M then 50 instead of zero depending on when the value was collected.
 
The biggest case that's not handled properly is when the machine (aka reporter) is restarted. Whenever I see rollover with respect to rate is in talk about integer overflow, which is a comparatively rare event in the world of 64-bit ints.

Don't assume all counters are 64-bits. There are tons of apps and hardware deployed that still use 32 bit or even smaller counters. Failure to plan and all that. 

With 2.2 we now have the NaN handling for missing data so that has been helping a ton to detect upstream issues. And I agree, we definitely don't want to invent data when writing to storage but at query time, it can be nice to interpolate or invent the data in some situations. But we should always have the fallback of showing the raw data.


I disagree. Never invent data. Full Stop. The correct way to handle missing data is to exclude it from computation.
I've witnessed multiple cases where what's nice to you means I get incorrect plots. The counter+rate stuff at least has the curtesy to be obviously wrong. But the same problems occur with other metrics where the effect is subtle and easy to miss.
I understand that this is a big ask, because it needs to extend all the way to the UI layers. 
 
We do need cross-series support for NaNs without downsampling so hopefully we'll get that added soon. Then it's a matter of notifying users if we change an existing behavior, i.e. the default interpolation.

ManOLamancha

unread,
May 3, 2016, 4:19:16 PM5/3/16
to OpenTSDB, ky...@kbrandt.com
On Monday, April 18, 2016 at 11:14:34 AM UTC-7, Kyle Brandt wrote:
When it comes to the reordering and 2.3 vs 2.4 and HBaseCon. Not only is HBaseCon in May, but my Birthday is in May as well. And I want proper counters for my Birthday ... not sure if that changes priorities?

 Happy almost birthday and will 2.3 be enough for HBaseCon? :(

Kyle Brandt

unread,
Oct 13, 2016, 10:50:02 AM10/13/16
to OpenTSDB, ky...@kbrandt.com
Is there someone I can bribe to get proper counter order of operations a priority item on the roadmap? :P

john....@skyscanner.net

unread,
Oct 18, 2016, 8:23:35 AM10/18/16
to OpenTSDB, ky...@kbrandt.com
The proposal was to handle the counter problem by allowing order of operations to be specified using the query api. This work is obviously a bit involved. Would it be a quicker interim step to add a config flag which globally makes rate calculation happen before downsampling  ?
Reply all
Reply to author
Forward
0 new messages