GraphiteReporter reports "stale" metric values

330 views
Skip to first unread message

Evgeniy Sorokin

unread,
Jan 5, 2015, 9:40:58 AM1/5/15
to metric...@googlegroups.com
Hello everyone, as we know GraphiteReporter's job is to iterate through various kinds of metrics registered inside MetricsRegistry and send their (Timer, Meter etc.) current values to the Graphite back-end along with current timestamp (current at the moment of report cycle). 

Everything works ok, unless you need to aggregate metrics reported from multiple instances. 

For example, assume that there are 2 web service instances running. Their resources are instrumented, so they have some Timers measuring handled request latencies (InstrumentedHandler from metrics-jetty). Then, let's assume we had a spike in incoming traffic for these services, and they both have now some huge values (current instant metric values):

instance1.get-requests.99percentile = 5000 ms
instance2.get-requests.99percentile = 3000 ms

Graphite reporter would report 5000 ms and 3000 ms accordingly. Graphite back-end would aggregate them as (assuming we use max. aggregating rule) 5000 ms, which is correct.

After some time had passed, and incoming traffic became very low, only instance2 would receive incoming traffic now, and its percentile get back to normal:

instance2.get-requests.99percentile = 200 ms

The problem is that Graphite reporter on instance1 would still report 5000 ms (latest 99th percentile values, which is remain current value as instance1 does not receive any traffic). So we would still get 5000 ms after aggregation, while real picture is that latency is low (200 ms) it is just instance1 reports this "stale" metric.

Of course this example is rather made up (another example could be that metrics being measured could not be bounded to incoming traffic, and represent some other distributed process etc.), but I think it shows the point. 

Have anyone else faced similar issue? How do you aggregate your metrics? 

The solution could be to add and maintain the "lastUpdate" timestamp attribute on all metrics, so that it could be possible e.g. to filter metrics "that were updated not longer than 60 seconds ago" using MetricFilter. Of course this would introduce gaps in Graphite metric series, but it is better than having such "stale" values.

Tyler Tolley

unread,
Jan 5, 2015, 8:13:10 PM1/5/15
to metric...@googlegroups.com
We had a problem with our carbon server getting overwhelmed with data, so I created a stale data filter to not send any metrics that haven't changed since the last time they were sent. 
public class StaticDataFilter implements MetricFilter
{

Map<String, Object> previousValues = Maps.newHashMap();

@Override
public boolean matches(String name, Metric metric)
{
boolean retVal = true;
Object prevValue = previousValues.get(name);

Object newValue = null;
if (metric instanceof Gauge)
{
Gauge g = (Gauge) metric;
newValue = g.getValue();
}
else if (metric instanceof Counter)
{
Counter g = (Counter) metric;
newValue = g.getCount();
}
else if (metric instanceof Histogram)
{
Histogram g = (Histogram) metric;
newValue = g.getCount();
}
else if (metric instanceof Timer)
{
Timer g = (Timer) metric;
newValue = g.getCount();
}
else if (metric instanceof Meter)
{
Meter g = (Meter) metric;
newValue = g.getCount();
}
previousValues.put(name, newValue);
if (prevValue != null)
{
// If the objects are not equal, the value has changed
retVal = !Objects.equals(newValue, prevValue);
}
return retVal;
}

}

Since most metrics keep a count of the number of values, if the count hasn't changed, then the associated reported metrics haven't changed. The one exception is the gauge, but you can easily compare the gauge with the last reported gauge value. There is a small memory overhead in using this filter (since it keeps track of the previously reported values) but this could be used by any reporter to filter out stale data.

Ryan Rupp

unread,
Jan 5, 2015, 8:17:43 PM1/5/15
to metric...@googlegroups.com
It sounds like maybe you want a SlidingTimeWindowReservoir with your timer or a similar type of reservoir that removes old entries without requiring new ones to have occurred (which is how the ExponentiallyDecyaingReservoir works, which by default Timers use). For instance if you use a sliding window of 2 minutes and no activity occurs after 2 minutes, subsequent reports of for the 99th percentile will be 0 (not to be confused with an actual timing of 0, rather there's no entries). The downside of the time window reservoir is that it is unbounded so I haven't really used it much out of concern there with a burst of timings possibly - probably could create a custom reservoir to handle that though (basically time window + N window sampling). Also, without reporting to another tool such as graphite the sliding time window reservoir isn't that useful since the average/max etc. are all tied to the snapshot of values which will only span the length of your time window so you would want that data to actually be going somewhere.

Evgeniy Sorokin

unread,
Jan 6, 2015, 2:40:15 AM1/6/15
to metric...@googlegroups.com
Cool, thanks for sharing, I was thinking about using count to see if there metric was updated, but as you said we can't use it for gauges, having update timestamp on all metrics would be more general approach.

Evgeniy Sorokin

unread,
Jan 6, 2015, 3:25:26 AM1/6/15
to metric...@googlegroups.com
Yes, looks like SlidingTimeWindowReservoir could be an option, can't change the Timer in InstrumentedHandler though.

Justin Mason

unread,
Jan 6, 2015, 4:54:12 AM1/6/15
to metric...@googlegroups.com
IMO the Metrics approach to reservoir decay/expiration isn't correct when publishing to a time-series store like Graphite.  When I see a data point for a 1-minute period in a graph, I want it to represent the data that occurred in _that period_ only -- not data from minutes (or even hours) before.  I blogged about this problem, and the workaround we use, here: http://taint.org/2014/01/16/145944a.html

This has come up several times before on this list -- would there be any interest in adding this Reporter (or something similar) to the official Metrics distro?

--j.

--
You received this message because you are subscribed to the Google Groups "metrics-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metrics-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Evgeniy Sorokin

unread,
Jan 6, 2015, 5:24:04 AM1/6/15
to metric...@googlegroups.com, j...@jmason.org
Thanks, good post. Isn't it possible to loose some data after GraphiteReporter has got snapshot from histogram to report, then new values came in while reporter is transmitting data, and you clear them all afterwards?

Justin Mason

unread,
Jan 6, 2015, 5:48:15 AM1/6/15
to metric...@googlegroups.com
ah, yes, good point :(  I had overlooked that -- probably assuming that it was "good enough" and certainly better than the existing situation, but yeah it shouldn't really do that.

To fix, I think it would be pretty easy to snapshot the values, clear the Timer, then report afterwards...

Panagiotis Partheniadis

unread,
Jan 7, 2015, 2:44:36 AM1/7/15
to metric...@googlegroups.com
It would be interesting though not to view the issue from the publishing side only. Regardless the reporting to Graphite, i think that a Timer that is queried for percentiles and provides values that are based on stale data is problematic and obviously misleading. I would prefer the metric to present Nulls than a stale value. 

Justin Mason

unread,
Jan 9, 2015, 7:00:17 PM1/9/15
to metric...@googlegroups.com
Unfortunately I disagree ;)  I think the output aspect is key.

There are several outputs for Metrics. Some are time-series-data-oriented, and some are not.  Graphite and Ganglia, for instance, are time-series-oriented, so the measurements written to that are periodically written, timestamped, and old measurements remain visible in the store.  In this situation, clearing the state between each time period's Histogram measurements makes sense, so that they are independent of each other -- the old measurements have already been written, and we gain more accurate data in the output store by clearing that state.

However, if I'm viewing the set of metrics using MetricsServlet (for example), I'm just seeing the _current_ state -- there's no way to see historical data using that interface. In that case, the min/max/mean/percentiles displayed for a Timer or Histogram should reflect a longer timescale than just the last 1 minute (or whatever), so that we gain a better idea of historical data.  I guess deferring to the Reservoir's default is appropriate in this case.

--j.


--

Justin Mason

unread,
Jan 12, 2015, 5:38:50 AM1/12/15
to metric...@googlegroups.com
fyi: I've updated the class at https://gist.github.com/jmason/7024259 to use that approach.

It's still not immune to races -- it's possible for the recorded values of the timer/histogram to be inconsistent -- but only to the pre-existing degree of the rest of the code, at least. ;)

That class is Metrics 2.x-specific though; I think 3.x has a better design and would be more consistent in its data.  Once I get a chance (and we've upgraded to 3.x internally!) I'll come up with a 3.x port as well...

--j.

To unsubscribe from this group and stop receiving emails from it, send an email to metrics-user+unsubscribe@googlegroups.com.

Panagiotis Partheniadis

unread,
Jan 12, 2015, 5:53:03 AM1/12/15
to metric...@googlegroups.com, j...@jmason.org
I think you mix the time window for which we report to Graphite (and has meaning about the time-series-oriented-data you talk about), with the time window for which we gather and calculate values about a metric. So, the 1 minute you mention is apparently a good value about the first type of window but not a good value about the second type. But you DO recognise too that there is an X dt in there that we need to care for. I am talking about the second time window though. Maybe 1 minute is rather aggressive but shouldn't we have another dt there beyond which the values reported by the metric have no meaning? In our case, the metric did not receive any updates for 2 days! You DO NOT any "current state" when retrieving percentiles there! Traffic stopped 2 days ago and percentile value still reflects the situation back then...

Marcin Biegan

unread,
Jan 12, 2015, 4:42:41 PM1/12/15
to metric...@googlegroups.com, j...@jmason.org
Actually what would be the best approach to achieve something like that in metrics-3? Timers don't seem to be resettable.

Justin Mason

unread,
Jan 12, 2015, 7:15:36 PM1/12/15
to metric...@googlegroups.com
Good point, I hadn't spotted that yet :(  Sounds like this approach isn't likely to work without a patch in 3.x, which is unfortunate.

--j.

Panagiotis Partheniadis

unread,
Jan 13, 2015, 3:00:41 AM1/13/15
to metric...@googlegroups.com, j...@jmason.org
Metric clear() methods are not yet present in Metrics 3.x version. Do we know why it is decided for them to be removed? We use them in Metrics 2.x to clear things up.

Justin Mason

unread,
Mar 11, 2015, 12:19:23 PM3/11/15
to metric...@googlegroups.com
BTW for people tracking this issue, the HDRHistogram metrics Reservoir implementation now has a fix for this, due in the next release:
https://bitbucket.org/marshallpierce/hdrhistogram-metrics-reservoir/issue/3/interval-histogramming

(thanks Marshall!)

--j.

Marshall Pierce

unread,
Mar 11, 2015, 11:55:19 PM3/11/15
to metric...@googlegroups.com
1.1.0 is out with this fix included.

-Marshall

Marcin Biegan

unread,
Mar 14, 2015, 7:34:43 AM3/14/15
to metric...@googlegroups.com, j...@jmason.org
Hystrix solves this problem nicely. It uses a circular buffer of K buckets which implement a bucketed sliding time window. Each bucket stores N latest measurements. Calculating percentiles is done on a sample of N*K measurements.

E.g. if someone reports to graphite every 20 seconds he might use 10 buckets with capacity 100, 2-seconds wide.

I like this approach because combines both sliding time window with a sampling method that allows to avoid storing all measurements in memory while being simple to implement and understand. When I have some time I'll try to implement this and create a pull request.

Panagiotis Partheniadis

unread,
Mar 16, 2015, 2:10:37 PM3/16/15
to metric...@googlegroups.com
+1.
We already use Dropwizard + Hystrix metrics and it would be great if we could use the same approach on how we handle latencies. Hystrix way seems to be the way to go in most of our cases.
Reply all
Reply to author
Forward
0 new messages