Hello everyone, as we know GraphiteReporter's job is to iterate through various kinds of metrics registered inside MetricsRegistry and send their (Timer, Meter etc.) current values to the Graphite back-end along with current timestamp (current at the moment of report cycle).
Everything works ok, unless you need to aggregate metrics reported from multiple instances.
For example, assume that there are 2 web service instances running. Their resources are instrumented, so they have some Timers measuring handled request latencies (InstrumentedHandler from metrics-jetty). Then, let's assume we had a spike in incoming traffic for these services, and they both have now some huge values (current instant metric values):
instance1.get-requests.99percentile = 5000 ms
instance2.get-requests.99percentile = 3000 ms
Graphite reporter would report 5000 ms and 3000 ms accordingly. Graphite back-end would aggregate them as (assuming we use max. aggregating rule) 5000 ms, which is correct.
After some time had passed, and incoming traffic became very low, only instance2 would receive incoming traffic now, and its percentile get back to normal:
instance2.get-requests.99percentile = 200 ms
The problem is that Graphite reporter on instance1 would still report 5000 ms (latest 99th percentile values, which is remain current value as instance1 does not receive any traffic). So we would still get 5000 ms after aggregation, while real picture is that latency is low (200 ms) it is just instance1 reports this "stale" metric.
Of course this example is rather made up (another example could be that metrics being measured could not be bounded to incoming traffic, and represent some other distributed process etc.), but I think it shows the point.
Have anyone else faced similar issue? How do you aggregate your metrics?
The solution could be to add and maintain the "lastUpdate" timestamp attribute on all metrics, so that it could be possible e.g. to filter metrics "that were updated not longer than 60 seconds ago" using MetricFilter. Of course this would introduce gaps in Graphite metric series, but it is better than having such "stale" values.