Event-driven reporting

Ben Tatham

unread,

Jun 21, 2016, 10:44:56 AM6/21/16

to metrics-user

I've been doing a lot of work with grafana (fed by graphite and influxdb) lately, with data from metrics and other sources. I've notice a fairly large philosophical divide in how to use the system as a whole.

Here, in metrics, the system tends to do a lot of calculations of statistics in the application itself - using reservoirs, ewma, etc.

A lot of the functions available in graphite/influxdb can do a lot of those calculations for you, if you push all the raw data into the TSDB itself. This includes per-second rates, histogram plots, moving averages, etc. (I don't see ewma explicitly, but that could be added there as well, I'm sure).

Now, there is clearly a trade-off here: more cpu/memory usage of the application vs more storage in the TSDB and less network traffic to send every event.

Note that this is looking at metrics that are tracking the time it takes to do some action (like request response time of a web server) or other similar event-based metrics. Things that Meters and Timers are good at in metrics. However, in metrics, I don't see a way to push each "event" to the Reporter. Therefore, as a developer, we cannot change the choice of tradeoff without changing the entire choice of metrics library in our application.

To me, having the power to send all the raw data into the TSDB, and then be able to "play" with that data later (without an application update) using all those downstream tools would be grater.

I guess my question is this: was this choice intentional? Has there been talk of supporting both scheduled as well as event-driven reporting (with batching)? Looking at influxdb-java library, it seems there system is set up to batch events like this - but they of course don't have a fully MetricRegistry system in front of it.

Thoughts?

Ryan Rupp

unread,

Jun 22, 2016, 10:45:08 AM6/22/16

to metrics-user

This was most likely intentional, what you're looking for is proposed for Metrics 4.0 here. Raw event data is generally preferable because it will allow you to do after the fact aggregations e.g. slicing across tags for different timings but it's the performance concern there and requirement that a TSDB is in place that can ingest a potentially high frequency of events that holds this back. Also there's some simple use cases where people are probably just viewing the local aggregations via JMX/metrics servlet without an external tool and benefit from derived statistics on the Java process. So, local aggregation in my mind is generally for performance. Unfortunately, local aggregation doesn't work well with things like percentiles where you generate some percentiles, send those back to the TSDB then expect to do things on the TSDB like aggregate those timings across a cluster of nodes - taking an aggregate of an aggregate is not accurate there. You can see a post about this in Netflix Servo (similar to Metrics library) about using bucket timers (tracking counts within a given set of time ranges) instead of percentiles because this type of aggregation lets them roll up to the cluster level, see the post here. Additionally they've experimented with capturing the raw data in the form of TDigests but this comes with the performance overhead, mentioned briefly here. I think capturing the raw data for events is maybe practical if it's opt-in for certain lower frequency timers/meters. Then again there's certain stats aggregators like StatsD that are designed around this philosophy - your application sends raw events to an external statsd and statsd aggregates them across a timeframe (e.g. avg/min/max across 1 minute window) then pushes them downstream to a TSDB. I thought though when I was looking into this though a while back that StatsD clients generally perform some type of local aggregation/sampling to prevent against very high frequency updates to lower the network utilization/not choke StatsD with millions of timings in a second.

This is a similar problem that logging and APM tools (New Relic, AppDynamics etc.) will have. If it was practical from a performance standpoint/low overhead you'd collect "all the things!" - e.g. every request is profiled/traced, full debug logging on and then you have that data and can analyze it after the fact. But generally with these monitoring type tools overhead is heavily scrutinized so sampling or local aggregation techniques are used to minimize the overhead. Of course with better hardware/technology this becomes more practical, newish* tools like Elasticsearch/InfluxDB come to mind that can ingest a ton of data and improvements to libraries like better asynchronous logging in log4j2 to reduce overhead.

Marshall Pierce

unread,

Jun 22, 2016, 10:57:08 AM6/22/16

to metric...@googlegroups.com

I think that metrics-with-local-aggregation and metrics-that-stream-every-event are two different use cases. They’re both valid; they’re just very different.

A small digression re percentiles and Servo’s timing buckets — the ultimate goal of HdrHistogram (and my reservoir that uses it) is to have the HdrHistogram encoded format be shoved over the wire, because you CAN combine that data structure (e.g. have one per second or minute and add them later to get per-hour or per-day data). It’s space efficient and fast, and has implementations in various languages (notably C with various bindings to other languages, and I have a mostly-complete pure Rust port).

Also note that systems like Dapper (and its offshoots like Zipkin and OpenTracing) are designed around yet other goals: capturing timing or other data in detail for the work required for one request in a distributed system.

These are all useful types of data, they’re just… different. A Counter I can safely call hundreds of thousands of times per second and not worry about it. Anything that writes to the network would be more problematic (I don’t want to be making a syscall that often, for instance), but that doesn’t mean it’s not useful to record some other event that’s much lower frequency by writing to InfluxDB. Ideally we would use data structures like HdrHistogram for everything and then we wouldn’t care about having to aggregate server side, but that’s not always feasible. And then orthogonally to that, I might have some OpenTracing trace data being collected that’s eventually bound (ala slf4j) to some tracing backend.

My vote would be that Metrics (for consistency in naming) should be geared towards the “aggregate locally” use case because that’s the one it’s targeted historically. However, if we want to have a separate thing (“Metrics streaming”?) that’s more a tidy wrapper in front of InfluxDB and friends, I don’t see anything wrong with that.

> --
> You received this message because you are subscribed to the Google Groups "metrics-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to metrics-user...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Patrick Valsecchi

unread,

Jun 23, 2016, 3:51:31 AM6/23/16

to metric...@googlegroups.com

Hi,

Having two interfaces for "aggregate locally" and for "streaming" metrics would be annoying for the users because the current API would suit perfectly for both use cases. I really don't see why we should split that in two libraries and force the user to change hos code.

I spent 20 minutes on a little prototype that I've put in this branch:
https://github.com/pvalsecc/metrics/tree/streaming

This shows only for timers and is un-tested, but you see the idea. It would have 0 performance impact for users that don't want streaming and has not impact on the API.

CU.

Reply all

Reply to author

Forward