How to handle changing metrics in the statsd_exporter

1,236 views
Skip to first unread message

jackso...@gmail.com

unread,
Jul 19, 2017, 1:26:02 PM7/19/17
to Prometheus Developers
First off-- wanted to say Prometheus is awesome :) We are currently moving our monitoring/metrics infra from statsd/graphite -> Prometheus and it has been awesome.


Since we are coming from statsd we use the statsd exporter pretty heavily (as it makes the transition significantly less painful). While working on this it seems that there are some artificial pain-points enforced due to restrictions placed on the golang Prometheus client (since the statsd_exporter is using it).


So for some background (for those unfamiliar with statsd or the exporter). Statsd is a mechanism that (as as gross over-simplification) has the app fire UDP packets with metrics in them whenever they want to a specific endpoint. That endpoint then ingests the metrics and they are stored. The statsd_exporter is a process which can be that "statsd endpoint" and convert metrics from the statsd format to the Prometheus format.

For the actual problem! In statsd-land -- the metrics aren't necessarily coming from a single process or application. As such we run into two basic problems:

#1 metrics change
As code changes in applications the metric tags change. This isn't an issue if the application uses the Prometheus client-- as the client is restarted with the application. In the statsd case this isn't true-- as the Prometheus client isn't restarted (as it is a separate process). So this means if we change metric "foo" from tags {a=1} to {a=1,b=2} (add a new tag) the new metric will never show up.

#2 inconsistent metrics tagging
since the statsd exporter is aggregating based on the "name" of the metric-- it is completely possible that 2 applications emit the same metric with different tag sets (due to rolling out new code or whatnot)-- which means that not only do we need to support there being one set at a given time-- we need to support N (in the statsd exporter). I've verified in Prometheus that if the exporter where to have N (name/tag) sets they all show up and queries work as you'd expect.


So-- what does this mean? For the stats_d exporter I propose 2 changes.

Change #1 -- optional TTL of metrics
Since the metrics change over-time (due to client code changing) we want to eventually TTL out those "dead" metrics -- instead of emitting the last-seen value forever. I specifically want to leave it as an optional value-- so we don't make it backwards incompatible.

Change #2 -- allow registering metrics with "conflicting" tag sets
The restriction in the golang Prometheus client is intended to help application owners/writers to have consistent metrics in their application. Since the statsd_exporter is aggregating many apps-- this restriction is not terribly helpful. From looking at the code it seems that we'd need to either (1) create a new registry or (2) add an option to bypass that check (similar to how pedantic checks work).


To be clear (since this is becoming a fairly long post) the changes to the golang prometheus client would be completely optional under a "new" registry function (similar to how NewPedanticRegistry works, we could call it "AggregatorRegistry" or "LaxRegistry").

Brian Brazil

unread,
Jul 19, 2017, 1:38:56 PM7/19/17
to jackso...@gmail.com, Prometheus Developers
The way I'd approach this would be to have a statsd_exporter per application instance, which is restarted when they are. This avoids these issues, and the usual scaling issues with statsd. 


--

Thomas Jackson

unread,
Jul 19, 2017, 2:01:53 PM7/19/17
to Brian Brazil, Prometheus Developers
That does significantly increase the complexity, as well as breaks from the mold of statsd. The long term plan of course is to move to Prometheus direct. This exporter should remain (IMO) a drop in replacement for statsd endpoints.

Julius Volz

unread,
Jul 19, 2017, 5:04:44 PM7/19/17
to jackso...@gmail.com, Prometheus Developers
On Wed, Jul 19, 2017 at 7:26 PM, <jackso...@gmail.com> wrote:
First off-- wanted to say Prometheus is awesome :) We are currently moving our monitoring/metrics infra from statsd/graphite -> Prometheus and it has been awesome.


Since we are coming from statsd we use the statsd exporter pretty heavily (as it makes the transition significantly less painful). While working on this it seems that there are some artificial pain-points enforced due to restrictions placed on the golang Prometheus client (since the statsd_exporter is using it).


So for some background (for those unfamiliar with statsd or the exporter). Statsd is a mechanism that (as as gross over-simplification) has the app fire UDP packets with metrics in them whenever they want to a specific endpoint. That endpoint then ingests the metrics and they are stored. The statsd_exporter is a process which can be that "statsd endpoint" and convert metrics from the statsd format to the Prometheus format.

For the actual problem! In statsd-land -- the metrics aren't necessarily coming from a single process or application. As such we run into two basic problems:

#1 metrics change
As code changes in applications the metric tags change. This isn't an issue if the application uses the Prometheus client-- as the client is restarted with the application. In the statsd case this isn't true-- as the Prometheus client isn't restarted (as it is a separate process). So this means if we change metric "foo" from tags {a=1} to {a=1,b=2} (add a new tag) the new metric will never show up.

#2 inconsistent metrics tagging
since the statsd exporter is aggregating based on the "name" of the metric-- it is completely possible that 2 applications emit the same metric with different tag sets (due to rolling out new code or whatnot)-- which means that not only do we need to support there being one set at a given time-- we need to support N (in the statsd exporter). I've verified in Prometheus that if the exporter where to have N (name/tag) sets they all show up and queries work as you'd expect.


So-- what does this mean? For the stats_d exporter I propose 2 changes.

Change #1 -- optional TTL of metrics
Since the metrics change over-time (due to client code changing) we want to eventually TTL out those "dead" metrics -- instead of emitting the last-seen value forever. I specifically want to leave it as an optional value-- so we don't make it backwards incompatible.

Yeah, that'd be more consistent with StatsD's aggregation interval and final metrics output, though it's a bit icky: too long TTL means you have stale metrics stick around for a long time, too short TTL means you'll produce too many disappearing / resetting / reappearing counter metrics (for counters that don't get incremented at least once per TTL interval), which also produces garbage data. The StatsD/Prometheus concepts just don't align well here.

I'm unsure about this. I see the argument that one just wants to have a single StatsD drop-in replacement and not run multiple, and that StatsD metrics just fundamentally work that way (that no new data points for them get written out anymore if they were not received in the last aggregation interval). On the other hand, I'd be worried that users will not understand that they cannot set an optional TTL to something as short as a usual StatsD aggregation interval, but to something more like 30 minutes or so. Otherwise they'll get garbage metrics and wonder why (which in turn frustrates people and increases support load).
 
Change #2 -- allow registering metrics with "conflicting" tag sets
The restriction in the golang Prometheus client is intended to help application owners/writers to have consistent metrics in their application. Since the statsd_exporter is aggregating many apps-- this restriction is not terribly helpful. From looking at the code it seems that we'd need to either (1) create a new registry or (2) add an option to bypass that check (similar to how pedantic checks work).


To be clear (since this is becoming a fairly long post) the changes to the golang prometheus client would be completely optional under a "new" registry function (similar to how NewPedanticRegistry works, we could call it "AggregatorRegistry" or "LaxRegistry").

I don't think we'd want to change the Go client library to support this kind of thing, but perhaps if we do decide to add TTLs and flexible labelsets to the StatsD Exporter, we'd change the exporter to have its own completely custom metrics state tracking and then just use ConstMetrics to bridge that state out during scrape, as in normal exporters.

Thomas Jackson

unread,
Jul 20, 2017, 8:52:54 AM7/20/17
to Julius Volz, Prometheus Developers
I think that's totally reasonable. If the mix label stuff doesn't appear useful to the main client I'll make another registry in the statsd exporter that is sufficiently lax (I'll leave it as an exported struct in case it is useful for others). As for the ttl for metrics, this will be by default off (since people are already using it). I'll just add a section to the readme where the options are that describes the issue and options. 


Reply all
Reply to author
Forward
0 new messages