Since we are coming from statsd we use the statsd exporter pretty heavily (as it makes the transition significantly less painful). While working on this it seems that there are some artificial pain-points enforced due to restrictions placed on the golang Prometheus client (since the statsd_exporter is using it).
So for some background (for those unfamiliar with statsd or the exporter). Statsd is a mechanism that (as as gross over-simplification) has the app fire UDP packets with metrics in them whenever they want to a specific endpoint. That endpoint then ingests the metrics and they are stored. The statsd_exporter is a process which can be that "statsd endpoint" and convert metrics from the statsd format to the Prometheus format.
For the actual problem! In statsd-land -- the metrics aren't necessarily coming from a single process or application. As such we run into two basic problems:
#1 metrics change
As code changes in applications the metric tags change. This isn't an issue if the application uses the Prometheus client-- as the client is restarted with the application. In the statsd case this isn't true-- as the Prometheus client isn't restarted (as it is a separate process). So this means if we change metric "foo" from tags {a=1} to {a=1,b=2} (add a new tag) the new metric will never show up.
#2 inconsistent metrics tagging
since the statsd exporter is aggregating based on the "name" of the metric-- it is completely possible that 2 applications emit the same metric with different tag sets (due to rolling out new code or whatnot)-- which means that not only do we need to support there being one set at a given time-- we need to support N (in the statsd exporter). I've verified in Prometheus that if the exporter where to have N (name/tag) sets they all show up and queries work as you'd expect.
So-- what does this mean? For the stats_d exporter I propose 2 changes.
Change #1 -- optional TTL of metrics
Since the metrics change over-time (due to client code changing) we want to eventually TTL out those "dead" metrics -- instead of emitting the last-seen value forever. I specifically want to leave it as an optional value-- so we don't make it backwards incompatible.
Change #2 -- allow registering metrics with "conflicting" tag sets
The restriction in the golang Prometheus client is intended to help application owners/writers to have consistent metrics in their application. Since the statsd_exporter is aggregating many apps-- this restriction is not terribly helpful. From looking at the code it seems that we'd need to either (1) create a new registry or (2) add an option to bypass that check (similar to how pedantic checks work).
To be clear (since this is becoming a fairly long post) the changes to the golang prometheus client would be completely optional under a "new" registry function (similar to how NewPedanticRegistry works, we could call it "AggregatorRegistry" or "LaxRegistry").
First off-- wanted to say Prometheus is awesome :) We are currently moving our monitoring/metrics infra from statsd/graphite -> Prometheus and it has been awesome.
Since we are coming from statsd we use the statsd exporter pretty heavily (as it makes the transition significantly less painful). While working on this it seems that there are some artificial pain-points enforced due to restrictions placed on the golang Prometheus client (since the statsd_exporter is using it).
So for some background (for those unfamiliar with statsd or the exporter). Statsd is a mechanism that (as as gross over-simplification) has the app fire UDP packets with metrics in them whenever they want to a specific endpoint. That endpoint then ingests the metrics and they are stored. The statsd_exporter is a process which can be that "statsd endpoint" and convert metrics from the statsd format to the Prometheus format.
For the actual problem! In statsd-land -- the metrics aren't necessarily coming from a single process or application. As such we run into two basic problems:
#1 metrics change
As code changes in applications the metric tags change. This isn't an issue if the application uses the Prometheus client-- as the client is restarted with the application. In the statsd case this isn't true-- as the Prometheus client isn't restarted (as it is a separate process). So this means if we change metric "foo" from tags {a=1} to {a=1,b=2} (add a new tag) the new metric will never show up.
#2 inconsistent metrics tagging
since the statsd exporter is aggregating based on the "name" of the metric-- it is completely possible that 2 applications emit the same metric with different tag sets (due to rolling out new code or whatnot)-- which means that not only do we need to support there being one set at a given time-- we need to support N (in the statsd exporter). I've verified in Prometheus that if the exporter where to have N (name/tag) sets they all show up and queries work as you'd expect.
So-- what does this mean? For the stats_d exporter I propose 2 changes.
Change #1 -- optional TTL of metrics
Since the metrics change over-time (due to client code changing) we want to eventually TTL out those "dead" metrics -- instead of emitting the last-seen value forever. I specifically want to leave it as an optional value-- so we don't make it backwards incompatible.
Change #2 -- allow registering metrics with "conflicting" tag sets
The restriction in the golang Prometheus client is intended to help application owners/writers to have consistent metrics in their application. Since the statsd_exporter is aggregating many apps-- this restriction is not terribly helpful. From looking at the code it seems that we'd need to either (1) create a new registry or (2) add an option to bypass that check (similar to how pedantic checks work).
To be clear (since this is becoming a fairly long post) the changes to the golang prometheus client would be completely optional under a "new" registry function (similar to how NewPedanticRegistry works, we could call it "AggregatorRegistry" or "LaxRegistry").