Handling removal of obsolete metrics

1,810 views
Skip to first unread message

Marco Jantke

unread,
Nov 27, 2017, 4:25:20 AM11/27/17
to Prometheus Users
Hi everyone,

I am working on improving the Prometheus metrics integration of the Traefik project.

One thing that I want to introduce is a traefik_backend_server_up metric that is a Gauge with two labels: backend and url. For each server Traefik knows about, it will either have the value 1 (up) or 0 (down, e.g. through failing HealthCheck). The Traefik configuration is loaded dynamically and backends and servers can come and go. Traefik itself is in that perspective basically state-less. It only knows about the current running configuration, not any previous ones. My problem is: once Traefik learned about a server, I call the Set(1) of the GaugeVec. This will make the concrete metric (with the server/url combination) to appear on the /metrics endpoint, but it will never go away until Traefik is restarted.

The idea of backend_server_up is to create an alert like (and potentially even more fine grained ones, depending on the concrete backend):
 
traefik_backend_healthy_percent = round(100 * sum(traefik_backend_server_up) / count(traefik_backend_server_up), 0.01) ALERT TraefikLowBackendHealthyPercentage IF min(traefik_backend_healthy_percent) < 95 FOR 5m .....
The above mentioned problem, that Traefik won't forget about sampled traefik_backend_server_up which will lead to a wrong calculation of the alert.

Did anyone else had a similar problem and how did you approach the problem?

I am happy to provide more information about the problem!

Best
Marco

PS: I don't only have this problem with the traefik_backend_server_up metric, but the pattern should be the same I guess.
PSS: I am very happy to improve the Subject of the thread in case you have suggestions, I didn't know how to summarize the problem properly.

Ben Kochie

unread,
Nov 27, 2017, 4:31:55 AM11/27/17
to Marco Jantke, Prometheus Users
For this up metric, I think you probably want to use prometheus.MustNewConstMetric() instead of a normal gauge metric.  This will allow you to dynamically generate the list of backends.  But you will have to gather this data on every scrape.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/bb32ed42-5af1-467e-a86e-ba6454a40338%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marco Jantke

unread,
Nov 27, 2017, 5:35:18 AM11/27/17
to Prometheus Users
Thanks for the quick reply. I already thought this is the way to go.

Other ideas how I could prevent having to build my own sampling logic are still welcome :) I have the problem described above with a lot of metrics, also e.g. traefik_backend_http_requests_total (counter) or traefik_backend_req_duration_seconds (histogram), which is data I can not gather on the fly. 

On Monday, 27 November 2017 10:31:55 UTC+1, Ben Kochie wrote:
For this up metric, I think you probably want to use prometheus.MustNewConstMetric() instead of a normal gauge metric.  This will allow you to dynamically generate the list of backends.  But you will have to gather this data on every scrape.
On Mon, Nov 27, 2017 at 10:25 AM, Marco Jantke <marco....@gmail.com> wrote:
Hi everyone,

I am working on improving the Prometheus metrics integration of the Traefik project.

One thing that I want to introduce is a traefik_backend_server_up metric that is a Gauge with two labels: backend and url. For each server Traefik knows about, it will either have the value 1 (up) or 0 (down, e.g. through failing HealthCheck). The Traefik configuration is loaded dynamically and backends and servers can come and go. Traefik itself is in that perspective basically state-less. It only knows about the current running configuration, not any previous ones. My problem is: once Traefik learned about a server, I call the Set(1) of the GaugeVec. This will make the concrete metric (with the server/url combination) to appear on the /metrics endpoint, but it will never go away until Traefik is restarted.

The idea of backend_server_up is to create an alert like (and potentially even more fine grained ones, depending on the concrete backend):
 
traefik_backend_healthy_percent = round(100 * sum(traefik_backend_server_up) / count(traefik_backend_server_up), 0.01) ALERT TraefikLowBackendHealthyPercentage IF min(traefik_backend_healthy_percent) < 95 FOR 5m .....
The above mentioned problem, that Traefik won't forget about sampled traefik_backend_server_up which will lead to a wrong calculation of the alert.

Did anyone else had a similar problem and how did you approach the problem?

I am happy to provide more information about the problem!

Best
Marco

PS: I don't only have this problem with the traefik_backend_server_up metric, but the pattern should be the same I guess.
PSS: I am very happy to improve the Subject of the thread in case you have suggestions, I didn't know how to summarize the problem properly.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Ben Kochie

unread,
Nov 27, 2017, 7:11:35 AM11/27/17
to Marco Jantke, Prometheus Users
Yes,  It's tricky.  The library is limiting this way because we want to avoid developers abusing it when they shouldn't.

I don't have any good answers for you, maybe one of the other developers could comment on how best to handle this.

From my SRE perspective, I don't see a lot of value in tracking backend instance metrics inside traefik directly.  In my recommended config I would get those http metrics directly from the backend.  Only the up-ness seems useful for comparing.

For backend, I really only care about traffic by backend pool in aggregate.  This should have a much lower label churn, so having them stick around is less of a problem. 

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0d1b4f21-1e85-4e66-8991-c95674422150%40googlegroups.com.

Björn Rabenstein

unread,
Nov 29, 2017, 8:17:20 AM11/29/17
to Marco Jantke, Prometheus Users
On 27 November 2017 at 11:35, Marco Jantke <marco....@gmail.com> wrote:
>
> Other ideas how I could prevent having to build my own sampling logic are
> still welcome :) I have the problem described above with a lot of metrics,
> also e.g. traefik_backend_http_requests_total (counter) or
> traefik_backend_req_duration_seconds (histogram), which is data I can not
> gather on the fly.

In general, the happy case is that metrics a binary exposes are static
throughout the lifetime of a binary. That's why you should avoid
metrics suddenly popping up (see
https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics
), and initialize metric vectors with all the possible label values
(which, in the Go client, can be done by
`someMetricVec.With(prometheus.Labels{"foo": "bar"})`, you don't even
need to call `Set` or something. The metric child will be created with
0 as a starting value), see also last paragraph of
https://prometheus.io/docs/instrumenting/writing_clientlibs/#labels

In reality, the happy case is not always feasible. Most of the time,
that's because you cannot predict all possible label values, or there
would just be too many (proverbial example: all possible HTTP status
codes). It's much rarer that you actually want to _remove_ a metric
child from the vector. As Ben said, if you just want to mirror metrics
from a 3rd party source, the "one shot" const metrics are the way to
go. If you still end up with the need to remove a metric child from a
metric vector, you can use the `Delete` method (in the Go client), see
https://godoc.org/github.com/prometheus/client_golang/prometheus#GaugeVec.Delete

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Marco Jantke

unread,
Nov 30, 2017, 9:31:49 AM11/30/17
to Prometheus Users
Thanks for your input Ben and Björn! I agree with your suggestions, but my project requires that I implement those metrics in the load balancer and a rudimentary implementation is already in Traefik, so I should get it fixed.

In case it's interesting for you, here is how I went so far: In Traefik we use the go-kit/metrics abstraction to make the metrics implementation configurable. For the Prometheus implementation I created a wrapper around the Vec types of the go client lib and on each Set|Add|Observe I pass this operations through to the metric and send those metrics over a channel. At the other end of the channel I have a struct PromState that is my only registered Collector and it passes the Collect through to all metrics it received. I can now control which metrics I track and I can run any clean up routines after a Collect. 
Reply all
Reply to author
Forward
0 new messages