Trouble using telegraf's prometheus output

2,663 views
Skip to first unread message

Romain Vrignaud

unread,
Jun 14, 2016, 4:53:03 AM6/14/16
to prometheus...@googlegroups.com
Hello,

I'm looking for some advice. For various reason I'm using telegraf (and would like to stick to it) with prometheus output to get metrics. (https://github.com/influxdata/telegraf/)

It seems that whenever a metric is published in the prometheus output it will never leave it, even if the telegraf input doesn't send it again (it's metrics for rabbitmq ephemeral queues in my case). You can find the bug description there : https://github.com/influxdata/telegraf/issues/1334.

If I understand correctly telegraf maintainer, it's prometheus way of doing things. Is that true ?

To be more specific, I have an alert rule on a sum of messages in ephemeral queues. But with telegraf rabbitmq input plugin and telegraf prometheus output plugin, queues that are deleted keep last known value before queue was deleted. This is quite problematic for us as it totaly twist the computation of number of messages. How should we handle this kind of use case ?

Thanks for any advice.


Brian Brazil

unread,
Jun 14, 2016, 5:38:30 AM6/14/16
to Romain Vrignaud, Prometheus Developers
This is a problem with Telegraf, the API it provides for outputs doesn't tell us what metrics do and don't exist. The original PR I proposed adding Prometheus support to Telegraf didn't have this issue.


--

Romain Vrignaud

unread,
Jun 14, 2016, 5:48:41 AM6/14/16
to Brian Brazil, Prometheus Developers
Would you mind comment on telegraf issue that this statement "Prometheus basically comes with the assumption that once a metric has been reported, it must be reported at every interval." is not true and not aligned with prometheus philosophy ?
For various reason:
  * only one project to maintain (telegraf vs lots of differents exporters)
  * push to durable metric storage in influxdb (given the fact that AFAIK prometheus will drop influxdb write)
I would prefer maintain only one tool for metric gathering. As exporters are not able to push metrics to influxdb, I would prefer to keep telegraf.
 

--

Brian Brazil

unread,
Jun 14, 2016, 5:59:49 AM6/14/16
to Romain Vrignaud, Prometheus Developers
On 14 June 2016 at 10:48, Romain Vrignaud <rvri...@gmail.com> wrote:


2016-06-14 11:38 GMT+02:00 Brian Brazil <brian....@robustperception.io>:
On 14 June 2016 at 09:52, Romain Vrignaud <rvri...@gmail.com> wrote:
Hello,

I'm looking for some advice. For various reason I'm using telegraf (and would like to stick to it) with prometheus output to get metrics. (https://github.com/influxdata/telegraf/)

It seems that whenever a metric is published in the prometheus output it will never leave it, even if the telegraf input doesn't send it again (it's metrics for rabbitmq ephemeral queues in my case). You can find the bug description there : https://github.com/influxdata/telegraf/issues/1334.

If I understand correctly telegraf maintainer, it's prometheus way of doing things. Is that true ?

To be more specific, I have an alert rule on a sum of messages in ephemeral queues. But with telegraf rabbitmq input plugin and telegraf prometheus output plugin, queues that are deleted keep last known value before queue was deleted. This is quite problematic for us as it totaly twist the computation of number of messages. How should we handle this kind of use case ?

This is a problem with Telegraf, the API it provides for outputs doesn't tell us what metrics do and don't exist. The original PR I proposed adding Prometheus support to Telegraf didn't have this issue.

Would you mind comment on telegraf issue that this statement "Prometheus basically comes with the assumption that once a metric has been reported, it must be reported at every interval." is not true and not aligned with prometheus philosophy ?

This is not true.
 
 

For various reason:
  * only one project to maintain (telegraf vs lots of differents exporters)
  * push to durable metric storage in influxdb (given the fact that AFAIK prometheus will drop influxdb write)

We'll drop it, but not before an API is added (such as https://github.com/prometheus/prometheus/pull/1487) to allow users to do that themselves if they wish.
 
I would prefer maintain only one tool for metric gathering. As exporters are not able to push metrics to influxdb, I would prefer to keep telegraf.

We consider the one exporter per machine to be an anti-pattern, as it's a bottleneck both technically and operationally as well as increasing the impact of a failure of one exporter.

--

camero...@gmail.com

unread,
Jun 14, 2016, 8:34:51 AM6/14/16
to Prometheus Developers, rvri...@gmail.com
My above statement is a bit out of context, I meant it that prometheus would expect _telegraf_ to continue reporting the same metrics, as I don't quite see a way that Telegraf could report ephemeral metrics to prometheus.

@Brian-Brazil I'd be very interested to know how Telegraf can let prometheus know which metrics do and don't exist, and be able to unregister & reregister them later?

I'm certainly not a prometheus expert, but I was basing that statement off of this example code: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register

In that example, the user registers a prometheus counter and then unregisters it. After unregistering, they are not able to register a counter with the same name again. From the example it appears that the only way to register another metric is to change the name, which from the Telegraf perspective is a different metric.

The use-case is basically this:

1. user has 2 metrics: [m1 value=1] & [m2 value=1]
2. At 12:00, both of these metrics are reported into telegraf, and telegraf writes both of these metrics to it's configured outputs (let's call it influxdb and prometheus)
3. user now has 1 metric: [m2 value=2]
4. At 12:10, only the "m2" metric exists. Telegraf now writes only the m2 metric to it's outputs. This means that the m1 metric continues to appear on the prometheus client /metrics endpoint with value=1.
5. user again has 2 metrics: [m1 value=3] & [m2 value=3]
6. at 12:20, Telegraf writes both metrics to its outputs.

It's at step 4 that I have a problem. From what I understand, if I unregistered m2 at step 4, I would not be allowed to re-register m2 in step 6.

Brian Brazil

unread,
Jun 14, 2016, 9:06:06 AM6/14/16
to camero...@gmail.com, Prometheus Developers, Romain Vrignaud
On 14 June 2016 at 13:34, <camero...@gmail.com> wrote:
My above statement is a bit out of context, I meant it that prometheus would expect _telegraf_ to continue reporting the same metrics, as I don't quite see a way that Telegraf could report ephemeral metrics to prometheus.

@Brian-Brazil I'd be very interested to know how Telegraf can let prometheus know which metrics do and don't exist, and be able to unregister & reregister them later?

The way to do it is that when Telegraf gets a scrape, it goes and gets all the metrics from the inputs plugins, send those to the registry and respond to the HTTP request with them.
 

I'm certainly not a prometheus expert, but I was basing that statement off of this example code: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register

That's about direct instrumentation, which is not what we're doing here.

Brian
 
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Brian Brazil

unread,
Jun 14, 2016, 9:30:35 AM6/14/16
to Cameron Sparr, Prometheus Developers
On 14 June 2016 at 14:27, Cameron Sparr <camero...@gmail.com> wrote:
I'm certainly not a prometheus expert, but I was basing that statement off of this example code: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register

That's about direct instrumentation, which is not what we're doing here.
Brian

Telegraf is doing direct instrumentation, I don't quite follow, what should Telegraf be doing instead?



Brian 



--

Cameron Sparr

unread,
Jun 14, 2016, 9:47:11 AM6/14/16
to Brian Brazil, Prometheus Developers
I see, and what would be stopping only Telegraf's "prometheus_client" output plugin from implementing the Collector interface, rather than the entire Telegraf agent?

I know in some ways that Telegraf's collection model goes against prometheus', but I'd like to integrate it as best as possible.

Brian Brazil

unread,
Jun 14, 2016, 9:54:29 AM6/14/16
to Cameron Sparr, Prometheus Developers
On 14 June 2016 at 14:47, Cameron Sparr <camero...@gmail.com> wrote:
I see, and what would be stopping only Telegraf's "prometheus_client" output plugin from implementing the Collector interface, rather than the entire Telegraf agent?

You could do that, but you still need Telegraf to provide all metrics on request.

Brian



--

Cameron Sparr

unread,
Jun 14, 2016, 11:13:25 AM6/14/16
to Brian Brazil, Prometheus Developers
It would be possible for the Telegraf prometheus output plugin to cache it's metrics from each collection interval, and then send them down the channel anytime that Collect is called.

But I also think that doing this would in some ways go against a tenant of prometheus, that the metrics should be "collected" at the time the http request is made, is that correct? Would you recommend this as an OK workaround for Telegraf to take, without changing it's core workflow?

Brian Brazil

unread,
Jun 14, 2016, 11:16:53 AM6/14/16
to Cameron Sparr, Prometheus Developers
On 14 June 2016 at 16:13, Cameron Sparr <camero...@gmail.com> wrote:
It would be possible for the Telegraf prometheus output plugin to cache it's metrics from each collection interval, and then send them down the channel anytime that Collect is called.

Does the API guarantee that it sends all the metrics at once? I couldn't find any docs to indicate so.
 

But I also think that doing this would in some ways go against a tenant of prometheus, that the metrics should be "collected" at the time the http request is made, is that correct?

Yes.
 
Would you recommend this as an OK workaround for Telegraf to take, without changing it's core workflow?

It's a tradeoff that sometimes must be made. It will cause problems.



--
Reply all
Reply to author
Forward
0 new messages