Dropping metrics (counters)

513 views
Skip to first unread message

gary.w...@comparethemarket.com

unread,
Aug 12, 2016, 11:10:52 AM8/12/16
to Prometheus Developers
Hi
I'm relatively new to prometheus and I'm not a code expert, but from a user perspective we've recently seen some drops in our trend data, as those counters (counts) haven't been successfully extracted from our On-Prem and AWS instances we're monitoring.

It’s as though the counters weren’t collected successfully, however, we have a log-exporter service running on our app-boxes, which should hold the counters until successfully scraped (retrieved) by prometheus. I’m wondering if this is something you are familiar with?

Our process flow for metrics is Log Files =>

log-exporter (which collects the logs in a format compatible with prometheus and presents it for collection) =>

prometheus servers in AWS.

We're losing counts (at seemingly random times) where the expression sum(increase(metric{}[15m])) by (instance) drops to zero for a short time.

It's almost like it's failed to scrape the counters from the log-exporter service, however, if it had done this then these counts would actually end up in prometheus at a later time...however they appear to have gone completely.

I'd like to attached an image (chart) but doesn't look like it's possible here to attach an image.

Any thoughts, common reasons, why this happens would be greatly appreciated.

Regards
Gary

Brian Brazil

unread,
Aug 12, 2016, 11:15:00 AM8/12/16
to gary.w...@comparethemarket.com, Prometheus Developers
On 12 August 2016 at 16:10, <gary.w...@comparethemarket.com> wrote:
Hi
I'm relatively new to prometheus and I'm not a code expert, but from a user perspective we've recently seen some drops in our trend data, as those counters (counts) haven't been successfully extracted from our On-Prem and AWS instances we're monitoring.

It’s as though the counters weren’t collected successfully, however, we have a log-exporter service running on our app-boxes, which should hold the counters until successfully scraped (retrieved) by prometheus. I’m wondering if this is something you are familiar with?

That doesn't sound right. Exporters should be stateless and not change what they export based on a scrape.

 

Our process flow for metrics is Log Files =>

log-exporter (which collects the logs in a format compatible with prometheus and presents it for collection) =>

prometheus servers in AWS.

We're losing counts (at seemingly random times) where the expression sum(increase(metric{}[15m])) by (instance) drops to zero for a short time.

It's almost like it's failed to scrape the counters from the log-exporter service, however, if it had done this then these counts would actually end up in prometheus at a later time...however they appear to have gone completely.

Why do you think that data from a failed scrape would reappear later? There's no mechanism for this in Prometheus.

Brian
 

I'd like to attached an image (chart) but doesn't look like it's possible here to attach an image.

Any thoughts, common reasons, why this happens would be greatly appreciated.

Regards
Gary

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Julius Volz

unread,
Aug 12, 2016, 11:59:18 AM8/12/16
to Gary Williams, Prometheus Developers
On Fri, Aug 12, 2016 at 5:10 PM, <gary.williams@comparethemarket.com> wrote:
Hi
I'm relatively new to prometheus and I'm not a code expert, but from a user perspective we've recently seen some drops in our trend data, as those counters (counts) haven't been successfully extracted from our On-Prem and AWS instances we're monitoring.

It’s as though the counters weren’t collected successfully, however, we have a log-exporter service running on our app-boxes, which should hold the counters until successfully scraped (retrieved) by prometheus. I’m wondering if this is something you are familiar with?

My guess is that the log exporter is supposed to count how many log events of certain types flow through it, right?

As Brian mentioned, exported counters should just increase forever, and their current value should be independent of any scrapes. A scrape simply asks a service instance or an exporter about the *current* state of each metric.
 

Our process flow for metrics is Log Files =>

log-exporter (which collects the logs in a format compatible with prometheus and presents it for collection) =>

prometheus servers in AWS.

We're losing counts (at seemingly random times) where the expression sum(increase(metric{}[15m])) by (instance) drops to zero for a short time.

It's almost like it's failed to scrape the counters from the log-exporter service, however, if it had done this then these counts would actually end up in prometheus at a later time...however they appear to have gone completely.

Failing to sometimes scrape a counter does not lead to 0-dips usually, but would just mean that the collected data would have lower resolution, and thus the computed rates would be smoother.

However, one thing that *could* be happening here is that the exporters are crashlooping for some reason (maybe for specific log messages?), and thus there are very frequent counter resets (counters in Prometheus only reset when the exporting process restarts). If you get too many counter resets in a very short time, counted-up events between scrapes effectively get lost from Prometheus's perspective. So I would check whether the exporters are running ok and not crashlooping.

If it's not that, we'd have to really see what the exporter is doing exactly.
 
I'd like to attached an image (chart) but doesn't look like it's possible here to attach an image.

Any thoughts, common reasons, why this happens would be greatly appreciated.

Regards
Gary

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

gary.w...@comparethemarket.com

unread,
Aug 15, 2016, 3:02:06 AM8/15/16
to Prometheus Developers, gary.w...@comparethemarket.com
Thanks Brian Julius,
I'm not sure the log-exporter can be crashlooping, as it's happening on multiple boxes in multiple environments in multiple server locations. The only common factor is the metrics are being lost (and not reclaimed after the 'event' has recovered) on one of our 3 prometheus instances. So the issu seems to be within prometheus itself.

Regards
Gary
Reply all
Reply to author
Forward
0 new messages