Best practices when monitoring a variable set of hosts?

872 views
Skip to first unread message

Douglas Burrill

unread,
Apr 15, 2016, 4:34:44 PM4/15/16
to Prometheus Developers
I have an R&D environment where we use Prometheus to monitor about 200 test VMs.  At any point, we may turf these VMs and create a new set with new host names.

Instead of changing the Prometheus yaml file each time, we opted to push metrics to a gateway every 15 seconds and have Prometheus scrape the gateway solely.  This provides the flexibility to add/remove monitored systems without having to alter anything on the Prometheus configuration.  It also makes on-prem/cloud hybrid solutions easy.  ie. pushing out of a system is more secure and less painful than allowing something to pull from the box.

However, I'm running into a staleness issue.  Let's say one of my 200 VMs spontaneously combusts and no longer pushes metrics to the gateway.  The gateway retains the most recent metrics that were pushed and this data gets happily scraped each interval.  In chart format it looks like the patient flatlined at what ever the most recent data value was, say 42.

My ultimate goal would be for Prometheus to only scrape up-to-date data and not continue scraping the value of 42 each time.

One work around was to flush the gateway every x minutes or hours and that way we would at least only get a limited amount of stale data.  Of course, if Prometheus wants to scrape at that moment it will get an error.

Questions

Is using the push gateway a good way to avoid having to modify the prometheus yaml file every time I need to change the hosts? 

If not, what might a good alternative approach be that still keeps a push model?

If yes, is there a way to tell the gateway to discard any metrics that are greater than x seconds old? 

Any help is very much appreciated!

Douglas


Julius Volz

unread,
Apr 15, 2016, 5:12:59 PM4/15/16
to Douglas Burrill, Prometheus Developers
On Fri, Apr 15, 2016 at 10:34 PM, Douglas Burrill <dwbu...@gmail.com> wrote:
I have an R&D environment where we use Prometheus to monitor about 200 test VMs.  At any point, we may turf these VMs and create a new set with new host names.

Instead of changing the Prometheus yaml file each time, we opted to push metrics to a gateway every 15 seconds and have Prometheus scrape the gateway solely.  This provides the flexibility to add/remove monitored systems without having to alter anything on the Prometheus configuration.  It also makes on-prem/cloud hybrid solutions easy.  ie. pushing out of a system is more secure and less painful than allowing something to pull from the box.

However, I'm running into a staleness issue.  Let's say one of my 200 VMs spontaneously combusts and no longer pushes metrics to the gateway.  The gateway retains the most recent metrics that were pushed and this data gets happily scraped each interval.  In chart format it looks like the patient flatlined at what ever the most recent data value was, say 42.

My ultimate goal would be for Prometheus to only scrape up-to-date data and not continue scraping the value of 42 each time.

One work around was to flush the gateway every x minutes or hours and that way we would at least only get a limited amount of stale data.  Of course, if Prometheus wants to scrape at that moment it will get an error.

Questions

Is using the push gateway a good way to avoid having to modify the prometheus yaml file every time I need to change the hosts? 

Not really. The pushgateway is really only recommended for service-level metrics that are not labeled as coming from a specific instance (in fact, usually one would even only push a "job" label and no "instance" label to it). So usually, you'd use it for service-level batch jobs that cannot be scraped. Host- or instance-level metrics are better tracked right on the host (via node exporter and its textfile exporter) or in the instance of a service (e.g. via direct instrumentation or other suitable exporter).

The reason for this is exactly what you encountered. When you track instance-level metrics via the PGW, the lifecycle of the metrics producers will not automatically coincide with the lifecycle of the PGW, and thus stale data remains in the PGW until you manually clean it up. You also don't get up-ness monitoring for free this way.

The PGW does have an API that allows you to automate deleting of stale metrics, but it still needs to be taken care of explicitly by yourself.
 
If not, what might a good alternative approach be that still keeps a push model?

I'm afraid there's no really good push solution for what you're doing.

You could use client-side timestamps with the PGW, which will mean that stale series will keep the same timestamp forever and not be stored repeatedly by Prometheus (the server will simply ignore repeated samples). So that would save you some storage space and also lead to series not being present in graphs anymore for time periods in which they are stale.

However, it's easy to shoot yourself in the foot with client-side timestamps, so we don't generally recommend it.
 
If yes, is there a way to tell the gateway to discard any metrics that are greater than x seconds old? 

Not yet, but you could automate that via the PGW API.

Generally, I would recommend finding some way to pull your data :)
 
Any help is very much appreciated!

Douglas


--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Kochie

unread,
Apr 15, 2016, 6:25:30 PM4/15/16
to Douglas Burrill, Prometheus Developers
If you are allocating VMs dynamically, I highly recommend setting up one of the various target discovery methods.

We support a number of discovery APIs, documented here: https://prometheus.io/docs/operating/configuration/

Possibly the simplest one for you to use is the file_sd_config method.  Instead of updating the prometheus yaml config, you can read your list of VMs from however you store it, write out a single node per line in a file, and prometheus will automatically watch and reload the target list file as it changes.

On Fri, Apr 15, 2016 at 10:34 PM, Douglas Burrill <dwbu...@gmail.com> wrote:

Brian Brazil

unread,
Apr 15, 2016, 6:26:21 PM4/15/16
to Julius Volz, Douglas Burrill, Prometheus Developers
To expand on this a bit, you should use some form of service discovery to tell Prometheus where your nodes are. If you happen to be on EC2, Azure (PR pending) or using Consul on your nodes there's built in support for this.

Failing that you can write a cronjob to write out a list of yours nodes in JSON or YAML. See http://www.robustperception.io/using-json-file-service-discovery-with-prometheus/

--
Reply all
Reply to author
Forward
0 new messages