Preventing data loss from poor network communication

Mathieu Tétreault

unread,

Jun 12, 2020, 2:45:12 PM6/12/20

to Prometheus Users

We plan on using prometheus to fetch data from multiples servers and the link between the metrics's server and the prometheus servers is known for not being that reliable. The instability can last a couples of minutes and there is nothing we can do about it.

Most of the time prometheus will be able to fetch the metrics. However, when prometheus is unable to pull the data the metrics server will need to be able to cache them until the connection is back.

Since most of the time the connection will be up, I was thinking about setting up a watchdog refreshed by the metric pull. When the watchdog trigs, then cache the data until the pushgateway is pulled.

If anyone had any advise on that, that'd be appreciated.

Cheers,

Mathieu

Stuart Clark

unread,

Jun 13, 2020, 8:25:23 AM6/13/20

to Mathieu Tétreault, Prometheus Users

Is it possible to run the Prometheus server on the other end of the link?

In general it is advised to run Prometheus servers as close as possible
to the things being monitored. For example a server per datacenter
instead of a single global server, etc.

Mathieu Tétreault

unread,

Jun 14, 2020, 7:20:31 AM6/14/20

to Stuart Clark, Prometheus Users

I will have to double check, at first glance, the metrics servers didn't have enough resources available to run prometheus alongside their application.

That's the main reason why I started to investigate setting up a watchdog setup and the pushgateway.

My understanding is that it will also prevent grafana frome properly displaying the data properly from time to time. Since sometimes it won't be able to query the metrics server, an issue that would be less visible if we have a global prometheus instance that stores all the data.

Cheers,

Mathieu

Stuart Clark

unread,

Jun 14, 2020, 7:32:19 AM6/14/20

to promethe...@googlegroups.com, Mathieu Tétreault, Prometheus Users

What you'd generally do is look at using federation or one of the global storage systems like Victoria Metrics, Thanos or Cortex.

You'd have a Prometheus server in each location, and then central systems for global views and alerts.

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Mathieu Tétreault

unread,

Jun 15, 2020, 8:30:07 AM6/15/20

to Stuart Clark, Prometheus Users

Alright, I'll look into it.

Just in case we don't have the resources required to run prometheus and thanos sidecar on the metrics server.

Would there be any issues using the pushgateway to cache the metrics while the network is down? I understand that it would be more complicated to implement, but other than that? I'll do some testing this week, but I was wondering if there were anything that I was missing.

Thanks for your help, it is really appreciated.

Cheers,

Mathieu

Stuart Clark

unread,

Jun 15, 2020, 2:19:40 PM6/15/20

to promethe...@googlegroups.com, Mathieu Tétreault, Prometheus Users

The Push Gateway isn't a caching system. If the Prometheus server can't connect to fetch a scrape due to network issues you will miss data. The server needs to have reliable connectivity to the systems it is scraping.

Mathieu Tétreault

unread,

Jun 15, 2020, 2:41:23 PM6/15/20

to Stuart Clark, Prometheus Users

Alright, thank you for your time.

Brian Candler

unread,

Jun 16, 2020, 7:10:34 AM6/16/20

to Prometheus Users

However, remote_write does buffer.

If you have a local prometheus server (i.e. close to the data you're collecting), and configure it to do remote_write back to a central data store (e.g. VictoriaMetrics), it will deal with temporary loss of connectivity and back-fill missing data when the connection comes back up.

Aliaksandr Valialkin

unread,

Jun 19, 2020, 8:33:01 AM6/19/20

to Mathieu Tétreault, Stuart Clark, Prometheus Users

Hi Mathieu!

What kind of resources are available on the metrics server? Probably, vmagent could be placed on each metrics server in order to reliably collect data and then send it to a centralized storage when the connection is available. This is one of the main use cases for vmagent - see https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/app/vmagent/README.md#iot-and-edge-monitoring for details.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAO%2BPXKMdJCKuBJqZp0TOthyAr6okKrgJH3cNMSLGSqUjzYBgKg%40mail.gmail.com.