Hello all,
I've a basic prometheus baremetal setup and it seems the recommend way to setup HA is with a pair of prometheus servers scraping the same targets. I've few questions on this:
a) With a HA pair, the prometheus data will be local to both the prometheus instances. Is it a good idea to have these 2 prometheus instances write to some sort of a network mounted filesystem like NFS / GlusterFS filesystem,so that the data is identical for both the prometheus instances ? Has anyone tried this ?
b) AFAIK, with both the ha pairs scraping the same targets, how do i build a global view of these local prometheus instances? Is federation the only way with another Prometheus instance scraping the ha pair ?
c) When the 2 ha pair scrape the same targets, are the metric values identical or slightly different, due to time offsets between the scrapes ? What happens if my scrape interval for prometheus A is 15sec and Prometheus B is 16 sec, then do i still need to dedup, since the values will be different ? What's the right strategy to dedup the metrics ?
If you use Grafana, you then use the Thanos Query HTTP API endpoint as your Prometheus datasource URL.
I have been using this kind of setup in Production and have been very happy with it. See https://thanos.io/
Other choices are Cortex and VictoriaMetrics (advertised to have easier setup and better performance than Cortex and Thanos).
a) With a HA pair, the prometheus data will be local to both the prometheus instances. Is it a good idea to have these 2 prometheus instances write to some sort of a network mounted filesystem like NFS / GlusterFS filesystem,so that the data is identical for both the prometheus instances ? Has anyone tried this ?
b) AFAIK, with both the ha pairs scraping the same targets, how do i build a global view of these local prometheus instances? Is federation the only way with another Prometheus instance scraping the ha pair ?
c) When the 2 ha pair scrape the same targets, are the metric values identical or slightly different, due to time offsets between the scrapes ? What happens if my scrape interval for prometheus A is 15sec and Prometheus B is 16 sec, then do i still need to dedup, since the values will be different ? What's the right strategy to dedup the metrics ?
As has been discussed elsewhere, two Prometheus instances cannot share the same data store. I'd also add that using NFS introduces extra, unnecessary failure modes.
If the two instances are both scraping the same targets you don't need a global view. I just point my Grafana instance at the load balancer sitting in front of my parallel Prometheus instances; I've never noticed any display glitches caused by the load balancer switching between instances. The reality is that while their datasets may not be identical, they'll be "close enough".
You would only need deduplication at all if you were sending the data on to another system (Prometheus federation, Thanos, Victoriametrics, etc.). I'm currently experimenting with sending data from all of my instances (DEV, paired TEST, paired PROD) to a single Victoriametrics server. I haven't yet played with de-duplication though.
I believe the downside or a corner case with the prometheus instances behind LB is, if one of the instances goes down, then you may have gaps in your graphs and you can't backfill (not supported by prom) the data once that instance comes back up.
Out of curiosity, i'd like to know why have you decided to go with VM over Thanos ? Would really to your PoV and experiences with VM ?