HA Prometheus with federation

Matt Bostock

unread,

Dec 6, 2016, 5:35:08 PM12/6/16

to Prometheus Users

Hello,

We're using Prometheus to monitor multiple datacentres, and federating key service-level metrics up to global Prometheus instances:

https://prometheus.io/docs/operating/federation/

In each datacentre, we have Prometheus installed on 1-3 machines, depending on the the size of the datacentre. This is partly for consistency (these machines are configured alike, so it makes sense for them all to run Prometheus) and partly for high availability - if we lose one machine, we don't lose monitoring for that datacentre.

The difficult comes when federating metrics - all the instances are federating up, so we see each federated metric duplicated 1-3 times. The size is not an issue (we are federating only select metrics), but it makes it's clumsy to query them at the top-level instance since you have to pick which federated instance you're querying against.

So far my thoughts were to either:

a) use a local forward HTTP proxy, running on the same machine as the top-level Prometheus, that round-robins between the HA Promethei in the datacentres and fails over if one of them dies

b) use relabelling somehow to drop the redundant metrics (using something along the lines of hashmod: https://prometheus.io/docs/operating/configuration/#relabel_config)

Does anyone have experience of this setup or suggestions on best practices for dealing duplicate federated metrics?

Thanks,

Matt

Brian Brazil

unread,

Dec 7, 2016, 12:27:45 AM12/7/16

to Matt Bostock, Prometheus Users

On 6 December 2016 at 22:34, Matt Bostock <ma...@mattbostock.com> wrote:

Hello,

We're using Prometheus to monitor multiple datacentres, and federating key service-level metrics up to global Prometheus instances:

https://prometheus.io/docs/operating/federation/

In each datacentre, we have Prometheus installed on 1-3 machines, depending on the the size of the datacentre. This is partly for consistency (these machines are configured alike, so it makes sense for them all to run Prometheus) and partly for high availability - if we lose one machine, we don't lose monitoring for that datacentre.

The difficult comes when federating metrics - all the instances are federating up, so we see each federated metric duplicated 1-3 times. The size is not an issue (we are federating only select metrics), but it makes it's clumsy to query them at the top-level instance since you have to pick which federated instance you're querying against.

So far my thoughts were to either:

a) use a local forward HTTP proxy, running on the same machine as the top-level Prometheus, that round-robins between the HA Promethei in the datacentres and fails over if one of them dies

That may cause artifacts.

Generally I'd suggest either scraping just one of the per-DC Prometheus servers, or scraping all of them. A key point is that all the Prometheus servers should have distinct external_labels to avoid clashes, and then do a min/max/mean in your queries.

b) use relabelling somehow to drop the redundant metrics (using something along the lines of hashmod: https://prometheus.io/docs/operating/configuration/#relabel_config)

That won't help. All relabelling is stateless.

Brian

Does anyone have experience of this setup or suggestions on best practices for dealing duplicate federated metrics?

Thanks,
Matt

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAH6-%3DC%2BtmkhOrHGGjT7s%2Bh5ywrfa_xPMGxae5zs_ajoRmjkc9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

govinda...@gmail.com

unread,

Jun 13, 2017, 1:17:11 PM6/13/17

to Prometheus Users, ma...@mattbostock.com

Hi,

We have 4 federate nodes scraping metrics from multiple scrapper nodes. When we run a prom query, each and every federate nodes shows a different output? When we run the same query against scrapper nodes, it has a different output. Why is the federate nodes shows different output for the same query? Thoughs?

Thanks,

Govind

On Wednesday, December 7, 2016 at 12:27:45 AM UTC-5, Brian Brazil wrote:

On 6 December 2016 at 22:34, Matt Bostock <ma...@mattbostock.com> wrote:
Hello,

We're using Prometheus to monitor multiple datacentres, and federating key service-level metrics up to global Prometheus instances:

https://prometheus.io/docs/operating/federation/

In each datacentre, we have Prometheus installed on 1-3 machines, depending on the the size of the datacentre. This is partly for consistency (these machines are configured alike, so it makes sense for them all to run Prometheus) and partly for high availability - if we lose one machine, we don't lose monitoring for that datacentre.

The difficult comes when federating metrics - all the instances are federating up, so we see each federated metric duplicated 1-3 times. The size is not an issue (we are federating only select metrics), but it makes it's clumsy to query them at the top-level instance since you have to pick which federated instance you're querying against.

So far my thoughts were to either:

a) use a local forward HTTP proxy, running on the same machine as the top-level Prometheus, that round-robins between the HA Promethei in the datacentres and fails over if one of them dies

That may cause artifacts.

Generally I'd suggest either scraping just one of the per-DC Prometheus servers, or scraping all of them. A key point is that all the Prometheus servers should have distinct external_labels to avoid clashes, and then do a min/max/mean in your queries.

b) use relabelling somehow to drop the redundant metrics (using something along the lines of hashmod: https://prometheus.io/docs/operating/configuration/#relabel_config)

That won't help. All relabelling is stateless.

Brian

Does anyone have experience of this setup or suggestions on best practices for dealing duplicate federated metrics?

Thanks,
Matt

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAH6-%3DC%2BtmkhOrHGGjT7s%2Bh5ywrfa_xPMGxae5zs_ajoRmjkc9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
Brian Brazil
www.robustperception.io

Ionut Ilie

unread,

Jun 14, 2017, 2:40:49 AM6/14/17

to Prometheus Users, ma...@mattbostock.com, govinda...@gmail.com

any loadbalancer in front of the federate nodes ?

this is the only idea i have because i have done that :)

https://github.com/prometheus/prometheus/issues/2761

govinda...@gmail.com

unread,

Jun 14, 2017, 10:46:05 AM6/14/17

to Prometheus Users, ma...@mattbostock.com, govinda...@gmail.com

We do have a GSLB which does round robin to one of 4 prometheus server.

Reply all

Reply to author

Forward