Federation Failed

2,856 views
Skip to first unread message

dzh...@contextlogic.com

unread,
May 26, 2017, 8:35:33 PM5/26/17
to Prometheus Users
Hello Prometheus Users,

I am having an issue with federation in Prometheus right now. I keep getting the error message in the logs of the child Prometheus:

"ERRO[0177] federation failed  err=write tcp <IP of the Child Prometheus>:9090-><IP of the federation Prometheus>:47396: write: broken pipe source=federate.go:124"

And on the federation Prometheus under the Targets tab, the status of the child Prometheus is "DOWN" and the error message is "context deadline exceeded". 

My federation Prometheus is configured as follows:

global:
  scrape_interval: 15s

scrape_configs:
- job_name: 'federate'
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job="node_exporter"}'
  static_configs:
    - targets: ['<IP of the child Prometheus>:9090']

Any help is greatly appreciated. Thanks so much in advance!



Mark Smith

unread,
May 26, 2017, 11:16:10 PM5/26/17
to promethe...@googlegroups.com
Hiya,

I'm pretty new to Prometheus, but in my experience I've seen this happen when too much data is being federated. The parent Prometheus will only wait for scrape_timeout seconds before it hangs up, the child Prometheus then encounters the write failure because the parent has gone away.

You could fix this by addressing why your federate requests are taking too long, and/or raising your scrape_timeout (and scrape_interval).

You can verify if this is the case by using 'curl' or similar to see how long your child Prometheus is taking to serve the federate request. If it's longer than 10 seconds (the default scrape_timeout) then you will see the issue you're reporting.

--
Mark Smith
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dzh...@contextlogic.com

unread,
May 30, 2017, 1:34:22 PM5/30/17
to Prometheus Users
Hello Mark,

Thanks a lot for the reply. Right now I am only federating from one child Prometheus. If the child Prometheus is able to scrape from all the exporters I have set up within the scrape_timeout, the federation Prometheus should be able to do the same thing by federating from the single child Prometheus.

Also, I have tried to curl the /federate endpoint of the child Prometheus and it responded with a 200 OK in less than 3 seconds (scrape_timeout is set to the default value of 15s). However, on the Expression browser under the Target tab it still says that the child Prometheus target is down. I don't think it's a network issue.

Thanks.

Mark Smith

unread,
May 30, 2017, 4:11:44 PM5/30/17
to dzh...@contextlogic.com, Prometheus Users
Hiya,

My experience with federation is that there is definitely efficiency loss so I don't think that your assertion is correct -- the usual goal of federation is to distribute the work of scraping/aggregating. You usually only federate the aggregated data. If you are federating every single metric being pulled by the child Prometheus, it's probably not a stable model.

As for debugging this particular issue though, I've noticed that the /federate endpoint is highly variable in its response time. Sometimes it takes 1-2 seconds and I've seen it take over a minute in my system. Are you sure that the /federate endpoint on the child always takes 3 seconds, or did you just get lucky?

How many metrics are you attempting to federate up? I don't have a great grasp of the limitations yet but in my experience thousands is fine, tens of thousands might work, and anything >100,000 is probably not going to work very well.

--
Mark Smith

dzh...@contextlogic.com

unread,
May 30, 2017, 4:23:55 PM5/30/17
to Prometheus Users, dzh...@contextlogic.com
Hi Mark,

Thanks for the reply. I tried curling the endpoint again and you're right, now it's taking up to 3 minutes to get all the metrics from a single exporter. I'm attempting to federate up approximately 3 million metrics, and I think the sheer amount of time series is causing this issue. I just read Brian Brazil's blog article here: https://www.robustperception.io/federation-what-is-it-good-for/ and he recommended "doing the scrapes via an actual proxy server using proxy_url". Do you know how I can use the proxy_url config option to do the scrapes?

Thanks a lot again for your help.

Julius Volz

unread,
May 30, 2017, 4:28:35 PM5/30/17
to dzh...@contextlogic.com, Prometheus Users
Federating 3 million metrics is definitely not really what federation
was meant for - as Mark said, typically you'll only federate
aggregated metrics to a higher-level Prometheus (global view with less
dimensional detail vs. local view with all detail).

Federating metrics out of a Prometheus server can be much more
expensive than scraping the metrics directly from the original
endpoints for multiple reasons:

- one huuuuuuge single scrape that cannot happen in parallel
- all samples need to be actually produced from the storage (and a
single process at that) vs. being generated on the fly in a bunch of
distributed targets

Why do you want to federate 3 million metrics?
> https://groups.google.com/d/msgid/prometheus-users/1942a3ad-defa-45f8-abb5-85a4c1e17547%40googlegroups.com.

dzh...@contextlogic.com

unread,
May 30, 2017, 4:54:02 PM5/30/17
to Prometheus Users, dzh...@contextlogic.com
Hello Julius,

I see what you mean. Thanks a lot for the clarification. The reason why is that we will potentially add more metrics to Prometheus and we need a scalable setup that can handle large amount of metrics. So since federation evidently is not a feasible option, do you know how I can use proxy servers to scrape metrics and aggregate them? The documentation for proxy_url is very limited and I am not sure where to start. 

Thanks a lot,
DZ

Julius Volz

unread,
May 30, 2017, 5:04:26 PM5/30/17
to dzh...@contextlogic.com, Prometheus Users
On Tue, May 30, 2017 at 10:54 PM, dzhang via Prometheus Users
<promethe...@googlegroups.com> wrote:
> Hello Julius,
>
> I see what you mean. Thanks a lot for the clarification. The reason why is
> that we will potentially add more metrics to Prometheus and we need a
> scalable setup that can handle large amount of metrics. So since federation

There's two ways of scaling:

- building hierarchical federation trees where you only federate
aggregated metrics (like job-level aggregations, no longer having
instance-level detail) into the higher tiers of the tree

- manual functional sharding (example: each service or team gets their
own Prometheus)

- hashmod-based horizontal sharding (not used often)

Federating all the metrics of one Prometheus server to another doesn't
help with scaling.

> evidently is not a feasible option, do you know how I can use proxy servers
> to scrape metrics and aggregate them? The documentation for proxy_url is
> very limited and I am not sure where to start.

Do you need a proxy because the targets are in a remote network for
the other Prometheus? Do you really need to get *all* metrics out of
that network segment instead of just having one Prometheus there with
all the detail right where the targets are (that is the usual
approach)?

You can find information about how to configure a proxy here:
https://prometheus.io/docs/operating/configuration/#scrape_config
> https://groups.google.com/d/msgid/prometheus-users/82169e9b-1296-4dbf-946a-7823581e91e7%40googlegroups.com.

vit...@contextlogic.com

unread,
May 30, 2017, 6:07:06 PM5/30/17
to Prometheus Users, dzh...@contextlogic.com

Hi Julius,

Thanks for the quick response. Seems like we definitely misunderstood the use of federation.

We have two data centers each with their own set of Prometheus servers. We use service discovery to have each Prometheus scrape only the instances in its respective data center.

Our application servers (which generate these large amounts of metrics) are split roughly 50/50 between the the two data centers. We wanted to use federation to have a global Prometheus that could scrape from the Prometheus servers for each respective data center. This is so we could make dashboards/alerts that reflected all of our application servers.

The alternative we tried out that worked before was having a global Prometheus that scraped all the application servers across all data centers (we scaled up the instance type if we needed more capacity). Our worry was that if a data center went down that the global Prometheus availability/performance would be affected (which might be an incorrect assumption).

Are there any drawbacks you can think of with our previous approach?

Brian Brazil

unread,
May 30, 2017, 6:31:16 PM5/30/17
to vit...@contextlogic.com, Prometheus Users, dzh...@contextlogic.com
On 30 May 2017 at 23:07, vitaliy via Prometheus Users <promethe...@googlegroups.com> wrote:

Hi Julius,

Thanks for the quick response. Seems like we definitely misunderstood the use of federation.

We have two data centers each with their own set of Prometheus servers. We use service discovery to have each Prometheus scrape only the instances in its respective data center.

Our application servers (which generate these large amounts of metrics) are split roughly 50/50 between the the two data centers. We wanted to use federation to have a global Prometheus that could scrape from the Prometheus servers for each respective data center. This is so we could make dashboards/alerts that reflected all of our application servers.

The alternative we tried out that worked before was having a global Prometheus that scraped all the application servers across all data centers (we scaled up the instance type if we needed more capacity). Our worry was that if a data center went down that the global Prometheus availability/performance would be affected (which might be an incorrect assumption).

Are there any drawbacks you can think of with our previous approach?

You're crossing failure domains there.

What you want is a Prometheus in each datacenter, aggregating up the metrics for its servers.

Then a global Prometheus in each datacenter (for redundancy), taking those aggregates via federation and calculating anything that can't be done in the dc-level Prometheus servers.

Brian
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/18f26d73-4962-4f93-9a43-f8d87a3302c6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

govinda...@gmail.com

unread,
Jun 13, 2017, 1:32:52 PM6/13/17
to Prometheus Users, vit...@contextlogic.com, dzh...@contextlogic.com
Hi Brian,

We have the similar setup where we have one prometheus scrapper per datacenter. We have a global Federate server scrapping from these Prometheus servers in individual data centers. For some reason when we run a prom sql query, federate nodes are showing a different count than the individual scrapper nodes. Any thoughts?

Thanks,
Govind



--

Brian Brazil

unread,
Jun 13, 2017, 1:44:16 PM6/13/17
to Govindaraj Venkatesan, Prometheus Users, vit...@contextlogic.com, dzh...@contextlogic.com
On 13 June 2017 at 18:32, <govinda...@gmail.com> wrote:
Hi Brian,

We have the similar setup where we have one prometheus scrapper per datacenter. We have a global Federate server scrapping from these Prometheus servers in individual data centers. For some reason when we run a prom sql query, federate nodes are showing a different count than the individual scrapper nodes. Any thoughts?

That's normal, there's plenty of races in monitoring that can make the numbers slightly different. You only notice it when you're doing it twice.

Brian
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/04b7d92b-0d0a-41c7-91ef-be7feb3e0eae%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

govindaraj

unread,
Jun 13, 2017, 1:47:00 PM6/13/17
to Brian Brazil, Prometheus Users, vit...@contextlogic.com, dzh...@contextlogic.com
Thanks Brian, We ran the below query against all 4 federate nodes and got output as 

topk(5, sum(sum_over_time(publish[30m])) by (root_topic))  

Node1: 530
Node 2: 560
Node 3: 500
Node 4: 510

When we run the same query against Scrapper Nodes:
Scrapper 1: 2340
Scrapper 2: 2125

Not sure why there is a huge difference between Federate Node Vs Scrapper Node.

Thanks,
Govind
--

Thanks & Regards
Govindaraj Venkatesan

govindaraj

unread,
Jun 14, 2017, 3:00:43 PM6/14/17
to Brian Brazil, Prometheus Users, vit...@contextlogic.com, dzh...@contextlogic.com
Hi Brian,

This is how our architecture resembles. For some reason the query out from all the 4 federate nodes looks different. :( 

Thoughts?

Thanks,
Govind

On Tue, Jun 13, 2017 at 1:46 PM, govindaraj <govinda...@gmail.com> wrote:
Thanks Brian, We ran the below query against all 4 federate nodes and got output as 

topk(5, sum(sum_over_time(publish[30m])) by (root_topic))  

Node1: 530
Node 2: 560
Node 3: 500
Node 4: 510

When we run the same query against Scrapper Nodes:
Scrapper 1: 2340
Scrapper 2: 2125

Not sure why there is a huge difference between Federate Node Vs Scrapper Node.

Thanks,
Govind
Screen Shot 2017-06-14 at 2.58.41 PM.png
Reply all
Reply to author
Forward
0 new messages