federation failed with error message "write: broken pipe"

3,357 views
Skip to first unread message

wangchao...@gmail.com

unread,
May 16, 2018, 5:13:50 AM5/16/18
to Prometheus Users
The prometheus version is v2.0.0. Previously it was working well. But I saw lots of the following errors today. I could still see the errors after rebooting the local prometheus instance.

level=error ts=2018-05-16T08:39:32.143676493Z caller=federate.go:163 component=web msg="federation failed" err="write tcp 192.168.243.145:9090->10.0.0.12:33494: write: broken pipe"
level=error ts=2018-05-16T08:40:32.146520927Z caller=federate.go:163 component=web msg="federation failed" err="write tcp 192.168.243.145:9090->10.0.0.12:33504: write: broken pipe"
level=error ts=2018-05-16T08:41:32.154047845Z caller=federate.go:163 component=web msg="federation failed" err="write tcp 192.168.243.145:9090->10.0.0.12:33506: write: broken pipe"
level=error ts=2018-05-16T08:42:32.145742427Z caller=federate.go:163 component=web msg="federation failed" err="write tcp 192.168.243.145:9090->10.0.0.12:33508: write: broken pipe"

Does anybody have the same issue? 

daniel....@invision.de

unread,
May 23, 2018, 5:57:50 AM5/23/18
to Prometheus Users
Yes, this seems to be happening to our Prometheus scraper as well. It's running in Kubernetes. It is exclusively scraping metrics from our Kubernetes cluster. I have tried to increase the scrape_timeout from 10s to 15s to no avail.

erezm...@gmail.com

unread,
Jul 12, 2018, 6:47:13 AM7/12/18
to Prometheus Users
It happens to me as well, when I am adding more jobs to the prometheus.yml file (on the main Prometheus server).
The main Prometheus scraps from 3 other Prometheus servers (one of each AWS account).
The only thing that helps is increasing scrape interval (from 5s to 10s): 'scrape_interval: 10s' but that is only a workaround, since we cannot add new jobs, unless we increase the interval again (and again each time).
Is there a limit of jobs that Prometheus can handle in a given time? The servers aren't working hard at all...

Martin Chodúr

unread,
Jul 13, 2018, 1:23:02 AM7/13/18
to Prometheus Users
Hi,

I saw this caused by timeout on federation by other Prometheus.
Is that your case? If yes you will need to shard your instances or lower the ammount of federated data.

Matthias Rampke

unread,
Jul 13, 2018, 7:44:39 AM7/13/18
to Martin Chodúr, Prometheus Users
Martin is correct, federation is sadly not a very efficient process. In effect, Prometheus acts as if it was a single massive exporter, and there can be a loooot of data to transfer. You can check this by curl'ing the federation endpoint directly (try with and without compression). Federation is a very simple approach but you are running up against its inherent scaling limits.

A workaround that may get you some way further is to set up several parallel scrape jobs. Even from and to the same servers that will give you at least some breathing room. Shard the federation selectors on the last 1-2 characters of the instance label or some such; you'll have to experiment with what makes sense for you. The idea is to have ~ 10 scrapes going on in parallel, each taking a different slice of the metrics. A fancy way (that I haven't tried yet) would be to use relabelling with the hashmod operation on the source Prometheus to partition the metrics, then use that label to select time series during federation, then drop it again on the central Prometheus.

Long term, you will have to consider a different approach than federation to get all your metrics in one place. One option that comes to mind is Thanos <https://github.com/improbable-eng/thanos>, it is designed to handle this problem (among others).

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f5ef63e2-4612-4c07-ae3c-00e76ff3042a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erez Gedalyahu

unread,
Jul 16, 2018, 3:33:35 AM7/16/18
to Martin Chodúr, Prometheus Users
Might be a timeout. It says "Broken Pipe", same as the original example:

level=error ts=2018-05-16T08:39:32.143676493Z caller=federate.go:163 component=web msg="federation failed" err="write tcp 192.168.243.145:9090->10.0.0.12:33494: write: broken pipe"

Different time and sockets, of course. I can't post my own errors since I've increased the parameter 'scrape_interval' from 5s to 10s, and now it's working just fine, but it would return errors again if I add more jobs...
Do you have any idea what causes it? Is there a limit of how many jobs can be federated?

Thanks in advance,
Erez


--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/xlKc_fp4r3k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

Ben Kochie

unread,
Jul 16, 2018, 4:59:13 AM7/16/18
to er...@erezmaiden.com, m.ch...@seznam.cz, Prometheus Users
Federation is not intended to be replication, only for gathering summary recording rules.

On Mon, Jul 16, 2018 at 9:33 AM Erez Gedalyahu <er...@erezmaiden.com> wrote:
Might be a timeout. It says "Broken Pipe", same as the original example:

level=error ts=2018-05-16T08:39:32.143676493Z caller=federate.go:163 component=web msg="federation failed" err="write tcp 192.168.243.145:9090->10.0.0.12:33494: write: broken pipe"

Different time and sockets, of course. I can't post my own errors since I've increased the parameter 'scrape_interval' from 5s to 10s, and now it's working just fine, but it would return errors again if I add more jobs...
Do you have any idea what causes it? Is there a limit of how many jobs can be federated?

Thanks in advance,
Erez
On Fri, Jul 13, 2018 at 8:23 AM, Martin Chodúr <m.ch...@seznam.cz> wrote:
Hi,

I saw this caused by timeout on federation by other Prometheus.
Is that your case? If yes you will need to shard your instances or lower the ammount of federated data.

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/xlKc_fp4r3k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Matthias Rampke

unread,
Jul 16, 2018, 6:10:08 AM7/16/18
to Ben Kochie, er...@erezmaiden.com, m.ch...@seznam.cz, Prometheus Users
Do you have any idea what causes it?

Most likely, collecting and transferring all the data to be federated takes longer than the scrape timeout.

 
> Is there a limit of how many jobs can be federated?

It's not really a limit on the number of jobs, but on the number of time series. The specifics depend on the specs of the server that you are federating from. As Ben noted, this is not what federation was designed for, so if pressed for a number I'd say "if you have  more than 10000 time series to federate look for a different solution".

/MR



Reply all
Reply to author
Forward
0 new messages