dial tcp i/o timeout

11,806 views
Skip to first unread message

ja...@42lines.net

unread,
Mar 24, 2016, 4:46:58 PM3/24/16
to Prometheus Developers
Hello,
I have a Prometheus instance scraping ~2500 nodes and a few of them (including the local Prometheus instance) will show the following Error on /status:

Get http://hostname:9090/metrics: dial tcp IP_ADDRESS:9090: i/o timeout


I don't see any kind of error message printed in the logs related to this. Any pointers on how to troubleshoot/resolve this?


Thanks,

Jarod

Brian Brazil

unread,
Mar 24, 2016, 4:52:57 PM3/24/16
to ja...@42lines.net, Prometheus Developers
The host on the other end didn't respond to the http connection, is it down?

--

Jarod Watkins

unread,
Mar 25, 2016, 11:36:33 AM3/25/16
to Brian Brazil, Prometheus Developers
It is not. The weird thing is I can curl the metrics end point from another host but not from the Prometheus machine. Also if I attempt to ping a machine that gives me that error, or even localhost, I get the following:

$ ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
ping: sendmsg: Invalid argument
ping: sendmsg: Invalid argument

Total number of TCP connections on the box is low (~150) at any given time. Open file handles appears to be sane as well. IPtables is enabled with conntrack but again they are within the configured limits.

If I reduce the number of hosts I monitor the issue appears to go away. Are there any kernel parameters I should tune when I am monitoring thousands of hosts?

Thanks,
Jarod

Brian Brazil

unread,
Mar 25, 2016, 12:50:51 PM3/25/16
to Jarod Watkins, Prometheus Developers
On 25 March 2016 at 15:36, Jarod Watkins <ja...@42lines.net> wrote:
It is not. The weird thing is I can curl the metrics end point from another host but not from the Prometheus machine. Also if I attempt to ping a machine that gives me that error, or even localhost, I get the following:

There's something weird going on here with your networking that's not related to Prometheus. I'd suggest starting by checking your routing tables.

Brian
 

$ ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
ping: sendmsg: Invalid argument
ping: sendmsg: Invalid argument

Total number of TCP connections on the box is low (~150) at any given time. Open file handles appears to be sane as well. IPtables is enabled with conntrack but again they are within the configured limits.

If I reduce the number of hosts I monitor the issue appears to go away. Are there any kernel parameters I should tune when I am monitoring thousands of hosts?

Thanks,
Jarod


On Mar 24, 2016, at 4:52 PM, Brian Brazil <brian....@robustperception.io> wrote:

On 24 March 2016 at 20:46, <ja...@42lines.net> wrote:
Hello,
I have a Prometheus instance scraping ~2500 nodes and a few of them (including the local Prometheus instance) will show the following Error on /status:

Get http://hostname:9090/metrics: dial tcp IP_ADDRESS:9090: i/o timeout


I don't see any kind of error message printed in the logs related to this. Any pointers on how to troubleshoot/resolve this?

The host on the other end didn't respond to the http connection, is it down?

--




--

Jack Neely

unread,
Mar 29, 2016, 1:37:57 PM3/29/16
to Brian Brazil, Jarod Watkins, Prometheus Developers
To follow up to this:  It turned out to be an overflow of the ARP table.  The kernel was managing to suppress most of the log messages, and failing to send network packets to write the rest to syslog.  So finding the actual problem became a lot harder than it should have been.

# sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

fixed things up for us.

Perhaps something that should be noted in the scaling documentation.

Jack Neely

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Jack Neely
Operations Engineer
42 Lines, Inc.

shaikib...@gmail.com

unread,
Jan 24, 2017, 10:17:25 AM1/24/17
to Prometheus Developers, brian....@robustperception.io, ja...@42lines.net
> Hi

I followed the instructions to set sysctl -w net.ipv4.neigh.default.gc_thresh3=4096 but for some reason i get the same error as mentioned above read tcp: i/o . I can curl the metrics from the host machine where prometheus is installed but in the status it shows as 'Down' with read tcp time out error.

Any clue on this?

Jarod Watkins

unread,
Jan 24, 2017, 11:23:30 AM1/24/17
to Prometheus Developers
How many arp entries do you have? (arp | wc -l)

Also, what are the values for these settings?

net.ipv4.neigh.default.gc_thresh1
net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh3

> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

>
> For more options, visit https://groups.google.com/d/optout.
>
>
> Hi

I followed the instructions to set sysctl -w net.ipv4.neigh.default.gc_thresh3=4096 but for some reason i get the same error as mentioned above read tcp: i/o . I can curl the metrics from the host machine where prometheus is installed but in the status it shows as 'Down' with read tcp time out error.

Any clue on this?
>
>
> --
>
> Jack Neely
> Operations Engineer
> 42 Lines, Inc.




--
Jarod Watkins

pppas...@gmail.com

unread,
Dec 17, 2018, 12:43:55 PM12/17/18
to Prometheus Developers
Hi, folks, I'm having the same problem, usually, I already upgraded gc_thresh1,2,3 and make no difference.
I'm using Prometheus 2.5.0 in a docker container. is there any solution?

Thanks!!

Reply all
Reply to author
Forward
0 new messages