Hi All,
This is my first post. So please excuse if I break any forum rules :-)
Background:
We are running a proxy server in golang which takes request from client, fetches the content client requested for and sends a response back. We run the server on port 8080.
Go version 1.2.1 .Code runs on CentOs6. TCP parameters tuned for faster use of connections.
Processe's Max open file limit set to 200000
Issue:
Once in every 3-5 days, server gets in to unresponsive state saying:
Accept error: accept tcp [::]:8080: too many open files; retrying in 1s
Accept error: accept tcp [::]:8080: too many open files; retrying in 1s
A manual restart is required every time we see this.
Suspicious findings:
-> When we observed lsof output in our internal testing, some connections are there for ever. And are increasing randomly. So we suspect there is an FD leak and this number goes way higher in real world case.
Ex:
agent 13771 agent 7u IPv6 75640925 0t0 TCP 128.199.211.246:webcache->
123.63.202.169:57585 (ESTABLISHED)
agent 13771 agent 9u IPv4 75636003 0t0 TCP 10.130.203.8:37627->
10.130.203.8:6379 (ESTABLISHED)
agent 13771 agent 10u IPv6 75640287 0t0 TCP 128.199.211.246:webcache->222.186.129.5:fotogcad (ESTABLISHED)
agent 13771 agent 11u IPv6 75912851 0t0 TCP 128.199.211.246:webcache->222.186.129.5:ieee-mih (ESTABLISHED)
agent 13771 agent 12u IPv6 76102384 0t0 TCP 128.199.211.246:webcache->
123.63.202.169:56670 (ESTABLISHED)
agent 13771 agent 13u IPv6 76080513 0t0 TCP 128.199.211.246:webcache->
123.63.202.169:49662 (ESTABLISHED)
agent 13771 agent 14u IPv4 75645718 0t0 TCP 10.130.203.8:37962->
10.130.203.8:6379 (ESTABLISHED)
agent 13771 agent 15u IPv6 76080655 0t0 TCP 128.199.211.246:webcache->
123.63.202.169:49666 (ESTABLISHED)
agent 13771 agent 18u IPv6 76080514 0t0 TCP 128.199.211.246:webcache->
123.63.202.169:49663 (ESTABLISHED)
-> In production lsof output, we found these lines (some hundreds)
agent 15976 agent 1563u sock 0,6 0t0 39770995 can't identify protocol
->In server handler, we are setting connection to close
w.Header().Set("Connection", "close")
This is when we fetch the content.
page, err := GetPage(origURL)
defer page.Body.Close()
This how we listen
l, e := net.Listen(proto, srv.Addr)
These are the only things controlled in application code. No timeouts specified. Read from other similar posts that we need to specify timeouts for read/write.
Let me know how to identify the problem and pointers to solve this. Not sure how to reproduce the issue. Thanks in advance.