net/http.Server: Lots of idle connections over HTTP/2 from Cloudflare

837 views
Skip to first unread message

Matt Joiner

unread,
Apr 26, 2016, 7:30:01 AM4/26/16
to golang-nuts
I have a server behind Cloudflare that sees

175 @ 0x44fc10 0x44ab2c 0x44a2ff 0x4e1cee 0x4e1d4a 0x4e3323 0x4f3493 0x5d3fc7 0x5d4484 0x5d786b 0x54948c 0x6aaca1 0x6ab86b 0x6ab9da 0x6ee432 0x6ee251 0x544395 0x54a68d 0x54e177 0x47c391
#	0x44a2fe	net.runtime_pollWait+0x4e			/root/src/go/src/runtime/netpoll.go:160
#	0x4e1ced	net.(*pollDesc).wait+0x2d			/root/src/go/src/net/fd_poll_runtime.go:73
#	0x4e1d49	net.(*pollDesc).waitRead+0x29			/root/src/go/src/net/fd_poll_runtime.go:78
#	0x4e3322	net.(*netFD).Read+0x182				/root/src/go/src/net/fd_unix.go:212
#	0x4f3492	net.(*conn).Read+0x62				/root/src/go/src/net/net.go:173
#	0x5d3fc6	crypto/tls.(*block).readFromUntil+0x86		/root/src/go/src/crypto/tls/conn.go:457
#	0x5d4483	crypto/tls.(*Conn).readRecord+0xb3		/root/src/go/src/crypto/tls/conn.go:559
#	0x5d786a	crypto/tls.(*Conn).Read+0xfa			/root/src/go/src/crypto/tls/conn.go:1006
#	0x54948b	net/http.(*connReader).Read+0x12b		/root/src/go/src/net/http/server.go:550
#	0x6aaca0	bufio.(*Reader).fill+0x100			/root/src/go/src/bufio/bufio.go:97
#	0x6ab86a	bufio.(*Reader).ReadSlice+0xaa			/root/src/go/src/bufio/bufio.go:330
#	0x6ab9d9	bufio.(*Reader).ReadLine+0x29			/root/src/go/src/bufio/bufio.go:359
#	0x6ee431	net/textproto.(*Reader).readLineSlice+0x51	/root/src/go/src/net/textproto/reader.go:55
#	0x6ee250	net/textproto.(*Reader).ReadLine+0x20		/root/src/go/src/net/textproto/reader.go:36
#	0x544394	net/http.readRequest+0x94			/root/src/go/src/net/http/request.go:772
#	0x54a68c	net/http.(*conn).readRequest+0xfc		/root/src/go/src/net/http/server.go:729
#	0x54e176	net/http.(*conn).serve+0x3a6			/root/src/go/src/net/http/server.go:1469
#	0x0

When there is no activity for a long period over HTTP. I'm not sure if it's Cloudflare keeping connections open, or there's a bug in Go's HTTP/2. In this instance, there is also 182 established TCP connections from netstat to the server's socket.

Matt Joiner

unread,
Apr 26, 2016, 9:04:33 PM4/26/16
to golang-nuts
The same server, has now climbed to 347 goroutines in this state. There are 349 TCP connections, and 2 goroutines:
2 @ 0x44fc10 0x44ab2c 0x44a2ff 0x4e1cee 0x4e1d4a 0x4e3323 0x4f3493 0x5d3fc7 0x5d4484 0x5d786b 0x6aaca1 0x6ab282 0x4c8df8 0x4c8f4e 0x51e26e 0x51ea31 0x539b5c 0x53969c 0x47c391
#	0x44a2fe	net.runtime_pollWait+0x4e			/root/src/go/src/runtime/netpoll.go:160
#	0x4e1ced	net.(*pollDesc).wait+0x2d			/root/src/go/src/net/fd_poll_runtime.go:73
#	0x4e1d49	net.(*pollDesc).waitRead+0x29			/root/src/go/src/net/fd_poll_runtime.go:78
#	0x4e3322	net.(*netFD).Read+0x182				/root/src/go/src/net/fd_unix.go:212
#	0x4f3492	net.(*conn).Read+0x62				/root/src/go/src/net/net.go:173
#	0x5d3fc6	crypto/tls.(*block).readFromUntil+0x86		/root/src/go/src/crypto/tls/conn.go:457
#	0x5d4483	crypto/tls.(*Conn).readRecord+0xb3		/root/src/go/src/crypto/tls/conn.go:559
#	0x5d786a	crypto/tls.(*Conn).Read+0xfa			/root/src/go/src/crypto/tls/conn.go:1006
#	0x6aaca0	bufio.(*Reader).fill+0x100			/root/src/go/src/bufio/bufio.go:97
#	0x6ab281	bufio.(*Reader).Read+0x1a1			/root/src/go/src/bufio/bufio.go:209
#	0x4c8df7	io.ReadAtLeast+0x97				/root/src/go/src/io/io.go:315
#	0x4c8f4d	io.ReadFull+0x4d				/root/src/go/src/io/io.go:333
#	0x51e26d	net/http.http2readFrameHeader+0x6d		/root/src/go/src/net/http/h2_bundle.go:744
#	0x51ea30	net/http.(*http2Framer).ReadFrame+0x90		/root/src/go/src/net/http/h2_bundle.go:966
#	0x539b5b	net/http.(*http2clientConnReadLoop).run+0x6b	/root/src/go/src/net/http/h2_bundle.go:5734
#	0x53969b	net/http.(*http2ClientConn).readLoop+0x9b	/root/src/go/src/net/http/h2_bundle.go:5695

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/zqsC6xcnP24/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Waller

unread,
Apr 28, 2016, 4:45:14 AM4/28/16
to Matt Joiner, golang-nuts
For what it's worth, I have observed this same leak (same stack trace), with go 1.6.1 -> go 1.6.1 connections.

Last time I measured, it seemed to be leaking on the order of 1 file descriptor every 5 requests in my circumstances.

The only reason I haven't investigated more thoroughly yet is that the rate is sufficiently low it will take a long time before running out of file descriptors; there are also frequent deploys and server restarts. But I'd love to see this fixed!

I can't see an existing issue for it Matt, could you file one?

There is https://github.com/golang/go/issues/14375 but as far as I read that thread it looks unrelated to the stack trace you've posted.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Peter Waller

unread,
Apr 28, 2016, 5:26:36 AM4/28/16
to Matt Joiner, golang-nuts
I should add, the usual culprit for this sort of thing is a missing request.Body.Close(). I haven't found one yet in my code, but it's quite possible there are some strange circumstances (e.g, error cases) that cause me to miss one. I wish there was a way to discover them automatically.

In this case maybe the missing request.Body.Close() could be on CloudFlare's side, I suppose.

Brian Ketelsen

unread,
Apr 28, 2016, 12:05:05 PM4/28/16
to golang-nuts, anac...@gmail.com
I've seen this same behavior on a Caddy server behind cloudflare.  Not adding anything useful, but it is a +1 to the problem.

msilv...@cloudflare.com

unread,
Apr 28, 2016, 12:15:04 PM4/28/16
to golang-nuts, anac...@gmail.com
Hi all—I mentioned this to Matt Joiner in a separate off-list thread, but CloudFlare doesn't do HTTP/2 to the origin (CF -> origin) yet. We definitely do HTTP/2 to the user. If you're seeing hanging (idle) HTTP/2 connections, it's extremely unlikely they are coming from us. Further, our default timeout settings (connect, send, read) in nginx should prevent any issues on that front over HTTP/1.1.

Still, if you're seeing problems and think there's something related, I'd be curious to see. If you can supply minimal reproducible test cases (e.g. Caddy version & config, or a small Go server that triggers the issue) shoot me an email so I can investigate.

Cheers,
Matt Silverlock
Solutions Engineer

CloudFlare, Inc.

Matt Joiner

unread,
Apr 29, 2016, 1:31:11 AM4/29/16
to Peter Waller, golang-nuts
Yes it does make sense that leaking connections like this, would be more likely on HTTP!=2, as HTTP2 should be reusing them even with incomplete request/response pairs. And the fact that the 350 odd connections are all from 3-4 IPs from Cloudflare, would similarly match Matthew's claim that they're not using HTTP2 to the origin (or I'd see only 1 connection from each IP presumably). I have found one resp.Body that wasn't closed, but it was completely consumed, and I don't believe this causes a leak, but I'll continue investigating. Thanks for the help so far.

Peter Waller

unread,
Apr 29, 2016, 2:56:36 AM4/29/16
to Matt Joiner, golang-nuts
On 29 April 2016 at 06:30, Matt Joiner <anac...@gmail.com> wrote:
I have found one resp.Body that wasn't closed, but it was completely consumed, and I don't believe this causes a leak, but I'll continue investigating.


The client must close the response body when finished with it: 

Clients must call resp.Body.Close when finished reading resp.Body. 

Callers should close resp.Body when done reading from it. If resp.Body is not closed, the Client's underlying RoundTripper (typically Transport) may not be able to re-use a persistent TCP connection to the server for a subsequent "keep-alive" request.

I also believed Close wasn't necessary if you read the body out. But if that was ever the case, it is not the case now.

See also my pain with Keep-Alive not working, where I first learned this:

Both Keep-Alives and HTTP2 not working. High latency for every connection.
https://groups.google.com/d/topic/golang-nuts/K5sJ3gX1YO8/discussion

Peter Waller

unread,
Apr 30, 2016, 5:07:27 PM4/30/16
to Matt Joiner, golang-nuts
FWIW, I just did a little experiment with this, which is some reasonably minimal standalone code with a TLS HTTP server:



You'll see exactly the stack trace you posted.

If you uncomment line 56, ("defer resp.Body.Close()") the problem goes away. So I think currently our problem is likely to be an unclosed response body. I still haven't found mine yet, but now I know to look harder :)

Peter Waller

unread,
May 1, 2016, 6:40:05 AM5/1/16
to Matt Joiner, golang-nuts
I've finally figured out why my goroutines were leaking. I was unintentionally creating many more *HTTPClients than I need periodically, and sending requests down those new clients. This explains how the leak was so slow relative to the request rate. The old clients seem to be keeping their Keep-Alive pool, indefinitely. I tried calling CloseIdleConnections before replacing the HTTPClient, but this doesn't seem to have any effect.

I can "fix" my leak by being a bit more careful about how often I create HTTPClients.

However, is it expected that a HTTPClient will keep connections around indefinitely? It is implied by the documentation that they might:

By default, Transport caches connections for future re-use. This may leave many open connections when accessing many hosts. This behavior can be managed using Transport's CloseIdleConnections method and the MaxIdleConnsPerHost and DisableKeepAlives fields.

I wonder why CloseIdleConnections doesn't work in my case.

My expectation was that the idle connections would only hang around for some limited time, but I can't see obvious evidence for that.

For reference, here are the goroutine stacks I saw:

10 @ 0x431c5d 0x42c971 0x42be13 0x688826 0x6888b4 0x68ad5b 0x6a5031 0x6d1fcd 0x60d463 0x60e02a 0x61405d 0x6fada5 0x6fbbae 0x5b5d59 0x5b5f07 0x73d6bd 0x73f4d6 0x75b7e7 0x75aeec 0x464681
# 0x42be13 net.runtime_pollWait+0x63 /usr/local/go/src/runtime/netpoll.go:160
# 0x688826 net.(*pollDesc).Wait+0x56 /usr/local/go/src/net/fd_poll_runtime.go:73
# 0x6888b4 net.(*pollDesc).WaitRead+0x44 /usr/local/go/src/net/fd_poll_runtime.go:78
# 0x68ad5b net.(*netFD).Read+0x27b /usr/local/go/src/net/fd_unix.go:250
# 0x6a5031 net.(*conn).Read+0x121 /usr/local/go/src/net/net.go:172
# 0x6d1fcd net.(*TCPConn).Read+0x7d <autogenerated>:68
# 0x60d463 crypto/tls.(*block).readFromUntil+0x143 /usr/local/go/src/crypto/tls/conn.go:460
# 0x60e02a crypto/tls.(*Conn).readRecord+0x3aa /usr/local/go/src/crypto/tls/conn.go:562
# 0x61405d crypto/tls.(*Conn).Read+0x1bd /usr/local/go/src/crypto/tls/conn.go:939
# 0x6fada5 bufio.(*Reader).fill+0x365 /usr/local/go/src/bufio/bufio.go:97
# 0x6fbbae bufio.(*Reader).Read+0x54e /usr/local/go/src/bufio/bufio.go:207
# 0x5b5d59 io.ReadAtLeast+0x119 /usr/local/go/src/io/io.go:297
# 0x5b5f07 io.ReadFull+0x77 /usr/local/go/src/io/io.go:315

And the server side leaked connection:

10 @ 0x431c5d 0x42c971 0x42be13 0x688826 0x6888b4 0x68ad5b 0x6a5031 0x6d1fcd 0x60d463 0x60e02a 0x61405d 0x5b5d59 0x5b5f07 0x49b80d 0x49d4f6 0x4b2c61 0x464681
# 0x42be13 net.runtime_pollWait+0x63 /usr/local/go/src/runtime/netpoll.go:160
# 0x688826 net.(*pollDesc).Wait+0x56 /usr/local/go/src/net/fd_poll_runtime.go:73
# 0x6888b4 net.(*pollDesc).WaitRead+0x44 /usr/local/go/src/net/fd_poll_runtime.go:78
# 0x68ad5b net.(*netFD).Read+0x27b /usr/local/go/src/net/fd_unix.go:250
# 0x6a5031 net.(*conn).Read+0x121 /usr/local/go/src/net/net.go:172
# 0x6d1fcd net.(*TCPConn).Read+0x7d <autogenerated>:68
# 0x60d463 crypto/tls.(*block).readFromUntil+0x143 /usr/local/go/src/crypto/tls/conn.go:460
# 0x60e02a crypto/tls.(*Conn).readRecord+0x3aa /usr/local/go/src/crypto/tls/conn.go:562
# 0x61405d crypto/tls.(*Conn).Read+0x1bd /usr/local/go/src/crypto/tls/conn.go:939
# 0x5b5d59 io.ReadAtLeast+0x119 /usr/local/go/src/io/io.go:297
# 0x5b5f07 io.ReadFull+0x77 /usr/local/go/src/io/io.go:315
# 0x49b80d net/http.http2readFrameHeader+0xcd /usr/local/go/src/net/http/h2_bundle.go:715
# 0x49d4f6 net/http.(*http2Framer).ReadFrame+0x166 /usr/local/go/src/net/http/h2_bundle.go:906
# 0x4b2c61 net/http.(*http2serverConn).readFrames+0x81 /usr/local/go/src/net/http/h2_bundle.go:3067

10 @ 0x431c5d 0x442140 0x441532 0x4b4715 0x4b03b0 0x512e96 0x4f19c3 0x464681
# 0x4b4715 net/http.(*http2serverConn).serve+0x1205 /usr/local/go/src/net/http/h2_bundle.go:3163
# 0x4b03b0 net/http.(*http2Server).ServeConn+0x1020 /usr/local/go/src/net/http/h2_bundle.go:2711
# 0x512e96 net/http.http2ConfigureServer.func1+0x116 /usr/local/go/src/net/http/h2_bundle.go:2603
# 0x4f19c3 net/http.(*conn).serve+0xa33 /usr/local/go/src/net/http/server.go:1414

Matt Joiner

unread,
May 1, 2016, 10:33:24 AM5/1/16
to Peter Waller, golang-nuts
Mine are just the server serve routines. And nothing to do with HTTP2 in the stack. No client leaks. :| Still all from Cloudflare.

Matt Joiner

unread,
May 11, 2016, 10:45:13 AM5/11/16
to golang-nuts
A common theme here seems to be TLS servers. I had a filter on TLS error messages that I've disabled in my production server to see if I can also correlate this goroutine leak count with the number of error messages.

Matt Silverlock

unread,
May 12, 2016, 1:48:29 AM5/12/16
to golang-nuts
Are you calling `http.ListenAndServe`, or are you creating a *http.Server and explicitly setting the ReadTimeout & WriteTimeout?

Matt Joiner

unread,
May 12, 2016, 2:10:54 AM5/12/16
to Matt Silverlock, golang-nuts

srv := http.Server{
Addr:      flags.TLSAddr.String(),
Handler:   h,
TLSConfig: &tls.Config{},
// ErrorLog:  log.New(&httpServerLoggerWriter{os.Stderr}, log.Prefix(), log.Flags()),
}
srv.TLSConfig.Certificates, err = loadCertificates(filepath.Join(dataDir, "ssl"))
if err != nil {
log.Fatal(err)
}
if len(srv.TLSConfig.Certificates) == 0 {
log.Fatal("no certificates; no tls")
}
srv.TLSConfig.BuildNameToCertificate()
// http2.ConfigureServer(&srv, nil)
log.Fatal(srv.ListenAndServeTLS("", ""))

So I'm not setting ReadTimeout or WriteTimeout at all. Does that matter?

--

Matt Silverlock

unread,
May 12, 2016, 9:20:07 AM5/12/16
to golang-nuts
> o I'm not setting ReadTimeout or WriteTimeout at all. Does that matter?

Yes, because you will potentially leak FDs when you don't time out.

Matt Joiner

unread,
May 12, 2016, 9:21:46 AM5/12/16
to Matt Silverlock, golang-nuts
Won't that require that the remote end hold the connection open, lest the TCP timeout takes care of it?

On 12 May 2016 at 23:20, Matt Silverlock <elit...@gmail.com> wrote:
> o I'm not setting ReadTimeout or WriteTimeout at all. Does that matter?

Yes, because you will potentially leak FDs when you don't time out.

Matt Joiner

unread,
May 16, 2016, 2:06:48 AM5/16/16
to Matt S, golang-nuts
I've done investigating:

Since I'm using http.Server.ListenAndServeTLS, TCPConn.SetKeepAlive(true) is being called, with a period of 3minutes. My system also has a tcp_keepalive_time of 2 hours, which is being exceeded by a large amount. Evidently neither are having an effect. Linux docs state that you must explicitly set the socket keep-alive option for the default (2 hours) to have an effect. This call is being made correctly in the Go code.

I'm fairly certain that something is holding these goroutines open. I'm pretty sure this isn't supposed to happen:

goroutine 273959793 [IO wait, 1607 minutes]:
net.runtime_pollWait(0x7f2f44618248, 0x72, 0xc829ab1000)
	/root/src/go/src/runtime/netpoll.go:160 +0x60
net.(*pollDesc).Wait(0xc829bee290, 0x72, 0x0, 0x0)
	/root/src/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc829bee290, 0x0, 0x0)
	/root/src/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).Read(0xc829bee230, 0xc829ab1000, 0x1000, 0x1000, 0x0, 0x7f2f4ef91050, 0xc820016180)
	/root/src/go/src/net/fd_unix.go:250 +0x23a
net.(*conn).Read(0xc82a7bf230, 0xc829ab1000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/root/src/go/src/net/net.go:172 +0xe4
net/http.(*connReader).Read(0xc82c9244c0, 0xc829ab1000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/root/src/go/src/net/http/server.go:526 +0x196
bufio.(*Reader).fill(0xc821d50420)
	/root/src/go/src/bufio/bufio.go:97 +0x1e9
bufio.(*Reader).ReadSlice(0xc821d50420, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0)
	/root/src/go/src/bufio/bufio.go:328 +0x21a
bufio.(*Reader).ReadLine(0xc821d50420, 0x0, 0x0, 0x0, 0xc23400, 0x0, 0x0)
	/root/src/go/src/bufio/bufio.go:357 +0x53
net/textproto.(*Reader).readLineSlice(0xc835c44150, 0x0, 0x0, 0x0, 0x0, 0x0)
	/root/src/go/src/net/textproto/reader.go:55 +0x81
net/textproto.(*Reader).ReadLine(0xc835c44150, 0x0, 0x0, 0x0, 0x0)
	/root/src/go/src/net/textproto/reader.go:36 +0x40
net/http.readRequest(0xc821d50420, 0x0, 0xc8289e3260, 0x0, 0x0)
	/root/src/go/src/net/http/request.go:721 +0xb6
net/http.(*conn).readRequest(0xc823c64f00, 0x0, 0x0, 0x0)
	/root/src/go/src/net/http/server.go:705 +0x359
net/http.(*conn).serve(0xc823c64f00)
	/root/src/go/src/net/http/server.go:1425 +0x947
created by net/http.(*Server).Serve
	/root/src/go/src/net/http/server.go:2137 +0x44e

On 12 May 2016 at 23:48, Matt S <ma...@eatsleeprepeat.net> wrote:
It does not.

If you are seeing connections held open for > 90s (default CloudFlare read timeout) or > 100s (write timeout) as per our nginx config today, then the remote end isn't holding the connection open—the local machine just thinks that to be the case.


Reply all
Reply to author
Forward
0 new messages