How to quickly/reliably cancel an HTTP Request?

2,640 views
Skip to first unread message

John Potocny

unread,
Dec 11, 2015, 12:14:25 PM12/11/15
to golang-nuts
Hey Everyone, I've been tearing my hair out over this problem but maybe the community can help :)

I have a service written in Go that is running as a data collector, it's designed to send batches of data up to a server once a minute over HTTPS, and also does a "checkin" with the server reporting its current health. I've been dealing with a problem recently where the collector's requests posting new batches of data are getting stuck for a long period of time, and causing it to get backed up.

My collector is built with Go1.5.2 using the standard net/http library to make requests up to the server, and the http.Client is built like so:

timeout := time.Minute

timeoutDialer
:= net.Dialer{
   
Timeout:   timeout,
   
KeepAlive: timeout,
}

// logTransport is just a wrapper on the normal http.Transport with some debug statements added.
transport
:= &logTransport{}
transport
.Proxy = noProxy
transport
.ResponseHeaderTimeout = timeout
transport
.TLSHandshakeTimeout = timeout / 3
transport
.Dial = timeoutDialer.Dial
 
client
:=  &http.Client{            
   
CheckRedirect: doNotFollowRedirects,
   
Timeout: timeout, // if we can't contact the api in one minute, just give up.
   
Transport: transport,
}


As shown above my http.Client has timeouts set just about wherever they can be defined in the standard interface. Even so, I'm seeing requests to my server get blocked for much longer than the configured 1 minute. Logs are showing requests take more than 5 minutes to cancel (my logTransport struct measures the time spent waiting for http.Transport.CancelRequest(req) to finish), but from what I've seen looking in the http package there's no reason for it to take that long.


Is there something I'm missing with respect to the way http requests are cancelled in the standard library (I.E. is the above behavior expected)? Any help here would be much appreciated.

Cheers,
John

Brad Fitzpatrick

unread,
Dec 11, 2015, 2:23:55 PM12/11/15
to John Potocny, golang-nuts
Show the code for your wrapper.

My guess is it doesn't have a CancelRequest method (the old style of cancelation) which the Client looks for when using Timeout.

But I'd use the new-style of cancelation instead and set the Cancel channel on your http.Request, and then close that to cancel the request when you want to.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tim K

unread,
Dec 11, 2015, 2:50:39 PM12/11/15
to golang-nuts, jo...@vividcortex.com
Not to sidetrack this thread, I didn't know about the Cancel channel. Any clever tricks to easily hook it up to a time.After without any glue code?
Would it have been better to declare the channel <-chan Time instead so it could be directly used with time.After for easy timeouts and also it could work on channel close for other cancel rules not based on it.

What happens if you simply write to the Cancel channel without closing it?

John Potocny

unread,
Dec 11, 2015, 2:51:21 PM12/11/15
to golang-nuts, jo...@vividcortex.com
Sure thing, sorry should have included it originally. It's really simple, here it is in it's entirety:

type logTransport struct {    
    http
.Transport
}


func
(t *logTransport) CancelRequest(req *http.Request) {
    start
:= time.Now()


    log
.Warnf("Cancelling request with uri %q", req.URL.RequestURI())
    t
.Transport.CancelRequest(req)
    log
.Warnf("Request with uri %q successfully cancelled after %v", req.URL.RequestURI(), time.Now().Sub(start))
}




I should probably have also noted that originally the wrapper wasn't even involved, I added it so I could wrap calls into the transport and log when methods were being called to help me understand what was going on when requests were hanging. Sorry again for not making that clear in the original post.

So to make sure I understand, the recommended method of cancelling a request is to do something like as follows:

var request *http.Request request, err = http.NewRequest(httpVerb, fullURI, rc)
if err != nil {
   
return
}

cancel
:= make(chan struct{})
request
.Cancel = cancel
timer
:= time.AfterFunc(time.Minute, func() {
    close
(cancel)
})

res
, err := http.DefaultClient.Do(request)
stopped
:= timer.Stop()

// Check if the request failed or was cancelled below.


Tamás Gulácsi

unread,
Dec 11, 2015, 4:12:30 PM12/11/15
to golang-nuts

Brad Fitzpatrick

unread,
Dec 11, 2015, 7:49:06 PM12/11/15
to John Potocny, golang-nuts
I see no obvious problems with that.

I don't know why you're seeing 5 minutes instead of 1 minute. Maybe send the process SIGQUIT when it's being slow and get the dump of goroutines and see if one is stuck somewhere?

John Potocny

unread,
Jan 11, 2016, 5:00:27 PM1/11/16
to golang-nuts, jo...@vividcortex.com
I've been away from this for a while due to the holidays but I was able to make some progress and figure out why my requests were taking so long to cancel. I figure I should update the thread for anyone who stumbles upon this down the road.

I instrumented the CancelRequest method in my Transport to take a snapshot of the program's goroutines when it got stuck (I don't have console access to the machine running the program so I couldn't just send a SIGQUIT), and isolated the routines that were blocked. The partial stack traces are shown below (I've removed the references to bits I can't share, but they don't affect the output in a meaningful way):


goroutine 73 [semacquire, 1 minutes]:
sync.runtime_Semacquire(0xc8204a7784)
    /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*Mutex).Lock(0xc8204a7780)
    /usr/local/go/src/sync/mutex.go:82 +0x1c4
crypto/tls.(*Conn).sendAlert(0xc8204a7600, 0x939300, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:684 +0x44
crypto/tls.(*Conn).Close(0xc8204a7600, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:950 +0xa6
net/http.(*persistConn).closeLocked(0xc820283340)
    /usr/local/go/src/net/http/transport.go:1248 +0x4e
net/http.(*persistConn).cancelRequest(0xc820283340)
    /usr/local/go/src/net/http/transport.go:859 +0x8c
net/http.(*persistConn).(net/http.cancelRequest)-fm()
    /usr/local/go/src/net/http/transport.go:1108 +0x20
net/http.(*Transport).CancelRequest(0xc820094990, 0xc8202d5500)
    /usr/local/go/src/net/http/transport.go:285 +0xca
The routine attempting to Cancel the request is waiting for a mutex in the tls.(*Conn) implementation - I looked up the source and the mutex in question controls access to writing data over the connection. It appears that when CancelRequest attempts to call conn.Close(), the underlying tls.Conn tries to send an alert to the server that the connection is being closed. 

Normally this might be a problem, except for the behavior I see in the request thread:

goroutine 71 [IO wait, 2 minutes]:
net.runtime_pollWait(0x7f58739daa18, 0x77, 0xc82000c1c0)
    /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820344990, 0x77, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitWrite(0xc820344990, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:82 +0x36
net.(*netFD).Write(0xc820344930, 0xc82030a000, 0x101d, 0x2000, 0xf2, 0x7f58739d5050, 0xc82000c1c0)
    /usr/local/go/src/net/fd_unix.go:318 +0x414
net.(*conn).Write(0xc820026078, 0xc82030a000, 0x101d, 0x2000, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:184 +0xe4
crypto/tls.(*Conn).writeRecord(0xc8204a7600, 0x939317, 0xc820415000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:745 +0x548
crypto/tls.(*Conn).Write(0xc8204a7600, 0xc820415000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:877 +0x40e
bufio.(*Writer).flush(0xc820544c00, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:562 +0xe0
bufio.(*Writer).ReadFrom(0xc820544c00, 0x7f58739e3b88, 0xc8201575c0, 0x4aeb9, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:686 +0x134
io.copyBuffer(0x7f58739e3a88, 0xc820544c00, 0x7f58739e3b88, 0xc8201575c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:375 +0x180
io.Copy(0x7f58739e3a88, 0xc820544c00, 0x7f58739e3b88, 0xc8201575c0, 0x0, 0x0, 0x0)
    /usr/local/go/src/io/io.go:351 +0x64
net/http.(*transferWriter).WriteBody(0xc820344620, 0x7f58739e3a88, 0xc820544c00, 0x0, 0x0)
    /usr/local/go/src/net/http/transfer.go:225 +0x8fe
net/http.(*Request).write(0xc8202d5500, 0x7f58739e3a88, 0xc820544c00, 0x0, 0xc8202f7350, 0x0, 0x0)
    /usr/local/go/src/net/http/request.go:462 +0xbb9
net/http.(*persistConn).writeLoop(0xc820283340)
    /usr/local/go/src/net/http/transport.go:1015 +0x27c
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:686 +0xc9d

I can see from the above stack trace that the request that blocked is stuck in the writeLoop, trying to send data over the tls.Conn. Given that we're waiting for a write to finish, there's no way for CancelRequest to actually complete until the thing that is blocking the request unblocks on its own.

It's not clear to me why the write is getting stuck for so long though, I guess that will be the next thing for me to take a look at.

Brad Fitzpatrick

unread,
Jan 11, 2016, 7:47:55 PM1/11/16
to John Potocny, Adam Langley, golang-nuts
[+agl as FYI]

John,

Are you defining your own Dial or DialTLS func for the transport? Or are you using DefaultTransport?

If this is the default, this is a little worrying. It's easy enough to make CancelRequest not block by calling net.Conn.Close in its own goroutine, but it will still leak that goroutine forever if the mutex is acquired by an also-deadlocked goroutine stuck in a Write to a dead TCP connection with no timeout. The question is really how to prevent that Write goroutine from deadlocking forever (in your case: 2 minutes so far).

The default TCP keep-alive time should be more like 3 minutes, IIRC, so maybe the goroutine would actually go away and die in 3 minutes. But CancelRequest shouldn't take 3 minutes. It could run in its own goroutine, and I suppose it could even even close the underlying TCP connection directly, rather than ask TLS to do it with the alertCloseNotify. The whole point of this is to be forceful, not graceful. We've already failed and are giving up on the peer.


John Potocny

unread,
Jan 11, 2016, 9:42:04 PM1/11/16
to golang-nuts, jo...@vividcortex.com, a...@golang.org
Hi Brad,

The transport is defined as the wrapper above, with dial/keepalive timeouts set to twice what the DefaultTransport sets for the dialer (1min in our case). There is also a ResponseHeaderTimeout set for 1min. Other than the modified timeouts though there isn't anything different in terms of transport configuration.

Regarding how long the write goroutine blocks, as a couple of data points I've seen requests take anywhere from 5 to 25 minutes to complete once the timeout triggers (I believe the one that produced the trace I shared completed after 6 minutes), so it can take longer than the default TCP keep-alive.

My next steps to try and address this will likely be to wrap the net.TCPConn returned by the Transport's dialer and set WriteDeadlines on all write operations, and see if that has any effect on request duration. If you have any other suggestions or if there's any additional information I can provide to add context to what I am seeing here please let me know.

Cheers,
John 

Brad Fitzpatrick

unread,
Jan 13, 2016, 7:32:42 PM1/13/16
to John Potocny, golang-nuts, Adam Langley
John,

As an update on this, we submitted this for Go 1.6: 

crypto/tls: don't block in Conn.Close if Writes are in-flight

That should help with this, if not fix it entirely.

There might be more yet to do, though. I didn't have time to investigate enough. If you continue to see this, I recommend you file a bug at https://golang.org/issue/new and provide repro instructions. I'd love to help fix it.

John Potocny

unread,
Jan 13, 2016, 9:50:17 PM1/13/16
to golang-nuts, jo...@vividcortex.com, a...@golang.org
Thanks for taking the time to submit a fix for this Brad, especially when it's so close to the release date. Much appreciated :)

On my side, I've had good success over the last day or so mitigating the problem with a wrapper on the tls.conn that calls conn.SetDeadline() before calling conn.Read/Write. I'll probably remove it though once I can update the code to Go1.6, so if I see any further troubles after that I'll be sure to file an issue.

Thanks again for your help here!
Reply all
Reply to author
Forward
0 new messages