I've been away from this for a while due to the holidays but I was able to make some progress and figure out why my requests were taking so long to cancel. I figure I should update the thread for anyone who stumbles upon this down the road.
I instrumented the CancelRequest method in my Transport to take a snapshot of the program's goroutines when it got stuck (I don't have console access to the machine running the program so I couldn't just send a SIGQUIT), and isolated the routines that were blocked. The partial stack traces are shown below (I've removed the references to bits I can't share, but they don't affect the output in a meaningful way):
goroutine 73 [semacquire, 1 minutes]:
sync.runtime_Semacquire(0xc8204a7784)
/usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*Mutex).Lock(0xc8204a7780)
/usr/local/go/src/sync/mutex.go:82 +0x1c4
crypto/tls.(*Conn).sendAlert(0xc8204a7600, 0x939300, 0x0, 0x0)
/usr/local/go/src/crypto/tls/conn.go:684 +0x44
crypto/tls.(*Conn).Close(0xc8204a7600, 0x0, 0x0)
/usr/local/go/src/crypto/tls/conn.go:950 +0xa6
net/http.(*persistConn).closeLocked(0xc820283340)
/usr/local/go/src/net/http/transport.go:1248 +0x4e
net/http.(*persistConn).cancelRequest(0xc820283340)
/usr/local/go/src/net/http/transport.go:859 +0x8c
net/http.(*persistConn).(net/http.cancelRequest)-fm()
/usr/local/go/src/net/http/transport.go:1108 +0x20
net/http.(*Transport).CancelRequest(0xc820094990, 0xc8202d5500)
/usr/local/go/src/net/http/transport.go:285 +0xca
The routine attempting to Cancel the request is waiting for a mutex in the tls.(*Conn) implementation - I looked up the source and the mutex in question controls access to writing data over the connection. It appears that when CancelRequest attempts to call conn.Close(), the underlying tls.Conn tries to send an alert to the server that the connection is being closed.
Normally this might be a problem, except for the behavior I see in the request thread:
goroutine 71 [IO wait, 2 minutes]:
net.runtime_pollWait(0x7f58739daa18, 0x77, 0xc82000c1c0)
/usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820344990, 0x77, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitWrite(0xc820344990, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:82 +0x36
net.(*netFD).Write(0xc820344930, 0xc82030a000, 0x101d, 0x2000, 0xf2, 0x7f58739d5050, 0xc82000c1c0)
/usr/local/go/src/net/fd_unix.go:318 +0x414
net.(*conn).Write(0xc820026078, 0xc82030a000, 0x101d, 0x2000, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:184 +0xe4
crypto/tls.(*Conn).writeRecord(0xc8204a7600, 0x939317, 0xc820415000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/crypto/tls/conn.go:745 +0x548
crypto/tls.(*Conn).Write(0xc8204a7600, 0xc820415000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/crypto/tls/conn.go:877 +0x40e
bufio.(*Writer).flush(0xc820544c00, 0x0, 0x0)
/usr/local/go/src/bufio/bufio.go:562 +0xe0
bufio.(*Writer).ReadFrom(0xc820544c00, 0x7f58739e3b88, 0xc8201575c0, 0x4aeb9, 0x0, 0x0)
/usr/local/go/src/bufio/bufio.go:686 +0x134
io.copyBuffer(0x7f58739e3a88, 0xc820544c00, 0x7f58739e3b88, 0xc8201575c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/io/io.go:375 +0x180
io.Copy(0x7f58739e3a88, 0xc820544c00, 0x7f58739e3b88, 0xc8201575c0, 0x0, 0x0, 0x0)
/usr/local/go/src/io/io.go:351 +0x64
net/http.(*transferWriter).WriteBody(0xc820344620, 0x7f58739e3a88, 0xc820544c00, 0x0, 0x0)
/usr/local/go/src/net/http/transfer.go:225 +0x8fe
net/http.(*Request).write(0xc8202d5500, 0x7f58739e3a88, 0xc820544c00, 0x0, 0xc8202f7350, 0x0, 0x0)
/usr/local/go/src/net/http/request.go:462 +0xbb9
net/http.(*persistConn).writeLoop(0xc820283340)
/usr/local/go/src/net/http/transport.go:1015 +0x27c
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:686 +0xc9d
I can see from the above stack trace that the request that blocked is stuck in the writeLoop, trying to send data over the tls.Conn. Given that we're waiting for a write to finish, there's no way for CancelRequest to actually complete until the thing that is blocking the request unblocks on its own.
It's not clear to me why the write is getting stuck for so long though, I guess that will be the next thing for me to take a look at.