Why am I getting "i/o timeout" for this scenario

7,107 views
Skip to first unread message

Davis Ford

unread,
Jul 8, 2015, 9:20:43 PM7/8/15
to golan...@googlegroups.com
I am seeing an issue for the code we have in production.  I am processing requests of a transactional nature.  I need to know if they have failed.  If they fail for a reason that can be retried, they should be retried (e.g. intermittent network failure).

I also need to limit them to a deadline and/or timeout.  

The problem is that I am at risk for generating a duplicate transaction -- which would be really bad in my case.  I suspect this may be happening, so I am really hoping for some outside insight on this issue.

Problem Summary: the http/net package is reporting a timeout error way too soon as far as I can tell considering how I setup the dialer/connection

I set the Connect timeout for the dialer to 1 second, and I set the Read/Write Deadline on the connection for 16 seconds.  The problem I am seeing at runtime is something like this:

log: 2015/07/09 00:13:44 some-file.go:147: POST {payload}

log: 2015/07/09 00:13:45 some-file.go:65: error is temporary=true, timed out=true for network error: Post https://somewhere.com/path: read tcp xx.xx.xx.xx:443: i/o timeout


Every once in a while, I post the payload, and 1 second later it reports an "i/o timeout".  


So, I have a transaction wrapper that will repost the payload if it receives a network error that is a timeout, or it indicates it is temporary (which it is in this case).  

Here is how I have configured the dialer, etc.

func timeoutDialer(config *Config) func(network, addr string) (c net.Conn, err error) {
   
return func(network, addr string) (net.Conn, error) {
     conn
, err := net.DialTimeout(network, addr, config.ConnectTimeout)
     
if err != nil {
       
return nil, err
     
}
     conn
.SetDeadline(time.Now().Add(config.ReadWriteTimeout))
     
return conn, nil
   
}
}


client
:= &http.Client{
   
Transport: &http.Transport{
     
Proxy: nil,
     
Dial:  timeoutDialer(config),
     
TLSClientConfig: &tls.Config{
       
InsecureSkipVerify: insecure,
     
},
   
},
}

So, I'm either misunderstanding how to properly set the deadline/timeout to get the desired effect, or I'm misunderstanding how to interpret a network error event that indicates it was a timeout -- or something else is going wrong.

I can understand how a remote endpoint could do a TCP RESET -- but that should just close the connection and I would not expect the network stack to say this is an "i/o timeout" -- I would consider that an incorrect categorization.

All three are really not good for me right now, as it appears I am at risk for generating duplicate transactions -- which as I stated before...is not a good thing.

Any insight is greatly appreciated.  

Using Golang 1.4.x on Ubuntu 14.04


Davis Ford

unread,
Jul 8, 2015, 9:30:37 PM7/8/15
to golan...@googlegroups.com
I should add one more piece of information.

One might come to the quick conclusion that it is indeed the 1 second connection timeout that is being tripped here.  I thought so too.  Except that one of my remote endpoints has their own duplicate detection mechanism built in, and it seems when this scenario happens:

log: 2015/07/09 00:13:44 some-file.go:147: POST {payload}
log: 2015/07/09 00:13:45 some-file.go:65: error is temporary=true, timed out=true for network error: Post https://somewhere.com/path: read tcp xx.xx.xx.xx:443: i/o timeout
log: 2015/07/09 00:13:46 some-file.go:33: retry (1/3) because network error read tcp xx.xx.xx.xx:443: i/o timeout
log: 2015/07/09 00:13:46 some-file.go:33: POST {payload}
log: 2015/07/09 00:13:47 some-file.go:66: error remote endpoint detected duplicate for {payload}

If it were indeed a connect timeout, then I would expect in the above scenario that we would only ever send one payload to the remote endpoint.  Why would the network stack report a connect timeout to the caller of the api, then suddenly connect, and push the payload out?  I can't imagine it behaving this way.

So, my only rational conclusion is that the R/W deadline is somehow being immediately tripped - despite being set for 16s, and it isn't the connect timeout.

Or I'm somehow misunderstanding how the client uses a dialer and re-uses connections.  Perhaps it caches connections, so all deadlines end up expiring pretty quickly.

I'm hoping that is the problem.

On Wednesday, July 8, 2015 at 9:20:43 PM UTC-4, Davis Ford wrote:

James Bardin

unread,
Jul 8, 2015, 10:09:05 PM7/8/15
to golan...@googlegroups.com
Don't set a deadline on a connection after Dial. The Transport will try to reuse that connection, and it will get closed when you don't expect it to.

Use the timeout settings in the http Client or Transport to timeout requests.

Davis Ford

unread,
Jul 8, 2015, 11:02:06 PM7/8/15
to golan...@googlegroups.com
Thanks James -- after writing this up, I sorta came to the same conclusion.  The transport caches the connections and the deadline expires fairly quickly.  I have instead employed the client.Timeout "nuclear option" instead of the custom dialer, and it is looking good so far.  We execute a lot of transactions so I suspect I would have seen the problem likely repeat itself by now.
Reply all
Reply to author
Forward
0 new messages