DNS lookup timeouts could return net.DNSError instead of poll.TimeoutError

664 views
Skip to first unread message

thereal...@gmail.com

unread,
May 19, 2020, 7:27:56 PM5/19/20
to golang-nuts
Hi,

I get DNS lookup timeouts many times a day when running Go HTTP clients on Kubernetes, where UDP packets are sometimes dropped due to a race condition in the Linux kernel. I'd like to propose a small standard library change to help debug these scenarios and get some feedback on the idea and implementation here before posting it as a Github issue.

For any http.Client use with a timeout set on the transport, when the timeout is exceeded and an error is returned it can be challenging to discover the cause of the timeout from the error message. Is it DNS related or TCP related or something else? Consider the following simulation of a DNS lookup that does not return within the timeout.

package main
  
import (
    "context"
    "net"
    "net/http"
    "time"
)

var timeout = 50 * time.Millisecond
var host = "http://example.com/"

// delayedDialer introduces a delay when performing DNS lookups.
func delayedDialer(ctx context.Context, network, address string) (net.Conn, error) {
    time.Sleep(2 * timeout)
    d := net.Dialer{}
    return d.DialContext(ctx, network, address)
}

func main() {
    r := &net.Resolver{
        PreferGo: true,
        Dial:     delayedDialer,
    }
    tr := &http.Transport{
        DialContext: (&net.Dialer{
            Timeout:  timeout,
            Resolver: r,
        }).DialContext,
    }
    client := &http.Client{Transport: tr}
    _, err := client.Get(host)
    if err != nil {
        panic(err)
    }
}

In the event of a DNS lookup timeout, the output is:

panic: Get "http://example.com/": dial tcp: i/o timeout

In the event of a TCP connection timeout (comment out the time.Sleep() call) the output is:

panic: Get "http://example.com/": dial tcp 93.184.216.34:80: i/o timeout

The missing IP address in the DNS lookup timeout output is the only clue that the root cause is a DNS problem, rather than a TCP problem. If users do not expect to see an IP address in that error message, they won't know it is missing, and their debugging and testing effort may be wasted investigating possible TCP-related causes, instead of investigating DNS (which normally runs over UDP).

The error message could communicate the root cause to users more clearly. After a DNS lookup timeout, the error returned by client.Get() looks like this in Go 1.14.3:

&url.Error{
  Op:  "Get",
  Err: &net.OpError{
    Op:     "dial",
    Net:    "tcp",
    Source: nil,
    Addr:   nil,
    Err:    &poll.TimeoutError{},
  },
}

Note: The &poll.TimeoutError{} will become &net.timeoutError{} in future releases after commit d422f54.

The resolver in net/lookup.go is the source of the *poll.TimeoutError. Instead it could return a *net.DNSError error when a lookup times out. Doing so would provide additional error context for all use cases (not only http.Client) and may save debugging time. For example client.Get() could return:

&url.Error{
  Op:  "Get",
  Err: &net.OpError{
    Op:     "dial",
    Net:    "tcp",
    Source: nil,
    Addr:   nil,
    Err:    &net.DNSError{
      Err:         "i/o timeout",
      Name:        "example.com",
      Server:      "",
      IsTimeout:   true,
      IsTemporary: false,
      IsNotFound:  false,
    },
  },
}

This would preserve the existing "i/o timeout" string and the ability to call err.Timeout(), while adding "lookup example.com: " to the Error() output and the capability to explicitly check for a *net.DNSError. The output from my simulator code would also be more explicit:

panic: Get "http://example.com/": dial tcp: lookup example.com: i/o timeout

The following change to implement this works in my usage and passes the tests in all.bash.

$ git diff
diff --git a/src/net/lookup.go b/src/net/lookup.go
index 5f7119872a..b1a22dd4e6 100644
--- a/src/net/lookup.go
+++ b/src/net/lookup.go
@@ -314,6 +314,9 @@ func (r *Resolver) lookupIPAddr(ctx context.Context, network, host string) ([]IP
                        }()
                }
                err := mapErr(ctx.Err())
+               if t, ok := err.(timeout); ok && t.Timeout() {
+                       err = &DNSError{Err: err.Error(), Name: host, IsTimeout: true}
+               }
                if trace != nil && trace.DNSDone != nil {
                        trace.DNSDone(nil, false, err)
                }

I can submit a Github issue and pull request if anyone thinks that this is a good idea and implementation.

Thanks.

Ian Lance Taylor

unread,
May 19, 2020, 7:49:57 PM5/19/20
to thereal...@gmail.com, golang-nuts
I don't know if that is the right implementation, but what you say
sounds plausible, and I encourage you to open an issue. Thanks.

Ian

thereal...@gmail.com

unread,
May 20, 2020, 1:55:08 PM5/20/20
to golang-nuts
Thanks Ian. I've raised #39178.
Reply all
Reply to author
Forward
0 new messages