Golang DNS lookup tracing

Cipov Peter

unread,

May 9, 2025, 12:40:57 PMMay 9

to golang-nuts

Hello Community

I have question regarding native golang DNS lookup as my app is compiled statically (CGO_ENABLED=0). For some reason this solution behaves unpredictable, having sometimes (few times a day) dns lookup >2s. I am using http-trace to get this number. I have tried to look into code whether there is possibility to drill-down to these 2s (no luck right now). My usecase is making quick http call to integration (optimal request total time < 500ms).

using

GODEBUG="netdns=2"

CGO_ENABLED=0
GOOS=linux

GOARCH=amd64

running in docker debian bookworm as k8s pod

logs:

go package net: confVal.netCgo = false netGo = false go package net: cgo resolver not supported; using Go's DNS resolver

I have checked the source code I have not seen much tracing information into why sometimes dns spikes occurs. Did I missed some option to get insights why dns lookup takes so long ? I cannot distinguish whether it is waiting for network call or some internal timeout.

Thank you

Brian Candler

unread,

May 9, 2025, 1:38:56 PMMay 9

to golang-nuts

I suspect if you tcpdump/wireshark the DNS traffic, you'll find a query goes out, and either the response is delayed by 2 seconds, or no response is received and your client re-sends the request.

To understand this, inside your pod you'll need to find out what your upstream DNS recursive server is. This might be `cat /etc/resolv.conf`, but if it's using systemd for resolution it could be `resolvectl status` or such like. And then you need to work out what's going on upstream.

You should note that a 2 second delay a few times per day for DNS resolution is not unusual. There are lots of reasons. It could be as simple as some network packet loss between your k8s server and your DNS recursor (since DNS is usually sent over UDP, and UDP does not guarantee delivery). Just one lost packet can cause a 1-2 second delay, depending on what the client's retransmission policy is.

However, a more likely explanation is this: the record has expired from the cache in the DNS recursor. When it next gets a query for this expired name, the recursive DNS server needs to locate the upstream authoritative DNS servers for that domain. If the one it chooses first is down, it will timeout and retry to a different one. Furthermore, it also needs to resolve the *names* of the authoritative servers (from NS records) into addresses, and if those have expired, there can be delays with that too. A delay of several seconds for all this is quite common.

This is just life: many DNS domains are broken in this way, because people don't know how to delegate properly or run their authoritative nameservers properly. If you tell us the actual domain you're querying, maybe we can identify the problem with the domain - but you'll have to get the domain owner to fix it.

As a sticking-plaster over the problem: if you run your own DNS recursor with suitable software, then you can get it to refresh the record *before* it expires. In powerdns-recursor this is controlled by refresh-on-ttl-perc. Bind calls it "prefetch". (Other nameserver software may or may not have this feature).

At the end of the day though, DNS issues are not related to the Go programming language.

Brian Candler

unread,

May 9, 2025, 1:46:21 PMMay 9

to golang-nuts

I will just add: this is a very common problem with domains which return a very low TTL, and/or use funky dynamic responses like geo-load balancing.

These companies think that it makes their infrastructure more reliable, because the short TTL allows them to change the address quickly if required. In practice, it makes their service way *less* reliable.

Reply all

Reply to author

Forward