blackbox_exporter tcp probe resolves names via TCP protocol

494 views
Skip to first unread message

terrible person

unread,
Aug 24, 2022, 11:56:57 PM8/24/22
to Prometheus Users
Hi. I'm currently debugging DNS Lookup warnings (more that 3 sec) and need to figure out whether our network/our DNS/or exporter is misbehaving.
So I'm checking ssh endpoints with tcp module:

2022-08-25_13-22-13.png

but experience 3+ seconds delay on resolving ssh hostnames, which triggers alerts DNSLookupDuration3s.

2022-08-25_13-26-19.png 
problem looks something like this on different hosts - 3.0s+ seconds of timeout, which looks very much like a generic tcp timeout. 

I checked on DNS server and yes, after UDP queries there is a TCP DNS query for A record. I don't  see any UDP checksum corruption or delays for such failover. Is this intended? Can someone help me out on this.

Ben Kochie

unread,
Aug 25, 2022, 12:03:27 AM8/25/22
to terrible person, Prometheus Users
DNS lookups will switch to TCP if the response is larger than can fit in a single packet. But that should happen immediately.



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/16d8f137-a7e3-4361-a624-6719d71b1d29n%40googlegroups.com.

terrible person

unread,
Aug 25, 2022, 1:34:58 AM8/25/22
to Prometheus Users
Thank you, actually I found out about this behaviour just after I posted here. 
Strangely, I don't see tcp connections with either nslookup of dig, though response is about 860 bytes, but UDP outgoing traffic is present. When I probe with blackbox there is also tcp.

How blackbox performs such probes? In parallel or successively? Is there a way to suspend such behaviour, analogue to +notcp option of dig?

Ben Kochie

unread,
Aug 25, 2022, 3:34:37 AM8/25/22
to terrible person, Prometheus Users
The blackbox_exporter uses the built-in Go resolver library[0]. The only options here are which address family you want in return.


Brian Candler

unread,
Aug 25, 2022, 4:40:23 AM8/25/22
to Prometheus Users
What is this "DNSLookupDuration3s" you talk about?  Is it an alerting rule?  Can you show the expr?

To me, it sounds like the opposite problem.  My guess is that blackbox_exporter is first making a UDP DNS query, and either the query or the response is being blocked. So after 3 seconds it retries with TCP, and that succeeds.

You can check this theory using tcpdump (especially if you can do tcpdump on the caching resolver as well).  Do you see an outbound UDP DNS query, but no response? The resolution is to fix the underlying UDP communication problem.

Are there any virtual machines involved in this?  That's the one case where I have seen this exact problem before with UDP traffic but not TCP.  The packet is sent without a correct UDP checksum, because checksum offloading is enabled and the client expects the NIC to insert a correct one; but the receiver doesn't know this, and just sees a packet with a bad checksum and discards it.

The solution, or at least workaround, is to disable UDP transmit checksum offloading on the VM's network interface (probably just the one running blackbox_exporter)

Try:
    ethtool --offload eth0 tx off

and if that doesn't work, also try:
    ethtool --offload eth0 gso off gro off tso off

terrible person

unread,
Aug 25, 2022, 6:36:59 AM8/25/22
to Prometheus Users
1) 2022-08-25_20-10-14.png

2) I was checking with tcpdump. Don't know if I'm on pair with your theory cause client (blackbox) sending syn immediately after receiving "large" udp packet. As I said I don't see this behavior with dig, nor I see the truncated flag. UDP response from server is 860 bytes. My hypothesis that DNS server is clogging with amount of TCP requests (more than 100 hosts) and he resets some of them, then there is 3s TCP timeout, and successful retry with new connection after. I will check RST flags from 53 port tomorrow on the DNS server host.

3) Yep, this is something i learned today. I was reading this article, but I don't know, you sure about it? As I understood it, you see incorrect checksums with tcpdump, cause of this
2022-08-25_20-27-13.png
but it has no effect on actual traffic. I observed that that tcpdump shows that checksums are incorrect for outgoing upd traffic, but receiver show that checksums are fine. Mb I can attach some dumps later.

So for now I see some ways to overcome this:

1) Somehow decrease DNS response (AUTHORITY SECTION и ADDITIONAL SECTION), though I don't know if I can do it (I'm using FreeIPA)
2) Make changes on client side, either custom changes to blackbox itself, or make architectural changes with spreading probing load on DNS server.

don't know, hard stuff

Ben Kochie

unread,
Aug 25, 2022, 7:25:08 AM8/25/22
to terrible person, Prometheus Users
For many reasons, I've been deploying node local DNS caching for production servers for a while now.

I can highly recommend CoreDNS for this. It should also provide good metrics as to the behavior of your central resolvers.

Brian Candler

unread,
Aug 25, 2022, 10:15:35 AM8/25/22
to Prometheus Users
On Thursday, 25 August 2022 at 11:36:59 UTC+1 melee.j...@gmail.com wrote:
1) 2022-08-25_20-10-14.png

2) I was checking with tcpdump. Don't know if I'm on pair with your theory cause client (blackbox) sending syn immediately after receiving "large" udp packet.

Does tcpdump decode this "large" udp response?  Is it a valid DNS packet?

If the resolver switches to TCP immediately - which it is entitled to - this then begs the question of where the 3 second delay is coming from.  Again, more tcpdump analysis may be required.

terrible person

unread,
Aug 28, 2022, 7:02:55 AM8/28/22
to Brian Candler, Prometheus Users
I fixed the problem with bind option
minimal-responses yes;
there was saturation of dns server's accept queue (bind has value of 10 by default). nothing has to do with blackbox in general I guess. Ty Ben Kochie for the insight on dns response size.

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/02s1zRIBTTY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/14e86df9-e50d-4142-9947-2a56244d00aen%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages