BIND and UDP tuning

Alex

unread,

Sep 26, 2018, 12:52:47 PM9/26/18

to bind-...@lists.isc.org

Hi,

I reported a few weeks ago that I was experiencing a really high
number of "SERVFAIL" messages in my bind-9.11.4-P1 system running on
fedora28, and I haven't yet found a solution. This is all now running
on a 165/35 cable system.

I found a program named dropwatch which is showing a significant
number of dropped UDP packets, particularly when there are bursts of
email traffic:

12 drops at skb_queue_purge+13 (0xffffffff9f79a0c3)
1 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
4 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
5 drops at nf_hook_slow+a7 (0xffffffff9f7faff7)
3 drops at sk_stream_kill_queues+48 (0xffffffff9f7a1158)
3 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
...

# netstat -us
...
Udp:
23449482 packets received
1724269 packets to unknown port received
8248 packet receive errors
31394909 packets sent
8243 receive buffer errors
0 send buffer errors
InCsumErrors: 5
IgnoredMulti: 43247

The SERVFAIL messages don't necessarily correspond to the UDP packet
errors shown by netstat, but the dropwatch output is continuous. The
netstat packet receive errors also don't seem to correspond to
"SERVFAIL" or "Name service" errors:

26-Sep-2018 12:42:49.743 query-errors: info: client @0x7fb3c41634d0
127.0.0.1#44104 (46.36.47.104.wl.mailspike.net): query failed
(SERVFAIL) for 46.36.47.104.wl.mailspike.net/IN/A at
../../../bin/named/query.c:8580

Sep 26 12:47:11 mail03 postfix/dnsblog[22821]: warning: dnsblog_query:
lookup error for DNS query 196.91.107.80.bl.spameatingmonkey.net: Host
or domain name not found. Name service error for
name=196.91.107.80.bl.spameatingmonkey.net type=A: Host not found, try
again

I've been following this thread from some time ago, but nothing I've
done has made a difference. I really don't know what the buffer sizes
should be.
http://bind-users-forum.2342410.n4.nabble.com/Tuning-suggestions-for-high-core-count-Linux-servers-td3899.html

Are there specific bind tunables you might recommend? edns-udp-size, perhaps?

Any ideas on other tunables such as net.core.*mem_default etc?

Browne, Stuart

unread,

Sep 26, 2018, 10:12:37 PM9/26/18

to Alex, bind-...@lists.isc.org

> https://urldefense.proofpoint.com/v2/url?u=http-3A__bind-2Dusers-
> 2Dforum.2342410.n4.nabble.com_Tuning-2Dsuggestions-2Dfor-2Dhigh-2Dcore-
> 2Dcount-2DLinux-2Dservers-
> 2Dtd3899.html&d=DwICAg&c=MOptNlVtIETeDALC_lULrw&r=udvvbouEjrWNUMab5xo_vLb
> UE6LRGu5fmxLhrDvVJS8&m=5XQNuuRQ4kxK03zqoWaJHIdaJvNdsyTKHuFlDKedbpc&s=5Dqh
> ne-5w5V_1coBTBvTITwK2EFeankOegTaofy8S5w&e=

>
> Are there specific bind tunables you might recommend? edns-udp-size,
> perhaps?
>
> Any ideas on other tunables such as net.core.*mem_default etc?

*chuckles to self*

I was just referring back to that thread myself to try remember what I did.

I ended up tuning the following items:

- name: SYSCTL system tuning, basics
sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
sysctl_set: yes
state: present
with_items:
- { name: 'vm.swappiness', value: 0 }
- { name: 'net.core.netdev_max_backlog', value: 32768 }
- { name: 'net.core.netdev_budget', value: 2700 }
- { name: 'net.ipv4.tcp_sack', value: 0 }
- { name: 'net.core.somaxconn', value: 2048 }
- { name: 'net.core.rmem_default', value: 16777216 }
- { name: 'net.core.rmem_max', value: 16777216 }
- { name: 'net.core.wmem_default', value: 16777216 }
- { name: 'net.core.wmem_max', value: 16777216 }

(Yeah, I was using ansible for that testing!)

The checking of the /proc/net/softnet_stat is what was driving some of those settings, so you may want to dig into that. I never did solve the netstat showing issues though, so keep that in mind.

If you are running high query throughput and have many CPU cores, the pinning of cores was a significant performance improvement.

You've not said here what sort of query throughput you are having here however. Be aware that if this is running in a virtualized environment, you may want to be looking at the host machine instead of the guest as the network performance there can have a significant impact.

Whilst mentioned in passing on that thread, there was also poking around with TOE, pause, coalesce adaptive and ring size settings (look at ethtool -K, ethtool -A, ethtool -C and ethtool -G), but sadly have lost the specific commands.

Stuart Browne
Neustar, Inc. / Sr Systems Admin
Level 8, 10 Queens Road, Melbourne, Australia VIC 3004
Office: +61.3.9866.3710
stuart...@team.neustar / home.neustar

Follow Neustar: LinkedIn / Twitter

Reduce your environmental footprint. Print only if necessary.

The information contained in this email message is intended only for the use of the recipient(s) named above and may contain confidential and/or privileged information. If you are not the intended recipient you have received this email message in error and any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately and delete the original message.

Tony Finch

unread,

Sep 27, 2018, 3:28:30 AM9/27/18

to Browne, Stuart, Alex, bind-...@lists.isc.org

Browne, Stuart via bind-users <bind-...@lists.isc.org> wrote:

> - { name: 'net.ipv4.tcp_sack', value: 0 }

Why? SACK is super important for TCP performance over links that have any
degree of lossiness, and I don't recall hearing of any caveats.

Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/
a just distribution of the rewards of success

Browne, Stuart

unread,

Sep 27, 2018, 4:01:27 AM9/27/18

to bind-...@lists.isc.org

> -----Original Message-----
> From: Tony Finch [mailto:d...@dotat.at]
>
> > - { name: 'net.ipv4.tcp_sack', value: 0 }
>
> Why? SACK is super important for TCP performance over links that have any
> degree of lossiness, and I don't recall hearing of any caveats.
>
> Tony.
> --
> f.anthony.n.finch <d...@dotat.at>

If I recall correctly, it had to do with the fact that we were in a very-network-close test environment with very-small packets so it wasn't necessary to even consider resends. I don't recall whether it did anything at all to the results; it is just one of the various things I stuck into the blender in order to see if it made a difference and was still in at the end of testing. The number of test iterations I went through was in the hundreds and most of it was "Moar! MOAR!" rather than good arguments; more about proving a design could reach a theoretical limit than whether it would be 100% stable in production.

The environment design that these tests were preparing for haven't been implemented yet; that's what I'm working on over the next few weeks, so I'll be going over these settings with some kid-gloves and being a little gentler as we don't need a single location churning out 2M5 qps; we're quite happy with 2M.

Let's hear it for overkill!

Stuart

Alex

unread,

Sep 27, 2018, 10:53:50 AM9/27/18

to Stuart...@team.neustar, bind-...@lists.isc.org

> - { name: 'net.ipv4.tcp_sack', value: 0 }

> - { name: 'net.core.somaxconn', value: 2048 }
> - { name: 'net.core.rmem_default', value: 16777216 }
> - { name: 'net.core.rmem_max', value: 16777216 }
> - { name: 'net.core.wmem_default', value: 16777216 }
> - { name: 'net.core.wmem_max', value: 16777216 }

Were you troubleshooting the same problems as I'm experiencing?

Many of these values I've already tweaked and have had no effect on my
SERVFAIL issues :-(

I've also been following the performance tuning variables in this RH document:
https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf

These errors appear to occur in spurts - there is typically ten or
more in a row at a time, then any number of minutes/seconds before the
next one.

It looks like there are periods of as many as 500 queries per second,
although the usual amount is closer to 200 per second.

I don't believe this is a bind configuration problem, as the "Name
service error" errors from postfix also occur when testing with
unbound.

This is also only happening on the two identical systems connected to
the 165/35mbit cable modem. I've verified with Oponline, and they've
emphatically asserted there are no problems with the circuit. The
systems are 8-core Xeon E31240 with 16GB RAM. I've also tried other
systems, including a 12-core i7 with 32GB.

We have several other systems connected to a 10mbit DIA ethernet
circuit where these errors don't generally occur. They are also
similarly configured fedora systems with the same version of bind.

I'm really at a loss as to what the problem(s) are, but feel like it's
really impacting our ability to query RBLs for processing mail.

> Whilst mentioned in passing on that thread, there was also poking around with TOE, pause, coalesce adaptive and ring size settings (look at ethtool -K, ethtool -A, ethtool -C and ethtool -G), but sadly have lost the specific commands.

I've also tried configuring the NIC with ethtool according to the
variables defined in the RH document listed above and have had no
success.

This really is just a stock system. I can't believe these problems
would be so elusive or uncommon. Could it have to do with some
characteristic of the cable circuit itself?

I've also experimented with QoS, using tc to prioritize interactive
traffic, including tcp and udp port 53, with plenty of bandwidth.

I really hope there is someone with some additional ideas.
Thanks,
Alex

Sten Carlsen

unread,

Sep 27, 2018, 11:04:19 AM9/27/18

to Alex, bind-...@lists.isc.org

Just a wild thought:
It works with a lower speed line (at least I read it that way) but has problems with higher speeds.
Could it be that the line is so fast that it "overtakes" the host in question?

A faster incoming line will give less time between the packets for processing.


I've also experimented with QoS, using tc to prioritize interactive
traffic, including tcp and udp port 53, with plenty of bandwidth.

I really hope there is someone with some additional ideas.
Thanks,
Alex
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-...@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Mukund Sivaraman

unread,

Sep 27, 2018, 11:38:15 AM9/27/18

to Alex, Stuart...@team.neustar, bind-...@lists.isc.org

On Thu, Sep 27, 2018 at 10:53:25AM -0400, Alex wrote:
> Many of these values I've already tweaked and have had no effect on my
> SERVFAIL issues :-(

If you are getting SERVFAILs from a BIND resolver you administer, then
it has responded to your query. If you turn up the log level to
something like -d 99, it'll print the steps that led to that SERVFAIL.
Usually you'll find something there that directs you to next steps.

On this topic, my home resolver is also a stock packaged BIND version as
you, and I too see spurious SERVFAILs sometimes. I used to think this
was due to too much indirection, e.g., when named starts up and you run:

dig -x 176.9.81.50

on a cold cache. However it seems to be returning SERVFAIL sometimes for
what should be a cached answer. I'll also turn up the debug logging and
watch it.

Mukund

G.W. Haywood

unread,

Sep 27, 2018, 12:02:22 PM9/27/18

to bind-...@lists.isc.org

Hi there,

On Thu, 27 Sep 2018, Alex wrote

> This is also only happening on the two identical systems connected
> to the 165/35mbit cable modem.

> ...

> I really hope there is > someone with some additional ideas.

Is it the modem?

--

73,
Ged.

Alex

unread,

Sep 27, 2018, 12:03:56 PM9/27/18

to Stuart...@team.neustar, bind-...@lists.isc.org

Hi,

> On Thu, Sep 27, 2018 at 10:53:25AM -0400, Alex wrote:
> > Many of these values I've already tweaked and have had no effect on my
> > SERVFAIL issues :-(
>
> If you are getting SERVFAILs from a BIND resolver you administer, then
> it has responded to your query. If you turn up the log level to
> something like -d 99, it'll print the steps that led to that SERVFAIL.
> Usually you'll find something there that directs you to next steps.
>
> On this topic, my home resolver is also a stock packaged BIND version as
> you, and I too see spurious SERVFAILs sometimes. I used to think this
> was due to too much indirection, e.g., when named starts up and you run:
>
> dig -x 176.9.81.50

It doesn't typically happen when running from the command-line. It
does occasionally happen, though. I usually run something like "dig
+all +trace +nodnssec <hostname>". It sometimes times out in the
middle, with something like "cannot resolve xyz host", which may even
be one of the root servers.

I also typically run it with "rndc trace 11" which shows me quite a
bit of debugging info - too much to look through manually. With trace
99, I can imagine it being overwhelming amount of info. Do you have
any ideas of what to look for? "query-errors"?

Also, I also see other SERVFAIL errors that really are SERVFAIL errors
- when querying the host manually, it still responds immediately with
SERVFAIL.

Thanks,
Alex

Alex

unread,

Sep 27, 2018, 12:05:04 PM9/27/18

to bi...@jubileegroup.co.uk, bind-...@lists.isc.org

Hi,

> > This is also only happening on the two identical systems connected
> > to the 165/35mbit cable modem.
> > ...
> > I really hope there is > someone with some additional ideas.
>
> Is it the modem?

No, it's been replaced at least once, and I've been assured by both
the cable tech that was here and the dimwits on the other end that
it's operating normally. I really wish it were that easy.

Thanks,
Alex

>
> --
>
> 73,
> Ged.

Ben Croswell

unread,

Sep 27, 2018, 12:06:47 PM9/27/18

to bind-...@lists.isc.org

When we ran into UDP tuning issues on high traffic devices it presented as silent discards rather than SERVFAIL.

Alex

unread,

Sep 27, 2018, 12:07:40 PM9/27/18

to st...@s-carlsen.dk, bind-...@lists.isc.org

Hi,

> Just a wild thought:
> It works with a lower speed line (at least I read it that way) but has problems with higher speeds.
> Could it be that the line is so fast that it "overtakes" the host in question?
>
> A faster incoming line will give less time between the packets for processing.

No, I actually upgraded from a 65/20mbit to a 165/35mbit recently,
thinking it was too slow because it was happening at the slower speeds
as well. I've also implemented some basic QoS to throttle outgoing
smtp and prioritize DNS but it made no difference.

Thanks,
Alex

Noel Butler

unread,

Sep 27, 2018, 8:46:42 PM9/27/18

to bind-...@lists.isc.org

Hi Alex,

Have you tried on a separate physical server? To rule out the actual hardware as being the problem?

Is this some user grade PC with either onboard or external ethernet interface, or a proper server grade equipment? Age of equipment? What else does that machine do?

Cheers

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-...@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

--

Kind Regards,

Noel Butler

This Email, including any attachments, may contain legally privileged information, therefore remains confidential and subject to copyright protected under international law. You may not disseminate, discuss, or reveal, any part, to anyone, without the authors express written authority to do so. If you are not the intended recipient, please notify the sender then delete all copies of this message including attachments, immediately. Confidentiality, copyright, and legal privilege are not waived or lost by reason of the mistaken delivery of this message. Only PDF and ODF documents accepted, please do not send proprietary formatted documents

Lee

unread,

Sep 28, 2018, 12:18:22 AM9/28/18

to Alex, bind-...@lists.isc.org

On 9/27/18, Alex <mysqls...@gmail.com> wrote:
> Hi,
>
>> Just a wild thought:
>> It works with a lower speed line (at least I read it that way) but has
>> problems with higher speeds.
>> Could it be that the line is so fast that it "overtakes" the host in
>> question?
>>
>> A faster incoming line will give less time between the packets for
>> processing.
>
> No, I actually upgraded from a 65/20mbit to a 165/35mbit recently,
> thinking it was too slow because it was happening at the slower speeds
> as well. I've also implemented some basic QoS to throttle outgoing
> smtp and prioritize DNS but it made no difference.

Has your provider enabled qos? I'd bet their dropping packets that
exceed qos rate limits would be considered "working as expected".

Which brings up the question of exactly what does SERVFAIL mean? Can
no response to a query result in SERVFAIL? Is there a way to tell the
difference between no response & getting a response indicating a
failure?

Lee

Alex

unread,

Sep 28, 2018, 9:06:38 AM9/28/18

to noel....@ausics.net, bind-...@lists.isc.org

Hi,

> Hi Alex,
>
> Have you tried on a separate physical server? To rule out the actual hardware as being the problem?
>
> Is this some user grade PC with either onboard or external ethernet interface, or a proper server grade equipment? Age of equipment? What else does that machine do?

This is a Xeon 8-core E31240 3.30GHz with 16GB. It's a few years old.
I've also recently tried with an i7 8700 with 32GB running the same
version of fedora28 with the same bind and had the same problem. I've
also mentioned previously that I've tried unbound and had the same
postfix "Name service error" error.

I believe this error is not a recent thing - it goes back in the logs
for as long as I can see, meaning into previous versions of postfix
and fedora and bind. I've only now started to notice it and the impact
that I'd imagine it's having on our ability to effectively using RBLs
and process mail.

This server does only mail/spam filtering with
postfix/amavis/spassassin using bind. It's configured as a recursive
caching server and not otherwise authoritative for any of our domains.

I've recently tried to configure it with "edns no;" and/or
"edns-udp-size 512;" and it's had no effect.

Thanks so much for your help.
Alex

Alex

unread,

Sep 28, 2018, 9:26:25 AM9/28/18

to ler...@gmail.com, bind-...@lists.isc.org

Hi,

On Fri, Sep 28, 2018 at 12:18 AM Lee <ler...@gmail.com> wrote:
>
> On 9/27/18, Alex <mysqls...@gmail.com> wrote:
> > Hi,
> >
> >> Just a wild thought:
> >> It works with a lower speed line (at least I read it that way) but has
> >> problems with higher speeds.
> >> Could it be that the line is so fast that it "overtakes" the host in
> >> question?
> >>
> >> A faster incoming line will give less time between the packets for
> >> processing.
> >
> > No, I actually upgraded from a 65/20mbit to a 165/35mbit recently,
> > thinking it was too slow because it was happening at the slower speeds
> > as well. I've also implemented some basic QoS to throttle outgoing
> > smtp and prioritize DNS but it made no difference.
>
> Has your provider enabled qos? I'd bet their dropping packets that
> exceed qos rate limits would be considered "working as expected".

I asked and they had no idea what that even meant. The technician that
was here replacing the modem also had no idea outside of what the
hardware does.

I've also asked on dslreports about this, and no one answered.

It certainly seems to be more pronounced now than it ever was in the
past. Sometimes so many queries are failing that it's impossible to
use the network.

> Which brings up the question of exactly what does SERVFAIL mean? Can
> no response to a query result in SERVFAIL? Is there a way to tell the
> difference between no response & getting a response indicating a
> failure?

Early in this thread or another, I provided a packet trace that showed
what appears to me to never have received the replies - it just times
out. Also, the "Server Failure" messages are always on the loopback
interface. I'd be happy to provide another trace if someone knows how
to properly read it. I really have no idea what's causing the problem.

Also, I recently raised the trace level to 99, but I don't see
anything in the logs beyond level 4. Where do I find what the
different trace levels are supposed to report?

27-Sep-2018 16:57:29.688 query-errors: info: client @0x7fc7b0169ac0
127.0.0.1#31675 (72.212.15.199.backscatter.spameatingmonkey.net):
query failed (SERVFAIL) for
72.212.15.199.backscatter.spameatingmonkey.net/IN/A at
../../../bin/named/query.c:8580
26-Sep-2018 15:16:32.507 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for
b74c2d3722fbce8841edc1808ea0a31e.ix.dnsbl.manitu.net/A in 30.000092:
timed out/success
[domain:manitu.net,referral:0,restart:5,qrysent:17,timeout:16,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

There are also tons of messages involving disabling EDNS:
27-Sep-2018 16:57:29.549 edns-disabled: debug 1: success resolving
'232.123.75.208.dnsbl-3.uceprotect.net/A' (in
'dnsbl-3.uceprotect.net'?) after disabling EDNS

I've also just installed 'netdata', which is an app that reports on
system parameters, and find it frequently reporting messages like:
ipv4 tcp listen overflows = 4 overflows
inbound packets dropped = 22 packets
ipv4 udp receive buffer errors = 184 errors

I've also now made the following buffer adjustments based on this and
other perf tuning docs:
https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf
net.core.rmem_default = 8388608
net.core.rmem_max = 33554432
net.core.wmem_default = 52428800
net.core.wmem_max = 134217728
net.ipv4.udp_early_demux = 0
net.ipv4.udp_mem=764304 1019072 1528608
net.ipv4.tcp_rmem=16384 349520 16777216
net.core.rmem_max=16777216
net.ipv4.udp_rmem_min = 18192
net.ipv4.udp_wmem_min = 8192
net.core.netdev_budget = 10000
net.core.netdev_max_backlog = 2000
net.core.netdev_max_backlog=100000

Thanks,
Alex

Lee

unread,

Sep 28, 2018, 10:45:52 AM9/28/18

to Alex, bind-...@lists.isc.org

On 9/28/18, Alex <mysqls...@gmail.com> wrote:
> Hi,
>
> On Fri, Sep 28, 2018 at 12:18 AM Lee <ler...@gmail.com> wrote:
>>
>> On 9/27/18, Alex <mysqls...@gmail.com> wrote:
>> > Hi,
>> >
>> >> Just a wild thought:
>> >> It works with a lower speed line (at least I read it that way) but has
>> >> problems with higher speeds.
>> >> Could it be that the line is so fast that it "overtakes" the host in
>> >> question?
>> >>
>> >> A faster incoming line will give less time between the packets for
>> >> processing.
>> >
>> > No, I actually upgraded from a 65/20mbit to a 165/35mbit recently,
>> > thinking it was too slow because it was happening at the slower speeds
>> > as well. I've also implemented some basic QoS to throttle outgoing
>> > smtp and prioritize DNS but it made no difference.
>>
>> Has your provider enabled qos? I'd bet their dropping packets that
>> exceed qos rate limits would be considered "working as expected".
>
> I asked and they had no idea what that even meant.

Escalate? Which is assuming you have a ticket open..

I had it a bit easier than you; I was in an enterprise environment &
had control of the routers on both sides + it was relatively easy to
demonstrate packet loss.

> The technician that
> was here replacing the modem also had no idea outside of what the
> hardware does.
>
> I've also asked on dslreports about this, and no one answered.
>
> It certainly seems to be more pronounced now than it ever was in the
> past. Sometimes so many queries are failing that it's impossible to
> use the network.

Can you make it happen on demand? Troubleshooting is so much easier
if you can demonstrate the problem vs. trying to reconstruct what
happened after the fact.

>> Which brings up the question of exactly what does SERVFAIL mean? Can
>> no response to a query result in SERVFAIL? Is there a way to tell the
>> difference between no response & getting a response indicating a
>> failure?
>
> Early in this thread or another, I provided a packet trace that showed
> what appears to me to never have received the replies - it just times
> out. Also, the "Server Failure" messages are always on the loopback
> interface. I'd be happy to provide another trace if someone knows how
> to properly read it. I really have no idea what's causing the problem.

It would be nice if there was a way to tell if the problem was packet
drops (ie. no response to a query), getting a bad response from the
server or something else. At least then you'd know where to direct
your attention..

> Also, I recently raised the trace level to 99, but I don't see
> anything in the logs beyond level 4. Where do I find what the
> different trace levels are supposed to report?

No idea. I'm running bind at home and very occasionally see things like
28-Sep-2018 1:04:32.552 query-errors: info: client @000001F0C86745C0
127.0.0.1#63459 (www.Amazon.com): query failed (SERVFAIL) for
www.Amazon.com/IN/A at ..\query.c:8580

so I'd be interested in knowing if you get a resolution to the problem.

Lee

Alan Clegg

unread,

Sep 28, 2018, 11:17:24 AM9/28/18

to bind-...@lists.isc.org

On 9/28/18 9:26 AM, Alex wrote:

>> Has your provider enabled qos? I'd bet their dropping packets that
>> exceed qos rate limits would be considered "working as expected".
>

> I asked and they had no idea what that even meant. The technician that

> was here replacing the modem also had no idea outside of what the
> hardware does.

You may want to consider buying a VPS somewhere other than behind the
modem at your (assumed) residence.

There are lots of 'em, some costing less than $5/month for a decent
little box (I have several scattered around the world) and when you have
a problem, they have a good chance of understanding what you are asking.

AlanC
--
Why don't we wander and follow la vie dansante.

Blake Hudson

unread,

Sep 28, 2018, 11:31:02 AM9/28/18

to bind-...@lists.isc.org

Alex wrote on 9/26/2018 11:52 AM:
> This is all now running on a 165/35 cable system.
>

> Early in this thread or another, I provided a packet trace that showed
> what appears to me to never have received the replies - it just times
> out.

> It looks like there are periods of as many as 500 queries per second,
> although the usual amount is closer to 200 per second.

DOCSIS cable systems use an upstream request/grant system to avoid
collisions (they act as a hub where only one cable modem in the node can
transmit at the same time). This leads to low pps rates compared with
ethernet. Even a 10M ethernet connection (1k-10k pps) will outperform a
1gig cable connection (a few hundred pps).

Based on the info you've provided, I suspect that you may be running
into this limit. As another poster suggested, you might consider moving
your DNS server to a VPS hosted on an ethernet connection at a location
more suited for DNS server operation or otherwise try to leverage your
upstream provider's DNS or an outside DNS server.

--Blake

Alex

unread,

Sep 29, 2018, 3:41:04 PM9/29/18

to bl...@ispn.net, bind-...@lists.isc.org

Hi,

> DOCSIS cable systems use an upstream request/grant system to avoid
> collisions (they act as a hub where only one cable modem in the node can
> transmit at the same time). This leads to low pps rates compared with
> ethernet. Even a 10M ethernet connection (1k-10k pps) will outperform a
> 1gig cable connection (a few hundred pps).
>
> Based on the info you've provided, I suspect that you may be running
> into this limit. As another poster suggested, you might consider moving
> your DNS server to a VPS hosted on an ethernet connection at a location
> more suited for DNS server operation or otherwise try to leverage your
> upstream provider's DNS or an outside DNS server.

I remember hearing this some time ago, and had even made mention very
early on that I questioned if it was the cable itself.

However, I've tried using Optonline's DNS and the "Name service error"
errors from postfix continued. Could it be affecting that traffic as
well, considering effectively the same UDP packets are being
transferred?

I also used socat to build an encrypted tunnel between this system
connected to the cable modem and our VPS system, and the SERVFAIL
messages stopped. However, there are still quite a few "Name service
error" errors from postfix.

I realize this is bind-users, not a postfix list, but any idea if
those errors could also be due to it being a cable circuit?

Sep 29 14:33:54 mail03 postfix/dnsblog[3290]: warning: dnsblog_query:
lookup error for DNS query 123.139.28.66.dnsbl.sorbs.net: Host or

domain name not found. Name service error for

name=123.139.28.66.dnsbl.sorbs.net type=A: Host not found, try again

I'd really be interested in people's input here.

Thanks,
Alex

G.W. Haywood

unread,

Sep 30, 2018, 10:25:27 AM9/30/18

to bind-...@lists.isc.org

Hi there,

On Sun, 30 Sep 2018, Alex wrote:

> Sep 29 14:33:54 mail03 postfix/dnsblog[3290]: warning:
> dnsblog_query: lookup error for DNS query
> 123.139.28.66.dnsbl.sorbs.net: Host or domain name not found. Name
> service error for name=123.139.28.66.dnsbl.sorbs.net type=A: Host
> not found, try again
>
> I'd really be interested in people's input here.

Are your requests being dropped by the service(s)?

(Or: are you inadvertently abusing the said service(s)?)

--

73,
Ged.

Alex

unread,

Sep 30, 2018, 11:59:50 AM9/30/18

to bind-...@lists.isc.org

Hi,

> > Sep 29 14:33:54 mail03 postfix/dnsblog[3290]: warning:
> > dnsblog_query: lookup error for DNS query
> > 123.139.28.66.dnsbl.sorbs.net: Host or domain name not found. Name
> > service error for name=123.139.28.66.dnsbl.sorbs.net type=A: Host
> > not found, try again
> >
> > I'd really be interested in people's input here.
>
> Are your requests being dropped by the service(s)?
> (Or: are you inadvertently abusing the said service(s)?)

I don't believe so - often times a follow-up host query succeeds
without issue. It's also failing for invaluement and spamhaus, both of
which we subscribe.

30-Sep-2018 11:42:04.345 query-errors: info: client @0x7f7910197080
127.0.0.1#46806 (177.32.208.162.bad.psky.me): query failed (SERVFAIL)
for 177.32.208.162.bad.psky.me/IN/A at ../../../bin/named/query.c:8580
30-Sep-2018 11:32:31.245 query-errors: info: client @0x7f7920170d30
127.0.0.1#30816 (86.131.2.198.zz.countries.nerd.dk): query failed
(SERVFAIL) for 86.131.2.198.zz.countries.nerd.dk/IN/A at
../../../bin/named/query.c:8580

# host 177.32.208.162.bad.psky.me
Host 177.32.208.162.bad.psky.me not found: 3(NXDOMAIN)
# host 61.200.226.173.zz.countries.nerd.dk
61.200.226.173.zz.countries.nerd.dk has address 127.0.3.72

It also tends to happen in bulk - there may be 25 SERVFAILs within the
same second, then nothing for another few minutes.

I believe an early tcpdump trace showed that we were just not
receiving the responses, although I don't know if it was due to the
service itself (doubtful, particularly for the reasons mentioned
above), or something along the way was dropping the packets.

This appears to indicate the response was never received:
27-Sep-2018 16:57:06.509 query-errors: info: client @0x7fc7a42f6900
127.0.0.1#46680 (fidelity.com.wild.pccc.com): query failed (SERVFAIL)
for fidelity.com.wild.pccc.com/IN/A at ../../../bin/named/query.c:8580
27-Sep-2018 16:57:06.510 query-errors: debug 2: fetch completed at
../../../lib/dns/resolver.c:3927 for fidelity.com.wild.pccc.com/A in
30.000130: timed out/success
[domain:wild.pccc.com,referral:0,restart:7,qrysent:7,timeout:6,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]

I attempted to search github for query.c line 8580, but there weren't
even that many lines in file.

Is there any further bind debugging that can be done to determine
this? I've tried increasing the tracing level to 99, but it doesn't
appear to show any more than trace level 4.

@lbutlr

unread,

Sep 30, 2018, 1:18:41 PM9/30/18

to bind-...@lists.isc.org

On 30 Sep 2018, at 09:59, Alex <mysqls...@gmail.com> wrote:
> It also tends to happen in bulk - there may be 25 SERVFAILs within the
> same second, then nothing for another few minutes.

That really makes it seem like either you modem or you ISP is interfering somehow, or is simply not able to keep up.

--
'Who's that playing now, Mr. Dibbler?' "'And you".' 'Sorry, Mr.
Dibbler?' 'Only they write it &U,' said Dibbler. --Soul Music

Alex

unread,

Sep 30, 2018, 8:28:14 PM9/30/18

to kre...@kreme.com, bind-...@lists.isc.org

Hi,

On Sun, Sep 30, 2018 at 1:19 PM @lbutlr <kre...@kreme.com> wrote:
>
> On 30 Sep 2018, at 09:59, Alex <mysqls...@gmail.com> wrote:
> > It also tends to happen in bulk - there may be 25 SERVFAILs within the
> > same second, then nothing for another few minutes.
>
> That really makes it seem like either you modem or you ISP is interfering somehow, or is simply not able to keep up.

I'm leaning towards that, too. The problem persists even when using
the provider's DNS servers. I thought for sure I'd see some verifiable
info from other people having problems with cable, such as from
dslreports, etc, but there really hasn't been anything. The comment
made about DOCSIS earlier in this thread was helpful.

Do you believe it could be impacting all data, not just bind/DNS/UDP?

Do people not generally use cable as even a fallback link for
secondary services? I figured it was because there's no SLA, not
because it doesn't work well with many protocols. I'd imagine services
like Netflix and youtube don't have problems is because they 1) don't
require a lot of DNS traffic and 2) http is a really simple protocol
and 3) the link is probably engineered to be used for that?

>
>
> --
> 'Who's that playing now, Mr. Dibbler?' "'And you".' 'Sorry, Mr.
> Dibbler?' 'Only they write it &U,' said Dibbler. --Soul Music
>

Browne, Stuart

unread,

Sep 30, 2018, 8:45:16 PM9/30/18

to bind-...@lists.isc.org

> -----Original Message-----
> From: bind-users On Behalf Of Alex

<snip>

> I'm leaning towards that, too. The problem persists even when using
> the provider's DNS servers. I thought for sure I'd see some verifiable
> info from other people having problems with cable, such as from
> dslreports, etc, but there really hasn't been anything. The comment
> made about DOCSIS earlier in this thread was helpful.
>
> Do you believe it could be impacting all data, not just bind/DNS/UDP?
>
> Do people not generally use cable as even a fallback link for
> secondary services? I figured it was because there's no SLA, not
> because it doesn't work well with many protocols. I'd imagine services
> like Netflix and youtube don't have problems is because they 1) don't
> require a lot of DNS traffic and 2) http is a really simple protocol
> and 3) the link is probably engineered to be used for that?

I use sendmail and bind at home for my purposes, and don't have these sorts of issues. But what probably solves this for most users is the fact that most home-sort-of-users use TCP rather than UDP.

UDP is designed as a lossy protocol; no resends, no guaranteed delivery at a protocol level. If you're really concerned with the occasional SERVFAIL due to this (which your stub resolver should be retrying), you could try convincing BIND to recurse using TCP only. It's not a good idea (and I'm pretty sure doesn't have the option to do it natively)...

Stuart

G.W. Haywood

unread,

Oct 1, 2018, 8:30:48 AM10/1/18

to bind-...@lists.isc.org

Hello again,

On Mon, 1 Oct 2018, Alex wrote:

> > Are your requests being dropped by the service(s)?
> >
> > (Or: are you inadvertently abusing the said service(s)?)
>
> I don't believe so - often times a follow-up host query succeeds
> without issue. It's also failing for invaluement and spamhaus, both
> of which we subscribe.

> [...]

> It also tends to happen in bulk - there may be 25 SERVFAILs within
> the same second, then nothing for another few minutes.

Hmmm. If it isn't the modem and it isn't the BLs then it more or less
has to be the service, no?

I'd be tempted by Mr. Clegg's suggestion to spin up a VPS somewhere
with decent connection, which will at least offload a lot of retries.
Talk to it through OpenVPN, which is very easy to set up, and it can
(a) put the VPS on your LAN (b) take much unreliablility out of the
presumably unreliable connection between you and the VPS and (c) write
very verbose logs if you wish. On occasion on unreliable connections
I've had to use TCP for the VPN link but UDP is the norm - OpenVPN has
its own ways of dealing with lost packets.

Then you'll probably have a whole new can of worms to investigate, but
the worms will definitely tell you something. :)

--

73,
Ged.

Blake Hudson

unread,

Oct 1, 2018, 9:57:42 AM10/1/18

to bind-...@lists.isc.org

Alex wrote on 9/30/2018 7:27 PM:
> Hi,
>
> On Sun, Sep 30, 2018 at 1:19 PM @lbutlr <kre...@kreme.com> wrote:
>> On 30 Sep 2018, at 09:59, Alex <mysqls...@gmail.com> wrote:

>>> It also tends to happen in bulk - there may be 25 SERVFAILs within the
>>> same second, then nothing for another few minutes.

>> That really makes it seem like either you modem or you ISP is interfering somehow, or is simply not able to keep up.

> I'm leaning towards that, too. The problem persists even when using
> the provider's DNS servers. I thought for sure I'd see some verifiable
> info from other people having problems with cable, such as from
> dslreports, etc, but there really hasn't been anything. The comment
> made about DOCSIS earlier in this thread was helpful.
>
> Do you believe it could be impacting all data, not just bind/DNS/UDP?
>
> Do people not generally use cable as even a fallback link for
> secondary services? I figured it was because there's no SLA, not
> because it doesn't work well with many protocols. I'd imagine services
> like Netflix and youtube don't have problems is because they 1) don't
> require a lot of DNS traffic and 2) http is a really simple protocol
> and 3) the link is probably engineered to be used for that?
>

Overall it probably depends on volume and application. Cable works well
as a transport, but is not the same as DSL, ethernet, or GPON. If you
have the need to send 500+ pps, then Cable may not meet your needs.

If you are running a high volume mail server you probably do need to run
a local resolver to query services like SpamHaus, SORBs, and others due
to the terms of service of these services and the rate limiting that
they apply which would prevent you from using your upstream provider's
DNS servers or a public DNS service like Google/Quad9/1.1.1.1. I would,
however, recommend that you ensure your system has at least 2 resolvers
configured in /etc/resolv.conf. If the first (local resolver) fails to
resolve a query, then your system should retry the second server before
giving up and returning a failure to Postfix. Again, if you're using
free RBL services that second resolver may need to be one of your own
and not one shared with other folks.

The occasional timeout might delay email, but should not prevent SMTP
from functioning because A) DNS timeouts are considered to be a
temporary error, and B) the default behavior of SMTP is to queue and
retry if there is a timeout or temporary failure. Another angle to look
at the problem from is if you believe the network can't handle more than
X query volume, reduce your query volume below X to see if this resolves
your issue. I operate dozens of email servers and they do not generate
the query volume you describe. Perhaps you are querying too many RBLs
and it may pay to be more selective. I find SpamHaus and SpamCop to be
the best two RBLs. If you want to pick another one or two, that seems
reasonable. I would not recommend more RBLs within Postfix.

--Blake

Lee

unread,

Oct 1, 2018, 11:17:12 AM10/1/18

to Alex, bind-...@lists.isc.org

On 9/30/18, Alex <mysqls...@gmail.com> wrote:
> Hi,
>
> On Sun, Sep 30, 2018 at 1:19 PM @lbutlr <kre...@kreme.com> wrote:
>>
>> On 30 Sep 2018, at 09:59, Alex <mysqls...@gmail.com> wrote:
>> > It also tends to happen in bulk - there may be 25 SERVFAILs within the
>> > same second, then nothing for another few minutes.
>>
>> That really makes it seem like either you modem or you ISP is interfering
>> somehow, or is simply not able to keep up.
>
> I'm leaning towards that, too. The problem persists even when using
> the provider's DNS servers.

Is this a personal project or can you get help from the network staff
& open trouble tickets with the various providers?

I'm making a big guess here, but you mentioned dnsbl.sorbs.net earlier so
$ dig dnsbl.sorbs.net.
<.. snip ..>
;; ANSWER SECTION:
dnsbl.sorbs.net. 86400 IN A 113.52.8.154
dnsbl.sorbs.net. 86400 IN A 113.52.8.155
dnsbl.sorbs.net. 86400 IN A 208.43.139.188
dnsbl.sorbs.net. 86400 IN A 113.52.8.153
dnsbl.sorbs.net. 86400 IN A 208.43.110.204

go here: https://wq.apnic.net/apnic-bin/whois.pl
and search for 113.52.8.154
which gives me
inetnum: 113.52.8.0 - 113.52.8.255
netname: DIGITALSENSE
descr: Digital Sense, Data Centres, Brisbane, Colocation
country: AU

on the other hand
https://whois.arin.net/rest/net/NET-208-43-0-0-1/pft?s=208.43.139.188
gives ms
City Dallas
State/Province TX

If this is a packet drop issue as well as a personal project, you
might be stuck with figuring out just how fast you can send queries
before things start to break and adjusting your setup accordingly.

> I thought for sure I'd see some verifiable
> info from other people having problems with cable, such as from
> dslreports, etc, but there really hasn't been anything. The comment
> made about DOCSIS earlier in this thread was helpful.
>
> Do you believe it could be impacting all data, not just bind/DNS/UDP?
>
> Do people not generally use cable as even a fallback link for
> secondary services? I figured it was because there's no SLA, not
> because it doesn't work well with many protocols.

I think it's more of a you pay for what you get thing. "business
class" cable costs more & might even be provisioned better, but at
least the first question you get when calling support isn't "have you
tried turning it off and on?"

wrt your earlier

I attempted to search github for query.c line 8580

there's probably a github answer; I went to https://ftp.isc.org/isc/bind9/
found my release and downloaded the BIND-xxx.tar.gz source code file.

It'd be nice if ISC made no response to a query a separate error vs.
lumping it in with all the other "Something has gone wrong."
possibilities.

Lee

Alex

unread,

Oct 1, 2018, 12:20:14 PM10/1/18

to bi...@jubileegroup.co.uk, bind-...@lists.isc.org

Hi,

> > It also tends to happen in bulk - there may be 25 SERVFAILs within
> > the same second, then nothing for another few minutes.
>

> Hmmm. If it isn't the modem and it isn't the BLs then it more or less
> has to be the service, no?

Yes, most likely, but I was looking for more definitive proof that the
circuit wasn't doing what it should be (or at least, what I expect). I
also wasn't sure if it was a tuning issue (network, bind, server
itself, etc).

> I'd be tempted by Mr. Clegg's suggestion to spin up a VPS somewhere
> with decent connection, which will at least offload a lot of retries.

I built an encrypted tunnel using socat with a VPS and a decent
connection and the bind SERVFAIL messages almost entirely went away.
The remaining ones seem to be actual SERVFAIL problems.

> Then you'll probably have a whole new can of worms to investigate, but
> the worms will definitely tell you something. :)

Yeah, socat isn't a good permanent solution. Looks like I'll get
libreswan going. Building a VPN for a specific port/service is a
little more difficult, I believe.

Alex

unread,

Oct 1, 2018, 12:52:05 PM10/1/18

to bind-...@lists.isc.org

Hi,

On Mon, Oct 1, 2018 at 9:58 AM Blake Hudson <bl...@ispn.net> wrote:
>
> Alex wrote on 9/30/2018 7:27 PM:

> > Hi,
> >
> > On Sun, Sep 30, 2018 at 1:19 PM @lbutlr <kre...@kreme.com> wrote:
> >> On 30 Sep 2018, at 09:59, Alex <mysqls...@gmail.com> wrote:

> >>> It also tends to happen in bulk - there may be 25 SERVFAILs within the
> >>> same second, then nothing for another few minutes.

> >> That really makes it seem like either you modem or you ISP is interfering somehow, or is simply not able to keep up.
> > I'm leaning towards that, too. The problem persists even when using

> > the provider's DNS servers. I thought for sure I'd see some verifiable

> > info from other people having problems with cable, such as from
> > dslreports, etc, but there really hasn't been anything. The comment
> > made about DOCSIS earlier in this thread was helpful.
> >
> > Do you believe it could be impacting all data, not just bind/DNS/UDP?
> >
> > Do people not generally use cable as even a fallback link for
> > secondary services? I figured it was because there's no SLA, not

> > because it doesn't work well with many protocols. I'd imagine services
> > like Netflix and youtube don't have problems is because they 1) don't
> > require a lot of DNS traffic and 2) http is a really simple protocol
> > and 3) the link is probably engineered to be used for that?
> >
>
> Overall it probably depends on volume and application. Cable works well
> as a transport, but is not the same as DSL, ethernet, or GPON. If you
> have the need to send 500+ pps, then Cable may not meet your needs.

I believe I said as many as 500 qps, but I believe that's wrong. It's
more like a sustained 200 q/s.

> If you are running a high volume mail server you probably do need to run
> a local resolver to query services like SpamHaus, SORBs, and others due
> to the terms of service of these services and the rate limiting that

Yes, doing all of that. That's why I'm posting to the bind-users list.

For RBLs, I'm using invaluement (amazing), spamhaus, spamcop, sorbs,
senderscore and barracuda.

> they apply which would prevent you from using your upstream provider's
> DNS servers or a public DNS service like Google/Quad9/1.1.1.1. I would,
> however, recommend that you ensure your system has at least 2 resolvers
> configured in /etc/resolv.conf. If the first (local resolver) fails to
> resolve a query, then your system should retry the second server before

That turned out to be a key factor in this.

I managed to get rid of most of the SERVFAIL bind errors after
tunneling them through socat temporarily, but there were still a few
others. I thought by using just one entry in /etc/resolv.conf, it
would force all to go through there, but apparently some were
dropped(?). It wasn't until I added another resolver on a local
network (also on that cable connection) that the 'Name service error'
postfix errors really stopped.

> The occasional timeout might delay email, but should not prevent SMTP
> from functioning because A) DNS timeouts are considered to be a
> temporary error, and B) the default behavior of SMTP is to queue and

It doesn't prevent the email from being delivered, but the RBL queries
time out and consequently don't get consulted, perhaps allowing email
to pass that otherwise shouldn't have.

> retry if there is a timeout or temporary failure. Another angle to look
> at the problem from is if you believe the network can't handle more than
> X query volume, reduce your query volume below X to see if this resolves
> your issue. I operate dozens of email servers and they do not generate
> the query volume you describe. Perhaps you are querying too many RBLs

I've also experimented with QoS, prioritizing interactive traffic like
DNS, and it appears to help, but I don't believe it's a bandwidth
issue. The errors also sometimes happen when processing only a few
emails.

For a while I thought it couldn't be a bandwidth issue because it's a
165/35mbit link, and we have 10mbit ethernet links where it doesn't
ever happen with otherwise very similar configurations, but now I know
(or are pretty sure) it's apparently because of the make-up of how the
cable (DOCSIS?) is designed...

Shaun

unread,

Oct 1, 2018, 2:16:01 PM10/1/18

to Alex, bind-...@lists.isc.org

Hi Alex,

On Mon, 1 Oct 2018 12:51:46 -0400
Alex <mysqls...@gmail.com> wrote:

> I believe I said as many as 500 qps, but I believe that's wrong. It's
> more like a sustained 200 q/s.

One other thing you might double check is whether or not any consumer
equipment (cable modem, router) has a firewall setting that could be
interfering.

My newest router came with a built-in DDOS protection feature, which
caused me some difficulty with UDP applications until I disabled it. The
default threshold for UDP was something like 200 or 300 pps. The manual
isn't clear on how the "protection" works, but I assume it starts
dropping packets on the floor when the threshold is exceeded. I turned
off that feature and the problem went away.

Apologies if you've already looked into this; long thread and I'm
jumping in late.

-s