dns lookup randomly fails

963 views
Skip to first unread message

Stevo Novkovski

unread,
Dec 26, 2016, 3:39:51 PM12/26/16
to gce-discussion
After too much debugging into my application, i discover  that my OS (Centos) actually is making randomly dns lookup failures.
I tried to debug port 53 and discover "bad udp cksum" errors. After some googling, many articles says that this happens on many clouds.

http://www.pkdavies.co.uk/172-using-tcpdump-to-monitor-dns-requests.htmlAnalysis of the [bad udp cksum xx] reveals that this is a common issue with virtual/cloud servers.
https://ubuntuforums.org/showthread.php?t=1940190 - Problem is related to hardware offload engine used in vmware virtual nicks

Here is output from my server
ethtool --show-offload  eth0
Features for eth0:
rx-checksumming: on [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
busy-poll: off [fixed]

Here is output from new Centos 7 server on GCE:
Features for eth0:
rx-checksumming: on [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off [fixed]
        tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-mpls-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
busy-poll: off [fixed]
tx-sctp-segmentation: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]

Any official person from GCE can answer how to fix this or why is this happening?

George (Google Cloud Support)

unread,
Dec 27, 2016, 4:58:44 PM12/27/16
to gce-discussion
Hello Stevo,

Bad checksum can be caused due to high MTU packets which are greater than 1460 as the GCE network has a maximum transmission unit (MTU) of 1460. 

Are you by any chance using a custom CentOS image? As the OS images provided by GCE are configured with the above mentioned MTU

I hope this helps.

Sincerely,
George

Stevo Novkovski

unread,
Dec 28, 2016, 3:44:34 AM12/28/16
to gce-discussion
It is Google Centos 7 -> Upgraded to Cloudlinux 7.
However, here is the output of eth0:

eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP qlen 1000

any other sollution?

George (Google Cloud Support)

unread,
Dec 28, 2016, 5:06:30 PM12/28/16
to gce-discussion
Hello Stevo,

As the issue is happening on a non-Google default image, I would suggest checking the configuration on the machine which can have an improper NIC driver or some NIC settings that need some tuning (offloading, buffers, TX/RX negotiation,etc..)

I would recommend turning off the offloading options as a first troubleshooting step which might give you some insight about the issue. I also suggest that you post this question on the StackExchange network by adding the relevant tags, as the community for Cloudlinux is active there.

I hope this helps.

Sincerely,

George

Stevo Novkovski

unread,
Dec 29, 2016, 7:53:19 AM12/29/16
to gce-discussion
Thanks for the tips.

But i also want to mention that on one of my servers this problem was from 19-26 December. Even before this dates, my system was exactly the same, we didn`t made single update to OS or some RPM or changing some setting.

Reply all
Reply to author
Forward
0 new messages