RHEL6 clients w/ iptables lose access to NFS exports

407 views
Skip to first unread message

Dan Pritts

unread,
Apr 17, 2017, 3:26:51 PM4/17/17
to isilon-u...@googlegroups.com
Hi all -

we've got a cluster with 7.2.1.2, and a bunch of RHEL6 clients.

Recently two of those clients have had multiple "mount disappears"
events - good old "NFS server not responding still trying". In some
cases we've had no choice but to do hard reboots.

Once, I got in and tried disabling iptables - the mount returned
immediately.

The two clients are on different lan segments, and the problem has
occurred with mounts against (at least) two different isilon nodes.
One was very busy, the other not so much.

In one case, the export is mounted only by that client, which was
hammering on it. In the other case, it was our central shared software
repository, not heavily accessed at all.

Neither iptables configs nor the isilon have had any significant
configuration changes in the recent past.


One common thread is that the two clients are running recent kernels:

kernel-2.6.32-696.1.1.el6
and
kernel-2.6.32-696.el6

These are not our only examples of these kernels, but most of our
machines aren't on this yet, and the other machines running these
kernels aren't heavily used.

Anyone else run into this? I asked around on our campus and nobody had
run into it (yet?).

thanks
danno
--
Dan Pritts
ICPSR Computing & Network Services
University of Michigan

Chris Pepper

unread,
Apr 18, 2017, 9:19:17 AM4/18/17
to isilon-u...@googlegroups.com
Dan,

Are they running NFS over TCP or UDP? Have you whitelisted the Isilon cluster (*all* IPs) in iptables? We had trouble with SNMP a few years ago -- Nagios would send an SNMP packet to a dynamic SmartConnect IP, and snmpd on the receiving node would send it back via a static node IP. This broke iptables response detection until we whitelisted SNMP in iptables. iptables LOG rules and tcpdump on the problem Linux client should help; tcpdump on Isilon nodes should help too.

Troubleshooting on the Isilon side would be simpler if you address an individual node directly, but might circumvent your issue as well. I believe Isilon support has a custom tool that basically runs tcpdump cluster-wide and dumps into log files to help with troubleshooting. Unfortunately I don't recall the name of the binary.

Regards,

Chris
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dan Pritts

unread,
Apr 18, 2017, 12:16:54 PM4/18/17
to isilon-u...@googlegroups.com
I must admit I haven't sniffed the traffic, but I would be shocked if it weren't using TCP - I certainly haven't forced UDP in the mount options.

I haven't whitelisted the cluster in iptables (except very recently on our production servers, in response to this issue).  I've been depending on iptables connection tracking to, you know, track connections.  And it's worked fine, for several years. 

I wonder whether rpc statd requests from Isilon to client are coming from the IP address that's being mounted, or the master address of the node, like in your SNMP situation.   I'm a bit fuzzy on the gritty details of how locks & statd operate but that might explain what's happening. 

Thanks for the troubleshooting hints - I've been tied up on other projects but had planned along similar lines.  We have a small cluster, and the pools that serve these clients are only on 3 nodes, so sniffing is relatively easy. 

thanks
danno


April 18, 2017 at 9:19 AM
Dan,

Are they running NFS over TCP or UDP? Have you whitelisted the Isilon cluster (*all* IPs) in iptables? We had trouble with SNMP a few years ago -- Nagios would send an SNMP packet to a dynamic SmartConnect IP, and snmpd on the receiving node would send it back via a static node IP. This broke iptables response detection until we whitelisted SNMP in iptables. iptables LOG rules and tcpdump on the problem Linux client should help; tcpdump on Isilon nodes should help too.

Troubleshooting on the Isilon side would be simpler if you address an individual node directly, but might circumvent your issue as well. I believe Isilon support has a custom tool that basically runs tcpdump cluster-wide and dumps into log files to help with troubleshooting. Unfortunately I don't recall the name of the binary.

Regards,

Chris



April 17, 2017 at 3:26 PM
Hi all -

we've got a cluster with 7.2.1.2, and a bunch of RHEL6 clients.

Recently two of those clients have had multiple "mount disappears" events - good old "NFS server not responding still trying".  In some cases we've had no choice but to do hard reboots.

Once, I got in and tried disabling iptables - the mount returned immediately.

The two clients are on different lan segments, and the problem has occurred with mounts against (at least) two different isilon nodes.    One was very busy, the other not so much.

In one case, the export is mounted only by that client, which was hammering on it.  In the other case, it was our central shared software repository, not heavily accessed at all.

Neither iptables configs nor the isilon have had any significant configuration changes in the recent past.


One common thread is that the two clients are running recent kernels:

kernel-2.6.32-696.1.1.el6
and
kernel-2.6.32-696.el6

These are not our only examples of these kernels, but most of our machines aren't on this yet, and the other machines running these kernels aren't heavily used.

Anyone else run into this?  I asked around on our campus and nobody had run into it (yet?).

thanks
danno

Dan Pritts

unread,
May 10, 2017, 2:21:40 PM5/10/17
to isilon-u...@googlegroups.com
FWIW,

I have further narrowed this down to the iptables conntrack INVALID match.  I'm DROPping INVALID packets; removing that rule fixes my problem. 

grumble grumble

thanks
danno

April 17, 2017 at 3:26 PM

Dan Pritts

unread,
May 19, 2017, 2:09:50 PM5/19/17
to isilon-u...@googlegroups.com
Looks like I ran into a (now-)known issue with RHEL6.9.

https://access.redhat.com/solutions/3018371

Issue

    A RHEL 6.9 client fails to reconnect to the NFS server and server not responding messages are seen. At the TCP level, the second step (the SYN,ACK) of the 3-way TCP handshake is failing with ICMP 102 Destination unreachable (Host administratively prohibited) sent by the NFS client.
    NFS share cannot reconnect due to the 3-way TCP handshake failure. The NFS client erroneously sends multiple SYN packets from the same TCP port but different sequence numbers, the NFS server responds with SYN,ACK to one of the SYNs, but not the others, and this sequence leads to confusion between the NFS client and server's TCP stacks. As a result of the confusion, the NFS client never sends the final ACK and so the NFS share cannot be reconnected, leading to a DoS of the NFS share.

Red Hat Enterprise Linux 6

    A fix for this problem is tracked in the following Red Hat bug:
        Bug 1448170 - RHEL6.9: sunrpc reconnect logic now may trigger a SYN storm when a TCP connection drops and a burst of RPC commands hit the transport
        As of Mon, May 15, 2017, the status of the bug is MODIFIED.
        Red Hat can reproduce this issue, a patch has been submitted internally, and is built for inclusion in the next minor release.

Workarounds

    Any of the following should work around the problem
    1) Disable iptables, even only temporarily. After the connection to the NFS server is established, iptables can be restarted with unchanged rules - however, the issue is likely to reoccur. NOTE: This workaround is only appropriate if iptables is active in the environment.
    2) Boot back to a kernel prior to 2.6.32-696.*el6
    3) Use some approach to ensure the NFS (and lockd) TCP connections do no go idle and drop. One possibility is to use a variation of a script to monitor for hung NFS mount points as described in https://access.redhat.com/solutions/97873.


May 10, 2017 at 2:21 PM
FWIW,

I have further narrowed this down to the iptables conntrack INVALID match.  I'm DROPping INVALID packets; removing that rule fixes my problem. 

grumble grumble

thanks
danno


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
April 17, 2017 at 3:26 PM
Hi all -

we've got a cluster with 7.2.1.2, and a bunch of RHEL6 clients.

Recently two of those clients have had multiple "mount disappears" events - good old "NFS server not responding still trying".  In some cases we've had no choice but to do hard reboots.

Once, I got in and tried disabling iptables - the mount returned immediately.

The two clients are on different lan segments, and the problem has occurred with mounts against (at least) two different isilon nodes.    One was very busy, the other not so much.

In one case, the export is mounted only by that client, which was hammering on it.  In the other case, it was our central shared software repository, not heavily accessed at all.

Neither iptables configs nor the isilon have had any significant configuration changes in the recent past.


One common thread is that the two clients are running recent kernels:

kernel-2.6.32-696.1.1.el6
and
kernel-2.6.32-696.el6

These are not our only examples of these kernels, but most of our machines aren't on this yet, and the other machines running these kernels aren't heavily used.

Anyone else run into this?  I asked around on our campus and nobody had run into it (yet?).

thanks
danno
Reply all
Reply to author
Forward
0 new messages