> We still have some issues on nodes with only gateway-less subnets and a nasty one regarding nfs
> locks. This last issue impact is quite heavy for some applications. It still under investigation by
> EMC.
The flexnet reloading issue was not the root cause of our "NFS server not responding" deluge. The
problem happened again.
Further investigation led to some of our nodes not having a gateway and not able to access our dns
servers. We are in this configuration since day 1 (4.5 years ago).
From what I understand, all nodes in an Isilon cluster refresh the nfs export everytime in a while.
When an export has an fqdn in one of its clients list, a name resolution is happening. The kernel
implementation of NFS didn't seem to care is the resolution didn't happen. The userland
implementation seems to need it to happen smoothly otherwise nfs threads become busy waiting for
resolution, hence not available to handle nfs traffic.
Local name resolution is managed locally on each node (see /etc/resolv.conf) by the isi_cbind_d
resolver before trying the ones defined in the groupnet configuration. While investigating network
trace, we saw that isi_cbind_d ask some infiniband node peers for an answser (we guessed it, because
it was nothing wireshark would recognize). It was confirmed by EMC support. If the request was sent
to a node without a gateway, isi_cbind_d never received an answer.
It has been identified as a bug and will be corrected. In the meantime, we changed the configuration
of our nodes so they can reach our DNS.
Again, I'm not 100% certain of the exact behaviour of all that, but I think the big picture is correct.
"""
It's not DNS
There's no way it's DNS
It was DNS
"""
We still have other NFS issues ongoing regarding network link aggregation. A suggested workaround
from the support is to implement what we call "cron of shame" running every 4 minutes (flock
/mnt/nlm/lckfile -c "sleep 10") to keep the connection open.
Jean-Baptiste