Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

NFS 4.1 RECLAIM_COMPLETE FS failed error in

243 views
Skip to first unread message

NAGY Andreas

unread,
Mar 2, 2018, 1:58:59 AM3/2/18
to
Hi,

I am trying to get a FreeBSD NFS 4.1 export working with VMware Esxi 6.5u1, but it is always mounted as read only.

After some research, I found out that this is a known problem, and there are threads about this from 2015 also in the mailinglist archive.

As it seems VMware will not change the bahvior of there NFS 4.1 client I wanted to ask here if there is a patch or workaround for this available.

Here the tread at VMware:
https://communities.vmware.com/thread/517788

And here what I found in the archive of the list:
https://lists.freebsd.org/pipermail/freebsd-stable/2015-May/082381.html

Thank,
Andi



_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

NAGY Andreas

unread,
Mar 3, 2018, 9:41:00 AM3/3/18
to
Hi and thanks!

First time using/needing a patch could you give me a short advise how to use it and for which version?

So far I have made a fresh FreeBSD 11.1 RELEASE install as a VM on a ESXi host updated the system and did a svn checkout http://svn.freebsd.org/base/release/11.1.0/

Then tried to apply the patch in /usr/src/sys via patch < /tmp/reclaimcom2.patch

Output was:
Hmm... Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|--- fs/nfsserver/nfs_nfsdserv.c.savrecl 2018-02-10 20:34:31.166445000 -0500
|+++ fs/nfsserver/nfs_nfsdserv.c 2018-02-10 20:36:07.947490000 -0500
--------------------------
Patching file fs/nfsserver/nfs_nfsdserv.c using Plan A...
No such line 4225 in input file, ignoring
Hunk #1 succeeded at 4019 (offset -207 lines).
done

So I think this was not correct, as I also noticed that nfs_nfsdserv.c 4102 lines.

andi



-----Original Message-----
From: Rick Macklem [mailto:rmac...@uoguelph.ca]
Sent: Samstag, 3. März 2018 03:01
To: NAGY Andreas <Andrea...@frequentis.com>; freebsd...@freebsd.org
Subject: Re: NFS 4.1 RECLAIM_COMPLETE FS failed error in combination with ESXi client

NAGY Andreas wrote:
>I am trying to get a FreeBSD NFS 4.1 export working with VMware Esxi 6.5u1, but >it is always mounted as read only.
>
>After some research, I found out that this is a known problem, and there are >threads about this from 2015 also in the mailinglist archive.
>
>As it seems VMware will not change the bahvior of there NFS 4.1 client I wanted >to ask here if there is a patch or workaround for this available.
I believe the attached small patch deals with the ReclaimComplete issue.
However, someone else who tested this had additional issues with the mount:
- The client logged a couple of things (that sounded weird to me;-)
- Something about Readdir seeing directories change too much..
- Something about "wrong reason for not issuing a delegation"...
(I don't what either of these are caused by or whether they result in serious
breakage of the mount.)
They also ran into a hang when transferring a large file. It sounded to me like something that might be a network interface device driver issue and I suggested they disable TSO, LRO and jumbo frames, but I never heard back from them, so I don't know more about this.

So, feel free to test with the attached patch and if you run into problems with the mount, email w.r.t. what they are. If we persevere we might get it going ok.

rick
[stuff snipped]

Rick Macklem

unread,
Mar 4, 2018, 12:55:22 AM3/4/18
to
NAGY Andreas wrote:
>Hi and thanks!
>
>First time using/needing a patch could you give me a short advise how to use it >and for which version?
The only difference with kernel versions will be the line#s.
>So far I have made a fresh FreeBSD 11.1 RELEASE install as a VM on a ESXi host >updated the system and did a svn checkout http://svn.freebsd.org/base/release/11.1.0/
>
>Then tried to apply the patch in /usr/src/sys via patch < /tmp/reclaimcom2.patch
>
>Output was:
>Hmm... Looks like a unified diff to me...
>The text leading up to this was:
>--------------------------
>|--- fs/nfsserver/nfs_nfsdserv.c.savrecl 2018-02-10 20:34:31.166445000 -0500
>|+++ fs/nfsserver/nfs_nfsdserv.c 2018-02-10 20:36:07.947490000 -0500
>--------------------------
>Patching file fs/nfsserver/nfs_nfsdserv.c using Plan A...
>No such line 4225 in input file, ignoring
>Hunk #1 succeeded at 4019 (offset -207 lines).
>done
Since it says "Hunk #1 succeeded...", I think it patched ok.
However, you can check by looking at nfsrvd_reclaimcomplete() in
sys/fs/nfsserver/nfs_nfsdserv.c.
Before the patch it would look like:
if (*tl == newnfs_true)
nd->nd_repstat = NFSERR_NOTSUPP;
else
nd->nd_repstat = nfsrv_checkreclaimcomplete(nd);
whereas after being patched, it will look like:
nd->nd_repstat = nfsrv_checkreclaimcomplete(nd);
if (*tl == newnfs_true)
nd->nd_repstat = 0;

NAGY Andreas

unread,
Mar 4, 2018, 8:30:41 AM3/4/18
to
Thanks, got it working with your patch.

So far I did not see any issue with the mount. Only in the vmkernel.log there are often following entrees:
WARNING: NFS41: NFS41ValidateDelegation:608: Server returned improper reason for no delegation: 2

Actually I have only a single link between the ESXi host and the FreeBSD host, but as soon as I figure out what Is the right way to configure multiple paths for NFS I will do more testing.

I need also to check out what can be tuned. I expected that writes to the NFS datastore will be slower than iSCSI but not as slow as it is now.

andi


-----Original Message-----
From: Rick Macklem [mailto:rmac...@uoguelph.ca]
Sent: Sonntag, 4. März 2018 06:48
To: NAGY Andreas <Andrea...@frequentis.com>; freebsd...@freebsd.org
Subject: Re: NFS 4.1 RECLAIM_COMPLETE FS failed error in combination with ESXi client

NAGY Andreas

unread,
Mar 4, 2018, 1:26:50 PM3/4/18
to
Okay, the slow write was not a NFS problem, it was the hw raid controller which switched to write through because of a broken battery.

In the source I saw nfs_async = 0; is it right that NFS will work in async mode if I compile the kernel with nfs_async = 1?

I know the risk of running it async, but is it not the same risk having the datastore connected via iSCSI which standard is also not sync?

The last weeks I tested the following setup:
Two FreeBSD hosts with a more or less good hw RAID controller in a HAST cluster providing a datastore to two ESXi hosts via iSCSI.
This setup worked quiet well, but I now want to switch to NFS, and hope to get equivalent speeds.

Thanks so far,
andi

NAGY Andreas

unread,
Mar 5, 2018, 8:26:34 AM3/5/18
to
Thanks, I am actually compiling with both patches.

I try now to get NFS 4.1 multipathing working. So I have now two connection on different subnets between the ESXi host and the FreeBSD host with exports for the same mountpoint on both subnets.
Now I get the following errors in the vmkernel.log:
2018-03-05T13:06:07.488Z cpu10:66503)WARNING: NFS41: NFS41_Bug:2361: BUG - Invalid BIND_CONN_TO_SESSION error: NFS4ERR_NOTSUPP

Is there session trunking available in the FreeBSD NFS41 implementation?

Br,
andi

-----Original Message-----
From: Rick Macklem [mailto:rmac...@uoguelph.ca]
Sent: Montag, 5. März 2018 02:16
To: NAGY Andreas <Andrea...@frequentis.com>; 'freebsd...@freebsd.org' <freebsd...@freebsd.org>
Subject: Re: NFS 4.1 RECLAIM_COMPLETE FS failed error in combination with ESXi client

NAGY Andreas wrote:
[stuff snipped]
>In the source I saw nfs_async = 0; is it right that NFS will work in async mode if I >compile the kernel with nfs_async = 1?
>
>I know the risk of running it async, but is it not the same risk having the datastore >connected via iSCSI which standard is also not sync?
If you want to use it, you can just set it by setting the sysctl vfs.nfsd.async=1.
- If the server crashes/reboots you can lose data. Also, after the reboot, the
client will only see an temporarily unresponsive server and will not have any
indication of data loss.
(I am not familiar with iSCSI, so I can't comment on how safe that is.)
- If you are using ZFS, there is also a ZFS config (sync=disabled). I'm not a ZFS
guy, so I don't know anything more, but I'm sure others reading this list can
tell you how to set it.
[more stuff snipped]
>So far I did not see any issue with the mount. Only in the vmkernel.log there are >often following entrees:
>WARNING: NFS41: NFS41ValidateDelegation:608: Server returned improper
>>reason for no delegation: 2
The attached patch *might* get rid of these, although I don't think it matters much, since it is just complaining about the "reason" the server returns for not issuing a delegation (issuing delegations is entirely at the discretion of the server and is disabled by default).
[more stuff snipped]
Good luck with it, rick

NAGY Andreas

unread,
Mar 5, 2018, 8:38:13 AM3/5/18
to
Compiling with the last patch also failed:

error: use of undeclared identifier 'NFSV4OPEN_WDSUPPFTYPE'



-----Original Message-----
From: NAGY Andreas

Rick Macklem

unread,
Mar 5, 2018, 5:53:44 PM3/5/18
to
Nope, that isn't supported, rick
(Hope no one is too upset by a top post.)

________________________________________
From: NAGY Andreas <Andrea...@frequentis.com>
Sent: Monday, March 5, 2018 8:22:10 AM
To: Rick Macklem; 'freebsd...@freebsd.org'
Subject: RE: NFS 4.1 RECLAIM_COMPLETE FS failed error in combination with ESXi client

NAGY Andreas

unread,
Mar 6, 2018, 1:18:28 PM3/6/18
to
Okay, that was the main reason for using NFS 4.1.
Is it planned to implement it, or is the focus on pNFS?

Thanks,
Andi


________________________________
Von: Rick Macklem <rmac...@uoguelph.ca>
Gesendet: 05.03.2018 11:49 nachm.
An: NAGY Andreas; 'freebsd...@freebsd.org'
Betreff: Re: NFS 4.1 RECLAIM_COMPLETE FS failed error in combination with ESXi client

Rick Macklem

unread,
Mar 6, 2018, 5:58:08 PM3/6/18
to
NAGY Andreas wrote:
>Okay, that was the main reason for using NFS 4.1.
>Is it planned to implement it, or is the focus on pNFS?
Do the VMware people claim that this improves performance?
(I know nothing about the world of VMs, but for real hardware
I can't see any advantage of having more than one TCP connection?
As far as I know, the Linux client never tries to acquire a second TCP
connection. I would have assumed trunking would be handled below
TCP.)

This is the first client that I am aware of (and just yesterday when you pointed
it out) that uses BIND_CONN_TO_SESSION for an additional TCP connection.
(Up until now I was only aware of it being used for RDMA setups and I have no
hardware to play with such things.)

If the VMware folk claim it does improve performance, I might get around to
it someday, although you are correct that I am working on pNFS support for the
server right now.

rick
[stuff snipped]

Rick Macklem

unread,
Mar 7, 2018, 10:11:09 AM3/7/18
to
NAGY Andreas wrote:
>Okay, that was the main reason for using NFS 4.1.
>Is it planned to implement it, or is the focus on pNFS?
I took a quick look and implementing this for some cases will be pretty
easy. Binding a FORE channel is implied, so for that case all the server
does is reply OK to the BIND_CONN_TO_SESSION.

To know if the ESXi client case is a simple one, I need to see what the
BIND_CONN_TO_SESSION arguments look like.
If you can capture packets for when this second connection is done and
email it to me as an attachment, I can look at what the BIND_CONN_TO_SESSION args are.
# tcpdump -s 0 -w <file.pcap> host <client-host-for-this-connection>
run on the FreeBSD server should get the <file.pcap> I need.

Alternately, if you have wireshark handy, you can just use it to look
for the BIND_CONN_TO_SESSION request and see if it specifies
(FORE, BACK, FORE_OR_BOTH or BACK_OR_BOTH) in it.
FORE or FORE_OR_BOTH means it is easy to do and I can probably have
a patch for testing in a day or two.

NAGY Andreas

unread,
Mar 8, 2018, 9:39:46 AM3/8/18
to
Thanks you, really great how fast you adapt the source/make patches for this. Saw so many posts were people did not get NFS41 working with ESXi and FreeBSD and now I have it already running with your changes.

I have now compiled the kernel with all 4 patches, and it works now.

Some problems are still left:

- the "Server returned improper reason for no delegation: 2" warnings are still in the vmkernel.log.
2018-03-08T11:41:20.290Z cpu0:68011 opID=488969b0)WARNING: NFS41: NFS41ValidateDelegation:608: Server returned improper reason for no delegation: 2

- can't delete a folder with the VMware host client datastore browser:
2018-03-08T11:34:00.349Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.349Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.349Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.350Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.350Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.350Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.351Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.351Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.351Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.351Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.352Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient file system condition, suggest retry
2018-03-08T11:34:00.352Z cpu1:67981 opID=f5159ce3)WARNING: UserFile: 2155: hostd-worker: Directory changing too often to perform readdir operation (11 retries), returning busy

- after a reboot of the FreeBSD machine the ESXi does not restore the NFS datastore again with following warning (just disconnecting the links is fine)
2018-03-08T12:39:44.602Z cpu23:66484)WARNING: NFS41: NFS41_Bug:2361: BUG - Invalid BIND_CONN_TO_SESSION error: NFS4ERR_NOTSUPP

Actually I have only made some quick benchmarks with ATTO in a Windows VM which has a vmdk on the NFS41 datastore which is mounted over two 1GB links in different subnets.
Read is nearly the double of just a single connection and write is just a bit faster. Don't know if write speed could be improved, actually the share is UFS on a HW raid controller which has local write speeds about 500MB/s.

At following link is the vmkernel.log from mouning the NFS share, attaching a vmdk from the share to a Win VM, running ATTO benchmark on it, disconnecting/reconnecting network and also the problem with the BIND_CONN_TO_SESSION error: NFS4ERR_NOTSUPP after reboot.
Till the reboot I have also made a trace on one of the two links. (nfs41_trace_before_reboot.pcap and nfs41_trace_after_reboot.pcap)

https://files.fm/u/wvybmdmc

andi

-----Original Message-----
From: Rick Macklem [mailto:rmac...@uoguelph.ca]
Sent: Donnerstag, 8. März 2018 03:48
To: NAGY Andreas <Andrea...@frequentis.com>; 'freebsd...@freebsd.org' <freebsd...@freebsd.org>
Subject: Re: NFS 4.1 RECLAIM_COMPLETE FS failed error in combination with ESXi client

NAGY Andreas wrote:
>attached the trace. If I see it correct it uses FORE_OR_BOTH.
>(bctsa_dir: >CDFC4_FORE_OR_BOTH (0x00000003))
Yes. The scary part is the ExchangeID before the BindConnectiontoSession.
(Normally that is only done at the beginning of a new mount to get a ClientID, followed immediately by a CreateSession. I don't know why it would do this?)

The attached patch might get BindConnectiontoSession to work. I have no way to test it beyond seeing it compile. Hopefully it will apply cleanly.

>The trace is only with the first patch, have not compiled the wantdeleg patches so >far.
That's fine. I don't think that matters much.

>I think this is related to the BIND_CONN_TO_SESSION; after a disconnect the ESXi >cannot connect to the NFS also with this warning:
>2018-03-07T16:55:11.227Z cpu21:66484)WARNING: NFS41: NFS41_Bug:2361:
>>BUG - Invalid BIND_CONN_TO_SESSION error: NFS4ERR_NOTSUPP
If the attached patch works, you'll find out what it fixes.

>Another thing I noticed today is that it is not possible to delete a folder with the >ESXi datastorebrowser on the NFS mount. Maybe it is a VMWare bug, but with >NFS3 it works.
>
>Here the vmkernel.log with only one connection contains mounting, trying to >delete a folder and disconnect:
>
>2018-03-07T16:46:04.543Z cpu12:68008 opID=55bea165)World: 12235: VC
>opID >c55dbe59 maps to vmkernel opID 55bea165 2018-03-07T16:46:04.543Z
>cpu12:68008 opID=55bea165)NFS41: >NFS41_VSIMountSet:423: Mount server:
>10.0.0.225, port: 2049, path: /, label: >nfsds1, security: 1 user: ,
>options: <none> 2018-03-07T16:46:04.543Z cpu12:68008
>opID=55bea165)StorageApdHandler: >977: APD Handle Created with
>lock[StorageApd-0x43046e4c6d70] 2018-03-07T16:46:04.544Z
>cpu11:66486)NFS41: >NFS41ProcessClusterProbeResult:3873: Reclaiming
>state, cluster 0x43046e4c7ee0 >[7] 2018-03-07T16:46:04.545Z cpu12:68008
>opID=55bea165)NFS41: >NFS41FSCompleteMount:3791: Lease time: 120
>2018-03-07T16:46:04.545Z cpu12:68008 opID=55bea165)NFS41:
>>NFS41FSCompleteMount:3792: Max read xfer size: 0x20000
>2018-03-07T16:46:04.545Z cpu12:68008 opID=55bea165)NFS41:
>>NFS41FSCompleteMount:3793: Max write xfer size: 0x20000
>2018-03-07T16:46:04.545Z cpu12:68008 opID=55bea165)NFS41:
>>NFS41FSCompleteMount:3794: Max file size: 0x800000000000
>2018-03-07T16:46:04.545Z cpu12:68008 opID=55bea165)NFS41:
>>NFS41FSCompleteMount:3795: Max file name: 255 2018-03-07T16:46:04.545Z
>cpu12:68008 opID=55bea165)WARNING: NFS41: >NFS41FSCompleteMount:3800:
>The max file name size (255) of file system is >larger than that of FSS
>(128) 2018-03-07T16:46:04.546Z cpu12:68008 opID=55bea165)NFS41:
>>NFS41FSAPDNotify:5960: Restored connection to the server 10.0.0.225
>mount >point nfsds1, mounted as 1a7893c8-eec764a7-0000-000000000000
>("/") 2018-03-07T16:46:04.546Z cpu12:68008 opID=55bea165)NFS41:
>>NFS41_VSIMountSet:435: nfsds1 mounted successfully
>2018-03-07T16:47:19.869Z cpu21:67981 opID=e47706ec)World: 12235: VC
>opID >c55dbe91 maps to vmkernel opID e47706ec 2018-03-07T16:47:19.869Z
>cpu21:67981 opID=e47706ec)WARNING: NFS41: >NFS41FileOpReaddir:4728:
>Failed to process READDIR result for fh 0x43046e4c6
I have no idea if getting BindConnectiontoSession working will fix this or not?

Rick Macklem

unread,
Mar 8, 2018, 5:58:34 PM3/8/18
to
NAGY Andreas wrote:
>Thanks you, really great how fast you adapt the source/make patches for this. Saw so many >posts were people did not get NFS41 working with ESXi and FreeBSD and now I have it already >running with your changes.
>
>I have now compiled the kernel with all 4 patches, and it works now.
Ok. Sounds like we are making progress. It also takes someone willing to test patches, so
thanks for doing so.
>Some problems are still left:
>
>- the "Server returned improper reason for no delegation: 2" warnings are still in the >vmkernel.log.
> 2018-03-08T11:41:20.290Z cpu0:68011 opID=488969b0)WARNING: NFS41: >NFS41ValidateDelegation:608: Server returned improper reason for no delegation: 2
I'll take another look and see if I can guess why it doesn't like "2" as a reason for not
issuing a delegation. (As noted before, I don't think this is serious, but???)

>- can't delete a folder with the VMware host client datastore browser:
> 2018-03-08T11:34:00.349Z cpu1:67981 opID=f5159ce3)WARNING: NFS41: >NFS41FileOpReaddir:4728: Failed to process READDIR result for fh 0x43046e4cb158: Transient >file system condition, suggest retry
[more of these snipped]
> 2018-03-08T11:34:00.352Z cpu1:67981 opID=f5159ce3)WARNING: UserFile: 2155: >hostd-worker: Directory changing too often to perform readdir operation (11 retries), >returning busy
This one is a mystery to me. It seemed to be upset that the directory is changing (I
assume either the Change or ModifyTime attributes). However, if entries are being
deleted, the directory is changing and, as far as I know, the Change and ModifyTime
attributes are supposed to change.
I might try posting on nf...@ietf.org in case somebody involved with this client reads
that list and can explain what this is?

>- after a reboot of the FreeBSD machine the ESXi does not restore the NFS datastore again >with following warning (just disconnecting the links is fine)
> 2018-03-08T12:39:44.602Z cpu23:66484)WARNING: NFS41: NFS41_Bug:2361: BUG - >Invalid BIND_CONN_TO_SESSION error: NFS4ERR_NOTSUPP
Hmm. Normally after a server reboot, the clients will try some RPC that starts with a
Sequence (the session op) and the server will reply NFS4ERR_BAD_SESSION.
This triggers recovery in the client.
The BindConnectiontoSession operation is done in an RPC by itself, so there is no
Sequence op to trigger NFS4ERR_BAD_SESSION.
Maybe this client expects to see NFS4ERR_BAD_SESSION for the BindConnectiontoSession.
I'll post a patch that modifies the BindConnectiontoSession to do that.

>Actually I have only made some quick benchmarks with ATTO in a Windows VM which has a >vmdk on the NFS41 datastore which is mounted over two 1GB links in different subnets.
>Read is nearly the double of just a single connection and write is just a bit faster. Don't know if >write speed could be improved, actually the share is UFS on a HW raid controller which has >local write speeds about 500MB/s.
Yes, before I posted that I didn't understand why multiple TCP links would be faster.
I didn't notice at the time that you mentioned using different subnets and, as such,
links couldn't be trunked below TCP. In your case trunking above TCP makes sense.

Getting slower write rates than read rates from NFS is normal.
Did you try "sysctl vfs.nfsd.async=1"?
The other thing that might help for UFS is increasing the size of the buffer cache.
(If this server is mainly an NFS server you could probably make the buffer cache
greater than half of the machine's ram.
Note to others, since ZFS doesn't use the buffer cache, the opposite is true for
ZFS.)

NAGY Andreas

unread,
Mar 9, 2018, 10:30:57 AM3/9/18
to
>The attached patch changes BindConnectiontoSession to reply NFS4ERR_BAD_SESSION when the session doesn't exist. This might trigger recovery after a server reboot.
>This patch must be applied after bindconn.patch.

Works perfect! ESXi host reconnects to the datastore as soon as the nfsserver is available.

>I took a quick look at your packet trace and it appears that this client does all write FILESYNC. As such, setting vfs.nfsd.async=1 won't have any affect.
>If you apply the attached patch, it should change the FILESYNC->UNSTABLE so that vfs.nfsd.async=1 will make a difference. Again, doing this does put data at risk when the server crashes.

Yes, ESXi writes everything sync. With the patch + vfs.nfsd.async=1 (only for testing) writes are a bit faster.
If have not tuned anything on the fs settings so far (just formatted it as standard UFS), compared to multipath iSCSI with VMFS reads are a little bit faster, but writes are a bit slower.
That's just what I see from a simple ATTO benchmark from within a Windows VM, have not done any details IOP benchmark,...
The RAID controller on this machines does not support IT mode, but I think I will still use ZFS, but only as filesystem on a single hw raid disk. Must check what are the best setting for this on the hw raid + zfs for nfs.

>This one is a mystery to me. It seemed to be upset that the directory is changing (I assume either the Change or ModifyTime attributes). However, if entries are being deleted, the directory is changing and, as far as I know, the Change and ModifyTime attributes are supposed to change.
>I might try posting on nf...@ietf.org in case somebody involved with this client reads that list and can explain what this is?

Maybe it is really just a bug in the VMware integrated host client browser. Deleting folders on the mounted datastore in the ESXi shell is no problem and does also not generate any warnings in the vmkernel.log.

>I'll take another look and see if I can guess why it doesn't like "2" as a reason for not issuing a delegation. (As noted before, I don't think this is serious, but???)

This warnings are still there, but don't seem to have any impact. It looks like they only appeare when files are created or modified on the datastore from the datastore browser or from shell, have not seen this warnings when working in a VM on a virtual disk that is stored on the nfs datastore.

>Yes, before I posted that I didn't understand why multiple TCP links would be faster.
>I didn't notice at the time that you mentioned using different subnets and, as such, links couldn't be trunked below TCP. In your case trunking above TCP makes sense.

Yes, actually I am working on a lab environment/testsys I often get some servers as leftovers from projects and there are also plenty of Cisco 1GB switches, but 10GBs are rare. I already did some tests with iSCSI multipathing but I prefer NFS and now with this I get also the same speed.
As I have never seen a working setup with multiple paths for NFS, so I am really happy that you got this working in such a short time.

andi

NAGY Andreas

unread,
Mar 10, 2018, 8:14:37 AM3/10/18
to
Thanks, the not issuing delegation warnings disappeared with this patch.

But now there are some new warnings I haven't seen so far:
2018-03-10T13:01:39.441Z cpu8:68046)WARNING: NFS41: NFS41FSOpGetObject:2148: Failed to get object 0x43910e71b386 [36 c6b10167 9b157f95 5aa100fb 8ffcf2c1 c 2 9f22ad6d 0 0 0 0 0]: Stale file handle

These only appear several times after a the NFS share is mounted or remounted after a connection loss.
Everything works fine, but haven't seen them till I applied the last patch.

Rick Macklem

unread,
Mar 10, 2018, 5:25:06 PM3/10/18
to
NAGY Andreas wrote:
>Thanks, the not issuing delegation warnings disappeared with this patch.
>
>But now there are some new warnings I haven't seen so far:
>2018-03-10T13:01:39.441Z cpu8:68046)WARNING: NFS41: NFS41FSOpGetObject:2148: Failed to >get object 0x43910e71b386 [36 c6b10167 9b157f95 5aa100fb 8ffcf2c1 c 2 9f22ad6d 0 0 0 0 0]: >Stale file handle
I doubt these would be related to the patch. A stale FH means that the client tried to
access a file via its FH after it was removed. (Normally this is a client bug, but hopefully
not one that will cause grief.)
>These only appear several times after a the NFS share is mounted or remounted after a >connection loss.
>Everything works fine, but haven't seen them till I applied the last patch.
>
>andi
Ok. Thanks for testing all of these patches. I will probably get cleaned up versions of
them committed in April.

The main outstanding issue is the Readdir one about directory changing too much.
Hopefully I can find out something about it via email.

Have fun with it, rick

NAGY Andreas

unread,
Mar 11, 2018, 1:52:17 AM3/11/18
to
Thanks! Please keep me updated if you find put more or when a updated version is available.

As I now know it is working, I will start tomorrow to build up a testsystem with 3 NFS servers (two of them in a ha with CARP and HAST) and several ESXi hosts which will all access there NFS datastores over 4 uplinks with NICs on different subnets.
It should always be possible to do there some testing.

andi


________________________________
Von: Rick Macklem <rmac...@uoguelph.ca>
Gesendet: 10.03.2018 11:20 nachm.
An: NAGY Andreas; 'freebsd...@freebsd.org'
Betreff: Re: NFS 4.1 RECLAIM_COMPLETE FS failed error in combination with ESXi client

Rick Macklem

unread,
Mar 11, 2018, 5:18:12 PM3/11/18
to
NAGY Andreas wrote:
>Thanks! Please keep me updated if you find put more or when a updated version is available.
Will try to remember to do so.

>As I now know it is working, I will start tomorrow to build up a testsystem with 3 NFS servers >(two of them in a ha with CARP and HAST) and several ESXi hosts which will all access there >NFS datastores over 4 uplinks with NICs on different subnets.
>It should always be possible to do there some testing.
Have fun. That's way beyond anything I do for NFS testing.

Btw, unless you don't want me to, I will list you as "Tested by:" on the commits.
(If you don't want me to do this, just email.)

Good luck with the testing, rick
0 new messages