Terrible NFS4 performance: FreeBSD 9.1 + ZFS + AWS EC2

Berend de Boer

unread,

Jul 7, 2013, 6:54:09 PM7/7/13

to

Hi All,

I've just completed a round of NFS testing on FreeBSD 9.1 on AWS. The
underlying file system is ZFS. I have a real nfs killer test: doing
an "svn update" of a directory of 3541 files.

Performance on the NFS server itself is good: checking this out takes
11 seconds (doing the same on a not really comparable Linux NFS server
is 43 seconds).

Doing this on a client writing to an NFS mounted home directory is
however terrible. Really terrible. This takes 25 minutes! With
"sync=disable" it's 16 minutes.

(doing this against the underpowered Linux NFS server is about 4.5 minutes).

The problem might be that the NFS server (nfsd) runs at 70-80% CPU. So
my writing speed is cpu bound (go figure).

Repeating this with nfs4 + udp: doesn't work at all, get input/output
error after a few files.

nfs3: 1m55s (nfsd cpu is in the 6%-8% range)

Varying tcp/udp, lock/nolock, sync/async, or enabling zfs
sync=standard doesn't make any particular difference. All timing were
within the 1m49s - 2m3s range.

So what's up with NFS4 on FreeBSD?

--
All the best,

Berend de Boer

------------------------------------------------------
Awesome Drupal hosting: https://www.xplainhosting.com/

Rick Macklem

unread,

Jul 7, 2013, 8:19:19 PM7/7/13

to

Berend de Boer wrote:
> Hi All,
>
> I've just completed a round of NFS testing on FreeBSD 9.1 on AWS. The
> underlying file system is ZFS. I have a real nfs killer test: doing
> an "svn update" of a directory of 3541 files.
>
> Performance on the NFS server itself is good: checking this out takes
> 11 seconds (doing the same on a not really comparable Linux NFS
> server
> is 43 seconds).
>
> Doing this on a client writing to an NFS mounted home directory is
> however terrible. Really terrible. This takes 25 minutes! With
> "sync=disable" it's 16 minutes.
>
> (doing this against the underpowered Linux NFS server is about 4.5
> minutes).
>
> The problem might be that the NFS server (nfsd) runs at 70-80% CPU.
> So
> my writing speed is cpu bound (go figure).
>

See below w.r.t. patch that reduces cpu overheads in the DRC.

> Repeating this with nfs4 + udp: doesn't work at all, get input/output
> error after a few files.
>

The RFCs for NFSv4 require use of transport protocols that include
congestion control --> no UDP support. What client are you using?
(If it is a FreeBSD one, I need to patch it to make the mount fail.
Since NFSv4 over UDP was done in early testing, the client might
still have that bogus code in it.)

> nfs3: 1m55s (nfsd cpu is in the 6%-8% range)
>
> Varying tcp/udp, lock/nolock, sync/async, or enabling zfs
> sync=standard doesn't make any particular difference. All timing were
> within the 1m49s - 2m3s range.
>
> So what's up with NFS4 on FreeBSD?
>

Please try this patch:
http://people.freebsd.org/rmacklem/~drc4.patch
- After you apply the patch and boot the rebuilt kernel, the cpu
overheads should be reduced after you increase the value of
vfs.nfsd.tcphighwater. The larger you make it, the more space
(mbuf clusters and other kernel malloc'd data structures) the DRC
uses, but hopefully with reduced CPU overheads.
The plan is to commit a patch semantically similar to this to head and
then MFC it someday. ivoras@ has a similar patch, but written in cleaner
C. However, I've never gotten around to combining the patches into a
version for head. Someday I'll get to it, but not in time for 9.2.

Good luck with it, rick
ps: There is also ken@'s file handle affinity patch, which is in head
(and I think stable/9), but it only works for NFSv3 at this point.
Hopefully we'll come up with a patch for NFSv4 for it someday.

> --
> All the best,
>
> Berend de Boer
>
>
> ------------------------------------------------------
> Awesome Drupal hosting: https://www.xplainhosting.com/
>

_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

Rick Macklem

unread,

Jul 7, 2013, 8:31:16 PM7/7/13

to

Oops, it actually is:
http://people.freebsd.org/~rmacklem/drc4.patch

Berend de Boer

unread,

Jul 7, 2013, 9:19:14 PM7/7/13

to

>>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:

>> Repeating this with nfs4 + udp: doesn't work at all, get
>> input/output error after a few files.
>>

Rick> The RFCs for NFSv4 require use of transport protocols that
Rick> include congestion control --> no UDP support. What client
Rick> are you using?

Happened to be Ubuntu 10.04 LTS.

>>
Rick> Please try this patch:
Rick> http://people.freebsd.org/rmacklem/~drc4.patch - After you
Rick> apply the patch and boot the rebuilt kernel,

Ah kernel. The problem is I'm on Amazon, and I doubt I can rebuilt the
kernel that easily. The patches to make that work only recently
landed.

I'll need to ask Colin Percival how I can safely do that.

Berend de Boer

unread,

Jul 8, 2013, 10:20:05 PM7/8/13

to

>>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:

Rick> After you apply the patch and boot the rebuilt kernel, the
Rick> cpu overheads should be reduced after you increase the value
Rick> of vfs.nfsd.tcphighwater.

Have set it to 10,000, max cpu for nfsd I've seen is below 6%. Makes
no real difference whatsoever to the great slowness of nfs4 in this
use-case.

I.e. did two tests: 17.5 minutes with sync=disabled, 21.5 minutes with
sync=enabled, but difference in this case could simply be due to
whatever else was going on that that time.

FYI, in the nfs3 mount nfsd is at 0% at all times, basically uses no
cpu whatsoever.

The weird thing is that nfs3 performance seems to have been greatly
affected: the same test which ran at 2 minutes on udp is now between 7-11
minutes.

As this could be a problem with how I'm testing now (I recompiled the
kernel), I'll try to see what numbers I get when I undo the patch and
work against a recompiled kernel.

Rick Macklem

unread,

Jul 8, 2013, 9:41:01 PM7/8/13

to

Berend de Boer wrote:
> >>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:
>
>
> Rick> After you apply the patch and boot the rebuilt kernel, the
> Rick> cpu overheads should be reduced after you increase the
> value
> Rick> of vfs.nfsd.tcphighwater.
>

> Do I need to umount the share? Restart something after changing this
> value?
>
I think you can safely change it "on the fly". It simply defines how
large the DRC cache can grow before the nfsd thread will try to trim
it down. The default of 0 means "trim every RPC", which keeps it at a
minimal size, but can result in significant CPU overhead.

rick

> --
> All the best,
>
> Berend de Boer
>
>
> ------------------------------------------------------
> Awesome Drupal hosting: https://www.xplainhosting.com/
>
>

Berend de Boer

unread,

Jul 8, 2013, 9:04:37 PM7/8/13

to

>>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:

Rick> After you apply the patch and boot the rebuilt kernel, the
Rick> cpu overheads should be reduced after you increase the value
Rick> of vfs.nfsd.tcphighwater.

Do I need to umount the share? Restart something after changing this value?

Berend de Boer

unread,

Jul 8, 2013, 9:02:18 PM7/8/13

to

>>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:

Rick> After you apply the patch and boot the rebuilt kernel, the
Rick> cpu overheads should be reduced after you increase the value
Rick> of vfs.nfsd.tcphighwater.

What number would I be looking at? 100? 100,000?

Berend de Boer

unread,

Jul 8, 2013, 7:27:12 PM7/8/13

to

>>>>> "Outback" == Outback Dingo <outbac...@gmail.com> writes:

Outback> yupp just hit that also, whats the fix ???

Just add a line with:

int i;

at the beginning of the function, below the other variable
declarations. Then everything compiles.

Garrett Wollman

unread,

Jul 8, 2013, 10:24:36 PM7/8/13

to

<<On Mon, 8 Jul 2013 21:43:52 -0400 (EDT), Rick Macklem <rmac...@uoguelph.ca> said:

> Berend de Boer wrote:
>> >>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:
>>
Rick> After you apply the patch and boot the rebuilt kernel, the
Rick> cpu overheads should be reduced after you increase the
>> value
Rick> of vfs.nfsd.tcphighwater.
>>
>> What number would I be looking at? 100? 100,000?
>>

> Garrett Wollman might have more insight into this, but I would say on
> the order of 100s to maybe 1000s.

On my production servers, I'm running with the following tuning
(after Rick's drc4.patch):

----loader.conf----
kern.ipc.nmbclusters="1048576"
vfs.zfs.scrub_limit="16"
vfs.zfs.vdev.max_pending="24"
vfs.zfs.arc_max="48G"
#
# Tunable per mps(4). We had sigificant numbers of allocation failures
# with the default value of 2048, so bump it up and see whether there's
# still an issue.
#
hw.mps.max_chains="4096"
#
# Simulate the 10-CURRENT autotuning of maxusers based on available memory
#
kern.maxusers="8509"
#
# Attempt to make the message buffer big enough to retain all the crap
# that gets spewed on the console when we boot. 64K (the default) isn't
# enough to even list all of the disks.
#
kern.msgbufsize="262144"
#
# Tell the TCP implementation to use the specialized, faster but possibly
# fragile implementation of soreceive. NFS calls soreceive() a lot and
# using this implementation, if it works, should improve performance
# significantly.
#
net.inet.tcp.soreceive_stream="1"
#
# Six queues per interface means twelve queues total
# on this hardware, which is a good match for the number
# of processor cores we have.
#
hw.ixgbe.num_queues="6"

----sysctl.conf----
# Make sure that device interrupts are not throttled (10GbE can make
# lots and lots of interrupts).
hw.intr_storm_threshold=12000

# If the NFS replay cache isn't larger than the number of operations nfsd
# can perform in a second, the nfsd service threads will spend all of their
# time contending for the mutex that protects the cache data structure so
# that they can trim them. If the cache is big enough, it will only do this
# once a second.
vfs.nfsd.tcpcachetimeo=300
vfs.nfsd.tcphighwater=150000

----modules/nfs/server/freebsd.pp----
exec {'sysctl vfs.nfsd.minthreads':
command => "sysctl vfs.nfsd.minthreads=${min_threads}",
onlyif => "test $(sysctl -n vfs.nfsd.minthreads) -ne ${min_threads}",
require => Service['nfsd'],
}

exec {'sysctl vfs.nfsd.maxthreads':
command => "sysctl vfs.nfsd.maxthreads=${max_threads}",
onlyif => "test $(sysctl -n vfs.nfsd.maxthreads) -ne ${max_threads}",
require => Service['nfsd'],
}

($min_threads and $max_threads are manually configured based on
hardware, currently 16/64 on 8-core machines and 16/96 on 12-core
machines.)

As this is the summer, we are currently very lightly loaded. There's
apparently still a bug in drc4.patch, because both of my non-scratch
production servers show a negative CacheSize in nfsstat.

(I hope that all of these patches will make it into 9.2 so we don't
have to maintain our own mutant NFS implementation.)

-GAWollman

Berend de Boer

unread,

Jul 8, 2013, 1:56:16 AM7/8/13

to

>>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:

Rick> Please try this patch:

Hi Rick,

Could you please reroll the patc hagainst 9.1-RELEASE? Not sure what
version of FreeBSD you made this for. I have 9.1-RELEASE. Get three
failures:

# patch --check -p0 < ~/drc4.patch
Hmm... Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|--- fs/nfsserver/nfs_nfsdcache.c.orig 2013-01-07 09:04:13.000000000 -0500
|+++ fs/nfsserver/nfs_nfsdcache.c 2013-03-12 22:42:05.000000000 -0400
--------------------------
Patching file fs/nfsserver/nfs_nfsdcache.c using Plan A...
Hunk #1 succeeded at 160.
Hunk #2 succeeded at 216.
Hunk #3 succeeded at 271.
Hunk #4 succeeded at 357.
Hunk #5 succeeded at 370.
Hunk #6 succeeded at 381.
Hunk #7 succeeded at 396 with fuzz 2.
Hunk #8 succeeded at 426.
Hunk #9 succeeded at 444.
Hunk #10 succeeded at 468.
Hunk #11 failed at 476.
Hunk #12 failed at 501.
Hunk #13 succeeded at 523.
Hunk #14 succeeded at 531.
Hunk #15 succeeded at 547.
Hunk #16 succeeded at 568.
Hunk #17 succeeded at 579.
Hunk #18 succeeded at 601.
Hunk #19 succeeded at 665.
Hunk #20 succeeded at 674.
Hunk #21 failed at 683.
Hunk #22 succeeded at 718.
Hunk #23 succeeded at 729.
Hunk #24 succeeded at 750.
Hunk #25 succeeded at 779.
Hunk #26 succeeded at 788.
Hunk #27 succeeded at 803.
Hunk #28 succeeded at 828.
Hunk #29 succeeded at 927.
Hunk #30 succeeded at 943.
3 out of 30 hunks failed--saving rejects to fs/nfsserver/nfs_nfsdcache.c.rej
Hmm... The next patch looks like a unified diff to me...
The text leading up to this was:
--------------------------
|--- fs/nfsserver/nfs_nfsdport.c.orig 2013-03-02 18:19:34.000000000 -0500
|+++ fs/nfsserver/nfs_nfsdport.c 2013-03-12 17:51:31.000000000 -0400
--------------------------
Patching file fs/nfsserver/nfs_nfsdport.c using Plan A...
Hunk #1 succeeded at 59 with fuzz 1 (offset -2 lines).
Hunk #2 succeeded at 3284 (offset -22 lines).
Hunk #3 succeeded at 3351 with fuzz 1 (offset -5 lines).
Hmm... The next patch looks like a unified diff to me...
The text leading up to this was:
--------------------------
|--- fs/nfs/nfsport.h.orig 2013-03-02 18:35:13.000000000 -0500
|+++ fs/nfs/nfsport.h 2013-03-12 17:51:31.000000000 -0400
--------------------------
Patching file fs/nfs/nfsport.h using Plan A...
Hunk #1 succeeded at 547 (offset -62 lines).
Hmm... The next patch looks like a unified diff to me...
The text leading up to this was:
--------------------------
|--- fs/nfs/nfsrvcache.h.orig 2013-01-07 09:04:15.000000000 -0500
|+++ fs/nfs/nfsrvcache.h 2013-03-12 18:02:42.000000000 -0400
--------------------------
Patching file fs/nfs/nfsrvcache.h using Plan A...
Hunk #1 succeeded at 41.
done

I just want to make sure I can apply it cleanly and that it's not a
mistake I made when making this work when it doesn't.

Rick Macklem

unread,

Jul 8, 2013, 9:43:52 PM7/8/13

to

Berend de Boer wrote:
> >>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:
>

> Rick> After you apply the patch and boot the rebuilt kernel, the
> Rick> cpu overheads should be reduced after you increase the
> value
> Rick> of vfs.nfsd.tcphighwater.
>
> What number would I be looking at? 100? 100,000?
>
Garrett Wollman might have more insight into this, but I would say on
the order of 100s to maybe 1000s.

rick

> --
> All the best,
>
> Berend de Boer
>
>
> ------------------------------------------------------
> Awesome Drupal hosting: https://www.xplainhosting.com/
>
>

Rick Macklem

unread,

Jul 8, 2013, 6:22:52 PM7/8/13

to

Berend de Boer wrote:
> >>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:
>

> Rick> Please try this patch:
>
> Hi Rick,
>
> Could you please reroll the patc hagainst 9.1-RELEASE? Not sure what
> version of FreeBSD you made this for.

I haven't a clue, either, to be honest;-)

> I have 9.1-RELEASE. Get three
> failures:
>

I don't have a copy of 9.1-RELEASE handy, but here's one for stable/9.
If that still fails, just email and I'll download 9.1-RELEASE and create
one for it.
http://people.freebsd.org/~rmacklem/drc4-stable9.patch

rick

Berend de Boer

unread,

Jul 8, 2013, 7:12:28 PM7/8/13

to

>>>>> "Outback" == Outback Dingo <outbac...@gmail.com> writes:

Outback> this patches cleanly against 9/stable updated as of 20
Outback> mins ago

It does, but it does not compile, the `i' variable wasn't declared. I
just fixed that and started the compile again.

Outback Dingo

unread,

Jul 8, 2013, 7:28:22 PM7/8/13

to

On Mon, Jul 8, 2013 at 7:27 PM, Berend de Boer <ber...@pobox.com> wrote:

> >>>>> "Outback" == Outback Dingo <outbac...@gmail.com> writes:
>

> Outback> yupp just hit that also, whats the fix ???
>
> Just add a line with:
>
> int i;
>
> at the beginning of the function, below the other variable
> declarations. Then everything compiles.
>

jeeeeeez damned if im not a programmer but considered that.... :) thanks
for clarifying

>
>
> --
> All the best,
>
> Berend de Boer
>
>
> ------------------------------------------------------
> Awesome Drupal hosting: https://www.xplainhosting.com/
>
>

Outback Dingo

unread,

Jul 8, 2013, 7:18:06 PM7/8/13

to

On Mon, Jul 8, 2013 at 7:12 PM, Berend de Boer <ber...@pobox.com> wrote:

> >>>>> "Outback" == Outback Dingo <outbac...@gmail.com> writes:
>

> Outback> this patches cleanly against 9/stable updated as of 20
> Outback> mins ago
>
> It does, but it does not compile, the `i' variable wasn't declared. I
> just fixed that and started the compile again.
>
>

yupp just hit that also, whats the fix ???

Outback Dingo

unread,

Jul 8, 2013, 7:04:46 PM7/8/13

to

On Mon, Jul 8, 2013 at 6:22 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Berend de Boer wrote:
> > >>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:
> >
> > Rick> Please try this patch:
> >
> > Hi Rick,
> >
> > Could you please reroll the patc hagainst 9.1-RELEASE? Not sure what
> > version of FreeBSD you made this for.
> I haven't a clue, either, to be honest;-)
>
> > I have 9.1-RELEASE. Get three
> > failures:
> >
> I don't have a copy of 9.1-RELEASE handy, but here's one for stable/9.
> If that still fails, just email and I'll download 9.1-RELEASE and create
> one for it.
> http://people.freebsd.org/~rmacklem/drc4-stable9.patch

this patches cleanly against 9/stable updated as of 20 mins ago

Berend de Boer

unread,

Jul 9, 2013, 3:08:47 AM7/9/13

to

>>>>> "Garrett" == Garrett Wollman <wol...@bimajority.org> writes:

Garrett> <snip>

Great stuff, thanks Garrett!

Garrett> ----modules/nfs/server/freebsd.pp----

Where does this go?

Berend de Boer

unread,

Jul 9, 2013, 3:48:05 AM7/9/13

to

>>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:

Rick> After you
Rick> apply the patch and boot the rebuilt kernel, the cpu
Rick> overheads should be reduced after you increase the value of
Rick> vfs.nfsd.tcphighwater.

OK, completely disregard my previous email. I actually was testing
against a server in a different data centre, didn't think it would
matter too much, but clearly it does (ping times 2-3 times higher).

So moved server + disks into the same data centre as the nfs client.

1. Does not effect nfs3.

2. When I do not set vfs.nfsd.tcphighwater, I get a "Remote I/O error"
on the client. On server I see:

nfsd server cache flooded, try to increase nfsrc_floodlevel

(this just FYI).

3. With vfs.nfsd.tcphighwater set to 150,000. I get very high cpu, 50%.

Performance is now about 8m15s. Which is better, but still twice
above a lower spec Linux NFS4 server, and four times slower than
nfs3 on the same box.

4. With Garrett's settings, I looked at when the cpu starts to
increase. It starts slow, but raises quickly to 50% in about 1
minute.

Time was similar 7m54s.

5. I lowered vfs.nfsd.tcphighwater to 10,000 but then it actually
became worse, cpu quickly went to 70%, i.e. not much difference
with FreeBSD without patch. Didn't keep this test running to see if
it became slower over time.

Making it 300,000 seems that the cpu increases are slower (but it
keeps rising).

So from what I observe from the patch is that it makes the rise in
cpu increase slower, but doesn't stop it. I.e. after a few minutes,
even with setting 300,000 the cpu is getting to 50%, but dropped a
bit after a while to hover around 40%. Then it crept back to over 50%.

6. So the conclusion is: this patch helps somewhat, but nfs4 behaviour
is still majorly impaired compared to nfs3.

Daniel Kalchev

unread,

Jul 9, 2013, 4:11:01 AM7/9/13

to

On 09.07.13 10:48, Berend de Boer wrote:
> OK, completely disregard my previous email. I actually was testing
> against a server in a different data centre, didn't think it would
> matter too much, but clearly it does (ping times 2-3 times higher).
>

Could you please actually post a diagram of your setup, with all the
components, including the "low spec Linux server". Do not forget the RTT
(ping) between these hosts. If you have made any network tuning too.

Networking protocols like NFS are heavily influenced by factors like
RTT. An "underpowered" box that is "nearby" (has lover RTT) usually
performs much better than a "powerful box" with larger RTT and other
network bottlenecks.

Unfortunately, AWS is far from perfect hardware emulation and there
might be other layers that intervene with the NFS protocol.

Daniel

Berend de Boer

unread,

Jul 9, 2013, 4:43:01 AM7/9/13

to

>>>>> "Berend" == Berend de Boer <ber...@pobox.com> writes:

Berend> 1. Does not effect nfs3.

One update: with zfs sync=standard, I can get nfs3 to perform as bad
as the nfs4+patch.

1. without nolock, without async, sync=standard: 8m21s

2. with nolock, without async, sync=standard: 5m37s

3. without nolock, with async, sync=standard: 4m56s.

4. with nolock, with async, sync=standard: 4m23s.

5. without nolock, without async, sync=disabled: 1m57.

PS: the nfs4 test was done with sync=disabled.

Berend de Boer

unread,

Jul 9, 2013, 4:50:03 AM7/9/13

to

>>>>> "Daniel" == Daniel Kalchev <dan...@digsys.bg> writes:

Daniel> Could you please actually post a diagram of your setup,
Daniel> with all the components, including the "low spec Linux
Daniel> server". Do not forget the RTT (ping) between these
Daniel> hosts. If you have made any network tuning too.

To be honest, I think that's of little use. I have no impact on where
my server is placed in the AWS data centre, nor what other load (cpu
or i/o) is also taking place on these boxes.

So having approximate values (I'm posting factors of 2 or 4) is the
only useful strategy.

As you can see I'm not talking about 5% differences here, only about
200% (or 2000%!!!) differences.

Pings within an AWS data centre are similar, let's say about
0.450ms. But have a reasonable range.

Unless you know AWS I think it's no use posting hardware specs (there
are none). The Linux box is a c1.medium, the FreeBSD is an
m1.large. FreeBSD is EBS optimised (but disks are not).

Daniel> Networking protocols like NFS are heavily influenced by
Daniel> factors like RTT. An "underpowered" box that is "nearby"
Daniel> (has lover RTT) usually performs much better than a
Daniel> "powerful box" with larger RTT and other network
Daniel> bottlenecks.

I fully agree with that.

Daniel> Unfortunately, AWS is far from perfect hardware emulation
Daniel> and there might be other layers that intervene with the
Daniel> NFS protocol.

Exactly right. Given that we are talking factors of difference here I
think the hardware in this case does not matter. The problem is in the
software.

Garrett Wollman

unread,

Jul 9, 2013, 11:21:59 AM7/9/13

to

<<On Tue, 09 Jul 2013 19:08:47 +1200, Berend de Boer <ber...@pobox.com> said:

Garrett> ----modules/nfs/server/freebsd.pp----

> Where does this go?

On your Puppet server.

-GAWollman

Rick Macklem

unread,

Jul 9, 2013, 7:38:01 PM7/9/13

to

Berend de Boer wrote:
> >>>>> "Berend" == Berend de Boer <ber...@pobox.com> writes:
>
> Berend> 1. Does not effect nfs3.
>
> One update: with zfs sync=standard, I can get nfs3 to perform as bad
> as the nfs4+patch.
>

Hmm, this is interesting. ken@'s file handle affinity patch works for
NFSv3, but not NFSv4. If I understood his posts correctly, the fh affinity
patch was needed, so ZFS's heuristic for recognizing sequential reading would
function correctly. (A file handle affinity patch for NFSv4 will take some time,
since all RPCs in NFSv4 are compounds, with reads/writes imbedded in them, along
with other ops.)

If you could do some testing where you export a UFS volume, the results might
help to isolate the issue to ZFS vs nfsd.

> 1. without nolock, without async, sync=standard: 8m21s
>
> 2. with nolock, without async, sync=standard: 5m37s
>
> 3. without nolock, with async, sync=standard: 4m56s.
>
> 4. with nolock, with async, sync=standard: 4m23s.
>
> 5. without nolock, without async, sync=disabled: 1m57.
>
>
> PS: the nfs4 test was done with sync=disabled.
>

W.r.t. CPU overheads and the size of vfs.nfsd.tcphighwater.
The size of the NFSRVCACHE_HASHSIZE was increased to 500 by
the patch. 500 seems about right for a vfs.nfsd.tcphighwater
of a few thousand.

If you are going to use a very large value (100000), I'd suggest
you increase NFSRVCACHE_HASHSIZE to something like 10000. (You have
to edit sys/fs/nfs/nfsrvcache.h and recompile to change this.)

I'd suggset you try something like:
vfs.nfsd.tcphighwater=5000
vfs.nfsd.tcpcachetimeo=300 (5 minutes instead of 12hrs)

Also, I can't remember if you've bumped up the # of nfsd threads,
but I'd go for 256.
nfs_server_flags="-u -t -n 256"
- in your /etc/rc.conf

I never use ZFS, so I can't help w.r.t. ZFS tuning, rick

>
> --
> All the best,
>
> Berend de Boer
>
>
> ------------------------------------------------------
> Awesome Drupal hosting: https://www.xplainhosting.com/
>

Rick Macklem

unread,

Jul 9, 2013, 7:53:54 PM7/9/13

to

Berend de Boer wrote:
> >>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:
>
> Rick> After you
> Rick> apply the patch and boot the rebuilt kernel, the cpu
> Rick> overheads should be reduced after you increase the value of
> Rick> vfs.nfsd.tcphighwater.
>

> OK, completely disregard my previous email. I actually was testing
> against a server in a different data centre, didn't think it would
> matter too much, but clearly it does (ping times 2-3 times higher).
>

> So moved server + disks into the same data centre as the nfs client.
>

> 1. Does not effect nfs3.
>

> 2. When I do not set vfs.nfsd.tcphighwater, I get a "Remote I/O
> error"
> on the client. On server I see:
>
> nfsd server cache flooded, try to increase nfsrc_floodlevel
>
> (this just FYI).
>
> 3. With vfs.nfsd.tcphighwater set to 150,000. I get very high cpu,
> 50%.
>

The patch I sent you does not tune nfsrc_floodlevel based on what
you set vfs.nfsd.tcphighwater to. That needs to be added to the patch.
(I had some code that did this, but others recommended that it should
be done as a part of the sysctl, but I haven't gotten around to coding that.)

--> For things to work ok, vfs.nfsd.tcphighwater needs to be less than
nfsrc_floodlevel (which is 16384).
*** Again, I'd recommend setting vfs.nfsd.tcphighwater=5000 to 10000, which
is well under 16384 and for which a hash table size of 500 should be ok.

Believe it or not, this server was developed about 10 years ago on a PIII
with 32 (no, not Gbytes, but Mbytes) of RAM. The sizing worked well for that
hardware, but is obviously a bit small for newer hardware;-)

> Performance is now about 8m15s. Which is better, but still twice
> above a lower spec Linux NFS4 server, and four times slower than
> nfs3 on the same box.
>
> 4. With Garrett's settings, I looked at when the cpu starts to
> increase. It starts slow, but raises quickly to 50% in about 1
> minute.
>

I think his code uses a nfsrc_floodlevel tuned based on vfs.nfsd.tcphighwater
and I suspect a much larger hash table size, too.

> Time was similar 7m54s.
>
One other thing you can try is enabling delegations.
On the server:
vfs.nfsd.issue_delegations=1

> 5. I lowered vfs.nfsd.tcphighwater to 10,000 but then it actually
> became worse, cpu quickly went to 70%, i.e. not much difference
> with FreeBSD without patch. Didn't keep this test running to see
> if
> it became slower over time.
>
> Making it 300,000 seems that the cpu increases are slower (but it
> keeps rising).
>
> So from what I observe from the patch is that it makes the rise in
> cpu increase slower, but doesn't stop it. I.e. after a few
> minutes,
> even with setting 300,000 the cpu is getting to 50%, but dropped a
> bit after a while to hover around 40%. Then it crept back to over
> 50%.
>
> 6. So the conclusion is: this patch helps somewhat, but nfs4
> behaviour
> is still majorly impaired compared to nfs3.
>

Well, reading and writing is the same for NFSv4 as NFSv3, except there isn't
any file handle affinity support for NFSv4 (ties a set of nfsd thread(s) to
reading/writing of a file). File handle affinity results in a more sequential
series of VOP_READ()/VOP_WRITE() calls to the server file system.

The other big difference between NFSv3 and NFSv4 are the Open operations. The
only way to reduce the # of these done may be enabling delegations. How much
effect this has depends on the client.

Rick Macklem

unread,

Jul 9, 2013, 7:57:02 PM7/9/13

to

Garrett Wollman wrote:
> <<On Mon, 8 Jul 2013 21:43:52 -0400 (EDT), Rick Macklem
> <rmac...@uoguelph.ca> said:
>

> > Berend de Boer wrote:
> >> >>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:
> >>

> Rick> After you apply the patch and boot the rebuilt kernel, the
> Rick> cpu overheads should be reduced after you increase the
> >> value

> Rick> of vfs.nfsd.tcphighwater.
> >>
> >> What number would I be looking at? 100? 100,000?
> >>
> > Garrett Wollman might have more insight into this, but I would say
> > on
> > the order of 100s to maybe 1000s.
>

Afraid not. I was planning on getting it in, but the release schedule
appeared with a short time to code slush. Hopefully a cleaned up version
of this will be in 10.0 and 9.3.

rick

> -GAWollman

Berend de Boer

unread,

Jul 9, 2013, 7:57:44 PM7/9/13

to

>>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:

Rick> Hmm, this is interesting. ken@'s file handle affinity patch
Rick> works for NFSv3, but not NFSv4. If I understood his posts
Rick> correctly, the fh affinity patch was needed, so ZFS's
Rick> heuristic for recognizing sequential reading would function
Rick> correctly. (A file handle affinity patch for NFSv4 will take
Rick> some time, since all RPCs in NFSv4 are compounds, with
Rick> reads/writes imbedded in them, along with other ops.)

These are all very small files, not much large, so it's not some kind
of sequential reading/writing test. Just thousands of small files
being written, and a fair amount of reads.

Rick> Also, I can't remember if you've bumped up the # of nfsd
Rick> threads, but I'd go for 256. nfs_server_flags="-u -t -n
Rick> 256" - in your /etc/rc.conf

Didn't. But only one client was writing, so I figured that shouldn't matter.

Berend de Boer

unread,

Jul 13, 2013, 1:48:02 AM7/13/13

to

>>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:

Rick> If you could do some testing where you export a UFS volume,
Rick> the results might help to isolate the issue to ZFS vs nfsd.

Indeed! Have changed subject, as indeed ZFS is a red herring. Issue
shows up with UFS as well. Very high cpu for nfds, about the same time
to do the operation.

Tried it with these settings:

vfs.nfsd.tcphighwater=5000
vfs.nfsd.tcpcachetimeo=300

nfs_server_flags="-u -t -n 256"

On nfs3 + ufs everything back to normal. I.e. nfs4 was about 15
minutes, same operation was 241s minutes with nfs3 and nfsd using no
cpu at all basically.

Per other reply, tried this too:

vfs.nfsd.issue_delegations=1

Locks up the client at first write access. Ctrl+C doesn't work, need
to explicitly send a KILL signal from other terminal.

I think it locks up the server in some way as well. Doing an ls on an
exported path locks up. Ctrl+C won't work anymore. Process doesn't
react to any signal. In the end I rebooted the server to get rid of
this.

Rick Macklem

unread,

Jul 13, 2013, 6:45:40 PM7/13/13

to

Berend de Boer wrote:
> >>>>> "Rick" == Rick Macklem <rmac...@uoguelph.ca> writes:
>
> Rick> If you could do some testing where you export a UFS volume,
> Rick> the results might help to isolate the issue to ZFS vs nfsd.
>
> Indeed! Have changed subject, as indeed ZFS is a red herring. Issue
> shows up with UFS as well. Very high cpu for nfds, about the same
> time
> to do the operation.
>
> Tried it with these settings:
>
> vfs.nfsd.tcphighwater=5000
> vfs.nfsd.tcpcachetimeo=300
> nfs_server_flags="-u -t -n 256"
>
>
> On nfs3 + ufs everything back to normal. I.e. nfs4 was about 15
> minutes, same operation was 241s minutes with nfs3 and nfsd using no
> cpu at all basically.
>

All I can suggest is capturing packets and then emailing be the captured
packet trace. You can use tcpdump to do the capture, since wireshark will
understand it:
# tcpdump -s 0 -w <file>.pcap host <client-host>
and then emailing me <file>.pcap.

I can take a look at the packet capture and maybe see what is going on.

I think you mentioned that you were using a Linux client, but not what
version. I'd suggest a recent kernel from kernel.org. (Fedora tracks
updates/fixes for NFSv4 pretty closely, so the newest Fedora release
should be pretty current.)

> Per other reply, tried this too:
>
> vfs.nfsd.issue_delegations=1
>
> Locks up the client at first write access. Ctrl+C doesn't work, need
> to explicitly send a KILL signal from other terminal.
>
> I think it locks up the server in some way as well. Doing an ls on an
> exported path locks up. Ctrl+C won't work anymore. Process doesn't
> react to any signal. In the end I rebooted the server to get rid of
> this.
>

Obviously broken, but without a lot more information, I can's say anything.
(Assuming you can still run commands on the server, you could start with
something like "ps axl" and "procstat -kk".)

rick

> --
> All the best,
>
> Berend de Boer
>
>
> ------------------------------------------------------
> Awesome Drupal hosting: https://www.xplainhosting.com/
>