[NFS] Maximum transfer size in NFSv3

Bruce Blinn

unread,

Feb 13, 2001, 5:26:59 PM2/13/01

to

I was trying to set the NFS transfer size (rsize= and wsize= mount
options) to 32k, which should be supported in NFS version 3. However,
when I tried it, the values keep getting set back to 8k. I tracked this
down and found that nfs_read_super() was resetting the value because the
variable fsinfo.rtmax was 8k. This value appears to have been retrieved
from the server (nfsd3_proc_fsinfo) where it turns out to be the
constant NFSSVC_MAXBLKSIZE, which is 8k.

Is there still a need for this limit, or did I miss something?

I am using Linux 2.2.18 without any patches. The output of /proc/mounts
shows that I am using NFS v3 and it shows the 8k transfer sizes.

Thanks,
Bruce
--
Bruce Blinn __o
Mission Critical Linux, Inc. _`\<;_ 408-615-9100
www.MissionCriticalLinux.com ( )/ ( ) bl...@MissionCriticalLinux.com

_______________________________________________
NFS maillist - N...@lists.sourceforge.net
http://lists.sourceforge.net/lists/listinfo/nfs

Trond Myklebust

unread,

Feb 13, 2001, 6:24:41 PM2/13/01

to

>>>>> " " == Bruce Blinn <bl...@MissionCriticalLinux.com> writes:

> I was trying to set the NFS transfer size (rsize= and wsize=
> mount options) to 32k, which should be supported in NFS version
> 3. However, when I tried it, the values keep getting set back
> to 8k. I tracked this down and found that nfs_read_super() was
> resetting the value because the variable fsinfo.rtmax was 8k.
> This value appears to have been retrieved from the server
> (nfsd3_proc_fsinfo) where it turns out to be the constant
> NFSSVC_MAXBLKSIZE, which is 8k.

> Is there still a need for this limit, or did I miss something?

The Linux NFS server is udp only, so any blocksize over 1k can give
rise to fragmentation losses. 8k is the historical maximum r/wsize
that was adopted by the NFSv2 standard, so we're more or less obliged
to support that, however even this has been known to give unacceptably
large timeouts on low end hardware.

With TCP support, concerns about the reliability of the transport
layer are no longer an issue, and this was why NFSv3 removed the
absolute limitation and replaced it with a set of server-specified
maxima.

If you do increase NFSSVC_MAXBLKSIZE on your setup, you should
probably edit fs/nfsd/nfs3proc.c to set f_rtpref, f_wtpref (the
preferred as opposed to maximum r/wsizes) to 8k due to the UDP
limitations.
This would mean that people are allowed to manually set larger values,
but that the recommended mount defaults (which aren't used by 2.2.x
but are used by the 2.4.x) are to set 8k.

Cheers,
Trond

Mark Hemment

unread,

Feb 14, 2001, 7:06:44 AM2/14/01

to

On 14 Feb 2001, Trond Myklebust wrote:
> The Linux NFS server is udp only, so any blocksize over 1k can give
> rise to fragmentation losses. 8k is the historical maximum r/wsize
> that was adopted by the NFSv2 standard, so we're more or less obliged
> to support that, however even this has been known to give unacceptably
> large timeouts on low end hardware.

From a quick look at David Miller's zero-copy patches, it appears the
pull-up of the IP fragments occurs within sunrpc code with his changes.
Wouldn't it be possible to added a receive buffer to the svc_rqst
structure (to complement the result buffer), and to use this buffer for
the pull-up?
OK, there is still the problem of allocating a contigious buffer, but it
would only occur once for each thread.
Now, if there were two receive buffers; one for the protocol header and
another for the data payload (nicely aligned on a page boundary) we might
get the necessary infra-structure for a few performance tricks...

I'll try to do some patches for next week.

Mark

Trond Myklebust

unread,

Feb 14, 2001, 2:01:10 PM2/14/01

to

>>>>> " " == Mark Hemment <mar...@veritas.com> writes:

> From a quick look at David Miller's zero-copy patches, it
> appears the pull-up of the IP fragments occurs within sunrpc
> code with his changes.
> Wouldn't it be possible to added a receive buffer to the
> svc_rqst structure (to complement the result buffer), and to
> use this buffer for the pull-up?
> OK, there is still the problem of allocating a contigious
> buffer, but it would only occur once for each thread.

Please note: With udp receives we don't actually use the receive
buffer, but instead hook the skb into rq_skbuff.

With this in mind, why linearize the skbs at all?

As you point out, this forces you into yet another buffer allocation
and another copy. However the RPC header stuff (and all the data
excepting write requests?) should normally fit within 1 fragment (as
long as the MTU is >~ 1200).
As for the larger write requests, perhaps one could just fill an iovec
array for the nfsd daemon?

Cheers,
Trond

Mark Hemment

unread,

Feb 14, 2001, 3:01:50 PM2/14/01

to

On 14 Feb 2001, Trond Myklebust wrote:

> >>>>> " " == Mark Hemment <mar...@veritas.com> writes:
>
> > From a quick look at David Miller's zero-copy patches, it
> > appears the pull-up of the IP fragments occurs within sunrpc
> > code with his changes.
> > Wouldn't it be possible to added a receive buffer to the
> > svc_rqst structure (to complement the result buffer), and to
> > use this buffer for the pull-up?
> > OK, there is still the problem of allocating a contigious
> > buffer, but it would only occur once for each thread.
>
> Please note: With udp receives we don't actually use the receive
> buffer, but instead hook the skb into rq_skbuff.

With David's patches, we could have a function, similar to
skb_linearize(), but which takes a pointer to a thread's pre-allocated
pull-up/receive buffer;
skb_linearize_buff(skb, rqstp->some_pointer_to_buffer);

OK, there would need to be a bit extra crud, but avoiding an allocation
on each incoming, then so much the better. Avoiding a GFP_ATOMIC
allocation for sizes greater than 4096, then we're winning all the way.

> With this in mind, why linearize the skbs at all?

If we copy-and-cksum at the same time (with IP fragments, we'll need to
cksum, right?) then the cost of linearizing is low and helps the file
system code.

> As you point out, this forces you into yet another buffer allocation
> and another copy. However the RPC header stuff (and all the data
> excepting write requests?) should normally fit within 1 fragment (as
> long as the MTU is >~ 1200).

The receive (aka pull-up) buffer is allocated at thread creation, and so
is removed from the critical code path.
Maybe it is my work load, but I see many more writes which include more
than one frament, than are a single fragment.

> As for the larger write requests, perhaps one could just fill an iovec
> array for the nfsd daemon?

With some (many!) filesystems, the code to handle iovec adds a slight
weight to the code path. Not important to some users (usually I/O bound
with sync mounts), but seriously important when you're CPU bound.

Mark

Trond Myklebust

unread,

Feb 15, 2001, 3:31:18 AM2/15/01

to

>>>>> " " == Mark Hemment <mar...@veritas.com> writes:

> With David's patches, we could have a function, similar to
> skb_linearize(), but which takes a pointer to a thread's
> pre-allocated pull-up/receive buffer;
> skb_linearize_buff(skb, rqstp->some_pointer_to_buffer);

> OK, there would need to be a bit extra crud, but avoiding an
> allocation on each incoming, then so much the better. Avoiding
> a GFP_ATOMIC allocation for sizes greater than 4096, then we're
> winning all the way.

I fully agree with this. Actually, I don't understand why we need
GFP_ATOMIC at all in even in the existing zero-copy patch. You can
rely on svc_udp_recvfrom() always being called in the context of the
nfsd/lockd thread. David?

>> With this in mind, why linearize the skbs at all?

> If we copy-and-cksum at the same time (with IP fragments,
> we'll need to cksum, right?) then the cost of linearizing is
> low and helps the file system code.

Yes, but I was thinking of delaying that until we're ready to copy
into the page cache.
All it would take would be a modification of generic_file_write() to
allow us to hook the copy operation (see for instance how read_actor_t
is used in the sendfile operation).

The problem with that of course is that you only know that you've got
a corrupt packet after you've corrupted the page cache by copying into
it. Ah well, it was a nice dream...

>> As you point out, this forces you into yet another buffer
>> allocation and another copy. However the RPC header stuff (and
>> all the data excepting write requests?) should normally fit
>> within 1 fragment (as long as the MTU is >~ 1200).

> The receive (aka pull-up) buffer is allocated at thread
> creation, and so is removed from the critical code path.
> Maybe it is my work load, but I see many more writes which
> include more than one frament, than are a single fragment.

Writes yes, but they are the special case. The bulk of NFS traffic
usually takes the form of filehandle lookups and read requests which
all fit into 1 ethernet fragment.

>> As for the larger write requests, perhaps one could just fill
>> an iovec array for the nfsd daemon?

> With some (many!) filesystems, the code to handle iovec adds
> a slight weight to the code path. Not important to some users
> (usually I/O bound with sync mounts), but seriously important
> when you're CPU bound.

Now that every filesystem is required to use the page cache and to
support generic_file_write() we should be able to design around this
particular problem.

Cheers,
Trond

Mark Hemment

unread,

Feb 15, 2001, 5:12:55 AM2/15/01

to

On 15 Feb 2001, Trond Myklebust wrote:
> >>>>> " " == Mark Hemment <mar...@veritas.com> writes:
> > With David's patches, we could have a function, similar to
> > skb_linearize(), but which takes a pointer to a thread's
> > pre-allocated pull-up/receive buffer;
> > skb_linearize_buff(skb, rqstp->some_pointer_to_buffer);
>
> > OK, there would need to be a bit extra crud, but avoiding an
> > allocation on each incoming, then so much the better. Avoiding
> > a GFP_ATOMIC allocation for sizes greater than 4096, then we're
> > winning all the way.
>
> I fully agree with this. Actually, I don't understand why we need
> GFP_ATOMIC at all in even in the existing zero-copy patch. You can
> rely on svc_udp_recvfrom() always being called in the context of the
> nfsd/lockd thread. David?

Hmm, probably because sk_data has been set to zero.
If multiple requests had arrived, and the allocation blocked, then the
requests already sitting on the socket won't get serviced very quickly.
Couldn't we set sk_data to 1 immediately after the skb_recv_datagram(),
and then call svc_sock_received(svsk, 0) before doing the
allocation/pull-up/checksum?

Doesn't this also show up a limitation in the page allocator?
Here, we're calling it with ATOMIC as we don't want to block
(fine). But we're not inside an interrupt or bottom-handler, so why
can't we steal from the inactive page lists? Unfortunately, GFP_ATOMIC
doesn't allow that (it assumes the worse).

Mark

Trond Myklebust

unread,

Feb 15, 2001, 6:37:15 AM2/15/01

to

>>>>> " " == Mark Hemment <mar...@veritas.com> writes:

>> I fully agree with this. Actually, I don't understand why we
>> need GFP_ATOMIC at all in even in the existing zero-copy
>> patch. You can rely on svc_udp_recvfrom() always being called
>> in the context of the nfsd/lockd thread. David?

> Hmm, probably because sk_data has been set to zero. If
> multiple requests had arrived, and the allocation blocked, then
> the requests already sitting on the socket won't get serviced
> very quickly.
> Couldn't we set sk_data to 1 immediately after the
> skb_recv_datagram(), and then call svc_sock_received(svsk, 0)
> before doing the allocation/pull-up/checksum?

You'd have to call svc_sock_received() or else sk_busy will remain
set. There's no reason why you can't do that immediately after calling
skb_recv_datagram().

In any case, the setting of sk_data there is wrong. It's done without
holding the svsk->sk_lock, and it sets sk_data to 1.

Ideally, sk_data is supposed to be incremented by svc_udp_data_ready()
and then decremented by svc_udp_recvfrom(). That way it reflects the
number of requests remaining in the buffer (well sort of - UDP may
discard requests).

> Doesn't this also show up a limitation in the page allocator?
> Here, we're calling it with ATOMIC as we don't want to block
> (fine). But we're not inside an interrupt or bottom-handler,
> so why can't we steal from the inactive page lists?
> Unfortunately, GFP_ATOMIC doesn't allow that (it assumes the
> worse).

There's no need for it. We can block AFAICS.

Cheers,
Trond

Mark Hemment

unread,

Feb 15, 2001, 7:38:29 AM2/15/01

to

Hi,

> You'd have to call svc_sock_received() or else sk_busy will remain
> set. There's no reason why you can't do that immediately after calling
> skb_recv_datagram().

Yes, it works.

> In any case, the setting of sk_data there is wrong. It's done without
> holding the svsk->sk_lock, and it sets sk_data to 1.

I don't think you any protection on sk_data there.
Infact, I don't think you even need a memory barrier after it,
svc_sock_received() should take care of it all nicely.

> Ideally, sk_data is supposed to be incremented by svc_udp_data_ready()
> and then decremented by svc_udp_recvfrom(). That way it reflects the
> number of requests remaining in the buffer (well sort of - UDP may
> discard requests).

Hmmm, as the TCP code uses it?

As unconditional setting of sk_data always causes another nfsd to
be scheduled (which tends to spin on the global kernel lock as soon as it
starts to run), it is far from ideal.
I've tried making the setting of sk_data conditional on a peek on the
socket. This works fine for low/medium work loads, but losses out on high
loads (where there will almost always be a new request waiting for a nfsd
thread, even if there wasn't a few cycles ago).
Turning sk_data into a counter for UDP is worth an investigation; then
it would need the lock in svc_udp_recvfrom().

> > Doesn't this also show up a limitation in the page allocator?
> > Here, we're calling it with ATOMIC as we don't want to block
> > (fine). But we're not inside an interrupt or bottom-handler,
> > so why can't we steal from the inactive page lists?
> > Unfortunately, GFP_ATOMIC doesn't allow that (it assumes the
> > worse).
>
> There's no need for it. We can block AFAICS.

OK, but lets make sure svc_sock_received() is called before blocking.

Mark

Trond Myklebust

unread,

Feb 15, 2001, 8:13:47 AM2/15/01

to

>>>>> " " == Mark Hemment <mar...@veritas.com> writes:

>> Ideally, sk_data is supposed to be incremented by
>> svc_udp_data_ready() and then decremented by
>> svc_udp_recvfrom(). That way it reflects the number of requests
>> remaining in the buffer (well sort of - UDP may discard
>> requests).

> Hmmm, as the TCP code uses it?

Yes. That would give fewer false positives, and hence fewer
unnecessary wakeups of the threads.

> I've tried making the setting of sk_data conditional on a
> peek on the socket. This works fine for low/medium work loads,
> but losses out on high loads (where there will almost always be
> a new request waiting for a nfsd thread, even if there wasn't a
> few cycles ago).

That's race prone unless you hold the spinlock and keep bottom halves
disabled over the call to the peek, something which may be dangerous
(we're not supposed to know what goes on in the networking layer).

> Turning sk_data into a counter for UDP is worth an
> investigation; then it would need the lock in
> svc_udp_recvfrom().

No just use svc_sock_received() to subtract the number of skbs that
you've just picked up from the accounting. Something like:

- svsk->sk_data = 0;
while ((skb = skb_recv_datagram(svsk->sk_sk, 0, 1, &err)) == NULL) {
- svc_sock_received(svsk, 0);
+ svc_sock_received(svsk, 1);
- if (err == -EAGAIN)
return err;
/* possibly an icmp error */
dprintk("svc: recvfrom returned error %d\n", -err);
}
+ svsk->sk_sk->stamp = skb->stamp;
+ svc_sock_received(svsk, 1);

and then remove the call to svc_sock_received at the end.

Cheers,
Trond

Mark Hemment

unread,

Feb 15, 2001, 8:54:31 AM2/15/01

to

Hi Trond,

> > I've tried making the setting of sk_data conditional on a
> > peek on the socket. This works fine for low/medium work loads,
> > but losses out on high loads (where there will almost always be
> > a new request waiting for a nfsd thread, even if there wasn't a
> > few cycles ago).
>
> That's race prone unless you hold the spinlock and keep bottom halves
> disabled over the call to the peek, something which may be dangerous
> (we're not supposed to know what goes on in the networking layer).

No, you don't need the lock if using sk_data as a boolean (you can only
race in setting it to true - which isn't a lost event).
Obviously, with it as a counter you won't even be doing this (and I now
much prefer it as a counter).

> No just use svc_sock_received() to subtract the number of skbs that
> you've just picked up from the accounting. Something like:

True.
I'll look over the code sometime and double check its not possible to lose
events.

Thanks,
Mark

David S. Miller

unread,

Feb 17, 2001, 4:49:04 PM2/17/01

to

Trond Myklebust writes:
> >>>>> " " == Mark Hemment <mar...@veritas.com> writes:
>

> > With David's patches, we could have a function, similar to
> > skb_linearize(), but which takes a pointer to a thread's
> > pre-allocated pull-up/receive buffer;
> > skb_linearize_buff(skb, rqstp->some_pointer_to_buffer);
>
> > OK, there would need to be a bit extra crud, but avoiding an
> > allocation on each incoming, then so much the better. Avoiding
> > a GFP_ATOMIC allocation for sizes greater than 4096, then we're
> > winning all the way.
>

> I fully agree with this. Actually, I don't understand why we need
> GFP_ATOMIC at all in even in the existing zero-copy patch. You can
> rely on svc_udp_recvfrom() always being called in the context of the
> nfsd/lockd thread. David?

It is brain fart, nothing more. GFP_ATOMIC is not required here.

Or maybe Alexey was just being lazy and did not want to verify
all code paths to see if any spinlocks were held or not :-)

Later,
David S. Miller
da...@redhat.com