Inner workings of NFS

Per S

unread,

Jun 16, 2001, 9:54:42 AM6/16/01

to

Is it correct to state that a NFS client (ver2 ver 3,both udp and tcp)
works single-threaded - i.e
send one write_block
wait for reply
send next write_block
...
...
commit (if ver 3)

We are designing another protocol that uses rpc over tcp and I need
to get a educated guess about performance.

Given a server that works like nfsd and
a single threaded client
(no commit in the protocol (as nfs v2))
tcp
blocksize 4k
100mBit ethernet
could one possibly get a throughput of 1 megByte/s ??
What about performance if you change the protocol to work as NFSv3 ?

Casper H.S. Dik - Network Security Engineer

unread,

Jun 16, 2001, 11:20:56 AM6/16/01

to

[[ PLEASE DON'T SEND ME EMAIL COPIES OF POSTINGS ]]

uab...@my-deja.com (Per S) writes:

>Is it correct to state that a NFS client (ver2 ver 3,both udp and tcp)
>works single-threaded - i.e
>send one write_block
>wait for reply
>send next write_block
>...
>...
>commit (if ver 3)

No, that's not correct. Some clients act that way (I think Linux
does or did) but it's not "performance-wise". Sun's implementations have
used various mechanisms such as "biod" and now async kernel threads
in order to have better throughput.

This is especially important for version 2 which isn't allowed to
reply to write until after the write has been comitted to stable storage.

In NFSv3, you may get away with waiting *if* the server implements async
writes (it could make COMMIT a no-op and make writes synchronous)
Typical is an upper limit to outstanding write requests; in all cases
you will need to remember the complete request in order to retransmit
it when it gets lost (in V2, until such time you get the reply to the write,
in V3 until such time you get a reply to the COMMIT)

>We are designing another protocol that uses rpc over tcp and I need
>to get a educated guess about performance.

>Given a server that works like nfsd and
>a single threaded client
>(no commit in the protocol (as nfs v2))
>tcp
>blocksize 4k
>100mBit ethernet
>could one possibly get a throughput of 1 megByte/s ??

Possibly; but the latency is might kill you. 1MB/s, 4K requests; 256 req/s
(In other words your disk must have a 4ms latency so that's cutting it close)

>What about performance if you change the protocol to work as NFSv3 ?

Changing the implementation is better.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Per S

unread,

Jun 16, 2001, 5:27:04 PM6/16/01

to

> No, that's not correct. Some clients act that way (I think Linux
> does or did) but it's not "performance-wise". Sun's implementations have
> used various mechanisms such as "biod" and now async kernel threads
> in order to have better throughput.
>
> This is especially important for version 2 which isn't allowed to
> reply to write until after the write has been comitted to stable storage.
>

Thats very interesting,
does this protocol behaviour assume that all write_block(block,filepos)
arrive in the "correct" order ? It seems very messy to implement the server
in such a way that it can go back and forth in a file that is copied from
the local to the remote filesystem - the server would have to fill in
temporary data for a intermediate block that havent arrived yet!

Also if it supposes correct order - does that presume rpc over TCP?
(I.E does rpc handle udp's un-dependability so that all rpc/udp packages
comes exactly once and in the right order?)

What is a typical blocksize for high-performance NFS v2/v3 ?

What troughput can you get when both client and server are on the same
local 100mBit net ? (lets say only one hub between the machines ).

And last:how much of the trhoughput in the previous scenario would be
lost if you run over udp - 20% 50% or 80%?

Casper H.S. Dik - Network Security Engineer

unread,

Jun 16, 2001, 5:56:37 PM6/16/01

to

[[ PLEASE DON'T SEND ME EMAIL COPIES OF POSTINGS ]]

uab...@my-deja.com (Per S) writes:

>Thats very interesting,
>does this protocol behaviour assume that all write_block(block,filepos)
>arrive in the "correct" order ? It seems very messy to implement the server
>in such a way that it can go back and forth in a file that is copied from
>the local to the remote filesystem - the server would have to fill in
>temporary data for a intermediate block that havent arrived yet!

The protocol makes no such assumption. The intermediate blocks are
zero-filled as typically on Unix. (Well, in many implementations,
the intermediate blocks are not allocated on disk but rather "read as
zero" unallocated blocls.

>Also if it supposes correct order - does that presume rpc over TCP?
>(I.E does rpc handle udp's un-dependability so that all rpc/udp packages
>comes exactly once and in the right order?)

NFS is states; write operations are idem-potent, so there's really no
problem at all. (NFS servers typically keep some state (reply cache)
to prevent a retried non-idempotent operation from returning an
error as with, e.g., a re-transmitted remove.

>What is a typical blocksize for high-performance NFS v2/v3 ?

I wouldn't know about high performance, but for NFS over UDP 8K
is fairly typical; over NFS 32K is.

>What troughput can you get when both client and server are on the same
>local 100mBit net ? (lets say only one hub between the machines ).

Wirespeed (or as fast as the disk would go)

>And last:how much of the trhoughput in the previous scenario would be
>lost if you run over udp - 20% 50% or 80%?

Similar, I suspect.

John Maddalozzo

unread,

Jun 20, 2001, 1:31:28 AM6/20/01

to

> >What is a typical blocksize for high-performance NFS v2/v3 ?
>
> I wouldn't know about high performance, but for NFS over UDP 8K
> is fairly typical; over NFS 32K is.

Presuming you mean read/write size over the wire, bigger blocks are better. You
can tune AIX to do 64K blocks if you know the magic, but perf seemed to top out
at 32K. To the kernel the blocks are 512 but aggregated to the larger
over-the-wire packets. Protocol dependencies are large, and making data
transfers work with full file system semantics, as you get with NFS, is
difficult. Tthat consideration far outweighs the block size. If you just want to
shoot a file over the wire stick with scp and friends.

>
>
> >What troughput can you get when both client and server are on the same
> >local 100mBit net ? (lets say only one hub between the machines ).
>
> Wirespeed (or as fast as the disk would go)
>
> >And last:how much of the trhoughput in the previous scenario would be
> >lost if you run over udp - 20% 50% or 80%?
>

Two years ago when I was still closly involved, UDP was always FASTER on local,
clean networks (no packet drops) than TCP. I don't recall the exact percentage
but I think it was around 10-15% faster under optimal conditions. Nobody could
explain exactly why, other than vague references to the complexity and length of
the TCP stack. I always though of UDP in that respect as kinda like assembly
code. More work for the application programmer, but potentially larger payoff.
Definately not recommended for internet or lossy intranet use though.

Eric Werme - replace nospam with werme

unread,

Jun 26, 2001, 2:57:40 PM6/26/01

to

uab...@my-deja.com (Per S) writes:

>> No, that's not correct. Some clients act that way (I think Linux
>> does or did) but it's not "performance-wise". Sun's implementations have
>> used various mechanisms such as "biod" and now async kernel threads
>> in order to have better throughput.
>>
>> This is especially important for version 2 which isn't allowed to
>> reply to write until after the write has been comitted to stable storage.
>>
>Thats very interesting,
>does this protocol behaviour assume that all write_block(block,filepos)
>arrive in the "correct" order ?

No, uni-processor and multi-processor clients often behave quite differently.
There may be clients that try to keep writes ordered, but I think most do
a couple heuristics and hope for the best.

It seems very messy to implement the server
>in such a way that it can go back and forth in a file that is copied from
>the local to the remote filesystem - the server would have to fill in
>temporary data for a intermediate block that havent arrived yet!

IIRC, UFS does that pretty well, when it creates a hole, it will leave
unallocated disk space such that later fill in will result in contiguous
data.

>Also if it supposes correct order - does that presume rpc over TCP?
>(I.E does rpc handle udp's un-dependability so that all rpc/udp packages
>comes exactly once and in the right order?)

On Tru64 we present data to a TCP socket in the same order as we do UDP.
It turns out that with several threads waiting for window space in a TCP
connection we wind up delaying some write multiple times and some make it
out quicker. So our write stream is actually more disordered than UDPs!

>What is a typical blocksize for high-performance NFS v2/v3 ?

32-64KB.

>What troughput can you get when both client and server are on the same
>local 100mBit net ? (lets say only one hub between the machines ).

As long as the disks can keep up, you should saturate the wire. Between
a pair of Compaq ES40s I can saturate Gigabit reading a file from the
server's cache. (UDP, haven't tried TCP.)

>And last:how much of the trhoughput in the previous scenario would be
>lost if you run over udp - 20% 50% or 80%?

UDP should be faster. In Tru64's case, we bypass socket code and most of
UDP processing on input, and do our own UDP and IP fragmentation on output.

IP checksumming on input gets done on the CPU about to use the data so
it comes into cache on the right CPU.

-Ric Werme
--
<> Eric (Ric) Werme <> The above is unlikely to contain <>
<> ROT-13 addresses: <> official claims or policies of <>
<> <jr...@mx3.qrp.pbz> <> Compaq Computer Corp. <>
<> <jr...@zrqvnbar.arg> <> http://people.ne.mediaone.net/werme <>