scp more perfectly fills the pipe than NFS/TCP

Zaphod Beeblebrox

unread,

Dec 19, 2009, 12:49:17 AM12/19/09

to freebsd...@freebsd.org

Here's an interesting conundrum. I don't know what's different
between the TCP that scp uses from the TCP that NFS uses, but given
the same two FreeBSD machines, SCP fills the pipe with packets better.

Examine the following graphic: http://www.eicat.ca/~dgilbert/example-mrtg.png

The system doing the scp and the NFS server is FreeBSD-7.2-p1. The
system receiving the scp and the NFS client is FreeBSD-8.0-p1

The scp transfer is the left hand side of the graph and the NFS
transfer is on the right.

The NFS is mounted with "-3 -T -b -l -i" and no other options. Files
are being moved over NFS with the system "mv" command. The files in
each case are large (50 to 500 meg files).

The connection is a DSL that terminates on the local lan near the
server (I own and run the DSL and the ISP)

In either case, the connection is lightly used by only me --- and I'm
fairly certain that this isn't another network factor at play.
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Dan Nelson

unread,

Dec 20, 2009, 12:27:44 AM12/20/09

to Zaphod Beeblebrox, freebsd...@freebsd.org

In the last episode (Dec 19), Zaphod Beeblebrox said:
> Here's an interesting conundrum. I don't know what's different between
> the TCP that scp uses from the TCP that NFS uses, but given the same two
> FreeBSD machines, SCP fills the pipe with packets better.
>
> Examine the following graphic: http://www.eicat.ca/~dgilbert/example-mrtg.png
>
> The system doing the scp and the NFS server is FreeBSD-7.2-p1. The system
> receiving the scp and the NFS client is FreeBSD-8.0-p1
>
> The scp transfer is the left hand side of the graph and the NFS transfer
> is on the right.
>
> The NFS is mounted with "-3 -T -b -l -i" and no other options. Files are
> being moved over NFS with the system "mv" command. The files in each case
> are large (50 to 500 meg files).

If you increase the NFS blocksize (-r 32768 for example) you will get
slightly better performance, but you will likely never match the scp
results. They're doing two different things under the hood: scp is
streaming the entire file in one operation, while NFS is performing many
"read 8k at offset 0", "read 8k at offset 8k", etc requests one after
another, so a high-latency connection will take a performance hit due to the
latency in issuing each command. According to the mount_nfs manpage, it
looks like there is some prefetching that can be enabled with the "-a ##"
option. It doesn't say what the default is, though.

--
Dan Nelson
dne...@allantgroup.com

Zaphod Beeblebrox

unread,

Dec 21, 2009, 12:49:48 AM12/21/09

to Dan Nelson, freebsd...@freebsd.org

On Sun, Dec 20, 2009 at 12:27 AM, Dan Nelson <dne...@allantgroup.com> wrote:
> In the last episode (Dec 19), Zaphod Beeblebrox said:
>> Here's an interesting conundrum. �I don't know what's different between
>> the TCP that scp uses from the TCP that NFS uses, but given the same two
>> FreeBSD machines, SCP fills the pipe with packets better.
>>
>> Examine the following graphic: http://www.eicat.ca/~dgilbert/example-mrtg.png
>>
>> The system doing the scp and the NFS server is FreeBSD-7.2-p1. �The system
>> receiving the scp and the NFS client is FreeBSD-8.0-p1
>>
>> The scp transfer is the left hand side of the graph and the NFS transfer
>> is on the right.
>>
>> The NFS is mounted with "-3 -T -b -l -i" and no other options. �Files are
>> being moved over NFS with the system "mv" command. �The files in each case
>> are large (50 to 500 meg files).
>
> If you increase the NFS blocksize (-r 32768 for example) you will get
> slightly better performance, but you will likely never match the scp
> results. �They're doing two different things under the hood: scp is
> streaming the entire file in one operation, while NFS is performing many
> "read 8k at offset 0", "read 8k at offset 8k", etc requests one after
> another, so a high-latency connection will take a performance hit due to the
> latency in issuing each command. �According to the mount_nfs manpage, it
> looks like there is some prefetching that can be enabled with the "-a ##"
> option. �It doesn't say what the default is, though.

While the link is slow, it is really directly connected with a latency
of 10ms or so. Isn't mv mmap()'ing large enough regions to cause
there to be a reasonable queue to transfer?

Dan Nelson

unread,

Dec 21, 2009, 2:28:27 AM12/21/09

to Zaphod Beeblebrox, freebsd...@freebsd.org

I've never been impressed with FreeBSD's ability to detect sequential read
patterns and prefetch blocks ahead of time, even on local ufs filesystems.

--
Dan Nelson
dne...@allantgroup.com

Dag-Erling Smørgrav

unread,

Dec 21, 2009, 7:35:04 AM12/21/09

to Zaphod Beeblebrox, freebsd...@freebsd.org, Dan Nelson

Zaphod Beeblebrox <zbe...@gmail.com> writes:
> While the link is slow, it is really directly connected with a latency
> of 10ms or so.

10 ms is pretty high. A "direct connection" (same Ethernet segment)
should have a round-trip latency well below 1 ms.

DES
--
Dag-Erling Smørgrav - d...@des.no

Matthew Dillon

unread,

Dec 21, 2009, 3:54:41 PM12/21/09

to Zaphod Beeblebrox, freebsd...@freebsd.org

Play with the read-ahead mount options for NFS, but it might require
more work with that kind of latency. You need to be able to have
a lot of RPC's in-flight to maintain the pipeline with higher latencies.
At least 16 and possibly more.

It might be easier to investigate why the latency is so high and fix
that first. 10ms is way too high for a LAN.

I remember there was some work in the FreeBSD tree to clean up the
client-side NFS rpc mechanics but if they are still threaded (kernel
thread or user thread, doesn't matter) with one synchronous RPC per
thread then a large amount of read-ahead will cause the requests to be
issued out of order over the wire (for both TCP and UDP NFS mounts),
which really messes up the server-side heuristics. Plus the
client-side threads wind up competing with each other for the
socket lock. So there is a limit to how large a read-ahead you
can specify and still get good results.

If they are using a single kernel thread for socket reading and a
single kernel thread for socket writing (i.e. a 100% async RPC model,
which is what DFly uses), then you can boost the read-ahead to 50+.
At that point the socket buffer becomes the limiting factor in the
pipeline.

Make sure the NFS mount is TCP (It defaults to TCP in FreeBSD 8+). UDP
mounts will not perform well with any read-ahead greater then 3 or 4
RPCs because occassional seek latencies on the server will cause
random UDP RPCs to timeout and retry, which completely destroys
performance. UDP mounts have no understanding of the RPC queue backlog
on the server and treat each RPC independently for timeout/retry
purposes. So one minor stall can blow up every single pending RPC
backed up behind the one that stalled.

-Matt

Zaphod Beeblebrox

unread,

Dec 21, 2009, 4:07:04 PM12/21/09

to Matthew Dillon, freebsd...@freebsd.org

I must say that I often deeply respect your position and your work,
but your recent willingness to jump into a conversation without
reading the whole of it ... simply to point out some point where your
pet is better than the subject of the list... is disappointing. Case
in point...

On Mon, Dec 21, 2009 at 3:42 PM, Matthew Dillon
<dil...@apollo.backplane.com> wrote:
> � �Play with the read-ahead mount options for NFS, but it might require

> � �more work with that kind of latency. �You need to be able to have
> � �a lot of RPC's in-flight to maintain the pipeline with higher latencies.
> � �At least 16 and possibly more.

I should almost label that ObContent.

> � �It might be easier to investigate why the latency is so high and fix

> � �that first. �10ms is way too high for a LAN.

Ref. my origional post. The connection is DSL, but completely
managed. 10ms is fairly good for DSL

> � �Make sure the NFS mount is TCP (It defaults to TCP in FreeBSD 8+). �UDP

> � �mounts will not perform well with any read-ahead greater then 3 or 4
> � �RPCs because occassional seek latencies on the server will cause
> � �random UDP RPCs to timeout and retry, which completely destroys
> � �performance. �UDP mounts have no understanding of the RPC queue backlog
> � �on the server and treat each RPC independently for timeout/retry
> � �purposes. �So one minor stall can blow up every single pending RPC
> � �backed up behind the one that stalled.

Again, from the original post, not only was -T specified, but (as you
say) it is the default for FreeBSD 8.

for a 4 megabit pipe, very few transactions need to be in flight to
fill it. Does the TCP NFS use tech like selective ack? Is it the
same stack as the one that scp is using?

Matthew Dillon

unread,

Dec 21, 2009, 4:40:24 PM12/21/09

to Zaphod Beeblebrox, freebsd...@freebsd.org

I'm just covering all the bases. To be frank, half the time when
someone posts they are doing something a certain way it turns out that
they actually aren't. I've learned that covering the bases tends to
lead to solutions more quickly than assuming a perfect rendition.

For example, is that 10ms latency with a ping? What about a
ping -s 4000? If you are talking about 16KB RCP transactions over
TCP then the real question is what is the latency for 16KB of data
coming back along the wire?

In your case we can calculate the read-ahead needed to keep the pipe
full. 500 KBytes/sec divided by 16KB is 31 transactions per second,
or an effective latency of 32ms + probably 5-10 for the RPC to be
sent... so probably more around 40ms. Not 10ms. And if you are using
32KB transactions the latency is going to be more around 70ms.

500K x 40ms = is about 20KB, so theoretically a read-ahead of
2 packets should do the trick.

There's a catch, however. Depending on the client-side implementation
the read-ahead requests may be transmitted out of order. That is
if the cp or dd program wants to read blocks 0, 1, 2, 3, 4, the
actual RPC's sent over the wire might be sent like this: 0, 2, 1, 4, 3,
or even 0, 4, 1, 2, 3. Someone who know what work was done on the
FreeBSD NFS stack can tell you whether that is still the case. If
the nfsiod's (whether kernel threads or not) are separate synchronous
RPCs then the read-ahead can transmit the RPC requests out of order.
The server may also respond to them out of order... (typically there
being 4 server-side threads handling RPCs). The combination is deadly.

If the read-aheads transmit out of order what happens is that
cp/dd/whatever on the client winds up stalling waiting for the
next linear block to come back, which might be BEHIND a later
read-ahead block coming back down the wire. That is, the stall,
the RPC latency winds up being multiplied by N. A 40ms turn can
turn into an 80 or 120ms turn before the cp/dd/whatever unstalls.

To deal with this you want to set the read-ahead higher... probably at
least 3 or four RPCs.

As I said, there are other issues as the amount of read-ahead
increases. The only way to really figure out what is going on is
to tcpdump the link and determine why the pipeline is not being
maintained. Look for out of order requests, out of order responses,
and stalls (actual lost packets).

Actual lost packets are not likely in your case, assuming you are
using something like fair-share scheduling and not RED (RED should
only be used by routers in the middle of a large network, it should
never be used at the end-points).

-Matt

Matthew Dillon

unread,

Dec 21, 2009, 6:06:26 PM12/21/09

to Zaphod Beeblebrox, freebsd...@freebsd.org

Oh, one more thing... I'm assuming you haven't used tcpdump with
NFS much. tcpdump has issues parsing the NFS RPC's out of a TCP
stream. For the purposes of testing you may want to temporarily
use a UDP NFS mount. tcpdump can parse the NFS RPCs out of the UDP
stream far more easily. If you use a UDP mount use the dumbtimer
option and set it to something big, like 10 seconds, so you don't
get caught up in NFS/UDP's retry code (which will confuse your
parsing of the output).

A typical tcpdump line would be something like this:

tcpdump -n -i nfe0 -l -s 4096 -vvv not port 2049

Where the port is whatever port the NFS RPC's are running over
while you are running the test. You'd want to display it on a
max-width xterm, or record a bunch of it to a file and then review
it via less.

The purpose of running the tcpdump is to validate all your assumptions
as well as determine whether basic features such as read-ahead are
actually running. You can also determine if packet loss is occuring,
if requests are being sent or responded to out of order (the RPC
tcpdump parses includes the request id's and the file offsets so it
should be easy to figure that out). You can also determine the
actual latency by looking at the timestamps for the request vs
the reply.

Once you've figured out as much as you can from that you can try
tcpdumping the TCP stream. In this case you may not be able to
pick out RPCs but you should be able to determine whether the
requests are being pipelined and whether any packet loss is occurring
or not. You can also determine whether the TCP link is working
properly... i.e. that the TCP packets are properly flagging the
'P'ushes and not delaying the responses, and that the link isn't
blowing out its socket buffer or TCP window (those are two separate
things). The kernel should be scaling things properly but you never
know.

-Matt