Examine the following graphic: http://www.eicat.ca/~dgilbert/example-mrtg.png
The system doing the scp and the NFS server is FreeBSD-7.2-p1. The
system receiving the scp and the NFS client is FreeBSD-8.0-p1
The scp transfer is the left hand side of the graph and the NFS
transfer is on the right.
The NFS is mounted with "-3 -T -b -l -i" and no other options. Files
are being moved over NFS with the system "mv" command. The files in
each case are large (50 to 500 meg files).
The connection is a DSL that terminates on the local lan near the
server (I own and run the DSL and the ISP)
In either case, the connection is lightly used by only me --- and I'm
fairly certain that this isn't another network factor at play.
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"
If you increase the NFS blocksize (-r 32768 for example) you will get
slightly better performance, but you will likely never match the scp
results. They're doing two different things under the hood: scp is
streaming the entire file in one operation, while NFS is performing many
"read 8k at offset 0", "read 8k at offset 8k", etc requests one after
another, so a high-latency connection will take a performance hit due to the
latency in issuing each command. According to the mount_nfs manpage, it
looks like there is some prefetching that can be enabled with the "-a ##"
option. It doesn't say what the default is, though.
--
Dan Nelson
dne...@allantgroup.com
While the link is slow, it is really directly connected with a latency
of 10ms or so. Isn't mv mmap()'ing large enough regions to cause
there to be a reasonable queue to transfer?
I've never been impressed with FreeBSD's ability to detect sequential read
patterns and prefetch blocks ahead of time, even on local ufs filesystems.
--
Dan Nelson
dne...@allantgroup.com
10 ms is pretty high. A "direct connection" (same Ethernet segment)
should have a round-trip latency well below 1 ms.
DES
--
Dag-Erling Smørgrav - d...@des.no
It might be easier to investigate why the latency is so high and fix
that first. 10ms is way too high for a LAN.
I remember there was some work in the FreeBSD tree to clean up the
client-side NFS rpc mechanics but if they are still threaded (kernel
thread or user thread, doesn't matter) with one synchronous RPC per
thread then a large amount of read-ahead will cause the requests to be
issued out of order over the wire (for both TCP and UDP NFS mounts),
which really messes up the server-side heuristics. Plus the
client-side threads wind up competing with each other for the
socket lock. So there is a limit to how large a read-ahead you
can specify and still get good results.
If they are using a single kernel thread for socket reading and a
single kernel thread for socket writing (i.e. a 100% async RPC model,
which is what DFly uses), then you can boost the read-ahead to 50+.
At that point the socket buffer becomes the limiting factor in the
pipeline.
Make sure the NFS mount is TCP (It defaults to TCP in FreeBSD 8+). UDP
mounts will not perform well with any read-ahead greater then 3 or 4
RPCs because occassional seek latencies on the server will cause
random UDP RPCs to timeout and retry, which completely destroys
performance. UDP mounts have no understanding of the RPC queue backlog
on the server and treat each RPC independently for timeout/retry
purposes. So one minor stall can blow up every single pending RPC
backed up behind the one that stalled.
-Matt
On Mon, Dec 21, 2009 at 3:42 PM, Matthew Dillon
<dil...@apollo.backplane.com> wrote:
> � �Play with the read-ahead mount options for NFS, but it might require
> � �more work with that kind of latency. �You need to be able to have
> � �a lot of RPC's in-flight to maintain the pipeline with higher latencies.
> � �At least 16 and possibly more.
I should almost label that ObContent.
> � �It might be easier to investigate why the latency is so high and fix
> � �that first. �10ms is way too high for a LAN.
Ref. my origional post. The connection is DSL, but completely
managed. 10ms is fairly good for DSL
> � �Make sure the NFS mount is TCP (It defaults to TCP in FreeBSD 8+). �UDP
> � �mounts will not perform well with any read-ahead greater then 3 or 4
> � �RPCs because occassional seek latencies on the server will cause
> � �random UDP RPCs to timeout and retry, which completely destroys
> � �performance. �UDP mounts have no understanding of the RPC queue backlog
> � �on the server and treat each RPC independently for timeout/retry
> � �purposes. �So one minor stall can blow up every single pending RPC
> � �backed up behind the one that stalled.
Again, from the original post, not only was -T specified, but (as you
say) it is the default for FreeBSD 8.
for a 4 megabit pipe, very few transactions need to be in flight to
fill it. Does the TCP NFS use tech like selective ack? Is it the
same stack as the one that scp is using?
For example, is that 10ms latency with a ping? What about a
ping -s 4000? If you are talking about 16KB RCP transactions over
TCP then the real question is what is the latency for 16KB of data
coming back along the wire?
In your case we can calculate the read-ahead needed to keep the pipe
full. 500 KBytes/sec divided by 16KB is 31 transactions per second,
or an effective latency of 32ms + probably 5-10 for the RPC to be
sent... so probably more around 40ms. Not 10ms. And if you are using
32KB transactions the latency is going to be more around 70ms.
500K x 40ms = is about 20KB, so theoretically a read-ahead of
2 packets should do the trick.
There's a catch, however. Depending on the client-side implementation
the read-ahead requests may be transmitted out of order. That is
if the cp or dd program wants to read blocks 0, 1, 2, 3, 4, the
actual RPC's sent over the wire might be sent like this: 0, 2, 1, 4, 3,
or even 0, 4, 1, 2, 3. Someone who know what work was done on the
FreeBSD NFS stack can tell you whether that is still the case. If
the nfsiod's (whether kernel threads or not) are separate synchronous
RPCs then the read-ahead can transmit the RPC requests out of order.
The server may also respond to them out of order... (typically there
being 4 server-side threads handling RPCs). The combination is deadly.
If the read-aheads transmit out of order what happens is that
cp/dd/whatever on the client winds up stalling waiting for the
next linear block to come back, which might be BEHIND a later
read-ahead block coming back down the wire. That is, the stall,
the RPC latency winds up being multiplied by N. A 40ms turn can
turn into an 80 or 120ms turn before the cp/dd/whatever unstalls.
To deal with this you want to set the read-ahead higher... probably at
least 3 or four RPCs.
As I said, there are other issues as the amount of read-ahead
increases. The only way to really figure out what is going on is
to tcpdump the link and determine why the pipeline is not being
maintained. Look for out of order requests, out of order responses,
and stalls (actual lost packets).
Actual lost packets are not likely in your case, assuming you are
using something like fair-share scheduling and not RED (RED should
only be used by routers in the middle of a large network, it should
never be used at the end-points).
-Matt
A typical tcpdump line would be something like this:
tcpdump -n -i nfe0 -l -s 4096 -vvv not port 2049
Where the port is whatever port the NFS RPC's are running over
while you are running the test. You'd want to display it on a
max-width xterm, or record a bunch of it to a file and then review
it via less.
The purpose of running the tcpdump is to validate all your assumptions
as well as determine whether basic features such as read-ahead are
actually running. You can also determine if packet loss is occuring,
if requests are being sent or responded to out of order (the RPC
tcpdump parses includes the request id's and the file offsets so it
should be easy to figure that out). You can also determine the
actual latency by looking at the timestamps for the request vs
the reply.
Once you've figured out as much as you can from that you can try
tcpdumping the TCP stream. In this case you may not be able to
pick out RPCs but you should be able to determine whether the
requests are being pipelined and whether any packet loss is occurring
or not. You can also determine whether the TCP link is working
properly... i.e. that the TCP packets are properly flagging the
'P'ushes and not delaying the responses, and that the link isn't
blowing out its socket buffer or TCP window (those are two separate
things). The kernel should be scaling things properly but you never
know.
-Matt