*[image: userimage]Scott Larson[image: los angeles]
<https://www.google.com/maps/place/4216+Glencoe+Ave,+Marina+Del+Rey,+CA+90292/@33.9892151,-118.4421334,17z/data=!3m1!4b1!4m2!3m1!1s0x80c2ba88ffae914d:0x14e1d00084d4d09c>Lead
Systems Administrator[image: wdlogo] <https://www.wiredrive.com/> [image:
linkedin] <https://www.linkedin.com/company/wiredrive> [image: facebook]
<https://www.twitter.com/wiredrive> [image: twitter]
<https://www.facebook.com/wiredrive> [image: instagram]
<https://www.instagram.com/wiredrive>T 310 823 8238 x1106
<310%20823%208238%20x1106> | M 310 904 8818 <310%20904%208818>*
On Thu, Jun 25, 2015 at 5:52 AM, Gerrit Kühn <gerrit...@aei.mpg.de>
wrote:
Everyone's talking about the network performance and to some extent NFS
tuning.
I would argue that given your iperf results, the network itself is not at
fault.
In your first post I see no information regarding the local performance of
your disks, sans le NFS that is.
You may want to look into that first and ensure you get good read and write
results on the Solaris box, before trying to fix that which might not be at
fault.
Perhaps your NFS implementation is already giving you the maximum speed the
disks can achieve, or close enough.
You may also want to compare the results with another NFS client to the
Oracle server, say, god forbid, a *nux box for example.
However, I need to note that NFS traffic is very different than what iperf
generates and a good result from iperf does not imply that there isn't a
network related problem causing NFS grief.
A couple of examples:
- NFS generates TSO segments that are sometimes just under 64K in length.
If the network interface has TSO enabled but cannot handle a list of
35 or more transmit segments (mbufs in list), this can cause problems.
Systems more than about 1year old could fail completely when the TSO
segment + IP header exceeded 64K for network interfaces limited to 32
transmit segments (32 * MCLBYTES == 64K). Also, some interfaces used
m_collapse() to try and fix the case where the TSO segment had too many
transmit segments in it and this almost always failed (you need to use
m_defrag()).
--> The worst case failures have been fixed by reducing the default
maximum TSO segment size to slightly less than 64K (by the maximum
MAC header length).
However, drivers limited to less than 35 transmit segments (which
includes at least one of the most common Intel chips) still end up
generating a lot of overhead by calling m_defrag() over and over and
over again (with the possibility of failure if mbuf clusters become
exhausted).
--> To fix this well, net device drivers need to set a field called
if_hw_tsomaxsegcount, but if you look in -head, you won't find it
set in many drivers. (I've posted to freebsd-net multiple times
asking the net device driver authors to do this, but it hasn't happened
yet.)
Usually avoided by disabling TSO.
Another failure case I've seen in the past was where a network interface
would drop a packet in a stream of closely spaced packets on the receive
side while concurrently transmitting. (NFS traffic is bi-directional and
it is common to be receiving and transmitting on a TCP socket concurrently.)
NFS traffic is also very bursty, and that seems to cause problems for certain
network interfaces.
These can usually be worked around by reducing rsize, wsize. (Reducing rsize, wsize
also "fixes" the 64K TSO segment problem, since the TSO segments won't be as
large.)
There are also issues w.r.t. kernel address space (the area used for mbuf cluster
mapping) exhaustion when jumbo packets are used, resulting in allocation
of multiple sized mbuf clusters.
I think you can see not all of these will be evident from iperf results.
rick
For example, from the "ix" driver:
#define IXGBE_82598_SCATTER 100
#define IXGBE_82599_SCATTER 32
This implies that the 82598 won't have problems with 64K TSO segments, but
the 82599 will end up doing calls to m_defrag() which copies the entire
list of mbufs into 32 new mbuf clusters for each of them.
--> Even for one driver, different chips may result in different NFS perf.
Btw, it appears that the driver in head/current now sets if_hw_tsomaxsegcount,
but the driver in stable/10 does not. This means that the 82599 chip will end
up doing the m_defrag() calls for 10.x.
rick
> The fact iperf gives you the expected throughput but NFS
> does not would have me looking at tuning for the NFS platform. Other things
> to look at: Are all the servers involved negotiating the correct speed and
> duplex, with TSO? Does it need to have the network stack tuned with
> whatever it's equivalent of maxsockbuf and send/recvbuf are? Do the switch
> ports and NIC counters show any drops or errors? On the FBSD servers you
> could also run 'netstat -i -w 1' under load to see if drops are occurring
> locally, or 'systat -vmstat' for resource contention problems. But again, a
> similar setup here and no such issues have appeared.
>
>
> *[image: userimage]Scott Larson[image: los angeles]
> <https://www.google.com/maps/place/4216+Glencoe+Ave,+Marina+Del+Rey,+CA+90292/@33.9892151,-118.4421334,17z/data=!3m1!4b1!4m2!3m1!1s0x80c2ba88ffae914d:0x14e1d00084d4d09c>Lead
> Systems Administrator[image: wdlogo] <https://www.wiredrive.com/> [image:
> linkedin] <https://www.linkedin.com/company/wiredrive> [image: facebook]
> <https://www.twitter.com/wiredrive> [image: twitter]
> <https://www.facebook.com/wiredrive> [image: instagram]
> <https://www.instagram.com/wiredrive>T 310 823 8238 x1106
> <310%20823%208238%20x1106> | M 310 904 8818 <310%20904%208818>*
>
> On Thu, Jun 25, 2015 at 5:52 AM, Gerrit Kühn <gerrit...@aei.mpg.de>
> wrote:
>
> > Hi all,
> >
> > We have a recent FreeBSD 10.1 installation here that is supposed to act as
> > nfs (v3) client to an Oracle x4-2l server running Soalris 11.2.
> > We have Intel 10-Gigabit X540-AT2 NICs on both ends, iperf is showing
> > plenty of bandwidth (9.xGB/s) in both directions.
> > However, nfs appears to be terribly slow, especially for writing:
> >
> > root@crest:~ # dd if=/dev/zero of=/net/hellpool/Z bs=1024k count=1000
> > 1000+0 records in
> > 1000+0 records out
> > 1048576000 bytes transferred in 20.263190 secs (51747824 bytes/sec)
> >
> >
> > Reading appears to be faster, but still far away from full bandwidth:
> >
> > root@crest:~ # dd of=/dev/null if=/net/hellpool/Z bs=1024k
> > 1000+0 records in
> > 1000+0 records out
> > 1048576000 bytes transferred in 5.129869 secs (204406000 bytes/sec)
> >
> >
> > We have already tried to tune rsize/wsize parameters, but they appear to
> > have little (if any) impact on these results. Also, neither stripping down
> > rxsum, txsum, tso etc. from the interface nor increasing MTU to 9000 for
> > jumbo frames did improve anything.
> > It is quite embarrassing to achieve way less than 1GBE performance with
> > 10GBE equipment. Are there any hints what else might be causing this (and
> > how to fix it)?
> >
> >
> On Fri, 26 Jun 2015 20:42:08 -0400 (EDT) Rick Macklem
> <rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:
>
> RM> Btw, can you tell us what Intel chip(s) you're using?
>
> I have
>
> ix0@pci0:5:0:0: class=0x020000 card=0x00028086 chip=0x15288086 rev=0x01
> hdr=0x00 vendor = 'Intel Corporation'
> device = 'Ethernet Controller 10-Gigabit X540-AT2'
> class = network
> subclass = ethernet
>
> RM> For example, from the "ix" driver:
> RM> #define IXGBE_82598_SCATTER 100
> RM> #define IXGBE_82599_SCATTER 32
>
> Hm, I cannot find out into which chipset number this translates for my
> device...
>
>
extract first 4 numbers of "chip", then try a grep:
grep 1528 /usr/src/sys/dev/ixgbe/*
/usr/src/sys/dev/ixgbe/ixgbe_type.h:#define
IXGBE_DEV_ID_X540T 0x1528
=> Then your chipset is X540
On 06/29/2015 02:20 PM, Rick Macklem wrote:
> If the Solaris server is using ZFS, setting sync=disabled might help w.r.t.
> write performance. It is, however, somewhat dangerous w.r.t. loss of recently
> written data when the server crashes. (Server has told client data is safely
> on stable storage so client will not re-write the block(s) although data wasn't
> on stable storage and is lost.)
> (I'm not a ZFS guy, so I can't suggest more w.r.t. ZFS.)
>
The system on the other side uses SAM/QFS, i.e. there is no such option
for the file system per se (only the file system metadata is in a zvol
thus not a full featured zfs).
In parallel we are working also with Oracle to see where there may be a
matching knob to turn as we see about the same performance issues from a
Linux host (NFS client, Debian Jessie) with a Mellanox Technologies
MT27500 Family [ConnectX-3] controller.
Cheers
Carsten
--
Dr. Carsten Aulbert, Atlas cluster administration
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Callinstraße 38, 30167 Hannover, Germany
Tel: +49 511 762 17185, Fax: +49 511 762 17193