Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

NFS on 10G interface terribly slow

142 views
Skip to first unread message

Gerrit Kühn

unread,
Jun 25, 2015, 2:16:29 PM6/25/15
to
Hi all,

We have a recent FreeBSD 10.1 installation here that is supposed to act as
nfs (v3) client to an Oracle x4-2l server running Soalris 11.2.
We have Intel 10-Gigabit X540-AT2 NICs on both ends, iperf is showing
plenty of bandwidth (9.xGB/s) in both directions.
However, nfs appears to be terribly slow, especially for writing:

root@crest:~ # dd if=/dev/zero of=/net/hellpool/Z bs=1024k count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 20.263190 secs (51747824 bytes/sec)


Reading appears to be faster, but still far away from full bandwidth:

root@crest:~ # dd of=/dev/null if=/net/hellpool/Z bs=1024k
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 5.129869 secs (204406000 bytes/sec)


We have already tried to tune rsize/wsize parameters, but they appear to
have little (if any) impact on these results. Also, neither stripping down
rxsum, txsum, tso etc. from the interface nor increasing MTU to 9000 for
jumbo frames did improve anything.
It is quite embarrassing to achieve way less than 1GBE performance with
10GBE equipment. Are there any hints what else might be causing this (and
how to fix it)?


cu
Gerrit
_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net...@freebsd.org"

Scott Larson

unread,
Jun 25, 2015, 7:56:31 PM6/25/15
to
We've got 10.0 and 10.1 servers accessing Isilon and Nexenta via NFS
with Intel 10G gear and bursting to near wire speed with the stock
MTU/rsize/wsize works as expected. TSO definitely needs to be enabled for
that performance. The fact iperf gives you the expected throughput but NFS
does not would have me looking at tuning for the NFS platform. Other things
to look at: Are all the servers involved negotiating the correct speed and
duplex, with TSO? Does it need to have the network stack tuned with
whatever it's equivalent of maxsockbuf and send/recvbuf are? Do the switch
ports and NIC counters show any drops or errors? On the FBSD servers you
could also run 'netstat -i -w 1' under load to see if drops are occurring
locally, or 'systat -vmstat' for resource contention problems. But again, a
similar setup here and no such issues have appeared.


*[image: userimage]Scott Larson[image: los angeles]
<https://www.google.com/maps/place/4216+Glencoe+Ave,+Marina+Del+Rey,+CA+90292/@33.9892151,-118.4421334,17z/data=!3m1!4b1!4m2!3m1!1s0x80c2ba88ffae914d:0x14e1d00084d4d09c>Lead
Systems Administrator[image: wdlogo] <https://www.wiredrive.com/> [image:
linkedin] <https://www.linkedin.com/company/wiredrive> [image: facebook]
<https://www.twitter.com/wiredrive> [image: twitter]
<https://www.facebook.com/wiredrive> [image: instagram]
<https://www.instagram.com/wiredrive>T 310 823 8238 x1106
<310%20823%208238%20x1106> | M 310 904 8818 <310%20904%208818>*

On Thu, Jun 25, 2015 at 5:52 AM, Gerrit Kühn <gerrit...@aei.mpg.de>
wrote:

Rick Macklem

unread,
Jun 25, 2015, 8:49:39 PM6/25/15
to
Gerrit Kuhn wrote:
> Hi all,
>
> We have a recent FreeBSD 10.1 installation here that is supposed to act as
> nfs (v3) client to an Oracle x4-2l server running Soalris 11.2.
> We have Intel 10-Gigabit X540-AT2 NICs on both ends, iperf is showing
> plenty of bandwidth (9.xGB/s) in both directions.
> However, nfs appears to be terribly slow, especially for writing:
>
> root@crest:~ # dd if=/dev/zero of=/net/hellpool/Z bs=1024k count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes transferred in 20.263190 secs (51747824 bytes/sec)
>
Recent commits to stable/10 (not in 10.1) done by Alexander Motin (mav@)
might help w.r.t. write performance (it avoids large writes doing synchronous
writes when the wcommitsize is exceeded). If you can try stable/10, that
might be worth it.

Otherwise, the main mount option you can try is "wcommitsize", which you
probably want to make larger.
(It sounds like you already tried most of what I could suggest.)

>
> Reading appears to be faster, but still far away from full bandwidth:
>
> root@crest:~ # dd of=/dev/null if=/net/hellpool/Z bs=1024k
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes transferred in 5.129869 secs (204406000 bytes/sec)
>
You could try increasing readahead. Look for the mount option and try
cranking it up to 8 or 16.

Good luck with it, rick

Gerrit Kühn

unread,
Jun 26, 2015, 5:56:38 AM6/26/15
to
On Thu, 25 Jun 2015 12:56:36 -0700 Scott Larson <s...@wiredrive.com> wrote
about Re: NFS on 10G interface terribly slow:

SL> We've got 10.0 and 10.1 servers accessing Isilon and Nexenta via
SL> NFS with Intel 10G gear and bursting to near wire speed with the stock
SL> MTU/rsize/wsize works as expected.

That sound promising. So we should be able to improve here, too.

SL> TSO definitely needs to be enabled for that performance.

Ok, I switched it back on.

SL> Other things to look at: Are all the servers involved negotiating the
SL> correct speed and duplex, with TSO?

We have a direct link between the systems, with only one switch in-between
acting as a transceiver to get from fibre to copper media. Both machines
and the switch show a 10G full-duplex link, not a single error or
collision to be spotted. The switch only carries these two lines, nothing
else.

SL> Does it need to have the network
SL> stack tuned with whatever it's equivalent of maxsockbuf and
SL> send/recvbuf are?

On the FreeBSD side we set

kern.ipc.maxsockbuf=33554432
net.inet.tcp.sendbuf_max=33554432
net.inet.tcp.recvbuf_max=33554432

I don't know what the equivalent for Solaris would be, still doing research
on that.

SL> Do the switch ports and NIC counters show any drops
SL> or errors?

No, nothing bad to be seen there.

SL> On the FBSD servers you could also run 'netstat -i -w 1'
SL> under load to see if drops are occurring locally, or 'systat -vmstat'
SL> for resource contention problems. But again, a similar setup here and
SL> no such issues have appeared.

No errors, no collisions, no drops.
I cannot spot any bottlenecks in netstat, either. One thing I just wonder
about is that all IRQs (about 700 under load) are routed to only one quque
on the ix interface (there seems to be one per core by default). Should the
load spread, or is that expected behaviour?

Gerrit Kühn

unread,
Jun 26, 2015, 6:00:15 AM6/26/15
to
On Thu, 25 Jun 2015 20:49:11 -0400 (EDT) Rick Macklem
<rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:


RM> Recent commits to stable/10 (not in 10.1) done by Alexander Motin
RM> (mav@) might help w.r.t. write performance (it avoids large writes
RM> doing synchronous writes when the wcommitsize is exceeded). If you can
RM> try stable/10, that might be worth it.

Ok, I'll schedule an update then, I guess. OTOH, Scott reported that a
similar setup is working fine for him with 10.0 and 10.1, so there is
probably not much to gain. I'll try anyway...

RM> Otherwise, the main mount option you can try is "wcommitsize", which
RM> you probably want to make larger.

Hm, which size would you recommend? I cannot find anything about this
setting, not even what the default value would be. Is this reflected in
some sysctl, or how can I find out what the actual value is?

Damien Fleuriot

unread,
Jun 26, 2015, 6:29:49 AM6/26/15
to
Gerrit,


Everyone's talking about the network performance and to some extent NFS
tuning.
I would argue that given your iperf results, the network itself is not at
fault.

In your first post I see no information regarding the local performance of
your disks, sans le NFS that is.

You may want to look into that first and ensure you get good read and write
results on the Solaris box, before trying to fix that which might not be at
fault.
Perhaps your NFS implementation is already giving you the maximum speed the
disks can achieve, or close enough.

You may also want to compare the results with another NFS client to the
Oracle server, say, god forbid, a *nux box for example.

Rick Macklem

unread,
Jun 26, 2015, 7:54:20 PM6/26/15
to
Damien Fleuriot wrote:
> Gerrit,
>
>
> Everyone's talking about the network performance and to some extent NFS
> tuning.
> I would argue that given your iperf results, the network itself is not at
> fault.
>
In this case, I think you might be correct.

However, I need to note that NFS traffic is very different than what iperf
generates and a good result from iperf does not imply that there isn't a
network related problem causing NFS grief.
A couple of examples:
- NFS generates TSO segments that are sometimes just under 64K in length.
If the network interface has TSO enabled but cannot handle a list of
35 or more transmit segments (mbufs in list), this can cause problems.
Systems more than about 1year old could fail completely when the TSO
segment + IP header exceeded 64K for network interfaces limited to 32
transmit segments (32 * MCLBYTES == 64K). Also, some interfaces used
m_collapse() to try and fix the case where the TSO segment had too many
transmit segments in it and this almost always failed (you need to use
m_defrag()).
--> The worst case failures have been fixed by reducing the default
maximum TSO segment size to slightly less than 64K (by the maximum
MAC header length).
However, drivers limited to less than 35 transmit segments (which
includes at least one of the most common Intel chips) still end up
generating a lot of overhead by calling m_defrag() over and over and
over again (with the possibility of failure if mbuf clusters become
exhausted).
--> To fix this well, net device drivers need to set a field called
if_hw_tsomaxsegcount, but if you look in -head, you won't find it
set in many drivers. (I've posted to freebsd-net multiple times
asking the net device driver authors to do this, but it hasn't happened
yet.)
Usually avoided by disabling TSO.

Another failure case I've seen in the past was where a network interface
would drop a packet in a stream of closely spaced packets on the receive
side while concurrently transmitting. (NFS traffic is bi-directional and
it is common to be receiving and transmitting on a TCP socket concurrently.)

NFS traffic is also very bursty, and that seems to cause problems for certain
network interfaces.
These can usually be worked around by reducing rsize, wsize. (Reducing rsize, wsize
also "fixes" the 64K TSO segment problem, since the TSO segments won't be as
large.)

There are also issues w.r.t. kernel address space (the area used for mbuf cluster
mapping) exhaustion when jumbo packets are used, resulting in allocation
of multiple sized mbuf clusters.

I think you can see not all of these will be evident from iperf results.

rick

Rick Macklem

unread,
Jun 26, 2015, 7:59:02 PM6/26/15
to
Gerrit Kuhn wrote:
> On Thu, 25 Jun 2015 20:49:11 -0400 (EDT) Rick Macklem
> <rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:
>
>
> RM> Recent commits to stable/10 (not in 10.1) done by Alexander Motin
> RM> (mav@) might help w.r.t. write performance (it avoids large writes
> RM> doing synchronous writes when the wcommitsize is exceeded). If you can
> RM> try stable/10, that might be worth it.
>
> Ok, I'll schedule an update then, I guess. OTOH, Scott reported that a
> similar setup is working fine for him with 10.0 and 10.1, so there is
> probably not much to gain. I'll try anyway...
>
> RM> Otherwise, the main mount option you can try is "wcommitsize", which
> RM> you probably want to make larger.
>
> Hm, which size would you recommend? I cannot find anything about this
> setting, not even what the default value would be. Is this reflected in
> some sysctl, or how can I find out what the actual value is?
>
The default (auto tuned) value is reported by "nfsstat -m".
It can be set with a mount option (should be something in "man mount_nfs").
If you are doing a test with 1 megabyte writes, I'd set it to at least
1 megabyte. (Basically, writing will be slower for write(2) syscalls that
are larger than wcommitsize. After mav@'s patch, the difference isn't nearly
as noticable. His other commit makes the auto tuned value more reasonable).

If you set it large enough with the "wcommitsize=<N>" mount option, you
don't need the updates stable/10.

rick

Rick Macklem

unread,
Jun 26, 2015, 8:42:32 PM6/26/15
to
Scott Larson wrote:
> We've got 10.0 and 10.1 servers accessing Isilon and Nexenta via NFS

> with Intel 10G gear and bursting to near wire speed with the stock
> MTU/rsize/wsize works as expected. TSO definitely needs to be enabled for
> that performance.
Btw, can you tell us what Intel chip(s) you're using?

For example, from the "ix" driver:
#define IXGBE_82598_SCATTER 100
#define IXGBE_82599_SCATTER 32

This implies that the 82598 won't have problems with 64K TSO segments, but
the 82599 will end up doing calls to m_defrag() which copies the entire
list of mbufs into 32 new mbuf clusters for each of them.
--> Even for one driver, different chips may result in different NFS perf.

Btw, it appears that the driver in head/current now sets if_hw_tsomaxsegcount,
but the driver in stable/10 does not. This means that the 82599 chip will end
up doing the m_defrag() calls for 10.x.

rick

> The fact iperf gives you the expected throughput but NFS
> does not would have me looking at tuning for the NFS platform. Other things
> to look at: Are all the servers involved negotiating the correct speed and
> duplex, with TSO? Does it need to have the network stack tuned with
> whatever it's equivalent of maxsockbuf and send/recvbuf are? Do the switch
> ports and NIC counters show any drops or errors? On the FBSD servers you
> could also run 'netstat -i -w 1' under load to see if drops are occurring
> locally, or 'systat -vmstat' for resource contention problems. But again, a
> similar setup here and no such issues have appeared.
>
>
> *[image: userimage]Scott Larson[image: los angeles]
> <https://www.google.com/maps/place/4216+Glencoe+Ave,+Marina+Del+Rey,+CA+90292/@33.9892151,-118.4421334,17z/data=!3m1!4b1!4m2!3m1!1s0x80c2ba88ffae914d:0x14e1d00084d4d09c>Lead
> Systems Administrator[image: wdlogo] <https://www.wiredrive.com/> [image:
> linkedin] <https://www.linkedin.com/company/wiredrive> [image: facebook]
> <https://www.twitter.com/wiredrive> [image: twitter]
> <https://www.facebook.com/wiredrive> [image: instagram]
> <https://www.instagram.com/wiredrive>T 310 823 8238 x1106
> <310%20823%208238%20x1106> | M 310 904 8818 <310%20904%208818>*
>
> On Thu, Jun 25, 2015 at 5:52 AM, Gerrit Kühn <gerrit...@aei.mpg.de>

> wrote:
>
> > Hi all,
> >
> > We have a recent FreeBSD 10.1 installation here that is supposed to act as
> > nfs (v3) client to an Oracle x4-2l server running Soalris 11.2.
> > We have Intel 10-Gigabit X540-AT2 NICs on both ends, iperf is showing
> > plenty of bandwidth (9.xGB/s) in both directions.
> > However, nfs appears to be terribly slow, especially for writing:
> >
> > root@crest:~ # dd if=/dev/zero of=/net/hellpool/Z bs=1024k count=1000
> > 1000+0 records in
> > 1000+0 records out
> > 1048576000 bytes transferred in 20.263190 secs (51747824 bytes/sec)
> >
> >

> > Reading appears to be faster, but still far away from full bandwidth:
> >
> > root@crest:~ # dd of=/dev/null if=/net/hellpool/Z bs=1024k
> > 1000+0 records in
> > 1000+0 records out
> > 1048576000 bytes transferred in 5.129869 secs (204406000 bytes/sec)
> >
> >

> > We have already tried to tune rsize/wsize parameters, but they appear to
> > have little (if any) impact on these results. Also, neither stripping down
> > rxsum, txsum, tso etc. from the interface nor increasing MTU to 9000 for
> > jumbo frames did improve anything.
> > It is quite embarrassing to achieve way less than 1GBE performance with
> > 10GBE equipment. Are there any hints what else might be causing this (and
> > how to fix it)?
> >
> >

Gerrit Kühn

unread,
Jun 29, 2015, 3:20:39 AM6/29/15
to
On Fri, 26 Jun 2015 20:42:08 -0400 (EDT) Rick Macklem
<rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:

RM> Btw, can you tell us what Intel chip(s) you're using?

I have

ix0@pci0:5:0:0: class=0x020000 card=0x00028086 chip=0x15288086 rev=0x01
hdr=0x00 vendor = 'Intel Corporation'
device = 'Ethernet Controller 10-Gigabit X540-AT2'
class = network
subclass = ethernet

RM> For example, from the "ix" driver:
RM> #define IXGBE_82598_SCATTER 100
RM> #define IXGBE_82599_SCATTER 32

Hm, I cannot find out into which chipset number this translates for my
device...

RM> Btw, it appears that the driver in head/current now sets
RM> if_hw_tsomaxsegcount, but the driver in stable/10 does not. This means
RM> that the 82599 chip will end up doing the m_defrag() calls for 10.x.

So the next step could even be updating to -current...
OTOH, I get the same (bad) resulsts, no matter if TSO is enabled or
disabled on the interface.

Gerrit Kühn

unread,
Jun 29, 2015, 3:36:03 AM6/29/15
to
On Fri, 26 Jun 2015 19:58:42 -0400 (EDT) Rick Macklem
<rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:


RM> The default (auto tuned) value is reported by "nfsstat -m".
RM> It can be set with a mount option (should be something in "man
RM> mount_nfs"). If you are doing a test with 1 megabyte writes, I'd set
RM> it to at least 1 megabyte. (Basically, writing will be slower for write
RM> (2) syscalls that are larger than wcommitsize. After mav@'s patch, the
RM> difference isn't nearly as noticable. His other commit makes the auto
RM> tuned value more reasonable).
RM>
RM> If you set it large enough with the "wcommitsize=<N>" mount option, you
RM> don't need the updates stable/10.

Ok, I set it way over 1MB now:

hellpool:/samqfs/K1/Gerrit on /net/hellpool
nfsv3,tcp,resvport,hard,cto,lockd,rdirplus,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=8192,readahead=1,wcommitsize=2048576,timeout=120,retrans=2

However, this still gives me the same bad write performance:

root@crest: # dd if=/dev/zero of=/net/hellpool/Z bs=1024k
count=1000 1000+0 records in
1000+0 records out
1048576000 bytes transferred in 22.939049 secs (45711398 bytes/sec)


So I guess I can postpone the update for now, and look for some other
reason for this instead.

Rick Macklem

unread,
Jun 29, 2015, 8:20:40 AM6/29/15
to
Gerrit Kuhn wrote:
> On Fri, 26 Jun 2015 20:42:08 -0400 (EDT) Rick Macklem
> <rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:
>
> RM> Btw, can you tell us what Intel chip(s) you're using?
>
> I have
>
> ix0@pci0:5:0:0: class=0x020000 card=0x00028086 chip=0x15288086 rev=0x01
> hdr=0x00 vendor = 'Intel Corporation'
> device = 'Ethernet Controller 10-Gigabit X540-AT2'
> class = network
> subclass = ethernet
>
> RM> For example, from the "ix" driver:
> RM> #define IXGBE_82598_SCATTER 100
> RM> #define IXGBE_82599_SCATTER 32
>
> Hm, I cannot find out into which chipset number this translates for my
> device...
>
> RM> Btw, it appears that the driver in head/current now sets
> RM> if_hw_tsomaxsegcount, but the driver in stable/10 does not. This means
> RM> that the 82599 chip will end up doing the m_defrag() calls for 10.x.
>
> So the next step could even be updating to -current...
> OTOH, I get the same (bad) resulsts, no matter if TSO is enabled or
> disabled on the interface.
>
Since disabling TSO had no effect, I don't think updating would matter.

If you can test against a different NFS server, that might indicate whether
or not the Solaris server is the bottleneck.

If the Solaris server is using ZFS, setting sync=disabled might help w.r.t.
write performance. It is, however, somewhat dangerous w.r.t. loss of recently
written data when the server crashes. (Server has told client data is safely
on stable storage so client will not re-write the block(s) although data wasn't
on stable storage and is lost.)
(I'm not a ZFS guy, so I can't suggest more w.r.t. ZFS.)

rick

Rick Macklem

unread,
Jun 29, 2015, 8:22:48 AM6/29/15
to
Gerrit Kuhn wrote:
> On Fri, 26 Jun 2015 20:42:08 -0400 (EDT) Rick Macklem
> <rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:
>
> RM> Btw, can you tell us what Intel chip(s) you're using?
>
> I have
>
> ix0@pci0:5:0:0: class=0x020000 card=0x00028086 chip=0x15288086 rev=0x01
> hdr=0x00 vendor = 'Intel Corporation'
> device = 'Ethernet Controller 10-Gigabit X540-AT2'
> class = network
> subclass = ethernet
>
Yea, I don't know how to decode this either. I was actually interested in
what chip Scott was using and getting wire speed.
As noted in the other reply, since disabling TSO didn't help, you probably
aren't affected by this issue.

rick

> RM> For example, from the "ix" driver:
> RM> #define IXGBE_82598_SCATTER 100
> RM> #define IXGBE_82599_SCATTER 32
>
> Hm, I cannot find out into which chipset number this translates for my
> device...
>
> RM> Btw, it appears that the driver in head/current now sets
> RM> if_hw_tsomaxsegcount, but the driver in stable/10 does not. This means
> RM> that the 82599 chip will end up doing the m_defrag() calls for 10.x.
>
> So the next step could even be updating to -current...
> OTOH, I get the same (bad) resulsts, no matter if TSO is enabled or
> disabled on the interface.
>
>

Rick Macklem

unread,
Jun 29, 2015, 8:47:59 AM6/29/15
to
I wrote:
> Gerrit Kuhn wrote:
> > On Fri, 26 Jun 2015 20:42:08 -0400 (EDT) Rick Macklem
> > <rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:
> >
> > RM> Btw, can you tell us what Intel chip(s) you're using?
> >
> > I have
> >
> > ix0@pci0:5:0:0: class=0x020000 card=0x00028086 chip=0x15288086 rev=0x01
> > hdr=0x00 vendor = 'Intel Corporation'
> > device = 'Ethernet Controller 10-Gigabit X540-AT2'
> > class = network
> > subclass = ethernet
> >
> Yea, I don't know how to decode this either.
I took a look at the driver and, if I read it correctly, most chips (including
all the X540 ones) use IXGBE_82599_SCATTER.

As such, you will be doing lots of m_defrag() calls, but since disabling TSO
didn't help, that doesn't seem to be the bottleneck.

rick

Olivier Cochard-Labbé

unread,
Jun 29, 2015, 8:56:09 AM6/29/15
to
On Mon, Jun 29, 2015 at 9:19 AM, Gerrit Kühn <gerrit...@aei.mpg.de>
wrote:

> On Fri, 26 Jun 2015 20:42:08 -0400 (EDT) Rick Macklem
> <rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly slow:
>
> RM> Btw, can you tell us what Intel chip(s) you're using?
>
> I have
>
> ix0@pci0:5:0:0: class=0x020000 card=0x00028086 chip=0x15288086 rev=0x01
> hdr=0x00 vendor = 'Intel Corporation'
> device = 'Ethernet Controller 10-Gigabit X540-AT2'
> class = network
> subclass = ethernet
>

> RM> For example, from the "ix" driver:
> RM> #define IXGBE_82598_SCATTER 100
> RM> #define IXGBE_82599_SCATTER 32
>
> Hm, I cannot find out into which chipset number this translates for my
> device...
>
>

​extract first 4 numbers of "chip", then try a grep:​
​grep 1528 /usr/src/sys/dev/ixgbe/*
/usr/src/sys/dev/ixgbe/ixgbe_type.h:#define
IXGBE_DEV_ID_X540T 0x1528

=> Then your chipset is X540

Carsten Aulbert

unread,
Jun 29, 2015, 9:09:25 AM6/29/15
to
Hi Rick

On 06/29/2015 02:20 PM, Rick Macklem wrote:
> If the Solaris server is using ZFS, setting sync=disabled might help w.r.t.
> write performance. It is, however, somewhat dangerous w.r.t. loss of recently
> written data when the server crashes. (Server has told client data is safely
> on stable storage so client will not re-write the block(s) although data wasn't
> on stable storage and is lost.)
> (I'm not a ZFS guy, so I can't suggest more w.r.t. ZFS.)
>

The system on the other side uses SAM/QFS, i.e. there is no such option
for the file system per se (only the file system metadata is in a zvol
thus not a full featured zfs).

In parallel we are working also with Oracle to see where there may be a
matching knob to turn as we see about the same performance issues from a
Linux host (NFS client, Debian Jessie) with a Mellanox Technologies
MT27500 Family [ConnectX-3] controller.

Cheers

Carsten

--
Dr. Carsten Aulbert, Atlas cluster administration
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Callinstraße 38, 30167 Hannover, Germany
Tel: +49 511 762 17185, Fax: +49 511 762 17193

Mike Tancsa

unread,
Jun 29, 2015, 9:25:22 AM6/29/15
to
On 6/29/2015 8:20 AM, Rick Macklem wrote:
> If the Solaris server is using ZFS, setting sync=disabled might help w.r.t.

On my FreeBSD zfs server, this is a must for decent and consistent write
throughput. Using FreeBSD as an iSCSI target and a Linux initiator, I
can saturate a 1G nic no problem with sync disabled. Its barely usable
with the default sync standard as its so bursty

---Mike


--
-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, mi...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada http://www.tancsa.com/

Scott Larson

unread,
Jun 29, 2015, 12:33:53 PM6/29/15
to
82599 in our case. One problem I do have is the stack likes to blow up
on occasion with the right combo of high load and high throughput while TSO
is enabled, possibly relating to the 10.x driver issue you've pointed out.
But when it comes to the throughput they'll blast 10G with no problem.
On Mon, Jun 29, 2015 at 5:22 AM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Gerrit Kuhn wrote:
> > On Fri, 26 Jun 2015 20:42:08 -0400 (EDT) Rick Macklem
> > <rmac...@uoguelph.ca> wrote about Re: NFS on 10G interface terribly
> slow:
> >
> > RM> Btw, can you tell us what Intel chip(s) you're using?
> >
> > I have
> >
> > ix0@pci0:5:0:0: class=0x020000 card=0x00028086 chip=0x15288086 rev=0x01
> > hdr=0x00 vendor = 'Intel Corporation'
> > device = 'Ethernet Controller 10-Gigabit X540-AT2'
> > class = network
> > subclass = ethernet
> >
> Yea, I don't know how to decode this either. I was actually interested in
> what chip Scott was using and getting wire speed.
> As noted in the other reply, since disabling TSO didn't help, you probably
> aren't affected by this issue.
>
> rick
>
> > RM> For example, from the "ix" driver:
> > RM> #define IXGBE_82598_SCATTER 100
> > RM> #define IXGBE_82599_SCATTER 32
> >
> > Hm, I cannot find out into which chipset number this translates for my
> > device...
> >
> > RM> Btw, it appears that the driver in head/current now sets
> > RM> if_hw_tsomaxsegcount, but the driver in stable/10 does not. This
> means
> > RM> that the 82599 chip will end up doing the m_defrag() calls for 10.x.
> >
> > So the next step could even be updating to -current...
> > OTOH, I get the same (bad) resulsts, no matter if TSO is enabled or
> > disabled on the interface.
> >
> >
> > cu
> > Gerrit

Rick Macklem

unread,
Jun 29, 2015, 4:49:33 PM6/29/15
to
Scott Larson wrote:
> 82599 in our case. One problem I do have is the stack likes to blow up
> on occasion with the right combo of high load and high throughput while TSO
> is enabled, possibly relating to the 10.x driver issue you've pointed out.
> But when it comes to the throughput they'll blast 10G with no problem.
>
Thanks for the info. So long as your mbuf cluster pool is large enough, I
think the m_defrag() calls will just result in increased CPU overheads and
probably don't introduce much delay.

I have no idea why the stack would blow up sometimes.
If you can catch the backtrace for one of these and post it, it might become
obvious. (Or you could just try increasing KSTACK_PAGES in sys/amd64/include/param.h
and see if the stack still blows up. Alternately, I think you can set KSTACK_PAGES in
your kernel config file.)

rick
0 new messages