Re: 9.2 ixgbe tx queue hang

Christopher Forgeron

unread,

Mar 19, 2014, 3:17:34 PM3/19/14

to

Hello,

I can report this problem as well on 10.0-RELEASE.

I think it's the same as kern/183390?

I have two physically identical machines, one running 9.2-STABLE, and one
on 10.0-RELEASE.

My 10.0 machine used to be running 9.0-STABLE for over a year without any
problems.

I'm not having the problems with 9.2-STABLE as far as I can tell, but it
does seem to be a load-based issue more than anything. Since my 9.2 system
is in production, I'm unable to load it to see if the problem exists there.
I have a ping_logger.py running on it now to see if it's experiencing
problems briefly or not.

I am able to reproduce it fairly reliably within 15 min of a reboot by
loading the server via NFS with iometer and some large NFS file copies at
the same time. I seem to need to sustain ~2 Gbps for a few minutes.

It will happen with just ix0 (no lagg) or with lagg enabled across ix0 and
ix1.

I've been load-testing new FreeBSD-10.0-RELEASE SAN's for production use
here, so I'm quite willing to put time into this to help find out where
it's coming from. It took me a day to track down my iometer issues as
being network related, and another day to isolate and write scripts to
reproduce.

The symptom I notice is:

- A running flood ping (ping -f 172.16.0.31) to the same hardware
(running 9.2) will come back with "ping: sendto: File too large" when the
problem occurs

- Network connectivity is very spotty during these incidents

- It can run with sporadic ping errors, or it can run a straight
set of errors for minutes at a time

- After a long run of ping errors, ESXi will show a disconnect
from the hosted NFS stores on this machine.

- I've yet to see it happen right after boot. Fastest is around 5
min, normally it's within 15 min.

System Specs:

- Dell PowerEdge M610x Blade

- 2 Xeon 6600 @ 2.40GHz (24 Cores total)

- 96 Gig RAM

- 35.3 TB ZFS Mirrored pool, lz4 compression on my test pool (ZFS
pool is the latest)

- Intel 520-DA2 10 Gb dual-port Blade Mezz. Cards

Currently this 10.0 testing machine is clean for all sysctl's other than
hw.intr_storm_threshold=9900. I have the problem if that's set or not, so I
leave it on.

( I used to set manual nmbclusters, etc. as per the Intel Readme.doc, but I
notice that the defaults on the new 10.0 system are larger. I did try using
all of the old sysctl's from an older 9.0-STABLE, and still had the
problem, but it did seem to take longer to occur? I haven't run enough
tests to confirm that time observation is true. )

What logs / info can I provide to help?

I have written a small script called ping_logger.py that pings an IP, and
checks to see if there is an error. On error it will execute and log:

- netstat -m

- sysctl hw.ix

- sysctl dev.ix

then go back to pinging. It will also log those values on the startup of
the script, and every 5 min (so you can see the progression on the system).
I can add any number of things to the reporting, so I'm looking for
suggestions.

This results in some large log files, but I can email a .gz directly to
anyone who need them, or perhaps put it up on a website.

I will also make the ping_logger.py script available if anyone else wants
it.

LASTLY:

The one thing I can see that is different in my 10.0 System and my 9.2 is:

9.2's netstat -m:

37965/16290/54255 mbufs in use (current/cache/total)

4080/8360/12440/524288 mbuf clusters in use (current/cache/total/max)

4080/4751 mbuf+clusters out of packet secondary zone in use (current/cache)

0/452/452/262144 4k (page size) jumbo clusters in use
(current/cache/total/max)

32773/4129/36902/96000 9k jumbo clusters in use (current/cache/total/max)

0/0/0/508538 16k jumbo clusters in use (current/cache/total/max)

312608K/59761K/372369K bytes allocated to network (current/cache/total)

0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)

0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

0/0/0 requests for jumbo clusters denied (4k/9k/16k)

0/0/0 sfbufs in use (current/peak/max)

0 requests for sfbufs denied

0 requests for sfbufs delayed

0 requests for I/O initiated by sendfile

0 calls to protocol drain routines

10.0's netstat -m:

21512/24448/45960 mbufs in use (current/cache/total)

4080/16976/21056/6127254 mbuf clusters in use (current/cache/total/max)

4080/16384 mbuf+clusters out of packet secondary zone in use (current/cache)

0/23/23/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)

16384/158/16542/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

160994K/41578K/202572K bytes allocated to network (current/cache/total)

17488/13290/20464 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)

0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

7/16462/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied

0 requests for sfbufs delayed

0 requests for I/O initiated by sendfile

Way more mbuf clusters in use, but also I never get denied/delayed results
in 9.2 - but I have them in 10.0 right away after a reboot.

Thanks for any help..
_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net...@freebsd.org"

Christopher Forgeron

unread,

Mar 19, 2014, 3:41:40 PM3/19/14

to

(Sorry for the formatting on that last message, that was weird)

Today I wanted to test the assertion that this is a NFS issue, since we all
seem to be running NFS.

I shut down my NFS daemon in rc.conf, configured the FreeBSD10 iSCSI ctld,
rebooted, and then ran all my tests exclusively from the iSCSI connection
to my FreeBSD-10 box.

The only other thing I had active on the FreeBSD-10 box was a flood-ping
so I could see if/when the problem occurred.

I still had the problem. It triggered around 40 min into my iometer
benchmark. Continued sporadically for a while, then settled down and didn't
give me problems.

That error behaviour isn't abnormal - I really need to push the SAN to put
it's network into a 'dead and not coming back' mode.

Any thoughts?

Rick Macklem

unread,

Mar 19, 2014, 10:29:36 PM3/19/14

to

Christopher Forgeron wrote:
> Hello,
>
>
>
> I can report this problem as well on 10.0-RELEASE.
>
>
>
> I think it's the same as kern/183390?
>
>
>
> I have two physically identical machines, one running 9.2-STABLE, and
> one
> on 10.0-RELEASE.
>
>
>
> My 10.0 machine used to be running 9.0-STABLE for over a year without
> any
> problems.
>
>
>
> I'm not having the problems with 9.2-STABLE as far as I can tell, but
> it
> does seem to be a load-based issue more than anything. Since my 9.2
> system
> is in production, I'm unable to load it to see if the problem exists
> there.
> I have a ping_logger.py running on it now to see if it's experiencing
> problems briefly or not.
>
>
>
> I am able to reproduce it fairly reliably within 15 min of a reboot
> by
> loading the server via NFS with iometer and some large NFS file
> copies at
> the same time. I seem to need to sustain ~2 Gbps for a few minutes.
>

If you can easily do so, testing with the attached patch might shed
some light on the problem. It just adds a couple of diagnostic checks
before and after m_defrag() is called when bus_dmamap_load_mbuf_sg()
returns EFBIG.

If the "before" printf happens, it would suggest a problem with the
loop in tcp_output() that creates TSO segments.

If the "after" printf happens, it would suggest that m_defrag() somehow
doesn't create a list of 32 or fewer mbufs for the TSO segment.

I don't have any ix hardware, so this patch is completely untested.

Just something maybe worth trying, rick

ixgbe.patch

Markus Gebert

unread,

Mar 20, 2014, 6:40:18 AM3/20/14

to

On 19.03.2014, at 20:17, Christopher Forgeron <csfor...@gmail.com> wrote:

> Hello,
>
>
>
> I can report this problem as well on 10.0-RELEASE.
>
>
>
> I think it's the same as kern/183390?

Possible. We still see this on nfsclients only, but I’m not convinced that nfs is the only trigger.

> I have two physically identical machines, one running 9.2-STABLE, and one
> on 10.0-RELEASE.
>
>
>
> My 10.0 machine used to be running 9.0-STABLE for over a year without any
> problems.
>
>
>
> I'm not having the problems with 9.2-STABLE as far as I can tell, but it
> does seem to be a load-based issue more than anything. Since my 9.2 system
> is in production, I'm unable to load it to see if the problem exists there.
> I have a ping_logger.py running on it now to see if it's experiencing
> problems briefly or not.

I our case, when it happens, the problem persists for quite some time (minutes or hours) if we don’t interact (ifconfig or reboot).

> I am able to reproduce it fairly reliably within 15 min of a reboot by
> loading the server via NFS with iometer and some large NFS file copies at
> the same time. I seem to need to sustain ~2 Gbps for a few minutes.

That’s probably why we can’t reproduce it reliably here. Although having 10gig cards in our blade servers, the ones affected are connected to a 1gig switch.

> It will happen with just ix0 (no lagg) or with lagg enabled across ix0 and
> ix1.

Same here.

> I've been load-testing new FreeBSD-10.0-RELEASE SAN's for production use
> here, so I'm quite willing to put time into this to help find out where
> it's coming from. It took me a day to track down my iometer issues as
> being network related, and another day to isolate and write scripts to
> reproduce.
>
>
>
> The symptom I notice is:
>
> - A running flood ping (ping -f 172.16.0.31) to the same hardware
> (running 9.2) will come back with "ping: sendto: File too large" when the
> problem occurs
>
> - Network connectivity is very spotty during these incidents
>
> - It can run with sporadic ping errors, or it can run a straight
> set of errors for minutes at a time
>
> - After a long run of ping errors, ESXi will show a disconnect
> from the hosted NFS stores on this machine.
>
> - I've yet to see it happen right after boot. Fastest is around 5
> min, normally it's within 15 min.

Can you try this when the problem occurs?

for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto; done

It will tie ping to certain cpus to test the different tx queues of your ix interface. If the pings reliably fail only on some queues, then your problem is more likely to be the same as ours.

Also, if you have dtrace available:

kldload dtraceall
dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack(); }'

while you run pings over the interface affected. This will give you hints about where the EFBIG error comes from.

> […]

Markus

Christopher Forgeron

unread,

Mar 20, 2014, 11:34:49 AM3/20/14

to

On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert
<markus...@hostpoint.ch>wrote:

>
>
> Possible. We still see this on nfsclients only, but I'm not convinced that
> nfs is the only trigger.
>
>

Just to clarify, I'm experiencing this error with NFS, but also with iSCSI
- I turned off my NFS server in rc.conf and rebooted, and I'm still able to
create the error. This is not just a NFS issue on my machine.

I our case, when it happens, the problem persists for quite some time
> (minutes or hours) if we don't interact (ifconfig or reboot).
>

> The first few times that I ran into it, I had similar issues - Because I
was keeping my system up and treating it like a temporary problem/issue.
Worst case scenario resulted in reboots to reset the NIC. Then again, I
find the ix's to be cranky if you ifconfig them too much.

Now, I'm trying to find a root cause, so as soon as I start seeing any
errors, I abort and reboot the machine to test the next theory.

Additionally, I'm often able to create the problem with just 1 VM running
iometer on the SAN storage. When the problem occurs, that connection is
broken temporarily, taking network load off the SAN - That may improve my
chances of keeping this running.

>
> > I am able to reproduce it fairly reliably within 15 min of a reboot by
> > loading the server via NFS with iometer and some large NFS file copies at
> > the same time. I seem to need to sustain ~2 Gbps for a few minutes.
>
> That's probably why we can't reproduce it reliably here. Although having
> 10gig cards in our blade servers, the ones affected are connected to a 1gig
> switch.
>
>

It seems that it needs a lot of traffic. I have a 10 gig backbone between
my SANs and my ESXi machines, so I can saturate quite quickly (just now I
hit a record.. the error occurred within ~5 min of reboot and testing). In
your case, I recommend firing up multiple VM's running iometer on different
1 gig connections and see if you can make it pop. I also often turn off ix1
to drive all traffic through ix0 - I've noticed it happens faster this way,
but once again I'm not taking enough observations to make decent time
predictions.

>
>
> Can you try this when the problem occurs?
>
> for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2 -c 2
> -W 1 10.0.0.1 | grep sendto; done
>
> It will tie ping to certain cpus to test the different tx queues of your
> ix interface. If the pings reliably fail only on some queues, then your
> problem is more likely to be the same as ours.
>
> Also, if you have dtrace available:
>
> kldload dtraceall
> dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack();
> }'
>
> while you run pings over the interface affected. This will give you hints
> about where the EFBIG error comes from.
>

> > [...]
>
>
> Markus
>
>
Will do. I'm not sure what shell the first script was written for, it's not
working in csh, here's a re-write that does work in csh in case others are
using the default shell:

#!/bin/csh
foreach CPU (`seq 0 23`)
echo "CPU$CPU";
cpuset -l $CPU ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto;
end

Thanks for your input. I should have results to post to the list shortly.

Christopher Forgeron

unread,

Mar 20, 2014, 11:50:21 AM3/20/14

to

Markus,

I just wanted to clarify what dtrace will output in a 'no-error'
situation. I'm seeing the following during a normal ping (no errors) on
ix0, or even on a non-problematic bge NIC:

On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert
<markus...@hostpoint.ch>wrote:

> Also, if you have dtrace available:
>
> kldload dtraceall
> dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack();
> }'
>
> while you run pings over the interface affected. This will give you hints
> about where the EFBIG error comes from.
>
>
>
>
>

Christopher Forgeron

unread,

Mar 20, 2014, 11:52:15 AM3/20/14

to

(Struggling with this mail client for some reason, sorry, here's the paste)

# dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / {
stack(); }'
dtrace: description 'fbt:::return ' matched 24892 probes
CPU ID FUNCTION:NAME
19 29656 maybe_yield:return
kernel`uiomove_faultflag+0xab
kernel`m_uiotombuf+0xfd
kernel`sosend_generic+0x367
kernel`kern_sendit+0x224
kernel`sendit+0x116
kernel`sys_sendto+0x4d
kernel`amd64_syscall+0x357
kernel`0xffffffff80c7567b

19 29656 maybe_yield:return
kernel`uiomove_faultflag+0xab
kernel`ttydisc_write+0xde
kernel`ttydev_write+0x143
kernel`devfs_write_f+0xef
kernel`dofilewrite+0x8a
kernel`kern_writev+0x65
kernel`sys_write+0x63
kernel`amd64_syscall+0x357
kernel`0xffffffff80c7567b

11 29656 maybe_yield:return
kernel`uiomove_faultflag+0xab
kernel`m_uiotombuf+0xfd
kernel`sosend_generic+0x367
kernel`kern_sendit+0x224
kernel`sendit+0x116
kernel`sys_sendto+0x4d
kernel`amd64_syscall+0x357
kernel`0xffffffff80c7567b

11 29656 maybe_yield:return
kernel`uiomove_faultflag+0xab
kernel`ttydisc_write+0xde
kernel`ttydev_write+0x143
kernel`devfs_write_f+0xef
kernel`dofilewrite+0x8a
kernel`kern_writev+0x65
kernel`sys_write+0x63
kernel`amd64_syscall+0x357
kernel`0xffffffff80c7567b

Markus Gebert

unread,

Mar 20, 2014, 1:01:31 PM3/20/14

to

On 20.03.2014, at 16:50, Christopher Forgeron <csfor...@gmail.com> wrote:

> Markus,
>
> I just wanted to clarify what dtrace will output in a 'no-error'
> situation. I'm seeing the following during a normal ping (no errors) on
> ix0, or even on a non-problematic bge NIC:
>
>

This is expected. This dtrace probe will fire if any kernel function that is run in the context of a process named “ping” returns 27, which is what EFBIG stands for. Kernel functions can return 27 for many reasons, not just an error, or they can return void (not sure how dtrace handles this case). Anyway, this dtrace one-liner is only meant to be used in the error case. Otherwise it’s probably useless. And even when the problem occurs, you need to pick the right stack trace, and dig around kernel sources to verify that the functions indeed return EFBIG and not any integer that happens to be 27 by accident.

Markus

Christopher Forgeron

unread,

Mar 20, 2014, 3:38:41 PM3/20/14

to

Output from the patch you gave me (I have screens of it.. let me know what
you're hoping to see.

Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: before pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:23 SAN0 kernel: before pklen=65538 actl=65538
Mar 20 16:37:23 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 20 16:37:23 SAN0 kernel: before pklen=65542 actl=65542
Mar 20 16:37:23 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 20 16:37:23 SAN0 kernel: before pklen=65542 actl=65542
Mar 20 16:37:23 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 20 16:37:23 SAN0 kernel: before pklen=65542 actl=65542
Mar 20 16:37:23 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65

On Wed, Mar 19, 2014 at 11:29 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:
> > Hello,
> >
> >
> >
> > I can report this problem as well on 10.0-RELEASE.
> >
> >
> >
> > I think it's the same as kern/183390?
> >
> >
> >

> > I have two physically identical machines, one running 9.2-STABLE, and
> > one
> > on 10.0-RELEASE.
> >
> >
> >
> > My 10.0 machine used to be running 9.0-STABLE for over a year without
> > any
> > problems.
> >
> >
> >
> > I'm not having the problems with 9.2-STABLE as far as I can tell, but
> > it
> > does seem to be a load-based issue more than anything. Since my 9.2
> > system
> > is in production, I'm unable to load it to see if the problem exists
> > there.
> > I have a ping_logger.py running on it now to see if it's experiencing
> > problems briefly or not.
> >
> >
> >

> > I am able to reproduce it fairly reliably within 15 min of a reboot
> > by
> > loading the server via NFS with iometer and some large NFS file
> > copies at
> > the same time. I seem to need to sustain ~2 Gbps for a few minutes.
> >

> If you can easily do so, testing with the attached patch might shed
> some light on the problem. It just adds a couple of diagnostic checks
> before and after m_defrag() is called when bus_dmamap_load_mbuf_sg()
> returns EFBIG.
>
> If the "before" printf happens, it would suggest a problem with the
> loop in tcp_output() that creates TSO segments.
>
> If the "after" printf happens, it would suggest that m_defrag() somehow
> doesn't create a list of 32 or fewer mbufs for the TSO segment.
>
> I don't have any ix hardware, so this patch is completely untested.
>
> Just something maybe worth trying, rick
>
> >
> >

> > It will happen with just ix0 (no lagg) or with lagg enabled across
> > ix0 and
> > ix1.
> >
> >
> >

Christopher Forgeron

unread,

Mar 20, 2014, 3:56:44 PM3/20/14

to

BTW,

When I have the problem, this is what I see from netstat -m

4080/2956/7036/6127254 mbuf clusters in use (current/cache/total/max)
4080/2636 mbuf+clusters out of packet secondary zone in use (current/cache)
0/50/50/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)
32768/155/32923/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

312541K/9182K/321724K bytes allocated to network (current/cache/total)
34481/2600/4091 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

50/27433/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

It doesn't look that bad to me, other than all of the denied counts - But I
can't see sysctl buffer numbers that look too low...

For those who are interested, here is a dump of hw.ix and dev.ix.0 (I have
ix.1 off)

hw.ix.enable_aim: 1
hw.ix.max_interrupt_rate: 31250
hw.ix.rx_process_limit: 256
hw.ix.tx_process_limit: 256
hw.ix.enable_msix: 1
hw.ix.num_queues: 8
hw.ix.txd: 2048
hw.ix.rxd: 2048

2014-03-20 16:29:05.291 - INFO - dev.ix.0.%desc: Intel(R) PRO/10GbE
PCI-Express Network Driver, Version - 2.5.15
dev.ix.0.%driver: ix
dev.ix.0.%location: slot=0 function=0
dev.ix.0.%pnpinfo: vendor=0x8086 device=0x10f8 subvendor=0x8086
subdevice=0x000c class=0x020000
dev.ix.0.%parent: pci5
dev.ix.0.fc: 3
dev.ix.0.enable_aim: 1
dev.ix.0.advertise_speed: 0
dev.ix.0.dropped: 0
dev.ix.0.mbuf_defrag_failed: 0
dev.ix.0.watchdog_events: 0
dev.ix.0.link_irq: 5
dev.ix.0.queue0.interrupt_rate: 500000
dev.ix.0.queue0.irqs: 452969
dev.ix.0.queue0.txd_head: 319
dev.ix.0.queue0.txd_tail: 319
dev.ix.0.queue0.tso_tx: 61107
dev.ix.0.queue0.no_tx_dma_setup: 0
dev.ix.0.queue0.no_desc_avail: 0
dev.ix.0.queue0.tx_packets: 257636
dev.ix.0.queue0.rxd_head: 531
dev.ix.0.queue0.rxd_tail: 530
dev.ix.0.queue0.rx_packets: 522771
dev.ix.0.queue0.rx_bytes: 1318022421
dev.ix.0.queue0.rx_copies: 224837
dev.ix.0.queue0.lro_queued: 424583
dev.ix.0.queue0.lro_flushed: 181580
dev.ix.0.queue1.interrupt_rate: 125000
dev.ix.0.queue1.irqs: 22756
dev.ix.0.queue1.txd_head: 1169
dev.ix.0.queue1.txd_tail: 1169
dev.ix.0.queue1.tso_tx: 0
dev.ix.0.queue1.no_tx_dma_setup: 0
dev.ix.0.queue1.no_desc_avail: 0
dev.ix.0.queue1.tx_packets: 23202
dev.ix.0.queue1.rxd_head: 337
dev.ix.0.queue1.rxd_tail: 336
dev.ix.0.queue1.rx_packets: 337
dev.ix.0.queue1.rx_bytes: 32988
dev.ix.0.queue1.rx_copies: 225
dev.ix.0.queue1.lro_queued: 335
dev.ix.0.queue1.lro_flushed: 320
dev.ix.0.queue2.interrupt_rate: 500000
dev.ix.0.queue2.irqs: 20256
dev.ix.0.queue2.txd_head: 1201
dev.ix.0.queue2.txd_tail: 1201
dev.ix.0.queue2.tso_tx: 0
dev.ix.0.queue2.no_tx_dma_setup: 0
dev.ix.0.queue2.no_desc_avail: 0
dev.ix.0.queue2.tx_packets: 20962
dev.ix.0.queue2.rxd_head: 1021
dev.ix.0.queue2.rxd_tail: 1020
dev.ix.0.queue2.rx_packets: 1021
dev.ix.0.queue2.rx_bytes: 99126
dev.ix.0.queue2.rx_copies: 891
dev.ix.0.queue2.lro_queued: 396
dev.ix.0.queue2.lro_flushed: 391
dev.ix.0.queue3.interrupt_rate: 71428
dev.ix.0.queue3.irqs: 25072
dev.ix.0.queue3.txd_head: 1465
dev.ix.0.queue3.txd_tail: 1465
dev.ix.0.queue3.tso_tx: 0
dev.ix.0.queue3.no_tx_dma_setup: 0
dev.ix.0.queue3.no_desc_avail: 0
dev.ix.0.queue3.tx_packets: 25726
dev.ix.0.queue3.rxd_head: 310
dev.ix.0.queue3.rxd_tail: 309
dev.ix.0.queue3.rx_packets: 310
dev.ix.0.queue3.rx_bytes: 36886
dev.ix.0.queue3.rx_copies: 150
dev.ix.0.queue3.lro_queued: 309
dev.ix.0.queue3.lro_flushed: 286
dev.ix.0.queue4.interrupt_rate: 500000
dev.ix.0.queue4.irqs: 21251
dev.ix.0.queue4.txd_head: 308
dev.ix.0.queue4.txd_tail: 308
dev.ix.0.queue4.tso_tx: 0
dev.ix.0.queue4.no_tx_dma_setup: 0
dev.ix.0.queue4.no_desc_avail: 0
dev.ix.0.queue4.tx_packets: 22090
dev.ix.0.queue4.rxd_head: 589
dev.ix.0.queue4.rxd_tail: 588
dev.ix.0.queue4.rx_packets: 589
dev.ix.0.queue4.rx_bytes: 57938
dev.ix.0.queue4.rx_copies: 558
dev.ix.0.queue4.lro_queued: 585
dev.ix.0.queue4.lro_flushed: 585
dev.ix.0.queue5.interrupt_rate: 41666
dev.ix.0.queue5.irqs: 20123
dev.ix.0.queue5.txd_head: 314
dev.ix.0.queue5.txd_tail: 314
dev.ix.0.queue5.tso_tx: 0
dev.ix.0.queue5.no_tx_dma_setup: 0
dev.ix.0.queue5.no_desc_avail: 0
dev.ix.0.queue5.tx_packets: 20618
dev.ix.0.queue5.rxd_head: 112
dev.ix.0.queue5.rxd_tail: 111
dev.ix.0.queue5.rx_packets: 112
dev.ix.0.queue5.rx_bytes: 10224
dev.ix.0.queue5.rx_copies: 84
dev.ix.0.queue5.lro_queued: 109
dev.ix.0.queue5.lro_flushed: 109
dev.ix.0.queue6.interrupt_rate: 71428
dev.ix.0.queue6.irqs: 18418
dev.ix.0.queue6.txd_head: 732
dev.ix.0.queue6.txd_tail: 732
dev.ix.0.queue6.tso_tx: 45
dev.ix.0.queue6.no_tx_dma_setup: 0
dev.ix.0.queue6.no_desc_avail: 0
dev.ix.0.queue6.tx_packets: 19137
dev.ix.0.queue6.rxd_head: 824
dev.ix.0.queue6.rxd_tail: 823
dev.ix.0.queue6.rx_packets: 824
dev.ix.0.queue6.rx_bytes: 92838
dev.ix.0.queue6.rx_copies: 583
dev.ix.0.queue6.lro_queued: 818
dev.ix.0.queue6.lro_flushed: 716
dev.ix.0.queue7.interrupt_rate: 62500
dev.ix.0.queue7.irqs: 17681
dev.ix.0.queue7.txd_head: 721
dev.ix.0.queue7.txd_tail: 721
dev.ix.0.queue7.tso_tx: 0
dev.ix.0.queue7.no_tx_dma_setup: 0
dev.ix.0.queue7.no_desc_avail: 0
dev.ix.0.queue7.tx_packets: 18067
dev.ix.0.queue7.rxd_head: 1407
dev.ix.0.queue7.rxd_tail: 1406
dev.ix.0.queue7.rx_packets: 1407
dev.ix.0.queue7.rx_bytes: 252631
dev.ix.0.queue7.rx_copies: 884
dev.ix.0.queue7.lro_queued: 1400
dev.ix.0.queue7.lro_flushed: 1390
dev.ix.0.mac_stats.crc_errs: 0
dev.ix.0.mac_stats.ill_errs: 0
dev.ix.0.mac_stats.byte_errs: 0
dev.ix.0.mac_stats.short_discards: 0
dev.ix.0.mac_stats.local_faults: 2
dev.ix.0.mac_stats.remote_faults: 3
dev.ix.0.mac_stats.rec_len_errs: 0
dev.ix.0.mac_stats.xon_txd: 0
dev.ix.0.mac_stats.xon_recvd: 0
dev.ix.0.mac_stats.xoff_txd: 0
dev.ix.0.mac_stats.xoff_recvd: 0
dev.ix.0.mac_stats.total_octets_rcvd: 1320732697
dev.ix.0.mac_stats.good_octets_rcvd: 1320713370
dev.ix.0.mac_stats.total_pkts_rcvd: 527648
dev.ix.0.mac_stats.good_pkts_rcvd: 527365
dev.ix.0.mac_stats.mcast_pkts_rcvd: 25
dev.ix.0.mac_stats.bcast_pkts_rcvd: 75
dev.ix.0.mac_stats.rx_frames_64: 128032
dev.ix.0.mac_stats.rx_frames_65_127: 100057
dev.ix.0.mac_stats.rx_frames_128_255: 115733
dev.ix.0.mac_stats.rx_frames_256_511: 1210
dev.ix.0.mac_stats.rx_frames_512_1023: 3075
dev.ix.0.mac_stats.rx_frames_1024_1522: 179258
dev.ix.0.mac_stats.recv_undersized: 0
dev.ix.0.mac_stats.recv_fragmented: 0
dev.ix.0.mac_stats.recv_oversized: 0
dev.ix.0.mac_stats.recv_jabberd: 0
dev.ix.0.mac_stats.management_pkts_rcvd: 0
dev.ix.0.mac_stats.management_pkts_drpd: 0
dev.ix.0.mac_stats.checksum_errs: 0
dev.ix.0.mac_stats.good_octets_txd: 2815129453
dev.ix.0.mac_stats.total_pkts_txd: 640355
dev.ix.0.mac_stats.good_pkts_txd: 640355
dev.ix.0.mac_stats.bcast_pkts_txd: 2
dev.ix.0.mac_stats.mcast_pkts_txd: 25
dev.ix.0.mac_stats.management_pkts_txd: 0
dev.ix.0.mac_stats.tx_frames_64: 39831
dev.ix.0.mac_stats.tx_frames_65_127: 166390
dev.ix.0.mac_stats.tx_frames_128_255: 72116
dev.ix.0.mac_stats.tx_frames_256_511: 2072
dev.ix.0.mac_stats.tx_frames_512_1023: 1339
dev.ix.0.mac_stats.tx_frames_1024_1522: 358607

and lastly, the default sysctl kern.ipc:

kern.ipc.maxsockbuf: 2097152
kern.ipc.sockbuf_waste_factor: 8
kern.ipc.max_linkhdr: 16
kern.ipc.max_protohdr: 60
kern.ipc.max_hdr: 76
kern.ipc.max_datalen: 92
kern.ipc.maxmbufmem: 50194468864
kern.ipc.nmbclusters: 6127254
kern.ipc.nmbjumbop: 3063627
kern.ipc.nmbjumbo9: 2723223
kern.ipc.nmbjumbo16: 2042416
kern.ipc.nmbufs: 39214440
kern.ipc.maxpipekva: 1610002432
kern.ipc.pipekva: 147456
kern.ipc.pipefragretry: 0
kern.ipc.pipeallocfail: 0
kern.ipc.piperesizefail: 0
kern.ipc.piperesizeallowed: 1
kern.ipc.msgmax: 16384
kern.ipc.msgmni: 40
kern.ipc.msgmnb: 2048
kern.ipc.msgtql: 40
kern.ipc.msgssz: 8
kern.ipc.msgseg: 2048
kern.ipc.semmni: 50
kern.ipc.semmns: 340
kern.ipc.semmnu: 150
kern.ipc.semmsl: 340
kern.ipc.semopm: 100
kern.ipc.semume: 50
kern.ipc.semusz: 632
kern.ipc.semvmx: 32767
kern.ipc.semaem: 16384
kern.ipc.shmmax: 536870912
kern.ipc.shmmin: 1
kern.ipc.shmmni: 192
kern.ipc.shmseg: 128
kern.ipc.shmall: 131072
kern.ipc.shm_use_phys: 0
kern.ipc.shm_allow_removed: 0
kern.ipc.soacceptqueue: 128
kern.ipc.numopensockets: 79
kern.ipc.maxsockets: 3144540
kern.ipc.sendfile.readahead: 1

Christopher Forgeron

unread,

Mar 20, 2014, 4:05:22 PM3/20/14

to

Re: cpuset ping

I can report that I do not get any fails with this ping - I have screens of
failed flood pings on the ix0 nic, but these always pass (i have that
cpuset ping looping constantly).

I can't report about the dtrace yet, as I'm running Rick's ixgbe patch, and
there seems to be a .ko conflict someplace that keeps dtrace from running.

I'm going to try locking my flood pings down to specific CPU's to see if
there is any pattern there. After that I'll restore GENERIC and try the
dtrace line.

On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert
<markus...@hostpoint.ch>wrote:

>
>

> Can you try this when the problem occurs?
>
> for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2 -c 2
> -W 1 10.0.0.1 | grep sendto; done
>
> It will tie ping to certain cpus to test the different tx queues of your
> ix interface. If the pings reliably fail only on some queues, then your
> problem is more likely to be the same as ours.
>

> Also, if you have dtrace available:
>
> kldload dtraceall
> dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack();
> }'
>
> while you run pings over the interface affected. This will give you hints
> about where the EFBIG error comes from.
>

> > [...]
>
>
> Markus

Garrett Wollman

unread,

Mar 20, 2014, 5:13:07 PM3/20/14

to

In article
<CAB2_NwAOmPtZjB03pdDiTK2O...@mail.gmail.com>,

csfor...@gmail.com writes:

>50/27433/0 requests for jumbo clusters denied (4k/9k/16k)

This is going to screw you. You need to make sure that no NIC driver
ever allocates 9k jumbo pages -- unless you are using one of those
mythical drivers that can't do scatter/gather DMA on receive, which
you don't appear to be.

These failures occur when the driver is trying to replenish its
receive queue, but is unable to allocate three *physically* contiguous
pages of RAM to construct the 9k jumbo cluster (of which the remaining
3k is simply wasted). This happens on any moderately active server,
once physical memory gets checkerboarded with active single pages,
particularly with ZFS where those pages are wired in kernel memory and
so can't be evicted.

-GAWollman

Christopher Forgeron

unread,

Mar 20, 2014, 5:22:46 PM3/20/14

to

Any recommendations on what to do? I'm experimenting with disabling TSO
right now, but it's too early to tell if it fixes my problem.

On my 9.2 box, we don't see this number climbing. With TSO off on 10.0, I
also see the number is not climbing.

I'd appreciate any links you may have so I can read up on this.

Thanks for the comment.

Christopher Forgeron

unread,

Mar 20, 2014, 5:34:04 PM3/20/14

to

I have found this:

http://lists.freebsd.org/pipermail/freebsd-net/2013-October/036955.html

I think what you're saying is that;
- a MTU of 9000 doesn't need to equal a 9k mbuf / jumbo cluster
- modern NIC drivers can gather 9000 bytes of data from various memory
locations
- The fact that I'm seeing 9k jumbo clusters is showing me that my driver
is trying to allocate 9k of contiguous space, and it's failing.

Please correct me if I'm off here, I'd love to understand more.

Jack Vogel

unread,

Mar 20, 2014, 6:00:07 PM3/20/14

to

What he's saying is that the driver should not be using 9K mbuf clusters, I
thought
this had been changed but I see the code in HEAD is still using the larger
clusters
when you up the mtu. I will put it on my list to change with the next
update to HEAD.

What version of ixgbe are you using?

Jack

On Thu, Mar 20, 2014 at 2:34 PM, Christopher Forgeron
<csfor...@gmail.com>wrote:

Christopher Forgeron

unread,

Mar 20, 2014, 6:12:49 PM3/20/14

to

Hi Jack,

I'm on ixgbe 2.5.15

I see a few other threads about using MJUMPAGESIZE instead of MJUM9BYTES.

If you have a patch you'd like me to test, I'll compile it in and let you
know. I was just looking at Garrett's if_em.c patch and thinking about
applying it to ixgbe..

As it stands I seem to not be having the problem now that I have disabled
TSO on ix0, but I still need more test runs to confirm - Which is also in
line (i think) with what you are all saying.

Jack Vogel

unread,

Mar 20, 2014, 6:24:46 PM3/20/14

to

I strongly discourage anyone from disabling TSO on 10G, its necessary to
get the
performance one wants to see on the hardware.

Here is a patch to do what i'm talking about:

*** ixgbe.c Fri Jan 10 18:12:20 2014
--- ixgbe.jfv.c Thu Mar 20 23:04:15 2014
*************** ixgbe_init_locked(struct adapter *adapte
*** 1140,1151 ****
*/
if (adapter->max_frame_size <= 2048)
adapter->rx_mbuf_sz = MCLBYTES;
- else if (adapter->max_frame_size <= 4096)
- adapter->rx_mbuf_sz = MJUMPAGESIZE;
- else if (adapter->max_frame_size <= 9216)
- adapter->rx_mbuf_sz = MJUM9BYTES;
else
! adapter->rx_mbuf_sz = MJUM16BYTES;

/* Prepare receive descriptors and buffers */
if (ixgbe_setup_receive_structures(adapter)) {
--- 1140,1147 ----
*/
if (adapter->max_frame_size <= 2048)
adapter->rx_mbuf_sz = MCLBYTES;
else
! adapter->rx_mbuf_sz = MJUMPAGESIZE;

/* Prepare receive descriptors and buffers */
if (ixgbe_setup_receive_structures(adapter)) {

On Thu, Mar 20, 2014 at 3:12 PM, Christopher Forgeron

Christopher Forgeron

unread,

Mar 20, 2014, 6:32:17 PM3/20/14

to

I agree, performance is noticeably worse with TSO off, but I thought it
would be a good step in troubleshooting. I'm glad you're a regular reader
of the list, so I don't have to settle for slow performance. :-)

I'm applying your patch now, I think it will fix it - but I'll report in
after it's run iometer for the night regardless.

On another note: What's so different about memory allocation in 10 that is
making this an issue?

Jack Vogel

unread,

Mar 20, 2014, 6:42:01 PM3/20/14

to

Your 4K mbuf pool is not being used, make sure you increase the size once
you are
using that or you'll just be having the same issue with a different pool.

Oh, and that patch was against the code in HEAD, it might need some manual
hacking
if you're using anything older.

Not sure what you mean about memory allocation in 10, this change is not 10
specific, its
something I intended on doing and it just slipped between the cracks.

Jack

On Thu, Mar 20, 2014 at 3:32 PM, Christopher Forgeron

Christopher Forgeron

unread,

Mar 20, 2014, 7:01:00 PM3/20/14

to

Ah, good point about the 4k buff size : I will allocate more to
kern.ipc.nmbjumbop , perhaps taking it from 9 and 16.

Yes, I did have to tweak the patch slightly to work on 10.0, but it's
basically the same thing I was trying after looking at Garrett's notes.

I see this is part of a larger problem, but I didn't see any issues with a
9.0 system for over a year, and my 9.2 system seems to be stable (all the
same hardware, same use). I was thinking it was an issue with later 9.2's
or 10, but ultimately I guess it's just a problem on any system that can't
allocate 3 contiguous 4k memory pages quickly enough. (?).. I do notice ~
30% more NFS speed to my ZFS pool with 10.0 - Perhaps that's the key
performance to start noticing this problem.

Then again, my 10.0 system starts out with denied 9k bufs at boot, where my
9.2 doesn't. There's no real memory pressure on boot when I have 96G of RAM
I would expect.

(I also wonder if I I shouldn't be considering a MTU that fits inside a
MJUMPAGESIZE. I don't think my switches support a MTU that will == 3 or 4
full MJUMPAGESIZE. Then again, wasting a bit of memory on the server may be
worth it to have slightly fewer TCP frames. )

What should be done about the other network drivers that still call
MJUM9BYTES? http://fxr.watson.org/fxr/ident?im=excerpts;i=MJUM9BYTES

I have a collection of a number of different NICs, I could test a few of
these to verify they work okay with the same sort of patch we're talking
about. I appreciate the help everyone gives me here, so I'm willing to help
out if it's needed.

Thanks again.

Rick Macklem

unread,

Mar 20, 2014, 10:13:22 PM3/20/14

to

Christopher Forgeron wrote:
>
> Output from the patch you gave me (I have screens of it.. let me know
> what you're hoping to see.
>
>
> Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538

Hmm. I think this means that the loop that generates TSO segments in
tcp_output() is broken, since I'm pretty sure that the maximum size
should be is IP_MAXPACKET (65535).

Either that or some non-TCP socket is trying to send a packet that
exceeds IP_MAXPACKET for some reason.

Would it be possible to add a printf() for m->m_pkthdr.csum_flags
to the before case, in the "if" that generates the before printf?
I didn't think to put this in, but CSUM_TSO will be set if it
is a TSO segment, I think? My networking is very rusty.
(If how to add this isn't obvious, just email and I'll update
the patch.)

Thanks for doing this, rick

> > 4080/8360/12440/524288 mbuf clusters in use
> > (current/cache/total/max)
> >
> > 4080/4751 mbuf+clusters out of packet secondary zone in use
> > (current/cache)
> >
> > 0/452/452/262144 4k (page size) jumbo clusters in use
> > (current/cache/total/max)
> >
> > 32773/4129/36902/96000 9k jumbo clusters in use
> > (current/cache/total/max)
> >
> > 0/0/0/508538 16k jumbo clusters in use (current/cache/total/max)
> >
> > 312608K/59761K/372369K bytes allocated to network
> > (current/cache/total)
> >
> > 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

> >
> > 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> >
> > 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
> >

> > 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> >
> > 0/0/0 sfbufs in use (current/peak/max)
> >

> > 0 requests for sfbufs denied
> >
> > 0 requests for sfbufs delayed
> >
> > 0 requests for I/O initiated by sendfile
> >

> > 0 calls to protocol drain routines
> >
> >
> >
> >
> >
> > 10.0's netstat -m:
>
> >
> >
> >
> > 21512/24448/45960 mbufs in use (current/cache/total)
> >

> > 4080/16976/21056/6127254 mbuf clusters in use
> > (current/cache/total/max)
> >
> > 4080/16384 mbuf+clusters out of packet secondary zone in use
> > (current/cache)
> >
> > 0/23/23/3063627 4k (page size) jumbo clusters in use
> > (current/cache/total/max)
> >
> > 16384/158/16542/907741 9k jumbo clusters in use

> > (current/cache/total/max)
> >
> > 0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)
> >

> > 160994K/41578K/202572K bytes allocated to network
> > (current/cache/total)
> >
> > 17488/13290/20464 requests for mbufs denied

> > (mbufs/clusters/mbuf+clusters)
> >
> > 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> >
> > 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
> >

> > 7/16462/0 requests for jumbo clusters denied (4k/9k/16k)

> >
> > 0 requests for sfbufs denied
> >
> > 0 requests for sfbufs delayed
> >
> > 0 requests for I/O initiated by sendfile
> >
> >
> >

> > Way more mbuf clusters in use, but also I never get denied/delayed
> > results
> > in 9.2 - but I have them in 10.0 right away after a reboot.
> >
> >
> >
> > Thanks for any help..

Rick Macklem

unread,

Mar 20, 2014, 10:25:51 PM3/20/14

to

Christopher Forgeron wrote:
>
>
>
>
>
>
> On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert <
> markus...@hostpoint.ch > wrote:
>
>
>
>
>

> Possible. We still see this on nfsclients only, but I’m not convinced
> that nfs is the only trigger.
>
>

Since Christopher is getting a bunch of the "before" printf()s from
my patch, it indicates that a packet/TSO segment that is > 65535 bytes
in length is showing up at ixgbe_xmit(). I've asked him to add a printf()
for the m_pkthdr.csum_flags field to see if it is really a TSO segment.

If it is a TSO segment, that indicates to me that the code in tcp_output() that should
generate a TSO segment no greater than 65535 bytes in length is busted.
And this would imply just about any app doing large sosend()s could cause
this, I think? (NFS read replies/write requests of 64K would be one of them.)

rick

>
>
>
> Just to clarify, I'm experiencing this error with NFS, but also with
> iSCSI - I turned off my NFS server in rc.conf and rebooted, and I'm
> still able to create the error. This is not just a NFS issue on my
> machine.
>
>
>
> I our case, when it happens, the problem persists for quite some time
> (minutes or hours) if we don’t interact (ifconfig or reboot).
>
>
>
> The first few times that I ran into it, I had similar issues -
> Because I was keeping my system up and treating it like a temporary
> problem/issue. Worst case scenario resulted in reboots to reset the
> NIC. Then again, I find the ix's to be cranky if you ifconfig them
> too much.
>
> Now, I'm trying to find a root cause, so as soon as I start seeing
> any errors, I abort and reboot the machine to test the next theory.
>
>
> Additionally, I'm often able to create the problem with just 1 VM
> running iometer on the SAN storage. When the problem occurs, that
> connection is broken temporarily, taking network load off the SAN -
> That may improve my chances of keeping this running.
>
>
>
>
>

> > I am able to reproduce it fairly reliably within 15 min of a reboot
> > by
> > loading the server via NFS with iometer and some large NFS file
> > copies at
> > the same time. I seem to need to sustain ~2 Gbps for a few minutes.
>

> That’s probably why we can’t reproduce it reliably here. Although
> having 10gig cards in our blade servers, the ones affected are
> connected to a 1gig switch.
>
>
>
>
>
> It seems that it needs a lot of traffic. I have a 10 gig backbone
> between my SANs and my ESXi machines, so I can saturate quite
> quickly (just now I hit a record.. the error occurred within ~5 min
> of reboot and testing). In your case, I recommend firing up multiple
> VM's running iometer on different 1 gig connections and see if you
> can make it pop. I also often turn off ix1 to drive all traffic
> through ix0 - I've noticed it happens faster this way, but once
> again I'm not taking enough observations to make decent time
> predictions.
>
>
>
>
>
>

> Can you try this when the problem occurs?
>
> for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2
> -c 2 -W 1 10.0.0.1 | grep sendto; done
>
> It will tie ping to certain cpus to test the different tx queues of
> your ix interface. If the pings reliably fail only on some queues,
> then your problem is more likely to be the same as ours.
>
> Also, if you have dtrace available:
>
> kldload dtraceall
> dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / {
> stack(); }'
>
> while you run pings over the interface affected. This will give you
> hints about where the EFBIG error comes from.
>

> > […]

>
>
> Markus
>
>
>
>
> Will do. I'm not sure what shell the first script was written for,
> it's not working in csh, here's a re-write that does work in csh in
> case others are using the default shell:
>
> #!/bin/csh
> foreach CPU (`seq 0 23`)
> echo "CPU$CPU";

> cpuset -l $CPU ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto;

> end
>
>
> Thanks for your input. I should have results to post to the list
> shortly.
>
>

Christopher Forgeron

unread,

Mar 20, 2014, 10:32:59 PM3/20/14

to

Yes, there is something broken in TSO for sure, as disabling it allows me
to run without error. It is possible that the drop in performance is
allowing me to stay under a critical threshold for the problem, but I'd
feel happier testing to make sure.

I understand what you're asking for in the patch, I'll make the edits
tomorrow and recompile a test kernel and see.

Right now I'm running tests on the ixgbe that Jack sent. Even if his patch
fixes the issue, I wonder if something else isn't broken in TSO, as the
ixgbe code has had these lines for a long time, and it's only on this 10.0
build that I have issues.

I'll be following up tomorrow with info on either outcome.

Thanks for your help.. your rusty networking is still better than mine. :-)

On Thu, Mar 20, 2014 at 11:13 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:
> >
> > Output from the patch you gave me (I have screens of it.. let me know
> > what you're hoping to see.
> >
> >
> > Mar 20 16:37:22 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
> > Mar 20 16:37:22 SAN0 kernel: before pklen=65538 actl=65538
> Hmm. I think this means that the loop that generates TSO segments in
> tcp_output() is broken, since I'm pretty sure that the maximum size
> should be is IP_MAXPACKET (65535).
>
> Either that or some non-TCP socket is trying to send a packet that
> exceeds IP_MAXPACKET for some reason.
>
> Would it be possible to add a printf() for m->m_pkthdr.csum_flags
> to the before case, in the "if" that generates the before printf?
> I didn't think to put this in, but CSUM_TSO will be set if it
> is a TSO segment, I think? My networking is very rusty.
> (If how to add this isn't obvious, just email and I'll update
> the patch.)
>
> Thanks for doing this, rick
>
>

Christopher Forgeron

unread,

Mar 20, 2014, 10:47:44 PM3/20/14

to

BTW - I think this will end up being a TSO issue, not the patch that Jack
applied.

When I boot Jack's patch (MJUM9BYTES removal) this is what netstat -m shows:

21489/2886/24375 mbufs in use (current/cache/total)
4080/626/4706/6127254 mbuf clusters in use (current/cache/total/max)
4080/587 mbuf+clusters out of packet secondary zone in use (current/cache)
16384/50/16434/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)
0/0/0/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

79068K/2173K/81241K bytes allocated to network (current/cache/total)
18831/545/4542 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

15626/0/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

Here is an un-patched boot:

21550/7400/28950 mbufs in use (current/cache/total)
4080/3760/7840/6127254 mbuf clusters in use (current/cache/total/max)
4080/2769 mbuf+clusters out of packet secondary zone in use (current/cache)
0/42/42/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)
16439/129/16568/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

161498K/10699K/172197K bytes allocated to network (current/cache/total)
18345/155/4099 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

3/3723/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

See how removing the MJUM9BYTES is just pushing the problem from the 9k
jumbo cluster into the 4k jumbo cluster?

Compare this to my FreeBSD 9.2 STABLE machine from ~ Dec 2013 : Exact same
hardware, revisions, zpool size, etc. Just it's running an older FreeBSD.

# uname -a
FreeBSD SAN1.XXXXX 9.2-STABLE FreeBSD 9.2-STABLE #0: Wed Dec 25 15:12:14
AST 2013 aatech@FreeBSD-Update Server:/usr/obj/usr/src/sys/GENERIC
amd64

root@SAN1:/san1 # uptime
7:44AM up 58 days, 38 mins, 4 users, load averages: 0.42, 0.80, 0.91

root@SAN1:/san1 # netstat -m
37930/15755/53685 mbufs in use (current/cache/total)
4080/10996/15076/524288 mbuf clusters in use (current/cache/total/max)
4080/5775 mbuf+clusters out of packet secondary zone in use (current/cache)
0/692/692/262144 4k (page size) jumbo clusters in use
(current/cache/total/max)
32773/4257/37030/96000 9k jumbo clusters in use (current/cache/total/max)

0/0/0/508538 16k jumbo clusters in use (current/cache/total/max)

312599K/67011K/379611K bytes allocated to network (current/cache/total)

0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Lastly, please note this link:

http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033660.html

It's so old that I assume the TSO leak that he speaks of has been patched,
but perhaps not. More things to look into tomorrow.

Rick Macklem

unread,

Mar 20, 2014, 10:47:58 PM3/20/14

to

Christopher Forgeron wrote:
> Yes, there is something broken in TSO for sure, as disabling it
> allows me
> to run without error. It is possible that the drop in performance is
> allowing me to stay under a critical threshold for the problem, but
> I'd
> feel happier testing to make sure.
>
> I understand what you're asking for in the patch, I'll make the edits
> tomorrow and recompile a test kernel and see.
>

I also suggested a small change (basically reverting it to the 9.1 code)
for tcp_output() in sys/netinet/tcp_output.c (around line# 777-778).
You might as well throw that in at the same time.

Thanks for all your work with this (and this applies to others that
have been working on this as well.)

rick

Christopher Forgeron

unread,

Mar 20, 2014, 10:51:44 PM3/20/14

to

Sorry Rick, what's the small change you wanted in sys/netinet/tcp_output.c
at 777-778? I see it's calc'ing length... or did you want me to take the
whole file back to 9.1-RELEASE ?

Christopher Forgeron

unread,

Mar 20, 2014, 10:53:21 PM3/20/14

to

Pardon.. delay in recv'ing messages. I see your edits for 777-778 .. will
attempt tomorrow.

Christopher Forgeron

unread,

Mar 21, 2014, 7:47:35 AM3/21/14

to

Hello all,

I ran Jack's ixgbe MJUM9BYTES removal patch, and let iometer hammer away
at the NFS store overnight - But the problem is still there.

From what I read, I think the MJUM9BYTES removal is probably good cleanup
(as long as it doesn't trade performance on a lightly memory loaded system
for performance on a heavily memory loaded system). If I can stabilize my
system, I may attempt those benchmarks.

I think the fix will be obvious at boot for me - My 9.2 has a 'clean'
netstat
- Until I can boot and see a 'netstat -m' that looks similar to that, I'm
going to have this problem.

Markus: Do your systems show denied mbufs at boot like mine does?

Turning off TSO works for me, but at a performance hit.

I'll compile Rick's patch (and extra debugging) this morning and let you
know soon.

> but perhaps not. More things to look into tomorrow.

Markus Gebert

unread,

Mar 21, 2014, 9:04:07 AM3/21/14

to

On 21.03.2014, at 12:47, Christopher Forgeron <csfor...@gmail.com> wrote:

> Hello all,
>
> I ran Jack's ixgbe MJUM9BYTES removal patch, and let iometer hammer away
> at the NFS store overnight - But the problem is still there.
>
> From what I read, I think the MJUM9BYTES removal is probably good cleanup
> (as long as it doesn't trade performance on a lightly memory loaded system
> for performance on a heavily memory loaded system). If I can stabilize my
> system, I may attempt those benchmarks.
>
> I think the fix will be obvious at boot for me - My 9.2 has a 'clean'
> netstat
> - Until I can boot and see a 'netstat -m' that looks similar to that, I'm
> going to have this problem.
>
> Markus: Do your systems show denied mbufs at boot like mine does?

No. Our systems never show denied mbufs. Not on boot, not during normal operations and also not when the problem is occuring. I don’t know what you do differently, but in our case neither 4k nor 9k mbufs get used, only the normal ones.

I’m beginning to think that we look at different problems and at least quite different symptoms of a similar problem. Have you had luck in trying to find out, where EFBIG originates from in your case?

Markus

Christopher Forgeron

unread,

Mar 21, 2014, 9:16:47 AM3/21/14

to

Hi Markus,

Yes, we may have different problems, or perhaps the same problem is
manifesting itself in different ways in our systems.

Have you tried a 10.0-RELEASE system yet? If we were on the same OS
version, we could then compare system specs a bit deeper, and see what is
different. Perhaps under 10.0 your symptoms would be closer to mine, which
may not be progress, but would be something.

Markus Gebert

unread,

Mar 21, 2014, 10:21:36 AM3/21/14

to

On 21.03.2014, at 14:16, Christopher Forgeron <csfor...@gmail.com> wrote:

> Hi Markus,
>
> Yes, we may have different problems, or perhaps the same problem is manifesting itself in different ways in our systems.
>
> Have you tried a 10.0-RELEASE system yet? If we were on the same OS version, we could then compare system specs a bit deeper, and see what is different. Perhaps under 10.0 your symptoms would be closer to mine, which may not be progress, but would be something.

I’m afraid, we can’t. We can only reproduce this on production systems. And they cannot be easily upgraded right now. For one, we don’t have our build infrastructure ready for 10.x, because we usually skip the X.0 release, which means no poudriere/pkg repo with locally patched packages I need for these systems go into production. Then there’s testing, configuration changes needed, etc. It will probably be a year until we can consider upgrading anything productive to FreeBSD 10.x.

I could setup a test system on identical spare hardware, but as we cannot trigger the problem without production load so far, there’s no point.

Markus

Christopher Forgeron

unread,

Mar 21, 2014, 10:49:26 AM3/21/14

to

Ah, I understand the difficulties of testing production systems.

However, if you can make a spare tester of the same hardware, that's
perfect - And you can generate all the load you need with benchmark
software like iometer, large NFS copies, or perhaps a small replica of your
network. Synthetic load is easier to control, and thus easier to reproduce
the issue and speed testing. Heck, you may be able to do it all by looping
through your two ix adapters and never using an external client.

It's a bit of a pain to setup, but it's worth the effort imo.

Christopher Forgeron

unread,

Mar 21, 2014, 10:54:36 AM3/21/14

to

Rick:

Unfortunately your patch didn't work. I expected as much as soon as I saw
my boot time 'netstat -m', but I wanted to run the tests to make sure.

First, here is where I put in your additional line - Let me know if that's
what you were hoping for, as I'm using mmm->m_pkthdr.csum_flags, as m
doesn't exist until the call to m_defrag a few lines below.

printf("before pklen=%d actl=%d csum=%lu\n", mmm->m_pkthdr.len, iii,
mmm->m_pkthdr.csum_flags);

With this in place, here is the first set of logs after ~ 5min of load:

On Thu, Mar 20, 2014 at 11:25 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:
> >
> >
> >
> >
> >
> >

> > On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert <
> > markus...@hostpoint.ch > wrote:
> >
> >
> >
> >
> >
> > Possible. We still see this on nfsclients only, but I'm not convinced
> > that nfs is the only trigger.
> >
> >
> Since Christopher is getting a bunch of the "before" printf()s from
> my patch, it indicates that a packet/TSO segment that is > 65535 bytes

> in length is showing up at ixgbe_xmit(). I've asked him to add a printf()

> > > [...]

> >
> >
> > Markus
> >
> >
> >
> >
> > Will do. I'm not sure what shell the first script was written for,
> > it's not working in csh, here's a re-write that does work in csh in
> > case others are using the default shell:
> >
> > #!/bin/csh
> > foreach CPU (`seq 0 23`)
> > echo "CPU$CPU";
> > cpuset -l $CPU ping -i 0.2 -c 2 -W 1 10.0.0.1 | grep sendto;
> > end
> >
> >
> > Thanks for your input. I should have results to post to the list
> > shortly.
> >
> >
>

Christopher Forgeron

unread,

Mar 21, 2014, 11:01:08 AM3/21/14

to

(Pardon me, for some reason my gmail is sending on my cut-n-pastes if I cr
down too fast)

First set of logs:

Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
Mar 21 11:07:00 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
Mar 21 11:07:00 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
Mar 21 11:07:00 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116
Mar 21 11:07:00 SAN0 kernel: after mbcnt=33 pklen=65542 actl=65542
Mar 21 11:07:00 SAN0 kernel: before pklen=65542 actl=65542 csum=4116

Here's a few later on.

Mar 21 11:10:09 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:10:09 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 21 11:10:09 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:10:09 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 21 11:10:09 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:10:09 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 21 11:10:09 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:10:09 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538

Mar 21 11:23:00 SAN0 kernel: after mbcnt=33 pklen=65546 actl=65546
Mar 21 11:23:01 SAN0 kernel: before pklen=65546 actl=65546 csum=4116
Mar 21 11:23:01 SAN0 kernel: after mbcnt=33 pklen=65546 actl=65546
Mar 21 11:23:03 SAN0 kernel: before pklen=65546 actl=65546 csum=4116
Mar 21 11:23:03 SAN0 kernel: after mbcnt=33 pklen=65546 actl=65546
Mar 21 11:23:04 SAN0 kernel: before pklen=65546 actl=65546 csum=4116
Mar 21 11:23:04 SAN0 kernel: after mbcnt=33 pklen=65546 actl=65546

Mar 21 11:41:25 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:41:25 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 21 11:41:25 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:41:25 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 21 11:41:25 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:41:25 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 21 11:41:25 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:41:25 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 21 11:41:26 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:41:26 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538
Mar 21 11:41:26 SAN0 kernel: before pklen=65538 actl=65538 csum=4116
Mar 21 11:41:26 SAN0 kernel: after mbcnt=33 pklen=65538 actl=65538

To be clear, I changed tp->t_tsomax to IP_MAXPACKET at ~ 777 in
sys/netinet/tcp_output.c like so:

if (len > IP_MAXPACKET - hdrlen) {
len = IP_MAXPACKET - hdrlen;
sendalot = 1;
}

I notice there is more that is different between 9.1 and 10 for this file:
http://fxr.watson.org/fxr/diff/netinet/tcp_output.c?v=FREEBSD10;diffval=FREEBSD91;diffvar=v

I'm going to attempt inserting a 9.1 tcp_output.c and see if that makes any
difference.

Otherwise, I wait further ideas from the list.

Thanks.

Christopher Forgeron

unread,

Mar 21, 2014, 11:22:01 AM3/21/14

to

Markus,

I don't know why I didn't notice this before.. I copied your cpuset ping
verbatim, not realizing that I should be using 172.16.0.x as that's my
network on the ix's

On this tester box, 10.0.0.1 goes out a different interface, thus it never
reported back any problems.

Now that I've corrected that, I see I have problems on the same queues:

CPU0
ping: sendto: No buffer space available
ping: sendto: No buffer space available
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
CPU7
CPU8
ping: sendto: No buffer space available
ping: sendto: No buffer space available
CPU9
CPU10
CPU11
CPU12
CPU13
CPU14
CPU15
CPU16
ping: sendto: No buffer space available
ping: sendto: No buffer space available
CPU17
CPU18
CPU19
CPU20
CPU21
CPU22
CPU23

I can run that three times and get the same CPU's. I'll try a reboot and
see if they always fail on the same queues, tho I don't know if that would
show anything.

At this stage, NFS connections coming into the box are down, but I can
still ping out. Incoming pings show 'host is down'

Here is the dump of ix0 's sysctls (only ix0 is in use on this machine for
testing)

dev.ix.0.queue0.interrupt_rate: 500000
dev.ix.0.queue0.irqs: 100179
dev.ix.0.queue0.txd_head: 0
dev.ix.0.queue0.txd_tail: 0
dev.ix.0.queue0.tso_tx: 104156
dev.ix.0.queue0.no_tx_dma_setup: 0
dev.ix.0.queue0.no_desc_avail: 5
dev.ix.0.queue0.tx_packets: 279480
dev.ix.0.queue0.rxd_head: 513
dev.ix.0.queue0.rxd_tail: 512
dev.ix.0.queue0.rx_packets: 774424
dev.ix.0.queue0.rx_bytes: 281916
dev.ix.0.queue0.rx_copies: 4609
dev.ix.0.queue0.lro_queued: 0
dev.ix.0.queue0.lro_flushed: 0
dev.ix.0.queue1.interrupt_rate: 71428
dev.ix.0.queue1.irqs: 540682
dev.ix.0.queue1.txd_head: 1295
dev.ix.0.queue1.txd_tail: 1295
dev.ix.0.queue1.tso_tx: 15
dev.ix.0.queue1.no_tx_dma_setup: 0
dev.ix.0.queue1.no_desc_avail: 0
dev.ix.0.queue1.tx_packets: 93248
dev.ix.0.queue1.rxd_head: 0
dev.ix.0.queue1.rxd_tail: 2047
dev.ix.0.queue1.rx_packets: 462225
dev.ix.0.queue1.rx_bytes: 0
dev.ix.0.queue1.rx_copies: 0
dev.ix.0.queue1.lro_queued: 0
dev.ix.0.queue1.lro_flushed: 0
dev.ix.0.queue2.interrupt_rate: 71428
dev.ix.0.queue2.irqs: 282801
dev.ix.0.queue2.txd_head: 367
dev.ix.0.queue2.txd_tail: 367
dev.ix.0.queue2.tso_tx: 312757
dev.ix.0.queue2.no_tx_dma_setup: 0
dev.ix.0.queue2.no_desc_avail: 0
dev.ix.0.queue2.tx_packets: 876533
dev.ix.0.queue2.rxd_head: 0
dev.ix.0.queue2.rxd_tail: 2047
dev.ix.0.queue2.rx_packets: 2324954
dev.ix.0.queue2.rx_bytes: 0
dev.ix.0.queue2.rx_copies: 0
dev.ix.0.queue2.lro_queued: 0
dev.ix.0.queue2.lro_flushed: 0
dev.ix.0.queue3.interrupt_rate: 71428
dev.ix.0.queue3.irqs: 1424108
dev.ix.0.queue3.txd_head: 499
dev.ix.0.queue3.txd_tail: 499
dev.ix.0.queue3.tso_tx: 1263116
dev.ix.0.queue3.no_tx_dma_setup: 0
dev.ix.0.queue3.no_desc_avail: 0
dev.ix.0.queue3.tx_packets: 1590798
dev.ix.0.queue3.rxd_head: 0
dev.ix.0.queue3.rxd_tail: 2047
dev.ix.0.queue3.rx_packets: 8319143
dev.ix.0.queue3.rx_bytes: 0
dev.ix.0.queue3.rx_copies: 0
dev.ix.0.queue3.lro_queued: 0
dev.ix.0.queue3.lro_flushed: 0
dev.ix.0.queue4.interrupt_rate: 71428
dev.ix.0.queue4.irqs: 138019
dev.ix.0.queue4.txd_head: 1620
dev.ix.0.queue4.txd_tail: 1620
dev.ix.0.queue4.tso_tx: 29235
dev.ix.0.queue4.no_tx_dma_setup: 0
dev.ix.0.queue4.no_desc_avail: 0
dev.ix.0.queue4.tx_packets: 200853
dev.ix.0.queue4.rxd_head: 6
dev.ix.0.queue4.rxd_tail: 5
dev.ix.0.queue4.rx_packets: 218327
dev.ix.0.queue4.rx_bytes: 1527
dev.ix.0.queue4.rx_copies: 0
dev.ix.0.queue4.lro_queued: 0
dev.ix.0.queue4.lro_flushed: 0
dev.ix.0.queue5.interrupt_rate: 71428
dev.ix.0.queue5.irqs: 131367
dev.ix.0.queue5.txd_head: 330
dev.ix.0.queue5.txd_tail: 330
dev.ix.0.queue5.tso_tx: 9907
dev.ix.0.queue5.no_tx_dma_setup: 0
dev.ix.0.queue5.no_desc_avail: 0
dev.ix.0.queue5.tx_packets: 150955
dev.ix.0.queue5.rxd_head: 0
dev.ix.0.queue5.rxd_tail: 2047
dev.ix.0.queue5.rx_packets: 72814
dev.ix.0.queue5.rx_bytes: 0
dev.ix.0.queue5.rx_copies: 0
dev.ix.0.queue5.lro_queued: 0
dev.ix.0.queue5.lro_flushed: 0
dev.ix.0.queue6.interrupt_rate: 71428
dev.ix.0.queue6.irqs: 839814
dev.ix.0.queue6.txd_head: 1402
dev.ix.0.queue6.txd_tail: 1402
dev.ix.0.queue6.tso_tx: 327633
dev.ix.0.queue6.no_tx_dma_setup: 0
dev.ix.0.queue6.no_desc_avail: 0
dev.ix.0.queue6.tx_packets: 1371262
dev.ix.0.queue6.rxd_head: 0
dev.ix.0.queue6.rxd_tail: 2047
dev.ix.0.queue6.rx_packets: 2559592
dev.ix.0.queue6.rx_bytes: 0
dev.ix.0.queue6.rx_copies: 0
dev.ix.0.queue6.lro_queued: 0
dev.ix.0.queue6.lro_flushed: 0
dev.ix.0.queue7.interrupt_rate: 71428
dev.ix.0.queue7.irqs: 150693
dev.ix.0.queue7.txd_head: 1965
dev.ix.0.queue7.txd_tail: 1965
dev.ix.0.queue7.tso_tx: 248
dev.ix.0.queue7.no_tx_dma_setup: 0
dev.ix.0.queue7.no_desc_avail: 0
dev.ix.0.queue7.tx_packets: 145736
dev.ix.0.queue7.rxd_head: 0
dev.ix.0.queue7.rxd_tail: 2047
dev.ix.0.queue7.rx_packets: 19030
dev.ix.0.queue7.rx_bytes: 0
dev.ix.0.queue7.rx_copies: 0
dev.ix.0.queue7.lro_queued: 0
dev.ix.0.queue7.lro_flushed: 0

On Thu, Mar 20, 2014 at 7:40 AM, Markus Gebert
<markus...@hostpoint.ch>wrote:

>
>

> Can you try this when the problem occurs?
>
> for CPU in {0..7}; do echo "CPU${CPU}"; cpuset -l ${CPU} ping -i 0.2 -c 2
> -W 1 10.0.0.1 | grep sendto; done
>
> It will tie ping to certain cpus to test the different tx queues of your
> ix interface. If the pings reliably fail only on some queues, then your
> problem is more likely to be the same as ours.
>
> Also, if you have dtrace available:
>
> kldload dtraceall
> dtrace -n 'fbt:::return / arg1 == EFBIG && execname == "ping" / { stack();
> }'
>
> while you run pings over the interface affected. This will give you hints
> about where the EFBIG error comes from.
>
> > [...]
>
>
> Markus
>
>
>

Markus Gebert

unread,

Mar 21, 2014, 12:30:15 PM3/21/14

to

On 21.03.2014, at 16:22, Christopher Forgeron <csfor...@gmail.com> wrote:

> Markus,
>
> I don't know why I didn't notice this before.. I copied your cpuset ping
> verbatim, not realizing that I should be using 172.16.0.x as that's my
> network on the ix's
>
> On this tester box, 10.0.0.1 goes out a different interface, thus it never
> reported back any problems.

I’m sorry, I could have mentioned that. The good news is, this makes our two problems look very similar again.

While this is not EFBIG, we’ve seen this too. It usually starts out as EFBIG because _bus_dmamap_load_mbuf_sg fails, and at some point turns into ENOBUFS when the software tx queue fills up too.

If there’s no flow id, CPU cores and tx queues have a direct relationship, which seems to be the case with ping packets.

ixgbe.c
798 /* Which queue to use */
799 if ((m->m_flags & M_FLOWID) != 0)
800 i = m->m_pkthdr.flowid % adapter->num_queues;
801 else
802 i = curcpu % adapter->num_queues;

In your example, queue 0 got stuck, which is the one that cpus 0, 8 and 16 will queue their ping packets in. num_queues defaults to 8 on systems with 8 cores or more. So it’s actually enough to run this test for the first 8 cpus to actually cover all tx queues.

> I can run that three times and get the same CPU's. I'll try a reboot and
> see if they always fail on the same queues, tho I don't know if that would
> show anything.

I’ve seen it happen on 2 queues at the same time. And I’ve seen it go away after a couple of hours leaving the system idle.

> At this stage, NFS connections coming into the box are down, but I can

> still ping out. Incoming pings show 'host is down’

I guess what will still work or not really depends on which queue is affected and which flows are tied to that queue.

Markus

Markus Gebert

unread,

Mar 21, 2014, 12:40:40 PM3/21/14

to

On 21.03.2014, at 15:49, Christopher Forgeron <csfor...@gmail.com> wrote:

> However, if you can make a spare tester of the same hardware, that's
> perfect - And you can generate all the load you need with benchmark
> software like iometer, large NFS copies, or perhaps a small replica of your
> network. Synthetic load is easier to control, and thus easier to reproduce
> the issue and speed testing. Heck, you may be able to do it all by looping
> through your two ix adapters and never using an external client.
>
> It's a bit of a pain to setup, but it's worth the effort imo.

The main problem is, that all the affected systems are blades which are only connected 1gig. I think that’s the main reason we have trouble reproducing the problem and I cannot change that, because we simply lack the parts to produce any kind of 10gig connection between blades. So I will postpone this idea, especially since our problems seem very similar again. 9.2 or 10.0 does not seem to matter, at least for now.

Christopher Forgeron

unread,

Mar 21, 2014, 1:22:18 PM3/21/14

to

Fair enough.

Have you tried disabling tso on the ix's ? That does fix the problem for
me, however there is a performance penalty to be paid.

I'm now regressing through the ixgbe drivers - I see there's been changes
to how the queues are drained between 9.1 - 10.0, will see if the older
ixgbe 2.4.8 works under 10.0

On Fri, Mar 21, 2014 at 1:40 PM, Markus Gebert
<markus...@hostpoint.ch>wrote:

Christopher Forgeron

unread,

Mar 21, 2014, 3:51:21 PM3/21/14

to

Update:

I've noticed a fair number of differences in the ixgbe driver between 9.2
and 10.0-RELEASE, even though they have the same 2.5.15 version. Mostly
Netmap integration.

I've loaded up a 9.2-STABLE ixgbe driver from Dec 25th as it was handy (I
had to hack the source a bit since some #def's had changed), and I
immediately notice one difference:

netstat -m
21486/1464/22950 mbufs in use (current/cache/total)
4080/168/4248/6127254 mbuf clusters in use (current/cache/total/max)
4080/149 mbuf+clusters out of packet secondary zone in use (current/cache)
0/3/3/3063627 4k (page size) jumbo clusters in use (current/cache/total/max)
16384/0/16384/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

160987K/714K/161701K bytes allocated to network (current/cache/total)
17721/108/4104 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

2/0/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

Not a homerun, but definitely better on the jumbo clusters denied.

(For reference, I'd normally see:
2/13185/0 requests for jumbo clusters denied (4k/9k/16k) )

It still gives me errors, but you can see it's really not hitting the wall
for jumbo clusters on boot. Perhaps those jumbo clusters are being denied
as the buffers are being setup?

This is after it starts to blow up:

netstat -m
21632/12838/34470 mbufs in use (current/cache/total)
4116/4808/8924/6127254 mbuf clusters in use (current/cache/total/max)
4080/4050 mbuf+clusters out of packet secondary zone in use (current/cache)
0/36/36/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)
16439/121/16560/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

161591K/14058K/175649K bytes allocated to network (current/cache/total)
20581/3188/5880 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

33/84/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

..after another 5 min of blowups

netstat -m
28065/8040/36105 mbufs in use (current/cache/total)
4482/4644/9126/6127254 mbuf clusters in use (current/cache/total/max)
4112/4018 mbuf+clusters out of packet secondary zone in use (current/cache)
0/36/36/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)
16384/176/16560/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

163436K/13026K/176462K bytes allocated to network (current/cache/total)
22223/3199/5880 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

33/84/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

My next attempt it ixgbe from 10.0-STABLE. I will come back to the
9.2-STABLE driver a bit later.

As for what queues are locked from this:
CPU0, 8, 16 - Three down, like last time.

Rick Macklem

unread,

Mar 21, 2014, 7:44:53 PM3/21/14

to

Christopher Forgeron wrote:
>
>
>
>
>
>
> Hello all,
>
> I ran Jack's ixgbe MJUM9BYTES removal patch, and let iometer hammer
> away at the NFS store overnight - But the problem is still there.
>
>
> From what I read, I think the MJUM9BYTES removal is probably good
> cleanup (as long as it doesn't trade performance on a lightly memory
> loaded system for performance on a heavily memory loaded system). If
> I can stabilize my system, I may attempt those benchmarks.
>
>
> I think the fix will be obvious at boot for me - My 9.2 has a 'clean'
> netstat
> - Until I can boot and see a 'netstat -m' that looks similar to that,
> I'm going to have this problem.
>
>
> Markus: Do your systems show denied mbufs at boot like mine does?
>
>

> Turning off TSO works for me, but at a performance hit.
>
> I'll compile Rick's patch (and extra debugging) this morning and let
> you know soon.
>
>
>
>
>
>
> On Thu, Mar 20, 2014 at 11:47 PM, Christopher Forgeron <
> csfor...@gmail.com > wrote:
>
>
>
>
>
>
>
>
> BTW - I think this will end up being a TSO issue, not the patch that
> Jack applied.
>
> When I boot Jack's patch (MJUM9BYTES removal) this is what netstat -m
> shows:
>
> 21489/2886/24375 mbufs in use (current/cache/total)

> 4080/626/4706/6127254 mbuf clusters in use (current/cache/total/max)
> 4080/587 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 16384/50/16434/3063627 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 0/0/0/907741 9k jumbo clusters in use (current/cache/total/max)

>
> 0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

> 79068K/2173K/81241K bytes allocated to network (current/cache/total)
> 18831/545/4542 requests for mbufs denied

> (mbufs/clusters/mbuf+clusters)
>
> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

> 15626/0/0 requests for jumbo clusters denied (4k/9k/16k)

>
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
>

> Here is an un-patched boot:
>
> 21550/7400/28950 mbufs in use (current/cache/total)

> 4080/3760/7840/6127254 mbuf clusters in use (current/cache/total/max)
> 4080/2769 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 0/42/42/3063627 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 16439/129/16568/907741 9k jumbo clusters in use

> (current/cache/total/max)
>
> 0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

> 161498K/10699K/172197K bytes allocated to network
> (current/cache/total)
> 18345/155/4099 requests for mbufs denied

> (mbufs/clusters/mbuf+clusters)
>
> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

> 3/3723/0 requests for jumbo clusters denied (4k/9k/16k)

>
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
>
>
>

> See how removing the MJUM9BYTES is just pushing the problem from the
> 9k jumbo cluster into the 4k jumbo cluster?
>
> Compare this to my FreeBSD 9.2 STABLE machine from ~ Dec 2013 : Exact
> same hardware, revisions, zpool size, etc. Just it's running an
> older FreeBSD.
>
> # uname -a
> FreeBSD SAN1.XXXXX 9.2-STABLE FreeBSD 9.2-STABLE #0: Wed Dec 25
> 15:12:14 AST 2013 aatech@FreeBSD-Update
> Server:/usr/obj/usr/src/sys/GENERIC amd64
>
> root@SAN1:/san1 # uptime
> 7:44AM up 58 days, 38 mins, 4 users, load averages: 0.42, 0.80, 0.91
>
> root@SAN1:/san1 # netstat -m
> 37930/15755/53685 mbufs in use (current/cache/total)

> 4080/10996/15076/524288 mbuf clusters in use
> (current/cache/total/max)
> 4080/5775 mbuf+clusters out of packet secondary zone in use
> (current/cache)
> 0/692/692/262144 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 32773/4257/37030/96000 9k jumbo clusters in use
> (current/cache/total/max)
>
> 0/0/0/508538 16k jumbo clusters in use (current/cache/total/max)
> 312599K/67011K/379611K bytes allocated to network
> (current/cache/total)
>
> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

> 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> 0/0/0 sfbufs in use (current/peak/max)

> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile

> 0 calls to protocol drain routines
>
> Lastly, please note this link:
>
> http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033660.html
>

Hmm, this mentioned the ethernet header being in the TSO segment. I think
I already mentioned my TCP/IP is rusty and I know diddly about TSO.
However, at a glance it does appear the driver uses ether_output() for
TSO segments and, as such, I think an ethernet header is prepended to the
TSO segment. (This makes sense, since how else would the hardware know
what ethernet header to use for the TCP segments generated.)

I think prepending the ethernet header could push the total length
over 64K, given a default if_hw_tsomax == IP_MAXPACKET. And over 64K
isn't going to fit in 32 * 2K (mclbytes) clusters, etc and so forth.

Anyhow, I think the attached patch will reduce if_hw_tsomax, so that
the result should fit in 32 clusters and avoid EFBIG for this case,
so it might be worth a try?
(I still can't think of why the CSUM_TSO bit isn't set for the printf()
case, but it seems TSO segments could generate EFBIG errors.)

Maybe worth a try, rick

ixgbe.patch

Christopher Forgeron

unread,

Mar 21, 2014, 7:55:55 PM3/21/14

to

Thanks Rick, trying it now. I'm currently working with the 9.2 ixgbe code
as a starting point, as I'm curious/encouraged by the lack of jumbo cluster
denials in netmap.

I'll let you know how it works out.

Christopher Forgeron

unread,

Mar 21, 2014, 8:15:04 PM3/21/14

to

Well I can tell you that your if statement in that patch is being
activated. I added a printf in there, and it showed up on my dmsg.

I still see bad things for netstat -m , but I'm starting a load run to see
if it makes a difference.

Next compile I'll add printouts of what ifp->if_hw_tsomax is once it's been
set, and I'll print it out in the trouble spot (ixgbe_xmit) as well.

Christopher Forgeron

unread,

Mar 21, 2014, 9:25:18 PM3/21/14

to

It may be a little early, but I think that's it!

It's been running without error for nearly an hour - It's very rare it
would go this long under this much load.

I'm going to let it run longer, then abort and install the kernel with the
extra printfs so I can see what value ifp->if_hw_tsomax is before you set
it.

It still had netstat -m denied entries on boot, but they are not climbing
like they did before:

$ uptime
9:32PM up 25 mins, 4 users, load averages: 2.43, 6.15, 4.65
$ netstat -m
21556/7034/28590 mbufs in use (current/cache/total)
4080/3076/7156/6127254 mbuf clusters in use (current/cache/total/max)
4080/2281 mbuf+clusters out of packet secondary zone in use (current/cache)
0/53/53/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)
16444/118/16562/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

161545K/9184K/170729K bytes allocated to network (current/cache/total)
17972/2230/4111 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

35/8909/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

- Started off bad with the 9k denials, but it's not going up!

uptime
10:20PM up 1:13, 6 users, load averages: 2.10, 3.15, 3.67
root@SAN0:/usr/home/aatech # netstat -m
21569/7141/28710 mbufs in use (current/cache/total)
4080/3308/7388/6127254 mbuf clusters in use (current/cache/total/max)
4080/2281 mbuf+clusters out of packet secondary zone in use (current/cache)
0/53/53/3063627 4k (page size) jumbo clusters in use
(current/cache/total/max)
16447/121/16568/907741 9k jumbo clusters in use (current/cache/total/max)

0/0/0/510604 16k jumbo clusters in use (current/cache/total/max)

161575K/9702K/171277K bytes allocated to network (current/cache/total)
17972/2261/4111 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)

35/8913/0 requests for jumbo clusters denied (4k/9k/16k)

0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

This is the 9.2 ixgbe that I'm patching into 10.0, I'll move into the base
10.0 code tomorrow.

Rick Macklem

unread,

Mar 21, 2014, 10:39:36 PM3/21/14

to

Christopher Forgeron wrote:
> It may be a little early, but I think that's it!
>
> It's been running without error for nearly an hour - It's very rare
> it
> would go this long under this much load.
>
> I'm going to let it run longer, then abort and install the kernel
> with the
> extra printfs so I can see what value ifp->if_hw_tsomax is before you
> set
> it.
>

I think you'll just find it set to 0. Code in if_attach_internal()
{ in sys/net/if.c } sets it to IP_MAXPACKET (which is 65535) if it
is 0. In other words, if the if_attach routine in the driver doesn't
set it, this code sets it to the maximum possible value.

Here's the snippet:
/* Initialize to max value. */
657 if (ifp->if_hw_tsomax == 0)
658 ifp->if_hw_tsomax = IP_MAXPACKET;

Anyhow, this sounds like progress.

As far as NFS is concerned, I'd rather set it to a smaller value
(maybe 56K) so that m_defrag() doesn't need to be called, but I
suspect others wouldn't like this.

Hopefully Jack can decide if this patch is ok?

Thanks yet again for doing this testing, rick
ps: I've attached it again, so Jack (and anyone else who reads this)
can look at it.
pss: Please report if it keeps working for you.

ixgbe.patch

Christopher Forgeron

unread,

Mar 21, 2014, 10:56:06 PM3/21/14

to

No errors for 1h 46m - That's a record. This is using the 9.2-STABLE ixgbe
in a 10.0-RELEASE system, with Rick's suggested code below.

I decided this must be it, so I aborted, and modified the ixgbe driver from
10.0-STABLE with Rick's suggestion. Installed and rebooted. Here's the
extra values I print out:

if ((adapter->num_segs * MCLBYTES - ETHER_HDR_LEN) < IP_MAXPACKET) {
printf("CF - Ricks Test! ifp->if_hw_tsomax = %d\n",
ifp->if_hw_tsomax);
ifp->if_hw_tsomax = adapter->num_segs * MCLBYTES - ETHER_HDR_LEN;
printf("CF - After Init, ifp->if_hw_tsomax = %d\n",
ifp->if_hw_tsomax);
printf("CF - adapter->num_segs=%d, ETHER_HDR_LEN=%d,
IP_MAXPACKET=%d\n", adapter->num_segs, ETHER_HDR_LEN, IP_MAXPACKET);
}

Which shows me:

ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version -
stable-2.5.15> port 0xfcc0-0xfcdf me
m 0xd9000000-0xd93fffff,0xd9bf8000-0xd9bfbfff irq 45 at device 0.0 on pci5
Mar 21 23:00:08 SAN0 kernel: ix0: Using MSIX interrupts with 9 vectors
Mar 21 23:00:08 SAN0 kernel: CF - Ricks Test! ifp->if_hw_tsomax = 0
Mar 21 23:00:08 SAN0 kernel: CF - After Init, ifp->if_hw_tsomax = 65522
Mar 21 23:00:08 SAN0 kernel: CF - adapter->num_segs=32, ETHER_HDR_LEN=14,
IP_MAXPACKET=65535
ix0: Ethernet address: 00:1b:21:d6:4c:4c

I don't see where the TSO max is being set in any other place. I see
IXGBE_TSO_SIZE = 262140 in ixgbe.h, and I suppose something similar is
happening in ixgbe_tso_setup, setting it to that 262149 default

However: This 10.0-STABLE ixgbe has the error. I'm getting it at 25 min of
runtime. I don't have the full printf's in this one yet, so I can't tell
you more about it.

I'm going back to the 9.2-STABLE ixgbe with the above tso modification for
a bit longer to confirm that I can run overnight without the error.

On Fri, Mar 21, 2014 at 10:25 PM, Christopher Forgeron <csfor...@gmail.com
> wrote:

> It may be a little early, but I think that's it!
>
> It's been running without error for nearly an hour - It's very rare it
> would go this long under this much load.
>
> I'm going to let it run longer, then abort and install the kernel with the
> extra printfs so I can see what value ifp->if_hw_tsomax is before you set
> it.
>

Christopher Forgeron

unread,

Mar 21, 2014, 11:14:07 PM3/21/14

to

Ah yes, I see it now: Line #658

#if defined(INET) || defined(INET6)

/* Initialize to max value. */

if (ifp->if_hw_tsomax == 0)

ifp->if_hw_tsomax = IP_MAXPACKET;
KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
("%s: tsomax outside of range", __func__));
#endif

Should this be the location where it's being set rather than in ixgbe? I
would assume that other drivers could fall prey to this issue.

Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax to make
sure VLANs fit?

Perhaps there is something in the newer network code that is filling up the
frames to the point where they are full - thus a TSO = IP_MAXPACKET is just
now causing problems.

I'm back on the 9.2-STABLE ixgbe with the tso patch for now. I'll make it
run overnight while copying a few TB of data to make sure it's stable there
before investigating the 10.0-STABLE driver more.

..and there is still the case of the denied jumbo clusters on boot -
something else is off someplace.

BTW - In all of this, I did not mention that my ix0 uses a MTU of 9000 - I
assume others assumed this.

On Fri, Mar 21, 2014 at 11:39 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:
> > It may be a little early, but I think that's it!
> >
> > It's been running without error for nearly an hour - It's very rare
> > it
> > would go this long under this much load.
> >
> > I'm going to let it run longer, then abort and install the kernel
> > with the
> > extra printfs so I can see what value ifp->if_hw_tsomax is before you
> > set
> > it.
> >

> I think you'll just find it set to 0. Code in if_attach_internal()
> { in sys/net/if.c } sets it to IP_MAXPACKET (which is 65535) if it
> is 0. In other words, if the if_attach routine in the driver doesn't
> set it, this code sets it to the maximum possible value.
>
> Here's the snippet:
> /* Initialize to max value. */
> 657 if (ifp->if_hw_tsomax == 0)
> 658 ifp->if_hw_tsomax = IP_MAXPACKET;
>
> Anyhow, this sounds like progress.
>
> As far as NFS is concerned, I'd rather set it to a smaller value
> (maybe 56K) so that m_defrag() doesn't need to be called, but I
> suspect others wouldn't like this.
>
> Hopefully Jack can decide if this patch is ok?
>
> Thanks yet again for doing this testing, rick
> ps: I've attached it again, so Jack (and anyone else who reads this)
> can look at it.
> pss: Please report if it keeps working for you.
>

Christopher Forgeron

unread,

Mar 22, 2014, 7:55:59 AM3/22/14

to

Status Update: Hopeful, but not done.

So the 9.2-STABLE ixgbe with Rick's TSO patch has been running all night
while iometer hammered away at it. It's got over 8 hours of test time on
it.

It's still running, the CPU queues are not clogged, and everything is
functional.

However, my ping_logger.py did record 23 incidents of "sendto: File too
large" over the 8 hour run.

That's really nothing compared to what I usually run into - Normally I'd
have 23 incidents within a 5 minute span.

During those 23 incidents, (ping_logger.py triggers a cpuset ping) I see
it's having the same symptoms of clogging on a few CPU cores. That clogging
does go away, a symptom that Markus says he sometimes experiences.

So I would say the TSO patch makes things remarkably better, but something
else is still up. Unfortunately, with the TSO patch in place it's now
harder to trigger the error, so testing will be more difficult.

Could someone confirm for me where the jumbo clusters denied/mbuf denied
counters come from for netstat? Would it be from a m_defrag call that fails?

I feel the netstat -m stats on boot are part of this issue - I was able to
greatly reduce them during one of my test iterations. I'm going to see if I
can repeat that with the TSO patch.

Getting this working on the 10-STABLE ixgbe:

Mike's contributed some edits (slightly different thread) I want to try on
that driver. At the same time, a diff of 9.2 <-> 10.0 may give hints, as
the 10.0 driver with TSO patch has issues quickly, and frequently... it's
doing something that aggravates this condition.

Thanks for all the help, please keep the suggestions or tidbits of info
coming.

Rick Macklem

unread,

Mar 22, 2014, 5:18:14 PM3/22/14

to

Christopher Forgeron wrote:
> Status Update: Hopeful, but not done.
>
> So the 9.2-STABLE ixgbe with Rick's TSO patch has been running all
> night
> while iometer hammered away at it. It's got over 8 hours of test time
> on
> it.
>
> It's still running, the CPU queues are not clogged, and everything is
> functional.
>
> However, my ping_logger.py did record 23 incidents of "sendto: File
> too
> large" over the 8 hour run.
>

Well, you could try making if_hw_tsomax somewhat smaller. (I can't see
how the packet including ethernet header would be more than 64K with the
patch, but?? For example, the ether_output() code can call ng_output()
and I have no idea if that might grow the data size of the packet?)

To be honest, the optimum for NFS would be setting if_hw_tsomax == 56K,
since that would avoid the overhead of the m_defrag() calls. However,
it is suboptimal for other TCP transfers.

One other thing you could do (if you still have them) is scan the logs
for the code with my previous printf() patch and see if there is ever
a size > 65549 in it. If there is, then if_hw_tsomax needs to be smaller
by at least that size - 65549. (65535 + 14 == 65549)

If I were you, I'd try setting it to 57344 (56K) instead of

"num_segs * MCLBYTES - ETHER_HDR_LEN"

ie. replace

ifp->if_hw_tsomax = adapter->num_segs * MCLBYTES - ETHER_HDR_LEN;

with
ifp->if_hw_tsomax = 57344;
in the patch.

Then see if all the errors go away. (Jack probably won't like making it
that small, but it will show if decreasing it a bit will completely
fix the problem.)

> That's really nothing compared to what I usually run into - Normally
> I'd
> have 23 incidents within a 5 minute span.
>
> During those 23 incidents, (ping_logger.py triggers a cpuset ping) I
> see
> it's having the same symptoms of clogging on a few CPU cores. That
> clogging
> does go away, a symptom that Markus says he sometimes experiences.
>
> So I would say the TSO patch makes things remarkably better, but
> something
> else is still up. Unfortunately, with the TSO patch in place it's now
> harder to trigger the error, so testing will be more difficult.
>
> Could someone confirm for me where the jumbo clusters denied/mbuf
> denied
> counters come from for netstat? Would it be from a m_defrag call that
> fails?
>

I'm not familiar enough with the mbuf/uma allocators to "confirm" it,
but I believe the "denied" refers to cases where m_getjcl() fails to get
a jumbo mbuf and returns NULL.

If this were to happen in m_defrag(), it would return NULL and the ix
driver returns ENOBUFS, so this is not the case for EFBIG errors.

I don't know if increasing the limits for the jumbo mbufs via sysctl
will help. If you are using the code without Jack's patch, which uses
9K mbufs, then I think it can fragment the address space and result
in no 9K contiguous areas to allocate from. (I'm just going by what
Garrett and others have said about this.)

Rick Macklem

unread,

Mar 22, 2014, 5:41:23 PM3/22/14

to

Christopher Forgeron wrote:
>
>
>
>
>
>
> Ah yes, I see it now: Line #658
>
> #if defined(INET) || defined(INET6)
> /* Initialize to max value. */
> if (ifp->if_hw_tsomax == 0)
> ifp->if_hw_tsomax = IP_MAXPACKET;
> KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
> ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
> ("%s: tsomax outside of range", __func__));
> #endif
>
>
> Should this be the location where it's being set rather than in
> ixgbe? I would assume that other drivers could fall prey to this
> issue.
>

All of this should be prepended with "I'm an NFS guy, not a networking
guy, so I might be wrong".

Other drivers (and ixgbe for the 82598 chip) can handle a packet that
is in more than 32 mbufs. (I think the 82598 handles 100, grep for SCATTER
in *.h in sys/dev/ixgbe.)

Now, since several drivers do have this 32 mbufs limit, I can see an argument
for making the default a little smaller to make these work, since the
driver can override the default. (About now someone usually jumps in and says
something along the lines of "You can't do that until all the drivers that
can handle IP_MAXPACKET are fixed to set if_hw_tsomax" and since I can't fix
drivers I can't test, that pretty much puts a stop on it.)

You see the problem isn't that IP_MAXPACKET is too big, but that the hardware
has a limit of 32 non-contiguous chunks (mbufs)/packet and 32 * MCLBYTES = 64K.
(Hardware/network drivers that can handle 35 or more chunks (they like to call
them transmit segments, although ixgbe uses the term scatter) shouldn't have
any problems.)

I have an untested patch that adds a tsomaxseg count to use along with tsomax
bytes so that a driver could inform tcp_output() it can only handle 32 mbufs
and then tcp_output() would limit a TSO segment using both, but I can't test
it, so who knows when/if that might happen.

I also have a patch that modifies NFS to use pagesize clusters (reducing the
mbuf count in the list), but that one causes grief when testing on an i386
(seems to run out of kernel memory to the point where it can't allocate something
called "boundary tags" and pretty well wedges the machine at that point.)
Since I don't know how to fix this (I thought of making the patch "amd64 only"),
I can't really commit this to head, either.

As such, I think it's going to be "fix the drivers one at a time" and tell
folks to "disable TSO or limit rsize,wsize to 32K" when they run into trouble.
(As you might have guessed, I'd rather just be "the NFS guy", but since NFS
"triggers the problem" I\m kinda stuck with it;-)

> Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax to
> make sure VLANs fit?
>

No idea. (I wouldn't know a VLAN if it jumped up and tried to
bite me on the nose.;-) So, I have no idea what does this, but
if it means the total ethernet header size can be > 14bytes, then I'd agree.

> Perhaps there is something in the newer network code that is filling
> up the frames to the point where they are full - thus a TSO =
> IP_MAXPACKET is just now causing problems.
>

Yea, I have no idea why this didn't bite running 9.1. (Did 9.1 have
TSO enabled by default?)

> I'm back on the 9.2-STABLE ixgbe with the tso patch for now. I'll
> make it run overnight while copying a few TB of data to make sure
> it's stable there before investigating the 10.0-STABLE driver more.
>

I have no idea what needs to be changed to back-port a 10.0 driver to
9.2.

Good luck with it and thanks for what you've learned sofar, rick

Rick Macklem

unread,

Mar 22, 2014, 10:58:11 PM3/22/14

to

Christopher Forgeron wrote:
>
>
>
>
>
>
> Ah yes, I see it now: Line #658
>
> #if defined(INET) || defined(INET6)
> /* Initialize to max value. */
> if (ifp->if_hw_tsomax == 0)
> ifp->if_hw_tsomax = IP_MAXPACKET;
> KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
> ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
> ("%s: tsomax outside of range", __func__));
> #endif
>
>
> Should this be the location where it's being set rather than in
> ixgbe? I would assume that other drivers could fall prey to this
> issue.
>

> Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax to
> make sure VLANs fit?
>

I took a look and, yes, this does seem to be needed. It will only be
needed for the case where a vlan is in use and hwtagging is disabled,
if I read the code correctly.

Do you use vlans?

I've attached an updated patch.

It might be nice to have the printf() patch in the driver too, so
we can see how big the ones that are too big are?

Good luck with it, rick

> Perhaps there is something in the newer network code that is filling
> up the frames to the point where they are full - thus a TSO =
> IP_MAXPACKET is just now causing problems.
>

> I'm back on the 9.2-STABLE ixgbe with the tso patch for now. I'll
> make it run overnight while copying a few TB of data to make sure
> it's stable there before investigating the 10.0-STABLE driver more.
>

ixgbe.patch

Christopher Forgeron

unread,

Mar 23, 2014, 10:54:54 AM3/23/14

to

Hi Rick, very helpful as always.

On Sat, Mar 22, 2014 at 6:18 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:
>

> Well, you could try making if_hw_tsomax somewhat smaller. (I can't see
> how the packet including ethernet header would be more than 64K with the
> patch, but?? For example, the ether_output() code can call ng_output()
> and I have no idea if that might grow the data size of the packet?)
>

That's what I was thinking - I was going to drop it down to 32k, which is
extreme, but I wanted to see if it cured it or not. Something would have to
be very broken to be adding nearly 32k to a packet.

> To be honest, the optimum for NFS would be setting if_hw_tsomax == 56K,
> since that would avoid the overhead of the m_defrag() calls. However,
> it is suboptimal for other TCP transfers.
>

I'm very interested in NFS performance, so this is interesting to me - Do
you have the time to educate me on this? I was going to spend this week
hacking out the NFS server cache, as I feel ZFS does a better job, and my
cache stats are always terrible, as to be expected when I have such a wide
data usage on these sans.

>
> One other thing you could do (if you still have them) is scan the logs
> for the code with my previous printf() patch and see if there is ever
> a size > 65549 in it. If there is, then if_hw_tsomax needs to be smaller
> by at least that size - 65549. (65535 + 14 == 65549)
>

There were some 65548's for sure. Interestingly enough, the amount that it
ruptures by seems to be increasing slowly. I should possibly let it rupture
and run for a long time to see if there is a steadily increasing pattern...
perhaps something is accidentally incrementing the packet by say 4 bytes in
a heavily loaded error condition.

>

> I'm not familiar enough with the mbuf/uma allocators to "confirm" it,
> but I believe the "denied" refers to cases where m_getjcl() fails to get
> a jumbo mbuf and returns NULL.
>
> If this were to happen in m_defrag(), it would return NULL and the ix
> driver returns ENOBUFS, so this is not the case for EFBIG errors.
>

> BTW, the loop that your original printf code is in, just before the retry:
goto label: That's an error loop, and it looks to me that all/most packets
traverse it at some time?

> I don't know if increasing the limits for the jumbo mbufs via sysctl
> will help. If you are using the code without Jack's patch, which uses
> 9K mbufs, then I think it can fragment the address space and result
> in no 9K contiguous areas to allocate from. (I'm just going by what
> Garrett and others have said about this.)
>
>

I never seem to be running out of mbufs - 4k or 9k. Unless it's possible
for a starvation to occur without incrementing the counters. Additionally,
netstat -m is recording denied mbufs on boot, so on a 96 Gig system that is
just starting up, I don't think I am.. but a large increase in the buffers
is on my list of desperation things to try.

Thanks for the hint on m_getjcl().. I'll dig around and see if I can find
what's happening there. I guess it's time for me to learn basic dtrace as
well. :-)

Christopher Forgeron

unread,

Mar 23, 2014, 11:19:25 AM3/23/14

to

On Sat, Mar 22, 2014 at 6:41 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:

> > #if defined(INET) || defined(INET6)
> > /* Initialize to max value. */
> > if (ifp->if_hw_tsomax == 0)
> > ifp->if_hw_tsomax = IP_MAXPACKET;
> > KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
> > ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
> > ("%s: tsomax outside of range", __func__));
> > #endif
> >
> >
> > Should this be the location where it's being set rather than in
> > ixgbe? I would assume that other drivers could fall prey to this
> > issue.
> >

> All of this should be prepended with "I'm an NFS guy, not a networking
> guy, so I might be wrong".
>
> Other drivers (and ixgbe for the 82598 chip) can handle a packet that
> is in more than 32 mbufs. (I think the 82598 handles 100, grep for SCATTER
> in *.h in sys/dev/ixgbe.)
>

[...]

Yes, I agree we have to be careful about the limitations of other drivers,
but I'm thinking setting tso to IP_MAXPACKET is a bad idea, unless all of
the header subtractions are happening elsewhere. Then again, perhaps every
other driver (and possibly ixgbe.. i need to look more) does a maxtso -
various_headers to set a limit for data packets.

I'm not familiar with the Freebsd network conventions/styles - I'm just
asking questions, something I have a bad habit for, but I'm in charge of
code stability issues at my work so it's hard to stop.

>
> Now, since several drivers do have this 32 mbufs limit, I can see an
> argument
> for making the default a little smaller to make these work, since the
> driver can override the default. (About now someone usually jumps in and
> says
> something along the lines of "You can't do that until all the drivers that
> can handle IP_MAXPACKET are fixed to set if_hw_tsomax" and since I can't
> fix
> drivers I can't test, that pretty much puts a stop on it.)
>
>

Testing is a problem isn't it? I once again offer my stack of network cards
and systems for some sort of testing.. I still have coax and token ring
around. :-)

> You see the problem isn't that IP_MAXPACKET is too big, but that the
> hardware
> has a limit of 32 non-contiguous chunks (mbufs)/packet and 32 * MCLBYTES =
> 64K.
> (Hardware/network drivers that can handle 35 or more chunks (they like to
> call
> them transmit segments, although ixgbe uses the term scatter) shouldn't
> have
> any problems.)
>
> I have an untested patch that adds a tsomaxseg count to use along with
> tsomax
> bytes so that a driver could inform tcp_output() it can only handle 32
> mbufs
> and then tcp_output() would limit a TSO segment using both, but I can't
> test
> it, so who knows when/if that might happen.
>
>

I think you give that to me in the next email - if not, please send.

> I also have a patch that modifies NFS to use pagesize clusters (reducing
> the
> mbuf count in the list), but that one causes grief when testing on an i386
> (seems to run out of kernel memory to the point where it can't allocate
> something
> called "boundary tags" and pretty well wedges the machine at that point.)
> Since I don't know how to fix this (I thought of making the patch "amd64
> only"),
> I can't really commit this to head, either.
>
>

Send me that one too. I love NFS patches.

> As such, I think it's going to be "fix the drivers one at a time" and tell
> folks to "disable TSO or limit rsize,wsize to 32K" when they run into
> trouble.
> (As you might have guessed, I'd rather just be "the NFS guy", but since NFS
> "triggers the problem" I\m kinda stuck with it;-)
>

> I know in some circumstances disabling TSO can be a benefit, but in
general you'd want it on a modern system with heavy data load.

> Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax to
> > make sure VLANs fit?
> >

> No idea. (I wouldn't know a VLAN if it jumped up and tried to
> bite me on the nose.;-) So, I have no idea what does this, but
> if it means the total ethernet header size can be > 14bytes, then I'd
> agree.
>

> Yeah, you need another 4 bytes for VLAN header if you're not using
hardware that strips it before the TCP stack gets it. I have a mix of
hardware and software VLANs running on our backbone, mostly due to a mixed
FreeBSD/OpenBSD/Windows environment.

> > Perhaps there is something in the newer network code that is filling
> > up the frames to the point where they are full - thus a TSO =
> > IP_MAXPACKET is just now causing problems.
> >

> Yea, I have no idea why this didn't bite running 9.1. (Did 9.1 have
> TSO enabled by default?)
>

I believe 9.0 has TSO on by default.. I seem to recall it always being
there, but I can't easily confirm it now. My last 9.0-STABLE doesn't have
an ixgbe card in it.

Christopher Forgeron

unread,

Mar 23, 2014, 11:25:59 AM3/23/14

to

On Sat, Mar 22, 2014 at 11:58 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:
> >
>
> > Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax to
> > make sure VLANs fit?
> >

> I took a look and, yes, this does seem to be needed. It will only be
> needed for the case where a vlan is in use and hwtagging is disabled,
> if I read the code correctly.
>

Yes, or in the rare care where you configure your switch to pass the v_lan
header through to the NIC.

>
> Do you use vlans?
>

(Answered in above email)

>
> I've attached an updated patch.
>
> It might be nice to have the printf() patch in the driver too, so
> we can see how big the ones that are too big are?
>
>

Yes, I'm going to leave those in until I know we have this fixed.. will
probably leave it in a while longer as it should only have a minor
performance impact to iter-loop like that, and I'd like to see what the
story is a few months down the road.

Thanks for the patches, will have to start giving them code-names so we can
keep them straight. :-) I guess we have printf, tsomax, and this one.

Christopher Forgeron

unread,

Mar 23, 2014, 4:28:02 PM3/23/14

to

Update:

For giggles, I set IP_MAXPACKET = 32768.

Over a hour of runtime, and no issues. This is better than with the TSO
patch and the 9.2 ixgbe, as that was just a drastic reduction in errors.

Still have an 'angry' netstat -m on boot, and I'm still incrementing
denied netbuf calls, so something else is wrong.

I'm going to modify Rick's prinft in ixgbe to also output when we're over
32768. I'm sure it's still happening, but with an extra 32k of space,
we're not busting like we did before.

I notice a few interesting ip->ip_len changes since 9.2 - Like here, at
line 720

http://fxr.watson.org/fxr/diff/netinet/ip_output.c?v=FREEBSD10;im=kwqeqdhhvovqn;diffval=FREEBSD92;diffvar=v

Looks like older code didn't byteswap with ntohs - I see that often in
tcp_output.c, and in tcp_options.c.

I'm also curious about this:Line 524
http://fxr.watson.org/fxr/diff/netinet/ip_options.c?v=FREEBSD10;diffval=FREEBSD92;diffvar=v

New 10 code:

ip <http://fxr.watson.org/fxr/ident?v=FREEBSD10;im=kwqeqdhhvovqn;i=ip>->ip_len
= htons <http://fxr.watson.org/fxr/ident?v=FREEBSD10;im=kwqeqdhhvovqn;i=htons>(ntohs
<http://fxr.watson.org/fxr/ident?v=FREEBSD10;im=kwqeqdhhvovqn;i=ntohs>(ip
<http://fxr.watson.org/fxr/ident?v=FREEBSD10;im=kwqeqdhhvovqn;i=ip>->ip_len)
+ optlen);

Old 9.2 Code:

ip <http://fxr.watson.org/fxr/ident?v=FREEBSD92;i=ip>->ip_len += optlen;

I wonder if there are any unexpected consequences of these changes, or
perhaps a line someplace that doesn't make the change.

Is there a dtrace command I could use to watch these functions and compare
the new ip_len with ip->ip_len or other variables?

Rick Macklem

unread,

Mar 23, 2014, 7:57:09 PM3/23/14

to

Christopher Forgeron wrote:
>
>
>
>
>
>
> On Sat, Mar 22, 2014 at 6:41 PM, Rick Macklem < rmac...@uoguelph.ca
> > wrote:

Well, IP_MAXPACKET is simply the largest # that fits in the 16bit length
field of an IP header (65535). This limit is on the TSO segment (which
is really just a TCP/IP packet greater than the MTU) and does not include
a MAC level (ethernet) header.

Beyond that, it is the specific hardware that limits things, such as
this case, which is limited to 32 mbufs (which happens to imply 64K
total, including ethernet header using 2K mbuf clusters).
(The 64K limit is just a quirk caused by the 32mbuf limit and the fact
that mbuf clusters hold 2K of data each.)

> > Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax
> > to
> > make sure VLANs fit?
> >

> No idea. (I wouldn't know a VLAN if it jumped up and tried to
> bite me on the nose.;-) So, I have no idea what does this, but
> if it means the total ethernet header size can be > 14bytes, then I'd
> agree.
>
>
>
> Yeah, you need another 4 bytes for VLAN header if you're not using
> hardware that strips it before the TCP stack gets it. I have a mix
> of hardware and software VLANs running on our backbone, mostly due
> to a mixed FreeBSD/OpenBSD/Windows environment.
>
>
>
>
> > Perhaps there is something in the newer network code that is
> > filling
> > up the frames to the point where they are full - thus a TSO =
> > IP_MAXPACKET is just now causing problems.
> >
> Yea, I have no idea why this didn't bite running 9.1. (Did 9.1 have
> TSO enabled by default?)
>
>
>
> I believe 9.0 has TSO on by default.. I seem to recall it always
> being there, but I can't easily confirm it now. My last 9.0-STABLE
> doesn't have an ixgbe card in it.
>
>
>
>

Ok, I've attached 3 patches:
ixgbe.patch - A slightly updated version of the one that sets if_hw_tsomax,
which subtracts out the additional 4bytes for the VLAN header.
*** If you can test this, it would be nice to know if this gets rid of all
the EFBIG replies, since I think Jack might feel it is ok to commit if
it does do so.

4kmcl.patch - This one modifies NFS to use pagesize mbuf clusters for the
large RPC messages. It is NOT safe to use on a small i386,
but might be ok on a large amd64 box. On a small i386, using
a mix of 2K and 4K mbuf clusters seems to fragment kernel memory
enough that allocation of "boundary tags" (whatever those are?)
fail and this trainwrecks the system.
Using pagesize (4K) clusters reduces the mbuf count for an
IP_MAXPACKET sized TSO segment to 19, avoiding the 32 limit
and any need to call m_defrag() for NFS.
*** Only use on a test system, at your own risk.

tsomaxseg.patch - This one adds support for if_hw_tsomaxseg, which is a limit on
the # of mbufs in an output TSO segment (and defaults to 32).
*** This one HAS NOT BEEN TESTED and probably doesn't even work at this point.

rick

ixgbe.patch

4kmcl.patch

tsomaxseg.patch

Rick Macklem

unread,

Mar 23, 2014, 8:38:59 PM3/23/14

to

Christopher Forgeron wrote:
> Hi Rick, very helpful as always.
>
>

> On Sat, Mar 22, 2014 at 6:18 PM, Rick Macklem <rmac...@uoguelph.ca>
> wrote:

>
> > Christopher Forgeron wrote:
> >
> > Well, you could try making if_hw_tsomax somewhat smaller. (I can't
> > see
> > how the packet including ethernet header would be more than 64K
> > with the
> > patch, but?? For example, the ether_output() code can call
> > ng_output()
> > and I have no idea if that might grow the data size of the packet?)
> >
>
> That's what I was thinking - I was going to drop it down to 32k,
> which is
> extreme, but I wanted to see if it cured it or not. Something would
> have to
> be very broken to be adding nearly 32k to a packet.
>
>
> > To be honest, the optimum for NFS would be setting if_hw_tsomax ==
> > 56K,
> > since that would avoid the overhead of the m_defrag() calls.
> > However,
> > it is suboptimal for other TCP transfers.
> >
>

Ok, here is the critical code snippet from tcp_output():
/*
774 * Limit a burst to t_tsomax minus IP,
775 * TCP and options length to keep ip->ip_len
776 * from overflowing or exceeding the maximum
777 * length allowed by the network interface.
778 */
779 if (len > tp->t_tsomax - hdrlen) {
780 len = tp->t_tsomax - hdrlen;
781 sendalot = 1;
782 }
783
784 /*
785 * Prevent the last segment from being
786 * fractional unless the send sockbuf can
787 * be emptied.
788 */
789 if (sendalot && off + len < so->so_snd.sb_cc) {
790 len -= len % (tp->t_maxopd - optlen);
791 sendalot = 1;
792 }
The first "if" at #779 limits the len to if_hw_tsomax - hdrlen.
(tp->t_tsomax == if_hw_tsomax and hdrlen == size of TCP/IP header)
The second "if" at #789 reduces the len to an exact multiple of the output
MTU if it won't empty the send queue.

Here's how I think things work:
- For a full 64K of read/write data, NFS generates an mbuf list with
32 MCLBYTES clusters of data and two small header packets prepended
in front of them (one for the RPC header + one for the NFS args that
come before the data).
Total data length is a little over 64K (something like 65600bytes).
- When the above code processes this, it reduces the length to
if_hw_tsomax (65535 by default). { if at #779 }
- Second "if" at #789 reduces it further (63000 for a 9000byte MTU).
tcp_output() prepends an mbuf with the TCP/IP header in it, resulting
is a total data length somewhat less than 64K and passes this to the
ixgbe.c driver.
- The ixgbe.c driver prepends an ethernet header (14 or maybe 18bytes in
length) by calling ether_output() and then hands it (a little less than
64K bytes of data in 35mbufs) to ixgbe_xmit().
ixgbe_xmit() calls bus_dmamap_load_mbuf_sg() which fails, returning
EFBIG, because the list has more than 32 mbufs in it.
- then it calls m_defrag(), which copies the slightly less than 64K
of data to a list of 32 mbuf clusters.
- bus_dmamap_load_mbuf_sg() is called again and succeeds this time
because the list is only 32 mbufs long.
(The call to m_defrag() adds some overhead and does have the potential
to fail if mbuf clusters are exhausted, so this works, but isn't ideal.)

The problem case happens when the size of the I/O is a little less than
the full 64K (hit EOF for read or a smaller than 64K dirty region in a
buffer cache block for write.
- Now, for example, the total data length for the mbuf chain (including
RPC, NFS and TCP/IP headers) could be 65534 (slightly less than 64K).
The first "if" doesn't change the "len", since it is less than if_hw_tsomax.
The second "if" doesn't change the "len" if there is no additional data in
the send queue.
--> Now the ixgbe driver prepends an ethernet header, increasing the total
data length to 65548 (a little over 64K).
- First call to bus_dmamap_load_mbuf_sg() fails with EFBIG because the
mbuf list has more than 32 entries.
- calls m_defrag(), which copies the data to a list of 33 mbuf clusters.
(> 64K requires 33 * 2K clusters)
- Second call to bus_dmamap_load_mbuf_sg() fails again with EFBIG, because
the list has 33 mbufs in it.
--> Returns EFBIG and throws away the TSO segment without sending it.

For NFS, the ideal would be to not only never fail with EFBIG, but to not
have the overhead of calling m_defrag().
- One way is to use pagesize (4K) clusters, so that the mbuf list only has
19 entries.
- Another way is to teach tcp_output() to limit the mbuf list to 32 mbufs
as well as 65535 bytes in length.
- Yet another is to make if_hw_tsomax small enough that the mbuf list
doesn't exceed 32 mbufs. (56K would do this for NFS, but is suboptimal
for other traffic.)

I am hoping that the only reason you still saw a few EFBIGs with the
patch that reduced if_hw_tsomax by ETHER_HDR_LEN was that some had
the additional 4bytes for a vlan header. If that is the case, the
slightly modified patch should make all EFBIG error returns go away.

> I'm very interested in NFS performance, so this is interesting to me
> - Do
> you have the time to educate me on this? I was going to spend this
> week
> hacking out the NFS server cache, as I feel ZFS does a better job,
> and my
> cache stats are always terrible, as to be expected when I have such a
> wide
> data usage on these sans.
>

The DRC (duplicate request cache) is not for performance, it is for
correctness. It comes from the fact that Sun RPC is "at least once"
and can retry non-idempotent operations (ones that modify the file
system) resulting in a corrupted file system on the server.

The risk is lower when using TCP vs UDP, but is still non-zero.

If you had a "perfect network fabric that never lost packets", the
hit rate of the DRC is 0 and could safely be disabled. However, each
hit implies a case where file system corruption has been avoided, so
I think most environments want it, despite the overhead.
--> The better your network environment, the lower the hit rate.
(Which means you want to see "terrible" cache stats.;-)
--> It is always some amount of overhead for the sake of correctness
and never improves performance (well technically it does avoid
redoing file system ops when there is a hit, but the performance
effect is miniscule and not relevant).

This was all fixed by NFSv4.1, which uses a machanism called Sessions
to provide "exactly once" RPC semantics. As such, the NFSv4.1 server
(still in a projects branch on svn) doesn't use the DRC.

> >
> > One other thing you could do (if you still have them) is scan the
> > logs
> > for the code with my previous printf() patch and see if there is
> > ever
> > a size > 65549 in it. If there is, then if_hw_tsomax needs to be
> > smaller
> > by at least that size - 65549. (65535 + 14 == 65549)
> >
>
> There were some 65548's for sure. Interestingly enough, the amount
> that it
> ruptures by seems to be increasing slowly. I should possibly let it
> rupture
> and run for a long time to see if there is a steadily increasing
> pattern...
> perhaps something is accidentally incrementing the packet by say 4
> bytes in
> a heavily loaded error condition.
>

I doubt it, since people run other network interfaces that don't have
the 32mbuf (transmit segment) limitation without difficulties, as far
as I know.

> >
>
> > I'm not familiar enough with the mbuf/uma allocators to "confirm"
> > it,
> > but I believe the "denied" refers to cases where m_getjcl() fails
> > to get
> > a jumbo mbuf and returns NULL.
> >
> > If this were to happen in m_defrag(), it would return NULL and the
> > ix
> > driver returns ENOBUFS, so this is not the case for EFBIG errors.
> >
> > BTW, the loop that your original printf code is in, just before the
> > retry:
> goto label: That's an error loop, and it looks to me that all/most
> packets
> traverse it at some time?
>

It does a second try at calling bus_dmamap_load_mbuf_sg() after doing
the compaction copying of the mbuf list via m_defrag() and then returns
EFBIG if the second attempt fails.

>
> > I don't know if increasing the limits for the jumbo mbufs via
> > sysctl
> > will help. If you are using the code without Jack's patch, which
> > uses
> > 9K mbufs, then I think it can fragment the address space and result
> > in no 9K contiguous areas to allocate from. (I'm just going by what
> > Garrett and others have said about this.)
> >
> >
> I never seem to be running out of mbufs - 4k or 9k. Unless it's
> possible
> for a starvation to occur without incrementing the counters.
> Additionally,
> netstat -m is recording denied mbufs on boot, so on a 96 Gig system
> that is
> just starting up, I don't think I am.. but a large increase in the
> buffers
> is on my list of desperation things to try.
>
> Thanks for the hint on m_getjcl().. I'll dig around and see if I can
> find
> what's happening there. I guess it's time for me to learn basic
> dtrace as
> well. :-)

Rick Macklem

unread,

Mar 23, 2014, 8:47:06 PM3/23/14

to

Christopher Forgeron wrote:
>
>
>
>
>
>
>
>
> Update:
>
> For giggles, I set IP_MAXPACKET = 32768.
>

Well, I'm pretty sure you don't want to do that, except for an experiment.
You can just set if_hw_tsomax to whatever you want to try, at the place
my ixgbe.patch put it (just before the call to ether_ifattach()).

> Over a hour of runtime, and no issues. This is better than with the
> TSO patch and the 9.2 ixgbe, as that was just a drastic reduction in
> errors.
>

So now the question becomes "how much does if_hw_tsomax need to be
reduced from 65535 to get this?". If reducing it by the additional
4bytes for a vlan header is sufficient, then I understand what is
going on. If it needs to be reduced by more than that, then there
is something going on that I still don't understand.

> Still have an 'angry' netstat -m on boot, and I'm still incrementing
> denied netbuf calls, so something else is wrong.
>
> I'm going to modify Rick's prinft in ixgbe to also output when we're
> over 32768. I'm sure it's still happening, but with an extra 32k of
> space, we're not busting like we did before.
>
>
> I notice a few interesting ip->ip_len changes since 9.2 - Like here,
> at line 720
>
> http://fxr.watson.org/fxr/diff/netinet/ip_output.c?v=FREEBSD10;im=kwqeqdhhvovqn;diffval=FREEBSD92;diffvar=v
>
> Looks like older code didn't byteswap with ntohs - I see that often
> in tcp_output.c, and in tcp_options.c.
>
>
> I'm also curious about this:Line 524
> http://fxr.watson.org/fxr/diff/netinet/ip_options.c?v=FREEBSD10;diffval=FREEBSD92;diffvar=v
>
>
> New 10 code:
>

> ip ->ip_len = htons ( ntohs ( ip ->ip_len) + optlen); Old 9.2 Code:
> ip ->ip_len += optlen;
>
Well, TSO segments aren't generated when optlen > 0, so I doubt this
matters for our issue (and I would find it hard to believe that this
would have been broken?). You can always look at the svn commit logs
to see why/how something was changed.

>
>
> I wonder if there are any unexpected consequences of these changes,
> or perhaps a line someplace that doesn't make the change.
>
> Is there a dtrace command I could use to watch these functions and
> compare the new ip_len with ip->ip_len or other variables?
>
>
>
>
>
>
>
> On Sun, Mar 23, 2014 at 12:25 PM, Christopher Forgeron <
> csfor...@gmail.com > wrote:
>
>
>
>
>
>
>

> On Sat, Mar 22, 2014 at 11:58 PM, Rick Macklem < rmac...@uoguelph.ca
> > wrote:

>
>
>
>
> Christopher Forgeron wrote:
> >
>
> > Also should we not also subtract ETHER_VLAN_ENCAP_LEN from tsomax
> > to
> > make sure VLANs fit?
> >

> I took a look and, yes, this does seem to be needed. It will only be
> needed for the case where a vlan is in use and hwtagging is disabled,
> if I read the code correctly.
>
>
>
> Yes, or in the rare care where you configure your switch to pass the
> v_lan header through to the NIC.
>
>
>
> Do you use vlans?
>
>
> (Answered in above email)
>
>
>
>
>
> I've attached an updated patch.
>
> It might be nice to have the printf() patch in the driver too, so
> we can see how big the ones that are too big are?
>
>
>
> Yes, I'm going to leave those in until I know we have this fixed..
> will probably leave it in a while longer as it should only have a
> minor performance impact to iter-loop like that, and I'd like to see
> what the story is a few months down the road.
>
>
> Thanks for the patches, will have to start giving them code-names so
> we can keep them straight. :-) I guess we have printf, tsomax, and
> this one.
>
>

Christopher Forgeron

unread,

Mar 24, 2014, 1:14:06 AM3/24/14

to

Hi,

I'll follow up more tomorrow, as it's late and I don't have time for
detail.

The basic TSO patch didn't work, as packets were were still going over
65535 by a fair amount. I thought I wrote that earlier, but I am dumping a
lot of info into a few threads, so I apologize if I'm not as concise as I
could be.

However, setting IP_MAXPACKET did. 4 hours of continuous run-time, no
issues. No lost pings, no issues. Of course this isn't a fix - but it helps
isolate the problem.

I used IP_MAXPACKET = 32k originally, and I'm currently on 65495 bytes now
(40 bytes shorter than IP_MAXPACKET). Of course, it's still sometimes
going over IP_MAXPACKET, but since there is plenty of space to the real
65535 limit, we're not having any bad effects.

Looking at my logs, I currently see the debug printf's showing packets of
65506 (11 bytes larger than IP_MAXPACKET). That's the real error. I should
dump these packets to see what's in them.. perhaps wireshark or tcpdump is
in order.

One item of note is that setting the if's TSO doesn't matter in all output
cases. I'll quote TSO code tomorrow that is still set to IP_MAXPACKET, thus
setting the if's TSO isn't effective. From looking deeper at the transmit
code, I thike I understand that we shouldn't be setting TSO to a value that
is subtracted of headers, as the output routine should cover all angles,
including TCP Options which can add size.

I also booted up a 9.2-RELEASE system to make sure it was clean, and it
is. No netstat -m denied stats. I didn't have time to put it under heavy
load.

The src diff between 10.0 and 9.2 holds the key to this problem.

The largest changes I see are the use of htons / ntohs, and I have to
wonder if there isn't a problem in there someplace. 9.2 -> 10 seems to
have a lot of changes with how we calculate packet length and react to it,
and this problem is one of packet length.

Either we're not storing packet length properly, or we're not detecting it
and fragmenting the packet properly.

Lastly, I really should compile a debug kernel of 10-STABLE and check the
asserts and the like. I will resume tomorrow as I feel we're maddeningly
close. It's been ages since I had to do this much TCP, but I'm starting to
get the hang of it again.

Will resume tomorrow...

Christopher Forgeron

unread,

Mar 24, 2014, 10:56:55 AM3/24/14

to

I'm going to split this into different posts to focus on each topic. This
is about setting IP_MAXPACKET to 65495

Update on Last Night's Run:

(Last night's run is a kernel with IP_MAXPACKET = 65495)

- Uptime on this run: 10:53AM up 13:21, 5 users, load averages: 1.98,
2.09, 2.13
- Ping logger records no ping errors for the entire run.
- At Mar 24th 10:57 I did a grep through the night's log for 'before'
(which is the printf logging that Rick suggested a few days ago), and saved
it to before_total.txt
- With wc -l on before_total.txt I can see that we have 504 lines, thus 504
incidents of the packet being above IP_MAXPACKET during this run.
- I did tr -c '[:alnum:]' '[\n*]' < before_total.txt | sort | uniq -c |
sort -nr | head -50 to list the most common words. Ignoring the non-pklen
output. The relevant output is:

344 65498 (3)
330 65506 (11)
330 65502 (7)

- First # being the # of times. (Each pklen is printed twice on the log,
thus 2x the total line count).
- Last (#) being the byte overrun from 65495
- A fairly even distribution of each type of packet overrun.

You will recall that my IP_MAXPACKET is 65495, so each of these packet
lengths represents a overshoot.

The fact that we have only 3 different types of overrun is good - It
suggests a non-random event, more like a broken 'if' statement for a
particular case.

If IP_MAXPACKET was set to 65535 as it normally is, I would have had 504
incidents of errors, with a chance that any one of them could have blocked
the queue for considerable time.

Question: Should there be logic that discards packets that are over
IP_MAXPACKET to ensure that we don't end up in a blocked queue situation
again?

Moving forward, I am doing two things:

1) I'm running a longer test with TSO disabled on my ix0 adapter. I want
to make sure that over say 4 hours I don't have even 1 packet over 65495.
This will at least locate the issue to TSO related code.

2) I have tcpdump running, to see if I can capture the packets over 65495.
Here is my command. Any suggestions on additional switches I should include?

tcpdump -ennvvXS greater 65495

I'll report in on this again once I have new info.

Thanks for reading.

On Mon, Mar 24, 2014 at 2:14 AM, Christopher Forgeron
<csfor...@gmail.com>wrote:

> Hi,
>
> I'll follow up more tomorrow, as it's late and I don't have time for
> detail.
>
> The basic TSO patch didn't work, as packets were were still going over
> 65535 by a fair amount. I thought I wrote that earlier, but I am dumping a
> lot of info into a few threads, so I apologize if I'm not as concise as I
> could be.
>
> However, setting IP_MAXPACKET did. 4 hours of continuous run-time, no
> issues. No lost pings, no issues. Of course this isn't a fix - but it helps
> isolate the problem.

Christopher Forgeron

unread,

Mar 24, 2014, 11:21:26 AM3/24/14

to

This is regarding the TSO patch that Rick suggested earlier. (With many
thanks for his time and suggestion)

As I mentioned earlier, it did not fix the issue on a 10.0 system. It did
make it less of a problem on 9.2, but either way, I think it's not needed,
and shouldn't be considered as a patch for testing/etc.

Patching TSO to anything other than a max value (and by default the code
gives it IP_MAXPACKET) is confusing the matter, as the packet length
ultimately needs to be adjusted for many things on the fly like TCP
Options, etc. Using static header sizes won't be a good idea.

Additionally, it seems that setting nic TSO will/may be ignored by code
like this in sys/netinet/tcp_output.c:

10.0 Code:

780 if (len > tp->t_tsomax - hdrlen)
{ !!
781 len = tp->t_tsomax -
hdrlen; !!
782 sendalot =
1;
783 }

I've put debugging here, set the nic's max TSO as per Rick's patch ( set to
say 32k), and have seen that tp->t_tsomax == IP_MAXPACKET. It's being set
someplace else, and thus our attempts to set TSO on the nic may be in vain.

It may have mattered more in 9.2, as I see the code doesn't use
tp->t_tsomax in some locations, and may actually default to what the nic is
set to.

The NIC may still win, I didn't walk through the code to confirm, it was
enough to suggest to me that setting TSO wouldn't fix this issue.

However, this is still a TSO related issue, it's just not one related to
the setting of TSO's max size.

A 10.0-STABLE system with tso disabled on ix0 doesn't have a single packet
over IP_MAXPACKET in 1 hour of runtime. I'll let it go a bit longer to
increase confidence in this assertion, but I don't want to waste time on
this when I could be logging problem packets on a system with TSO enabled.

Comments are very welcome..

Markus Gebert

unread,

Mar 24, 2014, 12:14:08 PM3/24/14

to

I just applied Rick’s ixgbe TSO patch and additionally wanted to be able to easily change the value of hw_tsomax, so I made a sysctl out of it.

While doing that, I asked myself the same question. Where and how will this value actually be used and how comes that tcp_output() uses that other value in struct tcpcb.

The only place tcpcb->t_tsomax gets set, that I have found so far, is in tcp_input.c’s tcp_mss() function. Some subfunctions get called:

tcp_mss() -> tcp_mss_update() -> tcp_maxmtu()

Then tcp_maxmtu() indeed uses the interface’s hw_tsomax value:

1746 cap->tsomax = ifp->if_hw_tsomax;

It get’s passed back to tcp_mss() where it is set on the connection level which will be used in tcp_output() later on.

tcp_mss() gets called from multiple places, I’ll look into that later. I will let you know if I find out more.

Markus

Christopher Forgeron

unread,

Mar 24, 2014, 12:23:21 PM3/24/14

to

I think making hw_tsomax a sysctl would be a good patch to commit - It
could enable easy debugging/performance testing for the masses.

I'm curious to hear how your environment is working with a tso turned off
on your nics.

My testbed just hit the 2 hour mark. With TSO off, I don't get a single
packet over IP_MAXPACKET. That puts my confidence at around 95% in the
statement 'turning off tso negates this issue for me'.

I'm now rebooting into a +tso env to see if I can capture the bad packets.

I am also sure that the netstat -m mbuf denied is a completely separate
issue. I'm going around the lab and powering up different boxes with
10.0-RELEASE, and they all have mbuf/mbuf clusters denied on boot, and that
number increases with network traffic. It's probably not helping the
IP_MAXPACKET issue.

I'll create a separate thread for that one shortly.

On Mon, Mar 24, 2014 at 1:14 PM, Markus Gebert
<markus...@hostpoint.ch>wrote:

>

Markus Gebert

unread,

Mar 24, 2014, 12:36:03 PM3/24/14

to

On 24.03.2014, at 17:23, Christopher Forgeron <csfor...@gmail.com> wrote:

> I think making hw_tsomax a sysctl would be a good patch to commit - It
> could enable easy debugging/performance testing for the masses.
>
> I'm curious to hear how your environment is working with a tso turned off
> on your nics.

This will take some more time. Only one of the affected systems is running the test kernel with prinfts, additional sysctl, and Rick’s patch. I want to be able to reproduce the problem with that patch, before changing another variable (like turning TSO off), but that can take days on one server. I’ll probably be able to equip some more servers with that kernel soon, and might run a subgroup without TSO. But first I have to make sure, the new kernel doesn’t add any new problemes, we can’t afford them on productive servers.

> My testbed just hit the 2 hour mark. With TSO off, I don't get a single
> packet over IP_MAXPACKET. That puts my confidence at around 95% in the
> statement 'turning off tso negates this issue for me'.
>
> I'm now rebooting into a +tso env to see if I can capture the bad packets.
>
> I am also sure that the netstat -m mbuf denied is a completely separate
> issue. I'm going around the lab and powering up different boxes with
> 10.0-RELEASE, and they all have mbuf/mbuf clusters denied on boot, and that
> number increases with network traffic. It's probably not helping the
> IP_MAXPACKET issue.

While we have most symptoms in common, I’ve still not seen any allocation error in netstat -m. So I tend to agree that this is most probably a different problem.

Markus

Julian Elischer

unread,

Mar 24, 2014, 12:54:06 PM3/24/14

to

On 3/23/14, 4:57 PM, Rick Macklem wrote:
> Christopher Forgeron wrote:
>>
>>
>>
>>
>>
>> On Sat, Mar 22, 2014 at 6:41 PM, Rick Macklem < rmac...@uoguelph.ca
>>> wrote:

>>
>>
>> Christopher Forgeron wrote:
>>> #if defined(INET) || defined(INET6)
>>> /* Initialize to max value. */
>>> if (ifp->if_hw_tsomax == 0)
>>> ifp->if_hw_tsomax = IP_MAXPACKET;
>>> KASSERT(ifp->if_hw_tsomax <= IP_MAXPACKET &&
>>> ifp->if_hw_tsomax >= IP_MAXPACKET / 8,
>>> ("%s: tsomax outside of range", __func__));
>>> #endif
>>>
>>>
>>> Should this be the location where it's being set rather than in
>>> ixgbe? I would assume that other drivers could fall prey to this
>>> issue.
>>>
>> All of this should be prepended with "I'm an NFS guy, not a
>> networking
>> guy, so I might be wrong".
>>
>> Other drivers (and ixgbe for the 82598 chip) can handle a packet that
>> is in more than 32 mbufs. (I think the 82598 handles 100, grep for
>> SCATTER
>> in *.h in sys/dev/ixgbe.)
>>

the Xen backend can not handle mor ethan 32 segments in some versions
of Xen.

Christopher Forgeron

unread,

Mar 24, 2014, 5:52:44 PM3/24/14

to

Well, a few more hours of running, and it's fairly easy to catch the
packets with tcpdump, but not as easy to see if there is a pattern to them
or what is different about them from the other packets that do pass with
normal sizes.

I'm using:

tcpdump -ennvvvSuxx -i ix0 -s 64 greater 65495

here's some output.

18:41:41.311025 00:1b:21:d6:4c:4c > 00:50:56:7d:b8:ff, ethertype IPv4
(0x0800), length 65502: (tos 0x0, ttl 64, id 37273, offset 0, flags [DF],
proto TCP (6), length 65488, bad cksum 0 (->50ee)!)
172.16.0.30.2049 > 172.16.0.97.947: Flags [P.], seq
3009729118:3009794554, ack 3477042952, win 28478, options [nop,nop,TS[|tcp]>
0x0000: 0050 567d b8ff 001b 21d6 4c4c 0800 4500

18:42:11.284028 00:1b:21:d6:4c:4c > 00:50:56:7d:b8:ff, ethertype IPv4
(0x0800), length 65502: (tos 0x0, ttl 64, id 52388, offset 0, flags [DF],
proto TCP (6), length 65488, bad cksum 0 (->15e3)!)
172.16.0.30.2049 > 172.16.0.97.947: Flags [.], seq
1533469358:1533534794, ack 478673276, win 29127, options [nop,nop,TS[|tcp]>
0x0000: 0050 567d b8ff 001b 21d6 4c4c 0800 4500

18:42:31.385082 00:1b:21:d6:4c:4c > 00:50:56:7d:b8:ff, ethertype IPv4
(0x0800), length 65498: (tos 0x0, ttl 64, id 25808, offset 0, flags [DF],
proto TCP (6), length 65484, bad cksum 0 (->7dbb)!)
172.16.0.30.2049 > 172.16.0.97.947: Flags [P.], seq
3658906462:3658971894, ack 1460462120, win 29127, options [nop,nop,TS[|tcp]>
0x0000: 0050 567d b8ff 001b 21d6 4c4c 0800 4500

18:42:45.200094 00:1b:21:d6:4c:4c > 00:50:56:7d:b8:ff, ethertype IPv4
(0x0800), length 65502: (tos 0x0, ttl 64, id 43985, offset 0, flags [DF],
proto TCP (6), length 65488, bad cksum 0 (->36b6)!)
172.16.0.30.2049 > 172.16.0.97.947: Flags [P.], seq
805280454:805345890, ack 2122788052, win 29127, options [nop,nop,TS[|tcp]>
0x0000: 0050 567d b8ff 001b 21d6 4c4c 0800 4500

18:43:16.601738 00:1b:21:d6:4c:4c > 00:50:56:7d:b8:ff, ethertype IPv4
(0x0800), length 65502: (tos 0x0, ttl 64, id 5657, offset 0, flags [DF],
proto TCP (6), length 65488, bad cksum 0 (->cc6e)!)
172.16.0.30.2049 > 172.16.0.97.947: Flags [.], seq
3978046962:3978112398, ack 3596907688, win 29127, options [nop,nop,TS[|tcp]>
0x0000: 0050 567d b8ff 001b 21d6 4c4c 0800 4500

18:43:37.345685 00:1b:21:d6:4c:4c > 00:50:56:7d:b8:ff, ethertype IPv4
(0x0800), length 65506: (tos 0x0, ttl 64, id 41062, offset 0, flags [DF],
proto TCP (6), length 65492, bad cksum 0 (->421d)!)
172.16.0.30.2049 > 172.16.0.97.947: Flags [P.], seq
1419570518:1419635958, ack 104148460, win 29127, options [nop,nop,TS[|tcp]>
0x0000: 0050 567d b8ff 001b 21d6 4c4c 0800 4500

18:45:50.266944 00:1b:21:d6:4c:4c > 00:50:56:7d:b8:ff, ethertype IPv4
(0x0800), length 65506: (tos 0x0, ttl 64, id 5853, offset 0, flags [DF],
proto TCP (6), length 65492, bad cksum 0 (->cba6)!)
172.16.0.30.2049 > 172.16.0.97.947: Flags [P.], seq
2161102562:2161168002, ack 2086338240, win 29127, options [nop,nop,TS[|tcp]>

With the IP_MAXPACKET = 65495, I've had zero problems with networking.

On Mon, Mar 24, 2014 at 1:23 PM, Christopher Forgeron

<csfor...@gmail.com>wrote:

> I think making hw_tsomax a sysctl would be a good patch to commit - It
> could enable easy debugging/performance testing for the masses.
>
> I'm curious to hear how your environment is working with a tso turned off
> on your nics.
>

> My testbed just hit the 2 hour mark. With TSO off, I don't get a single
> packet over IP_MAXPACKET. That puts my confidence at around 95% in the
> statement 'turning off tso negates this issue for me'.
>
> I'm now rebooting into a +tso env to see if I can capture the bad packets.
>
> I am also sure that the netstat -m mbuf denied is a completely separate
> issue. I'm going around the lab and powering up different boxes with
> 10.0-RELEASE, and they all have mbuf/mbuf clusters denied on boot, and that
> number increases with network traffic. It's probably not helping the
> IP_MAXPACKET issue.
>
>
>

Rick Macklem

unread,

Mar 24, 2014, 6:29:05 PM3/24/14

to

Oops, poorly worded. I should have said "Some other drivers...". Yes,
there are several (I once did a find/grep, but didn't keep the output)
that have this 32 limit.

Also, I have no idea if the limit can easily be increased to 35 for them?
(Bryan was able to do that for the virtio network driver.)

rick
ps: If it was just "ix" I wouldn't care as much about this.

Rick Macklem

unread,

Mar 24, 2014, 6:47:13 PM3/24/14

to

Christopher Forgeron wrote:
> I'm going to split this into different posts to focus on each topic.
> This
> is about setting IP_MAXPACKET to 65495
>
> Update on Last Night's Run:
>
> (Last night's run is a kernel with IP_MAXPACKET = 65495)
>
> - Uptime on this run: 10:53AM up 13:21, 5 users, load averages:
> 1.98,
> 2.09, 2.13
> - Ping logger records no ping errors for the entire run.
> - At Mar 24th 10:57 I did a grep through the night's log for 'before'
> (which is the printf logging that Rick suggested a few days ago), and
> saved
> it to before_total.txt
> - With wc -l on before_total.txt I can see that we have 504 lines,
> thus 504
> incidents of the packet being above IP_MAXPACKET during this run.
> - I did tr -c '[:alnum:]' '[\n*]' < before_total.txt | sort | uniq -c
> |
> sort -nr | head -50 to list the most common words. Ignoring the
> non-pklen
> output. The relevant output is:
>
> 344 65498 (3)
> 330 65506 (11)
> 330 65502 (7)
>

This makes sense to me, since tp->t_tsomax is used in tcp_output() for
the TCP/IP packet, which does not include the link level (ethernet)
header. When that is added, I would expect the length to be up to 14
(or maybe 18 for vlan cases) greater than IP_MAXPACKET. Since none of
these are greater than 65509, this looks fine to me.

So, unless you get ones greater than (65495 + 18 = 65513), this makes
sense and does not indicate a problem.

In another post, you indicate that having the driver set if_hw_tsomax
didn't set tp->t_tsomax to the same value.
--> I believe that is a bug and would mean my ixgbe.patch would not
fix the problem, because it is tp->t_tsomax that must be decreased
to at least (65536 - 18 = 65518).
--> Now, have you tried a case between 65495 and 65518 and seen
any EFBIG errors?
If so, then I don't understand why 65518 isn't small enough?

rick

> - First # being the # of times. (Each pklen is printed twice on the
> log,
> thus 2x the total line count).
> - Last (#) being the byte overrun from 65495
> - A fairly even distribution of each type of packet overrun.
>
> You will recall that my IP_MAXPACKET is 65495, so each of these
> packet
> lengths represents a overshoot.
>
> The fact that we have only 3 different types of overrun is good - It
> suggests a non-random event, more like a broken 'if' statement for a
> particular case.
>

I think it just means that your load happens to do only 3 sizes of I/O
that is a little less than 65536.

> If IP_MAXPACKET was set to 65535 as it normally is, I would have had
> 504
> incidents of errors, with a chance that any one of them could have
> blocked
> the queue for considerable time.
>

If tp->t_tsomax hasn't been set to a smaller value than 65535, the
ixgbe.patch didn't do what I thought it would.

> > issues. No lost pings, no issues. Of course this isn't a fix - but
> > it helps

> > isolate the problem.
> > > what the story is a few months down the road.
> > >
> > >
> > > Thanks for the patches, will have to start giving them code-names
> > > so
> > > we can keep them straight. :-) I guess we have printf, tsomax,
> > > and
> > > this one.
> > >
> > >
> >
> >

Rick Macklem

unread,

Mar 24, 2014, 9:04:18 PM3/24/14

to

> It get’s passed back to tcp_mss() where it is set on the connection

> level which will be used in tcp_output() later on.
>
> tcp_mss() gets called from multiple places, I’ll look into that
> later. I will let you know if I find out more.
>
>
> Markus
>

Well, if tp->t_tsomax isn't set to a value of 65518, then the ixgbe.patch
isn't doing what I thought it would.

The only explanation I can think of for this is that there might be
another net interface driver stacked on top of the ixgbe.c one and
that the setting doesn't get propagated up.
Does this make any sense?

IP_MAXPACKET can't be changed from 65535, but I can see an argument
for setting the default value of if_hw_tsomax to a smaller value.
For example, in sys/net/if.c change it from:
657 if (ifp->if_hw_tsomax == 0)

658 ifp->if_hw_tsomax = IP_MAXPACKET;

to
657 if (ifp->if_hw_tsomax == 0)
658 ifp->if_hw_tsomax = 65536 - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);

This is a slightly smaller default which won't have much impact unless
the hardware device can only handle 32 mbuf clusters for transmit of
a segment and there are several of those.

Christopher, can you do your test run with IP_MAXPACKET set to 65518,
which should be the same as the above. If that gets rid of all the
EFBIG error replies, then I think the above patch will have the same
effect.

Thanks, rick

>
> > However, this is still a TSO related issue, it's just not one
> > related to
> > the setting of TSO's max size.
> >
> > A 10.0-STABLE system with tso disabled on ix0 doesn't have a single
> > packet
> > over IP_MAXPACKET in 1 hour of runtime. I'll let it go a bit longer
> > to
> > increase confidence in this assertion, but I don't want to waste
> > time on
> > this when I could be logging problem packets on a system with TSO
> > enabled.
> >
> > Comments are very welcome..

Rick Macklem

unread,

Mar 24, 2014, 9:18:08 PM3/24/14

to

Christopher Forgeron wrote:
>
>
>
> This is regarding the TSO patch that Rick suggested earlier. (With
> many thanks for his time and suggestion)
>
>
> As I mentioned earlier, it did not fix the issue on a 10.0 system. It
> did make it less of a problem on 9.2, but either way, I think it's
> not needed, and shouldn't be considered as a patch for testing/etc.
>
>
> Patching TSO to anything other than a max value (and by default the
> code gives it IP_MAXPACKET) is confusing the matter, as the packet
> length ultimately needs to be adjusted for many things on the fly
> like TCP Options, etc. Using static header sizes won't be a good
> idea.
>

If you look at tcp_output(), you'll notice that it doesn't do TSO if
there are any options. That way it knows that the TCP/IP header is
just hdrlen.

If you don't limit the TSO packet (including TCP/IP and ethernet headers)
to 64K, then the "ix" driver can't send them, which is the problem
you guys are seeing.

There are other ways to fix this problem, but they all may introduce
issues that reducing if_hw_tsomax by a small amount does not.
For example, m_defrag() could be modified to use 4K pagesize clusters,
but this might introduce memory fragmentation problems. (I observed
what I think are memory fragmentation problems when I switched NFS
to use 4K pagesize clusters for large I/O messages.)

If setting IP_MAXPACKET to 65518 fixes the problem (no more EFBIG
error replies), then that is the size that if_hw_tsomax can be set
to (just can't change IP_MAXPACKET, but that is defined for other
things). (It just happens that IP_MAXPACKET is what if_hw_tsomax
defaults to. It has no other effect w.r.t. TSO.)

>
> Additionally, it seems that setting nic TSO will/may be ignored by
> code like this in sys/netinet/tcp_output.c:
>

Yes, but I don't know why.
The only conjecture I can come up with is that another net driver is
stacked above "ix" and the setting for if_hw_tsomax doesn't propagate
up. (If you look at the commit log message for r251296, the intent
of adding if_hw_tsomax was to allow device drivers to set a smaller
tsomax than IP_MAXPACKET.)

Are you using any of the "stacked" network device drivers like
lagg? I don't even know what the others all are?
Maybe someone else can list them?

rick

>
> 10.0 Code:
>
> 780 if (len > tp->t_tsomax - hdrlen) { !!
> 781 len = tp->t_tsomax - hdrlen; !!
> 782 sendalot = 1;
> 783 }
>
>
>
>
> I've put debugging here, set the nic's max TSO as per Rick's patch (
> set to say 32k), and have seen that tp->t_tsomax == IP_MAXPACKET.
> It's being set someplace else, and thus our attempts to set TSO on
> the nic may be in vain.
>
>
> It may have mattered more in 9.2, as I see the code doesn't use
> tp->t_tsomax in some locations, and may actually default to what the
> nic is set to.
>
> The NIC may still win, I didn't walk through the code to confirm, it
> was enough to suggest to me that setting TSO wouldn't fix this
> issue.
>
>

Rick Macklem

unread,

Mar 24, 2014, 10:00:57 PM3/24/14

to

Julian Elischer wrote:
----- Original Message -----
> I wrote (and snipped):

>> Other drivers (and ixgbe for the 82598 chip) can handle a packet that
>> is in more than 32 mbufs. (I think the 82598 handles 100, grep for
>> SCATTER
>> in *.h in sys/dev/ixgbe.)
>>
>
> the Xen backend can not handle mor ethan 32 segments in some versions
> of Xen.

Btw, I just did a quick find/grep (so I may have missed some), but here
is the list of net devices that appear to support TSO, but limited to
32 transmit segments for at least some supported chips:

jme, fxp, age, sge, msk, als, ale, ixgbe/ix, nfe, e1000/em, re

Also, several of these call m_collapse() instead of m_defrag() when
the run into a transmit mbuf list with > 32 elements.
m_collapse() - Isn't likely to squeeze the 35 mbuf 64Kbyte NFS I/O
message into 32 mbufs, so I don't think these ones will work
at all for NFS with default 64K I/O size and TSO enabled.

rick

Markus Gebert

unread,

Mar 25, 2014, 8:16:14 AM3/25/14

to

Is this confirmed or still a ‘it seems’? Have you actually seen a tp->t_tsomax value in tcp_output() bigger than if_hw_tsomax or was this just speculation because the values are stored in different places? (Sorry, if you already stated this in another email, it’s currently hard to keep track of all the information.)

Anyway, this dtrace one-liner should be a good test if other values appear in tp->t_tsomax:

# dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 && args[0]->t_tsomax != 65518 / { printf("unexpected tp->t_tsomax: %i\n", args[0]->t_tsomax); stack(); }'

Remember to adjust the value in the condition to whatever you’re currently expecting. The value seems to be 0 for new connections, probably when tcp_mss() has not been called yet. So that’s seems normal and I have excluded that case too. This will also print a kernel stack trace in case it sees an unexpected value.

> Yes, but I don't know why.
> The only conjecture I can come up with is that another net driver is
> stacked above "ix" and the setting for if_hw_tsomax doesn't propagate
> up. (If you look at the commit log message for r251296, the intent
> of adding if_hw_tsomax was to allow device drivers to set a smaller
> tsomax than IP_MAXPACKET.)
>
> Are you using any of the "stacked" network device drivers like
> lagg? I don't even know what the others all are?
> Maybe someone else can list them?

I guess the most obvious are lagg and vlan (and probably carp on FreeBSD 9.x or older).

On request from Jack, we’ve eliminated lagg and vlan from the picture, which gives us plain ixgbe interfaces with no stacked interfaces on top of it. And we can still reproduce the problem.

Markus

>
> rick
>>
>> 10.0 Code:
>>
>> 780 if (len > tp->t_tsomax - hdrlen) { !!
>> 781 len = tp->t_tsomax - hdrlen; !!
>> 782 sendalot = 1;
>> 783 }
>>
>>
>>
>>
>> I've put debugging here, set the nic's max TSO as per Rick's patch (
>> set to say 32k), and have seen that tp->t_tsomax == IP_MAXPACKET.
>> It's being set someplace else, and thus our attempts to set TSO on
>> the nic may be in vain.
>>
>>
>> It may have mattered more in 9.2, as I see the code doesn't use
>> tp->t_tsomax in some locations, and may actually default to what the
>> nic is set to.
>>
>> The NIC may still win, I didn't walk through the code to confirm, it
>> was enough to suggest to me that setting TSO wouldn't fix this
>> issue.
>>
>>
>> However, this is still a TSO related issue, it's just not one related
>> to the setting of TSO's max size.
>>
>> A 10.0-STABLE system with tso disabled on ix0 doesn't have a single
>> packet over IP_MAXPACKET in 1 hour of runtime. I'll let it go a bit
>> longer to increase confidence in this assertion, but I don't want to
>> waste time on this when I could be logging problem packets on a
>> system with TSO enabled.
>>
>>
>> Comments are very welcome..
>>
>>
>>

Johan Kooijman

unread,

Mar 25, 2014, 8:21:43 AM3/25/14

to

Hey guys,

I have nothing on your code level to add, but.. while investigating this
issue I ran into the guy that originally created the bug (
http://www.freebsd.org/cgi/query-pr.cgi?pr=183390&cat=). In the email
exchange that followed he told me that had found a workaround by running a
specific -STABLE revision:

"Yes, we found a workaround.
We upgraded to the -STABLE branch of the 9.2, so we use this currently:
[root@storagex ~]# uname -a
FreeBSD storagex.lan.granaglia.com 9.2-STABLE FreeBSD 9.2-STABLE #0
r257712: Tue Nov 5 23:02:49 CET 2013
ro...@storagex.lan.granaglia.com:/usr/obj/usr/src/sys/GENERIC
amd64"

Maybe this could help you in your quest to hunt this bug down.

On Tue, Mar 25, 2014 at 1:16 PM, Markus Gebert
<markus...@hostpoint.ch>wrote:

>

--
Met vriendelijke groeten / With kind regards,
Johan Kooijman

T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E ma...@johankooijman.com

Christopher Forgeron

unread,

Mar 25, 2014, 10:16:56 AM3/25/14

to

Hi guys,

I'm in meetings today, so I'll respond to the other emails later.

Just wanted to clarify about tp->t_tsomax : I can't make a solid assertion
about it's value as I only tracked it briefly. I did see it being !=
if_hw_tsomax, but that was a short test and should really be checked more
carefully. For now we should assume it's a possible, but not confirmed.

However, setting if_hw_tsomax as low as 32k did not fix the problem for
me. So either setting TSO is not the fix, or not everything is paying
attention to if_hw_tsomax. It has to be one or the other.

Setting IP_MAXPACKET does fix it for me, but of course that's not a solid
fix.

On Tue, Mar 25, 2014 at 9:16 AM, Markus Gebert

Christopher Forgeron

unread,

Mar 25, 2014, 3:25:00 PM3/25/14

to

I'm quite positive that an IP_MAXPACKET = 65518 would fix this, as I've
never seen a packet overshoot by more than 11 bytes, although that's just
in my case. It's next up on my test list.

BTW, to answer the next message: I am expierencing the error with a raw ix
or lagg interface. Originally I was on lagg, but have dropped down to a
single ix for testing.

Thanks for your continued help.

On Mon, Mar 24, 2014 at 10:04 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Markus Gebert wrote:
> >
> > On 24.03.2014, at 16:21, Christopher Forgeron <csfor...@gmail.com>

> > wrote:
> >
> > > This is regarding the TSO patch that Rick suggested earlier. (With
> > > many
> > > thanks for his time and suggestion)
> > >
> > > As I mentioned earlier, it did not fix the issue on a 10.0 system.
> > > It did
> > > make it less of a problem on 9.2, but either way, I think it's not
> > > needed,
> > > and shouldn't be considered as a patch for testing/etc.
> > >
> > > Patching TSO to anything other than a max value (and by default the
> > > code
> > > gives it IP_MAXPACKET) is confusing the matter, as the packet
> > > length
> > > ultimately needs to be adjusted for many things on the fly like TCP
> > > Options, etc. Using static header sizes won't be a good idea.
> > >

> > > Additionally, it seems that setting nic TSO will/may be ignored by
> > > code
> > > like this in sys/netinet/tcp_output.c:
> > >

> > > 10.0 Code:
> > >
> > > 780 if (len > tp->t_tsomax - hdrlen)
> > > { !!
> > > 781 len = tp->t_tsomax -
> > > hdrlen; !!
> > > 782 sendalot =
> > > 1;
> > > 783 }
> > >
> > >
> > > I've put debugging here, set the nic's max TSO as per Rick's patch
> > > ( set to
> > > say 32k), and have seen that tp->t_tsomax == IP_MAXPACKET. It's
> > > being set
> > > someplace else, and thus our attempts to set TSO on the nic may be
> > > in vain.
> > >
> > > It may have mattered more in 9.2, as I see the code doesn't use
> > > tp->t_tsomax in some locations, and may actually default to what
> > > the nic is
> > > set to.
> > >
> > > The NIC may still win, I didn't walk through the code to confirm,
> > > it was
> > > enough to suggest to me that setting TSO wouldn't fix this issue.
> >
> >

Rick Macklem

unread,

Mar 25, 2014, 5:46:00 PM3/25/14

to

Markus Gebert wrote:
>
> On 25.03.2014, at 02:18, Rick Macklem <rmac...@uoguelph.ca> wrote:
>

> > Christopher Forgeron wrote:
> >>
> >>
> >>
> >> This is regarding the TSO patch that Rick suggested earlier. (With
> >> many thanks for his time and suggestion)
> >>
> >>
> >> As I mentioned earlier, it did not fix the issue on a 10.0 system.
> >> It
> >> did make it less of a problem on 9.2, but either way, I think it's
> >> not needed, and shouldn't be considered as a patch for
> >> testing/etc.
> >>
> >>
> >> Patching TSO to anything other than a max value (and by default
> >> the
> >> code gives it IP_MAXPACKET) is confusing the matter, as the packet
> >> length ultimately needs to be adjusted for many things on the fly
> >> like TCP Options, etc. Using static header sizes won't be a good
> >> idea.
> >>

> > If you look at tcp_output(), you'll notice that it doesn't do TSO
> > if
> > there are any options. That way it knows that the TCP/IP header is
> > just hdrlen.
> >
> > If you don't limit the TSO packet (including TCP/IP and ethernet
> > headers)
> > to 64K, then the "ix" driver can't send them, which is the problem
> > you guys are seeing.
> >
> > There are other ways to fix this problem, but they all may
> > introduce
> > issues that reducing if_hw_tsomax by a small amount does not.
> > For example, m_defrag() could be modified to use 4K pagesize
> > clusters,
> > but this might introduce memory fragmentation problems. (I observed
> > what I think are memory fragmentation problems when I switched NFS
> > to use 4K pagesize clusters for large I/O messages.)
> >
> > If setting IP_MAXPACKET to 65518 fixes the problem (no more EFBIG
> > error replies), then that is the size that if_hw_tsomax can be set
> > to (just can't change IP_MAXPACKET, but that is defined for other
> > things). (It just happens that IP_MAXPACKET is what if_hw_tsomax
> > defaults to. It has no other effect w.r.t. TSO.)
> >
> >>

> >> Additionally, it seems that setting nic TSO will/may be ignored by
> >> code like this in sys/netinet/tcp_output.c:
> >>
>

> Is this confirmed or still a ‘it seems’? Have you actually seen a
> tp->t_tsomax value in tcp_output() bigger than if_hw_tsomax or was
> this just speculation because the values are stored in different
> places? (Sorry, if you already stated this in another email, it’s
> currently hard to keep track of all the information.)
>
> Anyway, this dtrace one-liner should be a good test if other values
> appear in tp->t_tsomax:
>
> # dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&
> args[0]->t_tsomax != 65518 / { printf("unexpected tp->t_tsomax:
> %i\n", args[0]->t_tsomax); stack(); }'
>
> Remember to adjust the value in the condition to whatever you’re
> currently expecting. The value seems to be 0 for new connections,
> probably when tcp_mss() has not been called yet. So that’s seems

> normal and I have excluded that case too. This will also print a

> kernel stack trace in case it sees an unexpected value.
>
>
> > Yes, but I don't know why.
> > The only conjecture I can come up with is that another net driver
> > is
> > stacked above "ix" and the setting for if_hw_tsomax doesn't
> > propagate
> > up. (If you look at the commit log message for r251296, the intent
> > of adding if_hw_tsomax was to allow device drivers to set a smaller
> > tsomax than IP_MAXPACKET.)
> >
> > Are you using any of the "stacked" network device drivers like
> > lagg? I don't even know what the others all are?
> > Maybe someone else can list them?
>
> I guess the most obvious are lagg and vlan (and probably carp on
> FreeBSD 9.x or older).
>
> On request from Jack, we’ve eliminated lagg and vlan from the
> picture, which gives us plain ixgbe interfaces with no stacked
> interfaces on top of it. And we can still reproduce the problem.
>

This was related to the "did if_hw_tsomax set tp->t_tsomax to the
same value?" question. Since you reported that my patch that set
if_hw_tsomax in the driver didn't fix the problem, that suggests
that tp->t_tsomax isn't being set to if_hw_tsomax from the driver,
but we don't know why?

rick

>
> Markus
>
>
> >
> > rick

> >>
> >> 10.0 Code:
> >>
> >> 780 if (len > tp->t_tsomax - hdrlen) { !!
> >> 781 len = tp->t_tsomax - hdrlen; !!
> >> 782 sendalot = 1;
> >> 783 }
> >>
> >>
> >>
> >>
> >> I've put debugging here, set the nic's max TSO as per Rick's patch
> >> (
> >> set to say 32k), and have seen that tp->t_tsomax == IP_MAXPACKET.
> >> It's being set someplace else, and thus our attempts to set TSO on
> >> the nic may be in vain.
> >>
> >>
> >> It may have mattered more in 9.2, as I see the code doesn't use
> >> tp->t_tsomax in some locations, and may actually default to what
> >> the
> >> nic is set to.
> >>
> >> The NIC may still win, I didn't walk through the code to confirm,
> >> it
> >> was enough to suggest to me that setting TSO wouldn't fix this
> >> issue.
> >>
> >>

Markus Gebert

unread,

Mar 25, 2014, 6:11:17 PM3/25/14

to

Jack asked us to remove lagg/vlans in the very beginning of this thread, and when had done that, the problem was still there. So my answer was not related to your recent patch. I wanted to clarify that we have been testing with ixgbe only for quite some time and that stacked interfaces could not be a source of problems in our test scenario.

We have just started testing your patch that sets if_hw_tsomax yesterday. So far I have it running on two systems along with some printfs and the dtrace one-liner that watches over tp->t_tsomax in tcp_output(). So far we’ve haven’t had any problems with these two servers, and the dtrace probe never fired, so far it looks like tp->t_tsomax always gets set from if_hw_tsomax. But it’s too soon to make a conclusion, it may take days to trigger the problem again. It might also be fixed with your patch.

I’m booting more systems with the test kernel and I will be watching all of them with dtrace to see I i find an occurence where tp->t_tsomax is off. I hope that with more systems, I’ll have an answer more quickly.

But digging around the code, I still don’t see a way how tp->tsomax could not have been set from if_hw_tsomax when there are no stacked interfaces…

Markus

Rick Macklem

unread,

Mar 25, 2014, 6:21:50 PM3/25/14

to

Righto. Setting if_hw_tsomax in the driver is supposed to set tp->t_tsomax
and I could see it work in a trivial test (I hacked the code so the assignments
are done for the non-tso case and it worked for the non-tso "re" driver I run.)
{ As an aside, one of these assignments does happen for non-tso cases, since
although it is indented, there are no {} for the block. In tcp_subr.c if I
recall. However, doing the assignment for the non-tso case seems harmless to me. }

> I’m booting more systems with the test kernel and I will be watching
> all of them with dtrace to see I i find an occurence where
> tp->t_tsomax is off. I hope that with more systems, I’ll have an
> answer more quickly.
>
> But digging around the code, I still don’t see a way how tp->tsomax
> could not have been set from if_hw_tsomax when there are no stacked
> interfaces…
>

It seems to happen where you mentioned before. Since it only gets set
from cap.tsomax and that gets set from if_hw_tsomax, it would be 0
otherwise. Christopher sees in change when he changes IP_MAXPACET, so
the default setting works, but for him setting it in the driver didn't,
for some reason?

Thanks for doing the testing, rick

Christopher Forgeron

unread,

Mar 25, 2014, 7:06:43 PM3/25/14

to

Update:

I'm changing my mind, and I believe Rick's TSO patch is fixing things
(sorry). In looking at my notes, it's possible I had lagg on for those
tests. lagg does seem to negate the TSO patch in my case.

kernel.10stable_basicTSO_65535/

- IP_MAXPACKET = 65535;
- manually forced (no if statement) ifp->if_hw_tsomax = IP_MAXPACKET -
(ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
- Verified on boot via printf that ifp->if_hw_tsomax = 65517
- Boot in a NON LAGG environment. ix0 only.

ixgbe's printf is showing packets up to 65530. Haven't run long enough yet
to see if anything will go over 65535

I have this tcpdump running to check packet size.
tcpdump -ennvvXS -i ix0 greater 65518

I do expect to get packets over 65518, but I was just curious to see if any
of them would go over 65535. Time will tell.

In a separate test, If I enable lagg, we have LOTS of oversized packet
problems. It looks like tsomax is definitely not making it through in
if_lagg.c - Any recommendations there? I will eventually need lagg, as I'm
sure will others.

With dtrace, it's showing t_tsomax >= 65518. Shouldn't that not be
happening?

dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&

args[0]->t_tsomax >= 65518 / { printf("unexpected tp->t_tsomax: %i\n",

args[0]->t_tsomax); stack(); }'

6 31403 tcp_output:entry unexpected tp->t_tsomax: 65535

kernel`tcp_do_segment+0x2c99
kernel`tcp_input+0x11a2
kernel`ip_input+0xa2
kernel`netisr_dispatch_src+0x5e
kernel`ether_demux+0x12a
kernel`ether_nh_input+0x35f
kernel`netisr_dispatch_src+0x5e
kernel`bce_intr+0x765
kernel`intr_event_execute_handlers+0xab
kernel`ithread_loop+0x96
kernel`fork_exit+0x9a
kernel`0xffffffff80c75b2e

3 31403 tcp_output:entry unexpected tp->t_tsomax: 65535

kernel`tcp_do_segment+0x2c99
kernel`tcp_input+0x11a2
kernel`ip_input+0xa2
kernel`netisr_dispatch_src+0x5e
kernel`ether_demux+0x12a
kernel`ether_nh_input+0x35f
kernel`netisr_dispatch_src+0x5e
kernel`bce_intr+0x765
kernel`intr_event_execute_handlers+0xab
kernel`ithread_loop+0x96
kernel`fork_exit+0x9a
kernel`0xffffffff80c75b2e

6 31403 tcp_output:entry unexpected tp->t_tsomax: 65535

kernel`tcp_do_segment+0x2c99
kernel`tcp_input+0x11a2
kernel`ip_input+0xa2
kernel`netisr_dispatch_src+0x5e
kernel`ether_demux+0x12a
kernel`ether_nh_input+0x35f
kernel`netisr_dispatch_src+0x5e
kernel`bce_intr+0x765
kernel`intr_event_execute_handlers+0xab
kernel`ithread_loop+0x96
kernel`fork_exit+0x9a
kernel`0xffffffff80c75b2e

1 31403 tcp_output:entry unexpected tp->t_tsomax: 65535

kernel`tcp_do_segment+0x2c99
kernel`tcp_input+0x11a2
kernel`ip_input+0xa2
kernel`netisr_dispatch_src+0x5e
kernel`ether_demux+0x12a
kernel`ether_nh_input+0x35f
kernel`netisr_dispatch_src+0x5e
kernel`bce_intr+0x765
kernel`intr_event_execute_handlers+0xab
kernel`ithread_loop+0x96
kernel`fork_exit+0x9a
kernel`0xffffffff80c75b2e

Markus Gebert

unread,

Mar 25, 2014, 7:07:18 PM3/25/14

to

>>>> # dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&
>>>> args[0]->t_tsomax != 65518 / { printf("unexpected tp->t_tsomax:

>>>> %i\n", args[0]->t_tsomax); stack(); }'
>>>>

Sorry, my sentence was probably a bit misleading. What you’re saying is what I meant. There’s the tcp_mss() -> tcp_mss_update() -> tcp_maxmtu() call chain that ultimately sets tp->t_tsomax from if_hw_tsomax (via the cap struct). tp->t_tsomax is indeed 0 on fresh connections when tcp_mss() has not been called yet, I could confirm that with dtrace. As soon as the connection gets running, it’s set to whatever the interface’s if_hw_tsomax is.

What I have _not_ found is another place that alters tp->t_tsomax, so I really don’t get how Christopher can see different values for tp->t_tsomax.

> Christopher sees in change when he changes IP_MAXPACET, so
> the default setting works, but for him setting it in the driver didn't,
> for some reason?

Christopher, can you run your tests again with default IP_MAXPACKET and just Rick's if_hw_maxtso patch? I think it’s important to confirm that tp->t_tsomax is really off in that case, which you can easily test with my dtrace one-liner. I’m running this test too, but I will take me much more time until I can make a statement.

> Thanks for doing the testing, rick

No problem. Thank you guys!

Rick Macklem

unread,

Mar 25, 2014, 7:16:29 PM3/25/14

to

Christopher Forgeron wrote:
> Update:
>
> I'm changing my mind, and I believe Rick's TSO patch is fixing
> things
> (sorry). In looking at my notes, it's possible I had lagg on for
> those
> tests. lagg does seem to negate the TSO patch in my case.
>

Ok, that's useful information. It implies that r251296 doesn't quite
work and needs to be fixed for "stacked" network interface drivers
before it can be used. I've cc'd Andre who is the author of that
patch, in case he knows how to fix it.

Thanks for checking this, rick

> kernel.10stable_basicTSO_65535/
>
> - IP_MAXPACKET = 65535;
> - manually forced (no if statement) ifp->if_hw_tsomax = IP_MAXPACKET
> -
> (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
> - Verified on boot via printf that ifp->if_hw_tsomax = 65517
> - Boot in a NON LAGG environment. ix0 only.
>
> ixgbe's printf is showing packets up to 65530. Haven't run long
> enough yet
> to see if anything will go over 65535
>
> I have this tcpdump running to check packet size.
> tcpdump -ennvvXS -i ix0 greater 65518
>
> I do expect to get packets over 65518, but I was just curious to see
> if any
> of them would go over 65535. Time will tell.
>
> In a separate test, If I enable lagg, we have LOTS of oversized
> packet
> problems. It looks like tsomax is definitely not making it through in
> if_lagg.c - Any recommendations there? I will eventually need lagg,
> as I'm
> sure will others.
>
> With dtrace, it's showing t_tsomax >= 65518. Shouldn't that not be
> happening?
>
>

> dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 &&

> args[0]->t_tsomax >= 65518 / { printf("unexpected tp->t_tsomax:

> %i\n",
> args[0]->t_tsomax); stack(); }'
>
>

Markus Gebert

unread,

Mar 25, 2014, 7:21:38 PM3/25/14

to

On 26.03.2014, at 00:06, Christopher Forgeron <csfor...@gmail.com> wrote:

> Update:
>
> I'm changing my mind, and I believe Rick's TSO patch is fixing things
> (sorry). In looking at my notes, it's possible I had lagg on for those
> tests. lagg does seem to negate the TSO patch in my case.

I’m glad to hear you could check that scenario again. In the other email I just sent, I just asked you to redo this test. Now it makes perfect sense why you saw oversized packets despite Rick’s if_hw_tsomax patch.

> kernel.10stable_basicTSO_65535/
>
> - IP_MAXPACKET = 65535;
> - manually forced (no if statement) ifp->if_hw_tsomax = IP_MAXPACKET -
> (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
> - Verified on boot via printf that ifp->if_hw_tsomax = 65517

Is 65517 correct? With Ricks patch, I get this:

dev.ix.0.hw_tsomax: 65518

Also the dtrace command you used excludes 65518...

> - Boot in a NON LAGG environment. ix0 only.
>
> ixgbe's printf is showing packets up to 65530. Haven't run long enough yet
> to see if anything will go over 65535
>
> I have this tcpdump running to check packet size.
> tcpdump -ennvvXS -i ix0 greater 65518
>
> I do expect to get packets over 65518, but I was just curious to see if any
> of them would go over 65535. Time will tell.
>
> In a separate test, If I enable lagg, we have LOTS of oversized packet
> problems. It looks like tsomax is definitely not making it through in
> if_lagg.c - Any recommendations there? I will eventually need lagg, as I'm
> sure will others.

I think somebody has to invent a way to propagate if_hw_maxtso to interfaces on top of each other.

> With dtrace, it's showing t_tsomax >= 65518. Shouldn't that not be
> happening?

Looks like these all come from bce interfaces (bce_intr in the stack trace), which probably have another value for if_hw_tsomax.

Markus

Rick Macklem

unread,

Mar 25, 2014, 10:04:55 PM3/25/14

to

Markus Gebert wrote:
>
> On 26.03.2014, at 00:06, Christopher Forgeron <csfor...@gmail.com>
> wrote:
>
> > Update:
> >
> > I'm changing my mind, and I believe Rick's TSO patch is fixing
> > things
> > (sorry). In looking at my notes, it's possible I had lagg on for
> > those
> > tests. lagg does seem to negate the TSO patch in my case.
>
> I’m glad to hear you could check that scenario again. In the other
> email I just sent, I just asked you to redo this test. Now it makes
> perfect sense why you saw oversized packets despite Rick’s
> if_hw_tsomax patch.
>
>
> > kernel.10stable_basicTSO_65535/
> >
> > - IP_MAXPACKET = 65535;
> > - manually forced (no if statement) ifp->if_hw_tsomax =
> > IP_MAXPACKET -
> > (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
> > - Verified on boot via printf that ifp->if_hw_tsomax = 65517
>
> Is 65517 correct? With Ricks patch, I get this:
>
> dev.ix.0.hw_tsomax: 65518
>
> Also the dtrace command you used excludes 65518...
>

I am using 32 * MCLBYTES - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN) which
is 65518. Although IP_MAXPACKET (maximum IP len, not including ethernet header)
is 65535 (largest # that fits in 16bits), the maximum data length
(including ethernet header) that will fit in 32 mbuf clusters is 65536.
(In practice 65517 or anything <= 65518 should fix the problem.)

rick

> > - Boot in a NON LAGG environment. ix0 only.
> >
> > ixgbe's printf is showing packets up to 65530. Haven't run long
> > enough yet
> > to see if anything will go over 65535
> >

With the ethernet header length, it can be <= 65536, because that
is 32 * MCLBYTES.

rick

Christopher Forgeron

unread,

Mar 25, 2014, 10:27:33 PM3/25/14

to

That's interesting. I see here in the r251296 commit Andre says :

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

I wonder if we add your same TSO patch to if_lagg.c before line 356's
ether_ifattach() will fix it.

Ultimately, it will need to load the if_hw_tsomax from the if below it -
but then again, if the calculation for ixgbe is good enough for that
driver, why wouldn't it be good enough for lagg?

Unless people think I'm crazy, I'll compile that in at line 356 in
if_lagg.c and give it a test run tomorrow.

This may need to go into vlan and carp as well, I'm not sure yet.

On Tue, Mar 25, 2014 at 8:16 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

> Christopher Forgeron wrote:
> > Update:
> >
> > I'm changing my mind, and I believe Rick's TSO patch is fixing
> > things
> > (sorry). In looking at my notes, it's possible I had lagg on for
> > those
> > tests. lagg does seem to negate the TSO patch in my case.
> >

> Ok, that's useful information. It implies that r251296 doesn't quite
> work and needs to be fixed for "stacked" network interface drivers
> before it can be used. I've cc'd Andre who is the author of that
> patch, in case he knows how to fix it.
>
> Thanks for checking this, rick
>

> > kernel.10stable_basicTSO_65535/
> >
> > - IP_MAXPACKET = 65535;
> > - manually forced (no if statement) ifp->if_hw_tsomax = IP_MAXPACKET
> > -
> > (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);
> > - Verified on boot via printf that ifp->if_hw_tsomax = 65517

> > - Boot in a NON LAGG environment. ix0 only.
> >
> > ixgbe's printf is showing packets up to 65530. Haven't run long
> > enough yet
> > to see if anything will go over 65535
> >

> > I have this tcpdump running to check packet size.
> > tcpdump -ennvvXS -i ix0 greater 65518
> >
> > I do expect to get packets over 65518, but I was just curious to see
> > if any
> > of them would go over 65535. Time will tell.
> >
> > In a separate test, If I enable lagg, we have LOTS of oversized
> > packet
> > problems. It looks like tsomax is definitely not making it through in
> > if_lagg.c - Any recommendations there? I will eventually need lagg,
> > as I'm
> > sure will others.
> >

> > With dtrace, it's showing t_tsomax >= 65518. Shouldn't that not be
> > happening?
> >
> >

Christopher Forgeron

unread,

Mar 25, 2014, 10:33:44 PM3/25/14

to

On Tue, Mar 25, 2014 at 8:21 PM, Markus Gebert
<markus...@hostpoint.ch>wrote:

>
>

> Is 65517 correct? With Ricks patch, I get this:
>
> dev.ix.0.hw_tsomax: 65518
>

Perhaps a difference between 9.2 and 10 for one of the macros? My code is:

ifp->if_hw_tsomax = IP_MAXPACKET - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);

printf("CSF - 3 Init, ifp->if_hw_tsomax = %d\n", ifp->if_hw_tsomax);

(BTW, you should submit the hw_tsomax sysctl patch, that's useful to others)

> Also the dtrace command you used excludes 65518...
>

Oh, I thought it was giving every packet that is greater than or equal to
65518 - Could you show me the proper command? That's the third time I've
used dtrace, so I'm making this up as I go. :-)

Christopher Forgeron

unread,

Mar 26, 2014, 1:44:37 PM3/26/14

to

Up for almost 19 hours under load without a single error. I would say the
TSO patch does work, now I'm going to run lagg tests.

The more I think of it, the more I wonder if setting tsomax in if.c at line
660 isn't the better idea, like below.

660: if (ifp->if_hw_tsomax == 0)
661: ifp->if_hw_tsomax = IP_MAXPACKET - (ETHER_HDR_LEN +
ETHER_VLAN_ENCAP_LEN);

I know there are concerns about the impact on various cards, but right now
if.c will set if_hw_tssomax to IP_MAXPACKET, which we know is bad for
ixgbe, and I believe bad for lagg (tests will show) - If the driver isn't
specifically setting it to a different setting, is there harm in limiting
all if's to a default of IP_MAXPACKET - (ETHER_HDR_LEN +
ETHER_VLAN_ENCAP_LEN) if not specified otherwise? When is a TSO of 65535
going to be useful?

I can confirm that with just the TSO patch in ixgbe, and lagg enabled, the
problem still exists. Last night's tests never went above a packet of
65530. Now with lagg enabled, I'm seeing packets of 65543 within 5 minutes,
so we're already breaking.

Christopher Forgeron

unread,

Mar 26, 2014, 6:20:25 PM3/26/14

to

Confirmed that adding this to sys/net/if.c fixes the issue for lagg as well
as ixgbe.

660: if (ifp->if_hw_tsomax == 0)
661: ifp->if_hw_tsomax = IP_MAXPACKET - (ETHER_HDR_LEN +
ETHER_VLAN_ENCAP_LEN);

Code before (looks to be introduced in 9.2, r251296 as Rick mentions above)
just sets ifp->if_hw_tsomax = IP_MAXPACKET , so I really don't see much
downside in setting it to ~65518 if it enhances compatibility.

This should also fix it for carp, vlan, and others.

I'm going to do a few more tests, but in the meantime lets discuss the cons
of doing this.

>> > - manually forced (no if statement) ifp->if_hw_tsomax = IP_MAXPACKET
>> > -
>> > (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);

Rick Macklem

unread,

Mar 26, 2014, 8:31:43 PM3/26/14

to

Christopher Forgeron wrote:
>
>
>
>
>
> On Tue, Mar 25, 2014 at 8:21 PM, Markus Gebert <
> markus...@hostpoint.ch > wrote:
>
>
>
>
>
> Is 65517 correct? With Ricks patch, I get this:
>
> dev.ix.0.hw_tsomax: 65518
>
>
>
> Perhaps a difference between 9.2 and 10 for one of the macros? My
> code is:
>

> ifp->if_hw_tsomax = IP_MAXPACKET - (ETHER_HDR_LEN +
> ETHER_VLAN_ENCAP_LEN);

> printf("CSF - 3 Init, ifp->if_hw_tsomax = %d\n", ifp->if_hw_tsomax);
>

The difference is simply that IP_MAXPACKET == 65535, but I've been using
32 * MCLBYTES == 65536 (the latter is the amount of data m_defrag() can
squeeze into 32 mbuf clusters).

ie. I've suggested:
ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN),
IP_MAXPACKET);
- I put the min() in just so it wouldn't break if MCLBYTES is increased someday.

rick

>
> (BTW, you should submit the hw_tsomax sysctl patch, that's useful to
> others)
>
>
>
>
>
> Also the dtrace command you used excludes 65518...
>
>
>
> Oh, I thought it was giving every packet that is greater than or
> equal to 65518 - Could you show me the proper command? That's the
> third time I've used dtrace, so I'm making this up as I go. :-)
>

Rick Macklem

unread,

Mar 26, 2014, 8:35:48 PM3/26/14

to

Christopher Forgeron wrote:
> That's interesting. I see here in the r251296 commit Andre says :
>
> Drivers can set ifp->if_hw_tsomax before calling ether_ifattach()
> to
> change the limit.
>
> I wonder if we add your same TSO patch to if_lagg.c before line
> 356's
> ether_ifattach() will fix it.
>

I think the value(s) for underlying hardware drivers have to somehow
be propagated up through lagg. I haven't looked at the code, so I
don't know what that would be.

Putting the patch for ixgbe.c in lagg wouldn't make sense, since it
doesn't know if the underlying devices have the 32 limit.

I've suggested in the other thread what you suggested in a recent
post...ie. to change the default, at least until the propagation
of driver set values is resolved.

rick

> Ultimately, it will need to load the if_hw_tsomax from the if below
> it -
> but then again, if the calculation for ixgbe is good enough for that
> driver, why wouldn't it be good enough for lagg?
>
> Unless people think I'm crazy, I'll compile that in at line 356 in
> if_lagg.c and give it a test run tomorrow.
>
> This may need to go into vlan and carp as well, I'm not sure yet.
>
>
> On Tue, Mar 25, 2014 at 8:16 PM, Rick Macklem <rmac...@uoguelph.ca>
> wrote:
>
> > Christopher Forgeron wrote:
> > > Update:
> > >
> > > I'm changing my mind, and I believe Rick's TSO patch is fixing
> > > things
> > > (sorry). In looking at my notes, it's possible I had lagg on for
> > > those
> > > tests. lagg does seem to negate the TSO patch in my case.
> > >
> > Ok, that's useful information. It implies that r251296 doesn't
> > quite
> > work and needs to be fixed for "stacked" network interface drivers
> > before it can be used. I've cc'd Andre who is the author of that
> > patch, in case he knows how to fix it.
> >
> > Thanks for checking this, rick
> >
> > > kernel.10stable_basicTSO_65535/
> > >
> > > - IP_MAXPACKET = 65535;

> > > - manually forced (no if statement) ifp->if_hw_tsomax =

> > > IP_MAXPACKET
> > > -
> > > (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);

Christopher Forgeron

unread,

Mar 27, 2014, 8:17:02 AM3/27/14

to

On Wed, Mar 26, 2014 at 9:31 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

>
> ie. I've suggested:
> ifp->if_hw_tsomax = min(32 * MCLBYTES - (ETHER_HDR_LEN +
> ETHER_VLAN_ENCAP_LEN),
> IP_MAXPACKET);
> - I put the min() in just so it wouldn't break if MCLBYTES is increased
> someday.
>

I like the added safety for future changes - Good forward thinking. I'll
adjust my testing code to reflect this.

Christopher Forgeron

unread,

Mar 27, 2014, 8:23:52 AM3/27/14

to

On Wed, Mar 26, 2014 at 9:35 PM, Rick Macklem <rmac...@uoguelph.ca> wrote:

>
>
> I've suggested in the other thread what you suggested in a recent
> post...ie. to change the default, at least until the propagation
> of driver set values is resolved.
>
> rick
>

I wonder if we need to worry about propagating values up from the sub-if's
- Setting the default in if.c means this is set for all if's, and it's a
simple 1 line code change. If a specific 'if' needs a different value, it
can be set before ether_attach() is called.

I'm more concerned with the equation we use to calculate if_hw_tsomax - Are
we considering the right variables? Are we thinking on the wrong OSI layer
for headers?

Markus Gebert

unread,

Mar 27, 2014, 10:13:57 AM3/27/14

to

On 26.03.2014, at 03:33, Christopher Forgeron <csfor...@gmail.com> wrote:

> On Tue, Mar 25, 2014 at 8:21 PM, Markus Gebert
> <markus...@hostpoint.ch>wrote:
>
>>
>>
>> Is 65517 correct? With Ricks patch, I get this:
>>
>> dev.ix.0.hw_tsomax: 65518
>>
>
> Perhaps a difference between 9.2 and 10 for one of the macros? My code is:
>

> ifp->if_hw_tsomax = IP_MAXPACKET - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);

> printf("CSF - 3 Init, ifp->if_hw_tsomax = %d\n", ifp->if_hw_tsomax);

Hm, I’m using Rick’s patch:

if ((adapter->num_segs * MCLBYTES - (ETHER_HDR_LEN +
ETHER_VLAN_ENCAP_LEN)) < IP_MAXPACKET)
ifp->if_hw_tsomax = adapter->num_segs * MCLBYTES -
(ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN);

> (BTW, you should submit the hw_tsomax sysctl patch, that's useful to others)

My patch added a sysctl that is writable, but if I got this right if_hw_tsomax is not expected to change after the interface is attached. That’s why I didn’t post it. But here’s a read-only version:

--- sys/dev/ixgbe/ixgbe.c 2013-12-19 14:24:10.624279412 +0100
+++ sys/dev/ixgbe/ixgbe.c 2014-03-27 15:00:59.503424634 +0100
@@ -577,6 +582,12 @@
if (ixgbe_setup_interface(dev, adapter) != 0)
goto err_late;

+ /* add interface to hw_tsomax */
+ SYSCTL_ADD_INT(device_get_sysctl_ctx(dev),
+ SYSCTL_CHILDREN(device_get_sysctl_tree(dev)),
+ OID_AUTO, "hw_tsomax", CTLTYPE_INT|CTLFLAG_RD,
+ &adapter->ifp->if_hw_tsomax, 1, "hardware TSO limit");
+
/* Initialize statistics */
ixgbe_update_stats_counters(adapter);

>> Also the dtrace command you used excludes 65518...
>>
>
> Oh, I thought it was giving every packet that is greater than or equal to
> 65518 - Could you show me the proper command? That's the third time I've
> used dtrace, so I'm making this up as I go. :-)

No, what looks like a comment (between slashes) are conditions in dtrace:

dtrace -n 'fbt::tcp_output:entry / args[0]->t_tsomax != 0 && args[0]->t_tsomax != 65518 / { printf("unexpected tp->t_tsomax: %i\n", args[0]->t_tsomax); stack(); }’

You have to read the above like this:

- fbt::tcp_output:entry -> Add a probe to the beginning of the kernel function tcp_output()
- / args[0]->t_tsomax != 0 && args[0]->t_tsomax != 65518 / -> only match if t_tsomax is neither 0 nor 65518 (args[0] is struct tcpcb in case of tcp_output())
- { printf("unexpected tp->t_tsomax: %i\n", args[0]->t_tsomax); stack(); } -> this is only executed if the probe matched and the condition were true. It that case a t_tsomax gets printed and a stack trace is generated

In your case, you stated that your if_hw_tsomax is 65517. Since my version of the dtrace one-liner does _not_ ignore 65517, you should have seen a lot of output, which you didn’t mention (you’ve just posted dtrace output that was generated from bce interfaces). That’s why I thought 65517 was a typo on your part, and I wanted to clarify that.

Markus

Rick Macklem

unread,

Mar 27, 2014, 6:44:01 PM3/27/14

to

Christopher Forgeron wrote:
>
>
>
>
>
>
> On Wed, Mar 26, 2014 at 9:35 PM, Rick Macklem < rmac...@uoguelph.ca
> > wrote:
>
>
>
>
> I've suggested in the other thread what you suggested in a recent
> post...ie. to change the default, at least until the propagation
> of driver set values is resolved.
>
> rick
>
>
>
> I wonder if we need to worry about propagating values up from the
> sub-if's - Setting the default in if.c means this is set for all
> if's, and it's a simple 1 line code change. If a specific 'if' needs
> a different value, it can be set before ether_attach() is called.
>
>
> I'm more concerned with the equation we use to calculate if_hw_tsomax
> - Are we considering the right variables? Are we thinking on the
> wrong OSI layer for headers?
>

Well, I'm pragmatic (which means I mostly care about some fix that works),
but it seems to me that:
- The problem is that some TSO enabled network drivers/hardware can only
handle 32 transmit segments (or 32 mbufs in the chain for the TSO packet
to be transmitted, if that is clearer).
--> Since the problem is in certain drivers, it seems that those drivers
should be where the long term fix goes.
--> Since some hardware can't handle more than 32, it seems that the
driver should be able to specify that limit, which tcp_output() can
then apply.

I have an untested patch that does this by adding if_hw_tsomaxseg.
(The attachment called tsomaxseg.patch.)

Changing if_hw_tsomax or its default value is just a hack that gets tcp_output()
to apply a limit that the driver can then fix to 32 mbufs in the chain via
m_defrag().

Since if_hw_tsomax (and if_hw_tsomaxseg in the untested patch) aren't
propagated up through lagg, that needs to be fixed.
(Yet another attached untested patch called lagg.patch.)

As I said before, I don't see these patches getting tested/reviewed etc
in time for 9.3, so I think reducing the default value of if_hw_tsomax
is a reasonable short term hack to work around the problem.
(And it sounds like Pyun YongHyeon has volunteered to fix many of the
drivers, where the 32 limit isn't a hardware one.)

rick

tsomaxseg.patch

lagg.patch