Re: FreeBSD 8.0 - network stack crashes?

Eirik Øverby

unread,

Nov 28, 2009, 2:46:12 AM11/28/09

to

Hi,

Gavin Atkinson wrote:
> On Tue, 2009-11-03 at 08:32 -0500, Weldon S Godfrey 3 wrote:
> >
> > If memory serves me right, sometime around Yesterday, Gavin Atkinson told me:
> >
> > Gavin, thank you A LOT for helping us with this, I have answered as much
> > as I can from the most recent crash below. We did hit max mbufs. It is
> > at 25Kclusters, which is the default. I have upped it to 32K because a
> > rather old article mentioned that as the top end and I need to get into
> > work so I am not trying to do this with a remote console to go higher. I
> > have already set it to reboot next with 64K clusters. I already have kmem
> > maxed to what is bootable (or at least at one time) in 8.0, 4GB, how high
> > can I safely go? This is a NFS server running ZFS with sustained 5 min
> > averages of 120-200Mb/s running as a store for a mail system.
> >
> > > Some things that would be useful:
> > >
> > > - Does "arp -da" fix things?
> >
> > no, it hangs like ssh, route add, etc
> >
> > > - What's the output of "netstat -m" while the networking is broken?
> > Tue Nov 3 07:02:11 CST 2009
> > 36971/2033/39004 mbufs in use (current/cache/total)
> > 24869/731/25600/25600 mbuf clusters in use (current/cache/total/max)
> > 24314/731 mbuf+clusters out of packet secondary zone in use
> > (current/cache)
> > 0/35/35/12800 4k (page size) jumbo clusters in use
> > (current/cache/total/max)
> > 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
> > 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
> > 58980K/2110K/61091K bytes allocated to network (current/cache/total)
> > 0/201276/90662 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> > 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> > 0/0/0 sfbufs in use (current/peak/max)
> > 0 requests for sfbufs denied
> > 0 requests for sfbufs delayed
> > 0 requests for I/O initiated by sendfile
> > 0 calls to protocol drain routines
>
> OK, at least we've figured out what is going wrong then. As a
> workaround to get the machine to stay up longer, you should be able to
> set kern.ipc.nmbclusters=256000 in /boot/loader.conf -but hopefully we
> can resolve this soon.

I'll chip in with a report of exactly the same situation, and I'm on 8.0-RELEASE.
We've been struggling with this for some time, and latest yesterday the box was rebooted, and already last night it wedged again. We're at a whopping
kern.ipc.nmbclusters: 524288
and I've just doubled it once more, which means we're allocating 2GB to networking..

Much like the original poster, we're seeing this on a amd64 storage server with a large ZFS array shared through NFS, and network interfaces are two em(4) combined in a lagg(4) interface (lacp). Using either of the two em interfaces without lagg shows the same problem, just lower performance..

> Firstly, what kernel was the above output from? And what network card
> are you using? In your initial post you mentioned testing both bce(4)
> and em(4) cards, be aware that em(4) had an issue that would cause
> exactly this issue, which was fixed with a commit on September 11th
> (r197093). Make sure your kernel is from after that date if you are
> using em(4). I guess it is also possible that bce(4) has the same
> issue, I'm not aware of any fixes to it recently.

We're on GENERIC .

> So, from here, I think the best thing would be to just use the em(4) NIC
> and an up-to-date kernel, and see if you can reproduce the issue.

em(4) and 8.0-RELEASE still shows this problem.

> How important is this machine? If em(4) works, are you able to help
> debug the issues with the bce(4) driver?

We have no bce(4), but we have the problem on em(4) so can help debug there. The server is important, but making it stable is more important.. See below the sig for some debug info.

/Eirik

Output from sysctl dev.em.[0,1].debug=1 :

em0: Adapter hardware address = 0xffffff80003ac530
em0: CTRL = 0x140248 RCTL = 0x8002
em0: Packet buffer = Tx=20k Rx=12k
em0: Flow control watermarks high = 10240 low = 8740
em0: tx_int_delay = 66, tx_abs_int_delay = 66
em0: rx_int_delay = 32, rx_abs_int_delay = 66
em0: fifo workaround = 0, fifo_reset_count = 0
em0: hw tdh = 92, hw tdt = 92
em0: hw rdh = 225, hw rdt = 224
em0: Num Tx descriptors avail = 256
em0: Tx Descriptors not avail1 = 0
em0: Tx Descriptors not avail2 = 0
em0: Std mbuf failed = 0
em0: Std mbuf cluster failed = 11001
em0: Driver dropped packets = 0
em0: Driver tx dma failure in encap = 0
em1: Adapter hardware address = 0xffffff80003be530
em1: CTRL = 0x140248 RCTL = 0x8002
em1: Packet buffer = Tx=20k Rx=12k
em1: Flow control watermarks high = 10240 low = 8740
em1: tx_int_delay = 66, tx_abs_int_delay = 66
em1: rx_int_delay = 32, rx_abs_int_delay = 66
em1: fifo workaround = 0, fifo_reset_count = 0
em1: hw tdh = 165, hw tdt = 165
em1: hw rdh = 94, hw rdt = 93
em1: Num Tx descriptors avail = 256
em1: Tx Descriptors not avail1 = 0
em1: Tx Descriptors not avail2 = 0
em1: Std mbuf failed = 0
em1: Std mbuf cluster failed = 17765
em1: Driver dropped packets = 0
em1: Driver tx dma failure in encap = 0

Output from netstat -m (note that I just doubled the mbuf cluster count, thus max is > total and the box currently works:

544916/3604/548520 mbufs in use (current/cache/total)
543903/3041/546944/1048576 mbuf clusters in use (current/cache/total/max)
543858/821 mbuf+clusters out of packet secondary zone in use (current/cache)
0/77/77/262144 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/131072 9k jumbo clusters in use (current/cache/total/max)
0/0/0/65536 16k jumbo clusters in use (current/cache/total/max)
1224035K/7291K/1231326K bytes allocated to network (current/cache/total)
0/58919/29431 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Pyun YongHyeon

unread,

Nov 28, 2009, 8:30:26 PM11/28/09

to

How about disabling TSO/Tx checksum offloading of em(4)?
Last time I checked the driver, em(4) seems to assume it can access
IP/TCP header in mbuf chains without computing required header size.

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Eirik Øverby

unread,

Nov 29, 2009, 2:45:54 AM11/29/09

to

Hi,

I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising. I'll wait and see if it goes down again, then reboot with those values to see how it behaves. But right away it doesn't look too good ..

/Eirik_______________________________________________

Robert Watson

unread,

Nov 29, 2009, 9:29:37 AM11/29/09

to

On Sun, 29 Nov 2009, Eirik ï¿œverby wrote:

> I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising.
> I'll wait and see if it goes down again, then reboot with those values to
> see how it behaves. But right away it doesn't look too good ..

It would be interesting to know if any of the counters in the output of
netstat -s grow linearly with the allocation count in netstat -m. Often times
leaks are associated with edge cases in the stack (typically because if they
are in common cases the bug is detected really quickly!) -- usually error
handling, where in some error case the unwinding fails to free an mbuf that it
should free. These are notoriously hard to track down, unfortunately, but the
stats output (especially where delta alloc is linear to delta stat) may inform
the situation some more.

Robert N M Watson
Computer Laboratory
University of Cambridge

Eirik Øverby

unread,

Nov 29, 2009, 2:13:46 PM11/29/09

to

On 29. nov. 2009, at 15.29, Robert Watson wrote:

> On Sun, 29 Nov 2009, Eirik Øverby wrote:
>
>> I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising. I'll wait and see if it goes down again, then reboot with those values to see how it behaves. But right away it doesn't look too good ..
>

> It would be interesting to know if any of the counters in the output of netstat -s grow linearly with the allocation count in netstat -m. Often times leaks are associated with edge cases in the stack (typically because if they are in common cases the bug is detected really quickly!) -- usually error handling, where in some error case the unwinding fails to free an mbuf that it should free. These are notoriously hard to track down, unfortunately, but the stats output (especially where delta alloc is linear to delta stat) may inform the situation some more.

I'm collecting the output of netstat -m|grep "mbuf clusters" and netstat -s every 10 seconds now. Will allow it to run until the box stops responding. Not sure what to do with the data after the box has "died" .. Will try to see if there are any obvious similarities.

Suggestions welcome - both how to collect data and how to analyse ;)

/Eirik

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--

Eirik Øverby

unread,

Nov 29, 2009, 6:21:16 PM11/29/09

to

On 29. nov. 2009, at 15.29, Robert Watson wrote:

> On Sun, 29 Nov 2009, Eirik Øverby wrote:
>
>> I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising. I'll wait and see if it goes down again, then reboot with those values to see how it behaves. But right away it doesn't look too good ..
>

> It would be interesting to know if any of the counters in the output of netstat -s grow linearly with the allocation count in netstat -m. Often times leaks are associated with edge cases in the stack (typically because if they are in common cases the bug is detected really quickly!) -- usually error handling, where in some error case the unwinding fails to free an mbuf that it should free. These are notoriously hard to track down, unfortunately, but the stats output (especially where delta alloc is linear to delta stat) may inform the situation some more.

From what I can tell, all that goes up with mbuf usage is traffic/packet counts. I can't say I see anything fishy in there.

From the last few samples in
http://anduin.net/~ltning/netstat.log
you can see the host stops receiving any packets, but does a few retransmits before the session where this script ran timed out.

/Eirik

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--

Pyun YongHyeon

unread,

Nov 29, 2009, 7:52:36 PM11/29/09

to

On Mon, Nov 30, 2009 at 12:21:16AM +0100, Eirik ??verby wrote:
> On 29. nov. 2009, at 15.29, Robert Watson wrote:
>

> > On Sun, 29 Nov 2009, Eirik �verby wrote:
> >
> >> I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising. I'll wait and see if it goes down again, then reboot with those values to see how it behaves. But right away it doesn't look too good ..
> >
> > It would be interesting to know if any of the counters in the output of netstat -s grow linearly with the allocation count in netstat -m. Often times leaks are associated with edge cases in the stack (typically because if they are in common cases the bug is detected really quickly!) -- usually error handling, where in some error case the unwinding fails to free an mbuf that it should free. These are notoriously hard to track down, unfortunately, but the stats output (especially where delta alloc is linear to delta stat) may inform the situation some more.
>
> From what I can tell, all that goes up with mbuf usage is traffic/packet counts. I can't say I see anything fishy in there.
>

If system exhausted all available mbufs it still should not crash
the box. Use -d option of netstat(1) to see whether packet drop
counter still goes up when you know system can't receive any
frames. AFAIK em(4) was carefully written to recover from Rx
resource shortage such that it just drops incoming frames when it
can't get new mbuf. This may result in dropping incoming connection
request but it means it still tries to recover from the resource
exhaustion.
It's not clear where mbuf leak comes from, though.

> From the last few samples in
> http://anduin.net/~ltning/netstat.log

404

> you can see the host stops receiving any packets, but does a few retransmits before the session where this script ran timed out.
>

By chance do you use pf/ipfw/ipf?

Eirik Øverby

unread,

Nov 30, 2009, 2:20:57 AM11/30/09

to

On 30. nov. 2009, at 01.52, Pyun YongHyeon wrote:

> On Mon, Nov 30, 2009 at 12:21:16AM +0100, Eirik ??verby wrote:
>> On 29. nov. 2009, at 15.29, Robert Watson wrote:
>>

>>> On Sun, 29 Nov 2009, Eirik Øverby wrote:
>>>
>>>> I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising. I'll wait and see if it goes down again, then reboot with those values to see how it behaves. But right away it doesn't look too good ..
>>>
>>> It would be interesting to know if any of the counters in the output of netstat -s grow linearly with the allocation count in netstat -m. Often times leaks are associated with edge cases in the stack (typically because if they are in common cases the bug is detected really quickly!) -- usually error handling, where in some error case the unwinding fails to free an mbuf that it should free. These are notoriously hard to track down, unfortunately, but the stats output (especially where delta alloc is linear to delta stat) may inform the situation some more.
>>
>> From what I can tell, all that goes up with mbuf usage is traffic/packet counts. I can't say I see anything fishy in there.
>>
>
> If system exhausted all available mbufs it still should not crash
> the box. Use -d option of netstat(1) to see whether packet drop
> counter still goes up when you know system can't receive any
> frames. AFAIK em(4) was carefully written to recover from Rx
> resource shortage such that it just drops incoming frames when it
> can't get new mbuf. This may result in dropping incoming connection
> request but it means it still tries to recover from the resource
> exhaustion.
> It's not clear where mbuf leak comes from, though.

The box does not crash; connecting to the console (via IP-KVM) shows the box is just fine, except that no networking works. I can up the kern.ipc.nmbclusters value from the commandline, and after a few seconds things start moving again.

The em(4) debug output shows that it fails to allocate mbuf clusters.

>> From the last few samples in
>> http://anduin.net/~ltning/netstat.log
>
> 404

Uh? Unpossible :)
The file is there, and I can view it here ...

>> you can see the host stops receiving any packets, but does a few retransmits before the session where this script ran timed out.
>>
>
> By chance do you use pf/ipfw/ipf?

No... Unfortunately ;)

/Eirik_______________________________________________

Adrian Chadd

unread,

Nov 30, 2009, 2:47:09 AM11/30/09

to

That URL works for me. So how much traffic is this box handling during
peak times?

I've seen this on the proxy boxes that I've setup. There's a lot of
data being tied up in socket buffers as well as being routed between
interfaces (ie, stuff that isn't being intercepted.) Take a look at
"netstat -an" when things are locked up; see if there's any sockets
which have full send/receive queues.

I'm going to take a complete stab in the dark here and say this sounds
a little like a livelock. Ie, something is queuing data and allocating
mbufs for TX (and something else is generating mbufs - I dunno, packet
headers?) far faster than the NIC is able to TX them out, and there's
not enough backpressure on whatever (say, the stuff filling socket
buffers) to stop the mbuf exhaustion. Again, I've seen this kind of
crap on proxy boxes.

See if you have full socket buffers showing up in netstat -an. Have
you tweaked the socket/TCP send/receive sizes? I typically lock mine
down to something small (32k-64k for the most part) so I don't hit
mbuf exhaustion on very busy proxies.

2c,

Adrian

2009/11/30 Eirik Øverby <ltn...@anduin.net>:

> On 29. nov. 2009, at 15.29, Robert Watson wrote:
>
>> On Sun, 29 Nov 2009, Eirik Øverby wrote:
>>

>>> I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising. I'll wait and see if it goes down again, then reboot with those values to see how it behaves. But right away it doesn't look too good ..
>>

>> It would be interesting to know if any of the counters in the output of netstat -s grow linearly with the allocation count in netstat -m. Often times leaks are associated with edge cases in the stack (typically because if they are in common cases the bug is detected really quickly!) -- usually error handling, where in some error case the unwinding fails to free an mbuf that it should free. These are notoriously hard to track down, unfortunately, but the stats output (especially where delta alloc is linear to delta stat) may inform the situation some more.
>
> From what I can tell, all that goes up with mbuf usage is traffic/packet counts. I can't say I see anything fishy in there.
>
> From the last few samples in
> http://anduin.net/~ltning/netstat.log
> you can see the host stops receiving any packets, but does a few retransmits before the session where this script ran timed out.
>
> /Eirik
>

> _______________________________________________
> freebsd...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"
>
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--

Eirik Øverby

unread,

Nov 30, 2009, 3:26:00 AM11/30/09

to

On 30. nov. 2009, at 08.47, Adrian Chadd wrote:

> That URL works for me. So how much traffic is this box handling during
> peak times?

Depends how you define load. It's a storage box (14TB ZFS) with a small handful of NFS clients pushing backup data to it .. So lots of traffic in bytes/sec, but not many clients.

> I've seen this on the proxy boxes that I've setup. There's a lot of
> data being tied up in socket buffers as well as being routed between
> interfaces (ie, stuff that isn't being intercepted.) Take a look at
> "netstat -an" when things are locked up; see if there's any sockets
> which have full send/receive queues.

If you're referring to the Send-Q and Recv-Q values, they are zero everywhere I can tell.

> I'm going to take a complete stab in the dark here and say this sounds
> a little like a livelock. Ie, something is queuing data and allocating
> mbufs for TX (and something else is generating mbufs - I dunno, packet
> headers?) far faster than the NIC is able to TX them out, and there's
> not enough backpressure on whatever (say, the stuff filling socket
> buffers) to stop the mbuf exhaustion. Again, I've seen this kind of
> crap on proxy boxes.

Not sure if this applies in our case. See the (very) end of this mail for some debug/stats output from em1 (the interface currently in use; I disabled lagg/lacp to ease debugging).

> See if you have full socket buffers showing up in netstat -an. Have
> you tweaked the socket/TCP send/receive sizes? I typically lock mine
> down to something small (32k-64k for the most part) so I don't hit
> mbuf exhaustion on very busy proxies.

I haven't touched any defaults except the mbuf clusters. What does your sysctl.conf look like?

Thanks,
/Eirik

> 2c,
>
>
>
> Adrian
>
> 2009/11/30 Eirik Øverby <ltn...@anduin.net>:
>> On 29. nov. 2009, at 15.29, Robert Watson wrote:
>>
>>> On Sun, 29 Nov 2009, Eirik Øverby wrote:
>>>
>>>> I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising. I'll wait and see if it goes down again, then reboot with those values to see how it behaves. But right away it doesn't look too good ..
>>>
>>> It would be interesting to know if any of the counters in the output of netstat -s grow linearly with the allocation count in netstat -m. Often times leaks are associated with edge cases in the stack (typically because if they are in common cases the bug is detected really quickly!) -- usually error handling, where in some error case the unwinding fails to free an mbuf that it should free. These are notoriously hard to track down, unfortunately, but the stats output (especially where delta alloc is linear to delta stat) may inform the situation some more.
>>
>> From what I can tell, all that goes up with mbuf usage is traffic/packet counts. I can't say I see anything fishy in there.
>>
>> From the last few samples in
>> http://anduin.net/~ltning/netstat.log
>> you can see the host stops receiving any packets, but does a few retransmits before the session where this script ran timed out.
>>
>> /Eirik
>>
>> _______________________________________________
>> freebsd...@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"
>>
>

em1: link state changed to UP

em1: Adapter hardware address = 0xffffff80003be530
em1: CTRL = 0x140248 RCTL = 0x8002
em1: Packet buffer = Tx=20k Rx=12k
em1: Flow control watermarks high = 10240 low = 8740
em1: tx_int_delay = 66, tx_abs_int_delay = 66
em1: rx_int_delay = 32, rx_abs_int_delay = 66
em1: fifo workaround = 0, fifo_reset_count = 0

em1: hw tdh = 25, hw tdt = 25
em1: hw rdh = 222, hw rdt = 221

em1: Num Tx descriptors avail = 256
em1: Tx Descriptors not avail1 = 0
em1: Tx Descriptors not avail2 = 0
em1: Std mbuf failed = 0

em1: Std mbuf cluster failed = 0

em1: Driver dropped packets = 0
em1: Driver tx dma failure in encap = 0

em1: Excessive collisions = 0
em1: Sequence errors = 0
em1: Defer count = 0
em1: Missed Packets = 0
em1: Receive No Buffers = 0
em1: Receive Length Errors = 0
em1: Receive errors = 0
em1: Crc errors = 0
em1: Alignment errors = 0
em1: Collision/Carrier extension errors = 0
em1: RX overruns = 0
em1: watchdog timeouts = 0
em1: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK MSIX IRQ = 0
em1: XON Rcvd = 0
em1: XON Xmtd = 0
em1: XOFF Rcvd = 0
em1: XOFF Xmtd = 0
em1: Good Packets Rcvd = 5704113
em1: Good Packets Xmtd = 3617612
em1: TSO Contexts Xmtd = 0
em1: TSO Contexts Failed = 0

Adrian Chadd

unread,

Nov 30, 2009, 3:50:04 AM11/30/09

to

2009/11/30 Eirik Øverby <ltn...@anduin.net>:

>> That URL works for me. So how much traffic is this box handling during
>> peak times?
>
> Depends how you define load. It's a storage box (14TB ZFS) with a small handful of NFS clients pushing backup data to it .. So lots of traffic in bytes/sec, but not many clients.

Ok.

> If you're referring to the Send-Q and Recv-Q values, they are zero everywhere I can tell.

Hm, I was. Ok.

>> See if you have full socket buffers showing up in netstat -an. Have
>> you tweaked the socket/TCP send/receive sizes? I typically lock mine
>> down to something small (32k-64k for the most part) so I don't hit
>> mbuf exhaustion on very busy proxies.

> I haven't touched any defaults except the mbuf clusters. What does your sysctl.conf look like?

I just set these:

net.inet.tcp.sendspace=65536
net.inet.tcp.recvspace=65536

I tweak a lot of other TCP stack stuff to deal with satellite
latencies; its not relevant here.

I'd love to see where those mbufs are hiding and whether they're a
leak, or whether the NFS server is just pushing too much data out for
whatever reason. Actually, something I also set was this:

# Handle slightly more packets per interrupt tick
net.inet.ip.intr_queue_maxlen=512

It was defaulting to 50 which wasn't fast enough for small packet loads.

Adrian

Eirik Øverby

unread,

Nov 30, 2009, 4:03:01 AM11/30/09

to

I fact it's mostly receiving. Other boxes on the LAN (or other
internal subnets) are pushing data to it, rarely reading any except to
check status and clean up.

> whatever reason. Actually, something I also set was this:
>
> # Handle slightly more packets per interrupt tick
> net.inet.ip.intr_queue_maxlen=512
>
> It was defaulting to 50 which wasn't fast enough for small packet
> loads.

I'll try all those and then some, but I'm no optimist.. Might try on
different hardware later.

Thanks,
/Eirik

Adrian Chadd

unread,

Nov 30, 2009, 4:05:23 AM11/30/09

to

2009/11/30 Eirik Øverby <ltn...@anduin.net>:

>> I'd love to see where those mbufs are hiding and whether they're a
>> leak, or whether the NFS server is just pushing too much data out for
>
> I fact it's mostly receiving. Other boxes on the LAN (or other internal
> subnets) are pushing data to it, rarely reading any except to check status
> and clean up.

Right, but it also has to queue response packets for the NFS transactions.

What do your NFS mounts look like? Are they TCP or UDP? Have you toyed
with the mount settings at all?

Eirik Øverby

unread,

Nov 30, 2009, 4:23:34 AM11/30/09

to

On 30. nov. 2009, at 10.05, Adrian Chadd wrote:

> 2009/11/30 Eirik Øverby <ltn...@anduin.net>:
>
>>> I'd love to see where those mbufs are hiding and whether they're a
>>> leak, or whether the NFS server is just pushing too much data out for
>>
>> I fact it's mostly receiving. Other boxes on the LAN (or other internal
>> subnets) are pushing data to it, rarely reading any except to check status
>> and clean up.
>
> Right, but it also has to queue response packets for the NFS transactions.
>
> What do your NFS mounts look like? Are they TCP or UDP? Have you toyed
> with the mount settings at all?

Nope ... All default, and mostly FreeBSD<->FreeBSD NFS mounts. The other FreeBSDs are at 7.x, and there's one OpenBSD 4.4 box. So nothing fancy. I'm at a complete loss here.

/Eirik_______________________________________________

Eirik Øverby

unread,

Nov 30, 2009, 5:22:20 AM11/30/09

to

Hi,

I have something that might be more interesting than any counter ...
It seems to me as if the problem *only* manifests itself when an OpenBSD box is backing up to this FreeBSD 8.0-NFS-ZFS server. All other boxes are FreeBSD, and I have so far today been unable to reproduce the problem from any of those. As soon as I interrupted the backup running from OpenBSD, the mbuf cluster usage stabilized.

How's that for a mystery in the morning?

/Eirik

On 29. nov. 2009, at 15.29, Robert Watson wrote:

_______________________________________________

Eirik Øverby

unread,

Nov 30, 2009, 5:36:33 AM11/30/09

to

Short follow-up: Making OpenBSD use TCP mounts (it defaults to UDP) seems to solve the issue.

So this is a UDP-NFS-related problem, it would seem?

/Eirik

Robert N. M. Watson

unread,

Nov 30, 2009, 8:09:16 AM11/30/09

to

On 30 Nov 2009, at 05:36, Eirik Øverby wrote:

> Short follow-up: Making OpenBSD use TCP mounts (it defaults to UDP) seems to solve the issue.
>
> So this is a UDP-NFS-related problem, it would seem?

Could well be. Let's try another debugging tactic -- there are two possible things going on here: resource leak, and resource exhaustion leading to deadlock. If you shut down to single user mode from multi-user, and let the system quiesce for a few minutes, then run netstat -m, what does it look like? Do vast numbers of mbufs+clusters get freed, or do they remain accounted for as allocated?

(If they remain allocated, they were likely leaked, since most/all sockets will have been closed, releasing their resources on shutdown to single user when all processes are killed)

The theory of an mbuf leak in NFS isn't an unlikely theory -- the socket code there continues to change, and rare edge cases frequently lead to leaks (per my earlier e-mail). Perhaps there's a case the OpenBSD client is triggering that other NFS clients normally don't. If we think that's the case, the next step is usually to narrow down what causes the leak to trigger a lot (i.e., the backup starting), and then grab a packet trace that we can analyze with wireshark. We'll want to look at the types of errors being returned for RPCs and, in particular, if there's one that happens about the same number of times as the resource has leaked over the same window, look at the code and see if that error case is handled properly.

If this is definitely an NFS leak bug, we should get the NFS folks attention by sticking "NFS mbuf leak" in the subject line and CC'ing rmacklem/dfr. :-)

Robert

> /Eirik
>
> On 30. nov. 2009, at 11.22, Eirik Øverby wrote:
>
>> Hi,
>>
>> I have something that might be more interesting than any counter ...
>> It seems to me as if the problem *only* manifests itself when an OpenBSD box is backing up to this FreeBSD 8.0-NFS-ZFS server. All other boxes are FreeBSD, and I have so far today been unable to reproduce the problem from any of those. As soon as I interrupted the backup running from OpenBSD, the mbuf cluster usage stabilized.
>>
>> How's that for a mystery in the morning?
>>
>> /Eirik
>>
>> On 29. nov. 2009, at 15.29, Robert Watson wrote:
>>
>>> On Sun, 29 Nov 2009, Eirik Øverby wrote:
>>>

>>>> I just did that (-rxcsum -txcsum -tso), but the numbers still keep rising. I'll wait and see if it goes down again, then reboot with those values to see how it behaves. But right away it doesn't look too good ..
>>>

>>> It would be interesting to know if any of the counters in the output of netstat -s grow linearly with the allocation count in netstat -m. Often times leaks are associated with edge cases in the stack (typically because if they are in common cases the bug is detected really quickly!) -- usually error handling, where in some error case the unwinding fails to free an mbuf that it should free. These are notoriously hard to track down, unfortunately, but the stats output (especially where delta alloc is linear to delta stat) may inform the situation some more.
>>>
>>> Robert N M Watson
>>> Computer Laboratory
>>> University of Cambridge
>>

>> _______________________________________________
>> freebsd...@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"
>>
>

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--

Eirik Øverby

unread,

Nov 30, 2009, 8:40:52 AM11/30/09

to

On 30. nov. 2009, at 14.09, Robert N. M. Watson wrote:

>
> On 30 Nov 2009, at 05:36, Eirik Øverby wrote:
>
>> Short follow-up: Making OpenBSD use TCP mounts (it defaults to UDP) seems to solve the issue.
>>
>> So this is a UDP-NFS-related problem, it would seem?
>
> Could well be. Let's try another debugging tactic -- there are two possible things going on here: resource leak, and resource exhaustion leading to deadlock. If you shut down to single user mode from multi-user, and let the system quiesce for a few minutes, then run netstat -m, what does it look like? Do vast numbers of mbufs+clusters get freed, or do they remain accounted for as allocated?

It's been sitting in single-user mode for about 15 minutes now, no change in allocation.
I'll reboot in about 15 minutes, then try to mount from a FreeBSD box using UDP - if that causes the same issues, I guess it's not an OpenBSD specific issue but a UDP issue "in general". Next step would be to try to reproduce the same between two VMs on my own box, as this box needs to return to production soonish - if we manage to reproduce elsewhere..

Other ideas/suggestions?

/Eirik

Robert N. M. Watson

unread,

Nov 30, 2009, 10:36:25 AM11/30/09

to

On 30 Nov 2009, at 08:40, Eirik Øverby wrote:

>>> Short follow-up: Making OpenBSD use TCP mounts (it defaults to UDP) seems to solve the issue.
>>>
>>> So this is a UDP-NFS-related problem, it would seem?
>>
>> Could well be. Let's try another debugging tactic -- there are two possible things going on here: resource leak, and resource exhaustion leading to deadlock. If you shut down to single user mode from multi-user, and let the system quiesce for a few minutes, then run netstat -m, what does it look like? Do vast numbers of mbufs+clusters get freed, or do they remain accounted for as allocated?
>
> It's been sitting in single-user mode for about 15 minutes now, no change in allocation.
> I'll reboot in about 15 minutes, then try to mount from a FreeBSD box using UDP - if that causes the same issues, I guess it's not an OpenBSD specific issue but a UDP issue "in general". Next step would be to try to reproduce the same between two VMs on my own box, as this box needs to return to production soonish - if we manage to reproduce elsewhere..

This sounds like a good plan -- especially reproducing it on a non-production box :-). I agree it's most likely that the OpenBSD NFS client simply does something a little differently than the other NFS clients you are dealing with, triggering an edge case in our NFS server code. But, to be clear, I think it's much more likely that the bug is in the NFS over UDP code than UDP itself, given the complexity of the NFS code (although a UDP bug can't be ruled out).

Robert

--

Eirik Øverby

unread,

Nov 30, 2009, 2:13:49 PM11/30/09

to

On 30. nov. 2009, at 16.36, Robert N. M. Watson wrote:
> On 30 Nov 2009, at 08:40, Eirik Øverby wrote:
>
>>>> Short follow-up: Making OpenBSD use TCP mounts (it defaults to UDP) seems to solve the issue.
>>>>
>>>> So this is a UDP-NFS-related problem, it would seem?
>>>
>>> Could well be. Let's try another debugging tactic -- there are two possible things going on here: resource leak, and resource exhaustion leading to deadlock. If you shut down to single user mode from multi-user, and let the system quiesce for a few minutes, then run netstat -m, what does it look like? Do vast numbers of mbufs+clusters get freed, or do they remain accounted for as allocated?
>>
>> It's been sitting in single-user mode for about 15 minutes now, no change in allocation.
>> I'll reboot in about 15 minutes, then try to mount from a FreeBSD box using UDP - if that causes the same issues, I guess it's not an OpenBSD specific issue but a UDP issue "in general". Next step would be to try to reproduce the same between two VMs on my own box, as this box needs to return to production soonish - if we manage to reproduce elsewhere..
>
> This sounds like a good plan -- especially reproducing it on a non-production box :-). I agree it's most likely that the OpenBSD NFS client simply does something a little differently than the other NFS clients you are dealing with, triggering an edge case in our NFS server code. But, to be clear, I think it's much more likely that the bug is in the NFS over UDP code than UDP itself, given the complexity of the NFS code (although a UDP bug can't be ruled out).

I meant NFS-UDP ... However I was wrong even there; Using NFS over UDP from FreeBSD boxes does not cause the same issue. So OpenBSD seems to be a special case here.

I'm no Wireshark expert (to be fair, I've seen it a few times and tried it once or twice, and that's so long ago it's almost no longer true), so I'd need some input on how to gather useful data. I assume tcpdump, which options? And would it be OK if I made the dump available for download somewhere, so you or someone else can take a look with whichever tools you'd like?

Thanks for your time,
/Eirik

Rick Macklem

unread,

Dec 7, 2009, 4:28:52 PM12/7/09

to

On Mon, 30 Nov 2009, Robert N. M. Watson wrote:

>
> On 30 Nov 2009, at 05:36, Eirik Øverby wrote:
>
>> Short follow-up: Making OpenBSD use TCP mounts (it defaults to UDP) seems to solve the issue.
>>
>> So this is a UDP-NFS-related problem, it would seem?
>
> Could well be. Let's try another debugging tactic -- there are two possible things going on here: resource leak, and resource exhaustion leading to deadlock. If you shut down to single user mode from multi-user, and let the system quiesce for a few minutes, then run netstat -m, what does it look like? Do vast numbers of mbufs+clusters get freed, or do they remain accounted for as allocated?
>
> (If they remain allocated, they were likely leaked, since most/all sockets will have been closed, releasing their resources on shutdown to single user when all processes are killed)
>
> The theory of an mbuf leak in NFS isn't an unlikely theory -- the socket code there continues to change, and rare edge cases frequently lead to leaks (per my earlier e-mail). Perhaps there's a case the OpenBSD client is triggering that other NFS clients normally don't. If we think that's the case, the next step is usually to narrow down what causes the leak to trigger a lot (i.e., the backup starting), and then grab a packet trace that we can analyze with wireshark. We'll want to look at the types of errors being returned for RPCs and, in particular, if there's one that happens about the same number of times as the resource has leaked over the same window, look at the code and see if that error case is handled properly.
>
> If this is definitely an NFS leak bug, we should get the NFS folks attention by sticking "NFS mbuf leak" in the subject line and CC'ing rmacklem/dfr. :-)
>

It's a bit of a shot in the dark, but could you please test the following
patch? It patches for a possible mbuf leak + a possible M_SONAME leak (I
have no idea if these ever occur in practice?). It also fixes a case where
the return value for svc_reply_dg() would have been TRUE for failure. It
was all I could see from a quick look.

rick
--- rpc/svc_dg.c.sav 2009-12-07 15:37:45.000000000 -0500
+++ rpc/svc_dg.c 2009-12-07 15:48:50.000000000 -0500
@@ -221,6 +221,8 @@
xdrmbuf_create(&xdrs, mreq, XDR_DECODE);
if (! xdr_callmsg(&xdrs, msg)) {
XDR_DESTROY(&xdrs);
+ if (raddr != NULL)
+ free(raddr, M_SONAME);
return (FALSE);
}

@@ -259,11 +261,13 @@
m_fixhdr(mrep);
error = sosend(xprt->xp_socket, addr, NULL, mrep, NULL,
0, curthread);
- if (!error) {
- stat = TRUE;
+ if (error) {
+ stat = FALSE;
}
} else {
m_freem(mrep);
+ if (m != NULL)
+ m_freem(m);
}

XDR_DESTROY(&xdrs);

Eirik Øverby

unread,

Dec 12, 2009, 5:03:35 AM12/12/09

to

On 10. des. 2009, at 17.21, Rick Macklem wrote:

>
>
> On Thu, 10 Dec 2009, Eirik Ã~Xverby wrote:
>
>> Hi,
>>
>> this applies to 8.0-RELEASE?
>>
> It should. sys/rpc/svc_dg.c hasn't changed in a while.
>
>> I'll try to test today.
>>
> Thanks. I have no idea if it will help, but there was a case that could
> leak mbufs, if it ever occurs, that is fixed by this.

It didn't seem to help by much, anyway. After a couple of backup runs the mbuf cluster allocation had gone from ~4k to ~25k, and as I had reverted the sysctl to default, it wedged right there.

I'll have to go back to TCP mounts from the OpenBSD box again I guess ;)

/Eirik

> Good luck with it, rick

No luck with it, unfortunately :) But thanks for trying. Any other ideas?

/Eirik_______________________________________________

Robert N. M. Watson

unread,

Jan 8, 2010, 10:33:06 AM1/8/10

to

On 30 Nov 2009, at 19:13, Eirik Øverby wrote:

> I meant NFS-UDP ... However I was wrong even there; Using NFS over UDP from FreeBSD boxes does not cause the same issue. So OpenBSD seems to be a special case here.
>
> I'm no Wireshark expert (to be fair, I've seen it a few times and tried it once or twice, and that's so long ago it's almost no longer true), so I'd need some input on how to gather useful data. I assume tcpdump, which options? And would it be OK if I made the dump available for download somewhere, so you or someone else can take a look with whichever tools you'd like?

Aii. Over a month zips past in the blink of an eye.

Are you still experiencing this problem? I can certainly look at a wireshark trace, but make no promises. If you do do a trace, then what we should do is have you do run a script that dumps a bunch of relevant stats with nfsstat, netstat, vmstat, etc, before the trace starts, grabs exactly ${someval} seconds of trace data, then dumps all the same stats afterwards. Then we can use the stats to work out about how many leaked packets (or whatever) were present, and try to correlate it to a count of some type of event in the trace.

Robert_______________________________________________

Eirik Øverby

unread,

Jan 8, 2010, 2:58:26 PM1/8/10

to

On 8. jan. 2010, at 16.33, Robert N. M. Watson wrote:
> On 30 Nov 2009, at 19:13, Eirik Øverby wrote:
>
>> I meant NFS-UDP ... However I was wrong even there; Using NFS over UDP from FreeBSD boxes does not cause the same issue. So OpenBSD seems to be a special case here.
>>
>> I'm no Wireshark expert (to be fair, I've seen it a few times and tried it once or twice, and that's so long ago it's almost no longer true), so I'd need some input on how to gather useful data. I assume tcpdump, which options? And would it be OK if I made the dump available for download somewhere, so you or someone else can take a look with whichever tools you'd like?
>
> Aii. Over a month zips past in the blink of an eye.

Happens all the time. ;)

> Are you still experiencing this problem? I can certainly look at a wireshark trace, but make no promises. If you do do a trace, then what we should do is have you do run a script that dumps a bunch of relevant stats with nfsstat, netstat, vmstat, etc, before the trace starts, grabs exactly ${someval} seconds of trace data, then dumps all the same stats afterwards. Then we can use the stats to work out about how many leaked packets (or whatever) were present, and try to correlate it to a count of some type of event in the trace.

I got a patch from Rick early december which I have tried, but that one unfortunately did not seem to make any difference. I cvsup'ed (RELENG_8_0), patched, compiled, installed, booted and tested on the 12th - any reason to think things might have changes on 8.0 since then? I'm assuming not, in which case the answer is yes, I am still experiencing this problem (except I've made all our OpenBSD boxen speak TCP now, avoiding the problem entirely).

I'll perform whichever tests you'd like me to on that particular system. Just send me the scripts or pseudocode, and I'll get it done as soon as I can.

I was also planning on setting up a pair of VMs (one FreeBSD and one OpenBSD) and try to reproduce there - but that month that zipped past has not allowed the time to do so.

Thanks,
/Eirik

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--

Rick Macklem

unread,

Jan 8, 2010, 4:48:04 PM1/8/10

to

On Fri, 8 Jan 2010, Eirik Ã~Xverby wrote:

>
> On 8. jan. 2010, at 16.33, Robert N. M. Watson wrote:
>> On 30 Nov 2009, at 19:13, Eirik Øverby wrote:
>>
>>> I meant NFS-UDP ... However I was wrong even there; Using NFS over UDP from FreeBSD boxes does not cause the same issue. So OpenBSD seems to be a special case here.
>>>

As a data point, I tried to reproduce it here using an OpenBSD4.5 client
and wasn't able to see a leak. So it might be only certain versions of
OpenBSD or certain network configs or ??? (I used NFSv3 over UDP.)

I doubt it, but if the version of OpenBSD you were using happened to
use NFSv2 by default with UDP vs NFSv3 for TCP, there is a patch that
fixes an mbuf leak, that is specific to NFSv2. However, if you were
using NFSv3 over UDP, then this won't be relevant.

I can look at a packet trace, but I kinda doubt that it's going to
shed light on the cause of the mbuf leak.
("tcpdump -s 0 -w <file> host <server>" should get a packet capture
that Wireshark will make sense of.)

rick

Eirik Øverby

unread,

Jan 8, 2010, 5:01:09 PM1/8/10

to

On 8. jan. 2010, at 22.48, Rick Macklem wrote:
> On Fri, 8 Jan 2010, Eirik Ã~Xverby wrote:
>
>>
>> On 8. jan. 2010, at 16.33, Robert N. M. Watson wrote:
>>> On 30 Nov 2009, at 19:13, Eirik Øverby wrote:
>>>
>>>> I meant NFS-UDP ... However I was wrong even there; Using NFS over UDP from FreeBSD boxes does not cause the same issue. So OpenBSD seems to be a special case here.
>>>>
> As a data point, I tried to reproduce it here using an OpenBSD4.5 client and wasn't able to see a leak. So it might be only certain versions of
> OpenBSD or certain network configs or ??? (I used NFSv3 over UDP.)
>
> I doubt it, but if the version of OpenBSD you were using happened to
> use NFSv2 by default with UDP vs NFSv3 for TCP, there is a patch that
> fixes an mbuf leak, that is specific to NFSv2. However, if you were
> using NFSv3 over UDP, then this won't be relevant.

10.1.5.200:/data/backup/alge.anart.no on /mnt type nfs (v3, tcp, timeo=100)
10.1.5.200:/data/backup/alge.anart.no on /mnt2 type nfs (v3, udp, timeo=100)

The first one is the mount I'm currently using (v3, tcp), which does not cause the problem.
The second one is the result of
mount 10.1.5.200:/data/backup/alge.anart.no /mnt2
and it does, though it's OpenBSD 4.4, default to v3 udp.

I'll try gathering a trace next week. Any other ideas, let me know.

/Eirik

> I can look at a packet trace, but I kinda doubt that it's going to
> shed light on the cause of the mbuf leak.
> ("tcpdump -s 0 -w <file> host <server>" should get a packet capture
> that Wireshark will make sense of.)
>
> rick