dev.bce.X.com_no_buffers increasing and packet loss

Ian FREISLICH

unread,

Mar 5, 2010, 6:20:57 AM3/5/10

to

Hi

I have a system that is experiencing mild to severe packet loss.
The interfaces are configured as follows:

lagg0: bce0, bce1, bce2, bce3 lagproto lacp

lagg0 then is used as the hwdev for the vlan interfaces.

I have pf with a few queues for bandwidth management.

There isn't that much traffic on it (200-500Mbit/s).

I see only the following suspect for packet loss:

dev.bce.0.com_no_buffers: 140151466
dev.bce.1.com_no_buffers: 514723247
dev.bce.2.com_no_buffers: 10454050
dev.bce.3.com_no_buffers: 369371

Most of the time, these numbers are static, but every once in a
while they increase massively by several thousand, but only on 2
interfaces. The 1 minute average rate on those interfaces is 266/s
and 123/s.

Does anyone think this is related to the packet loss or are these
counters just a red herring? Is there anything that can be done
to reduce this count?

Ian

--
Ian Freislich
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Pyun YongHyeon

unread,

Mar 5, 2010, 12:56:39 PM3/5/10

to

On Fri, Mar 05, 2010 at 01:20:57PM +0200, Ian FREISLICH wrote:
> Hi
>
> I have a system that is experiencing mild to severe packet loss.
> The interfaces are configured as follows:
>
> lagg0: bce0, bce1, bce2, bce3 lagproto lacp
>
> lagg0 then is used as the hwdev for the vlan interfaces.
>
> I have pf with a few queues for bandwidth management.
>
> There isn't that much traffic on it (200-500Mbit/s).
>
> I see only the following suspect for packet loss:
>
> dev.bce.0.com_no_buffers: 140151466
> dev.bce.1.com_no_buffers: 514723247
> dev.bce.2.com_no_buffers: 10454050
> dev.bce.3.com_no_buffers: 369371
>
> Most of the time, these numbers are static, but every once in a
> while they increase massively by several thousand, but only on 2
> interfaces. The 1 minute average rate on those interfaces is 266/s
> and 123/s.
>
> Does anyone think this is related to the packet loss or are these
> counters just a red herring? Is there anything that can be done
> to reduce this count?
>

I think this sysctl node indicates number of dropped frames in
completion processor of NetXtreme II. The counter is incremented
when the processor received a frame successfully but it couldn't
pass the frame to system as there are no available RX buffers so
completion processor dopped the received frame.
If you see mbuf shortage from netstat that would be normal. But if
system has a lot of free mbuf resources it may indicate other
issue. bce(4) may not be able to replenish controller with RX
buffer if system is suffering from high load.

Ian FREISLICH

unread,

Mar 5, 2010, 1:16:31 PM3/5/10

to

I don't think I've ever seen an mbuf shortage on this host, and
load isn't that high, typically 12% CPU or 88% idle. That's just
on 2 (of 16) cores busy. There's tons of free memory (~12G) if I
need to increase the number of buffers available, but I'm not sure
which tunable to use to do that. The routing table also isn't large
at about 4000 prefixes.

[firewall1.jnb1] ~ # netstat -m
4118/7147/11265 mbufs in use (current/cache/total)
3092/6850/9942/131072 mbuf clusters in use (current/cache/total/max)
2060/4212 mbuf+clusters out of packet secondary zone in use (current/cache)
0/678/678/65536 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/32768 9k jumbo clusters in use (current/cache/total/max)
0/0/0/16384 16k jumbo clusters in use (current/cache/total/max)
7214K/18198K/25412K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

I currently set the following in loader.conf:

net.isr.maxthreads="8"
net.isr.direct=0
if_igb_load="yes"
kern.ipc.nmbclusters="131072"
kern.maxusers="1024"

Ian

--
Ian Freislich

Pyun YongHyeon

unread,

Mar 5, 2010, 1:40:46 PM3/5/10

to

Would you show me the output of dmesg(bce(4)/brgphy(4) only) and
the output of "pciconf -lcbv" for the controller?

Ian FREISLICH

unread,

Mar 5, 2010, 3:20:18 PM3/5/10

to

Pyun YongHyeon wrote:
>
> Would you show me the output of dmesg(bce(4)/brgphy(4) only) and
> the output of "pciconf -lcbv" for the controller?

[firewall1.jnb1] ~ # egrep "bce|brgphy" /var/run/dmesg.boot
bce0: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem 0xe6000000-0xe7ffffff irq 72 at device 0.0 on pci4
miibus0: <MII bus> on bce0
brgphy0: <BCM5708C 10/100/1000baseTX PHY> PHY 1 on miibus0
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bce0: Ethernet address: 00:1e:c9:4a:33:b9
bce0: [ITHREAD]
bce0: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); B/C (4.0.3); Flags (MSI|MFW); MFW (ipms 1.6.0)
bce1: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem 0xe8000000-0xe9ffffff irq 75 at device 0.0 on pci6
miibus1: <MII bus> on bce1
brgphy1: <BCM5708C 10/100/1000baseTX PHY> PHY 1 on miibus1
brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bce1: Ethernet address: 00:1e:c9:4a:33:bb
bce1: [ITHREAD]
bce1: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); B/C (4.0.3); Flags (MSI|MFW); MFW (ipms 1.6.0)
bce2: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem 0xea000000-0xebffffff irq 33 at device 0.0 on pci8
miibus2: <MII bus> on bce2
brgphy2: <BCM5708C 10/100/1000baseTX PHY> PHY 1 on miibus2
brgphy2: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bce2: Ethernet address: 00:1e:4f:fb:cf:c5
bce2: [ITHREAD]
bce2: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); B/C (4.0.3); Flags (MSI|MFW); MFW (ipms 1.6.0)
bce3: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem 0xec000000-0xedffffff irq 37 at device 0.0 on pci10
miibus3: <MII bus> on bce3
brgphy3: <BCM5708C 10/100/1000baseTX PHY> PHY 1 on miibus3
brgphy3: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
bce3: Ethernet address: 00:1e:4f:fb:cf:c7
bce3: [ITHREAD]
bce3: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); B/C (4.0.3); Flags (MSI|MFW); MFW (ipms 1.6.0)

bce0@pci0:4:0:0: class=0x020000 card=0x02231028 chip=0x164c14e4 rev=0x12 hdr=0x00
vendor = 'Broadcom Corporation'
device = 'Broadcom NetXtreme II Gigabit Ethernet Adapter (BCM5708)'
class = network
subclass = ethernet
bar [10] = type Memory, range 64, base 0xe6000000, size 33554432, enabled
cap 07[40] = PCI-X 64-bit supports 133MHz, 512 burst read, 8 split transactions
cap 01[48] = powerspec 2 supports D0 D3 current D0
cap 03[50] = VPD
cap 05[58] = MSI supports 1 message, 64 bit enabled with 1 message
bce1@pci0:6:0:0: class=0x020000 card=0x02231028 chip=0x164c14e4 rev=0x12 hdr=0x00
vendor = 'Broadcom Corporation'
device = 'Broadcom NetXtreme II Gigabit Ethernet Adapter (BCM5708)'
class = network
subclass = ethernet
bar [10] = type Memory, range 64, base 0xe8000000, size 33554432, enabled
cap 07[40] = PCI-X 64-bit supports 133MHz, 512 burst read, 8 split transactions
cap 01[48] = powerspec 2 supports D0 D3 current D0
cap 03[50] = VPD
cap 05[58] = MSI supports 1 message, 64 bit enabled with 1 message
bce2@pci0:8:0:0: class=0x020000 card=0x1f121028 chip=0x164c14e4 rev=0x12 hdr=0x00
vendor = 'Broadcom Corporation'
device = 'Broadcom NetXtreme II Gigabit Ethernet Adapter (BCM5708)'
class = network
subclass = ethernet
bar [10] = type Memory, range 64, base 0xea000000, size 33554432, enabled
cap 07[40] = PCI-X 64-bit supports 133MHz, 512 burst read, 8 split transactions
cap 01[48] = powerspec 2 supports D0 D3 current D0
cap 03[50] = VPD
cap 05[58] = MSI supports 1 message, 64 bit enabled with 1 message
bce3@pci0:10:0:0: class=0x020000 card=0x1f121028 chip=0x164c14e4 rev=0x12 hdr=0x00
vendor = 'Broadcom Corporation'
device = 'Broadcom NetXtreme II Gigabit Ethernet Adapter (BCM5708)'
class = network
subclass = ethernet
bar [10] = type Memory, range 64, base 0xec000000, size 33554432, enabled
cap 07[40] = PCI-X 64-bit supports 133MHz, 512 burst read, 8 split transactions
cap 01[48] = powerspec 2 supports D0 D3 current D0
cap 03[50] = VPD
cap 05[58] = MSI supports 1 message, 64 bit enabled with 1 message

--
Ian Freislich

Pyun YongHyeon

unread,

Mar 5, 2010, 4:04:35 PM3/5/10

to

Thanks for the info. Frankly, I have no idea how to explain the
issue given that you have no heavy load.
I have a bce(4) patch which fixes a couple of bus_dma(9) issues as
well as fixing some minor bugs. However I don't know whether the
patch can fix the RX issue you're suffering from. Anyway, would you
give it try the patch at the following URL?
http://people.freebsd.org/~yongari/bce/bce.20100305.diff
The patch was generated against CURRENT and you may see a message
like "Disabling COAL_NOW timedout!" during interface up. You can
ignore that message.

Ian FREISLICH

unread,

Mar 5, 2010, 4:16:41 PM3/5/10

to

Pyun YongHyeon wrote:
> Thanks for the info. Frankly, I have no idea how to explain the
> issue given that you have no heavy load.

How many cores would be involved in handling the traffic and runnig
PF rules on this machine? There are 4x
CPU: Quad-Core AMD Opteron(tm) Processor 8354 (2194.51-MHz K8-class CPU)
In this server. I'm also using carp extensively.

> I have a bce(4) patch which fixes a couple of bus_dma(9) issues as
> well as fixing some minor bugs. However I don't know whether the
> patch can fix the RX issue you're suffering from. Anyway, would you
> give it try the patch at the following URL?
> http://people.freebsd.org/~yongari/bce/bce.20100305.diff
> The patch was generated against CURRENT and you may see a message
> like "Disabling COAL_NOW timedout!" during interface up. You can
> ignore that message.

Thanks. I'll give the patch a go on Monday when there are people
nearby if something goes wrong during the boot. I don't want to
loose the redundancy over the week end.

Otherwise, is there another interface chip we can try? It's got
an igb(4) quad port in there as well, but the performance is worse
on that chip than the bce(4) interface. It's also riddled with
vlan and other hardware offload bugs. I had good success in the
past with em(4), but it looks like igb is the PCI-e version.

Ian

--
Ian Freislich

Pyun YongHyeon

unread,

Mar 5, 2010, 4:55:39 PM3/5/10

to

On Fri, Mar 05, 2010 at 11:16:41PM +0200, Ian FREISLICH wrote:
> Pyun YongHyeon wrote:
> > Thanks for the info. Frankly, I have no idea how to explain the
> > issue given that you have no heavy load.
>
> How many cores would be involved in handling the traffic and runnig
> PF rules on this machine? There are 4x
> CPU: Quad-Core AMD Opteron(tm) Processor 8354 (2194.51-MHz K8-class CPU)
> In this server. I'm also using carp extensively.
>

pf(4) uses a single lock for processing, number of core would have
no much benefit.

> > I have a bce(4) patch which fixes a couple of bus_dma(9) issues as
> > well as fixing some minor bugs. However I don't know whether the
> > patch can fix the RX issue you're suffering from. Anyway, would you
> > give it try the patch at the following URL?
> > http://people.freebsd.org/~yongari/bce/bce.20100305.diff
> > The patch was generated against CURRENT and you may see a message
> > like "Disabling COAL_NOW timedout!" during interface up. You can
> > ignore that message.
>
> Thanks. I'll give the patch a go on Monday when there are people
> nearby if something goes wrong during the boot. I don't want to
> loose the redundancy over the week end.
>

>From my testing on quad-port BCM5709 controller, it was stable. But
I agree that your plan would be better.

> Otherwise, is there another interface chip we can try? It's got

I guess bce(4) and igb(4) would be one of the best controller.

> an igb(4) quad port in there as well, but the performance is worse
> on that chip than the bce(4) interface. It's also riddled with

Yeah, I also noticed that. I think bce(4) seems to give more better
performance numbers than igb(4).

> vlan and other hardware offload bugs. I had good success in the
> past with em(4), but it looks like igb is the PCI-e version.
>

It may depend on specific workloads. Last time I tried igb(4), the
driver had a couple of bugs and after patching it, igb(4) also
seemed to work well even though the performance was slightly slower
than I initially expected. One thing I saw was using LRO on igb(4)
showed slightly worse performance. Another thing for igb(4) case,
it began to support multi-TX queues as well as RSS. Theoretically
current multi-TX queue implementation can reorder packets such that
it can give negative effects.

bce(4) still lacks multi-TX queue support as well as RSS. bce(4)
controllers also supports MSI-X as well as RSS so I have plan to
implement it in future but it's hard to tell when I can find time
to implement that.

Ian FREISLICH

unread,

Mar 8, 2010, 9:45:20 AM3/8/10

to

Pyun YongHyeon wrote:
> On Fri, Mar 05, 2010 at 11:16:41PM +0200, Ian FREISLICH wrote:
> > Pyun YongHyeon wrote:
> > > Thanks for the info. Frankly, I have no idea how to explain the
> > > issue given that you have no heavy load.
> >
> > How many cores would be involved in handling the traffic and runnig
> > PF rules on this machine? There are 4x
> > CPU: Quad-Core AMD Opteron(tm) Processor 8354 (2194.51-MHz K8-class CPU)
> > In this server. I'm also using carp extensively.
> >
>
> pf(4) uses a single lock for processing, number of core would have
> no much benefit.

What's interesting is the effect on CPU utilisation and interrupt
generation that net.inet.ip.fastforwarding has:

net.inet.ip.fastforwarding=1
interrupt rate is around 10000/s per bce interface
cpu 8.0% interrupt

net.inet.ip.fastforwarding=0
interrupt rate is around 5000/s per bce interface
cpu 13.0% interrupt
It also appears to not drop packets, but I'll have to watch it for longer.

Ian

--
Ian Freislich

Pyun YongHyeon

unread,

Mar 8, 2010, 12:49:49 PM3/8/10

to

On Mon, Mar 08, 2010 at 04:45:20PM +0200, Ian FREISLICH wrote:
> Pyun YongHyeon wrote:
> > On Fri, Mar 05, 2010 at 11:16:41PM +0200, Ian FREISLICH wrote:
> > > Pyun YongHyeon wrote:
> > > > Thanks for the info. Frankly, I have no idea how to explain the
> > > > issue given that you have no heavy load.
> > >
> > > How many cores would be involved in handling the traffic and runnig
> > > PF rules on this machine? There are 4x
> > > CPU: Quad-Core AMD Opteron(tm) Processor 8354 (2194.51-MHz K8-class CPU)
> > > In this server. I'm also using carp extensively.
> > >
> >
> > pf(4) uses a single lock for processing, number of core would have
> > no much benefit.
>
> What's interesting is the effect on CPU utilisation and interrupt
> generation that net.inet.ip.fastforwarding has:
>
> net.inet.ip.fastforwarding=1
> interrupt rate is around 10000/s per bce interface
> cpu 8.0% interrupt
>

Yes, this is one of intentional change of the patch. Stock bce(4)
seems to generate too much interrupts on BCM5709 so I rewrote
interrupt handling with the help of David. sysctl nodes are also
exported to control interrupt moderation so you can change them if
you want. Default value was tuned to generate interrupts less than
10k per second and try to minimize latencies.

> net.inet.ip.fastforwarding=0
> interrupt rate is around 5000/s per bce interface
> cpu 13.0% interrupt
> It also appears to not drop packets, but I'll have to watch it for longer.
>

Hmm, actually that's not what I originally expected. :-)
The patch replaced some suspicious memory barrier instructions with
bus_dmamap_sync(9) and you may see the effect.

Ian FREISLICH

unread,

Mar 9, 2010, 3:26:29 AM3/9/10

to

Pyun YongHyeon wrote:
> patch can fix the RX issue you're suffering from. Anyway, would you
> give it try the patch at the following URL?
> http://people.freebsd.org/~yongari/bce/bce.20100305.diff
> The patch was generated against CURRENT and you may see a message
> like "Disabling COAL_NOW timedout!" during interface up. You can
> ignore that message.

It's been running for about 1:23 on the patched driver. I'm still
seeing the com_no_buffers increase:

[firewall2.jnb1] ~ # sysctl dev.bce |grep com_no_buffers
dev.bce.0.com_no_buffers: 5642
dev.bce.1.com_no_buffers: 497
dev.bce.2.com_no_buffers: 6260612
dev.bce.3.com_no_buffers: 4871338

Interupt rate is down now, at about 3500 per second per interface.

Interestingly setting net.inet.ip.fastforwarding=0 reduces CPU
consumption from 25% to 9% and less packet loss.

Ian

--
Ian Freislich

Ian FREISLICH

unread,

Mar 9, 2010, 8:31:55 AM3/9/10

to

Can you explain the tunables please - I'm guessing it's these:

dev.bce.$i.tx_quick_cons_trip_int
dev.bce.$i.tx_quick_cons_trip
dev.bce.$i.tx_ticks_int
dev.bce.$i.tx_ticks
dev.bce.$i.rx_quick_cons_trip_int
dev.bce.$i.rx_quick_cons_trip
dev.bce.$i.rx_ticks_int
dev.bce.$i.rx_ticks

Pyun YongHyeon

unread,

Mar 9, 2010, 3:49:27 PM3/9/10

to

This value controls the number of TX Quick BD Chain entries that
must be completed before a status block is generated during an
interrupt.

> dev.bce.$i.tx_quick_cons_trip

This value controls the number of TX Quick BD Chain entries that
must be completed before a status block is generated. Setting this
to 0 disables TX Quick BD Chain consumption from generating status
blocks.

> dev.bce.$i.tx_ticks_int

This value controls the number of 1us ticks that will be counted
for status block updates generated due to TX activity during
interrupt processing. Setting this value to 0 disables the TX
timer feature during interrupts.

> dev.bce.$i.tx_ticks

This value controls the number of 1us ticks that will be counted
before a status block update is generated due to TX activity.
Setting this value to 0 disables the TX timer feature.

> dev.bce.$i.rx_quick_cons_trip_int

This value controls the number of RX Quick BD entries that must be
completed before a status block is generated during interrupt
processing.

> dev.bce.$i.rx_quick_cons_trip

This value controls the number of RX Quick BD Chain entries that
must be completed before a status block is generated. Setting this
to 0 disables RX Event consumption from generating status blocks.

> dev.bce.$i.rx_ticks_int

This value controls the number of 1us ticks that will be counted
for status block updates generated due to RX activity during
interrupt processing. Setting this value to 0 disables the RX
timer feature during interrupts.

> dev.bce.$i.rx_ticks
>

This value controls the number of 1us ticks that will be counted
before a status block update is generated due to RX activity.
Setting this value to 0 disables the RX timer feature.

Pyun YongHyeon

unread,

Mar 9, 2010, 4:21:39 PM3/9/10

to

On Tue, Mar 09, 2010 at 10:26:29AM +0200, Ian FREISLICH wrote:
> Pyun YongHyeon wrote:

> > patch can fix the RX issue you're suffering from. Anyway, would you
> > give it try the patch at the following URL?
> > http://people.freebsd.org/~yongari/bce/bce.20100305.diff
> > The patch was generated against CURRENT and you may see a message
> > like "Disabling COAL_NOW timedout!" during interface up. You can
> > ignore that message.
>
> It's been running for about 1:23 on the patched driver. I'm still
> seeing the com_no_buffers increase:
>
> [firewall2.jnb1] ~ # sysctl dev.bce |grep com_no_buffers
> dev.bce.0.com_no_buffers: 5642
> dev.bce.1.com_no_buffers: 497
> dev.bce.2.com_no_buffers: 6260612
> dev.bce.3.com_no_buffers: 4871338
>

Still have no idea why these counters are increasing here.
Actually the counter is read from scratch pad of completion
processor. The datasheet does not even document the counter.
Maybe david know better what's happening here(CCed).

> Interupt rate is down now, at about 3500 per second per interface.
>
> Interestingly setting net.inet.ip.fastforwarding=0 reduces CPU
> consumption from 25% to 9% and less packet loss.
>
> Ian
>
> --
> Ian Freislich

Pyun YongHyeon

unread,

Mar 9, 2010, 4:40:13 PM3/9/10

to

On Tue, Mar 09, 2010 at 01:31:55PM -0800, David Christensen wrote:
> > > > patch can fix the RX issue you're suffering from. Anyway,
> > would you
> > > > give it try the patch at the following URL?
> > > > http://people.freebsd.org/~yongari/bce/bce.20100305.diff
> > > > The patch was generated against CURRENT and you may see a message
> > > > like "Disabling COAL_NOW timedout!" during interface up. You can
> > > > ignore that message.
> > >
> > > It's been running for about 1:23 on the patched driver. I'm still
> > > seeing the com_no_buffers increase:
> > >
> > > [firewall2.jnb1] ~ # sysctl dev.bce |grep com_no_buffers
> > > dev.bce.0.com_no_buffers: 5642
> > > dev.bce.1.com_no_buffers: 497
> > > dev.bce.2.com_no_buffers: 6260612
> > > dev.bce.3.com_no_buffers: 4871338
> > >
> >
> > Still have no idea why these counters are increasing here.
> > Actually the counter is read from scratch pad of completion
> > processor. The datasheet does not even document the counter.
> > Maybe david know better what's happening here(CCed).
> >
> > > Interupt rate is down now, at about 3500 per second per interface.
> > >
> > > Interestingly setting net.inet.ip.fastforwarding=0 reduces CPU
> > > consumption from 25% to 9% and less packet loss.
>

> The com_no_buffers statistic comes from firmware and indicates how
> many times a valid frame was received but could not be placed into
> the receive chain because there were no available RX buffers. The
> firmware will then drop the frame but that dropped frame won't be
> reflected in any of the hardware based statistics.
>

Yeah, but the question is why bce(4) has no available RX buffers.
The system has a lot of available mbufs so I don't see the root
cause here.

> Dave

David Christensen

unread,

Mar 9, 2010, 4:31:55 PM3/9/10

to

Dave

Ian FREISLICH

unread,

Mar 9, 2010, 4:55:30 PM3/9/10

to

Pyun YongHyeon wrote:
> On Tue, Mar 09, 2010 at 03:31:55PM +0200, Ian FREISLICH wrote:
> > Can you explain the tunables please - I'm guessing it's these:

I think I asked the wrong question. What is a "Quick BD Chain"?
What relation should this number have to traffic rate. Is there a
maximum and what are reasonable numbers for setting this to?

I set the RX as high as 512 in 64 quanta but it made little difference
to the interrupt rate. At times where we experience the packet
loss and com_no_buffers increases, the interrupt rate on between 1
and 3 of the 4 bce interfaces fell from about 3200/s to 130/s.

We're wondering if the switches we're using could be causing this
problem - they're Dell PowerConnect 5448. I've seen complaints of
random packet loss caused by these switches on the Internet. We
have some new H3C 5100 series switches which we're planning on
swapping for the Dells tomorrow to see if it makes a difference.

Ian

--
Ian Freislich

David Christensen

unread,

Mar 9, 2010, 5:04:57 PM3/9/10

to

> > > > It's been running for about 1:23 on the patched driver.
> I'm still
> > > > seeing the com_no_buffers increase:
> > > >
> > > > [firewall2.jnb1] ~ # sysctl dev.bce |grep com_no_buffers
> > > > dev.bce.0.com_no_buffers: 5642
> > > > dev.bce.1.com_no_buffers: 497
> > > > dev.bce.2.com_no_buffers: 6260612
> > > > dev.bce.3.com_no_buffers: 4871338
> > > >
> > >
> > > Still have no idea why these counters are increasing here.
> > > Actually the counter is read from scratch pad of completion
> > > processor. The datasheet does not even document the counter.
> > > Maybe david know better what's happening here(CCed).
> > >
> > > > Interupt rate is down now, at about 3500 per second per
> interface.
> > > >
> > > > Interestingly setting net.inet.ip.fastforwarding=0 reduces CPU
> > > > consumption from 25% to 9% and less packet loss.
> >
> > The com_no_buffers statistic comes from firmware and indicates how
> > many times a valid frame was received but could not be
> placed into the
> > receive chain because there were no available RX buffers. The
> > firmware will then drop the frame but that dropped frame won't be
> > reflected in any of the hardware based statistics.
> >
>

> Yeah, but the question is why bce(4) has no available RX buffers.
> The system has a lot of available mbufs so I don't see the
> root cause here.

What's the traffic look like? Jumbo, standard, short frames? Any
good ideas on profiling the code? I haven't figured out how to use
the CPU TSC but there is a free running timer on the device that
might be usable to calculate where the driver's time is spent.

Dave

Pyun YongHyeon

unread,

Mar 9, 2010, 5:12:40 PM3/9/10

to

On Tue, Mar 09, 2010 at 11:55:30PM +0200, Ian FREISLICH wrote:
> Pyun YongHyeon wrote:
> > On Tue, Mar 09, 2010 at 03:31:55PM +0200, Ian FREISLICH wrote:
> > > Can you explain the tunables please - I'm guessing it's these:
>
> I think I asked the wrong question. What is a "Quick BD Chain"?

I don't know why Broadocom uses 'Quick'. It's a just buffer
descriptor chain for TX/RX.

> What relation should this number have to traffic rate. Is there a
> maximum and what are reasonable numbers for setting this to?
>

The maximum BD chain would be number of configured TX/RX descriptors.
And 1 would be minimum value as you have to want to get status block
for each TX/RX. Finding best value may depend on specific load.

> I set the RX as high as 512 in 64 quanta but it made little difference
> to the interrupt rate. At times where we experience the packet
> loss and com_no_buffers increases, the interrupt rate on between 1
> and 3 of the 4 bce interfaces fell from about 3200/s to 130/s.
>

BD chain is just one of parameters. bce(4) controllers also provide
more advanced features that fine control interrupt moderation(TX/RX
ticks). It's hard to explain all the details so you may want to
read public data sheet of bce(4).
http://www.broadcom.com/collateral/pg/NetXtremeII-PG203-R.pdf
See host coalescing registers(page 484).

> We're wondering if the switches we're using could be causing this
> problem - they're Dell PowerConnect 5448. I've seen complaints of
> random packet loss caused by these switches on the Internet. We
> have some new H3C 5100 series switches which we're planning on
> swapping for the Dells tomorrow to see if it makes a difference.
>

Not sure but it does not explain increasing
dev.bce.X.com_no_buffers counter.

> Ian
>
> --
> Ian Freislich

Ryan Stone

unread,

Mar 9, 2010, 5:30:52 PM3/9/10

to

> What's the traffic look like? Jumbo, standard, short frames? Any
> good ideas on profiling the code? I haven't figured out how to use
> the CPU TSC but there is a free running timer on the device that
> might be usable to calculate where the driver's time is spent.
>
> Dave

In my experience hwpmc is the best and easiest way to profile anything
on FreeBSD. Here's something I sent to a different thread a couple of
months ago explaining how to use it:

1) If device hwpmc is not compiled into your kernel, kldload hwpmc(you
will need the HWPMC_HOOKS option in either case)
2) Run pmcstat to begin taking samples(make sure that whatever you are
profiling is busy doing work first!):

pmcstat -S unhalted-cycles -O /tmp/samples.out

The -S option specifies what event you want to use to trigger
sampling. The unhalted-cycles is the best event to use if your
hardware supports it; pmc will take a sample every 64K non-idle CPU
cycles, which is basically equivalent to sampling based on time. If
the unhalted-cycles event is not supported by your hardware then the
instructions event will probably be the next best choice(although it's
nowhere near as good, as it will not be able to tell you, for example,
if a particular function is very expensive because it takes a lot of
cache misses compared to the rest of your program). One caveat with
the unhalted-cycles event is that time spent spinning on a spinlock or
adaptively spinning on a MTX_DEF mutex will not be counted by this
event, because most of the spinning time is spent executing an hlt
instruction that idles the CPU for a short period of time.

Modern Intel and AMD CPUs offer a dizzying array of events. They're
mostly only useful if you suspect that a particular kind of event is
hurting your performance and you would like to know what is causing
those events. For example, if you suspect that data cache misses are
causing you problems you can take samples on cache misses.
Unfortunately on some of the newer CPUs(namely the Core2 family,
because that's what I'm doing most of my profiling on nowadays) I find
it difficult to figure out just what event to use to profile based on
cache misses. man pmc will give you an overview of pmc, and there are
manpages for every CPU family supported(eg man pmc.core2)

3) After you've run pmcstat for "long enough"(a proper definition of
long enough requires a statistician, which I most certainly am not,
but I find that for a busy system 10 seconds is enough), Control-C it
to stop it*. You can use pmcstat to post-process the samples into
human-readable text:

pmcstat -R /tmp/samples.out -G /tmp/graph.txt

The graph.txt file will show leaf functions on the left and their
callers beneath them, indented to reflect the callchain. It's not too
easy to describe and I don't have sample output available right now.

Another interesting tool for post-processing the samples is
pmcannotate. I've never actually used the tool before but it will
annotate the program's source to show which lines are the most
expensive. This of course needs unstripped modules to work. I think
that it will also work if the GNU "debug link" is in the stripped
module pointing to the location of the file with symbols.

* Here's a tip I picked up from Joseph Koshy's blog: to collect
samples for a fixed period of time(say 1 minute), have pmcstat run the
sleep command:

pmcstat -S unhalted-cycles -O /tmp/samples.out sleep 60

David Christensen

unread,

Mar 9, 2010, 6:00:39 PM3/9/10

to

> -----Original Message-----
> From: Ryan Stone [mailto:rys...@gmail.com]
> Sent: Tuesday, March 09, 2010 2:31 PM
> To: David Christensen
> Cc: pyu...@gmail.com; Ian FREISLICH; cur...@freebsd.org
> Subject: Re: dev.bce.X.com_no_buffers increasing and packet loss
>
> > What's the traffic look like? Jumbo, standard, short frames? Any
> > good ideas on profiling the code? I haven't figured out how to use
> > the CPU TSC but there is a free running timer on the device
> that might
> > be usable to calculate where the driver's time is spent.
> >
> > Dave
>
> In my experience hwpmc is the best and easiest way to profile
> anything on FreeBSD. Here's something I sent to a different
> thread a couple of months ago explaining how to use it:
>
> 1) If device hwpmc is not compiled into your kernel, kldload
> hwpmc(you will need the HWPMC_HOOKS option in either case)
> 2) Run pmcstat to begin taking samples(make sure that
> whatever you are profiling is busy doing work first!):
>
> pmcstat -S unhalted-cycles -O /tmp/samples.out
>
> The -S option specifies what event you want to use to trigger
> sampling. The unhalted-cycles is the best event to use if

> 3) After you've run pmcstat for "long enough"(a proper

> definition of long enough requires a statistician, which I
> most certainly am not, but I find that for a busy system 10
> seconds is enough), Control-C it to stop it*. You can use
> pmcstat to post-process the samples into human-readable text:
>
> pmcstat -R /tmp/samples.out -G /tmp/graph.txt
>
> The graph.txt file will show leaf functions on the left and
> their callers beneath them, indented to reflect the
> callchain. It's not too easy to describe and I don't have
> sample output available right now.

Below is a quick sample I obtained running netperf. We're
interested in the bce(4) driver so I assume I'm interested
in the time spent in bce and the functions it calls. Looks
to me like memory allocation/freeing is a major source of
CPU cycles in this test. Am I reading this right?

@ CPU_CLK_UNHALTED_CORE [1091924 samples]

49.25% [537739] sched_idletd @ /boot/kernel/kernel
100.0% [537739] fork_exit

20.89% [228070] trash_dtor @ /boot/kernel/kernel
85.45% [194883] mb_dtor_clust
100.0% [194883] uma_zfree_arg
100.0% [194883] mb_free_ext
14.55% [33186] mb_dtor_mbuf
100.0% [33186] uma_zfree_arg
84.27% [27966] mb_free_ext
15.73% [5220] m_freem
00.00% [1] mb_dtor_pack
100.0% [1] uma_zfree_arg
100.0% [1] mb_free_ext

02.34% [25542] bce_intr @ /boot/kernel/if_bce.ko
100.0% [25542] intr_event_execute_handlers @ /boot/kernel/kernel
100.0% [25542] ithread_loop
100.0% [25542] fork_exit

02.20% [24055] trash_ctor @ /boot/kernel/kernel
96.41% [23192] mb_ctor_clust
100.0% [23192] uma_zalloc_arg
100.0% [23192] bce_fill_rx_chain @ /boot/kernel/if_bce.ko
03.39% [815] mb_ctor_mbuf @ /boot/kernel/kernel
100.0% [815] uma_zalloc_arg
99.39% [810] bce_fill_rx_chain @ /boot/kernel/if_bce.ko
00.49% [4] m_copym @ /boot/kernel/kernel
00.12% [1] tcp_output
00.20% [48] uma_zalloc_arg
100.0% [48] bce_fill_rx_chain @ /boot/kernel/if_bce.ko
100.0% [48] bce_intr

Dave

Ryan Stone

unread,

Mar 9, 2010, 6:08:13 PM3/9/10

to

trash_ctor and trash_dtor? You're running with INVARIANTS.

Fabien Thomas

unread,

Mar 9, 2010, 5:55:35 PM3/9/10

to

If you are on head/stable_7/stable_8 you can also do quick test with top mode pmcstat -S unhalted-cycles -T (http://wiki.freebsd.org/PmcTools/PmcTop).
For more in depth post processing with source code (c+asm) you can output to Kcachegrind (http://wiki.freebsd.org/PmcTools/PmcKcachegrind).

Fabien

>> What's the traffic look like? Jumbo, standard, short frames? Any
>> good ideas on profiling the code? I haven't figured out how to use
>> the CPU TSC but there is a free running timer on the device that
>> might be usable to calculate where the driver's time is spent.
>>
>> Dave
>
> In my experience hwpmc is the best and easiest way to profile anything
> on FreeBSD. Here's something I sent to a different thread a couple of
> months ago explaining how to use it:
>
> 1) If device hwpmc is not compiled into your kernel, kldload hwpmc(you
> will need the HWPMC_HOOKS option in either case)
> 2) Run pmcstat to begin taking samples(make sure that whatever you are
> profiling is busy doing work first!):
>
> pmcstat -S unhalted-cycles -O /tmp/samples.out
>
> The -S option specifies what event you want to use to trigger

> sampling. The unhalted-cycles is the best event to use if your
> hardware supports it; pmc will take a sample every 64K non-idle CPU
> cycles, which is basically equivalent to sampling based on time. If
> the unhalted-cycles event is not supported by your hardware then the
> instructions event will probably be the next best choice(although it's
> nowhere near as good, as it will not be able to tell you, for example,
> if a particular function is very expensive because it takes a lot of
> cache misses compared to the rest of your program). One caveat with
> the unhalted-cycles event is that time spent spinning on a spinlock or
> adaptively spinning on a MTX_DEF mutex will not be counted by this
> event, because most of the spinning time is spent executing an hlt
> instruction that idles the CPU for a short period of time.
>
> Modern Intel and AMD CPUs offer a dizzying array of events. They're
> mostly only useful if you suspect that a particular kind of event is
> hurting your performance and you would like to know what is causing
> those events. For example, if you suspect that data cache misses are
> causing you problems you can take samples on cache misses.
> Unfortunately on some of the newer CPUs(namely the Core2 family,
> because that's what I'm doing most of my profiling on nowadays) I find
> it difficult to figure out just what event to use to profile based on
> cache misses. man pmc will give you an overview of pmc, and there are
> manpages for every CPU family supported(eg man pmc.core2)
>

> 3) After you've run pmcstat for "long enough"(a proper definition of
> long enough requires a statistician, which I most certainly am not,
> but I find that for a busy system 10 seconds is enough), Control-C it
> to stop it*. You can use pmcstat to post-process the samples into
> human-readable text:
>
> pmcstat -R /tmp/samples.out -G /tmp/graph.txt
>
> The graph.txt file will show leaf functions on the left and their
> callers beneath them, indented to reflect the callchain. It's not too
> easy to describe and I don't have sample output available right now.
>
>

> Another interesting tool for post-processing the samples is
> pmcannotate. I've never actually used the tool before but it will
> annotate the program's source to show which lines are the most
> expensive. This of course needs unstripped modules to work. I think
> that it will also work if the GNU "debug link" is in the stripped
> module pointing to the location of the file with symbols.
>
>
> * Here's a tip I picked up from Joseph Koshy's blog: to collect
> samples for a fixed period of time(say 1 minute), have pmcstat run the
> sleep command:
>

> pmcstat -S unhalted-cycles -O /tmp/samples.out sleep 60

Ian FREISLICH

unread,

Mar 10, 2010, 12:48:11 AM3/10/10

to

Pyun YongHyeon wrote:
> On Tue, Mar 09, 2010 at 11:55:30PM +0200, Ian FREISLICH wrote:
> > I set the RX as high as 512 in 64 quanta but it made little difference
> > to the interrupt rate. At times where we experience the packet
> > loss and com_no_buffers increases, the interrupt rate on between 1
> > and 3 of the 4 bce interfaces fell from about 3200/s to 130/s.
> >
>
> BD chain is just one of parameters. bce(4) controllers also provide
> more advanced features that fine control interrupt moderation(TX/RX
> ticks). It's hard to explain all the details so you may want to
> read public data sheet of bce(4).

Thanks. I'll have a read over that.

I meant to state that above that whenever the interrupt rate on a
controller (or several) falls off, the interrupt CPU usage climbs
from about 4% to about 20%. So it seems like something is happening
on host that jams up interrupt processing.

Ian

--
Ian Freislich

Ian FREISLICH

unread,

Mar 10, 2010, 4:05:22 AM3/10/10

to

"David Christensen" wrote:
> > Yeah, but the question is why bce(4) has no available RX buffers.

> > The system has a lot of available mbufs so I don't see the=20
> > root cause here.
>
> What's the traffic look like? Jumbo, standard, short frames? Any=20

> good ideas on profiling the code? I haven't figured out how to use
> the CPU TSC but there is a free running timer on the device that
> might be usable to calculate where the driver's time is spent.

It looks like the traffic that provoked it was this:

10:18:42.319370 IP X.4569 > X.4569: UDP, length 12
10:18:42.319402 IP X.4569 > X.4569: UDP, length 12
10:18:42.319438 IP X.4569 > X.4569: UDP, length 12
10:18:42.319484 IP X.4569 > X.4569: UDP, length 12
10:18:42.319517 IP X.4569 > X.4569: UDP, length 12

A flurry of UDP tinygrams on an IAX2 trunk. The packet rate isn't
spectacular at about 30kpps which on top of the base load of 60kpps
still isn't a fantastic packet rate. The interesting thing is that
while this storm was inprogress, it almost entirely excluded other
traffic on the network.

There have been reports of backplane congestion on the switches we
use when UDP packets smaller than 400 bytes arrive within 40us of
eachother. But that still doesn't explain the counter increases
and high interrupt CPU usage, unless the switch was producing garbage
output in response.

David Christensen

unread,

Mar 10, 2010, 2:11:13 PM3/10/10

to

> > What's the traffic look like? Jumbo, standard, short
> frames? Any=20
> > good ideas on profiling the code? I haven't figured out how to use
> > the CPU TSC but there is a free running timer on the device
> that might
> > be usable to calculate where the driver's time is spent.
>
> It looks like the traffic that provoked it was this:
>
> 10:18:42.319370 IP X.4569 > X.4569: UDP, length 12
> 10:18:42.319402 IP X.4569 > X.4569: UDP, length 12
> 10:18:42.319438 IP X.4569 > X.4569: UDP, length 12
> 10:18:42.319484 IP X.4569 > X.4569: UDP, length 12
> 10:18:42.319517 IP X.4569 > X.4569: UDP, length 12
>
> A flurry of UDP tinygrams on an IAX2 trunk. The packet rate
> isn't spectacular at about 30kpps which on top of the base
> load of 60kpps still isn't a fantastic packet rate. The
> interesting thing is that while this storm was inprogress, it
> almost entirely excluded other traffic on the network.

Ok, small packet performance is involved, this narrows down
the range of problems. The current design of bce_rx_intr()
attempts to process all RX frames in the receive ring. After
all available frames have been processed then the function
will attempt to refill the ring with new buffers. It's
likely that there's a long gap between the time the last
receive buffer is consumed and the time the RX ring is
refilled and the buffers are posted to the hardware, causing
a burst of dropped frames and the com_no_buffers firmware
counter to increment.

Changing the high level design of bce_rx_intr() and
bce_rx_fill_chain() slightly to post a new buffer as each
frame is passed to the OS would likely avoid these gaps
during bursts of small frames but I'm not sure whether
they'll have a negative impact on the more common case of
streams of MTU sized frames. I've considered this in the
past but never coded the change and tested the resulting
performance.

Does anyone have some experience with one method over
the other?

Dave

Pyun YongHyeon

unread,

Mar 10, 2010, 2:52:06 PM3/10/10

to

I successfully reproduced the issue with netperf on BCM5709. You
can use UDP frame size 1 to trigger the issue.

> Changing the high level design of bce_rx_intr() and
> bce_rx_fill_chain() slightly to post a new buffer as each
> frame is passed to the OS would likely avoid these gaps
> during bursts of small frames but I'm not sure whether
> they'll have a negative impact on the more common case of
> streams of MTU sized frames. I've considered this in the
> past but never coded the change and tested the resulting
> performance.
>

I guess this may slightly increase performance with additional
bus_dma(9) overheads but I think one of reason of dropping frames
under heavy UDP frames may come from lack of free RX descriptors.
Because bce(4) just uses a single RX ring so the number of
available RX buffers would be 512. However it seems it's not
possible to increase the number of RX buffers per RX ring so the
next possible approach would be switching to use multiple RX rings
with RSS. Even though FreeBSD does not dynamically adjust loads
among CPUs I guess using RSS would be the way to go.

David Christensen

unread,

Mar 10, 2010, 5:45:47 PM3/10/10

to

> I successfully reproduced the issue with netperf on BCM5709.
> You can use UDP frame size 1 to trigger the issue.
>
> > Changing the high level design of bce_rx_intr() and
> > bce_rx_fill_chain() slightly to post a new buffer as each frame is
> > passed to the OS would likely avoid these gaps during
> bursts of small
> > frames but I'm not sure whether they'll have a negative
> impact on the
> > more common case of streams of MTU sized frames. I've
> considered this
> > in the past but never coded the change and tested the resulting
> > performance.
> >
>
> I guess this may slightly increase performance with additional
> bus_dma(9) overheads but I think one of reason of dropping
> frames under heavy UDP frames may come from lack of free RX
> descriptors.
> Because bce(4) just uses a single RX ring so the number of
> available RX buffers would be 512. However it seems it's not
> possible to increase the number of RX buffers per RX ring so
> the next possible approach would be switching to use multiple
> RX rings with RSS. Even though FreeBSD does not dynamically
> adjust loads among CPUs I guess using RSS would be the way to go.

The bce(4) hardware supports a linked list of pages for RX
buffer descriptors. The stock build supports 2 pages (RX_PAGES)
with a total of 511 BD's per page. The hardware can support a
maximum of 64K BD's but that would be an unnecessarily large
amount of mbufs for an infrequent problem.

The middle road would probably involve changing RX_PAGES from a
#define to a sysctl variable to allow tuning for specific
environments along with a change in bce_rx_intr() to fill the
ring after all frames have been processed or when more than
256 BDs have been consumed, whichever comes first.

RSS would be great as well though it wouldn't make a dent in
this case since RSS is only supported for TCP, not UDP.

Pyun YongHyeon

unread,

Mar 10, 2010, 6:02:20 PM3/10/10

to

Thanks for the info. I guess 2048 or 4096 BDs would be necessary to
get satisfactory Rx performance. I'll have to experiment this.

> The middle road would probably involve changing RX_PAGES from a
> #define to a sysctl variable to allow tuning for specific
> environments along with a change in bce_rx_intr() to fill the
> ring after all frames have been processed or when more than
> 256 BDs have been consumed, whichever comes first.
>
> RSS would be great as well though it wouldn't make a dent in
> this case since RSS is only supported for TCP, not UDP.
>

Even though UDP is not supported in RSS, RSS can handle IP. This
wouldn't distribute UDP load coming from a single host but if
source IP address is different it may help, I guess.

Ian FREISLICH

unread,

Mar 11, 2010, 1:46:20 AM3/11/10

to

Pyun YongHyeon wrote:
> I successfully reproduced the issue with netperf on BCM5709. You
> can use UDP frame size 1 to trigger the issue.

Now I wish I had paid closer attention ages ago. I actually saw
this when I benchmarked the system post purchase, but didn't
investigate further. I tested and specified the hardware and that
system was forwarding 1MPPS. Then the bean counters got interested
in Dell and they bought something entirely different.

Thanks very, very much for your interest. This is one of the reasons
I love FreeBSD - invariably, the right people are available and
interested.

Ian

--
Ian Freislich

Ian FREISLICH

unread,

Mar 11, 2010, 2:06:04 AM3/11/10

to

Pyun YongHyeon wrote:
> On Wed, Mar 10, 2010 at 02:45:47PM -0800, David Christensen wrote:
> > The bce(4) hardware supports a linked list of pages for RX
> > buffer descriptors. The stock build supports 2 pages (RX_PAGES)
> > with a total of 511 BD's per page. The hardware can support a
> > maximum of 64K BD's but that would be an unnecessarily large
> > amount of mbufs for an infrequent problem.

I think that depends on how you define infrequent. Our use case
is a largish core router. It's highly likely that we'll see this
again and again in various packet storms on our network.

David Christensen

unread,

Mar 12, 2010, 6:58:31 PM3/12/10

to

> Pyun YongHyeon wrote:
> > On Wed, Mar 10, 2010 at 02:45:47PM -0800, David Christensen wrote:
> > > The bce(4) hardware supports a linked list of pages for RX buffer
> > > descriptors. The stock build supports 2 pages (RX_PAGES) with a
> > > total of 511 BD's per page. The hardware can support a
> maximum of
> > > 64K BD's but that would be an unnecessarily large amount of mbufs
> > > for an infrequent problem.
>
> I think that depends on how you define infrequent. Our use
> case is a largish core router. It's highly likely that we'll
> see this again and again in various packet storms on our network.
>

Are the packet storms always always from the same host or do they come
from multiple hosts? The hardware supports RSS which can spread the
network load across multiple receive queues and multiple CPU cores, but
only when the traffic is spread across several hosts. (The current
bce(4) driver doesn't include support for RSS.) If a storm of
small frames comes from a single host then almost all adapters will be
challenged to handle the flow.

Dave

Ian FREISLICH

unread,

Mar 13, 2010, 12:06:13 PM3/13/10

to

"David Christensen" wrote:
> > Pyun YongHyeon wrote:
> > > On Wed, Mar 10, 2010 at 02:45:47PM -0800, David Christensen wrote:

> > > > The bce(4) hardware supports a linked list of pages for RX buffer=20
> > > > descriptors. The stock build supports 2 pages (RX_PAGES) with a=20
> > > > total of 511 BD's per page. The hardware can support a=20
> > maximum of=20
> > > > 64K BD's but that would be an unnecessarily large amount of mbufs=20
> > > > for an infrequent problem.
> >=20
> > I think that depends on how you define infrequent. Our use=20
> > case is a largish core router. It's highly likely that we'll=20

> > see this again and again in various packet storms on our network.

> >=20

>
> Are the packet storms always always from the same host or do they come
> from multiple hosts? The hardware supports RSS which can spread the
> network load across multiple receive queues and multiple CPU cores, but
> only when the traffic is spread across several hosts. (The current
> bce(4) driver doesn't include support for RSS.) If a storm of
> small frames comes from a single host then almost all adapters will be
> challenged to handle the flow.

In this case the storm only involved 2 hosts. While it's an
exceptional circumstance it isn't unusual in our environment (core
router in a datacenter). Fortuately we controlled both machines
in this instance. Perhaps if the load is spread across more CPUs,
then perhaps only those flows unlucky to hash to the CPU handling
the storm will be degraded. That is a marginally better situation
than all flows being degraded. From the sounds of it RSS isn't the
cure for this particular situation, but may improve performance in
general.

It does sound like reworking the buffer will solve the problem.
Perhaps having a 2 recieve rings so that once one ring is available
for processing, the other ready filled and clear ring can be used
for recieving frames.

Ian

--
Ian Freislich