dummynet dropping too many packets

rihad

unread,

Oct 4, 2009, 9:47:23 AM10/4/09

to

Hi, we have around 500-600 mbit/s traffic flowing through a 7.1R Dell
PowerEdge w/ 2 GigE bce cards. There are currently around 4 thousand ISP
users online limited by dummynet pipes of various speeds. According to
netstat -s output around 500-1000 packets are being dropped every second
(this accounts for wasting around 7-12 mbit/s worth of traffic according
to systat -ifstat):

# while :; do netstat -z -s 2>/dev/null | fgrep -w "output packets
dropped"; sleep 1; done
16824 output packets dropped due to no bufs, etc.
548 output packets dropped due to no bufs, etc.
842 output packets dropped due to no bufs, etc.
709 output packets dropped due to no bufs, etc.
652 output packets dropped due to no bufs, etc.
^C

Pipes have been created like this:
ipfw pipe 1024 config bw 1024kbit/s mask dst-ip 0xffffffff queue 350KBytes
etc., and then assigned to users by application (ipfw tablearg).

I've tried playing with the queue setting, from as little as 1 slot to
as much as 4096KBytes - packets are still being dropped, more or less.
Should I somehow calculate the proper queue value for the given pipe
width? The manpage says 50 slots is typical for Ethernet devices (not
mentioning whether it's 10, 100 or 1000 mbit/s), and that's it.

sysctls:
kern.ipc.nmbclusters=50000
net.inet.ip.dummynet.io_fast=1

Polling can't be enabled with bce.

Any hints? Should I provide any further info?

Thanks.
_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net...@freebsd.org"

Luigi Rizzo

unread,

Oct 4, 2009, 10:49:09 AM10/4/09

to

On Sun, Oct 04, 2009 at 06:47:23PM +0500, rihad wrote:
> Hi, we have around 500-600 mbit/s traffic flowing through a 7.1R Dell
> PowerEdge w/ 2 GigE bce cards. There are currently around 4 thousand ISP
> users online limited by dummynet pipes of various speeds. According to
> netstat -s output around 500-1000 packets are being dropped every second
> (this accounts for wasting around 7-12 mbit/s worth of traffic according
> to systat -ifstat):

what kind of packets are you seeing as dropped ?
please give the output of 'netstat -s output | grep drop'

At those speeds you might be hitting various limits with your
config (e.g. 50k nmbclusters is probably way too small for
4k users -- means you have an average of 10-15 buffers per user;
the queue size of 350kbytes = 2.6Mbits means 2.6 seconds of buffering,
which is quite high besides the fact that in order to scale to 4k users
you would need over 1GB of kernel memory just for the buffers).

I'd most likely suspect the nmbclusters argument.
netstat -m might give you some useful stats.

cheers
luigi

rihad

unread,

Oct 4, 2009, 10:53:23 AM10/4/09

to

Luigi Rizzo wrote:
> On Sun, Oct 04, 2009 at 06:47:23PM +0500, rihad wrote:
>> Hi, we have around 500-600 mbit/s traffic flowing through a 7.1R Dell
>> PowerEdge w/ 2 GigE bce cards. There are currently around 4 thousand ISP
>> users online limited by dummynet pipes of various speeds. According to
>> netstat -s output around 500-1000 packets are being dropped every second
>> (this accounts for wasting around 7-12 mbit/s worth of traffic according
>> to systat -ifstat):
>
> what kind of packets are you seeing as dropped ?
> please give the output of 'netstat -s output | grep drop'
>

The output packets, like I said:
# netstat -s output | grep drop
2 connections closed (including 2 drops)
0 embryonic connections dropped
2 connections dropped by rexmit timeout
0 connections dropped by persist timeout
0 Connections (fin_wait_2) dropped because of timeout
0 connections dropped by keepalive
0 dropped
2 dropped due to no socket
0 dropped due to full socket buffers
0 fragments dropped (dup or out of space)
2 fragments dropped after timeout
7538 output packets dropped due to no bufs, etc.

The statistics are zeroed every 15 seconds in another window as I'm
investigating the issue, but the rate is around 500-1000 lost packets
every second at the current ~530 mbit/s load.

> At those speeds you might be hitting various limits with your
> config (e.g. 50k nmbclusters is probably way too small for

I bet it isn't:
1967/5009/6976/50000 mbuf clusters in use (current/cache/total/max)

> 4k users -- means you have an average of 10-15 buffers per user;
> the queue size of 350kbytes = 2.6Mbits means 2.6 seconds of buffering,
> which is quite high besides the fact that in order to scale to 4k users
> you would need over 1GB of kernel memory just for the buffers).

Aha. Can you be more specific about the kernel memory stuff? Which
setting needs tweaking?

I have another similar box with 2 em GigE interfaces working @220-230
mbit/s and virtually no out-of-bufs dropped packets as with bce @500-600
mbit. It too has 350KBytes dummynet queue sizes. And it too has adequate
mbuf load:
3071/10427/13498/25600 mbuf clusters in use (current/cache/total/max)

rihad

unread,

Oct 4, 2009, 10:55:57 AM10/4/09

to

Luigi Rizzo wrote:
> On Sun, Oct 04, 2009 at 06:47:23PM +0500, rihad wrote:
>> Hi, we have around 500-600 mbit/s traffic flowing through a 7.1R Dell
>> PowerEdge w/ 2 GigE bce cards. There are currently around 4 thousand ISP
>> users online limited by dummynet pipes of various speeds. According to
>> netstat -s output around 500-1000 packets are being dropped every second
>> (this accounts for wasting around 7-12 mbit/s worth of traffic according
>> to systat -ifstat):
>

> At those speeds you might be hitting various limits with your
> config (e.g. 50k nmbclusters is probably way too small for
> 4k users -- means you have an average of 10-15 buffers per user;
> the queue size of 350kbytes = 2.6Mbits means 2.6 seconds of buffering,
> which is quite high besides the fact that in order to scale to 4k users
> you would need over 1GB of kernel memory just for the buffers).

top output:
Mem: 2037M Active, 1248M Inact, 450M Wired, 184M Cache, 214M Buf, 17M Free

I guess we're quite far from reaching 1GB of kernel memory.

Luigi Rizzo

unread,

Oct 4, 2009, 11:11:17 AM10/4/09

to

On Sun, Oct 04, 2009 at 07:55:57PM +0500, rihad wrote:
> Luigi Rizzo wrote:
> >On Sun, Oct 04, 2009 at 06:47:23PM +0500, rihad wrote:
> >>Hi, we have around 500-600 mbit/s traffic flowing through a 7.1R Dell
> >>PowerEdge w/ 2 GigE bce cards. There are currently around 4 thousand ISP
> >>users online limited by dummynet pipes of various speeds. According to
> >>netstat -s output around 500-1000 packets are being dropped every second
> >>(this accounts for wasting around 7-12 mbit/s worth of traffic according
> >>to systat -ifstat):
> >
> >At those speeds you might be hitting various limits with your
> >config (e.g. 50k nmbclusters is probably way too small for
> >4k users -- means you have an average of 10-15 buffers per user;
> >the queue size of 350kbytes = 2.6Mbits means 2.6 seconds of buffering,
> >which is quite high besides the fact that in order to scale to 4k users
> >you would need over 1GB of kernel memory just for the buffers).
>
>
> top output:
> Mem: 2037M Active, 1248M Inact, 450M Wired, 184M Cache, 214M Buf, 17M Free
>
> I guess we're quite far from reaching 1GB of kernel memory.

of course, you'd have to configure also 500k nmbclusters (and then
probably this would not fit)

Luigi Rizzo

unread,

Oct 4, 2009, 11:15:18 AM10/4/09

to

I think a quick way to tell if the problem is in dummynet/ipfw or elsewhere
would be to reconfigure the pipes (for short times, e.g. 1-2 minutes
while you test things) as

# first, try to remove the shaping to see if the drops
# are still present or not
ipfw pipe XX delete; ipfw pipe XX config // no buffering

# second, do more traffic aggregation to see if the number of
# pipes influences the drops. These are several different
# configs to be tried.
ipfw pipe XX delete; ipfw pipe XX config bw 500Mbits/s
ipfw pipe XX delete; ipfw pipe XX config bw 50Mbits/s mask src-ip 0xffffff00
ipfw pipe XX delete; ipfw pipe XX config bw 5Mbits/s mask src-ip 0xfffffff0

and see if things change. If losses persist even removing dummynet,
then of course it is a device problem.
Also note that dummynet introduces some burstiness in the output,
which might saturate the output queue in the card (no idea what is
used by bce). This particular phenomenon could be reduced by raising
HZ to 2000 or 4000.

cheers
luigi

rihad

unread,

Oct 4, 2009, 11:28:59 AM10/4/09

to

Thanks for the tip. although I took an easier route by simply doing
"ipfw add allow ip from any to any" before the pipe rules, and the buf
drop rate instantly became 0. So the problem is dummynet/ipfw. Should I
go to setting HZ to 2000? Mine is 1000. I somehow don't think the change
would improve things. Maybe there's another way not involving a machine
reboot? This is a production machine ;-(

Eugene Grosbein

unread,

Oct 4, 2009, 10:55:21 PM10/4/09

to

On Sun, Oct 04, 2009 at 06:47:23PM +0500, rihad wrote:

> sysctls:
> kern.ipc.nmbclusters=50000
> net.inet.ip.dummynet.io_fast=1

I guess you should also try to increase pipes length:

net.inet.ip.dummynet.hash_size=65536
net.inet.ip.dummynet.pipe_slot_limit=1000

And reconfigure pipes like this:

ipfw pipe NNN config bw ... queue 1000

And default 'taildrop' policy of dummynet pipes may be guilty,
you'd use GRED to prevent excessive drops.

Eugene Grosbein

Luigi Rizzo

unread,

Oct 5, 2009, 2:10:25 AM10/5/09

to

On Mon, Oct 05, 2009 at 10:55:21AM +0800, Eugene Grosbein wrote:
> On Sun, Oct 04, 2009 at 06:47:23PM +0500, rihad wrote:
>
> > sysctls:
> > kern.ipc.nmbclusters=50000
> > net.inet.ip.dummynet.io_fast=1
>
> I guess you should also try to increase pipes length:
>
> net.inet.ip.dummynet.hash_size=65536
> net.inet.ip.dummynet.pipe_slot_limit=1000

in fact, i forgot to ask, we'd need to know the output of
"ipfw pipe show" just to get the idea if there is any
known reason for the drop (e.g. queues too short, etc.)

cheers
luigi

rihad

unread,

Oct 5, 2009, 4:53:20 AM10/5/09

to

Luigi Rizzo wrote:
> On Mon, Oct 05, 2009 at 10:55:21AM +0800, Eugene Grosbein wrote:
>> On Sun, Oct 04, 2009 at 06:47:23PM +0500, rihad wrote:
>>
>>> sysctls:
>>> kern.ipc.nmbclusters=50000
>>> net.inet.ip.dummynet.io_fast=1
>> I guess you should also try to increase pipes length:
>>
>> net.inet.ip.dummynet.hash_size=65536
>> net.inet.ip.dummynet.pipe_slot_limit=1000
>
> in fact, i forgot to ask, we'd need to know the output of
> "ipfw pipe show" just to get the idea if there is any
> known reason for the drop (e.g. queues too short, etc.)
>

The pipes are fine, each normally having 100-120 concurrent consumers
(i.e. active users). (The max number of consumers allowed per pipe is
hash_size*max_chain_len, normally 64*16==1024.)

kern.ipc.nmbclusters=111111

And these are by default:
#net.inet.ip.dummynet.hash_size=64
#net.inet.ip.dummynet.max_chain_len=16

Yesterday I set up a cronjob logging the number of drops in the past 15
minutes:
*/15 * * * * (echo -n "$(date) "; netstat -z -s 2>/dev/null | fgrep
-w "output packets dropped") >> /tmp/bufs.log

And here's bufs.log:
Sun Oct 4 21:45:00 AZST 2009 418869 output packets dropped due to no
bufs, etc.
Sun Oct 4 22:00:00 AZST 2009 851693 output packets dropped due to no
bufs, etc.
Sun Oct 4 22:15:01 AZST 2009 932885 output packets dropped due to no
bufs, etc.
Sun Oct 4 22:30:00 AZST 2009 890522 output packets dropped due to no
bufs, etc.
Sun Oct 4 22:45:00 AZST 2009 1065931 output packets dropped due to no
bufs, etc.
Sun Oct 4 23:00:00 AZST 2009 937863 output packets dropped due to no
bufs, etc.
Sun Oct 4 23:15:01 AZST 2009 1018822 output packets dropped due to no
bufs, etc.
Sun Oct 4 23:30:00 AZST 2009 981922 output packets dropped due to no
bufs, etc.
Sun Oct 4 23:45:00 AZST 2009 1015124 output packets dropped due to no
bufs, etc.
Mon Oct 5 00:00:01 AZST 2009 1123926 output packets dropped due to no
bufs, etc.
Mon Oct 5 00:15:01 AZST 2009 948161 output packets dropped due to no
bufs, etc.
Mon Oct 5 00:30:00 AZST 2009 937277 output packets dropped due to no
bufs, etc.
Mon Oct 5 00:45:00 AZST 2009 875218 output packets dropped due to no
bufs, etc.
Mon Oct 5 01:00:00 AZST 2009 803527 output packets dropped due to no
bufs, etc.
Mon Oct 5 01:15:00 AZST 2009 728639 output packets dropped due to no
bufs, etc.
Mon Oct 5 01:30:00 AZST 2009 626154 output packets dropped due to no
bufs, etc.
Mon Oct 5 01:45:00 AZST 2009 519441 output packets dropped due to no
bufs, etc.
Mon Oct 5 02:00:00 AZST 2009 371098 output packets dropped due to no
bufs, etc.
Mon Oct 5 02:15:00 AZST 2009 681243 output packets dropped due to no
bufs, etc.
Mon Oct 5 02:30:00 AZST 2009 562909 output packets dropped due to no
bufs, etc.
Mon Oct 5 02:45:00 AZST 2009 426734 output packets dropped due to no
bufs, etc.
Mon Oct 5 03:00:00 AZST 2009 344619 output packets dropped due to no
bufs, etc.
Mon Oct 5 03:15:00 AZST 2009 90006 output packets dropped due to no
bufs, etc.
Mon Oct 5 03:30:00 AZST 2009 17064 output packets dropped due to no
bufs, etc.
Mon Oct 5 03:45:00 AZST 2009 3851 output packets dropped due to no
bufs, etc.
Mon Oct 5 04:00:00 AZST 2009 1323 output packets dropped due to no
bufs, etc.
Mon Oct 5 04:15:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 04:30:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 04:45:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 05:00:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 05:15:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 05:30:01 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 05:45:01 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 06:00:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 06:15:01 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 06:30:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 06:45:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 07:00:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 07:15:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 07:30:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 07:45:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 08:00:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 08:15:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 08:30:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 08:45:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 09:00:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 09:15:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 09:30:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 09:45:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 10:00:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 10:15:00 AZST 2009 0 output packets dropped due to no bufs, etc.
Mon Oct 5 10:30:00 AZST 2009 177 output packets dropped due to no
bufs, etc.
Mon Oct 5 10:45:00 AZST 2009 1701 output packets dropped due to no
bufs, etc.
Mon Oct 5 11:00:01 AZST 2009 19933 output packets dropped due to no
bufs, etc.
Mon Oct 5 11:15:00 AZST 2009 30003 output packets dropped due to no
bufs, etc.
Mon Oct 5 11:30:00 AZST 2009 56712 output packets dropped due to no
bufs, etc.
Mon Oct 5 11:45:00 AZST 2009 78721 output packets dropped due to no
bufs, etc.
Mon Oct 5 12:00:01 AZST 2009 112518 output packets dropped due to no
bufs, etc.
Mon Oct 5 12:15:00 AZST 2009 7229 output packets dropped due to no
bufs, etc.
Mon Oct 5 12:30:01 AZST 2009 24965 output packets dropped due to no
bufs, etc.
Mon Oct 5 12:45:00 AZST 2009 75900 output packets dropped due to no
bufs, etc.
Mon Oct 5 13:00:00 AZST 2009 45002 output packets dropped due to no
bufs, etc.
Mon Oct 5 13:15:00 AZST 2009 67161 output packets dropped due to no
bufs, etc.
Mon Oct 5 13:30:00 AZST 2009 112591 output packets dropped due to no
bufs, etc.

As you can see the drops gradually went away completely at about 4:00
a.m., and started coming up at about 10:30 a.m., although at a lower
rate, probably thanks to me bumping "ipfw ... queue NNN" up to 5000 at
10a.m. this morning. The traffic flow between 4a.m. and 10:30a.m., the
"quiet" times, is about 200-330 mbit/s 5 minute average, without a
single drop. But after that, in come the drops, no matter how high I set
the queue. Should I try 10000 slots? 20000? I'm really sure there are
plenty of heavy downloaders between 4a.m. and 10a.m., but still without
a single drop before approx. 330 mbit/s! Strange, isn't it? This makes
me believe I'm hitting some other global memory limit, rather than
per-user limit, at around 340-350 mbit/s, causing drops. top and netstat
-m are OK right now, though:

Mem: 1870M Active, 1220M Inact, 481M Wired, 201M Cache, 214M Buf, 164M Free
15595/5385/20980/111112 mbuf clusters in use (current/cache/total/max)

Eugene Grosbein

unread,

Oct 5, 2009, 5:01:02 AM10/5/09

to

On Mon, Oct 05, 2009 at 01:53:20PM +0500, rihad wrote:

> As you can see the drops gradually went away completely at about 4:00
> a.m., and started coming up at about 10:30 a.m., although at a lower
> rate, probably thanks to me bumping "ipfw ... queue NNN" up to 5000 at
> 10a.m. this morning. The traffic flow between 4a.m. and 10:30a.m., the
> "quiet" times, is about 200-330 mbit/s 5 minute average, without a
> single drop. But after that, in come the drops, no matter how high I set
> the queue. Should I try 10000 slots? 20000?

First switch from taildrop (default) to GRED, it is designed to fight
your problem.

Eugene Grosbein

rihad

unread,

Oct 5, 2009, 5:28:58 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 01:53:20PM +0500, rihad wrote:
>
>> As you can see the drops gradually went away completely at about 4:00
>> a.m., and started coming up at about 10:30 a.m., although at a lower
>> rate, probably thanks to me bumping "ipfw ... queue NNN" up to 5000 at
>> 10a.m. this morning. The traffic flow between 4a.m. and 10:30a.m., the
>> "quiet" times, is about 200-330 mbit/s 5 minute average, without a
>> single drop. But after that, in come the drops, no matter how high I set
>> the queue. Should I try 10000 slots? 20000?
>
> First switch from taildrop (default) to GRED, it is designed to fight
> your problem.
>

Oh, I almost forgot... Right now I've googled up and am reading this
intro: http://www-rp.lip6.fr/~sf/WebSF/PapersWeb/iscc01.ps

So turning to GRED would turn my FreeBSD router from dumb into a smart
router that knows TCP? I thought pushing bits around at a lower level,
and a sufficient queue size were enough.
Still not sure why increasing queue size as high as I want doesn't
completely eliminate drops.

rihad

unread,

Oct 5, 2009, 5:48:29 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 01:53:20PM +0500, rihad wrote:
>
>> As you can see the drops gradually went away completely at about 4:00
>> a.m., and started coming up at about 10:30 a.m., although at a lower
>> rate, probably thanks to me bumping "ipfw ... queue NNN" up to 5000 at
>> 10a.m. this morning. The traffic flow between 4a.m. and 10:30a.m., the
>> "quiet" times, is about 200-330 mbit/s 5 minute average, without a
>> single drop. But after that, in come the drops, no matter how high I set
>> the queue. Should I try 10000 slots? 20000?
>
> First switch from taildrop (default) to GRED, it is designed to fight
> your problem.
>

red | gred w_q/min_th/max_th/max_p
Make use of the RED (Random Early Detection) queue
management algo-
rithm. w_q and max_p are floating point numbers between 0
and 1 (0
not included), while min_th and max_th are integer numbers
specify-
ing thresholds for queue management (thresholds are computed in
bytes if the queue has been defined in bytes, in slots
otherwise).
The dummynet(4) also supports the gentle RED variant (gred).

Do you or someone else know what w_q and max_p are?

There's just too much info for me to grasp here:
http://www.icir.org/floyd/red.html

Eugene Grosbein

unread,

Oct 5, 2009, 5:56:00 AM10/5/09

to

On Mon, Oct 05, 2009 at 02:28:58PM +0500, rihad wrote:

> Oh, I almost forgot... Right now I've googled up and am reading this
> intro: http://www-rp.lip6.fr/~sf/WebSF/PapersWeb/iscc01.ps
>
> So turning to GRED would turn my FreeBSD router from dumb into a smart
> router that knows TCP? I thought pushing bits around at a lower level,
> and a sufficient queue size were enough.

No, it will still deal with IP packets but more clever.

> Still not sure why increasing queue size as high as I want doesn't
> completely eliminate drops.

The goal is to make sources of traffic to slow down, this is the only
way to descrease drops - any finite queue may be overhelmed with traffic.
Taildrop does not really help with this. GRED does much better.

Eugene Grosbein

Luigi Rizzo

unread,

Oct 5, 2009, 6:04:46 AM10/5/09

to

On Mon, Oct 05, 2009 at 05:56:00PM +0800, Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 02:28:58PM +0500, rihad wrote:
>
> > Oh, I almost forgot... Right now I've googled up and am reading this
> > intro: http://www-rp.lip6.fr/~sf/WebSF/PapersWeb/iscc01.ps
> >
> > So turning to GRED would turn my FreeBSD router from dumb into a smart
> > router that knows TCP? I thought pushing bits around at a lower level,
> > and a sufficient queue size were enough.
>
> No, it will still deal with IP packets but more clever.
>
> > Still not sure why increasing queue size as high as I want doesn't
> > completely eliminate drops.
>
> The goal is to make sources of traffic to slow down, this is the only
> way to descrease drops - any finite queue may be overhelmed with traffic.
> Taildrop does not really help with this. GRED does much better.

i think the first problem here is figure out _why_ we have
the drops, as the original poster said that queues are configured
with a very large amount of buffer (and i think there is a
misconfiguration somewhere because the mbuf stats do not make
sense)

cheers
luigi

Eugene Grosbein

unread,

Oct 5, 2009, 5:59:16 AM10/5/09

to

On Mon, Oct 05, 2009 at 02:48:29PM +0500, rihad wrote:

> >First switch from taildrop (default) to GRED, it is designed to fight
> >your problem.
>
> red | gred w_q/min_th/max_th/max_p
> Make use of the RED (Random Early Detection) queue
> management algo-
> rithm. w_q and max_p are floating point numbers between 0
> and 1 (0
> not included), while min_th and max_th are integer numbers
> specify-
> ing thresholds for queue management (thresholds are computed in
> bytes if the queue has been defined in bytes, in slots
> otherwise).
> The dummynet(4) also supports the gentle RED variant (gred).
>
> Do you or someone else know what w_q and max_p are?
>
> There's just too much info for me to grasp here:
> http://www.icir.org/floyd/red.html

http://docs.freebsd.org/cgi/getmsg.cgi?fetch=126518+0+archive/2009/freebsd-net/20091004.freebsd-net

Eugene Grosbein

unread,

Oct 5, 2009, 6:05:32 AM10/5/09

to

On Mon, Oct 05, 2009 at 12:04:46PM +0200, Luigi Rizzo wrote:

> > The goal is to make sources of traffic to slow down, this is the only
> > way to descrease drops - any finite queue may be overhelmed with traffic.
> > Taildrop does not really help with this. GRED does much better.
>
> i think the first problem here is figure out _why_ we have
> the drops, as the original poster said that queues are configured
> with a very large amount of buffer (and i think there is a
> misconfiguration somewhere because the mbuf stats do not make
> sense)

That may be very simple, f.e. wide uplink channel and policy that
dictates slower client speeds. Any taildrop queue would drop lots
of packets.

Eugene Grosbein

rihad

unread,

Oct 5, 2009, 6:20:58 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 12:04:46PM +0200, Luigi Rizzo wrote:
>
>>> The goal is to make sources of traffic to slow down, this is the only
>>> way to descrease drops - any finite queue may be overhelmed with traffic.
>>> Taildrop does not really help with this. GRED does much better.
>> i think the first problem here is figure out _why_ we have
>> the drops, as the original poster said that queues are configured
>> with a very large amount of buffer (and i think there is a
>> misconfiguration somewhere because the mbuf stats do not make
>> sense)
>
> That may be very simple, f.e. wide uplink channel and policy that
> dictates slower client speeds. Any taildrop queue would drop lots
> of packets.
>

If uplink is e.g. 100 mbit/s, but data is fed to client by dummynet at 1
mbit/s, doesn't the _client's_ TCP software know to slow things down to
not overwhelm 1 mbit/s? Where has TCP slow-start gone? My router box
isn't some application proxy that starts downloading at full 100 mbit/s
thus quickly filling client's 1 mbit/s link. It's just a router.

Although it doesn't yet make sense to me, I'll try going to GRED soon.

rihad

unread,

Oct 5, 2009, 6:52:39 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 02:28:58PM +0500, rihad wrote:
>
>> Still not sure why increasing queue size as high as I want doesn't
>> completely eliminate drops.
>
> The goal is to make sources of traffic to slow down, this is the only
> way to descrease drops - any finite queue may be overhelmed with traffic.
> Taildrop does not really help with this. GRED does much better.
>

Alright, so I changed to gred by adding to each config command:
ipfw ... gred 0.002/900/1000/0.1 queue 1000
and reconfigured. Still around 300-400 drops per second, which was
typical at this load level before with taildrop anyway. There are around
3-5 mbit/s being wasted according to systat -ifstat.

Should I now increase slots to 5-10-20k?
Very strange.

"ipfw pipe show" correctly shows that gred is at work. For example:
00512: 512.000 Kbit/s 0 ms 1000 sl. 79 queues (64 buckets)
GRED w_q 0.001999 min_th 900 max_th 1000 max_p 0.099991
mask: 0x00 0x00000000/0x0000 -> 0xffffffff/0x0000
...

Luigi Rizzo

unread,

Oct 5, 2009, 7:07:26 AM10/5/09

to

On Mon, Oct 05, 2009 at 03:52:39PM +0500, rihad wrote:
> Eugene Grosbein wrote:
> >On Mon, Oct 05, 2009 at 02:28:58PM +0500, rihad wrote:
> >
> >>Still not sure why increasing queue size as high as I want doesn't
> >>completely eliminate drops.
> >
> >The goal is to make sources of traffic to slow down, this is the only
> >way to descrease drops - any finite queue may be overhelmed with traffic.
> >Taildrop does not really help with this. GRED does much better.
> >
>
> Alright, so I changed to gred by adding to each config command:
> ipfw ... gred 0.002/900/1000/0.1 queue 1000
> and reconfigured. Still around 300-400 drops per second, which was
> typical at this load level before with taildrop anyway. There are around
> 3-5 mbit/s being wasted according to systat -ifstat.
>
> Should I now increase slots to 5-10-20k?
> Very strange.
>
> "ipfw pipe show" correctly shows that gred is at work. For example:
> 00512: 512.000 Kbit/s 0 ms 1000 sl. 79 queues (64 buckets)
> GRED w_q 0.001999 min_th 900 max_th 1000 max_p 0.099991
> mask: 0x00 0x00000000/0x0000 -> 0xffffffff/0x0000
> ...

you keep omitting the important info i.e. whether individual
pipes have drops, significant queue lenghts and so on.

i am giving up!

rihad

unread,

Oct 5, 2009, 7:29:02 AM10/5/09

to

Luigi Rizzo wrote:
> On Mon, Oct 05, 2009 at 03:52:39PM +0500, rihad wrote:
>> Eugene Grosbein wrote:
>>> On Mon, Oct 05, 2009 at 02:28:58PM +0500, rihad wrote:
>>>
>>>> Still not sure why increasing queue size as high as I want doesn't
>>>> completely eliminate drops.
>>> The goal is to make sources of traffic to slow down, this is the only
>>> way to descrease drops - any finite queue may be overhelmed with traffic.
>>> Taildrop does not really help with this. GRED does much better.
>>>
>> Alright, so I changed to gred by adding to each config command:
>> ipfw ... gred 0.002/900/1000/0.1 queue 1000
>> and reconfigured. Still around 300-400 drops per second, which was
>> typical at this load level before with taildrop anyway. There are around
>> 3-5 mbit/s being wasted according to systat -ifstat.
>>
>> Should I now increase slots to 5-10-20k?
>> Very strange.
>>
>> "ipfw pipe show" correctly shows that gred is at work. For example:
>> 00512: 512.000 Kbit/s 0 ms 1000 sl. 79 queues (64 buckets)
>> GRED w_q 0.001999 min_th 900 max_th 1000 max_p 0.099991
>> mask: 0x00 0x00000000/0x0000 -> 0xffffffff/0x0000
>> ...
>
> you keep omitting the important info i.e. whether individual
> pipes have drops, significant queue lenghts and so on.
>

Sorry. Almost everyone has 0 in the last Drp column, but some have above
zero. I'm not just sure how this can be helpful to anyone.

05120: 5.120 Mbit/s 0 ms 5000 sl. 66 queues (64 buckets)
GRED w_q 0.001999 min_th 4500 max_th 5000 max_p 0.099991

mask: 0x00 0x00000000/0x0000 -> 0xffffffff/0x0000

BKT Prot ___Source IP/port____ ____Dest. IP/port____ Tot_pkt/bytes
Pkt/Byte Drp
0 ip 0.0.0.0/0 <client_ip> 1 131 0 0 0
1 ip 0.0.0.0/0 <client_ip> 39 53360 0 0 0
2 ip 0.0.0.0/0 <client_ip> 382206 418022848 0
0 0
3 ip 0.0.0.0/0 <client_ip> 34 2008 0 0 0
4 ip 0.0.0.0/0 <client_ip> 4868510 6277077787 15
20452 9
5 ip 0.0.0.0/0 <client_ip> 14 16675 0 0 0
5 ip 0.0.0.0/0 <client_ip> 3 4158 0 0 0
6 ip 0.0.0.0/0 <client_ip> 38 43576 0 0 0
7 ip 0.0.0.0/0 <client_ip> 1265954 1475400663 0
0 0
8 ip 0.0.0.0/0 <client_ip> 1081461 1247681879 0
0 749
9 ip 0.0.0.0/0 <client_ip> 6186589 8737048919 0
0 19243
10 ip 0.0.0.0/0 <client_ip> 21607 5636447 0 0 5
11 ip 0.0.0.0/0 <client_ip> 437 94576 0 0 0
12 ip 0.0.0.0/0 <client_ip> 22915 18634779 0 0 0
13 ip 0.0.0.0/0 <client_ip> 557988 688051579 0
0 0
14 ip 0.0.0.0/0 <client_ip> 50339 65685647 0 0 0
15 ip 0.0.0.0/0 <client_ip> 554835 546223485 0
0 140
16 ip 0.0.0.0/0 <client_ip> 32 13104 0 0 0
17 ip 0.0.0.0/0 <client_ip> 2034099 2719966792 0
0 0
18 ip 0.0.0.0/0 <client_ip> 282 36551 0 0 0
19 ip 0.0.0.0/0 <client_ip> 8351766 8947643162 0
0 0
20 ip 0.0.0.0/0 <client_ip> 4 624 0 0 0
21 ip 0.0.0.0/0 <client_ip> 22391 29922375 0 0 0
22 ip 0.0.0.0/0 <client_ip> 9 424 0 0 0
23 ip 0.0.0.0/0 <client_ip> 750322 935365326 0
0 0
24 ip 0.0.0.0/0 <client_ip> 1 40 0 0 0
25 ip 0.0.0.0/0 <client_ip> 3617690 3501375619 0
0 602
26 ip 0.0.0.0/0 <client_ip> 12116 12039435 0 0 0
27 ip 0.0.0.0/0 <client_ip> 524311 653399507 0
0 8
28 ip 0.0.0.0/0 <client_ip> 3 417 0 0 0
29 ip 0.0.0.0/0 <client_ip> 16 2034 0 0 0
30 ip 0.0.0.0/0 <client_ip> 64 82661 3 4432 0
31 ip 0.0.0.0/0 <client_ip> 946389 1175221367 0
0 66
32 ip 0.0.0.0/0 <client_ip> 1 168 0 0 0
32 ip 0.0.0.0/0 <client_ip> 28 41776 0 0 0
33 ip 0.0.0.0/0 <client_ip> 6 6433 0 0 0
34 ip 0.0.0.0/0 <client_ip> 1 536 0 0 0
35 ip 0.0.0.0/0 <client_ip> 2021 2641048 0 0 0
36 ip 0.0.0.0/0 <client_ip> 350 264039 0 0 0
37 ip 0.0.0.0/0 <client_ip> 167578 137763107 0
0 0
38 ip 0.0.0.0/0 <client_ip> 250404 128905757 0
0 0
39 ip 0.0.0.0/0 <client_ip> 385139 287006012 0
0 0
40 ip 0.0.0.0/0 <client_ip> 49 68696 0 0 0
41 ip 0.0.0.0/0 <client_ip> 23 1813 0 0 0
42 ip 0.0.0.0/0 <client_ip> 129 135256 0 0 0
43 ip 0.0.0.0/0 <client_ip> 3232 2191027 0 0 0
44 ip 0.0.0.0/0 <client_ip> 27935157 24307287646 0
0 18802
45 ip 0.0.0.0/0 <client_ip> 2166 212635 0 0 0
46 ip 0.0.0.0/0 <client_ip> 1127307 1392467620 0
0 3
47 ip 0.0.0.0/0 <client_ip> 1216900 1258200836 0
0 0
48 ip 0.0.0.0/0 <client_ip> 2 2984 1 1492 0
49 ip 0.0.0.0/0 <client_ip> 1 112 0 0 0
50 ip 0.0.0.0/0 <client_ip> 1409 326389 0 0 0
51 ip 0.0.0.0/0 <client_ip> 46674 47291021 10 14920 0
52 ip 0.0.0.0/0 <client_ip> 86667 66834983 0 0 0
53 ip 0.0.0.0/0 <client_ip> 434998 302827189 0
0 0
54 ip 0.0.0.0/0 <client_ip> 542 277669 0 0 0
55 ip 0.0.0.0/0 <client_ip> 1088072 919495021 0
0 0
56 ip 0.0.0.0/0 <client_ip> 64 81240 0 0 0
57 ip 0.0.0.0/0 <client_ip> 41028 59193278 0 0 0
58 ip 0.0.0.0/0 <client_ip> 1 210 0 0 0
59 ip 0.0.0.0/0 <client_ip> 4 310 0 0 0
60 ip 0.0.0.0/0 <client_ip> 2 2984 0 0 0
61 ip 0.0.0.0/0 <client_ip> 42874 36616688 0 0 0
62 ip 0.0.0.0/0 <client_ip> 4 498 0 0 0
63 ip 0.0.0.0/0 <client_ip> 530137 717027403 0
0 0

Eugene Grosbein

unread,

Oct 5, 2009, 7:30:37 AM10/5/09

to

On Mon, Oct 05, 2009 at 03:20:58PM +0500, rihad wrote:

> >>>Taildrop does not really help with this. GRED does much better.
> >>i think the first problem here is figure out _why_ we have
> >>the drops, as the original poster said that queues are configured
> >>with a very large amount of buffer (and i think there is a
> >>misconfiguration somewhere because the mbuf stats do not make
> >>sense)
> >
> >That may be very simple, f.e. wide uplink channel and policy that
> >dictates slower client speeds. Any taildrop queue would drop lots
> >of packets.
> >
> If uplink is e.g. 100 mbit/s, but data is fed to client by dummynet at 1
> mbit/s, doesn't the _client's_ TCP software know to slow things down to
> not overwhelm 1 mbit/s?

That's not client's TCP software feeding your router with traffic
but server side.

> Where has TCP slow-start gone? My router box
> isn't some application proxy that starts downloading at full 100 mbit/s
> thus quickly filling client's 1 mbit/s link. It's just a router.

While there is no or little competition for bandwidth from the router
to clients, TCP would work just fine. I suspect your shaping policy
makes heavy competition between clients. In this case, TCP behaves
not-so-well without help of router's good shaping algorythms
and taildrop is not good one.

> Although it doesn't yet make sense to me, I'll try going to GRED soon.

"Works for me" :-)

Eugene Grosbein

unread,

Oct 5, 2009, 7:46:36 AM10/5/09

to

On Sun, Oct 04, 2009 at 08:28:59PM +0500, rihad wrote:

> Thanks for the tip. although I took an easier route by simply doing
> "ipfw add allow ip from any to any" before the pipe rules, and the buf
> drop rate instantly became 0. So the problem is dummynet/ipfw.

You should also estimate volume of non-TCP traffic
that is generally has not flow control capabilities of TCP.

Or, try something like this:

ipfw add 100 skipto 200 tcp from any to any # direct only TCP to dummynet
ipfw add 150 allow ip from any to any # pass non-TCP
ipfw add 200 ... # here dummynet rules go

And take a look at drop counters.

Eugene Grosbein

rihad

unread,

Oct 5, 2009, 7:47:59 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 03:20:58PM +0500, rihad wrote:
>
>>>>> Taildrop does not really help with this. GRED does much better.
>>>> i think the first problem here is figure out _why_ we have
>>>> the drops, as the original poster said that queues are configured
>>>> with a very large amount of buffer (and i think there is a
>>>> misconfiguration somewhere because the mbuf stats do not make
>>>> sense)
>>> That may be very simple, f.e. wide uplink channel and policy that
>>> dictates slower client speeds. Any taildrop queue would drop lots
>>> of packets.
>>>
>> If uplink is e.g. 100 mbit/s, but data is fed to client by dummynet at 1
>> mbit/s, doesn't the _client's_ TCP software know to slow things down to
>> not overwhelm 1 mbit/s?
>
> That's not client's TCP software feeding your router with traffic
> but server side.
>

Yup, I meant client-side decreasing congestion window, delaying ACKs
etc. etc., typical for TCP congestion control. Or...?

GRED didn't solve the problem :(

rihad

unread,

Oct 5, 2009, 7:50:10 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 03:20:58PM +0500, rihad wrote:

>> Where has TCP slow-start gone? My router box
>> isn't some application proxy that starts downloading at full 100 mbit/s
>> thus quickly filling client's 1 mbit/s link. It's just a router.
>
> While there is no or little competition for bandwidth from the router
> to clients, TCP would work just fine. I suspect your shaping policy
> makes heavy competition between clients. In this case, TCP behaves
> not-so-well without help of router's good shaping algorythms
> and taildrop is not good one.
>

Nothing fancy (i.e. no competition). Only tons of per-user pipes
simulating the given throughput.

Luigi Rizzo

unread,

Oct 5, 2009, 8:04:18 AM10/5/09

to

On Mon, Oct 05, 2009 at 04:29:02PM +0500, rihad wrote:
> Luigi Rizzo wrote:

...
> >you keep omitting the important info i.e. whether individual
> >pipes have drops, significant queue lenghts and so on.
> >
> Sorry. Almost everyone has 0 in the last Drp column, but some have above
> zero. I'm not just sure how this can be helpful to anyone.

because you were complaining about 'dummynet causing drops and
waste of bandwidth'.
Now, drops could be due to either
1) some saturation in the dummynet machine (memory shortage, cpu
shortage, etc.) which cause unwanted drops;

2) intentional drops introduced by dummynet because a flow exceeds
its queue size. These drops are those shown in the 'Drop'
column in 'ipfw pipe show' (they are cumulative, so you
should do an 'ipfw pipe delete; ipfw pipe 5120 config ...'
whenever you want to re-run the stats, or compute the
differences between subsequent reads, to figure out what
happens.

If all drops you are seeing are of type 2, then there is nothing
you can do to remove them: you set a bandwidth limit, the
client is sending faster than it should, perhaps with UDP
so even RED/GRED won't help you, and you see the drops
once the queue starts to fill up.
Examples below: the entries in bucket 4 and 44

If you are seeing drops that are not listed in 'pipe show'
then yun need to investigate where the packets are lost,
again it could be on the output queue of the interface
(due to the burstiness introduced by dummynet), or shortage
of mbufs (but this did not seem to be the case from your
previous stats) or something else.

It's all up to you to run measurements, possibly
without omitting potentially significant data
(e.g. sysctl -a net.inet.ip)
or making assumptions (e.g. you have configured
5000 slots per queue, but with only 50k mbufs in total
there is no chance to guarantee 5000 slots to each
queue -- all you will achieve is give a lot of slots
to the greedy nodes, and very little to the other ones)

cheers
luigi

Eugene Grosbein

unread,

Oct 5, 2009, 8:00:57 AM10/5/09

to

On Mon, Oct 05, 2009 at 04:50:10PM +0500, rihad wrote:

> >>Where has TCP slow-start gone? My router box
> >>isn't some application proxy that starts downloading at full 100 mbit/s
> >>thus quickly filling client's 1 mbit/s link. It's just a router.
> >
> >While there is no or little competition for bandwidth from the router
> >to clients, TCP would work just fine. I suspect your shaping policy
> >makes heavy competition between clients. In this case, TCP behaves
> >not-so-well without help of router's good shaping algorythms
> >and taildrop is not good one.
> >
>
> Nothing fancy (i.e. no competition). Only tons of per-user pipes
> simulating the given throughput.

You've mentioned previously: "The pipes are fine, each normally having

100-120 concurrent consumers (i.e. active users)."

This IS competition between TCP flows inside each pipe.

Eugene Grosbein

rihad

unread,

Oct 5, 2009, 8:12:11 AM10/5/09

to

Luigi Rizzo wrote:
> On Mon, Oct 05, 2009 at 04:29:02PM +0500, rihad wrote:
>> Luigi Rizzo wrote:
> ...
>>> you keep omitting the important info i.e. whether individual
>>> pipes have drops, significant queue lenghts and so on.
>>>
>> Sorry. Almost everyone has 0 in the last Drp column, but some have above
>> zero. I'm not just sure how this can be helpful to anyone.
>
> because you were complaining about 'dummynet causing drops and
> waste of bandwidth'.
> Now, drops could be due to either
> 1) some saturation in the dummynet machine (memory shortage, cpu
> shortage, etc.) which cause unwanted drops;
>

I too think the box is hitting some other global limit and dropping
packets. If not, then how come that between 4a.m. and 10a.m. when the
traffic load is at 250-330 mbit/s there isn't a single drop?

> 2) intentional drops introduced by dummynet because a flow exceeds
> its queue size. These drops are those shown in the 'Drop'
> column in 'ipfw pipe show' (they are cumulative, so you
> should do an 'ipfw pipe delete; ipfw pipe 5120 config ...'
> whenever you want to re-run the stats, or compute the
> differences between subsequent reads, to figure out what
> happens.
>
> If all drops you are seeing are of type 2, then there is nothing
> you can do to remove them: you set a bandwidth limit, the
> client is sending faster than it should, perhaps with UDP
> so even RED/GRED won't help you, and you see the drops
> once the queue starts to fill up.
> Examples below: the entries in bucket 4 and 44
>

Then I guess I'm left with increasing slots and see how it goes.
Currently it's set to 10000 for each pipe. Thanks for yours and Eugene's
efforts, I appreciate it.

> If you are seeing drops that are not listed in 'pipe show'
> then yun need to investigate where the packets are lost,
> again it could be on the output queue of the interface
> (due to the burstiness introduced by dummynet), or shortage
> of mbufs (but this did not seem to be the case from your
> previous stats) or something else.
>

This indeed is not a problem, proved by the fact that, like I said,
short-circuiting "ipfw allow ip from any to any" before dummynet pipe
rules instantly eliminates all drops, and bce0 and bce1 load evens out
(bce0 used for input, and bce1 for output).

> It's all up to you to run measurements, possibly
> without omitting potentially significant data
> (e.g. sysctl -a net.inet.ip)
> or making assumptions (e.g. you have configured
> 5000 slots per queue, but with only 50k mbufs in total
> there is no chance to guarantee 5000 slots to each
> queue -- all you will achieve is give a lot of slots
> to the greedy nodes, and very little to the other ones)
>

Well, I've been monitoring this stuff. It has never reached above 20000
mbufs (111111 is the current limit).

rihad

unread,

Oct 5, 2009, 8:18:29 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 04:50:10PM +0500, rihad wrote:
>
>>>> Where has TCP slow-start gone? My router box
>>>> isn't some application proxy that starts downloading at full 100 mbit/s
>>>> thus quickly filling client's 1 mbit/s link. It's just a router.
>>> While there is no or little competition for bandwidth from the router
>>> to clients, TCP would work just fine. I suspect your shaping policy
>>> makes heavy competition between clients. In this case, TCP behaves
>>> not-so-well without help of router's good shaping algorythms
>>> and taildrop is not good one.
>>>
>> Nothing fancy (i.e. no competition). Only tons of per-user pipes
>> simulating the given throughput.
>
> You've mentioned previously: "The pipes are fine, each normally having
> 100-120 concurrent consumers (i.e. active users)."
> This IS competition between TCP flows inside each pipe.
>

Well, each user gets instantiated with a new copy of the pipe. Each such
user counts towards the limit imposed by hash_size*max_chain_len for
that pipe only. It would have been competition had I used dst-ip dst-ip
0xffffff00 or similar and not dst-ip 0xffffffff, _then_ all 256 users
(determined by the mask) would compete for the pipe's bandwidth. So the
only competition is in the uplink at our main Cisco, I guess.

Luigi Rizzo

unread,

Oct 5, 2009, 8:32:30 AM10/5/09

to

On Mon, Oct 05, 2009 at 05:12:11PM +0500, rihad wrote:
> Luigi Rizzo wrote:
> >On Mon, Oct 05, 2009 at 04:29:02PM +0500, rihad wrote:
> >>Luigi Rizzo wrote:
> >...
> >>>you keep omitting the important info i.e. whether individual
> >>>pipes have drops, significant queue lenghts and so on.
> >>>
> >>Sorry. Almost everyone has 0 in the last Drp column, but some have above
> >>zero. I'm not just sure how this can be helpful to anyone.
> >
> >because you were complaining about 'dummynet causing drops and
> >waste of bandwidth'.
> >Now, drops could be due to either
> >1) some saturation in the dummynet machine (memory shortage, cpu
> > shortage, etc.) which cause unwanted drops;
> >
> I too think the box is hitting some other global limit and dropping
> packets. If not, then how come that between 4a.m. and 10a.m. when the
> traffic load is at 250-330 mbit/s there isn't a single drop?

there may be different reasons, e.g. the big offenders were
idle when you saw no drops. You still do not have enough
information on which packets are dropped and where,
so you cannot prove your assumptions.

Also, below:
1. increasing the queue size won't help at all. Those
who overflow a queue of 1000 slots will also overflow
a queue of 10k slots.

2. your test with 'ipfw allow ip from any to any' does not
prove that the interface queue is not saturating, because
you also remove the burstiness that dummynet introduces,
and so the queue is driven differently.

good luck
luigi

rihad

unread,

Oct 5, 2009, 9:08:47 AM10/5/09

to

There's one thing I noticed:
net.inet.ip.dummynet.io_pkt_drop doesn't grow! But still around 400
packets dropped per second.
net.inet.ip.dummynet.tick_lost is always zero
net.inet.ip.dummynet.tick_diff: grows at about 50 per second.
net.inet.ip.dummynet.tick_adjustment: grows at about 5 per second.

How do I investigate and fix this burstiness issue?

$ netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts
Oerrs Coll
bce0 1500 <Link#1> 00:1d:09:xx:xx:xx 24777049059 0 75426020
0 0
bce0 1500 xx.xx.xx.xx/xx my.hostname 159293969 - 75282225
- -
bce1 1500 <Link#2> 00:1d:09:xx:xx:xx 724725 0 24514919344
0 0
bce1 1500 192.168.94.0 local.hostname 656243 - 83024869
- -

Eugene Grosbein

unread,

Oct 5, 2009, 9:56:32 AM10/5/09

to

On Mon, Oct 05, 2009 at 05:18:29PM +0500, rihad wrote:

> >You've mentioned previously: "The pipes are fine, each normally having
> >100-120 concurrent consumers (i.e. active users)."
> >This IS competition between TCP flows inside each pipe.
> >
> Well, each user gets instantiated with a new copy of the pipe. Each such
> user counts towards the limit imposed by hash_size*max_chain_len for
> that pipe only. It would have been competition had I used dst-ip dst-ip
> 0xffffff00 or similar and not dst-ip 0xffffffff, _then_ all 256 users
> (determined by the mask) would compete for the pipe's bandwidth. So the
> only competition is in the uplink at our main Cisco, I guess.

Hmm, yes, you are rigth. I've missed 'mask'.

Try to disable net.inet.ip.dummynet.io_fast to see if there is a bug
in 'fast' dummynet mode.

Eugene Grosbein

rihad

unread,

Oct 5, 2009, 10:03:51 AM10/5/09

to

Eugene Grosbein wrote:
> Try to disable net.inet.ip.dummynet.io_fast to see if there is a bug
> in 'fast' dummynet mode.
>

Already tried that, no difference. Interestingly, now I checked sysctls
and found out that net.inet.ip.dummynet.io_pkt_fast counter is still
growing despite io_fast being disabled.

Eugene Grosbein

unread,

Oct 5, 2009, 10:04:09 AM10/5/09

to

On Mon, Oct 05, 2009 at 06:08:47PM +0500, rihad wrote:

> How do I investigate and fix this burstiness issue?

Please also show:

sysctl net.isr
sysctl net.inet.ip.intr_queue_maxlen

Eugene Grosbein

rihad

unread,

Oct 5, 2009, 10:30:15 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 06:08:47PM +0500, rihad wrote:
>
>> How do I investigate and fix this burstiness issue?
>
> Please also show:
>
> sysctl net.isr
> sysctl net.inet.ip.intr_queue_maxlen
>

net.isr.swi_count: 65461359
net.isr.drop: 0
net.isr.queued: 32843752
net.isr.deferred: 0
net.isr.directed: -723075002
net.isr.count: -723074001
net.isr.direct: 1

net.inet.ip.intr_queue_maxlen: 50

Eugene Grosbein

unread,

Oct 5, 2009, 10:50:37 AM10/5/09

to

On Mon, Oct 05, 2009 at 07:30:15PM +0500, rihad wrote:

> >>How do I investigate and fix this burstiness issue?
> >
> >Please also show:
> >
> >sysctl net.isr
> >sysctl net.inet.ip.intr_queue_maxlen
>
> net.isr.swi_count: 65461359
> net.isr.drop: 0
> net.isr.queued: 32843752
> net.isr.deferred: 0
> net.isr.directed: -723075002
> net.isr.count: -723074001
> net.isr.direct: 1
>
> net.inet.ip.intr_queue_maxlen: 50

What is CPU load in when the load is maximum?

Eugene

rihad

unread,

Oct 5, 2009, 11:07:18 AM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 07:30:15PM +0500, rihad wrote:
>
>>>> How do I investigate and fix this burstiness issue?
>>> Please also show:
>>>
>>> sysctl net.isr
>>> sysctl net.inet.ip.intr_queue_maxlen
>> net.isr.swi_count: 65461359
>> net.isr.drop: 0
>> net.isr.queued: 32843752
>> net.isr.deferred: 0
>> net.isr.directed: -723075002
>> net.isr.count: -723074001
>> net.isr.direct: 1
>>
>> net.inet.ip.intr_queue_maxlen: 50
>
> What is CPU load in when the load is maximum?
>

It has 2 quad-cores, so I'm not sure. Here's the output of top -S:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 1 171 ki31 0K 16K CPU7 7 120.2H 98.39%
idle: cpu7
12 root 1 171 ki31 0K 16K CPU6 6 120.3H 95.75%
idle: cpu6
16 root 1 171 ki31 0K 16K CPU2 2 115.5H 93.99%
idle: cpu2
14 root 1 171 ki31 0K 16K CPU4 4 114.6H 91.89%
idle: cpu4
18 root 1 171 ki31 0K 16K RUN 0 97.7H 88.96%
idle: cpu0
13 root 1 171 ki31 0K 16K CPU5 5 117.5H 85.35%
idle: cpu5
15 root 1 171 ki31 0K 16K CPU3 3 110.2H 79.79%
idle: cpu3
29 root 1 -68 - 0K 16K WAIT 1 59.1H 54.05%
irq256: bce0
17 root 1 171 ki31 0K 16K CPU1 1 63.5H 42.29%
idle: cpu1
467 root 1 -68 - 0K 16K - 3 21.4H 11.57%
dummynet
19 root 1 -32 - 0K 16K WAIT 4 268:29 3.66%
swi4: clock sio
31 root 1 -68 - 0K 16K WAIT 2 310:19 2.98%
irq257: bce1
30 root 1 -64 - 0K 16K WAIT 6 40:02 0.29%
irq16: mfi0
3 root 1 -8 - 0K 16K - 5 55:42 0.00% g_up
4 root 1 -8 - 0K 16K - 1 51:22 0.00% g_down
22 root 1 44 - 0K 16K - 3 13:30 0.00% yarrow
21 root 1 -44 - 0K 16K WAIT 0 9:09 0.00%
swi1: net
51 root 1 20 - 0K 16K syncer 2 7:25 0.00% syncer
599 root 1 44 0 5660K 1132K select 0 1:40 0.00% syslogd
52 root 1 -16 - 0K 16K sdflus 5 0:34 0.00%
softdepflush
2 root 1 -8 - 0K 16K - 2 0:31 0.00% g_event

Eugene Grosbein

unread,

Oct 5, 2009, 11:42:36 AM10/5/09

to

On Mon, Oct 05, 2009 at 08:07:18PM +0500, rihad wrote:

> >What is CPU load in when the load is maximum?
> >
> It has 2 quad-cores, so I'm not sure. Here's the output of top -S:

There is a rumour about FreeBSD's shedulers...
That they are not so good for 8 cores and that you may get MORE speed
by disabling 4 cores if it's possible for your system.
Or even using uniprocessor kernel.

Only rumour, though :-)

Eugene Grosbein

rihad

unread,

Oct 5, 2009, 12:09:49 PM10/5/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 08:07:18PM +0500, rihad wrote:
>
>>> What is CPU load in when the load is maximum?
>>>
>> It has 2 quad-cores, so I'm not sure. Here's the output of top -S:
>
> There is a rumour about FreeBSD's shedulers...
> That they are not so good for 8 cores and that you may get MORE speed
> by disabling 4 cores if it's possible for your system.
> Or even using uniprocessor kernel.
>
> Only rumour, though :-)
>

I'd really like to try this as the last option ;-)

It's 21:07 where I live, and we're again wasting 9-10 mbit/s w/ 4k users
online.

systat -ifstat:

bce1 in 0.000 Mb/s 0.003 Mb/s 49.004 MB
out 470.102 Mb/s 470.102 Mb/s 22.637 TB

bce0 in 479.754 Mb/s 479.754 Mb/s 22.858 TB
out 0.148 Mb/s 0.207 Mb/s 6.950 GB

Julian Elischer

unread,

Oct 5, 2009, 1:03:16 PM10/5/09

to

unread,

Oct 5, 2009, 2:54:50 PM10/5/09

to

rihad wrote:
> Julian Elischer wrote:
>> rihad wrote:
>>> Julian Elischer wrote:
>>>> Luigi Rizzo wrote:
>>>>
>>>> Is it possible to know what sessions are losing packets?
>>>>
>>> Yes, of course, by running ipfw pipe show ;-)
>>> There's one confusing thing, though: net.inet.ip.dummynet.io_pkt_drop
>>> isn't increasing while around 800-1000 packets per second are being
>>> dropped right now. And so "ipfw pipe show" Drp column wouldn't grow
>>> either. So it's either not dummynet dropping packets, or a bug (?).
>>
>>
>> I suspect your interface queues.
>>
>
> You mean in hardware? Any way to tweak those?

no, there is (usually) a software queue before the hardware queue.

look at /sys/net/if_ethersubr.c

Eugene Grosbein

unread,

Oct 6, 2009, 12:21:53 AM10/6/09

to

On Mon, Oct 05, 2009 at 07:30:15PM +0500, rihad wrote:

> >>How do I investigate and fix this burstiness issue?
> >
> >Please also show:
> >
> >sysctl net.isr
> >sysctl net.inet.ip.intr_queue_maxlen
>
> net.isr.swi_count: 65461359
> net.isr.drop: 0
> net.isr.queued: 32843752
> net.isr.deferred: 0
> net.isr.directed: -723075002
> net.isr.count: -723074001
> net.isr.direct: 1
>
> net.inet.ip.intr_queue_maxlen: 50

Try to increase net.inet.ip.intr_queue_maxlen uptio 4096.

Eugene

Eugene Grosbein

unread,

Oct 6, 2009, 12:23:35 AM10/6/09

to

On Mon, Oct 05, 2009 at 10:49:45AM -0700, Julian Elischer wrote:

> >There is a rumour about FreeBSD's shedulers...
> >That they are not so good for 8 cores and that you may get MORE speed
> >by disabling 4 cores if it's possible for your system.
> >Or even using uniprocessor kernel.
> >Only rumour, though :-)
> true for 6.x, less true for 7.x, and not true for 8.x

That's great. For both of ULE and BSD?

Eugene

rihad

unread,

Oct 6, 2009, 12:38:38 AM10/6/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 07:30:15PM +0500, rihad wrote:
>
>>>> How do I investigate and fix this burstiness issue?
>>> Please also show:
>>>
>>> sysctl net.isr
>>> sysctl net.inet.ip.intr_queue_maxlen
>> net.isr.swi_count: 65461359
>> net.isr.drop: 0
>> net.isr.queued: 32843752
>> net.isr.deferred: 0
>> net.isr.directed: -723075002
>> net.isr.count: -723074001
>> net.isr.direct: 1
>>
>> net.inet.ip.intr_queue_maxlen: 50
>
> Try to increase net.inet.ip.intr_queue_maxlen uptio 4096.
>

You sure? Packets are never dropped once I add "allow ip from any to
any" before pipes, effectively turning dummynet off. Yet I've doubled it
for starters (50->100) let's see if it works in an hour or so, when it's
10:30 here.

rihad

unread,

Oct 6, 2009, 1:21:17 AM10/6/09

to

> Eugene Grosbein wrote:

>> Try to increase net.inet.ip.intr_queue_maxlen uptio 4096.
>>
> You sure? Packets are never dropped once I add "allow ip from any to
> any" before pipes, effectively turning dummynet off. Yet I've doubled it
> for starters (50->100) let's see if it works in an hour or so, when it's
> 10:30 here.
>

Besides,
net.inet.ip.intr_queue_maxlen: Maximum size of the IP input queue
but the drops have been occurring on the _output_.
I'll see in about 10 minutes if it's true or not.

I've also set up kern.hz=2000 in loader.conf the effects of which will
be seen in an hour or so, when I'm permitted to reboot the box.

rihad

unread,

Oct 6, 2009, 2:38:52 AM10/6/09

to

Eugene Grosbein wrote:
> On Mon, Oct 05, 2009 at 07:30:15PM +0500, rihad wrote:
>
>>>> How do I investigate and fix this burstiness issue?
>>> Please also show:
>>>
>>> sysctl net.isr
>>> sysctl net.inet.ip.intr_queue_maxlen
>> net.isr.swi_count: 65461359
>> net.isr.drop: 0
>> net.isr.queued: 32843752
>> net.isr.deferred: 0
>> net.isr.directed: -723075002
>> net.isr.count: -723074001
>> net.isr.direct: 1
>>
>> net.inet.ip.intr_queue_maxlen: 50
>
> Try to increase net.inet.ip.intr_queue_maxlen uptio 4096.
>

Apparently increasing the input queue had no effect on the output drops.
Besides, net.inet.ip.intr_queue_drops has only reached 70 for 6 days of
uptime.

I'll soon be rebooting to change HZ 1000 -> 2000, and we'll see how that
goes. There must be some global limit, not nmbclusters, not per-user
output queues, not the interface input queue, but something else that
triggers at around 360-370 mbit/s...

rihad

unread,

Oct 6, 2009, 4:26:18 AM10/6/09

to

Julian Elischer wrote:
> rihad wrote:
>> Luigi Rizzo wrote:
>>> 2. your test with 'ipfw allow ip from any to any' does not
>>> prove that the interface queue is not saturating, because
>>> you also remove the burstiness that dummynet introduces,
>>> and so the queue is driven differently.
>>>
>>

>> How do I investigate and fix this burstiness issue?
>
> higher Hz rate?
>

Rebooted with HZ=2000 10 minutes ago. Due to application design the ipfw
table (pipe tablearg) was flushed, so there are now 350 (and
increasing at a rate 1 per 1-2 seconds as I type this) or so users in
the table, and not 4k as normally would be. The box is servicing 450+
mbit/s without a single drop. I want to monitor how things change once
the number of users in ipfw tables gradually increases up to several
thousands.

rihad

unread,

Oct 6, 2009, 5:21:38 AM10/6/09

to

rihad wrote:
> Julian Elischer wrote:
>> rihad wrote:
>>> Luigi Rizzo wrote:
>>>> 2. your test with 'ipfw allow ip from any to any' does not
>>>> prove that the interface queue is not saturating, because
>>>> you also remove the burstiness that dummynet introduces,
>>>> and so the queue is driven differently.
>>>>
>>>
>>> How do I investigate and fix this burstiness issue?
>>
>> higher Hz rate?
>>
>
> Rebooted with HZ=2000 10 minutes ago. Due to application design the ipfw
> table (pipe tablearg) was flushed, so there are now 350 (and increasing
> at a rate 1 per 1-2 seconds as I type this) or so users in the table,
> and not 4k as normally would be. The box is servicing 450+ mbit/s
> without a single drop. I want to monitor how things change once the
> number of users in ipfw tables gradually increases up to several thousands.
>

It starts dropping packets at around 2000 online users (ipfw table
load). I've set up a shell script to monitor this:

# while :; do ipfw table 0 list | wc -l; netstat -s 2>/dev/null |fgrep
-w 'output packets dropped'; sleep 10; done

... # all zeroes above this
1999
0 output packets dropped due to no bufs, etc.
2001
0 output packets dropped due to no bufs, etc.
2008
0 output packets dropped due to no bufs, etc.
2017
0 output packets dropped due to no bufs, etc.
2027
156 output packets dropped due to no bufs, etc.
2037
156 output packets dropped due to no bufs, etc.
2045
156 output packets dropped due to no bufs, etc.
2372
202 output packets dropped due to no bufs, etc.
2377
207 output packets dropped due to no bufs, etc.
2391
338 output packets dropped due to no bufs, etc.
2402
394 output packets dropped due to no bufs, etc.
2415
531 output packets dropped due to no bufs, etc.
2421
725 output packets dropped due to no bufs, etc.

Is there some limit on the number of IP addresses in an ipfw table?

unread,

Oct 6, 2009, 6:17:47 AM10/6/09

to

io_pkt_drop only reports packets dropped to errors (missing pipes,
randomly forced packet drops which you don't use, no buffers and so on).

The packets intentionally dropped in dummynet because queues are full
are listed by 'ipfw pipe show'.

Even if pipes expire, there is a difference between having partial
information and completely ignoring what is available and claiming
"it's plain useless".

BTW at least while you try to debug the problem you can temporarily
disable the pipe expire with 'sysctl net.inet.ip.dummynet.expire=0'
and also you could poll the stats more frequently (say every 1-2-5 sec)
to get a better idea of what happens.

The one time you sent the 'pipe show' info there were clearly
a few pipes with thousand packet drops -- as i said those are
unavoidable and correspond to clients that systematically
exceed their share (500k/1m as you set) e.g. because they are flooding
the net with TCP SYN or UDP requests. This may be due to viruses,
aggressive p2p, and so on. A single client can easily generate
the extra 2000 packets per seconds that you are seeing.

It's up to you to open your eyes looking for evidence, or
close them and randomly blame one or another piece of the system.

cheers
luigi

rihad

unread,

Oct 6, 2009, 9:14:58 AM10/6/09

to

Eugene Grosbein wrote:
> On Tue, Oct 06, 2009 at 02:21:38PM +0500, rihad wrote:
>
>> Is there some limit on the number of IP addresses in an ipfw table?
>
> No, generally handles much more. Please show your ipfw rule(s)
> containing 'tablearg'.
>

01031 x x allow ip from any to any
01040 x x skipto 1100 ip from table(127) to any out
recv bce0 xmit bce1
01060 x x pipe tablearg ip from any to table(0) out
recv bce0 xmit bce1
01070 x x allow ip from any to table(0) out recv bce0
xmit bce1
01100 x x pipe tablearg ip from any to table(2) out
65535 x x allow ip from any to any

table(127) contains country-wide ISPs' netblocks (under 100 entries).
table(0) and table(2) contain same user IP addresses, but different pipe
IDs - normally around 3-4k entries each.

Now please pay special attention to rule 1031. I've added it to bypass
dummynet and stop packets from being dropped for now. Normally the rule
isn't there.

As I found out today after rebooting, drops only start occurring when
the number of entries in table(0) exceeds 2000 or so (please see my
previous email). Maybe it's a coincidence - I don't know. Global traffic
load doesn't matter - it was approximately the same before and after the
drops (around 450 mbit/s).

rihad

unread,

Oct 6, 2009, 9:33:20 AM10/6/09

to

Luigi Rizzo wrote:
> On Tue, Oct 06, 2009 at 02:34:32PM +0500, rihad wrote:
>> Luigi Rizzo wrote:

>> 8664 output packets dropped due to no bufs, etc.
>> net.inet.ip.dummynet.io_pkt_drop: 111
>
> io_pkt_drop only reports packets dropped to errors (missing pipes,
> randomly forced packet drops which you don't use, no buffers and so on).
>
> The packets intentionally dropped in dummynet because queues are full
> are listed by 'ipfw pipe show'.
>
> Even if pipes expire, there is a difference between having partial
> information and completely ignoring what is available and claiming
> "it's plain useless".

Ok, without looking I can say: there _are_ always some users overrunning
their queues that have non-zero Drp column in ipfw pipe show. But they
are also there downloading like crazy when the number of online users is
below 2000 or so, so how come that they're overrunning their share
without getting reflected in the global netstat -s drop counter? There
are 0 drops! netstat -s only starts growing when the number of online
users exceeds 2000.

>
> BTW at least while you try to debug the problem you can temporarily
> disable the pipe expire with 'sysctl net.inet.ip.dummynet.expire=0'
> and also you could poll the stats more frequently (say every 1-2-5 sec)
> to get a better idea of what happens.
>

I've tried expire=0 for a while, too. No difference whatsoever.

> The one time you sent the 'pipe show' info there were clearly
> a few pipes with thousand packet drops -- as i said those are
> unavoidable and correspond to clients that systematically
> exceed their share (500k/1m as you set) e.g. because they are flooding
> the net with TCP SYN or UDP requests. This may be due to viruses,
> aggressive p2p, and so on. A single client can easily generate
> the extra 2000 packets per seconds that you are seeing.
>

Only downloads (i.e. traffic arriving from the Net) crosses the PC at
hand. What makes me think this is a global problem, is that the drops
_never_ happen when the number of online users in IPFW tables is below
2000 or so, no matter how high global traffic load is. Please see my
previous email with ipfw rules.

Eugene Grosbein

unread,

Oct 6, 2009, 10:21:52 AM10/6/09

to

On Tue, Oct 06, 2009 at 06:14:58PM +0500, rihad wrote:

> >No, generally handles much more. Please show your ipfw rule(s)
> >containing 'tablearg'.
>
> 01031 x x allow ip from any to any
> 01040 x x skipto 1100 ip from table(127) to any out
> recv bce0 xmit bce1
> 01060 x x pipe tablearg ip from any to table(0) out
> recv bce0 xmit bce1
> 01070 x x allow ip from any to table(0) out recv bce0
> xmit bce1
> 01100 x x pipe tablearg ip from any to table(2) out
> 65535 x x allow ip from any to any
>
> table(127) contains country-wide ISPs' netblocks (under 100 entries).
> table(0) and table(2) contain same user IP addresses, but different pipe
> IDs - normally around 3-4k entries each.
>
> Now please pay special attention to rule 1031. I've added it to bypass
> dummynet and stop packets from being dropped for now. Normally the rule
> isn't there.
>
> As I found out today after rebooting, drops only start occurring when
> the number of entries in table(0) exceeds 2000 or so (please see my
> previous email). Maybe it's a coincidence - I don't know. Global traffic
> load doesn't matter - it was approximately the same before and after the
> drops (around 450 mbit/s).

It's possible that pipe lookup by its number is inefficient
and firewall keeps its lock for too much time while searching the pipe,
just a guess. And packets start to drop, eh?

Try setting net.isr.direct to 0 and make large net.inet.ip.intr_queue_maxlen.
This way, one of your cores may run bce's thread, enqueue incoming
packets and return to work immediately. The rest of processing may be
performed by another kernel thread, hopefully using another core.
Just to see if this changes anything. top -S should help here too.

Eugene

rihad

unread,

Oct 6, 2009, 11:28:35 AM10/6/09

to

Since this is a remote production box, I'm really scared of toggling
such on/off flags I've never used before, particularly under heavy
traffic loads, they're way too eager to lock up the whole system. I
might try this tomorrow morning, though, first on another less critical
box. p.s.: I've just tested toggling the flag on my virtual machine, it
went fine.

I don't think net.inet.ip.intr_queue_maxlen is relevant to this problem,
as net.inet.ip.intr_queue_drops is normally zero or very close to it at
all times.

Eugene Grosbein

unread,

Oct 6, 2009, 12:12:40 PM10/6/09

to

On Tue, Oct 06, 2009 at 08:28:35PM +0500, rihad wrote:

> I don't think net.inet.ip.intr_queue_maxlen is relevant to this problem,
> as net.inet.ip.intr_queue_drops is normally zero or very close to it at
> all times.

When net.isr.direct is 1, this queue is used very seldom.
Would you change it to 0, it will be used extensively.

Eugene

rihad

unread,

Oct 6, 2009, 12:25:04 PM10/6/09

to

Eugene Grosbein wrote:
> On Tue, Oct 06, 2009 at 08:28:35PM +0500, rihad wrote:
>
>> I don't think net.inet.ip.intr_queue_maxlen is relevant to this problem,
>> as net.inet.ip.intr_queue_drops is normally zero or very close to it at
>> all times.
>
> When net.isr.direct is 1, this queue is used very seldom.
> Would you change it to 0, it will be used extensively.
>

Ah, ok, thanks. I'll try that tomorrow.

But still...

> Try setting net.isr.direct to 0 and make large net.inet.ip.intr_queue_maxlen.
> This way, one of your cores may run bce's thread, enqueue incoming
> packets and return to work immediately. The rest of processing may be
> performed by another kernel thread, hopefully using another core.
> Just to see if this changes anything. top -S should help here too.

It isn't the incoming bce0 losing packets, but rather the outgoing bce1,
which is almost idle interrupt-wise:

29 root 1 -68 - 0K 16K CPU1 1 223:39 55.86%
irq256: bce0
31 root 1 -68 - 0K 16K WAIT 2 19:27 4.10%
irq257: bce1

Robert Watson

unread,

Oct 6, 2009, 1:06:37 PM10/6/09

to

On Wed, 7 Oct 2009, Eugene Grosbein wrote:

> On Tue, Oct 06, 2009 at 08:28:35PM +0500, rihad wrote:
>
>> I don't think net.inet.ip.intr_queue_maxlen is relevant to this problem, as
>> net.inet.ip.intr_queue_drops is normally zero or very close to it at all
>> times.
>
> When net.isr.direct is 1, this queue is used very seldom. Would you change
> it to 0, it will be used extensively.

Just to clarify this more specifically:

With net.isr.direct set to 0, the netisr will always be used when processing
inbound IP packets.

With net.isr.direct set to 1, the netisr will only be used for special cases,
such as loopback traffic, IPSEC decapsulation, and other processing types
where there's a risk of recursive processing.

In the default 8.0 configuration, we use one netisr thread; however, you can
specify to use multiple threads at boot time. This is not the default
currently because we're still researching load distribution schemes, and on
current high-performance systems the hardware tends to take care of that
already pretty well (i.e., most modern 10gbps cards).

Also, ipfw/dummynet have fairly non-granular locking, so adding parallelism
won't necessarily help currently.

Robert

rihad

unread,

Oct 6, 2009, 1:21:18 PM10/6/09

to

Robert Watson wrote:
> and on current high-performance systems the hardware tends to
> take care of that already pretty well (i.e., most modern 10gbps cards).
>

Do you think that us switching to 10gbps cards would solve the problem
discussed? We're currently at 500-550 mbps and rising, so we might as
well switch to it now without waiting to hit the ceiling.

Oleg Bulyzhin

unread,

Oct 6, 2009, 1:30:39 PM10/6/09

to

On Tue, Oct 06, 2009 at 12:17:47PM +0200, Luigi Rizzo wrote:

> io_pkt_drop only reports packets dropped to errors (missing pipes,
> randomly forced packet drops which you don't use, no buffers and so on).

You are mistaken here. io_pkt_drop is total number of packets dropped by
dummynet_io().

--
Oleg.

================================================================
=== Oleg Bulyzhin -- OBUL-RIPN -- OBUL-RIPE -- ol...@rinet.ru ===
================================================================

Luigi Rizzo

unread,

Oct 6, 2009, 5:05:13 PM10/6/09

to

On Tue, Oct 06, 2009 at 09:30:39PM +0400, Oleg Bulyzhin wrote:
> On Tue, Oct 06, 2009 at 12:17:47PM +0200, Luigi Rizzo wrote:
>
> > io_pkt_drop only reports packets dropped to errors (missing pipes,
> > randomly forced packet drops which you don't use, no buffers and so on).
>
> You are mistaken here. io_pkt_drop is total number of packets dropped by
> dummynet_io().

hmm... you are right.

unread,

Oct 7, 2009, 5:01:19 AM10/7/09

to

On Wed, 7 Oct 2009, rihad wrote:

> rihad wrote:
>> I've yet to test how this direct=0 improves extensive dummynet drops.
>
> Ooops... After a couple of minutes, suddenly:
>
> net.inet.ip.intr_queue_drops: 1284
>
> Bumped it up a bit.

Yes, I was going to suggest that moving to deferred dispatch has probably
simply moved the drops to a new spot, the queue between the ithreads and the
netisr thread. In your setup, how many network interfaces are in use, and
what drivers?

If what's happening is that you're maxing out a CPU then moving to multiple
netisrs might help if your card supports generating flow IDs, but most
lower-end cards don't. I have patches to generate those flow IDs in software
rather than hardware, but there are some downsides to doing so, not least that
it takes cache line misses on the packet that generally make up a lot of the
cost of processing the packet.

My experience with most reasonable cards is that letting them doing the work
distribution with RSS and use multiple ithreads is a more performant strategy
than using software work distribution on current systems, though.

Someone has probably asked for this already, but -- could you send a snapshot
of the top -SH output in the steady state? Let top run for a few minutes and
then copy/paste the first 10-20 lines into an e-mail.

Robert N M Watson
Computer Laboratory
University of Cambridge

rihad

unread,

Oct 7, 2009, 5:22:52 AM10/7/09

to

Robert Watson wrote:
> On Wed, 7 Oct 2009, rihad wrote:
>
>> rihad wrote:
>>> I've yet to test how this direct=0 improves extensive dummynet drops.
>>
>> Ooops... After a couple of minutes, suddenly:
>>
>> net.inet.ip.intr_queue_drops: 1284
>>
>> Bumped it up a bit.
>
> Yes, I was going to suggest that moving to deferred dispatch has
> probably simply moved the drops to a new spot, the queue between the
> ithreads and the netisr thread. In your setup, how many network
> interfaces are in use, and what drivers?
>

bce -- Broadcom NetXtreme II (BCM5706/BCM5708) PCI/PCIe Gigabit Ethernet
adapter driver
device bce compiled into a 7.1-RELEASE-p8 kernel.
2 network cards: bce0 used for ~400-500 mbit/s input, bce1 for output,
i.e. acting as a smart router. It has 2 quad core CPUs.

Now the probability of drops (as monitored by netstat -s's "output
packets dropped due to no bufs, etc.") is definitely a function of
traffic load and the number of items in a ipfw table. I've just
decreased the size of the two tables from ~2600 to ~1800 each and the
drops instantly went away, even though the traffic passing through the
box didn't decrease, it even increased a bit due to now shaping fewer
clients (luckily "ipfw pipe tablearg" passes packets failing a table
lookup untouched).

> If what's happening is that you're maxing out a CPU then moving to
> multiple netisrs might help if your card supports generating flow IDs,
> but most lower-end cards don't. I have patches to generate those flow
> IDs in software rather than hardware, but there are some downsides to
> doing so, not least that it takes cache line misses on the packet that
> generally make up a lot of the cost of processing the packet.
>
> My experience with most reasonable cards is that letting them doing the
> work distribution with RSS and use multiple ithreads is a more
> performant strategy than using software work distribution on current
> systems, though.
>

So should we prefer a bunch of expensive quality 10 gig cards? Any you
would recommend?

> Someone has probably asked for this already, but -- could you send a
> snapshot of the top -SH output in the steady state? Let top run for a
> few minutes and then copy/paste the first 10-20 lines into an e-mail.
>

Sure. Mind you: now there's only 1800 entries in each of the two ipfw
tables, so any drops have stopped. But it only takes another 200-300
entries to start dropping.

155 processes: 10 running, 129 sleeping, 16 waiting
CPU: 2.4% user, 0.0% nice, 2.0% system, 9.3% interrupt, 86.2% idle
Mem: 1691M Active, 1491M Inact, 454M Wired, 130M Cache, 214M Buf, 170M Free
Swap: 2048M Total, 12K Used, 2048M Free

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
15 root 171 ki31 0K 16K CPU3 3 22.4H 97.85% idle: cpu3
14 root 171 ki31 0K 16K CPU4 4 23.0H 96.29% idle: cpu4
12 root 171 ki31 0K 16K CPU6 6 23.8H 94.58% idle: cpu6
16 root 171 ki31 0K 16K CPU2 2 22.5H 90.72% idle: cpu2
13 root 171 ki31 0K 16K CPU5 5 23.4H 90.58% idle: cpu5
18 root 171 ki31 0K 16K RUN 0 20.3H 85.60% idle: cpu0
17 root 171 ki31 0K 16K CPU1 1 910:03 78.37% idle: cpu1
11 root 171 ki31 0K 16K CPU7 7 23.8H 65.62% idle: cpu7
21 root -44 - 0K 16K CPU7 7 19:03 48.34% swi1: net
29 root -68 - 0K 16K WAIT 1 515:49 19.63% irq256: bce0
31 root -68 - 0K 16K WAIT 2 56:05 5.52% irq257: bce1
19 root -32 - 0K 16K WAIT 5 50:05 3.86% swi4:
clock sio
983 flowtools 44 0 12112K 6440K select 0 13:20 0.15% flow-capture
465 root -68 - 0K 16K - 3 51:19 0.00% dummynet
3 root -8 - 0K 16K - 1 7:41 0.00% g_up
4 root -8 - 0K 16K - 2 7:14 0.00% g_down
30 root -64 - 0K 16K WAIT 6 5:30 0.00% irq16: mfi0

rihad

unread,

Oct 7, 2009, 5:23:47 AM10/7/09

to

Oleg Bulyzhin wrote:
> Please show your 'sysctl net.inet.ip' output.
>

net.inet.ip.portrange.randomtime: 45
net.inet.ip.portrange.randomcps: 10
net.inet.ip.portrange.randomized: 1
net.inet.ip.portrange.reservedlow: 0
net.inet.ip.portrange.reservedhigh: 1023
net.inet.ip.portrange.hilast: 65535
net.inet.ip.portrange.hifirst: 49152
net.inet.ip.portrange.last: 65535
net.inet.ip.portrange.first: 49152
net.inet.ip.portrange.lowlast: 600
net.inet.ip.portrange.lowfirst: 1023
net.inet.ip.forwarding: 1
net.inet.ip.redirect: 1
net.inet.ip.ttl: 64
net.inet.ip.rtexpire: 3600
net.inet.ip.rtminexpire: 10
net.inet.ip.rtmaxcache: 128
net.inet.ip.sourceroute: 0
net.inet.ip.intr_queue_maxlen: 300
net.inet.ip.intr_queue_drops: 1365
net.inet.ip.accept_sourceroute: 0
net.inet.ip.keepfaith: 0
net.inet.ip.gifttl: 30
net.inet.ip.same_prefix_carp_only: 0
net.inet.ip.subnets_are_local: 0
net.inet.ip.fastforwarding: 0
net.inet.ip.fw.dyn_keepalive: 1
net.inet.ip.fw.dyn_short_lifetime: 5
net.inet.ip.fw.dyn_udp_lifetime: 10
net.inet.ip.fw.dyn_rst_lifetime: 1
net.inet.ip.fw.dyn_fin_lifetime: 1
net.inet.ip.fw.dyn_syn_lifetime: 20
net.inet.ip.fw.dyn_ack_lifetime: 300
net.inet.ip.fw.static_count: 12
net.inet.ip.fw.dyn_max: 4096
net.inet.ip.fw.dyn_count: 0
net.inet.ip.fw.curr_dyn_buckets: 256
net.inet.ip.fw.dyn_buckets: 256
net.inet.ip.fw.default_rule: 65535
net.inet.ip.fw.verbose_limit: 0
net.inet.ip.fw.verbose: 0
net.inet.ip.fw.debug: 1
net.inet.ip.fw.one_pass: 0
net.inet.ip.fw.autoinc_step: 100
net.inet.ip.fw.enable: 1
net.inet.ip.maxfragpackets: 3472
net.inet.ip.maxfragsperpacket: 16
net.inet.ip.fragpackets: 0
net.inet.ip.check_interface: 0
net.inet.ip.random_id: 0
net.inet.ip.sendsourcequench: 0
net.inet.ip.process_options: 1
net.inet.ip.dummynet.debug: 0
net.inet.ip.dummynet.pipe_byte_limit: 1048576
net.inet.ip.dummynet.pipe_slot_limit: 2000
net.inet.ip.dummynet.io_pkt_drop: 4515
net.inet.ip.dummynet.io_pkt_fast: 120785085
net.inet.ip.dummynet.io_pkt: 639360193
net.inet.ip.dummynet.io_fast: 1
net.inet.ip.dummynet.tick_lost: 0
net.inet.ip.dummynet.tick_diff: 4252218
net.inet.ip.dummynet.tick_adjustment: 950800
net.inet.ip.dummynet.tick_delta_sum: 173
net.inet.ip.dummynet.tick_delta: 502
net.inet.ip.dummynet.red_max_pkt_size: 1500
net.inet.ip.dummynet.red_avg_pkt_size: 512
net.inet.ip.dummynet.red_lookup_depth: 256
net.inet.ip.dummynet.max_chain_len: 16
net.inet.ip.dummynet.expire: 1
net.inet.ip.dummynet.search_steps: 710716767
net.inet.ip.dummynet.searches: 639360196
net.inet.ip.dummynet.extract_heap: 0
net.inet.ip.dummynet.ready_heap: 496
net.inet.ip.dummynet.curr_time: 181442521
net.inet.ip.dummynet.hash_size: 65536

Robert Watson

unread,

Oct 7, 2009, 5:36:58 AM10/7/09

to

On Wed, 7 Oct 2009, rihad wrote:

>> snapshot of the top -SH output in the steady state? Let top run for a few
>> minutes and then copy/paste the first 10-20 lines into an e-mail.
>>
> Sure. Mind you: now there's only 1800 entries in each of the two ipfw
> tables, so any drops have stopped. But it only takes another 200-300 entries
> to start dropping.

Could you do the same in the net.isr.direct=1 configuration so we can compare?

Robert

rihad

unread,

Oct 7, 2009, 5:55:44 AM10/7/09

to

Robert Watson wrote:
>
> On Wed, 7 Oct 2009, rihad wrote:
>
>>> snapshot of the top -SH output in the steady state? Let top run for
>>> a few minutes and then copy/paste the first 10-20 lines into an e-mail.
>>>
>> Sure. Mind you: now there's only 1800 entries in each of the two ipfw
>> tables, so any drops have stopped. But it only takes another 200-300
>> entries to start dropping.
>
> Could you do the same in the net.isr.direct=1 configuration so we can
> compare?
>

net.isr.direct=1:

last pid: 92152; load averages: 0.99, 1.18, 1.15
up 1+01:42:28 14:53:09
162 processes: 9 running, 136 sleeping, 17 waiting
CPU: 2.1% user, 0.0% nice, 5.4% system, 7.0% interrupt, 85.5% idle
Mem: 1693M Active, 1429M Inact, 447M Wired, 197M Cache, 214M Buf, 170M Free

Swap: 2048M Total, 12K Used, 2048M Free

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND

12 root 171 ki31 0K 16K CPU6 6 24.3H 100.00% idle: cpu6
13 root 171 ki31 0K 16K CPU5 5 23.8H 95.95% idle: cpu5
14 root 171 ki31 0K 16K CPU4 4 23.4H 93.12% idle: cpu4
16 root 171 ki31 0K 16K CPU2 2 23.0H 90.19% idle: cpu2
11 root 171 ki31 0K 16K CPU7 7 24.2H 87.26% idle: cpu7
15 root 171 ki31 0K 16K CPU3 3 22.8H 86.18% idle: cpu3
18 root 171 ki31 0K 16K RUN 0 20.6H 84.96% idle: cpu0
17 root 171 ki31 0K 16K CPU1 1 933:23 47.85% idle: cpu1
29 root -68 - 0K 16K WAIT 1 522:02 46.88% irq256: bce0
465 root -68 - 0K 16K - 7 55:15 12.65% dummynet
31 root -68 - 0K 16K WAIT 2 57:29 4.74% irq257: bce1
21 root -44 - 0K 16K WAIT 0 34:55 4.64% swi1: net
19 root -32 - 0K 16K WAIT 4 51:41 3.96% swi4:
clock sio
30 root -64 - 0K 16K WAIT 6 5:43 0.73% irq16: mfi0

Almost 2000 entries in the table, traffic load= 420-430 mbps, drops
haven't yet started.

Previous net.isr.direct=0:

Oleg Bulyzhin

unread,

Oct 7, 2009, 6:05:03 AM10/7/09

to

On Wed, Oct 07, 2009 at 02:23:47PM +0500, rihad wrote:

_______________________________________________

rihad

unread,

Oct 7, 2009, 6:52:56 AM10/7/09

to

Oleg Bulyzhin wrote:
> On Wed, Oct 07, 2009 at 03:16:27PM +0500, rihad wrote:
>> Oleg Bulyzhin wrote:
>>> On Wed, Oct 07, 2009 at 02:23:47PM +0500, rihad wrote:
>>>
>>> Few questions:
>>> 1) why are you not using fastforwarding?
>>> 2) search_steps/searches ratio is not that good, are you using 'buckets'
>>> keyword in your pipe configuration?
>>> 3) you have net.inet.ip.fw.one_pass = 0, is it intended?
>>>
>> 1) and 3): the box does traffic accounting and shaping, so I need
>> one_pass=0 to do both ngtee and pipes.
> Still can not see any objection for not using fastforwarding, and usually
> ipfw ruleset can be rearranged for using dummynet & netgraph with one_pass=1.

You probably have some special sources of documentation ;-) According to
man ipfw, both "netgraph/ngtee" and "pipe" decide the fate of the packet
unless one_pass=0. Or do you mean sprinkling smart skiptos here and
there? ;-)

> Could you show your 'ipfw show' output? (hide ip addresses if you wish but
> keep counters please).
>

Here it is, in its whole glory:

00100 10434423 1484891105 allow ip from any to any via lo0
00200 2 14 deny ip from any to 127.0.0.0/8
00300 1 4 deny ip from 127.0.0.0/8 to any
01000 3300039938 327603104711 allow ip from any to any in
01010 26214900 421138433 allow ip from me to any out
01020 5453857 46806278 allow icmp from any to any out
01030 3268289053 327224694165 ngtee 1 ip from any to any out
01040 18681181 1089636054 skipto 1100 ip from table(127) to any out
recv bce0 xmit bce1
01060 777488848 76743392754 pipe tablearg ip from any to table(0) out
recv bce0 xmit bce1
01070 776831109 76682499457 allow ip from any to table(0) out recv
bce0 xmit bce1
01100 13102697 808411842 pipe tablearg ip from any to table(2) out
65535 662648946 66711487830 allow ip from any to any

table(127) is static in nature and is under 100 entries.
table(0) and table(2) have the same IP clients' addresses but different
pipe IDs.

>> 2) Hm, I'm not using "buckets", but rather
>> net.inet.ip.dummynet.hash_size. It's at default, 64. I've tried setting
>> net.inet.ip.dummynet.hash_size=65536 in sysctl.conf but somehow it was
>> still 64 after reboot, so I left it at 64. Should I make it 128? 256?
>> Does it matter that much? The load is at approx. 70-120 consumers per
>> pipe, so I thought 64 bucket size was enough.
> It depends on traffic pattern, try to increase it and watch
> search_steps/searches ratio (~1.001 is good enough)
>

Hm, thanks, I'll try that.

rihad

unread,

Oct 7, 2009, 7:23:36 AM10/7/09

to

After reconfiguring all pipes with hash_size=256 (4 times as much as
before), the ratio has started decreasing slowly. Run by me every 5-100
seconds:
[rihad@billing ~]$ echo "$(sysctl -n
net.inet.ip.dummynet.search_steps)/$(sysctl -n
net.inet.ip.dummynet.searches)" | bc -l
1.10639566354978963640
[rihad@billing ~]$ echo "$(sysctl -n
net.inet.ip.dummynet.search_steps)/$(sysctl -n
net.inet.ip.dummynet.searches)" | bc -l
1.10638988711274017516
[rihad@billing ~]$ echo "$(sysctl -n
net.inet.ip.dummynet.search_steps)/$(sysctl -n
net.inet.ip.dummynet.searches)" | bc -l
1.10637649664889937145
[rihad@billing ~]$ echo "$(sysctl -n
net.inet.ip.dummynet.search_steps)/$(sysctl -n
net.inet.ip.dummynet.searches)" | bc -l
1.10636898392044547569
[rihad@billing ~]$ echo "$(sysctl -n
net.inet.ip.dummynet.search_steps)/$(sysctl -n
net.inet.ip.dummynet.searches)" | bc -l
1.10634798328730542254
[rihad@billing ~]$ echo "$(sysctl -n
net.inet.ip.dummynet.search_steps)/$(sysctl -n
net.inet.ip.dummynet.searches)" | bc -l
1.10608591323771604268
[rihad@billing ~]$ echo "$(sysctl -n
net.inet.ip.dummynet.search_steps)/$(sysctl -n
net.inet.ip.dummynet.searches)" | bc -l
1.10600110020578292697

but the number of drops is still there. Run every minute:
Wed Oct 7 11:00:44 UTC 2009 34630 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:01:44 UTC 2009 34630 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:02:44 UTC 2009 34729 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:03:44 UTC 2009 34729 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:04:44 UTC 2009 34861 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:05:44 UTC 2009 34932 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:06:44 UTC 2009 35499 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:07:45 UTC 2009 35780 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:08:45 UTC 2009 35841 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:09:45 UTC 2009 36348 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:10:45 UTC 2009 36568 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:11:45 UTC 2009 36673 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:12:45 UTC 2009 36673 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:13:46 UTC 2009 36673 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:14:46 UTC 2009 36673 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:15:46 UTC 2009 36673 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:16:46 UTC 2009 36849 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:17:46 UTC 2009 37234 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:18:46 UTC 2009 37949 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:19:47 UTC 2009 38043 output packets dropped due to no
bufs, etc.
Wed Oct 7 11:20:47 UTC 2009 38549 output packets dropped due to no
bufs, etc.

2200-2350 users online (ipfw table load). I'll wait and see if the drop
rate approaches 500-1000 per second as the number of online users comes
close to 3-4K.

net.isr.direct=0

Barney Cordoba

unread,

Oct 7, 2009, 7:29:21 AM10/7/09

to

--- On Wed, 10/7/09, rihad <ri...@mail.ru> wrote:

Its frightening to me that someone is managing such a large network
with dummynet. Talk about stealing your customer's money.

BC

rihad

unread,

Oct 7, 2009, 7:38:37 AM10/7/09

to

> Its frightening to me that someone is managing such a large network
> with dummynet. Talk about stealing your customer's money.
>

We have no customers - we're a charity ISP.
Any alternatives? ALTQ?

Oleg Bulyzhin

unread,

Oct 7, 2009, 7:54:25 AM10/7/09

to

On Wed, Oct 07, 2009 at 03:52:56PM +0500, rihad wrote:

> You probably have some special sources of documentation ;-) According to
> man ipfw, both "netgraph/ngtee" and "pipe" decide the fate of the packet
> unless one_pass=0. Or do you mean sprinkling smart skiptos here and
> there? ;-)

you can
1) use ng_ether & ng_netflow. (so no need in 'ngtee' rule).
2) use 'tee' rule with ng_ksocket & ng_netflow

>
> > Could you show your 'ipfw show' output? (hide ip addresses if you wish but
> > keep counters please).
> >

> Here it is, in its whole glory:
>
> 00100 10434423 1484891105 allow ip from any to any via lo0
> 00200 2 14 deny ip from any to 127.0.0.0/8
> 00300 1 4 deny ip from 127.0.0.0/8 to any
> 01000 3300039938 327603104711 allow ip from any to any in
> 01010 26214900 421138433 allow ip from me to any out
> 01020 5453857 46806278 allow icmp from any to any out
> 01030 3268289053 327224694165 ngtee 1 ip from any to any out
> 01040 18681181 1089636054 skipto 1100 ip from table(127) to any out
> recv bce0 xmit bce1
> 01060 777488848 76743392754 pipe tablearg ip from any to table(0) out
> recv bce0 xmit bce1
> 01070 776831109 76682499457 allow ip from any to table(0) out recv
> bce0 xmit bce1
> 01100 13102697 808411842 pipe tablearg ip from any to table(2) out
> 65535 662648946 66711487830 allow ip from any to any

I guess this one would be better(faster):

00050 allow ip from any to any in
00100 allow ip from any to any via lo0
01010 allow ip from me to any
01020 allow icmp from any to any
01030 ngtee 1 ip from any to any
01035 skipto 1040 ip from any to any recv bce0 xmit bce1
01036 allow ip from any to any
01040 skipto 1100 ip from table(127) to any
01060 pipe tablearg ip from any to table(0)
01070 allow ip from any to any
01100 pipe tablearg ip from any to table(2)
65535 allow ip from any to any

P.S. have you tried net.inet.ip.fastforwarding=1?

--
Oleg.

================================================================
=== Oleg Bulyzhin -- OBUL-RIPN -- OBUL-RIPE -- ol...@rinet.ru ===
================================================================

_______________________________________________

rihad

unread,

Oct 7, 2009, 8:10:29 AM10/7/09

to

Oleg Bulyzhin wrote:
> On Wed, Oct 07, 2009 at 03:52:56PM +0500, rihad wrote:
>
>> You probably have some special sources of documentation ;-) According to
>> man ipfw, both "netgraph/ngtee" and "pipe" decide the fate of the packet
>> unless one_pass=0. Or do you mean sprinkling smart skiptos here and
>> there? ;-)
>
> you can
> 1) use ng_ether & ng_netflow. (so no need in 'ngtee' rule).
> 2) use 'tee' rule with ng_ksocket & ng_netflow
>

Thanks, I'll see into that if fast-forwarding doesn't help.

Phew, were "out" that bad? I left them in as commentary.
And the localhost anti-spoof check isn't such a bad security ring to get
rid of in the name of performance ;-)
Ok, got you, I'll take a note of it, thanks.

> P.S. have you tried net.inet.ip.fastforwarding=1?
>

man 4 inet:
IPCTL_FASTFORWARDING (ip.fastforwarding) Boolean: enable/disable
the use
of fast IP forwarding code. Defaults to
off. When
fast IP forwarding is enabled, IP packets
are for-
warded directly to the appropriate network
inter-
face with direct processing to completion, which
greatly improves the throughput. All
packets for
local IP addresses, non-unicast, or with IP
options
are handled by the normal IP input
processing path.
All features of the normal (slow) IP forwarding
path are supported including firewall (through
pfil(9) hooks) checking, except ipsec(4) tunnel
brokering. The IP fastforwarding path does not
generate ICMP redirect or source quench
messages.

I'm afraid a bit that it will lock up the live remote system. Is it a
drop in replacement given my ipfw rules? Why isn't it enabled by default?

rihad

unread,

Oct 7, 2009, 8:13:05 AM10/7/09

to

> Why isn't it enabled by default?
>

Answering myself: probably because of this:

The IP fastforwarding path does not generate ICMP redirect or source
quench messages.

Robert Watson

unread,

Oct 7, 2009, 8:21:43 AM10/7/09

to

On Wed, 7 Oct 2009, rihad wrote:

> Robert Watson wrote:
>>
>> On Wed, 7 Oct 2009, rihad wrote:
>>
>>>> snapshot of the top -SH output in the steady state? Let top run for a
>>>> few minutes and then copy/paste the first 10-20 lines into an e-mail.
>>>>
>>> Sure. Mind you: now there's only 1800 entries in each of the two ipfw
>>> tables, so any drops have stopped. But it only takes another 200-300
>>> entries to start dropping.
>>
>> Could you do the same in the net.isr.direct=1 configuration so we can
>> compare?
>
> net.isr.direct=1:

So it seems that CPU exhaustion is likely not the source of drops -- what I
was looking for in both configurations were signs that any individual thread
was approaching 80% utilization, which in a peak load situation might mean it
hitting 100% and therefore leading to packet loss for that reason.

The statistic you're monitoring has a couple of interpretations, but the most
likely interpretation is overfilling the output queue on the network interface
you're transmitting on. In turn there are various possible reasons for this
happening, but the two most common would be:

(1) Average load is exceeding the transmit capacity of the driver/hardware
pipeline -- the pipe is just too small.

(2) Peak capacity (burstiness) is exceeding the transmit capacity of the
driver/hardware pipeline.

The questions that Luigi and others have been asking about your dummynet
configuration are to some extent oriented around determining whether the
burstiness introduced by dummynet could be responsible for that. Suggestions
like increasing timer resolution are intended to spread out the injection of
packets by dummynet to attempt to reduce the peaks of burstiness that occur
when multiple queues inject packets in a burst that exceeds the queue depth
supported by combined hardware descriptor rings and software transmit queue.

The two solutions, then are (a) to increase the timer resolution significantly
so that packets are injected in smaller bursts and (b) increase the queue
capacities. The hardware queue limits likely can't be raised w/o new
hardware, but the ifnet transmit queue sizes can be increased. Timer
resolution going up is almost certainly not a bad idea in your configuration,
although does require a reboot as you have observed.

On a side note: one other possible interpretation of that statistic is that
you're seeing fragmentation problems. Usually in forwarding scenarios this is
unlikely. However, it wouldn't hurt to make sure you have LRO turned off on
the network interfaces you're using, assuming it's supported by the driver.

Robert N M Watson
Computer Laboratory
University of Cambridge

>

rihad

unread,

Oct 7, 2009, 8:29:00 AM10/7/09

to

Oleg Bulyzhin wrote:
> On Wed, Oct 07, 2009 at 03:52:56PM +0500, rihad wrote:
>
> P.S. have you tried net.inet.ip.fastforwarding=1?
>

Yup, it didn't help at all. Reverting it back to 0 for now.

rihad

unread,

Oct 7, 2009, 8:42:48 AM10/7/09

to

Robert Watson wrote:
> Suggestions like increasing timer resolution are intended to spread out
> the injection of packets by dummynet to attempt to reduce the peaks of
> burstiness that occur when multiple queues inject packets in a burst
> that exceeds the queue depth supported by combined hardware descriptor
> rings and software transmit queue.
>

Raising HZ from 1000 to 2000 has helped. There are now 200-300 global
drops/s, as opposed to 300-1000 with HZ=1000. Or maybe net.isr.direct
from 1 to 0 help. Or maybe hash_size from 64 to 256. Or maybe...

> The two solutions, then are (a) to increase the timer resolution
> significantly so that packets are injected in smaller bursts

But isn't that bad that it can actually become worse? From /sys/conf/NOTES:

# The granularity of operation is controlled by the kernel option HZ whose
# default value (1000 on most architectures) means a granularity of 1ms
# (1s/HZ). Historically, the default was 100, but finer granularity is
# required for DUMMYNET and other systems on modern hardware. There are
# reasonable arguments that HZ should, in fact, be 100 still; consider,
# that reducing the granularity too much might cause excessive overhead in
# clock interrupt processing, potentially causing ticks to be missed and
thus
# actually reducing the accuracy of operation.

> and (b) increase the queue capacities. The hardware queue limits likely can't
> be raised w/o new hardware, but the ifnet transmit queue sizes can be
> increased.

Can someone please say how to increase the "ifnet transmit queue sizes"?

> Timer resolution going up is almost certainly not a bad idea in your configuration, although does require a reboot as you have observed.
>

OK, I'll try HZ=4000, but there are some required servers like
flowtools/radius/mysql/perl app that are also running.

> On a side note: one other possible interpretation of that statistic is
> that you're seeing fragmentation problems. Usually in forwarding
> scenarios this is unlikely. However, it wouldn't hurt to make sure you
> have LRO turned off on the network interfaces you're using, assuming
> it's supported by the driver.
>

I don't think fragments are the problem. The numbers are too small ;-)
$ netstat -s|fgrep fragment
5318 fragments received
147 fragments dropped (dup or out of space)
5157 fragments dropped after timeout
4088 output datagrams fragmented
8180 fragments created
0 datagrams that can't be fragmented

There's no such option as LRO shown, so I guess it's off:
options=1bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4>

Robert Watson

unread,

Oct 7, 2009, 9:40:20 AM10/7/09

to

On Wed, 7 Oct 2009, rihad wrote:

>> Suggestions like increasing timer resolution are intended to spread out the
>> injection of packets by dummynet to attempt to reduce the peaks of
>> burstiness that occur when multiple queues inject packets in a burst that
>> exceeds the queue depth supported by combined hardware descriptor rings and
>> software transmit queue.
>
> Raising HZ from 1000 to 2000 has helped. There are now 200-300 global
> drops/s, as opposed to 300-1000 with HZ=1000. Or maybe net.isr.direct from 1
> to 0 help. Or maybe hash_size from 64 to 256. Or maybe...

Or maybe other random factors such as traffic load corresponding to major
sports events, etc. :-)

It's also possible that combining multiple changes cancels out the effect of
one or another change. Given the rather large number of possible
combinations of things to try, I'd suggest being fairly strategic in how you
try them. Starting with just an original config + significant HZ increase is
probably the best starting point. Changing hash_size is really about reducing
CPU use, so if in the whole you're not getting close to the capacity of a core
for any given thread involved in the work, it may not make much difference
(tuning these data structures is a bit of a black art).

>> The two solutions, then are (a) to increase the timer resolution
>> significantly so that packets are injected in smaller bursts
>
> But isn't that bad that it can actually become worse? From /sys/conf/NOTES:
>
> # The granularity of operation is controlled by the kernel option HZ whose
> # default value (1000 on most architectures) means a granularity of 1ms
> # (1s/HZ). Historically, the default was 100, but finer granularity is
> # required for DUMMYNET and other systems on modern hardware. There are
> # reasonable arguments that HZ should, in fact, be 100 still; consider,
> # that reducing the granularity too much might cause excessive overhead in
> # clock interrupt processing, potentially causing ticks to be missed and thus
> # actually reducing the accuracy of operation.

Right: we fire the timer on every CPU at 1/HZ seconds, which means quite a lot
of work being done. On systems where timers are proportionally more expensive
-- especially when using hardware virtualization, for example, we do recommend
tuning the timers down. And our boot loader will actually do it for you: we
auto-detect vmware, parallels, kqemu, virtualbox, etc, and adjust the timer
rate from from 1000 to 100 during the boot.

That said, in your configuration I see little argument for a lower timer rate:
you need to burst packets at frequent intervals or risk overfilling queues,
and the overheads of additional timer tickets on your system shouldn't be too
bad as you have both very fast hardware and a lot of idle time.

I would suggest making just the HZ -> 4000 change for now and see how it goes.

>> and (b) increase the queue capacities. The hardware queue limits likely
>> can't be raised w/o new hardware, but the ifnet transmit queue sizes can be
>> increased.
>
> Can someone please say how to increase the "ifnet transmit queue sizes"?

Unfortunately, I fear that this is driver-specific, and in the case of bce
requires a recompile. In the driver init code in if_bce, the following code
appears:

ifp->if_snd.ifq_drv_maxlen = USABLE_TX_BD;
IFQ_SET_MAXLEN(&ifp->if_snd, ifp->if_snd.ifq_drv_maxlen);
IFQ_SET_READY(&ifp->if_snd);

Which evaluates to a architecture-specific value due to varying pagesize. You
might just try forcing it to 1024.

>> Timer resolution going up is almost certainly not a bad idea in your
>> configuration, although does require a reboot as you have observed.
>
> OK, I'll try HZ=4000, but there are some required servers like
> flowtools/radius/mysql/perl app that are also running.

That should be fine.

>> On a side note: one other possible interpretation of that statistic is that
>> you're seeing fragmentation problems. Usually in forwarding scenarios this
>> is unlikely. However, it wouldn't hurt to make sure you have LRO turned
>> off on the network interfaces you're using, assuming it's supported by the
>> driver.
>>
> I don't think fragments are the problem. The numbers are too small ;-)
> $ netstat -s|fgrep fragment
> 5318 fragments received
> 147 fragments dropped (dup or out of space)
> 5157 fragments dropped after timeout
> 4088 output datagrams fragmented
> 8180 fragments created
> 0 datagrams that can't be fragmented
>
> There's no such option as LRO shown, so I guess it's off:
> options=1bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4>

That probably rules that out as a source of problems then.

Robert

rihad

unread,

Oct 7, 2009, 10:43:54 AM10/7/09

to

Robert Watson wrote:

> I would suggest making just the HZ -> 4000 change for now and see how it
> goes.
>

OK, I will try testing HZ=4000 tomorrow morning, although I'm pretty
sure there still will be some drops.

>> Can someone please say how to increase the "ifnet transmit queue sizes"?
>
> Unfortunately, I fear that this is driver-specific, and in the case of
> bce requires a recompile. In the driver init code in if_bce, the
> following code appears:
>
> ifp->if_snd.ifq_drv_maxlen = USABLE_TX_BD;
> IFQ_SET_MAXLEN(&ifp->if_snd, ifp->if_snd.ifq_drv_maxlen);
> IFQ_SET_READY(&ifp->if_snd);
>
> Which evaluates to a architecture-specific value due to varying
> pagesize. You might just try forcing it to 1024.
>

I think I'll try this too, if HZ=4000 doesn't help, thanks a lot.
In the long run we'll switch to some high-quality 10 GiGE cards.

Ingo Flaschberger

unread,

Oct 7, 2009, 10:49:21 AM10/7/09

to

Hi,

can you send me the dmesg ouput from your networkcards when they are
detected at booting?

can you also send me a lspci and lspci -v ?

Kind regards,
ingo flaschberger

rihad

unread,

Oct 7, 2009, 10:58:17 AM10/7/09

to

Robert Watson wrote:
> In the driver init code in if_bce, the following code appears:
>
> ifp->if_snd.ifq_drv_maxlen = USABLE_TX_BD;
> IFQ_SET_MAXLEN(&ifp->if_snd, ifp->if_snd.ifq_drv_maxlen);
> IFQ_SET_READY(&ifp->if_snd);
>
> Which evaluates to a architecture-specific value due to varying
> pagesize. You might just try forcing it to 1024.
>

In dev/bce/if_bcereg.h:
#define TX_PAGES 2
#define TOTAL_TX_BD_PER_PAGE (BCM_PAGE_SIZE / sizeof(struct tx_bd))
#define USABLE_TX_BD_PER_PAGE (TOTAL_TX_BD_PER_PAGE - 1)
#define TOTAL_TX_BD (TOTAL_TX_BD_PER_PAGE * TX_PAGES)
#define USABLE_TX_BD (USABLE_TX_BD_PER_PAGE * TX_PAGES)
#define MAX_TX_BD (TOTAL_TX_BD - 1)

meaning that ifq_drv_maxlen is expected to end up smaller than
MAX_TX_BD. What if MAX_TX_BD is itself way smaller than 1024?

rihad

unread,

Oct 7, 2009, 11:02:21 AM10/7/09

to

Ingo Flaschberger wrote:
> Hi,
>
> can you send me the dmesg ouput from your networkcards when they are
> detected at booting?
>

Hello,

bce0: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem
0xf4000000-0xf5ffffff irq 16 at device 0.0 on pci7
bce0: Ethernet address: 00:1d:09:xx:xx:xx
bce0: [ITHREAD]
bce0: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); F/W
(0x03050C05); Flags( MFW MSI )
bce1: <Broadcom NetXtreme II BCM5708 1000Base-T (B2)> mem
0xf8000000-0xf9ffffff irq 16 at device 0.0 on pci3
bce1: Ethernet address: 00:1d:09:xx:xx:xx
bce1: [ITHREAD]
bce1: ASIC (0x57081020); Rev (B2); Bus (PCI-X, 64-bit, 133MHz); F/W
(0x03050C05); Flags( MFW MSI )

> can you also send me a lspci and lspci -v ?
>

Sorry, this is FreeBSD, not Linux ;-)

# pciconf -l
...
bce0@pci0:7:0:0: class=0x020000 card=0x01b21028 chip=0x164c14e4
rev=0x12 hdr=0x00
bce1@pci0:3:0:0: class=0x020000 card=0x01b21028 chip=0x164c14e4
rev=0x12 hdr=0x00