tbf/htb qdisc limitations

Steven Brudenell

unread,

Oct 8, 2010, 4:58:36 PM10/8/10

to net...@vger.kernel.org

hi folks,

i was disappointed recently to find that i can't set the "burst"
parameters very high on the tbf or htb qdiscs. the actual limit of the
burst parameters varies, according to the rate parameter. at the
relatively low rate i want to set, i want to have the burst parameter
be several gigabytes, but i'm actually limited to only a few
megabytes.

(motivation: a fully-automated way to stay inside the monthly transfer
limits imposed by many ISPs these days, without resorting to a
constant rate limit. for example, comcast limits its customers to
250GB/month, which is about 101KB/s; many cellular data plans in the
US limit to 5GB/month =~ 2KB/s).

i'll gladly code a patch, but i'd like the list's advice on whether
this is necessary, and a little bit about how to proceed:

1) what is the purpose of the "rate tables" used in these qdiscs --
why use them in favor of dividing bytes by time to compute a rate? i
assume the answer has something to do with restrictions on using
floating point math (maybe even integer division?) at different places
/ interruptability states in the kernel. maybe this is documented on
kernelnewbies somewhere but i couldn't find it.

2) is there an established procedure for versioning a netlink
interface? today the netlink interface for tbf and htb is horribly
implementation-coupled (the "burst" parameters need to be munged
according to the "rate" parameters and kernel tick rate). i think i
would need to change these interfaces in order to change the
accounting implementation in the corresponding qdisc. however, i
probably want to remain compatible with old userspace.

~steve
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jarek Poplawski

unread,

Oct 10, 2010, 7:23:23 AM10/10/10

to Steven Brudenell, net...@vger.kernel.org

Steven Brudenell wrote:
> hi folks,
>
> i was disappointed recently to find that i can't set the "burst"
> parameters very high on the tbf or htb qdiscs. the actual limit of the
> burst parameters varies, according to the rate parameter. at the
> relatively low rate i want to set, i want to have the burst parameter
> be several gigabytes, but i'm actually limited to only a few
> megabytes.
>
> (motivation: a fully-automated way to stay inside the monthly transfer
> limits imposed by many ISPs these days, without resorting to a
> constant rate limit. for example, comcast limits its customers to
> 250GB/month, which is about 101KB/s; many cellular data plans in the
> US limit to 5GB/month =~ 2KB/s).

I'm not sure you checked how the "burst" works, and doubt it could
help you here. Anyway, do you think: rate 2KB/s with burst 5GB
config would be useful for you?

>
> i'll gladly code a patch, but i'd like the list's advice on whether
> this is necessary, and a little bit about how to proceed:
>
> 1) what is the purpose of the "rate tables" used in these qdiscs --
> why use them in favor of dividing bytes by time to compute a rate? i
> assume the answer has something to do with restrictions on using
> floating point math (maybe even integer division?) at different places
> / interruptability states in the kernel. maybe this is documented on
> kernelnewbies somewhere but i couldn't find it.
>
> 2) is there an established procedure for versioning a netlink
> interface? today the netlink interface for tbf and htb is horribly
> implementation-coupled (the "burst" parameters need to be munged
> according to the "rate" parameters and kernel tick rate). i think i
> would need to change these interfaces in order to change the
> accounting implementation in the corresponding qdisc. however, i
> probably want to remain compatible with old userspace.

My proposal is you don't bother with 1) and 2), but first do the
hack in tbf or htb directly, using or omitting rate tables, how
you like, and test this idea.

But it seems the right way is to collect monthly stats with some
userspace tool and change qdisc config dynamically. You might
look at network admins' lists for small ISPs exemplary scripts
doing such nasty things to their users, or have a look at ppp
accounting tools.

Jarek P.

Steven Brudenell

unread,

Oct 12, 2010, 3:31:48 PM10/12/10

to Jarek Poplawski, net...@vger.kernel.org

> Yes, it's not allowed according to Documentation/HOWTO. Btw, as you
> can see e.g. in sch_hfsc comments, 64-bit division is avoided too.

i see sch_hfsc avoids do_div in critical areas for performance
reasons, but uses it other places. it should still be alright to
do_div in tbf_change and htb_change_class, right? it would be nice to
compute the rtabs in those functions instead of having userspace do
it.

> I can only say there is no versioning, but backward compatibility
> is crucial, so you need to do some tricks or data duplication.
> You could probably try to get opinions about it with an RFC on
> moving tbf and htb schedulers to 64 bits if you're interested
> (decoupling it from your specific burst problem).

my burst problem is the only semi-legitimate motivation i can think
of. the only other possible motivations i can imagine are setting
"limit" to buffer more than 4GB of packets and setting "rate" to
something more than 32 gigabit; both of these seem kind of dubious. is
there something else you had in mind?

looking more at the netlink tc interface: why is it that the interface
for so many qdiscs consists of passing a big options struct as a
single netlink attr, instead of a bunch of individual attrs? this kind
of seems contrary to the extensibility / flexibility spirit of
netlink, and seems to be getting in the way of changing the interface.
maybe i should RFC about this instead ;)

Rick Jones

unread,

Oct 12, 2010, 6:17:18 PM10/12/10

to Jarek Poplawski, Steven Brudenell, net...@vger.kernel.org

>>my burst problem is the only semi-legitimate motivation i can think
>>of. the only other possible motivations i can imagine are setting
>>"limit" to buffer more than 4GB of packets and setting "rate" to
>>something more than 32 gigabit; both of these seem kind of dubious. is
>>there something else you had in mind?
>
>

> No, mainly 10 gigabit rates and additionally 64-bit stats.

Any issue for bonded 10 GbE interfaces? Now that the IEEE have ratified (June)
how far out are 40 GbE interfaces? Or 100 GbE for that matter.

rick jones

Jarek Poplawski

unread,

Oct 13, 2010, 2:26:49 AM10/13/10

to Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Tue, Oct 12, 2010 at 03:17:18PM -0700, Rick Jones wrote:
>>> my burst problem is the only semi-legitimate motivation i can think
>>> of. the only other possible motivations i can imagine are setting
>>> "limit" to buffer more than 4GB of packets and setting "rate" to
>>> something more than 32 gigabit; both of these seem kind of dubious. is
>>> there something else you had in mind?
>>
>>
>> No, mainly 10 gigabit rates and additionally 64-bit stats.
>
> Any issue for bonded 10 GbE interfaces? Now that the IEEE have ratified
> (June) how far out are 40 GbE interfaces? Or 100 GbE for that matter.

Alas packet schedulers using rate tables are still around 1G. Above 2G
they get less and less accurate, so hfsc is recommended.

Jarek P.

Bill Fink

unread,

Oct 13, 2010, 11:36:53 PM10/13/10

to Jarek Poplawski, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Wed, 13 Oct 2010, Jarek Poplawski wrote:

> On Tue, Oct 12, 2010 at 03:17:18PM -0700, Rick Jones wrote:
> >>> my burst problem is the only semi-legitimate motivation i can think
> >>> of. the only other possible motivations i can imagine are setting
> >>> "limit" to buffer more than 4GB of packets and setting "rate" to
> >>> something more than 32 gigabit; both of these seem kind of dubious. is
> >>> there something else you had in mind?
> >>
> >>
> >> No, mainly 10 gigabit rates and additionally 64-bit stats.
> >
> > Any issue for bonded 10 GbE interfaces? Now that the IEEE have ratified
> > (June) how far out are 40 GbE interfaces? Or 100 GbE for that matter.
>
> Alas packet schedulers using rate tables are still around 1G. Above 2G
> they get less and less accurate, so hfsc is recommended.

I was just trying to do an 8 Gbps rate limit on a 10-GigE path,
and couldn't get it to work with either htb or tbf. Are you
saying this currently isn't possible? Or are you saying to use
this hfsc mechanism, which there doesn't seem to be a man page
for?

-Bill

Bill Fink

unread,

Oct 14, 2010, 2:34:07 AM10/14/10

to Eric Dumazet, Jarek Poplawski, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Thu, 14 Oct 2010, Eric Dumazet wrote:

> Le mercredi 13 octobre 2010 � 23:36 -0400, Bill Fink a �crit :

>
> > I was just trying to do an 8 Gbps rate limit on a 10-GigE path,
> > and couldn't get it to work with either htb or tbf. Are you
> > saying this currently isn't possible? Or are you saying to use
> > this hfsc mechanism, which there doesn't seem to be a man page
> > for?
>

> man pages ? Oh well...
>
> 8Gbps rate limit sounds very optimistic with a central lock and one
> queue...
>
> Maybe its possible to split this into 8 x 1Gbps, using 8 queues...
> or 16 x 500 Mbps

Not when I'm trying to rate limit a single flow.

Bill Fink

unread,

Oct 14, 2010, 3:13:54 AM10/14/10

to Jarek Poplawski, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Thu, 14 Oct, Jarek Poplawski wrote:

> On Wed, Oct 13, 2010 at 11:36:53PM -0400, Bill Fink wrote:
> > On Wed, 13 Oct 2010, Jarek Poplawski wrote:
> >
> > > On Tue, Oct 12, 2010 at 03:17:18PM -0700, Rick Jones wrote:
> > > >>> my burst problem is the only semi-legitimate motivation i can think
> > > >>> of. the only other possible motivations i can imagine are setting
> > > >>> "limit" to buffer more than 4GB of packets and setting "rate" to
> > > >>> something more than 32 gigabit; both of these seem kind of dubious. is
> > > >>> there something else you had in mind?
> > > >>
> > > >>
> > > >> No, mainly 10 gigabit rates and additionally 64-bit stats.
> > > >
> > > > Any issue for bonded 10 GbE interfaces? Now that the IEEE have ratified
> > > > (June) how far out are 40 GbE interfaces? Or 100 GbE for that matter.
> > >
> > > Alas packet schedulers using rate tables are still around 1G. Above 2G
> > > they get less and less accurate, so hfsc is recommended.
> >
> > I was just trying to do an 8 Gbps rate limit on a 10-GigE path,
> > and couldn't get it to work with either htb or tbf. Are you
> > saying this currently isn't possible?
>

> Let's start from reminding that no precise packet scheduling should be
> expected with gso/tso etc. turned on. I don't know current hardware
> limits for such a non-gso traffic, but for 8 Gbit rate htb or tbf
> would definitely have wrong rate tables (overflowed values) for packet
> sizes below 1500 bytes.

TSO/GSO was disabled and was using 9000-byte jumbo frames
(and specified mtu 9000 to tc command).

Here was one attempt I made using tbf:

tc qdisc add dev eth2 root handle 1: prio
tc qdisc add dev eth2 parent 1:1 handle 10: tbf rate 8900mbit buffer 1112500 limit 10000 mtu 9000
tc filter add dev eth2 protocol ip parent 1: prio 1 u32 match ip dst 192.168.1.23 flowid 10:1

I tried many variations of the above, all without success.

> > Or are you saying to use
> > this hfsc mechanism, which there doesn't seem to be a man page
> > for?
>

> There was a try:
> http://lists.openwall.net/netdev/2009/02/26/138

Thanks for the pointer. I will check it out later in detail,
but I'm already having difficulty with deciding if I have the
tc commands right for tbf and htb, and hfsc looks even more
involved.

Jarek Poplawski

unread,

Oct 14, 2010, 2:44:05 AM10/14/10

to Bill Fink, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Wed, Oct 13, 2010 at 11:36:53PM -0400, Bill Fink wrote:

> On Wed, 13 Oct 2010, Jarek Poplawski wrote:
>
> > On Tue, Oct 12, 2010 at 03:17:18PM -0700, Rick Jones wrote:
> > >>> my burst problem is the only semi-legitimate motivation i can think
> > >>> of. the only other possible motivations i can imagine are setting
> > >>> "limit" to buffer more than 4GB of packets and setting "rate" to
> > >>> something more than 32 gigabit; both of these seem kind of dubious. is
> > >>> there something else you had in mind?
> > >>
> > >>
> > >> No, mainly 10 gigabit rates and additionally 64-bit stats.
> > >
> > > Any issue for bonded 10 GbE interfaces? Now that the IEEE have ratified
> > > (June) how far out are 40 GbE interfaces? Or 100 GbE for that matter.
> >
> > Alas packet schedulers using rate tables are still around 1G. Above 2G
> > they get less and less accurate, so hfsc is recommended.
>
> I was just trying to do an 8 Gbps rate limit on a 10-GigE path,
> and couldn't get it to work with either htb or tbf. Are you
> saying this currently isn't possible?

Let's start from reminding that no precise packet scheduling should be

expected with gso/tso etc. turned on. I don't know current hardware
limits for such a non-gso traffic, but for 8 Gbit rate htb or tbf
would definitely have wrong rate tables (overflowed values) for packet
sizes below 1500 bytes.

> Or are you saying to use

> this hfsc mechanism, which there doesn't seem to be a man page
> for?

There was a try:
http://lists.openwall.net/netdev/2009/02/26/138

Jarek P.

Bill Fink

unread,

Oct 15, 2010, 2:37:49 AM10/15/10

to Jarek Poplawski, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Thu, 14 Oct 2010, Jarek Poplawski wrote:

> On Thu, Oct 14, 2010 at 08:09:39AM +0000, Jarek Poplawski wrote:

> > On Thu, Oct 14, 2010 at 03:13:54AM -0400, Bill Fink wrote:
> > > TSO/GSO was disabled and was using 9000-byte jumbo frames
> > > (and specified mtu 9000 to tc command).
> > >
> > > Here was one attempt I made using tbf:
> > >
> > > tc qdisc add dev eth2 root handle 1: prio
> > > tc qdisc add dev eth2 parent 1:1 handle 10: tbf rate 8900mbit buffer 1112500 limit 10000 mtu 9000
> > > tc filter add dev eth2 protocol ip parent 1: prio 1 u32 match ip dst 192.168.1.23 flowid 10:1
> > >
> > > I tried many variations of the above, all without success.
> >

> > The main problem are smaller packets. If you had (almost) only 9000b
> > frames this probably could work. [...]
>
> On the other hand, e.g. the limit above seems too low wrt mtu & rate.

Actually, I discovered my commands above work just fine on
a 2.6.35 box:

i7test7% nuttcp -T10 -i1 192.168.1.17
1045.3125 MB / 1.00 sec = 8768.3573 Mbps 0 retrans
1045.6875 MB / 1.00 sec = 8772.0292 Mbps 0 retrans
1049.5625 MB / 1.00 sec = 8804.2627 Mbps 0 retrans
1043.1875 MB / 1.00 sec = 8750.9960 Mbps 0 retrans
1048.6875 MB / 1.00 sec = 8796.3246 Mbps 0 retrans
1033.4375 MB / 1.00 sec = 8669.3188 Mbps 0 retrans
1040.7500 MB / 1.00 sec = 8730.7057 Mbps 0 retrans
1047.0000 MB / 1.00 sec = 8783.2063 Mbps 0 retrans
1040.0000 MB / 1.00 sec = 8724.0564 Mbps 0 retrans
1037.4375 MB / 1.00 sec = 8702.5434 Mbps 0 retrans

10431.5608 MB / 10.00 sec = 8749.7542 Mbps 25 %TX 35 %RX 0 retrans 0.11 msRTT

The problems I encountered were on a field system running
2.6.30.10. I will investigate upgrading the field system
to 2.6.35.

Eric Dumazet

unread,

Oct 15, 2010, 2:44:19 AM10/15/10

to Bill Fink, Jarek Poplawski, Rick Jones, Steven Brudenell, net...@vger.kernel.org

Yes, I noticed same thing for me on net-next-2.6

Please report :

tc -s -d qdisc

Jarek Poplawski

unread,

Oct 15, 2010, 4:18:26 AM10/15/10

to Bill Fink, Rick Jones, Steven Brudenell, net...@vger.kernel.org

This change from 2.6.31 should matter here:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.35.y.git;a=commit;h=a4a710c4a7490587406462bf1d54504b7783d7d7

Jarek P.

Bill Fink

unread,

Oct 15, 2010, 5:37:46 PM10/15/10

to Eric Dumazet, Jarek Poplawski, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Fri, 15 Oct 2010, Eric Dumazet wrote:

> Le vendredi 15 octobre 2010 � 02:37 -0400, Bill Fink a �crit :

Not sure why you want this on the older 2.6.30.10 kernel,
but here it is:

i7test6% nuttcp -T10 -i1 192.168.1.14
1169.1875 MB / 1.00 sec = 9807.2868 Mbps 0 retrans
1181.1875 MB / 1.00 sec = 9908.9054 Mbps 0 retrans
1181.1250 MB / 1.00 sec = 9907.9253 Mbps 0 retrans
1181.1875 MB / 1.00 sec = 9908.4991 Mbps 0 retrans
1180.6875 MB / 1.00 sec = 9904.3345 Mbps 0 retrans
1181.1250 MB / 1.00 sec = 9908.0838 Mbps 0 retrans
1181.1875 MB / 1.00 sec = 9908.4099 Mbps 0 retrans
1181.0625 MB / 1.00 sec = 9907.3911 Mbps 0 retrans
1181.3750 MB / 1.00 sec = 9910.2801 Mbps 0 retrans
1181.1875 MB / 1.00 sec = 9908.2118 Mbps 0 retrans

11801.1382 MB / 10.04 sec = 9858.7159 Mbps 24 %TX 40 %RX 0 retrans 0.11 msRTT

i7test6% tc -s -d qdisc show dev eth2
qdisc prio 1: root refcnt 32 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 12448974085 bytes 1381173 pkt (dropped 266, overlimits 0 requeues 12)
rate 0bit 0pps backlog 0b 0p requeues 12
qdisc tbf 10: parent 1:1 rate 8900Mbit burst 1111387b/64 mpu 0b lat 4295.0s
Sent 12448974043 bytes 1381172 pkt (dropped 266, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

I'm guessing this is probably related to the schedulers
time resolution issue that Jarek mentioned.

And for completeness, here's the info for the working
2.6.35 case:

i7test7% nuttcp -T10 -i1 192.168.1.17

1045.5625 MB / 1.00 sec = 8770.6210 Mbps 0 retrans
1032.1875 MB / 1.00 sec = 8658.3825 Mbps 0 retrans
1039.8125 MB / 1.00 sec = 8722.7801 Mbps 0 retrans
1050.2500 MB / 1.00 sec = 8810.0739 Mbps 0 retrans
1050.6875 MB / 1.00 sec = 8813.9378 Mbps 0 retrans
1048.8125 MB / 1.00 sec = 8798.0857 Mbps 0 retrans
1046.1875 MB / 1.00 sec = 8775.9954 Mbps 0 retrans
1045.7500 MB / 1.00 sec = 8771.9307 Mbps 0 retrans
1051.1250 MB / 1.00 sec = 8817.8900 Mbps 0 retrans
1044.0625 MB / 1.00 sec = 8757.8019 Mbps 0 retrans

10454.7500 MB / 10.00 sec = 8769.2206 Mbps 26 %TX 35 %RX 0 retrans 0.11 msRTT

i7test7% tc -s -d qdisc show dev eth2
qdisc prio 1: root refcnt 33 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 11028687119 bytes 1223828 pkt (dropped 293, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc tbf 10: parent 1:1 rate 8900Mbit burst 1112500b/64 mpu 0b lat 4295.0s
Sent 11028687077 bytes 1223827 pkt (dropped 293, overlimits 593 requeues 0)
backlog 0b 0p requeues 0

I'm not sure how you can have so many dropped but not have
any TCP retransmissions (or not show up as requeues). But
there's probably something basic I just don't understand
about how all this stuff works.

-Bill

Jarek Poplawski

unread,

Oct 15, 2010, 6:05:35 PM10/15/10

to Bill Fink, Eric Dumazet, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Fri, Oct 15, 2010 at 05:37:46PM -0400, Bill Fink wrote:
...

> i7test7% tc -s -d qdisc show dev eth2
> qdisc prio 1: root refcnt 33 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> Sent 11028687119 bytes 1223828 pkt (dropped 293, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> qdisc tbf 10: parent 1:1 rate 8900Mbit burst 1112500b/64 mpu 0b lat 4295.0s
> Sent 11028687077 bytes 1223827 pkt (dropped 293, overlimits 593 requeues 0)
> backlog 0b 0p requeues 0
>
> I'm not sure how you can have so many dropped but not have
> any TCP retransmissions (or not show up as requeues). But
> there's probably something basic I just don't understand
> about how all this stuff works.

Me either, but it seems higher "limit" might help with these drops.

Jarek P.

Bill Fink

unread,

Oct 16, 2010, 12:51:06 AM10/16/10

to Jarek Poplawski, Eric Dumazet, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Sat, 16 Oct 2010, Jarek Poplawski wrote:

> On Fri, Oct 15, 2010 at 05:37:46PM -0400, Bill Fink wrote:
> ...
> > i7test7% tc -s -d qdisc show dev eth2
> > qdisc prio 1: root refcnt 33 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> > Sent 11028687119 bytes 1223828 pkt (dropped 293, overlimits 0 requeues 0)
> > backlog 0b 0p requeues 0
> > qdisc tbf 10: parent 1:1 rate 8900Mbit burst 1112500b/64 mpu 0b lat 4295.0s
> > Sent 11028687077 bytes 1223827 pkt (dropped 293, overlimits 593 requeues 0)
> > backlog 0b 0p requeues 0
> >
> > I'm not sure how you can have so many dropped but not have
> > any TCP retransmissions (or not show up as requeues). But
> > there's probably something basic I just don't understand
> > about how all this stuff works.
>
> Me either, but it seems higher "limit" might help with these drops.

You were of course correct about the higher limit helping.
I finally upgraded the field system to 2.6.35, and did some
testing on the real data path of interest, which has an RTT
of about 29 ms. I set up a rate limit of 8 Gbps using the
following commands:

tc qdisc add dev eth2 root handle 1: prio

tc qdisc add dev eth2 parent 1:1 handle 10: tbf rate 8000mbit limit 35000000 burst 20000 mtu 9000
tc filter add dev eth2 protocol ip parent 1: prio 1 u32 match ip protocol 6 0xff match ip dst 192.168.1.23 flowid 10:1

hecn-i7sl1% nuttcp -T10 -i1 -w50m 192.168.1.23
676.3750 MB / 1.00 sec = 5673.4646 Mbps 0 retrans
948.5625 MB / 1.00 sec = 7957.1508 Mbps 0 retrans
948.8125 MB / 1.00 sec = 7959.5902 Mbps 0 retrans
948.3750 MB / 1.00 sec = 7955.5382 Mbps 0 retrans
949.0000 MB / 1.00 sec = 7960.6696 Mbps 0 retrans
948.7500 MB / 1.00 sec = 7958.7873 Mbps 0 retrans
948.6875 MB / 1.00 sec = 7958.0959 Mbps 0 retrans
948.6250 MB / 1.00 sec = 7957.4205 Mbps 0 retrans
948.7500 MB / 1.00 sec = 7958.7237 Mbps 0 retrans
948.4375 MB / 1.00 sec = 7956.3648 Mbps 0 retrans

9270.5625 MB / 10.09 sec = 7707.7457 Mbps 24 %TX 36 %RX 0 retrans 29.38 msRTT

hecn-i7sl1% tc -s -d qdisc show dev eth2

qdisc prio 1: root refcnt 33 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

Sent 9779476756 bytes 1084943 pkt (dropped 0, overlimits 0 requeues 0)

backlog 0b 0p requeues 0

qdisc tbf 10: parent 1:1 rate 8000Mbit burst 19000b/64 mpu 0b lat 35.0ms
Sent 9779476756 bytes 1084943 pkt (dropped 0, overlimits 1831360 requeues 0)

backlog 0b 0p requeues 0

No drops!

BTW the effective rate limit seems to be a very coarse adjustment
at these speeds. I was seeing some data path issues at 8.9 Gbps
so I tried setting slightly lower rates such as 8.8 Gbps, 8.7 Gbps,
etc, but they still gave me an effective rate limit of about 8.9 Gbps.
It wasn't until I got down to a setting of 8 Gbps that I actually
got an effective rate limit of 8 Gbps.

Also the man page for tbf seems to be wrong/misleading about
the burst parameter. It states:

"If your buffer is too small, packets may be dropped because more
tokens arrive per timer tick than fit in your bucket. The minimum
buffer size can be calculated by dividing the rate by HZ.

According to that, with a rate of 8 Gbps and HZ=1000, the minimum
burst should be 1000000 bytes. But my testing shows that a burst
of just 20000 works just fine. That's only 2 9000-byte packets
or about 20 usec of traffic at the 8 Gbps rate. Using too large
a value for burst can actually be harmful as it allows the traffic
to temporarily exceed the desired rate limit.

-Thanks

-Bill

Jarek Poplawski

unread,

Oct 16, 2010, 4:58:24 PM10/16/10

to Bill Fink, Eric Dumazet, Rick Jones, Steven Brudenell, net...@vger.kernel.org

As I mentioned before, it could work, but your config is really on
the edge. Anyway, if lower than minimum buffer size is needed
something else is definitely wrong. (Btw, this size can matter less
with high resolution timers.) You could try if my iproute patch:
"tc_core: Use double in tc_core_time2tick()" (not merged) can help
here. While googling for this patch I found this page, which might be
interesting to you (besides the link to the thread with the patch at
the end, take 1 or 2, shouldn't matter):

http://code.google.com/p/pspacer/wiki/HTBon10GbE

If it doesn't help reconsider hfsc.

Thanks,
Jarek P.

Bill Fink

unread,

Oct 16, 2010, 9:24:34 PM10/16/10

to Jarek Poplawski, Eric Dumazet, Rick Jones, Steven Brudenell, net...@vger.kernel.org

Thanks for the link. From his results, it appears you can
get better accuracy by keeping TSO/GSO enabled and upping
the tc mtu parameter to 64000. I will have to try that out.

For the very high bandwidth cases I tend to deal with, would
there be any advantage to further reducing the PSCHED_SHIFT
from its current value of 6?

-Bill

Jarek Poplawski

unread,

Oct 17, 2010, 4:36:18 PM10/17/10

to Bill Fink, Eric Dumazet, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Sat, Oct 16, 2010 at 09:24:34PM -0400, Bill Fink wrote:
> On Sat, 16 Oct 2010, Jarek Poplawski wrote:

...

> > http://code.google.com/p/pspacer/wiki/HTBon10GbE
> >
> > If it doesn't help reconsider hfsc.
>
> Thanks for the link. From his results, it appears you can
> get better accuracy by keeping TSO/GSO enabled and upping
> the tc mtu parameter to 64000. I will have to try that out.

Sure, but you have to remember that scheduler doesn't know real packet
sizes and rate tables are less accurate especially for smaller packets,
so it depends on conditions.

> For the very high bandwidth cases I tend to deal with, would
> there be any advantage to further reducing the PSCHED_SHIFT
> from its current value of 6?

If you don't use low rates and/or large buffers it might be a good
idea, especially on x64 (for 32-bit longs htb needs some change for
this value below 5).

Jarek P.

Bill Fink

unread,

Oct 19, 2010, 3:37:24 AM10/19/10

to Jarek Poplawski, Eric Dumazet, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Sun, 17 Oct 2010, Jarek Poplawski wrote:

> On Sat, Oct 16, 2010 at 09:24:34PM -0400, Bill Fink wrote:
> > On Sat, 16 Oct 2010, Jarek Poplawski wrote:
> ...
> > > http://code.google.com/p/pspacer/wiki/HTBon10GbE
> > >
> > > If it doesn't help reconsider hfsc.
> >
> > Thanks for the link. From his results, it appears you can
> > get better accuracy by keeping TSO/GSO enabled and upping
> > the tc mtu parameter to 64000. I will have to try that out.
>
> Sure, but you have to remember that scheduler doesn't know real packet
> sizes and rate tables are less accurate especially for smaller packets,
> so it depends on conditions.

On my testing on the real data path, TSO/GSO enabled did seem
to give more accurate results for a single stream. But when
I tried multiple 10-GigE paths simultaneously, each with a
single stream across it, non-TSO/GSO seemed to fare better
overall.

-Bill

Jarek Poplawski

unread,

Oct 20, 2010, 7:06:13 AM10/20/10

to Bill Fink, Eric Dumazet, Rick Jones, Steven Brudenell, net...@vger.kernel.org

On Tue, Oct 19, 2010 at 03:37:24AM -0400, Bill Fink wrote:
> On Sun, 17 Oct 2010, Jarek Poplawski wrote:
>
> > On Sat, Oct 16, 2010 at 09:24:34PM -0400, Bill Fink wrote:
> > > On Sat, 16 Oct 2010, Jarek Poplawski wrote:
> > ...
> > > > http://code.google.com/p/pspacer/wiki/HTBon10GbE
> > > >
> > > > If it doesn't help reconsider hfsc.
> > >
> > > Thanks for the link. From his results, it appears you can
> > > get better accuracy by keeping TSO/GSO enabled and upping
> > > the tc mtu parameter to 64000. I will have to try that out.
> >
> > Sure, but you have to remember that scheduler doesn't know real packet
> > sizes and rate tables are less accurate especially for smaller packets,
> > so it depends on conditions.
>
> On my testing on the real data path, TSO/GSO enabled did seem
> to give more accurate results for a single stream. But when
> I tried multiple 10-GigE paths simultaneously, each with a
> single stream across it, non-TSO/GSO seemed to fare better
> overall.

Btw, if you find time I would be interested in checking an opposite
concept of lower than real mtu (256) to use rate tables different way
(other tbf parameters without change). The patch below is needed for
this to work.

Thanks,
Jarek P.
---

diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index 641a30d..9ac3460 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -123,9 +123,6 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc* sch)
struct tbf_sched_data *q = qdisc_priv(sch);
int ret;

- if (qdisc_pkt_len(skb) > q->max_size)
- return qdisc_reshape_fail(skb, sch);
-
ret = qdisc_enqueue(skb, q->qdisc);
if (ret != NET_XMIT_SUCCESS) {
if (net_xmit_drop_count(ret))

Reply all

Reply to author

Forward