HTB accuracy for high speed

Antonio Almeida

unread,

May 15, 2009, 10:49:31 AM5/15/09

to net...@vger.kernel.org, jar...@gmail.com, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

Hi!
I've been using HTB in a Linux bridge and recently I noticed that, for
high speed, the configured rate/ceil is not respected as for lower
speeds.
I'm using a packet generator/analyser to inject over 950Mpbs, and see
what returns back to it, in the other side of my bridge. Generated
packets have 800bytes. I noticed that, for several tc HTB rate/ceil
configurations the amount of traffic received by the analyser stays
the same. See this values:

HTB conf      Analyser reception
476000Kbit    544.260.329
500000Kbit    545.880.017
510000Kbit    544.489.469
512000Kbit    546.890.972
-------------------------
513000Kbit    596.061.383
520000Kbit    596.791.866
550000Kbit    596.543.271
554000Kbit    596.193.545
-------------------------
555000Kbit    654.773.221
570000Kbit    654.996.381
590000Kbit    655.363.253
605000Kbit    654.112.017
-------------------------
606000Kbit    728.262.237
665000Kbit    727.014.365
-------------------------

There are these steps and it looks like doesn't matter if I configure
HTB to 555Mbit or to 605Mbit - the result is the same: 654Mbit. This
is 18% more traffic than the configured value. I also realise that for
smaller packets it gets worse, reaching 30% more traffic than what I
configured. For packets of 1514bytes the accuracy is quiet good.
I'm using kernel 2.6.25

My 'tc -s -d class ls dev eth1' output:

class htb 1:10 parent 1:2 rate 1000Mbit ceil 1000Mbit burst 126375b/8
mpu 0b overhead 0b cburst 126375b/8 mpu 0b overhead 0b level 5
Sent 51888579644 bytes 62067679 pkt (dropped 0, overlimits 0 requeues 0)
rate 653124Kbit 97656pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 113 ctokens: 113

class htb 1:1 root rate 1000Mbit ceil 1000Mbit burst 126375b/8 mpu 0b
overhead 0b cburst 126375b/8 mpu 0b overhead 0b level 7
Sent 51888579644 bytes 62067679 pkt (dropped 0, overlimits 0 requeues 0)
rate 653123Kbit 97656pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 113 ctokens: 113

class htb 1:2 parent 1:1 rate 1000Mbit ceil 1000Mbit burst 126375b/8
mpu 0b overhead 0b cburst 126375b/8 mpu 0b overhead 0b level 6
Sent 51888579644 bytes 62067679 pkt (dropped 0, overlimits 0 requeues 0)
rate 653124Kbit 97656pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 113 ctokens: 113

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
Sent 51888579644 bytes 62067679 pkt (dropped 27801917, overlimits 0 requeues 0)
rate 653124Kbit 97656pps backlog 0b 0p requeues 0
lended: 62067679 borrowed: 0 giants: 0
tokens: -798 ctokens: -798

As you can see, class htb 1:108 rate's is 653124Kbit! Much bigger that
it's ceil.

I also note that, for HTB rate configurations over 500Mbit/s on leaf
class, when I stop the traffic, in the output of "tc -s -d class ls
dev eth1" command, I see that leaf's rate (in bits/s) is growing
instead of decreasing (as expected since I've stopped the traffic).
Rate in pps is ok and decreases until 0pps. Rate in bits/s increases
above 1000Mbit and stays there for a few minutes. After two or three
minutes it becomes 0bit. The same happens for it's ancestors (also for
root class).Here's tc output of my leaf class for this situation:

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
Sent 120267768144 bytes 242475339 pkt (dropped 62272599, overlimits 0
requeues 0)
rate 1074Mbit 0pps backlog 0b 0p requeues 0
lended: 242475339 borrowed: 0 giants: 0
tokens: 8 ctokens: 8

Antonio Almeida
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Stephen Hemminger

unread,

May 15, 2009, 2:12:27 PM5/15/09

to Antonio Almeida, net...@vger.kernel.org, jar...@gmail.com, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

You are probably hitting the limit of the timer resolution. So it matters
what the clock source is.
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Also, is HFSC any better than HTB?

--

Jarek Poplawski

unread,

May 16, 2009, 4:31:58 AM5/16/09

to Antonio Almeida, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

On Fri, May 15, 2009 at 03:49:31PM +0100, Antonio Almeida wrote:
> Hi!
> I've been using HTB in a Linux bridge and recently I noticed that, for
> high speed, the configured rate/ceil is not respected as for lower
> speeds.
> I'm using a packet generator/analyser to inject over 950Mpbs, and see
> what returns back to it, in the other side of my bridge. Generated
> packets have 800bytes. I noticed that, for several tc HTB rate/ceil
> configurations the amount of traffic received by the analyser stays
> the same. See this values:
>
> HTB conf Analyser reception
> 476000Kbit 544.260.329

...

> As you can see, class htb 1:108 rate's is 653124Kbit! Much bigger that
> it's ceil.

Is it for sure there is no gso/tso enabled on this dev (with up to
date ethtool -k)? It would be nice to see also more details like
.config, ifconfigs before and after the test, tc -s qdisc and bytes/
packet number seen by this analyser, plus maybe some proof you can
obtain such flows with something simpler like tbf. Of course using
the current kernel, even if no difference, would give us more
valuable perspective.

Thanks,
Jarek P.

Jarek Poplawski

unread,

May 16, 2009, 10:14:30 AM5/16/09

to Antonio Almeida, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

On Fri, May 15, 2009 at 03:49:31PM +0100, Antonio Almeida wrote:

...

> I also note that, for HTB rate configurations over 500Mbit/s on leaf
> class, when I stop the traffic, in the output of "tc -s -d class ls
> dev eth1" command, I see that leaf's rate (in bits/s) is growing
> instead of decreasing (as expected since I've stopped the traffic).
> Rate in pps is ok and decreases until 0pps. Rate in bits/s increases
> above 1000Mbit and stays there for a few minutes. After two or three
> minutes it becomes 0bit. The same happens for it's ancestors (also for
> root class).Here's tc output of my leaf class for this situation:
>
> class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
> 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
> 70901b/8 mpu 0b overhead 0b level 0
> Sent 120267768144 bytes 242475339 pkt (dropped 62272599, overlimits 0
> requeues 0)
> rate 1074Mbit 0pps backlog 0b 0p requeues 0
> lended: 242475339 borrowed: 0 giants: 0
> tokens: 8 ctokens: 8

This looks like a regular bug. I guess it's an overflow in
gen_estimator(), but I'm not sure there is nothing more. Could you
try the patch below? (An offset warning when patching 2.6.25 is OK)

Thanks,
Jarek P.
---

net/core/gen_estimator.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..87f0ced 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -127,7 +127,11 @@ static void est_timer(unsigned long arg)
npackets = e->bstats->packets;
rate = (nbytes - e->last_bytes)<<(7 - idx);
e->last_bytes = nbytes;
- e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
+ if (rate > e->avbps)
+ e->avbps += (rate - e->avbps) >> e->ewma_log;
+ else
+ e->avbps -= (e->avbps - rate) >> e->ewma_log;
+
e->rate_est->bps = (e->avbps+0xF)>>5;

rate = (npackets - e->last_packets)<<(12 - idx);

Jarek Poplawski

unread,

May 17, 2009, 4:15:28 PM5/17/09

to Antonio Almeida, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

On Fri, May 15, 2009 at 03:49:31PM +0100, Antonio Almeida wrote:

Here is some additional explanation. It looks like these rates above
500Mbit hit the design limits of packet scheduling. Currently used
internal resolution PSCHED_TICKS_PER_SEC is 1,000,000. 550Mbit rate
with 800byte packets means 550M/8/800 = 85938 packets/s, so on average
1000000/85938 = 11.6 ticks per packet. Accounting only 11 ticks means
we leave 0.6*85938 = 51563 ticks per second, letting for additional
sending of 51563/11 = 4687 packets/s or 4687*800*8 = 30Mbit. Of course
it could be worse (0.9 tick/packet lost) depending on packet sizes vs.
rates, and the effect rises for higher rates.

Jarek P.

Vladimir Ivashchenko

unread,

May 17, 2009, 4:29:28 PM5/17/09

to Antonio Almeida, net...@vger.kernel.org, jar...@gmail.com, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

Hi Antonio,

FYI, these are exactly the same problems I get in real life.
Check the later posts in "bond + tc regression" thread.

On Fri, May 15, 2009 at 03:49:31PM +0100, Antonio Almeida wrote:

--
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

Jarek Poplawski

unread,

May 18, 2009, 2:56:29 AM5/18/09

to Stephen Hemminger, Antonio Almeida, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

Return non-zero tc_calc_xmittime() for rate tables

While looking at the problem of HTB accuracy for high speed (~500Mbit
rates) I've found that rate tables have cells filled with zeros for
the smallest sizes. It means such packets aren't accounted at all.
Apart from the correctness of such configs, let's make it safe with
rather overaccounting than living it unlimited.

Reported-by: Antonio Almeida <vex...@gmail.com>
Signed-off-by: Jarek Poplawski <jar...@gmail.com>
---

tc/tc_core.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..14f25bc 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -58,7 +58,9 @@ unsigned tc_core_ktime2time(unsigned ktime)

unsigned tc_calc_xmittime(unsigned rate, unsigned size)
{
- return tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+ unsigned t;
+ t = tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+ return t ? : 1;
}

unsigned tc_calc_xmitsize(unsigned rate, unsigned ticks)

Jarek Poplawski

unread,

May 18, 2009, 3:01:34 AM5/18/09

to Stephen Hemminger, Antonio Almeida, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

-----------> (One misspelling fixed.)

Return non-zero tc_calc_xmittime() for rate tables

While looking at the problem of HTB accuracy for high speed (~500Mbit
rates) I've found that rate tables have cells filled with zeros for
the smallest sizes. It means such packets aren't accounted at all.
Apart from the correctness of such configs, let's make it safe with

rather overaccounting than leaving it unlimited.

Reported-by: Antonio Almeida <vex...@gmail.com>
Signed-off-by: Jarek Poplawski <jar...@gmail.com>
---

tc/tc_core.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/tc/tc_core.c b/tc/tc_core.c
index 9a0ff39..14f25bc 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -58,7 +58,9 @@ unsigned tc_core_ktime2time(unsigned ktime)

unsigned tc_calc_xmittime(unsigned rate, unsigned size)
{
- return tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+ unsigned t;
+ t = tc_core_time2tick(TIME_UNITS_PER_SEC*((double)size/rate));
+ return t ? : 1;
}

unsigned tc_calc_xmitsize(unsigned rate, unsigned ticks)

Antonio Almeida

unread,

May 18, 2009, 6:01:21 AM5/18/09

to Stephen Hemminger, net...@vger.kernel.org, jar...@gmail.com, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

Hi!

cat /sys/devices/system/clocksource/clocksource0/current_clocksource
returns "jiffies"

With HFSC the accuracy is good. Also with packets of 800 bytes I got
these values:
received configured error
904596519 900000000 0,51
804293658 800000000 0,54
703662853 700000000 0,52
603354059 600000000 0,56
502805411 500000000 0,56
402527055 400000000 0,63
301484904 300000000 0,49
201074301 200000000 0,54
100546656 100000000 0,55

Thanks
Antonio Almeida

Antonio Almeida

unread,

May 18, 2009, 6:39:02 AM5/18/09

to Jarek Poplawski, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

Hi!

Here the information you asked:

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

The bridge is between eth0 and eth1

---------------------------
Before traffic starts:
---------------------------
Analyser sent bytes: 0
Analyser sent packets: 0
Analyser received bytes: 0
Analyser received packets: 0

# tc -s -d class ls dev eth1
class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 990 ctokens: 990

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 990 ctokens: 990

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 990 ctokens: 990

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 999 ctokens: 999

# ifconfig
br0 Link encap:Ethernet HWaddr 00:E0:ED:10:7C:6C
UP BROADCAST RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

eth0 Link encap:Ethernet HWaddr 00:E0:ED:10:7C:6C
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:69617616 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4154463648 (3.8 GiB) TX bytes:0 (0.0 b)
Base address:0x4000 Memory:e8200000-e8220000

eth1 Link encap:Ethernet HWaddr 00:E0:ED:10:7C:6D
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:50262048 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:1554907136 (1.4 GiB)
Base address:0x4040 Memory:e8220000-e8240000

eth3 Link encap:Ethernet HWaddr 00:11:25:C4:60:AF
inet addr:192.168.0.244 Bcast:19.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:461403 errors:0 dropped:0 overruns:0 frame:0
TX packets:13573 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:34150991 (32.5 MiB) TX bytes:1247864 (1.1 MiB)
Interrupt:27

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:4 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:188 (188.0 b) TX bytes:188 (188.0 b)

# tc -s qdisc
qdisc pfifo_fast 0: dev eth3 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1
1 1 1 1 1 1
Sent 5459409 bytes 25647 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc htb 1: dev eth0 root r2q 10 default 0 direct_packets_stat 0
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc sfq 108: dev eth0 parent 1:108 limit 127p quantum 1514b perturb 15sec
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc htb 1: dev eth1 root r2q 10 default 0 direct_packets_stat 0
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc sfq 108: dev eth1 parent 1:108 limit 127p quantum 1514b perturb 15sec
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

---------------------
Traffic running:
---------------------

Analyser sent rate: 704218764 bits/s
Analyser received rate: 624942839 bits/s

# tc -s -d class ls dev eth1
class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
Sent 5772939852 bytes 7252437 pkt (dropped 0, overlimits 0 requeues 0)
rate 624826Kbit 97169pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 402 ctokens: 402

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
Sent 5772939852 bytes 7252437 pkt (dropped 0, overlimits 0 requeues 0)
rate 624826Kbit 97169pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 402 ctokens: 402

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
Sent 5772939852 bytes 7252437 pkt (dropped 0, overlimits 0 requeues 0)
rate 624826Kbit 97169pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 402 ctokens: 402

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 5773001940 bytes 7252515 pkt (dropped 916587, overlimits 0 requeues 0)
rate 624826Kbit 97169pps backlog 0b 78p requeues 0
lended: 7252437 borrowed: 0 giants: 0
tokens: -10 ctokens: -10

# tc -s qdisc
qdisc pfifo_fast 0: dev eth3 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1
1 1 1 1 1 1
Sent 5611186 bytes 26259 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc htb 1: dev eth0 root r2q 10 default 0 direct_packets_stat 0
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc sfq 108: dev eth0 parent 1:108 limit 127p quantum 1514b perturb 15sec
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc htb 1: dev eth1 root r2q 10 default 0 direct_packets_stat 0
Sent 7122619144 bytes 8948014 pkt (dropped 1130906, overlimits
10090666 requeues 0)
rate 0bit 0pps backlog 0b 70p requeues 0
qdisc sfq 108: dev eth1 parent 1:108 limit 127p quantum 1514b perturb 15sec
Sent 7122619144 bytes 8948014 pkt (dropped 1130906, overlimits 0 requeues 0)
rate 0bit 0pps backlog 55720b 70p requeues 0

---------------------------
After traffic stopped:
---------------------------
(traffic ran for 170 seconds)

Analyser sent bytes: 15143884800
Analyser sent packets: 18929856
Analyser received bytes: 13444564800
Analyser received packets: 16805706

# tc -s -d class ls dev eth1
class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
Sent 13377341976 bytes 16805706 pkt (dropped 0, overlimits 0 requeues 0)
rate 1061Mbit 2066pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 708 ctokens: 708

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
Sent 13377341976 bytes 16805706 pkt (dropped 0, overlimits 0 requeues 0)
rate 1061Mbit 2066pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 708 ctokens: 708

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
Sent 13377341976 bytes 16805706 pkt (dropped 0, overlimits 0 requeues 0)
rate 1061Mbit 2066pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 708 ctokens: 708

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 13377341976 bytes 16805706 pkt (dropped 2124150, overlimits 0 requeues 0)
rate 1061Mbit 2066pps backlog 0b 0p requeues 0
lended: 16805706 borrowed: 0 giants: 0
tokens: 503 ctokens: 503

# ifconfig
br0 Link encap:Ethernet HWaddr 00:E0:ED:10:7C:6C
UP BROADCAST RUNNING NOARP MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

eth0 Link encap:Ethernet HWaddr 00:E0:ED:10:7C:6C
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:88547472 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2118475264 (1.9 GiB) TX bytes:0 (0.0 b)
Base address:0x4000 Memory:e8200000-e8220000

eth1 Link encap:Ethernet HWaddr 00:E0:ED:10:7C:6D
UP BROADCAST RUNNING PROMISC MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:67067754 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:2114553248 (1.9 GiB)
Base address:0x4040 Memory:e8220000-e8240000

eth3 Link encap:Ethernet HWaddr 00:11:25:C4:60:AF
inet addr:192.168.0.244 Bcast:19.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:476452 errors:0 dropped:0 overruns:0 frame:0
TX packets:27435 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:35918090 (34.2 MiB) TX bytes:5939712 (5.6 MiB)
Interrupt:27

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:4 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:188 (188.0 b) TX bytes:188 (188.0 b)

# tc -s qdisc
qdisc pfifo_fast 0: dev eth3 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1
1 1 1 1 1 1
Sent 5623502 bytes 26347 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc htb 1: dev eth0 root r2q 10 default 0 direct_packets_stat 0
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc sfq 108: dev eth0 parent 1:108 limit 127p quantum 1514b perturb 15sec
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc htb 1: dev eth1 root r2q 10 default 0 direct_packets_stat 0
Sent 13377341976 bytes 16805706 pkt (dropped 2124150, overlimits
18953263 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
qdisc sfq 108: dev eth1 parent 1:108 limit 127p quantum 1514b perturb 15sec
Sent 13377341976 bytes 16805706 pkt (dropped 2124150, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

Thanks
Antonio Almeida

config

Jarek Poplawski

unread,

May 18, 2009, 6:45:59 AM5/18/09

to Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Vladimir Ivashchenko

On Mon, May 18, 2009 at 11:01:21AM +0100, Antonio Almeida wrote:
> Hi!
>
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> returns "jiffies"
>
> With HFSC the accuracy is good. Also with packets of 800 bytes I got
> these values:
> received configured error
> 904596519 900000000 0,51
> 804293658 800000000 0,54
> 703662853 700000000 0,52
> 603354059 600000000 0,56
> 502805411 500000000 0,56
> 402527055 400000000 0,63
> 301484904 300000000 0,49
> 201074301 200000000 0,54
> 100546656 100000000 0,55
>

Looks great! But, since HFSC uses rates directly (without rate tables)
seems to be logical. So, it looks like the best choice for handling
>100Mbit configs now.

Thanks,
Jarek P.

Jarek Poplawski

unread,

May 18, 2009, 7:14:11 AM5/18/09

to Antonio Almeida, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

On Mon, May 18, 2009 at 11:39:02AM +0100, Antonio Almeida wrote:
> Hi!
>

> Here the information you asked:

Very nice, but there are some questions:
- if this analyser uses tcp we definitely need tso off as well during
these tests,
- it would be nice to use two patches I've sent to exclude known (now)
reasons.

With the above I expect accuracy should be better, but definitely not
like hfsc (plus no higher than 1000Mbit rate reported after stopping
effect).

Thanks,
Jarek P.

>
> # ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
> udp fragmentation offload: off
> generic segmentation offload: off
>
> # ethtool -k eth1
> Offload parameters for eth1:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
> udp fragmentation offload: off
> generic segmentation offload: off
>
> The bridge is between eth0 and eth1

...

Antonio Almeida

unread,

May 18, 2009, 8:05:32 AM5/18/09

to Jarek Poplawski, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

The analyser traffic is tcp. Setting tso off the accuracy stays the same

# ethtool -K eth0 tso off
# ethtool -K eth1 tso off

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on

tcp segmentation offload: off

udp fragmentation offload: off
generic segmentation offload: off

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on

tcp segmentation offload: off

udp fragmentation offload: off
generic segmentation offload: off

# tc -s -d class ls dev eth1 | head -24

class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5

Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0)
rate 652715Kbit 97655pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0
tokens: 402 ctokens: 402

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7

Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0)
rate 652715Kbit 97655pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0
tokens: 402 ctokens: 402

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6

Sent 164938012460 bytes 206824215 pkt (dropped 0, overlimits 0 requeues 0)
rate 652715Kbit 97655pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0
tokens: 402 ctokens: 402

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 164938040048 bytes 206824248 pkt (dropped 25827911, overlimits 0
requeues 0)
rate 652715Kbit 97655pps backlog 0b 33p requeues 0
lended: 206824215 borrowed: 0 giants: 0
tokens: -6 ctokens: -6

I'm applying the patches now. I'll get back to you.

Antonio Almeida

unread,

May 18, 2009, 8:27:08 AM5/18/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Vladimir Ivashchenko

> Looks great! But, since HFSC uses rates directly (without rate tables)

This matter about the use of rate tables is not very familiar to me.
In fact I keep wondering a lot of things about what kernel does with
packets. Is there any documentation explaining how queue disciplines
work and how it interacts with netfilter and tc_core? What about
packets dispatching?

Thanks
Antonio Almeida

Jarek Poplawski

unread,

May 18, 2009, 8:32:37 AM5/18/09

to Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Vladimir Ivashchenko

On Mon, May 18, 2009 at 01:27:08PM +0100, Antonio Almeida wrote:
> > Looks great! But, since HFSC uses rates directly (without rate tables)
>
> This matter about the use of rate tables is not very familiar to me.
> In fact I keep wondering a lot of things about what kernel does with
> packets. Is there any documentation explaining how queue disciplines
> work and how it interacts with netfilter and tc_core? What about
> packets dispatching?

Here are a few links:
http://yesican.chsoft.biz/lartc/index.html

Jarek P.

Antonio Almeida

unread,

May 18, 2009, 10:36:00 AM5/18/09

to Jarek Poplawski, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

This patch works perfectly!
rate (bits/s) is now decreasing along with pps when I stop the traffic
(doesn't grow as it used to for rates over 500Mbtis/s).

# tc -s -d class ls dev eth1 | head -21 | tail -1
rate 651960Kbit 97482pps backlog 0b 0p requeues 0
rate 541134Kbit 80911pps backlog 0b 0p requeues 0
rate 405850Kbit 60683pps backlog 0b 0p requeues 0
rate 304388Kbit 45512pps backlog 0b 0p requeues 0
rate 304388Kbit 45512pps backlog 0b 0p requeues 0
rate 228291Kbit 34134pps backlog 0b 0p requeues 0
rate 171218Kbit 25601pps backlog 0b 0p requeues 0
rate 171218Kbit 25601pps backlog 0b 0p requeues 0
rate 128414Kbit 19201pps backlog 0b 0p requeues 0
rate 96310Kbit 14400pps backlog 0b 0p requeues 0
rate 96310Kbit 14400pps backlog 0b 0p requeues 0
rate 72233Kbit 10800pps backlog 0b 0p requeues 0
rate 54174Kbit 8100pps backlog 0b 0p requeues 0

Thank's to you!
Antonio Almeida

Stephen Hemminger

unread,

May 18, 2009, 12:13:14 PM5/18/09

to Antonio Almeida, net...@vger.kernel.org, jar...@gmail.com, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

On Mon, 18 May 2009 11:01:21 +0100
Antonio Almeida <vex...@gmail.com> wrote:

> Hi!
>
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> returns "jiffies"

That is the slowest of the choices. Better ones are hpet and tsc, but you
hardware doesn't support them.

You should compile your kernel with HZ=1000 and the resolution will be better
(but with some loss of performance).

Eric Dumazet

unread,

May 18, 2009, 12:40:56 PM5/18/09

to Jarek Poplawski, Antonio Almeida, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

Jarek Poplawski a écrit :

With a typical estimator "1sec 8sec", ewma_log value is 3

At gigabit speeds, we are very close to overflow yes, since
we only have 27 bits available, so 134217728 bytes per second
or 1073741824 bits per second.

So formula :

e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;

is going to overflow.

One way to avoid the overflow would be to use a smaller estimator, like "500ms 4sec"

Or use a 64bits rate & avbps, this is needed fo 10Gb speeds I suppose...

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..150e2f5 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -86,9 +86,9 @@ struct gen_estimator
spinlock_t *stats_lock;
int ewma_log;
u64 last_bytes;
+ u64 avbps;
u32 last_packets;
u32 avpps;
- u32 avbps;
struct rcu_head e_rcu;
struct rb_node node;
};
@@ -115,6 +115,7 @@ static void est_timer(unsigned long arg)
rcu_read_lock();
list_for_each_entry_rcu(e, &elist[idx].list, list) {
u64 nbytes;
+ u64 brate;
u32 npackets;
u32 rate;

@@ -125,9 +126,9 @@ static void est_timer(unsigned long arg)

nbytes = e->bstats->bytes;

npackets = e->bstats->packets;

- rate = (nbytes - e->last_bytes)<<(7 - idx);
+ brate = (nbytes - e->last_bytes)<<(7 - idx);

e->last_bytes = nbytes;
- e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;

+ e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;

Antonio Almeida

unread,

May 18, 2009, 12:54:18 PM5/18/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

I'm not sure if I'm able to test this patch. What do you mean with
"smallest sizes"? Are you talking about packet's size? What kind of
sizes?
When I feed my bridge with 950Mbits/s of packets with 800 bytes that
is close to 150.000pps and CPUs start to get busy. For packets 100
bytes long, 150.000pps would be close to 125Mbits/s and CPUs start to
get busy already, so I'm not able to get close to 500Mbits/s. For
rates near 125bits/s the bad accuracy is not so expressive. For
packets of 100 bytes increasing analyser sent traffic, at some point
is not HTB shaping but the CPU that can't process so many packets. I
might misunderstood your point.

I applied this tc_core.c patch and for packets of 800 bytes it had no
effect in HTB accuracy with rates over 500Mbit.
Anyway I also test it with packets of 100 bytes, generating 200Mbits,
and the result is the same as without this patch:

With the patch:

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate

100000Kbit ceil 100000Kbit burst 14087b/8 mpu 0b overhead 0b cburst
14087b/8 mpu 0b overhead 0b level 0
Sent 2187884640 bytes 22790465 pkt (dropped 8624566, overlimits 0 requeues 0)
rate 124946Kbit 162691pps backlog 0b 0p requeues 0
lended: 22790465 borrowed: 0 giants: 0
tokens: 180 ctokens: 180

Without the patch:

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate

100000Kbit ceil 100000Kbit burst 14087b/8 mpu 0b overhead 0b cburst
14087b/8 mpu 0b overhead 0b level 0
Sent 1260235680 bytes 13127455 pkt (dropped 4531299, overlimits 0 requeues 0)
rate 124575Kbit 162207pps backlog 0b 0p requeues 0
lended: 13127455 borrowed: 0 giants: 0
tokens: 123 ctokens: 123

Thanks
Antonio Almeida

Antonio Almeida

unread,

May 18, 2009, 1:16:26 PM5/18/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

I forgot to tell you that I used tc source code from iproute2-2.6.16.
I couldn't use the newest version because I got errors when compiling.

Antonio Almeida

unread,

May 18, 2009, 2:56:12 PM5/18/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet

Precise measurements:

800 bytes:

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate

555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
Sent 46793626324 bytes 57771194 pkt (dropped 29920019, overlimits 0 requeues 0)
rate 621714Kbit 97631pps backlog 0b 126p requeues 0
lended: 57771068 borrowed: 0 giants: 0
tokens: -8 ctokens: -8

850 bytes:

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate

555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
Sent 63422144616 bytes 77714246 pkt (dropped 41012275, overlimits 0 requeues 0)
rate 600699Kbit 88756pps backlog 0b 127p requeues 0
lended: 77714119 borrowed: 0 giants: 0
tokens: -11 ctokens: -11

900 bytes:

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate

555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0
Sent 76868403562 bytes 92835297 pkt (dropped 48565133, overlimits 0 requeues 0)
rate 636195Kbit 88755pps backlog 0b 126p requeues 0
lended: 92835171 borrowed: 0 giants: 0
tokens: -7 ctokens: -7

If you need more values you're free to ask.

Antonio Almeida

Jarek Poplawski

unread,

May 18, 2009, 3:05:41 PM5/18/09

to Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet

Since you're so kind... :-) There is a line in net/sched/sch_htb.c:

#define HTB_HYSTERESIS 1 /* whether to use mode hysteresis for speedup */

Could you change 1 to 0, and repeat these tests above after recompiling?

More thanks,

David Miller

unread,

May 18, 2009, 5:52:33 PM5/18/09

to jar...@gmail.com, da...@cosmosbay.com, vex...@gmail.com, net...@vger.kernel.org, ka...@trash.net, de...@cdi.cz

From: Jarek Poplawski <jar...@gmail.com>
Date: Mon, 18 May 2009 19:23:49 +0200

> On Mon, May 18, 2009 at 06:40:56PM +0200, Eric Dumazet wrote:
>> With a typical estimator "1sec 8sec", ewma_log value is 3
>>
>> At gigabit speeds, we are very close to overflow yes, since
>> we only have 27 bits available, so 134217728 bytes per second
>> or 1073741824 bits per second.
>>
>> So formula :
>> e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
>> is going to overflow.
>>
>> One way to avoid the overflow would be to use a smaller estimator, like "500ms 4sec"
>>
>> Or use a 64bits rate & avbps, this is needed fo 10Gb speeds I suppose...
>
> Yes, I considered this too, but because of an overhead I decided to
> fix as designed (according to the comment) for now. But probably you
> are right, and we should go further, so I'm OK with your patch.

I like this patch too, Eric can you submit this formally with
proper signoffs etc.?

Thanks!

Stephen Hemminger

unread,

May 18, 2009, 6:02:34 PM5/18/09

to Stephen Hemminger, Antonio Almeida, net...@vger.kernel.org, jar...@gmail.com, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

On Mon, 18 May 2009 09:13:14 -0700
Stephen Hemminger <shemm...@vyatta.com> wrote:

> On Mon, 18 May 2009 11:01:21 +0100
> Antonio Almeida <vex...@gmail.com> wrote:
>
> > Hi!
> >
> > cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> > returns "jiffies"
>
> That is the slowest of the choices. Better ones are hpet and tsc, but you
> hardware doesn't support them.
>
> You should compile your kernel with HZ=1000 and the resolution will be better
> (but with some loss of performance).
> --

Are you using one of the AMD dual core machines? That processor has the bad
design flaw that the TSC counter is not synced between core's so the kernel can't
use it. You might even be better off running a non SMP kernel on that box.

Vladimir Ivashchenko

unread,

May 18, 2009, 7:14:39 PM5/18/09

to Jarek Poplawski, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Antonio Almeida

On Mon, 2009-05-18 at 15:36 +0100, Antonio Almeida wrote:
> This patch works perfectly!
> rate (bits/s) is now decreasing along with pps when I stop the traffic
> (doesn't grow as it used to for rates over 500Mbtis/s).

I'm not able to reach full speed with bond + HTB + sfq on 2.6.29.1, both
with and without these patches. I seem to get a lot of drops on sfq
qdiscs, whatever quantum I set. Playing with IRQ affinity doesn't help.
I didn't check without bond.

With bond + HFSC + sfq, I'm able to reach the speed. It doesn't seem to
overspill with 580 mbps load. Jarek, would your patches help with HSFC
overspill ? I will check tomorrow under 750 mbps load.

# ethtool -k eth0
Offload parameters for eth0:

Cannot get device flags: Operation not supported

rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

large receive offload: off

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

--
Best Regards,

Vladimir Ivashchenko
Chief Technology Officer

PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211

Vladimir Ivashchenko

unread,

May 18, 2009, 7:27:47 PM5/18/09

to Jarek Poplawski, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Antonio Almeida

> With bond + HFSC + sfq, I'm able to reach the speed. It doesn't seem to
> overspill with 580 mbps load. Jarek, would your patches help with HSFC
> overspill ? I will check tomorrow under 750 mbps load.

Please disregard my comment about HFSC. It still overspills heavily.

On a 400 mbps limit, I'm getting 520 mbps actual throughput.

Eric Dumazet

unread,

May 18, 2009, 7:59:55 PM5/18/09

to David Miller, jar...@gmail.com, vex...@gmail.com, net...@vger.kernel.org, ka...@trash.net, de...@cdi.cz

David Miller a écrit :

> From: Jarek Poplawski <jar...@gmail.com>
> Date: Mon, 18 May 2009 19:23:49 +0200
>
>> On Mon, May 18, 2009 at 06:40:56PM +0200, Eric Dumazet wrote:
>>> With a typical estimator "1sec 8sec", ewma_log value is 3
>>>
>>> At gigabit speeds, we are very close to overflow yes, since
>>> we only have 27 bits available, so 134217728 bytes per second
>>> or 1073741824 bits per second.
>>>
>>> So formula :
>>> e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
>>> is going to overflow.
>>>
>>> One way to avoid the overflow would be to use a smaller estimator, like "500ms 4sec"
>>>
>>> Or use a 64bits rate & avbps, this is needed fo 10Gb speeds I suppose...
>> Yes, I considered this too, but because of an overhead I decided to
>> fix as designed (according to the comment) for now. But probably you
>> are right, and we should go further, so I'm OK with your patch.
>
> I like this patch too, Eric can you submit this formally with
> proper signoffs etc.?
>

Sure, here it is. We might need a similar patch to get a correct pps value
too, since we currently are limited to ~ 2^21 packets per second.

[PATCH] pkt_sched: gen_estimator: use 64 bit intermediate counters for bps

gen_estimator can overflow bps (bytes per second) with Gb links, while
it was designed with a u32 API, with a theorical limit of 34360Mbit (2^32 bytes)

Using 64 bit intermediate avbps/brate counters can allow us to reach this
theorical limit.

Signed-off-by: Eric Dumazet <da...@cosmosbay.com>

Signed-off-by: Jarek Poplawski <jar...@gmail.com>
---

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 9cc9f95..ea28659 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -66,9 +66,9 @@

NOTES.

- * The stored value for avbps is scaled by 2^5, so that maximal
- rate is ~1Gbit, avpps is scaled by 2^10.
-
+ * avbps is scaled by 2^5, avpps is scaled by 2^10.
+ * both values are reported as 32 bit unsigned values. bps can
+ overflow for fast links : max speed being 34360Mbit/sec
* Minimal interval is HZ/4=250msec (it is the greatest common divisor
for HZ=100 and HZ=1024 8)), maximal interval
is (HZ*2^EST_MAX_INTERVAL)/4 = 8sec. Shorter intervals

--

David Miller

unread,

May 18, 2009, 10:27:29 PM5/18/09

to da...@cosmosbay.com, jar...@gmail.com, vex...@gmail.com, net...@vger.kernel.org, ka...@trash.net, de...@cdi.cz

From: Eric Dumazet <da...@cosmosbay.com>
Date: Tue, 19 May 2009 01:59:55 +0200

> Sure, here it is.

Applied, thanks!

> We might need a similar patch to get a correct pps value
> too, since we currently are limited to ~ 2^21 packets per second.

True, but it is a less urgent issue than bps overflow.

Jarek Poplawski

unread,

May 19, 2009, 3:02:52 AM5/19/09

to Eric Dumazet, David Miller, vex...@gmail.com, net...@vger.kernel.org, ka...@trash.net, de...@cdi.cz

On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:
...
> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
...

> - e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
> + e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;

Btw., I'm a bit concerned about the syntax here: isn't such shifting
of signed ints implementation dependant?

Jarek P.

Eric Dumazet

unread,

May 19, 2009, 3:31:36 AM5/19/09

to Jarek Poplawski, David Miller, vex...@gmail.com, net...@vger.kernel.org, ka...@trash.net, de...@cdi.cz

Jarek Poplawski a écrit :

> On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:
> ...
>> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
> ...
>> - e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
>> + e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
>
> Btw., I'm a bit concerned about the syntax here: isn't such shifting
> of signed ints implementation dependant?
>

You are right Jarek, I very often forget to never ever use signed quantities
at all ! (But also note original code has same undefined behavior)

Quoting wikipedia : (http://en.wikipedia.org/wiki/Arithmetic_shift)

The (1999) ISO standard for the, C programming language defines the C language's
right shift operator in terms of divisions by powers of 2. Because of the
aforementioned non-equivalence, the standard explicitly excludes from that
definition the right shifts of signed numbers that have negative values.
It doesn't specify the behaviour of the right shift operator in such circumstances,
but instead requires each individual C compiler to specify the behaviour of shifting
negative values right.

Apparently gcc does the *right* thing on x86_32, but we probably want something
stronger here. I could not find gcc documentation statement on right shifts of
negative values.

436: 8b 4b 14 mov 0x14(%ebx),%ecx
439: 89 73 18 mov %esi,0x18(%ebx)
43c: 89 7b 1c mov %edi,0x1c(%ebx)
43f: 8b 73 20 mov 0x20(%ebx),%esi
442: 8b 7b 24 mov 0x24(%ebx),%edi
445: 29 f0 sub %esi,%eax
447: 19 fa sbb %edi,%edx
449: 0f ad d0 shrd %cl,%edx,%eax
44c: d3 fa sar %cl,%edx << good >>
44e: f6 c1 20 test $0x20,%cl
451: 74 05 je 458 <est_timer+0xb8>
453: 89 d0 mov %edx,%eax
455: c1 fa 1f sar $0x1f,%edx
458: 01 f0 add %esi,%eax
45a: 8b 4b 0c mov 0xc(%ebx),%ecx
45d: 89 43 20 mov %eax,0x20(%ebx)
460: 11 fa adc %edi,%edx
462: 83 c0 0f add $0xf,%eax
465: 89 53 24 mov %edx,0x24(%ebx)
468: 83 d2 00 adc $0x0,%edx
46b: 0f ac d0 05 shrd $0x5,%edx,%eax
46f: 89 01 mov %eax,(%ecx)

Jarek Poplawski

unread,

May 19, 2009, 3:42:47 AM5/19/09

> Please disregard my comment about HFSC. It still overspills heavily.
>
> On a 400 mbps limit, I'm getting 520 mbps actual throughput.

I guess you should send some logs. Your previous report seem to show
the sum of sc rates of of children could be too high. You seem to
expect the parent's sc and ul should limit this, but actually children
rates decide and parent's rates are mainly for lending/borrowing (at
least in HTB). So, it would be nice to try with one leaf class first,
(similarly to Antonio) how high rates are respected.

High drop should be OK if the flow is much faster than scheduling/
hardware send rate. It could be a bit higher than in older kernels
because of limited requeuing, but this could be corrected with
longer queue lenghts (sfq has a very short queue: max 127).

Jarek P.

Denys Fedoryschenko

unread,

May 19, 2009, 7:04:50 AM5/19/09

to Antonio Almeida, Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet

On Tuesday 19 May 2009 13:55:43 Antonio Almeida wrote:
> Doesn't seem to make any diference seting HTB_HYSTERESIS to 0. Here're
> the values using #define HTB_HYSTERESIS 0
>
> 800 bytes:
> class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
> 555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
> 70901b/8 mpu 0b overhead 0b level 0
> Sent 9773257752 bytes 12277962 pkt (dropped 6292541, overlimits 0 requeues
> 0) rate 621796Kbit 97644pps backlog 0b 127p requeues 0
> lended: 12277835 borrowed: 0 giants: 0
> tokens: -7 ctokens: -7

6292541 dropped from 12277962 pkt, means 51% dropped. Maybe something fishy
here?

Can you try instead of SFQ - BFIFO? For 100ms buffer, 550Mbit/s it will be
~6875000 bytes bfifo.

It is by the way too short, IMHO, for this bandwidth, 127 packets is not
enough. 127 packets with 800 bytes can buffer 1 second for 812Kbit/s only,
and for 550Mbit/s it will buffer data for ~2ms only.

Jarek Poplawski

unread,

unread,

May 19, 2009, 9:18:47 AM5/19/09

to Jarek Poplawski, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet

On Mon, 18 May 2009, Jarek Poplawski wrote:

> Since you're so kind... :-) There is a line in net/sched/sch_htb.c:
>
> #define HTB_HYSTERESIS 1 /* whether to use mode hysteresis for speedup */
>
> Could you change 1 to 0, and repeat these tests above after recompiling?

Notice its runtime adjustable via:
/sys/module/sch_htb/parameters/htb_hysteresis

Since kernel version v2.6.26.

Cheers,
Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

Vladimir Ivashchenko

unread,

May 19, 2009, 10:04:16 AM5/19/09

to Jarek Poplawski, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Antonio Almeida

> > Please disregard my comment about HFSC. It still overspills heavily.
> >
> > On a 400 mbps limit, I'm getting 520 mbps actual throughput.
>
> I guess you should send some logs. Your previous report seem to show

Can you give some hints on which logs you would like to see?

> the sum of sc rates of of children could be too high. You seem to
> expect the parent's sc and ul should limit this, but actually children
> rates decide and parent's rates are mainly for lending/borrowing (at

The children's ceil rate is 70% of the parent 1:2 class rate.

> least in HTB). So, it would be nice to try with one leaf class first,
> (similarly to Antonio) how high rates are respected.

Unfortunately its difficult for me to play with classes as its real traffic.
I'll try to get a traffic generator.

> High drop should be OK if the flow is much faster than scheduling/
> hardware send rate. It could be a bit higher than in older kernels
> because of limited requeuing, but this could be corrected with
> longer queue lenghts (sfq has a very short queue: max 127).

I don't think its sfq, since I have the same sfq qdiscs with HSFC.

Also I'm comparing this to my production HTB box has 2.6.21.5 with esfq
and no bond (just eth), esfq also has 127p limit.

I tried to get rid of bond on the outbound traffic, I balanced traffic
via eth0 and eth2 manually by splitting routes going through them.

I still had the same issue with HTB not reaching the full speed.

I'm going to try testing exactly the same configuration on 2.6.29 as I have
on 2.6.21.5 tonight. The only difference would be that I use sfq(dst) instead of
esfq(dst) which is not available on 2.6.29.

--
Best Regards

Vladimir Ivashchenko
Chief Technology Officer

PrimeTel, Cyprus - www.prime-tel.com

Antonio Almeida

unread,

May 19, 2009, 10:31:30 AM5/19/09

to Jarek Poplawski, Denys Fedoryschenko, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet

I tested it with BFIFO using limit 6875000. (Analyser keeps sending
950Mbits/s of 800 bytes tcp packets - lots of drops for sure)
Backlog is now huge but the throughout stays much higher than the
configured ceil.

# tc -s -d class ls dev eth1
class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5
Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0)
rate 621765Kbit 97639pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: -186 ctokens: -186

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7
Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0)
rate 621765Kbit 97639pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: -186 ctokens: -186

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6
Sent 9542831672 bytes 11988482 pkt (dropped 0, overlimits 0 requeues 0)
rate 621765Kbit 97639pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: -186 ctokens: -186

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 9549705928 bytes 11997118 pkt (dropped 6092846, overlimits 0 requeues 0)
rate 621764Kbit 97639pps backlog 0b 8636p requeues 0
lended: 11988482 borrowed: 0 giants: 0
tokens: -1008 ctokens: -1008

# tc -s -d qdisc ls dev eth1
qdisc htb 1: root r2q 10 default 0 direct_packets_stat 11955 ver 3.17
Sent 9608660872 bytes 12071182 pkt (dropped 6124502, overlimits
18190041 requeues 0)
rate 0bit 0pps backlog 0b 8636p requeues 0
qdisc bfifo 108: parent 1:108 limit 6875000b
Sent 9599144692 bytes 12059227 pkt (dropped 6124502, overlimits 0 requeues 0)
rate 0bit 0pps backlog 6874256b 8636p requeues 0

Antonio Almeida

Eric Dumazet

unread,

May 19, 2009, 2:03:24 PM5/19/09

to Jarek Poplawski, David Miller, vex...@gmail.com, net...@vger.kernel.org, ka...@trash.net, de...@cdi.cz

Jarek Poplawski a écrit :
> On Tue, May 19, 2009 at 07:42:47AM +0000, Jarek Poplawski wrote:
>> On Tue, May 19, 2009 at 09:31:36AM +0200, Eric Dumazet wrote:
>>> Jarek Poplawski a écrit :
>>>> On Tue, May 19, 2009 at 01:59:55AM +0200, Eric Dumazet wrote:
>>>> ...
>>>>> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
>>>> ...
>>>>> - e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;
>>>>> + e->avbps += ((s64)(brate - e->avbps)) >> e->ewma_log;
>>>> Btw., I'm a bit concerned about the syntax here: isn't such shifting
>>>> of signed ints implementation dependant?
>>>>
>>> You are right Jarek, I very often forget to never ever use signed quantities
>>> at all ! (But also note original code has same undefined behavior)
>> Sure, I've meant the original code including 5 lines below.
>>
>>> Apparently gcc does the *right* thing on x86_32, but we probably want something
>>> stronger here. I could not find gcc documentation statement on right shifts of
>>> negative values.
>> I guess gcc and most of others do this "right"; but it looks
>> "unkosher" anyway.
>
> I might have missed your point here, but would it be so costly to do
> these shifts separately here?

You replied to yourself Jarek :)

As I said earlier, I found your concern right, so please submit a patch ?

I found many occurrences of a right shift on a signed int/long in kernel.
One example being :

arch/x86/mm/init_64.c

int kern_addr_valid(unsigned long addr)
{
unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;

and another rate estimator in drivers/atm/idt77252.c

static void
idt77252_est_timer(unsigned long data)

We could aso check net/netfilter/ipvs/ip_vs_est.c (estimation_timer())

Jarek Poplawski

unread,

May 19, 2009, 3:09:15 PM5/19/09

to Eric Dumazet, David Miller, vex...@gmail.com, net...@vger.kernel.org, ka...@trash.net, de...@cdi.cz

On Tue, May 19, 2009 at 08:03:24PM +0200, Eric Dumazet wrote:
...

> As I said earlier, I found your concern right, so please submit a patch ?

OK, thanks,
Jarek P.
----------------->
pkt_sched: gen_estimator: Fix signed integers right-shifts.

Right-shifts of signed integers are implementation-defined so unportable.

With feedback from: Eric Dumazet <da...@cosmosbay.com>

Signed-off-by: Jarek Poplawski <jar...@gmail.com>
---

diff -Nurp a/net/core/gen_estimator.c b/net/core/gen_estimator.c
--- a/net/core/gen_estimator.c 2009-05-19 20:33:47.000000000 +0200
+++ b/net/core/gen_estimator.c 2009-05-19 20:40:58.000000000 +0200
@@ -128,12 +128,12 @@ static void est_timer(unsigned long arg)

unread,

May 21, 2009, 3:20:50 AM5/21/09

to Eric Dumazet, Vladimir Ivashchenko, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Antonio Almeida, Corey Hickey

Yes, sfq has its design limits, and as a matter of fact, because of
max length (127) it should be treated as a toy or "personal" qdisc.

I don't know why more of esfq wasn't merged, anyway similar
functionality could be achieved in current kernels with sch_drr +
cls_flow, alas not enough documented. Here is some hint:
http://markmail.org/message/h24627xkrxyqxn4k

Jarek P.

PS: I guess, you wasn't very consistent if your main problem was
exceeding or not reaching htb rate, and there is quite a difference.

Vladimir Ivashchenko wrote, On 05/08/2009 10:46 PM:

> Exporting HZ=1000 doesn't help. However, even if I recompile the kernel
> to 1000 Hz and the burst is calculated correctly, for some reason HTB on
> 2.6.29 is still worse at rate control than 2.6.21.
>
> With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
> With 2.6.29, same ceil/burst -> actual rate 890 mbits.
...

Vladimir Ivashchenko wrote, On 05/17/2009 10:29 PM:

> Hi Antonio,
>
> FYI, these are exactly the same problems I get in real life.
> Check the later posts in "bond + tc regression" thread.
...

Vladimir Ivashchenko

unread,

May 21, 2009, 3:44:00 AM5/21/09

to Jarek Poplawski, Eric Dumazet, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Antonio Almeida, Corey Hickey

> I don't know why more of esfq wasn't merged, anyway similar
> functionality could be achieved in current kernels with sch_drr +
> cls_flow, alas not enough documented. Here is some hint:
> http://markmail.org/message/h24627xkrxyqxn4k

Can I balance only by destination IP using this approach?
Normal IP flow-based balancing is not good for me, I need
to ensure equality between destination hosts.

>
> Jarek P.
>
> PS: I guess, you wasn't very consistent if your main problem was
> exceeding or not reaching htb rate, and there is quite a difference.

Yes indeed :(

I'm trying to migrate from 2.6.21 eth/htb/esfq to 2.6.29
bond/htb/sfq, and that introduces a lot of changes.

Apparently during some point I changed sfq divisor from 1024
to 2048 and forgot about it.

Now I realize that the problems I reported were as follows:

1) HTB exceeds target when I use HTB + sfq + divisor 1024
2) HFSC exceeds target when I use HFSC + sfq + divisor 1024
3) HTB does not reach target when I use HTB + sfq + divisor 2048

I will check again scenario 1) with the latest patches from
the list.

> Vladimir Ivashchenko wrote, On 05/08/2009 10:46 PM:
>
> > Exporting HZ=1000 doesn't help. However, even if I recompile the kernel
> > to 1000 Hz and the burst is calculated correctly, for some reason HTB on
> > 2.6.29 is still worse at rate control than 2.6.21.
> >
> > With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
> > With 2.6.29, same ceil/burst -> actual rate 890 mbits.
> ...
>
> Vladimir Ivashchenko wrote, On 05/17/2009 10:29 PM:
>
> > Hi Antonio,
> >
> > FYI, these are exactly the same problems I get in real life.
> > Check the later posts in "bond + tc regression" thread.
> ...

--
Best Regards

Vladimir Ivashchenko
Chief Technology Officer

PrimeTel, Cyprus - www.prime-tel.com

Jarek Poplawski

unread,

May 21, 2009, 4:28:05 AM5/21/09

to Vladimir Ivashchenko, Eric Dumazet, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Antonio Almeida, Corey Hickey

On Thu, May 21, 2009 at 10:44:00AM +0300, Vladimir Ivashchenko wrote:
> > I don't know why more of esfq wasn't merged, anyway similar
> > functionality could be achieved in current kernels with sch_drr +
> > cls_flow, alas not enough documented. Here is some hint:
> > http://markmail.org/message/h24627xkrxyqxn4k
>
> Can I balance only by destination IP using this approach?
> Normal IP flow-based balancing is not good for me, I need
> to ensure equality between destination hosts.

Yes, you need to use flow "dst" key, I guess. (tc filter add flow help)

Jarek P.

> > PS: I guess, you wasn't very consistent if your main problem was
> > exceeding or not reaching htb rate, and there is quite a difference.
>
> Yes indeed :(

Generally, the most common reasons are:
- too short (or zero) tx queue length or/plus some disturbances in
maintaining the flow - for not reaching the rate
- gso/tso or other non standard packets sizes - for exceeding the
rate.

Jarek Poplawski

unread,

May 21, 2009, 4:51:17 AM5/21/09

to Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz

On Mon, May 18, 2009 at 06:16:26PM +0100, Antonio Almeida wrote:
> I forgot to tell you that I used tc source code from iproute2-2.6.16.
> I couldn't use the newest version because I got errors when compiling.

I still have no clue about the reason, but it would be really nice to
do some short test with more current kernel (>= 2.6.27) and iproute2
(to exclude the possibility of some incomaptibility in configs e.g.
rate tables passed to htb).

Thanks,
Jarek P.

Eric Dumazet

unread,

May 21, 2009, 5:07:24 AM5/21/09

to Jarek Poplawski, Vladimir Ivashchenko, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Antonio Almeida, Corey Hickey

Jarek Poplawski a écrit :

> On Thu, May 21, 2009 at 10:44:00AM +0300, Vladimir Ivashchenko wrote:
>>> I don't know why more of esfq wasn't merged, anyway similar
>>> functionality could be achieved in current kernels with sch_drr +
>>> cls_flow, alas not enough documented. Here is some hint:
>>> http://markmail.org/message/h24627xkrxyqxn4k
>> Can I balance only by destination IP using this approach?
>> Normal IP flow-based balancing is not good for me, I need
>> to ensure equality between destination hosts.
>
> Yes, you need to use flow "dst" key, I guess. (tc filter add flow help)
>
> Jarek P.
>
>>> PS: I guess, you wasn't very consistent if your main problem was
>>> exceeding or not reaching htb rate, and there is quite a difference.
>> Yes indeed :(
>
> Generally, the most common reasons are:
> - too short (or zero) tx queue length or/plus some disturbances in
> maintaining the flow - for not reaching the rate

> - gso/tso or other non standard packets sizes - for exceeding the
> rate.

Could we detect this at runtime and emit a warning (once) ?

Or should we assume guys using this stuff should be smart enough ?
I confess I made this error once and this was not so easy to spot...

Jarek Poplawski

unread,

May 21, 2009, 5:22:12 AM5/21/09

to Eric Dumazet, Vladimir Ivashchenko, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Antonio Almeida, Corey Hickey

On Thu, May 21, 2009 at 11:07:24AM +0200, Eric Dumazet wrote:
...

> > - gso/tso or other non standard packets sizes - for exceeding the
> > rate.
>
> Could we detect this at runtime and emit a warning (once) ?

I guess, it's a rhetorical question...

Jarek P.

Antonio Almeida

unread,

May 22, 2009, 1:42:16 PM5/22/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Thu, May 21, 2009 at 9:51 AM, Jarek Poplawski wrote:
> I still have no clue about the reason, but it would be really nice to
> do some short test with more current kernel (>= 2.6.27) and iproute2
> (to exclude the possibility of some incomaptibility in configs e.g.
> rate tables passed to htb).

I installed kernel 2.6.29 (finaly! wasn't easy... I couldn't set
memory split 2G/2G),
but the results are the same. I've already applied gen_estimator.c
patches (works fine).

# tc -s -d class ls dev eth1 | head -24

class htb 1:1 root rate 900000Kbit ceil 900000Kbit burst 113962b/8 mpu
0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level 7

Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0)
rate 621844Kbit 97651pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 402 ctokens: 402

class htb 1:10 parent 1:2 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
5

Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0)
rate 621844Kbit 97651pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 402 ctokens: 402

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 119955366812 bytes 150697697 pkt (dropped 76696483, overlimits 0
requeues 0)
rate 621847Kbit 97652pps backlog 0b 79p requeues 0
lended: 150697618 borrowed: 0 giants: 0
tokens: -5 ctokens: -5

class htb 1:2 parent 1:1 rate 900000Kbit ceil 900000Kbit burst
113962b/8 mpu 0b overhead 0b cburst 113962b/8 mpu 0b overhead 0b level
6

Sent 119955303928 bytes 150697618 pkt (dropped 0, overlimits 0 requeues 0)
rate 621844Kbit 97651pps backlog 0b 0p requeues 0

lended: 0 borrowed: 0 giants: 0

tokens: 402 ctokens: 402

# cat /sys/module/sch_htb/parameters/htb_hysteresis
0

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

I'm working on a newer iproute2.

Antonio Almeida

Jarek Poplawski

unread,

May 23, 2009, 3:32:01 AM5/23/09

to Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Fri, May 22, 2009 at 06:42:16PM +0100, Antonio Almeida wrote:
> On Thu, May 21, 2009 at 9:51 AM, Jarek Poplawski wrote:
> > I still have no clue about the reason, but it would be really nice to
> > do some short test with more current kernel (>= 2.6.27) and iproute2
> > (to exclude the possibility of some incomaptibility in configs e.g.
> > rate tables passed to htb).
>
> I installed kernel 2.6.29 (finaly! wasn't easy... I couldn't set
> memory split 2G/2G),
> but the results are the same. I've already applied gen_estimator.c
> patches (works fine).

...

> I'm working on a newer iproute2.

Actually, from these two I was more interested in iproute2 more
fitting the kernel version. :-((It should be enough to have at least
tc compiled properly, I guess.)

Btw.: if at any point you think this testing is too disturbing to you
etc., feel free to stop this or delay in time as you like.

Thanks,
Jarek P.

Vladimir Ivashchenko

unread,

May 23, 2009, 6:37:32 AM5/23/09

to Jarek Poplawski, Eric Dumazet, net...@vger.kernel.org

> > > cls_flow, alas not enough documented. Here is some hint:
> > > http://markmail.org/message/h24627xkrxyqxn4k
> >
> > Can I balance only by destination IP using this approach?
> > Normal IP flow-based balancing is not good for me, I need
> > to ensure equality between destination hosts.
>
> Yes, you need to use flow "dst" key, I guess. (tc filter add flow
> help)

What is the number of DRR classes I need to create, a separate class for
each host? I have around 20000 hosts.

I figured out that WRR does what I want and its documented, so I'm using
a 2.6.27 kernel with WRR now.

I was still hitting a wall with bonding. I played with a lot of
combinations and could not find a way to make it scale to multiple
cores. Cores which handle incoming traffic would get hit to 0-20% idle.

So, I got rid of bonding completely and instead configured PBR on Cisco
+ Linux routing in such a way so that packet gets received and
transmitted using NICs connected to the same pair of cores with common
cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
case scenarios before.

> - gso/tso or other non standard packets sizes - for exceeding the
> rate.

Just FYI, kernel 2.6.29.1, sub-classes with sfq divisor 1024, tso & gso
off, netdevice.h and tc_core.c patches applied:

class htb 1:2 root rate 775000Kbit ceil 775000Kbit burst 98328b cburst
98328b
Sent 64883444467 bytes 72261124 pkt (dropped 0, overlimits 0 requeues 0)
rate 821332Kbit 112572pps backlog 0b 0p requeues 0
lended: 21736738 borrowed: 0 giants: 0

In any case, exceeding the rate is not big of a problem for me.

Thanks a lot to everyone for their help.

--
Best Regards,

Vladimir Ivashchenko
Chief Technology Officer

PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211

Jarek Poplawski

unread,

May 23, 2009, 10:34:32 AM5/23/09

to Vladimir Ivashchenko, Eric Dumazet, net...@vger.kernel.org

On Sat, May 23, 2009 at 01:37:32PM +0300, Vladimir Ivashchenko wrote:
>
> > > > cls_flow, alas not enough documented. Here is some hint:
> > > > http://markmail.org/message/h24627xkrxyqxn4k
> > >
> > > Can I balance only by destination IP using this approach?
> > > Normal IP flow-based balancing is not good for me, I need
> > > to ensure equality between destination hosts.
> >
> > Yes, you need to use flow "dst" key, I guess. (tc filter add flow
> > help)
>
> What is the number of DRR classes I need to create, a separate class for
> each host? I have around 20000 hosts.

One class per divisor.

> I figured out that WRR does what I want and its documented, so I'm using
> a 2.6.27 kernel with WRR now.

OK if it works for you.

> I was still hitting a wall with bonding. I played with a lot of
> combinations and could not find a way to make it scale to multiple
> cores. Cores which handle incoming traffic would get hit to 0-20% idle.
>
> So, I got rid of bonding completely and instead configured PBR on Cisco
> + Linux routing in such a way so that packet gets received and
> transmitted using NICs connected to the same pair of cores with common
> cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> case scenarios before.

As a matter of fact I don't understand this bonding idea vs. smp: I
guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
expert but I think the most efficient use is with separate NICs per
cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -
but they would currently need a common HTB etc., so again a common
locking/cache problem.

> > - gso/tso or other non standard packets sizes - for exceeding the
> > rate.
>
> Just FYI, kernel 2.6.29.1, sub-classes with sfq divisor 1024, tso & gso
> off, netdevice.h and tc_core.c patches applied:
>
> class htb 1:2 root rate 775000Kbit ceil 775000Kbit burst 98328b cburst
> 98328b
> Sent 64883444467 bytes 72261124 pkt (dropped 0, overlimits 0 requeues 0)
> rate 821332Kbit 112572pps backlog 0b 0p requeues 0
> lended: 21736738 borrowed: 0 giants: 0
>
> In any case, exceeding the rate is not big of a problem for me.

Anyway, I'd be interested with the full tc -s class & qdisc report.

Thanks,
Jarek P.

Vladimir Ivashchenko

unread,

May 23, 2009, 11:06:30 AM5/23/09

to Jarek Poplawski, Eric Dumazet, net...@vger.kernel.org

> > So, I got rid of bonding completely and instead configured PBR on Cisco
> > + Linux routing in such a way so that packet gets received and
> > transmitted using NICs connected to the same pair of cores with common
> > cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> > case scenarios before.
>
> As a matter of fact I don't understand this bonding idea vs. smp: I
> guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
> expert but I think the most efficient use is with separate NICs per
> cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -

I tried the following scenario: 2 NICs used for receive + another 2 NICs
used for transmit having HTB. Each NIC on a separate core. No bonding,
just manual load balancing using IP routing.

The result was that RX cores would be 20% and 40% idle respectively, even
though the amount of traffic they were receiving was roughly the same.
The TX cores were idling at around 90%.

I found this strange personally, but I'm completely ignorant in internals of
kernel operation.

--
Best Regards

Vladimir Ivashchenko
Chief Technology Officer

PrimeTel, Cyprus - www.prime-tel.com

Jarek Poplawski

unread,

May 23, 2009, 11:35:25 AM5/23/09

to Vladimir Ivashchenko, Eric Dumazet, net...@vger.kernel.org

On Sat, May 23, 2009 at 06:06:30PM +0300, Vladimir Ivashchenko wrote:
> > > So, I got rid of bonding completely and instead configured PBR on Cisco
> > > + Linux routing in such a way so that packet gets received and
> > > transmitted using NICs connected to the same pair of cores with common
> > > cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> > > case scenarios before.
> >
> > As a matter of fact I don't understand this bonding idea vs. smp: I
> > guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
> > expert but I think the most efficient use is with separate NICs per
> > cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -
>
> I tried the following scenario: 2 NICs used for receive + another 2 NICs
> used for transmit having HTB. Each NIC on a separate core. No bonding,
> just manual load balancing using IP routing.
>
> The result was that RX cores would be 20% and 40% idle respectively, even
> though the amount of traffic they were receiving was roughly the same.
> The TX cores were idling at around 90%.

There is not enough data to analyse this, but generally you should aim
at maintaining one flow (RX + TX) on the same cpu cache.

Jarek P.

Vladimir Ivashchenko

unread,

May 23, 2009, 11:53:21 AM5/23/09

to Jarek Poplawski, Eric Dumazet, net...@vger.kernel.org

> > > > So, I got rid of bonding completely and instead configured PBR on Cisco
> > > > + Linux routing in such a way so that packet gets received and
> > > > transmitted using NICs connected to the same pair of cores with common
> > > > cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> > > > case scenarios before.
> > >
> > > As a matter of fact I don't understand this bonding idea vs. smp: I
> > > guess Eric Dumazet wrote why it's wrong wrt. locking. I'm not an smp
> > > expert but I think the most efficient use is with separate NICs per
> > > cpu (so with separate HTB qdiscs if possible), or multiqueue NICs -
> >
> > I tried the following scenario: 2 NICs used for receive + another 2 NICs
> > used for transmit having HTB. Each NIC on a separate core. No bonding,
> > just manual load balancing using IP routing.
> >
> > The result was that RX cores would be 20% and 40% idle respectively, even
> > though the amount of traffic they were receiving was roughly the same.
> > The TX cores were idling at around 90%.
>
> There is not enough data to analyse this, but generally you should aim
> at maintaining one flow (RX + TX) on the same cpu cache.

Yep, that's what I did in the end (as per the top paragraph).

--
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

Jarek Poplawski

unread,

May 23, 2009, 12:02:07 PM5/23/09

to Vladimir Ivashchenko, Eric Dumazet, net...@vger.kernel.org

On Sat, May 23, 2009 at 06:53:21PM +0300, Vladimir Ivashchenko wrote:
> > > > > So, I got rid of bonding completely and instead configured PBR on Cisco
> > > > > + Linux routing in such a way so that packet gets received and
> > > > > transmitted using NICs connected to the same pair of cores with common
> > > > > cache. 65-70% idle on all cores now, compared to 0-30% idle in worst
> > > > > case scenarios before.

...

> > There is not enough data to analyse this, but generally you should aim
> > at maintaining one flow (RX + TX) on the same cpu cache.
>
> Yep, that's what I did in the end (as per the top paragraph).

So, stop writing: "I'm completely ignorant in internals of kernel
operation" because you're smp expert now! ;-)

Jarek P.

David Miller

unread,

May 26, 2009, 1:47:16 AM5/26/09

to jar...@gmail.com, da...@cosmosbay.com, vex...@gmail.com, net...@vger.kernel.org, ka...@trash.net, de...@cdi.cz

From: Jarek Poplawski <jar...@gmail.com>
Date: Tue, 19 May 2009 21:09:15 +0200

> pkt_sched: gen_estimator: Fix signed integers right-shifts.
>
> Right-shifts of signed integers are implementation-defined so unportable.
>
> With feedback from: Eric Dumazet <da...@cosmosbay.com>
>
> Signed-off-by: Jarek Poplawski <jar...@gmail.com>

Applied to net-next-2.6, thanks!

Antonio Almeida

unread,

May 28, 2009, 2:13:40 PM5/28/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Sat, May 23, 2009 at 8:32 AM, Jarek Poplawski wrote:
> Actually, from these two I was more interested in iproute2 more
> fitting the kernel version. :-((It should be enough to have at least
> tc compiled properly, I guess.)

I installed iproute2-ss090115 with the new patch but the results are
the same for my test scenery. HTB keeps sending 620Mbit/s when I
configure it's ceil to 555Mbit/s, with 800 bytes packets long.

> Btw.: if at any point you think this testing is too disturbing to you
> etc., feel free to stop this or delay in time as you like.

I'm working on this, don't worry. Since I have a traffic
generator/analyser, any modification you would make I can test it.
You're free to ask.

I've been looking inside htb source code. The granularity problem
could be in the use qdisc_rate_table or near that.

Antonio Almeida

Jarek Poplawski

unread,

May 28, 2009, 5:12:58 PM5/28/09

to Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Thu, May 28, 2009 at 07:13:40PM +0100, Antonio Almeida wrote:
> On Sat, May 23, 2009 at 8:32 AM, Jarek Poplawski wrote:
> > Actually, from these two I was more interested in iproute2 more
> > fitting the kernel version. :-((It should be enough to have at least
> > tc compiled properly, I guess.)
> I installed iproute2-ss090115 with the new patch but the results are
> the same for my test scenery. HTB keeps sending 620Mbit/s when I
> configure it's ceil to 555Mbit/s, with 800 bytes packets long.
>
> > Btw.: if at any point you think this testing is too disturbing to you
> > etc., feel free to stop this or delay in time as you like.
> I'm working on this, don't worry. Since I have a traffic
> generator/analyser, any modification you would make I can test it.
> You're free to ask.
>
> I've been looking inside htb source code. The granularity problem
> could be in the use qdisc_rate_table or near that.

Yes, but according to my assessment there should be "only" 50Mbit
difference for this rate/packet size. Anyway, could you try a testing
patch below, which should add some granularity to this rate table?

Thanks,
Jarek P.
---

include/net/pkt_sched.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index e37fe31..f0faf03 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -42,8 +42,8 @@ typedef u64 psched_time_t;
typedef long psched_tdiff_t;

/* Avoid doing 64 bit divide by 1000 */
-#define PSCHED_US2NS(x) ((s64)(x) << 10)
-#define PSCHED_NS2US(x) ((x) >> 10)
+#define PSCHED_US2NS(x) ((s64)(x) << 6)
+#define PSCHED_NS2US(x) ((x) >> 6)

#define PSCHED_TICKS_PER_SEC PSCHED_NS2US(NSEC_PER_SEC)
#define PSCHED_PASTPERFECT 0

Antonio Almeida

unread,

May 29, 2009, 1:02:39 PM5/29/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Thu, May 28, 2009 at 10:12 PM, Jarek Poplawski wrote:
> Yes, but according to my assessment there should be "only" 50Mbit
> difference for this rate/packet size. Anyway, could you try a testing
> patch below, which should add some granularity to this rate table?
>
> Thanks,
> Jarek P.
> ---
>
> include/net/pkt_sched.h | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index e37fe31..f0faf03 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -42,8 +42,8 @@ typedef u64 psched_time_t;
> typedef long psched_tdiff_t;
>
> /* Avoid doing 64 bit divide by 1000 */
> -#define PSCHED_US2NS(x) ((s64)(x) << 10)
> -#define PSCHED_NS2US(x) ((x) >> 10)
> +#define PSCHED_US2NS(x) ((s64)(x) << 6)
> +#define PSCHED_NS2US(x) ((x) >> 6)
>
> #define PSCHED_TICKS_PER_SEC PSCHED_NS2US(NSEC_PER_SEC)
> #define PSCHED_PASTPERFECT 0

It's better! This patch gives more accuracy to HTB. Here some values:
Note that these are boundary values, so, e.g., any HTB configuration
between 377000Kbit and 400000Kbit would fall in the same step - close
to 397977Kbit.
This test was made over the same conditions: generating 950Mbit/s of
unidirectional tcp traffic of 800 bytes packets long.

leaf class ceil leaf class sent rate (tc -s values)
376000Kbit 375379Kbit
--
377000Kbit 397977Kbit
400000Kbit 397973Kbit
--
401000Kbit 425199Kbit
426000Kbit 425199Kbit
--
427000Kbit 456389Kbit
457000Kbit 456409Kbit
--
458000Kbit 490111Kbit
492000Kbit 490138Kbit
--
493000Kbit 531957Kbit
533000Kbit 532078Kbit
--
534000Kbit 581835Kbit
581000Kbit 581820Kbit
--
582000Kbit 637809Kbit
640000Kbit 637709Kbit
--
641000Kbit 710526Kbit
711000Kbit 710553Kbit
--
712000Kbit 795921Kbit
800000Kbit 795901Kbit
--
801000Kbit 912706Kbit
914000Kbit 912782Kbit
--
915000Kbit --

Here more values for a HTB ceil configuration of 555Mbit/s changing packet size:

800 bytes:

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 18731000768 bytes 23531408 pkt (dropped 15715520, overlimits 0 requeues 0)
rate 581832Kbit 91368pps backlog 0b 110p requeues 0
lended: 23531298 borrowed: 0 giants: 0
tokens: -16091 ctokens: -16091

850 bytes:

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 30556163150 bytes 37645600 pkt (dropped 25746491, overlimits 0 requeues 0)
rate 565509Kbit 83556pps backlog 0b 15p requeues 0
lended: 37645585 borrowed: 0 giants: 0
tokens: -16010 ctokens: -16010

950 bytes

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate
555000Kbit ceil 555000Kbit burst 70901b/8 mpu 0b overhead 0b cburst
70901b/8 mpu 0b overhead 0b level 0

Sent 51363059854 bytes 60954074 pkt (dropped 40474346, overlimits 0 requeues 0)
rate 598925Kbit 83555pps backlog 0b 112p requeues 0
lended: 60953962 borrowed: 0 giants: 0
tokens: 12446 ctokens: 12446

I'm using
# tc -V
tc utility, iproute2-ss090115

and keeping tso and gso off:

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

Stephen Hemminger

unread,

May 29, 2009, 1:28:45 PM5/29/09

to Antonio Almeida, Jarek Poplawski, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

You really need to get a better box than the dual core AMD.
There is only millisecond (or worse with HZ=100) resolution possible because
there is no working TSC on that hardware.

--

Jarek Poplawski

unread,

May 29, 2009, 3:46:43 PM5/29/09

to Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Good news! So it seems there are no other reasons of this inaccuracy
than too coarse granularity, but I have to check this yet. Alas there
is needed something more than this patch, because it probably breaks
other things like hfsc.

Thanks,
Jarek P.

Jarek Poplawski

Jun 2, 2009, 6:12:45 AM6/2/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

I'm getting great values with this patch!

class htb 1:108 parent 1:10 leaf 108: prio 7 quantum 1514 rate

555000Kbit ceil 555000Kbit burst 70970b/8 mpu 0b overhead 0b cburst
70970b/8 mpu 0b overhead 0b level 0
Sent 14270693572 bytes 17928007 pkt (dropped 12579262, overlimits 0 requeues 0)
rate 552755Kbit 86802pps backlog 0b 127p requeues 0
lended: 17927880 borrowed: 0 giants: 0
tokens: -16095 ctokens: -16095

(for packets of 800 bytes)
I'll get back to you with more values.

Antonio Almeida

unread,

Jun 2, 2009, 7:45:28 AM6/2/09

to Jarek Poplawski, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Tue, Jun 2, 2009 at 11:12 AM, Antonio Almeida wrote:
> I'm getting great values with this patch!

>...

> I'll get back to you with more values.

The steps are much smaller and the error keeps lower than 1%.
Injecting over 950Mpbs of tcp packets of 800bytes I get these values:

Configuration Sent rate error (%)
498000Kbit 495023Kbit 0,60
499000Kbit 497456Kbit 0,31
500000Kbit 497498Kbit 0,50
501000Kbit 497496Kbit 0,70
502000Kbit 499986Kbit 0,40
503000Kbit 499978Kbit 0,60
504000Kbit 502520Kbit 0,29

696000Kbit 690964Kbit 0,72
697000Kbit 695782Kbit 0,17
698000Kbit 695783Kbit 0,32
699000Kbit 695783Kbit 0,46
700000Kbit 695795Kbit 0,60
701000Kbit 695786Kbit 0,74
702000Kbit 700703Kbit 0,18

896000Kbit 888383Kbit 0,85
897000Kbit 896289Kbit 0,08
904000Kbit 896389Kbit 0,84
905000Kbit 904542Kbit 0,05

Jarek Poplawski

unread,

Jun 2, 2009, 8:36:36 AM6/2/09

to Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, ka...@trash.net, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Tue, Jun 02, 2009 at 12:45:28PM +0100, Antonio Almeida wrote:
> On Tue, Jun 2, 2009 at 11:12 AM, Antonio Almeida wrote:
> > I'm getting great values with this patch!
> >...
> > I'll get back to you with more values.
>
> The steps are much smaller and the error keeps lower than 1%.
> Injecting over 950Mpbs of tcp packets of 800bytes I get these values:

Nice values - should be acceptable, I guess. Alas this is not all, and
I'll ask you soon for re-testing HFSC (after another patch) or maybe
even some simple CBQ setup ;-)

Thank you very much for testing,
Jarek P.

Patrick McHardy

unread,

Jun 2, 2009, 8:45:34 AM6/2/09

to Jarek Poplawski, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Jarek Poplawski wrote:
> On Tue, Jun 02, 2009 at 12:45:28PM +0100, Antonio Almeida wrote:
>> On Tue, Jun 2, 2009 at 11:12 AM, Antonio Almeida wrote:
>>> I'm getting great values with this patch!
>>> ...
>>> I'll get back to you with more values.
>> The steps are much smaller and the error keeps lower than 1%.
>> Injecting over 950Mpbs of tcp packets of 800bytes I get these values:
>
> Nice values - should be acceptable, I guess. Alas this is not all, and
> I'll ask you soon for re-testing HFSC (after another patch) or maybe
> even some simple CBQ setup ;-)

I didn't follow the full discussion, so I'm not sure which kind of
arithmetic error you're attempting to cure. For the HFSC scaling
factors, please just keep in mind that its also supposed to be
very accurate at low bandwidths.

Jarek Poplawski

unread,

Jun 2, 2009, 9:08:58 AM6/2/09

to Patrick McHardy, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Tue, Jun 02, 2009 at 02:45:34PM +0200, Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> On Tue, Jun 02, 2009 at 12:45:28PM +0100, Antonio Almeida wrote:
>>> On Tue, Jun 2, 2009 at 11:12 AM, Antonio Almeida wrote:
>>>> I'm getting great values with this patch!
>>>> ...
>>>> I'll get back to you with more values.
>>> The steps are much smaller and the error keeps lower than 1%.
>>> Injecting over 950Mpbs of tcp packets of 800bytes I get these values:
>>
>> Nice values - should be acceptable, I guess. Alas this is not all, and
>> I'll ask you soon for re-testing HFSC (after another patch) or maybe
>> even some simple CBQ setup ;-)
>
> I didn't follow the full discussion, so I'm not sure which kind of
> arithmetic error you're attempting to cure. For the HFSC scaling
> factors, please just keep in mind that its also supposed to be
> very accurate at low bandwidths.

It's all here:

http://permalink.gmane.org/gmane.linux.network/129301

Of course, I'd appreciate any suggestions.

Thanks,
Jarek P.

Patrick McHardy

unread,

Jun 2, 2009, 9:20:20 AM6/2/09

to Jarek Poplawski, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Jarek Poplawski wrote:
> On Tue, Jun 02, 2009 at 02:45:34PM +0200, Patrick McHardy wrote:
>> I didn't follow the full discussion, so I'm not sure which kind of
>> arithmetic error you're attempting to cure. For the HFSC scaling
>> factors, please just keep in mind that its also supposed to be
>> very accurate at low bandwidths.
>
> It's all here:
>
> http://permalink.gmane.org/gmane.linux.network/129301

I've read through the mails where you suggested to change the scaling
factors. I wasn't able to find the reasoning (IOW: where does it
overflow or loose precision in which case) though.

> Of course, I'd appreciate any suggestions.

The HFSC shifts would indeed need adjustments if the US<->NS conversion
factor were to change.

Jarek Poplawski

unread,

Jun 2, 2009, 5:37:23 PM6/2/09

to Patrick McHardy, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Tue, Jun 02, 2009 at 03:20:20PM +0200, Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> On Tue, Jun 02, 2009 at 02:45:34PM +0200, Patrick McHardy wrote:
>>> I didn't follow the full discussion, so I'm not sure which kind of
>>> arithmetic error you're attempting to cure. For the HFSC scaling
>>> factors, please just keep in mind that its also supposed to be
>>> very accurate at low bandwidths.
>>
>> It's all here:
>>
>> http://permalink.gmane.org/gmane.linux.network/129301
>
> I've read through the mails where you suggested to change the scaling
> factors. I wasn't able to find the reasoning (IOW: where does it
> overflow or loose precision in which case) though.

I described the reasoning here:
http://permalink.gmane.org/gmane.linux.network/128189

Of course, we could try some other solution than changing the scaling.
I considered a possibility to do it internally in htb, even with
skipping rate tables, but the change of the scaling seems to be the
most generic way (alas there are some odd compatibility issues in
iproute/tc like TIME_UNITS_PER_SEC or "if (nom == 1000000)" to make
it really consistent/readable).

Jarek P.

Jarek Poplawski

unread,

Jun 2, 2009, 5:50:42 PM6/2/09

to Patrick McHardy, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Jarek Poplawski wrote, On 06/02/2009 11:37 PM:
...

> I described the reasoning here:
> http://permalink.gmane.org/gmane.linux.network/128189

The link is stuck now, so here is a quote:

Jarek Poplawski wrote, On 05/17/2009 10:15 PM:

> Here is some additional explanation. It looks like these rates above
> 500Mbit hit the design limits of packet scheduling. Currently used
> internal resolution PSCHED_TICKS_PER_SEC is 1,000,000. 550Mbit rate
> with 800byte packets means 550M/8/800 = 85938 packets/s, so on average
> 1000000/85938 = 11.6 ticks per packet. Accounting only 11 ticks means
> we leave 0.6*85938 = 51563 ticks per second, letting for additional
> sending of 51563/11 = 4687 packets/s or 4687*800*8 = 30Mbit. Of course
> it could be worse (0.9 tick/packet lost) depending on packet sizes vs.
> rates, and the effect rises for higher rates.

Patrick McHardy

unread,

Jun 3, 2009, 3:06:37 AM6/3/09

to Jarek Poplawski, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Jarek Poplawski wrote:
> Jarek Poplawski wrote, On 06/02/2009 11:37 PM:
> ...
>
>> I described the reasoning here:
>> http://permalink.gmane.org/gmane.linux.network/128189
>
> The link is stuck now, so here is a quote:

Thanks.

> Jarek Poplawski wrote, On 05/17/2009 10:15 PM:
>
>> Here is some additional explanation. It looks like these rates above
>> 500Mbit hit the design limits of packet scheduling. Currently used
>> internal resolution PSCHED_TICKS_PER_SEC is 1,000,000. 550Mbit rate
>> with 800byte packets means 550M/8/800 = 85938 packets/s, so on average
>> 1000000/85938 = 11.6 ticks per packet. Accounting only 11 ticks means
>> we leave 0.6*85938 = 51563 ticks per second, letting for additional
>> sending of 51563/11 = 4687 packets/s or 4687*800*8 = 30Mbit. Of course
>> it could be worse (0.9 tick/packet lost) depending on packet sizes vs.
>> rates, and the effect rises for higher rates.

I see. Unfortunately changing the scaling factors is pushing the lower
end towards overflowing. For example Denys Fedoryshchenko reported some
breakage a few years ago when I changed the iproute-internal factors
triggered by this command:

.. tbf buffer 1024kb latency 500ms rate 128kbit peakrate 256kbit
minburst 16384

The burst size calculated by TBF with the current parameters is
64000000. Increasing it by a factor of 16 as in your patch results
in 1024000000. Which means we're getting dangerously close to
overflowing, a buffer size increase or a rate decrease of slightly
bigger than factor 4 will already overflow.

Mid-term we really need to move to 64 bit values and ns resolution,
otherwise this problem is just going to reappear as soon as someone
tries 10gbit. Not sure what the best short term fix is, I feel a bit
uneasy about changing the current factors given how close this brings
us towards overflowing.

unread,

Jun 3, 2009, 4:29:58 AM6/3/09

to Jarek Poplawski, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Jarek Poplawski wrote:
> On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:
> ...
>> Yes, it would work perfectly fine with usecs, which is actually (and
>> unfortunately) the unit it uses in its ABI. But I think its better
>> to convert the values once during initialization, instead of again
>> and again when scheduling the watchdog. The necessary changes are
>> really trivial, all you need to do when changing the scaling factors
>> is to increase SM_MASK and decrease ISM_MASK accordingly.
>
> Right! (On the other hand we could consider a separate watchdog too...)

We could :) But I don't see any benefit doing that, especially given
that eventually everything should be using ns resolution anyways.

Jarek Poplawski

unread,

Jun 3, 2009, 4:45:41 AM6/3/09

to Patrick McHardy, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Wed, Jun 03, 2009 at 10:29:58AM +0200, Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:
>> ...
>>> Yes, it would work perfectly fine with usecs, which is actually (and
>>> unfortunately) the unit it uses in its ABI. But I think its better
>>> to convert the values once during initialization, instead of again
>>> and again when scheduling the watchdog. The necessary changes are
>>> really trivial, all you need to do when changing the scaling factors
>>> is to increase SM_MASK and decrease ISM_MASK accordingly.
>>
>> Right! (On the other hand we could consider a separate watchdog too...)
>
> We could :) But I don't see any benefit doing that, especially given
> that eventually everything should be using ns resolution anyways.

The main benefit would be readability... I guess it's no problem for
you, but I'm currently trying to make sure things like this are/will
be OK :-)

dx = ((u64)d * PSCHED_TICKS_PER_SEC);
dx += USEC_PER_SEC - 1;

Jarek P.

Jarek Poplawski

unread,

Jun 3, 2009, 5:54:12 AM6/3/09

to Antonio Almeida, Patrick McHardy, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:
...

> The necessary changes are
> really trivial, all you need to do when changing the scaling factors
> is to increase SM_MASK and decrease ISM_MASK accordingly.

OK, looks like it's really enough and I was confused with some
rounding, thanks Patrick.

Antonio, could you give this patch a try (with all the previous) and
repeat those HFSC tests you did before (plus maybe a few tries with
lower rates)?

Thanks,
Jarek P.
---

net/sched/sch_hfsc.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 5022f9c..7c53a36 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -384,8 +384,9 @@ cftree_update(struct hfsc_class *cl)
*
* 1.024us/byte 78.125 7.8125 0.78125 0.078125 0.0078125
*/
-#define SM_SHIFT 20
-#define ISM_SHIFT 18
+#define PSCHED_SHIFT 6 /* TODO: move to pkt_sched.h */
+#define SM_SHIFT (30 - PSCHED_SHIFT)
+#define ISM_SHIFT (8 + PSCHED_SHIFT)

#define SM_MASK ((1ULL << SM_SHIFT) - 1)
#define ISM_MASK ((1ULL << ISM_SHIFT) - 1)

Patrick McHardy

unread,

Jun 3, 2009, 6:01:55 AM6/3/09

to Jarek Poplawski, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Jarek Poplawski wrote:
> On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:
> ...
>> The necessary changes are
>> really trivial, all you need to do when changing the scaling factors
>> is to increase SM_MASK and decrease ISM_MASK accordingly.
>
> OK, looks like it's really enough and I was confused with some
> rounding, thanks Patrick.

Looks fine in principle, but considering your change to the generic
scaling factors:

> -#define PSCHED_US2NS(x) ((s64)(x) << 10)
> -#define PSCHED_NS2US(x) ((x) >> 10)
> +#define PSCHED_US2NS(x) ((s64)(x) << 6)
> +#define PSCHED_NS2US(x) ((x) >> 6)

PSCHED_SHIFT should be 4, right?

> +#define PSCHED_SHIFT 6 /* TODO: move to pkt_sched.h */
> +#define SM_SHIFT (30 - PSCHED_SHIFT)
> +#define ISM_SHIFT (8 + PSCHED_SHIFT)

Patrick McHardy

unread,

Jun 3, 2009, 6:05:28 AM6/3/09

to Jarek Poplawski, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> On Wed, Jun 03, 2009 at 09:53:11AM +0200, Patrick McHardy wrote:
>> ...
>>> The necessary changes are
>>> really trivial, all you need to do when changing the scaling factors
>>> is to increase SM_MASK and decrease ISM_MASK accordingly.
>>
>> OK, looks like it's really enough and I was confused with some
>> rounding, thanks Patrick.
>
> Looks fine in principle, but considering your change to the generic
> scaling factors:
>
>> -#define PSCHED_US2NS(x) ((s64)(x) << 10)
>> -#define PSCHED_NS2US(x) ((x) >> 10)
>> +#define PSCHED_US2NS(x) ((s64)(x) << 6)
>> +#define PSCHED_NS2US(x) ((x) >> 6)
>
> PSCHED_SHIFT should be 4, right?

> -#define SM_SHIFT 20
> -#define ISM_SHIFT 18

> +#define PSCHED_SHIFT 6 /* TODO: move to pkt_sched.h */
> +#define SM_SHIFT (30 - PSCHED_SHIFT)
> +#define ISM_SHIFT (8 + PSCHED_SHIFT)

Actually I'm confused, why the additional change of 10?

Patrick McHardy

unread,

Jun 3, 2009, 6:06:10 AM6/3/09

to Jarek Poplawski, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

Patrick McHardy wrote:
>> PSCHED_SHIFT should be 4, right?
>
>> -#define SM_SHIFT 20
>> -#define ISM_SHIFT 18
>> +#define PSCHED_SHIFT 6 /* TODO: move to pkt_sched.h */
>> +#define SM_SHIFT (30 - PSCHED_SHIFT)
>> +#define ISM_SHIFT (8 + PSCHED_SHIFT)
>
> Actually I'm confused, why the additional change of 10?

OK, 10 - 6 = 4, got it :)

Jarek Poplawski

unread,

Jun 3, 2009, 6:27:47 AM6/3/09

to Patrick McHardy, Antonio Almeida, Stephen Hemminger, net...@vger.kernel.org, da...@davemloft.net, de...@cdi.cz, Eric Dumazet, Vladimir Ivashchenko

On Wed, Jun 03, 2009 at 12:06:10PM +0200, Patrick McHardy wrote:
> Patrick McHardy wrote:
>>> PSCHED_SHIFT should be 4, right?
>>
>>> -#define SM_SHIFT 20
>>> -#define ISM_SHIFT 18
>>> +#define PSCHED_SHIFT 6 /* TODO: move to pkt_sched.h */
>>> +#define SM_SHIFT (30 - PSCHED_SHIFT)
>>> +#define ISM_SHIFT (8 + PSCHED_SHIFT)
>>
>> Actually I'm confused, why the additional change of 10?
>
> OK, 10 - 6 = 4, got it :)
>

If you wanted to console me after my hfsc confusions, you did it!

Thanks again,
Jarek P.

David Miller

unread,

Jun 4, 2009, 12:53:14 AM6/4/09

to ka...@trash.net, jar...@gmail.com, vex...@gmail.com, shemm...@vyatta.com, net...@vger.kernel.org, de...@cdi.cz, da...@cosmosbay.com, haz...@francoudi.com

From: Patrick McHardy <ka...@trash.net>
Date: Wed, 03 Jun 2009 09:53:11 +0200

> Jarek Poplawski wrote:
>> On Wed, Jun 03, 2009 at 09:06:37AM +0200, Patrick McHardy wrote:
>>> Mid-term we really need to move to 64 bit values and ns resolution,
>>> otherwise this problem is just going to reappear as soon as someone
>>> tries 10gbit. Not sure what the best short term fix is, I feel a bit
>>> uneasy about changing the current factors given how close this brings
>>> us towards overflowing.
>> I completely agree it's on the verge of overflow, and actually would
>> overflow for some insanely low (for today's standards) rates. So I
>> treat it's as a temporary solution, until people start asking about
>> more than 1 or 2Gbit. And of course we will have to move to 64 bit
>> anyway. Or we can do it now...
>
> That (now) would certainly be the best solution, but its a non-trivial
> task since all the ABIs use 32 bit values.

We could pass in a new attribute which provides the upper-32bits
of the value. I'm not sure if that works in this case but it's
an idea.