txqueuelen has wrong units; should be time

Albert Cahalan

unread,

Feb 27, 2011, 12:50:02 AM2/27/11

to

(thinking about the bufferbloat problem here)

Setting txqueuelen to some fixed number of packets
seems pretty broken if:

1. a link can vary in speed (802.11 especially)

2. a packet can vary in size (9 KiB jumbograms, etc.)

3. there is other weirdness (PPP compression, etc.)

It really needs to be set to some amount of time,
with the OS accounting for packets in terms of the
time it will take to transmit them. This would need
to account for physical-layer packet headers and
minimum spacing requirements.

I think it could also account for estimated congestion
on the local link, because that effects the rate at which
the queue can empty. An OS can directly observe this
on some types of hardware.

Nanoseconds seems fine; it's unlikely you'd ever want
more than 4.2 seconds (32-bit unsigned) of queue.

I guess there are at least 2 queues of interest, with the
second one being under control of the hardware driver.
Having the kernel split the max time as appropriate for
the hardware seems nicest.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mikael Abrahamsson

unread,

Feb 27, 2011, 2:10:03 AM2/27/11

to

On Sun, 27 Feb 2011, Albert Cahalan wrote:

> Nanoseconds seems fine; it's unlikely you'd ever want
> more than 4.2 seconds (32-bit unsigned) of queue.

I think this is shortsighted and I'm sure someone will come up with a case
where 4.2 seconds isn't enough. Let's not build in those kinds of
limitations from start.

Why not make it 64bit and go to picoseconds from start?

If you need to make it 32bit unsigned, I'd suggest to start from
microseconds instead. It's less likely someone would want less than a
microsecond of queue, than someone wanting more than 4.2 seconds of queue.

--
Mikael Abrahamsson email: swm...@swm.pp.se

Eric Dumazet

unread,

Feb 27, 2011, 3:00:01 AM2/27/11

to

Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>
> > Nanoseconds seems fine; it's unlikely you'd ever want
> > more than 4.2 seconds (32-bit unsigned) of queue.
>
> I think this is shortsighted and I'm sure someone will come up with a case
> where 4.2 seconds isn't enough. Let's not build in those kinds of
> limitations from start.
>
> Why not make it 64bit and go to picoseconds from start?
>
> If you need to make it 32bit unsigned, I'd suggest to start from
> microseconds instead. It's less likely someone would want less than a
> microsecond of queue, than someone wanting more than 4.2 seconds of queue.
>

32 or 64 bits doesnt matter a lot. At Qdisc stage we have up to 40 bytes
available in skb->sb[] for our usage.

Problem is some machines have slow High Resolution timing services.

_If_ we have a time limit, it will probably use the low resolution (aka
jiffies), unless high resolution services are cheap.

I was thinking not having an absolute hard limit, but an EWMA based one.

Albert Cahalan

unread,

Feb 27, 2011, 3:30:01 AM2/27/11

to

On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.d...@gmail.com> wrote:
> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>>
>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> > more than 4.2 seconds (32-bit unsigned) of queue.

...

> Problem is some machines have slow High Resolution timing services.
>
> _If_ we have a time limit, it will probably use the low resolution (aka
> jiffies), unless high resolution services are cheap.

As long as that is totally internal to the kernel and never
getting exposed by some API for setting the amount, sure.

> I was thinking not having an absolute hard limit, but an EWMA based one.

The whole point is to prevent stale packets, especially to prevent
them from messing with TCP, so I really don't think so. I suppose
you do get this to some extent via early drop.

Jussi Kivilinna

unread,

Feb 27, 2011, 6:10:02 AM2/27/11

to

Quoting Albert Cahalan <acah...@gmail.com>:

> On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.d...@gmail.com> wrote:
>> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>>>
>>> > Nanoseconds seems fine; it's unlikely you'd ever want
>>> > more than 4.2 seconds (32-bit unsigned) of queue.
> ...
>> Problem is some machines have slow High Resolution timing services.
>>
>> _If_ we have a time limit, it will probably use the low resolution (aka
>> jiffies), unless high resolution services are cheap.
>
> As long as that is totally internal to the kernel and never
> getting exposed by some API for setting the amount, sure.
>
>> I was thinking not having an absolute hard limit, but an EWMA based one.
>
> The whole point is to prevent stale packets, especially to prevent
> them from messing with TCP, so I really don't think so. I suppose
> you do get this to some extent via early drop.

I made simple hack on sch_fifo with per packet time limits
(attachment) this weekend and have been doing limited testing on
wireless link. I think hardlimit is fine, it's simple and does
somewhat same as what packet(-hard)limited buffer does, drops packets
when buffer is 'full'. My hack checks for timed out packets on
enqueue, might be wrong approach (on other hand might allow some more
burstiness).

-Jussi

sch_fifo_to.c

Eric Dumazet

unread,

Feb 27, 2011, 3:10:01 PM2/27/11

to

Qdisc should return to caller a good indication packet is queued or
dropped at enqueue() time... not later (aka : never)

Accepting a packet at t0, and dropping it later at t0+limit without
giving any indication to caller is a problem.

This is why I suggested using an EWMA plus a probabilist drop or
congestion indication (NET_XMIT_CN) to caller at enqueue() time.

The absolute time limit you are trying to implement should be checked at
dequeue time, to cope with enqueue bursts or pauses on wire.

Jussi Kivilinna

unread,

Feb 27, 2011, 4:40:04 PM2/27/11

to

Quoting Eric Dumazet <eric.d...@gmail.com>:

Ok, it is ugly hack ;) I got idea of dropping head from pfifo_head_drop.

>
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.

Ok.

>
> This is why I suggested using an EWMA plus a probabilist drop or
> congestion indication (NET_XMIT_CN) to caller at enqueue() time.
>
> The absolute time limit you are trying to implement should be checked at
> dequeue time, to cope with enqueue bursts or pauses on wire.
>

Ok.

Albert Cahalan

unread,

Feb 27, 2011, 6:40:03 PM2/27/11

to

On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna
<jussi.k...@mbnet.fi> wrote:

> I made simple hack on sch_fifo with per packet time limits (attachment) this
> weekend and have been doing limited testing on wireless link. I think
> hardlimit is fine, it's simple and does somewhat same as what
> packet(-hard)limited buffer does, drops packets when buffer is 'full'. My
> hack checks for timed out packets on enqueue, might be wrong approach (on
> other hand might allow some more burstiness).

Thanks!

I think the default is too high. 1 ms may even be a bit high.

I suppose there is a need to allow at least 2 packets despite any
time limits, so that it remains possible to use a traditional modem
even if a huge packet takes several seconds to send.

Jussi Kivilinna

unread,

Feb 28, 2011, 6:30:02 AM2/28/11

to

Quoting Albert Cahalan <acah...@gmail.com>:

> On Sun, Feb 27, 2011 at 5:55 AM, Jussi Kivilinna
> <jussi.k...@mbnet.fi> wrote:
>
>> I made simple hack on sch_fifo with per packet time limits (attachment) this
>> weekend and have been doing limited testing on wireless link. I think
>> hardlimit is fine, it's simple and does somewhat same as what
>> packet(-hard)limited buffer does, drops packets when buffer is 'full'. My
>> hack checks for timed out packets on enqueue, might be wrong approach (on
>> other hand might allow some more burstiness).
>
> Thanks!
>
> I think the default is too high. 1 ms may even be a bit high.

Well, with 10ms buffer timeout latency goes to 10-20ms on 54Mbit wifi
link (zd1211rw driver) from >500ms (ping rtt when iperf running same
time). So for that it's good enough.

>
> I suppose there is a need to allow at least 2 packets despite any
> time limits, so that it remains possible to use a traditional modem
> even if a huge packet takes several seconds to send.
>

I made EWMA version of my fifo hack (attached). I added minimum 2
packet queue limit and probabilistic 1% ECN marking/dropping for
timeout/2.

-Jussi

sch_fifo_ewma.c

Jussi Kivilinna

unread,

Feb 28, 2011, 6:50:02 AM2/28/11

to

Quoting Eric Dumazet <eric.d...@gmail.com>:

Would it be better to implement this as generic feature instead of
qdisc specific? Have qdisc_enqueue_root do ewma check:

static inline int qdisc_enqueue_root(struct sk_buff *skb, struct Qdisc *sch)
{
qdisc_skb_cb(skb)->pkt_len = skb->len;
if (likely(!sch->use_timeout)) {
ewma_ok:
return qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
}

status = qdisc_check_ewma_status()
if (status == ok)
goto ewma_ok;

if (status == overlimits)
...drop...

if (status == congestion) {
ret = qdisc_enqueue(skb, sch) & NET_XMIT_MASK;
return (ret == success) ? NET_XMIT_CN : ret;
}
}

And add qdisc_dequeue_root:

static inline struct sk_buff *qdisc_dequeue_root(struct Qdisc *sch)
{
skb = sch->dequeue(sch);

if (skb && unlikely(sch->use_timeout))
qdisc_update_ewma(skb);

return skb;
}

Then user could specify any qdisc to use timeout or not with tc. Maybe
go even as far as have some default timeout for default qdisc(?)

-Jussi

Eric Dumazet

unread,

Feb 28, 2011, 8:20:01 AM2/28/11

to

Problem is you can have several virtual queues in a qdisc.

For example, pfifo_fast has 3 bands. You could have a global ewma with
high values, but you still want to let a high priority packet going
through...

Hagen Paul Pfeifer

unread,

Feb 28, 2011, 10:50:02 AM2/28/11

to

On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:

> I suppose there is a need to allow at least 2 packets despite any
> time limits, so that it remains possible to use a traditional modem
> even if a huge packet takes several seconds to send.

That is a good point! We talk about as we may know every use case of
Linux. But this is not true at all. One of my customer for example operates
the Linux network stack functionality on top of a proprietary MAC/Driver
where the current packet queue characteristic is just fine. The
time-drop-approach is unsuitable because the bandwidth can vary in a small
amount of time over a great range (0 till max. bandwidth). A sufficient
buffering shows up superior in this environment (only IPv{4,6}/UDP).

Hagen

John W. Linville

unread,

Feb 28, 2011, 11:20:02 AM2/28/11

to

On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:

> Qdisc should return to caller a good indication packet is queued or
> dropped at enqueue() time... not later (aka : never)
>
> Accepting a packet at t0, and dropping it later at t0+limit without
> giving any indication to caller is a problem.

Can you elaborate on what problem this causes? Is it any worse than
if the packet is dropped at some later hop?

Is there any API that could report the drop to the sender (at
least a local one) without having to wait for the ack timeout?
Should there be?

John
--
John W. Linville Someday the world will need a hero, and you
linv...@tuxdriver.com might be all we have. Be ready.

Albert Cahalan

unread,

Feb 28, 2011, 11:50:04 AM2/28/11

to

On Mon, Feb 28, 2011 at 10:38 AM, Hagen Paul Pfeifer <ha...@jauu.net> wrote:
> On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:
>
>> I suppose there is a need to allow at least 2 packets despite any
>> time limits, so that it remains possible to use a traditional modem
>> even if a huge packet takes several seconds to send.
>
> That is a good point! We talk about as we may know every use case of
> Linux. But this is not true at all. One of my customer for example operates
> the Linux network stack functionality on top of a proprietary MAC/Driver
> where the current packet queue characteristic is just fine. The
> time-drop-approach is unsuitable because the bandwidth can vary in a small
> amount of time over a great range (0 till max. bandwidth). A sufficient
> buffering shows up superior in this environment (only IPv{4,6}/UDP).

I don't think the current non-time queue is just fine for him.
I can see that time-based discard-on-enqueue would not be
fine either. He needs time-based discard-on-dequeue.
Good for him is probably:

On dequeue, discard all packets that are too old.
On enqueue, assume max bandwidth and discard all
packets that have no hope of surviving the dequeue check.
(the enqueue check is only to prevent wasting RAM)
Exception: always keep at least 2 packets.

Better is something that would allow random drop.
The trouble here is that bandwidth varies greatly.
Some sort of undelete functionality is needed...?

Assuming the difficulty with implementing random drop
is solvable, I think this would work for the rest of us too.

Keeping the timeout really low is important because it isn't
OK to eat up all the latency tolerance in one hop. You have
an end-to-end budget of 20 ms for usable GUI rubber banding.
The budget for gaming is about 80 and for VoIP is about 150.

Eric Dumazet

unread,

Feb 28, 2011, 11:50:05 AM2/28/11

to

Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
> On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:
>
> > Qdisc should return to caller a good indication packet is queued or
> > dropped at enqueue() time... not later (aka : never)
> >
> > Accepting a packet at t0, and dropping it later at t0+limit without
> > giving any indication to caller is a problem.
>
> Can you elaborate on what problem this causes? Is it any worse than
> if the packet is dropped at some later hop?
>
> Is there any API that could report the drop to the sender (at
> least a local one) without having to wait for the ack timeout?
> Should there be?
>

Not all protocols have ACKS ;)

dev_queue_xmit() returns an error code, some callers use it.

John W. Linville

unread,

Feb 28, 2011, 12:10:02 PM2/28/11

to

On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote:
> Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
> > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:
> >
> > > Qdisc should return to caller a good indication packet is queued or
> > > dropped at enqueue() time... not later (aka : never)
> > >
> > > Accepting a packet at t0, and dropping it later at t0+limit without
> > > giving any indication to caller is a problem.
> >
> > Can you elaborate on what problem this causes? Is it any worse than
> > if the packet is dropped at some later hop?
> >
> > Is there any API that could report the drop to the sender (at
> > least a local one) without having to wait for the ack timeout?
> > Should there be?
> >
>
> Not all protocols have ACKS ;)
>
> dev_queue_xmit() returns an error code, some callers use it.

Well, OK -- I agree it is best if you can return the status at
enqueue time. The question becomes whether or not a dropped frame
is worse than living with high latency. The answer, of course, still
seems to be a bit subjective. But, if the admin has determined that
a link should be low latency...?

John
--
John W. Linville Someday the world will need a hero, and you
linv...@tuxdriver.com might be all we have. Be ready.

Eric Dumazet

unread,

Feb 28, 2011, 12:20:02 PM2/28/11

to

Le lundi 28 février 2011 à 11:55 -0500, John W. Linville a écrit :
> On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote:
> > Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
> > > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:
> > >
> > > > Qdisc should return to caller a good indication packet is queued or
> > > > dropped at enqueue() time... not later (aka : never)
> > > >
> > > > Accepting a packet at t0, and dropping it later at t0+limit without
> > > > giving any indication to caller is a problem.
> > >
> > > Can you elaborate on what problem this causes? Is it any worse than
> > > if the packet is dropped at some later hop?
> > >
> > > Is there any API that could report the drop to the sender (at
> > > least a local one) without having to wait for the ack timeout?
> > > Should there be?
> > >
> >
> > Not all protocols have ACKS ;)
> >
> > dev_queue_xmit() returns an error code, some callers use it.
>
> Well, OK -- I agree it is best if you can return the status at
> enqueue time. The question becomes whether or not a dropped frame
> is worse than living with high latency. The answer, of course, still
> seems to be a bit subjective. But, if the admin has determined that
> a link should be low latency...?
>

If the latency problem could be solved by an admin choice, it probably
would be there already.

Point is qdisc layer is able to immediately return an error code to
caller, if qdisc handlers properly done. This can help applications to
immediately react to congestion notifications.

Some applications, even running on a "low latency link" can afford a
long delay for their packets. Should we introduce a socket API to give
the upper bound for the limit, or share a global 'per qdisc' limit ?

Bill Sommerfeld

unread,

Feb 28, 2011, 12:30:02 PM2/28/11

to

On Mon, Feb 28, 2011 at 07:38, Hagen Paul Pfeifer <ha...@jauu.net> wrote:
> On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:
>> I suppose there is a need to allow at least 2 packets despite any
>> time limits, so that it remains possible to use a traditional modem
>> even if a huge packet takes several seconds to send.
>
> That is a good point! We talk about as we may know every use case of
> Linux. But this is not true at all. One of my customer for example operates
> the Linux network stack functionality on top of a proprietary MAC/Driver
> where the current packet queue characteristic is just fine. The
> time-drop-approach is unsuitable because the bandwidth can vary in a small
> amount of time over a great range (0 till max. bandwidth). A sufficient
> buffering shows up superior in this environment (only IPv{4,6}/UDP).

The tension is between the average queue length and the maximum amount
of buffering needed. Fixed-sized tail-drop queues -- either long, or
short -- are not ideal.

My understanding is that the best practice here is that you need
(bandwidth * path delay) buffering to be available to absorb bursts
and avoid drops, but you also need to use queue management algorithms
with ECN or random drop to keep the *average* queue length short;
unfortunately, researchers are still arguing about the details of the
second part...

John W. Linville

unread,

Feb 28, 2011, 1:10:03 PM2/28/11

to

On Mon, Feb 28, 2011 at 11:37:45AM -0500, Albert Cahalan wrote:

> Keeping the timeout really low is important because it isn't
> OK to eat up all the latency tolerance in one hop. You have
> an end-to-end budget of 20 ms for usable GUI rubber banding.
> The budget for gaming is about 80 and for VoIP is about 150.

Oooh, numbers! :-)

Where can I find estimates on average hop counts for internet
connections?

John
--
John W. Linville Someday the world will need a hero, and you
linv...@tuxdriver.com might be all we have. Be ready.

Jussi Kivilinna

unread,

Feb 28, 2011, 1:40:02 PM2/28/11

to

Quoting Eric Dumazet <eric.d...@gmail.com>:

Ok. It would better to have ewma/timelimit at leaf qdisc.
(Or have in-middle-qdisc handling ewma/timelimit for leaf qdisc,
sch_timelimit)

-Jussi

John Heffner

unread,

Feb 28, 2011, 4:50:02 PM2/28/11

to

On Mon, Feb 28, 2011 at 11:55 AM, John W. Linville
<linv...@tuxdriver.com> wrote:
> On Mon, Feb 28, 2011 at 05:48:14PM +0100, Eric Dumazet wrote:
>> Le lundi 28 février 2011 à 11:11 -0500, John W. Linville a écrit :
>> > On Sun, Feb 27, 2011 at 09:07:53PM +0100, Eric Dumazet wrote:
>> >
>> > > Qdisc should return to caller a good indication packet is queued or
>> > > dropped at enqueue() time... not later (aka : never)
>> > >
>> > > Accepting a packet at t0, and dropping it later at t0+limit without
>> > > giving any indication to caller is a problem.
>> >
>> > Can you elaborate on what problem this causes? Is it any worse than
>> > if the packet is dropped at some later hop?
>> >
>> > Is there any API that could report the drop to the sender (at
>> > least a local one) without having to wait for the ack timeout?
>> > Should there be?
>> >
>>
>> Not all protocols have ACKS ;)
>>
>> dev_queue_xmit() returns an error code, some callers use it.
>
> Well, OK -- I agree it is best if you can return the status at
> enqueue time. The question becomes whether or not a dropped frame
> is worse than living with high latency. The answer, of course, still
> seems to be a bit subjective. But, if the admin has determined that
> a link should be low latency...?

Notably, TCP is one caller that uses the error code. The error code
is functionally equivalent to ECN, one of whose great advantages is
reducing delay jitter. If TCP didn't get the error, that would
effectively double the latency for a full window of data, since the
dropped segment would not be retransmitted for an RTT.

-John

John Heffner

unread,

Feb 28, 2011, 5:00:02 PM2/28/11

to

Right... while I generally agree that a fixed-length drop-tail queue
isn't optimal, isn't this problem what the various AQM schemes try to
solve?

-John

On Mon, Feb 28, 2011 at 12:20 PM, Bill Sommerfeld
<wsomm...@google.com> wrote:
> On Mon, Feb 28, 2011 at 07:38, Hagen Paul Pfeifer <ha...@jauu.net> wrote:
>> On Sun, 27 Feb 2011 18:33:39 -0500, Albert Cahalan wrote:
>>> I suppose there is a need to allow at least 2 packets despite any
>>> time limits, so that it remains possible to use a traditional modem
>>> even if a huge packet takes several seconds to send.
>>
>> That is a good point! We talk about as we may know every use case of
>> Linux. But this is not true at all. One of my customer for example operates
>> the Linux network stack functionality on top of a proprietary MAC/Driver
>> where the current packet queue characteristic is just fine. The
>> time-drop-approach is unsuitable because the bandwidth can vary in a small
>> amount of time over a great range (0 till max. bandwidth). A sufficient
>> buffering shows up superior in this environment (only IPv{4,6}/UDP).
>
> The tension is between the average queue length and the maximum amount
> of buffering needed. Fixed-sized tail-drop queues -- either long, or
> short -- are not ideal.
>
> My understanding is that the best practice here is that you need
> (bandwidth * path delay) buffering to be available to absorb bursts
> and avoid drops, but you also need to use queue management algorithms
> with ECN or random drop to keep the *average* queue length short;
> unfortunately, researchers are still arguing about the details of the
> second part...
> --

> To unsubscribe from this list: send the line "unsubscribe netdev" in

Mikael Abrahamsson

unread,

Feb 28, 2011, 7:50:01 PM2/28/11

to

On Mon, 28 Feb 2011, John Heffner wrote:

> Right... while I generally agree that a fixed-length drop-tail queue
> isn't optimal, isn't this problem what the various AQM schemes try to
> solve?

I am not an expert on exactly how Linux does this, but for Cisco and for
instance ATM interfaces, there are two stages of queuing. One is the
"hardware queue", which is a FIFO queue going into the ATM framer. If one
wants low CPU usage, then this needs to be high so multiple packets can be
put there per interrupt. Since AQM is working before this, it also means
the low-latency-queue will have a higher latency as it ends up behind
larger packets in the hw queue.

So on what level does the AQM work in Linux? Does it work similarily, that
txqueuelen is a FIFO queue to the hardware that AQM feeds packets into?

Also, when one uses WRED the thinking is generally to keep the average
queue len down, but still allow for bursts by dynamically changing the
drop probability and where it happens. When there is no queuing, allow for
big queue (so it can fill up if needed), but if the queue is large for
several seconds, start to apply WRED to bring it down.

There is generally no need at all to constantly buffer > 50 ms of data,
then it's better to just start selectively dropping it. In time of
burstyness (perhaps when re-routing traffic) there is need to buffer
200-500ms of during perhaps 1-2 seconds before things stabilize.

So one queuing scheme and one queue limit isn't going to solve this, there
need to be some dynamic built into the system for it to work well.

AQM needs to feed into a relatively short hw queue and AQM needs to exist
on output also when the traffic is sourced from the box itself, no tonly
routed. It would also help if the default would be to use let's say 25% of
the bandwidth for smaller packets (< 200 bytes or so) which generally are
for interactive uses or are ACKs.

--
Mikael Abrahamsson email: swm...@swm.pp.se

David Miller

unread,

Feb 28, 2011, 11:20:02 PM2/28/11

to

From: Albert Cahalan <acah...@gmail.com>
Date: Mon, 28 Feb 2011 23:11:13 -0500

> It sounds like you need a callback or similar, so that TCP can be
> informed later that the drop has occurred.

By that point we could have already sent an entire RTT's worth
of data, or more.

It needs to be synchronous, otherwise performance suffers.

Albert Cahalan

unread,

Feb 28, 2011, 11:20:02 PM2/28/11

to

It sounds like you need a callback or similar, so that TCP can be

informed later that the drop has occurred.

Eric Dumazet

unread,

Mar 1, 2011, 12:10:02 AM3/1/11

to

Le lundi 28 février 2011 à 23:11 -0500, Albert Cahalan a écrit :

> It sounds like you need a callback or similar, so that TCP can be
> informed later that the drop has occurred.

There is the thing called skb destructor / skb_orphan() mess, that is
not stackable... Might extend this to something more clever, and be able
to call functions (into TCP stack for example) giving a status of skb :
Sent, or dropped somewhere in the stack...

Eric Dumazet

unread,

Mar 1, 2011, 12:40:01 AM3/1/11

to

Le mardi 01 mars 2011 à 06:01 +0100, Eric Dumazet a écrit :
> Le lundi 28 février 2011 à 23:11 -0500, Albert Cahalan a écrit :
>
> > It sounds like you need a callback or similar, so that TCP can be
> > informed later that the drop has occurred.
>
> There is the thing called skb destructor / skb_orphan() mess, that is
> not stackable... Might extend this to something more clever, and be able
> to call functions (into TCP stack for example) giving a status of skb :
> Sent, or dropped somewhere in the stack...
>

One problem of such schem is the huge extra cost involved, extra
locking, extra memory allocations, extra atomic operations...

Albert Cahalan

unread,

Mar 1, 2011, 2:00:01 AM3/1/11

to

On Mon, Feb 28, 2011 at 11:18 PM, David Miller <da...@davemloft.net> wrote:
> From: Albert Cahalan <acah...@gmail.com>
> Date: Mon, 28 Feb 2011 23:11:13 -0500
>
>> It sounds like you need a callback or similar, so that TCP can be
>> informed later that the drop has occurred.
>
> By that point we could have already sent an entire RTT's worth
> of data, or more.
>
> It needs to be synchronous, otherwise performance suffers.

Ouch. OTOH, the current situation: performance suffers.

In case it makes you feel any better, consider two cases
where synchronous feedback is already impossible.
One is when you're routing packets that merely pass through.
The other is when some other box is doing that to you.
Either way, packets go bye-bye and nobody tells TCP.

David Miller

unread,

Mar 1, 2011, 2:30:01 AM3/1/11

to

From: Albert Cahalan <acah...@gmail.com>
Date: Tue, 1 Mar 2011 01:54:09 -0500

> In case it makes you feel any better, consider two cases
> where synchronous feedback is already impossible.
> One is when you're routing packets that merely pass through.
> The other is when some other box is doing that to you.
> Either way, packets go bye-bye and nobody tells TCP.

I consider ECN quite synchronous, and routers will set ECN bits to
propagate congestion information when they do or are about to drop
packets.

Eric Dumazet

unread,

Mar 1, 2011, 2:30:01 AM3/1/11

to

Le mardi 01 mars 2011 à 01:54 -0500, Albert Cahalan a écrit :
> On Mon, Feb 28, 2011 at 11:18 PM, David Miller <da...@davemloft.net> wrote:
> > From: Albert Cahalan <acah...@gmail.com>
> > Date: Mon, 28 Feb 2011 23:11:13 -0500
> >
> >> It sounds like you need a callback or similar, so that TCP can be
> >> informed later that the drop has occurred.
> >
> > By that point we could have already sent an entire RTT's worth
> > of data, or more.
> >
> > It needs to be synchronous, otherwise performance suffers.
>
> Ouch. OTOH, the current situation: performance suffers.
>
> In case it makes you feel any better, consider two cases
> where synchronous feedback is already impossible.
> One is when you're routing packets that merely pass through.
> The other is when some other box is doing that to you.
> Either way, packets go bye-bye and nobody tells TCP.

So in a hurry we decide to drop packets blindly because kernel took the
cpu to perform an urgent task ?

Bufferbloat is a configuration/tuning problem, not a "everything must be
redone" problem. We add new qdiscs (CHOKe, SFB, QFQ, ...) and let admins
do their job. Problem is most admins are unaware of the problems, and
only buy more bandwidth.

And no, there is no "generic" solution, unless you have a lab with two
machines back to back (private link) and a known workload.

We might need some changes (including new APIs).

ECN is a forward step. Blindly dropping packets before ever sending them
is a step backward.

We should allow some trafic spikes, or many applications will stop
working. Unless all applications are fixed, we are stuck.

Only if the queue stay loaded a long time (yet another parameter) we can
try to drop packets.

Albert Cahalan

unread,

Mar 1, 2011, 2:40:02 PM3/1/11

to

On Tue, Mar 1, 2011 at 2:26 AM, Eric Dumazet <eric.d...@gmail.com> wrote:
> Le mardi 01 mars 2011 à 01:54 -0500, Albert Cahalan a écrit :
>> On Mon, Feb 28, 2011 at 11:18 PM, David Miller <da...@davemloft.net> wrote:
>> > From: Albert Cahalan <acah...@gmail.com>

>> >> It sounds like you need a callback or similar, so that TCP can be

>> >> informed later that the drop has occurred.
>> >
>> > By that point we could have already sent an entire RTT's worth
>> > of data, or more.
>> >
>> > It needs to be synchronous, otherwise performance suffers.
>>
>> Ouch. OTOH, the current situation: performance suffers.
>>
>> In case it makes you feel any better, consider two cases
>> where synchronous feedback is already impossible.
>> One is when you're routing packets that merely pass through.
>> The other is when some other box is doing that to you.
>> Either way, packets go bye-bye and nobody tells TCP.
>
> So in a hurry we decide to drop packets blindly because kernel took the
> cpu to perform an urgent task ?

Yes. If the system can't handle the load, it needs to fess up.

> Bufferbloat is a configuration/tuning problem, not a "everything must be
> redone" problem. We add new qdiscs (CHOKe, SFB, QFQ, ...) and let admins
> do their job. Problem is most admins are unaware of the problems, and
> only buy more bandwidth.

We could at least do as well as Windows. >:-)

You can not expect some random Linux user to tune things
every time the link changes speed or the app mix changes.
What person NOT ON THIS MAILING LIST is going to mess
with their qdisc when they connect to a new access point
or switch from running Skype to running Netflix? Heck, how
many have any awareness of what a qdisk even is? Linux
networking needs to be excellent for people with no clue.

> We might need some changes (including new APIs).

If an app can't specify latency, adding the ability could
be nice. Still, stuff needs to JUST WORK more of the time.

> ECN is a forward step. Blindly dropping packets before ever sending them
> is a step backward.

Last I knew, ECN defaulted to a setting of "2" which means
it is only used in response. Perhaps it's time to change that.
It's been a while, with defective firewalls being replaced
by faster hardware.

> We should allow some trafic spikes, or many applications will stop
> working. Unless all applications are fixed, we are stuck.

Such applications would stop working...

1. across a switch
2. across an older router

We certainly should allow some traffic spikes. 1 to 10 ms of
traffic ought to do nicely. Hundreds or thousands of ms is
getting way beyond "spike".

Eric Dumazet

unread,

Mar 1, 2011, 3:20:02 PM3/1/11

to

OK.

Eric Dumazet

unread,

Mar 1, 2011, 3:20:02 PM3/1/11

to

Hmm, user error, hit wrong button, sorry.

Mikael Abrahamsson

unread,

Mar 1, 2011, 10:20:01 PM3/1/11

to

On Tue, 1 Mar 2011, Eric Dumazet wrote:

> We should allow some trafic spikes, or many applications will stop
> working. Unless all applications are fixed, we are stuck.
>
> Only if the queue stay loaded a long time (yet another parameter) we can
> try to drop packets.

Are we talking forwarding of packets or originating them ourselves, or
trying to use the same mechanism for both?

In the case of routing a packet, I envision a WRED kind of behaviour is
the most efficient.

<http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/12stbwr.html>

"QoS: Time-Based Thresholds for WRED and Queue Limit for the Cisco 12000
Series Router" You can set the drop probabilites in milliseconds.
Unfortunately ECN isn't supported on this platform but on other platforms
it can be configured and used instead of WRED dropping packets.

For the case when we're ourselves originating the traffic (for instance to
a wifi card with varying speed and jitter due to retransmits on the wifi
layer), I think it's taking the too easy way out to use the same
mechanisms (dropping packets or marking ECN for our own originated packets
seems really weird), here we should be able to pushback information to the
applications somehow and do prioritization between flows since we're
sitting on all information ourselves including the application.

For this case, I think there is something to be learnt from:

<http://www.cisco.com/en/US/tech/tk39/tk824/technologies_tech_note09186a00800fbafc.shtml>

Here you have the IP part and the ATM part, and you can limit the number
of cells/packets sent to the ATM hardware at any given time (this queue is
FIFO so no AQM when the packet has been sent here). We need the same here,
to properly keep latency down and make AQM work, the hardware FIFO queue
needs to be kept low.

--
Mikael Abrahamsson email: swm...@swm.pp.se

Stephen Hemminger

unread,

Mar 2, 2011, 1:30:01 AM3/2/11

to

It is possible to build an equivalent to WRED out existing GRED queuing
discipline but it does require a lot of tc knowledge to get right.
The inventor of RED (Van Jacobsen) has issues with WRED because of
the added complexity of queue selection. RED requires some parameters
which the average user has no idea how to set.

There are several problems with RED that prevent prevent VJ from
recommending it in the current form.

http://gettys.wordpress.com/2010/12/17/red-in-a-different-light/

--

Mikael Abrahamsson

unread,

Mar 2, 2011, 1:50:02 AM3/2/11

to

On Tue, 1 Mar 2011, Stephen Hemminger wrote:

> It is possible to build an equivalent to WRED out existing GRED queuing
> discipline but it does require a lot of tc knowledge to get right.

To me who has worked with cisco routers for 10+ years and who is used to
the different variants Cisco use, tc is just weird. It must come from a
completely different school of thinking compared to what router people are
used to, because I have tried and failed twice to do anything sensible
with it.

> The inventor of RED (Van Jacobsen) has issues with WRED because of the
> added complexity of queue selection. RED requires some parameters which
> the average user has no idea how to set.

Of course there are issues and some of them can be adressed by simply
lowering the queue depth. Yes, that might bring down the performance of
some sessions, but for most of the interactive traffic, never buffering
more than 40ms is a good thing.

> There are several problems with RED that prevent prevent VJ from
> recommending it in the current form.

Ask if he prefers FIFO+tail drop to RED in current form.

--
Mikael Abrahamsson email: swm...@swm.pp.se

Stephen Hemminger

unread,

Mar 2, 2011, 2:10:02 AM3/2/11

to

On Wed, 2 Mar 2011 07:41:30 +0100 (CET)
Mikael Abrahamsson <swm...@swm.pp.se> wrote:

> On Tue, 1 Mar 2011, Stephen Hemminger wrote:
>
> > It is possible to build an equivalent to WRED out existing GRED queuing
> > discipline but it does require a lot of tc knowledge to get right.
>
> To me who has worked with cisco routers for 10+ years and who is used to
> the different variants Cisco use, tc is just weird. It must come from a
> completely different school of thinking compared to what router people are
> used to, because I have tried and failed twice to do anything sensible
> with it.

Vyatta has scripting that handles all that:

vyatta@napa:~$ configure
[edit]
yatta@napa# set traffic-policy random-detect MyWFQ bandwidth 1gbps
[edit]
vyatta@napa# set interfaces ethernet eth0 traffic-policy out MyWFQ
[edit]
vyatta@napa# commit
[edit]
vyatta@napa# exit
vyatta@napa:~$ show queueing ethernet eth0

eth0 Queueing:
Class Policy Sent Rate Dropped Overlimit Backlog
root weighted-random 16550 0 0 0

vyatta@napa:~$ /sbin/tc qdisc show dev eth0
qdisc dsmark 1: root refcnt 2 indices 0x0008 set_tc_index
qdisc gred 2: parent 1:
DP:0 (prio 8) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 82 (bytes 9540) ewma 3 Plog 17 Scell_log 3
DP:1 (prio 7) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0) ewma 2 Plog 17 Scell_log 2
DP:2 (prio 6) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0) ewma 2 Plog 17 Scell_log 2
DP:3 (prio 5) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0) ewma 2 Plog 16 Scell_log 2
DP:4 (prio 4) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0) ewma 2 Plog 16 Scell_log 2
DP:5 (prio 3) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0) ewma 2 Plog 16 Scell_log 2
DP:6 (prio 2) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0) ewma 2 Plog 15 Scell_log 2
DP:7 (prio 1) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0) ewma 1 Plog 15 Scell_log 1

QoS on Cisco has different/other problems mostly because various groups
tried to fix the QoS problem over time and never got it quite right.
Also WRED is not default on faster links because it can't be done
fast enough.

Mikael Abrahamsson

unread,

Mar 2, 2011, 11:50:02 AM3/2/11

to

On Tue, 1 Mar 2011, Stephen Hemminger wrote:

> Also WRED is not default on faster links because it can't be done fast
> enough.

Before this propagates as some kind of truth. Cisco modern core routers
have no problems doing WRED at wirespeed, the above statement is not true.

--
Mikael Abrahamsson email: swm...@swm.pp.se

Eric Dumazet

unread,

Mar 2, 2011, 12:00:01 PM3/2/11

to

Le mercredi 02 mars 2011 à 17:41 +0100, Mikael Abrahamsson a écrit :
> On Tue, 1 Mar 2011, Stephen Hemminger wrote:
>
> > Also WRED is not default on faster links because it can't be done fast
> > enough.
>
> Before this propagates as some kind of truth. Cisco modern core routers
> have no problems doing WRED at wirespeed, the above statement is not true.
>

looking at cisco docs you provided
( <http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/12stbwr.html>
)
, it seems the WRED time limits (instead of bytes/packets limits) are
internaly converted to bytes/packets limits

quote :

When the queue limit threshold is specified in milliseconds, the number
of milliseconds is internally converted to bytes using the bandwidth
available for the class.

So it seems its only a facility provided, and queues are still managed
with bytes/packets limits...

WRED is able to prob drop a packet when this packet is enqueued. At time
of enqueue, we dont know yet the time of dequeue, unless bandwidth is
known.

Chris Friesen

unread,

Mar 2, 2011, 3:30:02 PM3/2/11

to

On 03/01/2011 09:10 PM, Mikael Abrahamsson wrote:

> For the case when we're ourselves originating the traffic (for instance to
> a wifi card with varying speed and jitter due to retransmits on the wifi
> layer), I think it's taking the too easy way out to use the same
> mechanisms (dropping packets or marking ECN for our own originated packets
> seems really weird), here we should be able to pushback information to the
> applications somehow and do prioritization between flows since we're
> sitting on all information ourselves including the application.

Doesn't the socket tx buffer give all the app pushback necessary?
(Assuming it's set to a sane value.)

We should certainly do prioritization between flows. Perhaps if no
other information is available the scheduler priority could be used?

Chris

--
Chris Friesen
Software Developer
GENBAND
chris....@genband.com
www.genband.com