Re: [bbr-dev] bbr does not use sch_fq but internal pacing of tcp stack when it is as server

Eric Dumazet

unread,

Dec 21, 2020, 9:16:18 AM12/21/20

to mingkun bian, BBR Development

On Mon, Dec 21, 2020 at 3:00 PM mingkun bian <bianm...@gmail.com> wrote:
>
> Hi,
> I found that bbr does not use sch_fq but internal pacing of tcp stack when it is as server,

Hi

What makes you think that ?

> I log at bbr_init, then find that:
> sk->sk_pacing_status is 0 as a server when client sent HTTP GET to it.
> sk->sk_pacing_status is 2 as a client when it send HTTP GET to other server.
>
> Then I search goole and find a patch as following:
> http://patchwork.ozlabs.org/project/netdev/patch/20191223191324....@google.com/
>
> Now my question is funtion "fq_classify":
> 1. sk = (struct sock *)((hash << 1) | 1UL)
> 2. current sk may not be a valid addr, why does sk can get sk_hash at "f->socket_hash = sk->sk_hash"

sk must be valid, many things in the kernel would break if this was
not the case.

>
>
> static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q)
> {
> struct rb_node **p, *parent;
> struct sock *sk = skb->sk;
> struct rb_root *root;
> struct fq_flow *f;
>
> if (unlikely((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL))
> return &q->internal;
> if (!sk || sk_listener(sk)) {
> unsigned long hash = skb_get_hash(skb) & q->orphan_mask;
>
> /* By forcing low order bit to 1, we make sure to not
> * collide with a local flow (socket pointers are word aligned)
> */
> sk = (struct sock *)((hash << 1) | 1UL);
> skb_orphan(skb);
> }
> ...
> if (skb->sk)
> f->socket_hash = sk->sk_hash;
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups "BBR Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/fd57295c-3e3d-42b0-b1ea-2d8e1148b322n%40googlegroups.com.

mingkun bian

unread,

Dec 22, 2020, 9:17:11 AM12/22/20

to BBR Development

Hi Eric ,

I find the above problem:

when sending synack, sk is not a valid addr when runing "sk = (struct sock *)((hash << 1) | 1UL)", it can not exec "f->socket_hash = sk->sk_hash" because of skb_orphan(skb) which set skb->sk zero.

But I have the doubt about the patch "http://patchwork.ozlabs.org/project/netdev/patch/20191223191324....@google.com/", my kernel version is 4.18.0, I print some info by using bcc in function "fq_enqueue", it is as follow in my original kernel：

1. skb->sk's type is request_sock when sending synack, which does not have sk_pacing_status.

2. tcp_v4_syn_recv_sock will creat a new sock of child, whose state is TCP_SYN_RECV, then tcp_rcv_state_process will exec tcp_init_congestion_control, then sk_pacing_status is SK_PACING_NONE in bbr_init, then bbr change sk_pacing_status to SK_PACING_NEEDED, which will use tcp internal pacing.

3. client send a HTTP "GET" to us, then we response a ack, fq_enqueue will change sk_pacing_status to SK_PACING_FQ, then transmition later will use sch_fq but not tcp stack internal pacing.

So original fq_enqueue in kernel 4.18.0 looks like correct， I do not understand the patch clearly， can you explain it detailedly？

Thanks.

Eric Dumazet

unread,

Dec 22, 2020, 11:19:24 AM12/22/20

to mingkun bian, BBR Development

On Tue, Dec 22, 2020 at 3:17 PM mingkun bian <bianm...@gmail.com> wrote:
>
> Hi Eric ,
> I find the above problem:
> when sending synack, sk is not a valid addr when runing "sk = (struct sock *)((hash << 1) | 1UL)", it can not exec "f->socket_hash = sk->sk_hash" because of skb_orphan(skb) which set skb->sk zero.
>
> But I have the doubt about the patch "http://patchwork.ozlabs.org/project/netdev/patch/20191223191324....@google.com/", my kernel version is 4.18.0, I print some info by using bcc in function "fq_enqueue", it is as follow in my original kernel：
> 1. skb->sk's type is request_sock when sending synack, which does not have sk_pacing_status.
> 2. tcp_v4_syn_recv_sock will creat a new sock of child, whose state is TCP_SYN_RECV, then tcp_rcv_state_process will exec tcp_init_congestion_control, then sk_pacing_status is SK_PACING_NONE in bbr_init, then bbr change sk_pacing_status to SK_PACING_NEEDED, which will use tcp internal pacing.
> 3. client send a HTTP "GET" to us, then we response a ack, fq_enqueue will change sk_pacing_status to SK_PACING_FQ, then transmition later will use sch_fq but not tcp stack internal pacing.
> So original fq_enqueue in kernel 4.18.0 looks like correct， I do not understand the patch clearly， can you explain it detailedly？

I think the changelog is exactly describing the issue.

Not sure how to explain it in more details.

The problem was that a 'struct fq_flow´ could be recycled (containing
state for a stale flow), and in this case we were not setting
sk_pacing_status to SK_PACING_FQ

This means BBR (or more precisely core TCP stack) was forced to arm a
per-TCP-socket high resolution timer instead of relying on the shared
one in FQ qdisc.

> To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/e817f1b9-8754-42ab-a1ed-a277f1d9fa57n%40googlegroups.com.

mingkun bian

unread,

Dec 23, 2020, 10:36:08 AM12/23/20

to BBR Development

Hi,

I might understand what you explain by modified code.

1. original logic：

static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) {

f = fq_classify(skb, q);

if (fq_flow_is_detached(f)) {

fq_flow_add_tail(&q->new_flows, f);

if(...) {

smp_store_release(&sk->sk_pacing_status,SK_PACING_FQ);

}

static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q) {

root = &q->fq_root[hash_ptr(sk, q->fq_trees_log)];

p = &root->rb_node;

while(*p) {

if(f->sk == sk)

return f;

}

f = kmem_cache_zalloc;

fq_flow_set_detached(f);

rb_insert(f);

return f;

}

2. modified logic:

static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) {

f = fq_classify(skb, q);

if (fq_flow_is_detached(f)) {

fq_flow_add_tail(&q->new_flows, f);

...

}

static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q) {

root = &q->fq_root[hash_ptr(sk, q->fq_trees_log)];

p = &root->rb_node;

while(*p) {

if(f->sk == sk)

return f;

}

f = kmem_cache_zalloc;

fq_flow_set_detached(f);

if (skb->sk == sk)

smp_store_release(&sk->sk_pacing_status,SK_PACING_FQ);

rb_insert(f);

return f;

}

what does “a socket structure has been reallocated” mean? Two different sk have the same sk addr when first sk is destroyed but second sk alloc first sk's addr?

if it's right,

a. in original logic, the stale flow is f which returne from the red position “return f” is not detached, because f->next is NULL,which not be same as &detached,

so fq_enqueue can not set sk_pacing_status SK_PACING_FQ because fq_flow_is_detached return false.

My question is that how to insert the fq_flow to new_flows and how to send this skb， fq_flow_add_tail(&q->new_flows, f) can not be exec because fq_flow_is_detached return false too.

b. in fq_classify, fq_root will insert a invalid sk addr(sk = (struct sock *)((hash << 1) | 1UL)) when sending synack, tcp_v4_syn_recv_sock will create a new sock of child after recving ack of handshake, so the new sock must not be

the same as sk of synack, what is the point of doing this？

c. socket might have been reallocated, does request_sock(TCPF_NEW_SYN_RECV) and sock(TCP_ESTABLISHED) have the same problem?

I am not well with the tcp stack, I am sorry that these questions may be a bit too simple for you.

Thanks.

Eric Dumazet

unread,

Dec 23, 2020, 12:32:21 PM12/23/20

to mingkun bian, BBR Development

On Wed, Dec 23, 2020 at 4:36 PM mingkun bian <bianm...@gmail.com> wrote:
>
> Hi,

Pseudo code :

struct sock *sk = kmalloc(sizeof(TCP sock)); // old_sk

send_packets(sk);

kfree(sk)
....

struct sock *sk = kmalloc(...); // new_sk

-> Due to SLAB/SLUB caches, new_sk could be == old_sk

So FQ detects that sk points to another flow/socket.

> a. in original logic, the stale flow is f which returne from the red position “return f” is not detached, because f->next is NULL,which not be same as &detached,
> so fq_enqueue can not set sk_pacing_status SK_PACING_FQ because fq_flow_is_detached return false.
> My question is that how to insert the fq_flow to new_flows and how to send this skb， fq_flow_add_tail(&q->new_flows, f) can not be exec because fq_flow_is_detached return false too.
>
> b. in fq_classify, fq_root will insert a invalid sk addr(sk = (struct sock *)((hash << 1) | 1UL)) when sending synack, tcp_v4_syn_recv_sock will create a new sock of child after recving ack of handshake, so the new sock must not be
> the same as sk of synack, what is the point of doing this？
>
> c. socket might have been reallocated, does request_sock(TCPF_NEW_SYN_RECV) and sock(TCP_ESTABLISHED) have the same problem?

No, NEW_SYN_RECV are not real sockets, but pseudo ones (request
sockets). They can not be == to an established socket, they are
different kind of objects.

> To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/0ae156e7-2cc0-4c0c-8876-a2ca738a67e9n%40googlegroups.com.

mingkun bian

unread,

Dec 23, 2020, 10:23:53 PM12/23/20

to BBR Development

Hi,

Thanks your detailed reply.

I'm sorry I didn't read the patch seriously, I ignored the first modification of the patch, so the real modified logic should look like this, which I ignore the red position "smp_store_release(&sk->sk_pacing_status,SK_PACING_FQ)", it make sure

conflicting fq_flow sk have the right sk_pacing_status.

static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) {

f = fq_classify(skb, q);

if (fq_flow_is_detached(f)) {

fq_flow_add_tail(&q->new_flows, f);

...

}

static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q) {

root = &q->fq_root[hash_ptr(sk, q->fq_trees_log)];

p = &root->rb_node;

while(*p) {

if(f->sk == sk)

smp_store_release(&sk->sk_pacing_status,SK_PACING_FQ);

return f;

}

f = kmem_cache_zalloc;

fq_flow_set_detached(f);

if (skb->sk == sk)

smp_store_release(&sk->sk_pacing_status,SK_PACING_FQ);

rb_insert(f);

return f;

}

The scenario where this problem occurs is as follows, am I right?

1. struct sock *sk = kmalloc(sizeof(TCP sock)); // old_sk

send_packets(sk);

kfree(sk)

Then old_sk's fq_flow may change from new_flows to old_flows exactly, which is not detached and gc.

2. struct sock *sk = kmalloc(...); // new_sk

Then fq_classify will return a conflicting fq_flow of old_sk, which is not detached because of step 1， then fq_enqueue can not set sk_pacing_status SK_PACING_FQ，

And bbr can only use intertal pace of tcp stack.

3. Later, fq_flow will become detached when it have no skb because fq_flow belong to old_flows. Then fq_enqueue can set sk_pacing_status SK_PACING_FQ, then bbr change pacing from intertal pace of tcp stack to sch_fq.

static struct sk_buff *fq_dequeue(struct Qdisc *sch) {

skb = fq_dequeue_head(sch, f);

if (!skb) {

head->first = f->next;

/* force a pass through old_flows to prevent starvation */

if ((head == &q->new_flows) && q->old_flows.first) {

fq_flow_add_tail(&q->old_flows, f);

} else {

fq_flow_set_detached(f);

q->inactive_flows++;

}

goto begin;

}

Another questions:

A. in fq_classify, fq_root will insert a invalid sk addr(sk = (struct sock *)((hash << 1) | 1UL)) when sending synack, tcp_v4_syn_recv_sock will create a new sock of child after recving ack of handshake, so the new sock must not be

the same as sk of synack, so child sk can not search pseudo ones (request sockets) successly from rb_node, what is the point of doing this？

>>it may be used to retranmit synack, am I right?

B. What are the more advantages of sch_fq over intertal pace of tcp stack expect more cpu cycles? my kernal version is 4.18.0, which does not have EDT, in view of this situation, is intertal pace of tcp stack more suitable than sch_fq if