BIG TCP question

415 views
Skip to first unread message

Brian Tierney

unread,
Mar 13, 2024, 10:31:13 AMMar 13
to BBR Development

I'm sorry for posting a question not related to BBR here, but I know there are folks in this group who would know the answer.

My understanding is that BIG TCP is supported starting with the 6.3 kernel, but I don't see the BIG TCP patch in the code here:
nor in the code here:

What's the trick? Do I need to apply the patch myself? 

Thanks!

Eric Dumazet

unread,
Mar 13, 2024, 10:40:21 AMMar 13
to Brian Tierney, BBR Development
On Wed, Mar 13, 2024 at 3:31 PM Brian Tierney <blti...@gmail.com> wrote:
>
>
> I'm sorry for posting a question not related to BBR here, but I know there are folks in this group who would know the answer.
>
> My understanding is that BIG TCP is supported starting with the 6.3 kernel, but I don't see the BIG TCP patch in the code here:
> https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.6.21.tar.xz
> nor in the code here:
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/snapshot/net-next-6.7.tar.gz
>

BIG TCP is there, starting from linux-6.1

You need a capable NIC, of course.

(look for tso_max_size in "ip -s -d link sh dev eth0" output)

Then you need to enable it.

For ipv6
ip link set dev eth0 gso_max_size 150000 gro_max_size 150000
For ipv4
... gso_ipv4_max_size 150000 gro_ipv4_max_size 150000

> What's the trick? Do I need to apply the patch myself?
> https://lore.kernel.org/netdev/20220303181607.109...@gmail.com/
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups "BBR Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bbr-dev+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bbr-dev/8c16a246-f3f3-49c3-9b77-5e15efb5e182n%40googlegroups.com.

Brian Tierney

unread,
Mar 13, 2024, 12:22:04 PMMar 13
to Eric Dumazet, BBR Development

Thanks for the quick response! I was looking for gso_ipv6_max_size.
This works for me:
ip link set dev eth100 gro_max_size 185000 gso_max_size 185000


But this gives an error:
ip link set dev eth100 gso_ipv4_max_size 150000 gro_ipv4_max_size 150000
Error: either "dev" is duplicate, or "gso_ipv4_max_size" is a garbage.

(BTW, this is a mlnx NIC)

Eric Dumazet

unread,
Mar 13, 2024, 12:25:13 PMMar 13
to Brian Tierney, BBR Development
On Wed, Mar 13, 2024 at 5:22 PM Brian Tierney <blti...@gmail.com> wrote:
>
>
> Thanks for the quick response! I was looking for gso_ipv6_max_size.
> This works for me:
> ip link set dev eth100 gro_max_size 185000 gso_max_size 185000
>
>
> But this gives an error:
> ip link set dev eth100 gso_ipv4_max_size 150000 gro_ipv4_max_size 150000
> Error: either "dev" is duplicate, or "gso_ipv4_max_size" is a garbage.
>


Maybe your ip command is too old ?

Neal Cardwell

unread,
Mar 13, 2024, 1:15:16 PMMar 13
to Eric Dumazet, Brian Tierney, BBR Development
> Maybe your ip command is too old ?

Yeah, that seems likely. Brian, can you please try using the latest "ip" binary, perhaps with something like:
cd iproute2/
./configure
make
ip/ip link set dev eth100 gso_ipv4_max_size 150000 gro_ipv4_max_size 150000

How does that work?

cheers,
neal



MUHAMMAD AHSAN

unread,
Mar 13, 2024, 7:45:50 PMMar 13
to Neal Cardwell, Eric Dumazet, Brian Tierney, BBR Development
Well, the question is , Is big tcp good for home gateways? E.g common wireless access points connectivity in home and offices, can big tcp be deployed on linux client machines with 1gb wireless ethernet cards to get good speed gains and lower latencies?? Or is it good for data centers only?


Regards,
Ahsan

Bob McMahon

unread,
Mar 13, 2024, 9:49:00 PMMar 13
to MUHAMMAD AHSAN, Neal Cardwell, Eric Dumazet, Brian Tierney, BBR Development
My opinions only and not those of my employer.

We don't seem to need big tcp to get full capacity over WiFi links (even beyond 1Gb/s). 

A primary issue now in home networks is better latency.  Some things that improve WiFi latency include:
  • Use of TCP_NOTSENT_LOWAT on the send side
  • Reduce air contention or media access contention. Reduce the AP / STA ratio. One radio per client is good as it becomes (sorta) point to point. Modern APs should have a minimum of 3 radios, one per band (2.4G, 5G, and 6G)
  • Eliminate meshes rather use a fronthaul network, fiber or coaxial, to each room and place an AP in each room. Ceiling mounts APs are good for this.
  • Turn down the APs' power so SINR can maximize the MCS (see this table) and better take advantage of spatial dimensions
  • Don't have distance devices that use a very low MCS with close devices that have a higher MCS on the same AP
  • Minimize hidden nodes to mitigate the need for RTS/CTS
  • Use MLO when it's available to the market
  • Use APs with AQM and ECN support that will mitigate buffer bloat (ECN like L4S is a work in progress, Comcast recently enabled support)
I probably forgot a few things but this would be a good start.

Long term, invest in a FiWi in home network (which doesn't yet exist.). Replace all the APs with remote radio heads, 4 per room, and use a 32 port 100G open source switch as the FiWi concentrator. Then build a small team, write a bunch of new sw and change the world ;)

Bob

 


This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

Bob McMahon

unread,
Mar 13, 2024, 9:53:39 PMMar 13
to MUHAMMAD AHSAN, Neal Cardwell, Eric Dumazet, Brian Tierney, BBR Development
sorry, don't use coaxial - meant to say cat5-8, i.e. 8 copper pairs with the best shielding and proper amount of twists.

Bob

Neal Cardwell

unread,
Mar 14, 2024, 9:50:56 AMMar 14
to MUHAMMAD AHSAN, Eric Dumazet, Brian Tierney, BBR Development
On Wed, Mar 13, 2024 at 7:45 PM MUHAMMAD AHSAN <muhamm...@umt.edu.pk> wrote:
Well, the question is , Is big tcp good for home gateways? E.g common wireless access points connectivity in home and offices, can big tcp be deployed on linux client machines with 1gb wireless ethernet cards to get good speed gains and lower latencies?? Or is it good for data centers only?

AFAIK generally BIG TCP will only help if a Linux TCP connection is going faster than roughly 512 Mbit/sec. This is because Linux TCP sizes skbuffs to sk_pacing_rate * 1ms, and 512 Mbit/sec * 1ms is 64KBytes, which will roughly fit in one non-BIG-TCP skbuff.

Typically, today's Linux TCP connections to the home are not going faster than 512 Mbit/sec, so typically they would not benefit. Though at the 1 Gbit/sec rate you mention, there is the potential for benefits.

Needless to say, having multiple flows going faster than 512 Mbit/sec is more common in datacenters and so the big benefits will more typically be present in datacenters.

cheers, 
neal

Jason Xing

unread,
Mar 14, 2024, 10:33:58 AMMar 14
to BBR Development
在2024年3月14日星期四 UTC+8 21:50:56<Neal Cardwell> 写道:
On Wed, Mar 13, 2024 at 7:45 PM MUHAMMAD AHSAN <muhamm...@umt.edu.pk> wrote:
Well, the question is , Is big tcp good for home gateways? E.g common wireless access points connectivity in home and offices, can big tcp be deployed on linux client machines with 1gb wireless ethernet cards to get good speed gains and lower latencies?? Or is it good for data centers only?

AFAIK generally BIG TCP will only help if a Linux TCP connection is going faster than roughly 512 Mbit/sec. This is because Linux TCP sizes skbuffs to sk_pacing_rate * 1ms, and 512 Mbit/sec * 1ms is 64KBytes, which will roughly fit in one non-BIG-TCP skbuff.

Typically, today's Linux TCP connections to the home are not going faster than 512 Mbit/sec, so typically they would not benefit. Though at the 1 Gbit/sec rate you mention, there is the potential for benefits.

Needless to say, having multiple flows going faster than 512 Mbit/sec is more common in datacenters and so the big benefits will more typically be present in datacenters.


Thanks for the explanation.

However, the results of some tests on loopback do not behave like what I expect. Indeed, I saw some good numbers in LWN which is the same as what I tested.

Here are some bad numbers if using "for i in {1..10}; do netperf -t TCP_RR -H 127.0.0.1 -l 10 -p 8888 -- -r128k,128k -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done":
with BIG TCP enabled, throughput is around 12061.97
without BIG TCP, it is around 13087.77

I see some unexpected numbers if the sending packet size is larger than 80000. Do you have idea why?

[Actually, I've already sent this email, but I cannot see it. I have to rewrite it :S]

Thanks,
Jason

Eric Dumazet

unread,
Mar 14, 2024, 10:40:46 AMMar 14
to Jason Xing, BBR Development
On Thu, Mar 14, 2024 at 3:34 PM Jason Xing <kernelj...@gmail.com> wrote:
>
>
>
> 在2024年3月14日星期四 UTC+8 21:50:56<Neal Cardwell> 写道:
>
> On Wed, Mar 13, 2024 at 7:45 PM MUHAMMAD AHSAN <muhamm...@umt.edu.pk> wrote:
>
> Well, the question is , Is big tcp good for home gateways? E.g common wireless access points connectivity in home and offices, can big tcp be deployed on linux client machines with 1gb wireless ethernet cards to get good speed gains and lower latencies?? Or is it good for data centers only?
>
>
> AFAIK generally BIG TCP will only help if a Linux TCP connection is going faster than roughly 512 Mbit/sec. This is because Linux TCP sizes skbuffs to sk_pacing_rate * 1ms, and 512 Mbit/sec * 1ms is 64KBytes, which will roughly fit in one non-BIG-TCP skbuff.
>
> Typically, today's Linux TCP connections to the home are not going faster than 512 Mbit/sec, so typically they would not benefit. Though at the 1 Gbit/sec rate you mention, there is the potential for benefits.
>
> Needless to say, having multiple flows going faster than 512 Mbit/sec is more common in datacenters and so the big benefits will more typically be present in datacenters.
>
>
>
> Thanks for the explanation.
>
> However, the results of some tests on loopback do not behave like what I expect. Indeed, I saw some good numbers in LWN which is the same as what I tested.
>
> Here are some bad numbers if using "for i in {1..10}; do netperf -t TCP_RR -H 127.0.0.1 -l 10 -p 8888 -- -r128k,128k -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done":
> with BIG TCP enabled, throughput is around 12061.97
> without BIG TCP, it is around 13087.77
>
> I see some unexpected numbers if the sending packet size is larger than 80000. Do you have idea why?

Too small cpu caches... cache eviction with too big packets/queues.

loopback traffic is very different from 'real' traffic.

You might need to play with tcp_rmem[] and tcp_wmem[] values as well.

Eric Dumazet

unread,
Mar 14, 2024, 10:45:57 AMMar 14
to Jason Xing, BBR Development
And of course. with 128Kb chunks, a receiver will be awakened when the
last byte is received (with BIG TCP),
instead of being awakened after byte #65536.

Without BIG TCP, there is an overlap, where the receiver already
copied the first 64KB part.

-> Having two wakeups (because receiver receives two packets instead
of one) is beneficial for latency,
in this particular case but costs twice as much in cpu cycles.

If you are looking for best latencies and one flow workload, you can
also burn a cpu with busy polling.

Brian Tierney

unread,
Mar 14, 2024, 11:33:50 AMMar 14
to Neal Cardwell, Eric Dumazet, BBR Development

Yep, the newer version of iproute2 fixed that problem.

I'm not seeing any performance improvements, but I do see lower CPU usage on the receive core (75% vs 90%).

Test environment: 
   2 100G hosts connected to a 100G switch, Mellanox NICS, RTT: 0.1ms, CPU: AMD EPYC 73F3 16-Core, 3.5Ghz

Default settings:

numactl -N 0 netperf -t TCP_STREAM -H  10.0.0.8  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
Minimum      90th         99th         Throughput
Latency      Percentile   Percentile
Microseconds Latency      Latency
             Microseconds Microseconds
3            10           254          48310.24

numactl -N 0 iperf3 -c 10.0.0.8
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  56.4 GBytes  48.4 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  56.4 GBytes  48.4 Gbits/sec                  receiver

netperf -t TCP_RR -H  10.0.0.8  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
Minimum      90th         99th         Throughput
Latency      Percentile   Percentile
Microseconds Latency      Latency
             Microseconds Microseconds
56           125          169          10628.35



BIG TCP
ip link set dev eth100 gso_ipv4_max_size 185000 gro_ipv4_max_size 185000 #(send and receive host)
numactl -N 0 netperf -t TCP_STREAM  -H  10.0.0.8  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
Minimum      90th         99th         Throughput
Latency      Percentile   Percentile
Microseconds Latency      Latency
             Microseconds Microseconds
3            10           252          48242.71

numactl -N 0 iperf3 -c 10.0.0.8
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  57.1 GBytes  49.0 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  57.1 GBytes  49.0 Gbits/sec                  receiver


netperf -t TCP_RR -H  10.0.0.8  -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT
Minimum      90th         99th         Throughput
Latency      Percentile   Percentile
Microseconds Latency      Latency
             Microseconds Microseconds
58           129          167          10315.34



Eric Dumazet

unread,
Mar 14, 2024, 11:38:30 AMMar 14
to Brian Tierney, Neal Cardwell, BBR Development
On Thu, Mar 14, 2024 at 4:33 PM Brian Tierney <blti...@gmail.com> wrote:
>
>
> Yep, the newer version of iproute2 fixed that problem.
>
> I'm not seeing any performance improvements, but I do see lower CPU usage on the receive core (75% vs 90%).

We get line rate (200Gbit NIC) one one single TCP flow, with BIG TCP.

In order to get this kind of performance :

- Zero copy on tx and rx

- 4K MTU (to enable rx zerocopy)

- BIG TCP

- CONFIG_MAX_SKB_FRAGS=45 kernel config

If you use 1500 MTU and standard tools, the bottlenecks are the copies
from/to user/kernel space.

Not sure if this discussion belongs to BBR list.

Jason Xing

unread,
Mar 14, 2024, 10:50:06 PMMar 14
to BBR Development
Thanks!

I suspect that cpu caches is the key leading to poorer performance. But what I can see in VM is not that weird.
$ lscpu
……
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            36608K
……

I will try it in physical server later.
 


loopback traffic is very different from 'real' traffic.

Sure. I noticed that some of our customers hope we can have a better performance on loopback if we can. So I resort to BIG TCP....

I'm studying if there is one way to make loopback run faster :)
 


You might need to play with tcp_rmem[] and tcp_wmem[] values as well.

It seems that tcp mem doesn't have any impact on the test in VM, that is to say, the performance is not affected by memory.

I saw your another reply about tcp mem. I think you're right. I will dig into this part more deeply.

Thanks,
Jason
 
Reply all
Reply to author
Forward
0 new messages