Linux Route Cache performance tests

1,476 views
Skip to first unread message

Paweł Staszewski

unread,
Nov 6, 2011, 10:57:22 AM11/6/11
to Linux Network Development list, Eric Dumazet
Hello

I make some networking performance tests for Linux 3.1

Configuration:

Linux (pktget) ----> Linux (router) ----> Linux (Sink)

pktgen config:
clone_skb 32
pkt_size 64
delay 0

pgset "flag IPDST_RND"
pgset "dst_min 10.0.0.0"
pgset "dst_max 10.18.255.255"
pgset "config 1"
pgset "flows 256"
pgset "flowlen 8"

TX performance for this host:
eth0: RX: 0.00 P/s TX: 12346107.73 P/s TOTAL:
12346107.73 P/s

On Linux (router):
grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:500
/proc/sys/net/ipv4/route/error_cost:100
grep: /proc/sys/net/ipv4/route/flush: Permission denied
/proc/sys/net/ipv4/route/gc_elasticity:4
/proc/sys/net/ipv4/route/gc_interval:60
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:2000000
/proc/sys/net/ipv4/route/gc_timeout:60
/proc/sys/net/ipv4/route/max_size:8388608
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:2
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:2048

For the first 30secs maybee more router is forwarding ~5Mpps to the
Linux (Sink)
and some stats for this forst 30secs in attached image:

http://imageshack.us/photo/my-images/684/test1ih.png/

Left up - pktgen linux
left down - Linux router (htop)
Right up - Linux router (bwm-ng - showing pps)
Right down - Linux router (lnstat)


And all is good - performance 5Mpps until Linux router will reach ~1kk
entries
What You can see on next attached image:

http://imageshack.us/photo/my-images/24/test2id.png/

Forwarding performance drops from 5Mpps to 1,8Mpps
And after 3 - 4 minutes it will stop on 0,7Mpps


After flushing the route cache performance increase from 0.7Mpps to 6Mpps
What You can see on next attached image:

http://imageshack.us/photo/my-images/197/test3r.png/

Is it possible to turn off route cache ? and see what performance will
be without caching


Thanks
Pawel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Eric Dumazet

unread,
Nov 6, 2011, 12:29:57 PM11/6/11
to Paweł Staszewski, Linux Network Development list

Route cache cannot handle DDOS situation, since it will be filled,
unless you have a lot of memory.

I am not sure what you expected here. If caches misses are too frequent,
a cache is useless, whatever how its done.

If you disable route cache, you'll get poor performance in normal
situation (99.9999% of cases, non DDOS), and same performance on DDOS,
in 0.0001% cases

Trick to disable it is to use a big (and negative) rebuild_count

$ echo 3000000000 >/proc/sys/net/ipv4/rt_cache_rebuild_count
$ cat /proc/sys/net/ipv4/rt_cache_rebuild_count
-1294967296

Paweł Staszewski

unread,
Nov 6, 2011, 1:28:40 PM11/6/11
to Eric Dumazet, Linux Network Development list
W dniu 2011-11-06 18:29, Eric Dumazet pisze:
hmm
but what is DDOS situation for route cache ? new entries per sec ? total
amount of entries 1,2kk in my tests ?
Look sometimes in normal scenario You can hit
1245072 route cache entries
This is normal for BGP configurations.

The performance of route cache is ok to the point where we reach more
than 1245072 entries.
Router is starting forwarding packets with 5Mpps and ends at about
0.7Mpps when more than 1245072 entries is reached.
For my scenario
Random ip generation start at: 10.0.0.0 ends on 10.18.255.255
this is 1170450 random ip's

> I am not sure what you expected here. If caches misses are too frequent,
> a cache is useless, whatever how its done.

Yes i understand this.


> If you disable route cache, you'll get poor performance in normal
> situation (99.9999% of cases, non DDOS), and same performance on DDOS,
> in 0.0001% cases
>
> Trick to disable it is to use a big (and negative) rebuild_count
>
> $ echo 3000000000>/proc/sys/net/ipv4/rt_cache_rebuild_count
> $ cat /proc/sys/net/ipv4/rt_cache_rebuild_count
> -1294967296

Ok so disabling route cache

echo 300000000000 > /proc/sys/net/ipv4/rt_cache_rebuild_count
cat /proc/sys/net/ipv4/rt_cache_rebuild_count
-647710720


I can reach 4Mpps forwarding performance without degradation in time.
/ iface Rx
Tx Total

==============================================================================
lo: 0.00 P/s 0.00 P/s
0.00 P/s
eth1: 1.00 P/s 1.00 P/s
2.00 P/s
eth2: 0.00 P/s 3971015.09 P/s
3971015.09 P/s
eth3: 3970941.17 P/s 0.00 P/s
3970941.17 P/s

------------------------------------------------------------------------------
total: 3970942.17 P/s 3971016.09 P/s
7941958.26 P/s

lnstat -c -1 -i 1 -f rt_cache -k entries
rt_cache|
entries|
8|
6|
5|
10|
5|
7|
7|
11|
5|
11|
11|
6|
7|
6|


So with disabled route cache performance is better for over 1kk route
cache entries.
But it is static.
So i have the same performance now for generated 10k,50k,100k,1kk random
ip's

So yes in scenarion when You count that route cache will never reach
over 1kk entries performance with route cache enabled is 2x more that
with disabled.
But when somehow - You will reach over 1kk entries in route cache -
router almost stops forwarding traffic.

Thanks
Pawel

Eric Dumazet

unread,
Nov 6, 2011, 1:48:46 PM11/6/11
to Paweł Staszewski, Linux Network Development list

Then figure out the right tunables for your machine ?

Its not a laptop or average server setup, so you need to allow your
kernel to consume a fair amount of memory for the route cache.

Or accept low performance :(

> The performance of route cache is ok to the point where we reach more
> than 1245072 entries.
> Router is starting forwarding packets with 5Mpps and ends at about
> 0.7Mpps when more than 1245072 entries is reached.
> For my scenario
> Random ip generation start at: 10.0.0.0 ends on 10.18.255.255
> this is 1170450 random ip's
>

I have no problem with 4 millions entries in route cache, with full
performance, not 80%.


You currently have one hash table with 524288 entries
(before you changed /proc/sys/net/ipv4/route/gc_thresh)

Its not optimal for your workload, because you have many slots with 4
chained items, performance sucks.

You have to boot your machine with "rhash_entries=2097152", so that
average chain length is less than 1

Your problem is then solved :

# grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:5000
/proc/sys/net/ipv4/route/error_cost:1000
/proc/sys/net/ipv4/route/gc_elasticity:8
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:2097152
/proc/sys/net/ipv4/route/gc_timeout:300
/proc/sys/net/ipv4/route/max_size:33554432


/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600

/proc/sys/net/ipv4/route/redirect_load:20
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:20480

Paweł Staszewski

unread,
Nov 6, 2011, 2:20:38 PM11/6/11
to Eric Dumazet, Linux Network Development list
W dniu 2011-11-06 19:48, Eric Dumazet pisze:
Yes this parameters was special not tuned :)
To see what is the route cache performance limit

Because there was no optimal parameters for this test :)
no matter what i tuned results are always the same
performance drops from 5Mpps to 0.7Mpps without tuning sysctl

And with tuned parameters i can reach the same as turning off route
cache - when running this tests.
So Yes Tuned performance is better
performance drops from 5Mpps to 0.7Mpps - without tuning
and from 5Mpps to 3,7Mpps with tuned sysctl - so a little less than with
turned off route cache

So the point of this test was figure out how much of route cache entries
Linux can handle without dropping performance.


> Or accept low performance :(

Never :)

Eric Dumazet

unread,
Nov 6, 2011, 2:38:10 PM11/6/11
to Paweł Staszewski, Linux Network Development list

Hmm, I thought you were asking for help on netdev ?

> Because there was no optimal parameters for this test :)
> no matter what i tuned results are always the same
> performance drops from 5Mpps to 0.7Mpps without tuning sysctl
>
> And with tuned parameters i can reach the same as turning off route
> cache - when running this tests.
> So Yes Tuned performance is better
> performance drops from 5Mpps to 0.7Mpps - without tuning
> and from 5Mpps to 3,7Mpps with tuned sysctl - so a little less than with
> turned off route cache
>
> So the point of this test was figure out how much of route cache entries
> Linux can handle without dropping performance.

No need to even do a bench, its pretty easy to understand how a hash
table is handled.

Allowing long chains is not good.

With your 512k slots hash table, you cannot expect handling 1.4M routes
with optimal performance. End of story.

Since route hash table is allocated at boot time, only way to change its
size is using "rhash_entries=2097152" boot parameter.

If it still doesnt fly, try with "rhash_entries=4194304"

Paweł Staszewski

unread,
Nov 6, 2011, 3:25:18 PM11/6/11
to Eric Dumazet, Linux Network Development list
W dniu 2011-11-06 20:38, Eric Dumazet pisze:
Title was tests :)
And yes maybee some help that You give me about understanding how kernel
works with and without route cache.

>
>> Because there was no optimal parameters for this test :)
>> no matter what i tuned results are always the same
>> performance drops from 5Mpps to 0.7Mpps without tuning sysctl
>>
>> And with tuned parameters i can reach the same as turning off route
>> cache - when running this tests.
>> So Yes Tuned performance is better
>> performance drops from 5Mpps to 0.7Mpps - without tuning
>> and from 5Mpps to 3,7Mpps with tuned sysctl - so a little less than with
>> turned off route cache
>>
>> So the point of this test was figure out how much of route cache entries
>> Linux can handle without dropping performance.
> No need to even do a bench, its pretty easy to understand how a hash
> table is handled.
>
> Allowing long chains is not good.
>
> With your 512k slots hash table, you cannot expect handling 1.4M routes
> with optimal performance. End of story.
>
> Since route hash table is allocated at boot time, only way to change its
> size is using "rhash_entries=2097152" boot parameter.
>
> If it still doesnt fly, try with "rhash_entries=4194304"

Yes with this is a little problem i think with kernel 3.1 because
dmesg | egrep '(rhash)|(route)'
[ 0.000000] Command line: root=/dev/md2 rhash_entries=2097152
[ 0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
[ 4.697294] IP route cache hash table entries: 524288 (order: 10,
4194304 bytes)


Thanks
Pawel

Eric Dumazet

unread,
Nov 6, 2011, 4:26:28 PM11/6/11
to Paweł Staszewski, Linux Network Development list
Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
> Yes with this is a little problem i think with kernel 3.1 because
> dmesg | egrep '(rhash)|(route)'
> [ 0.000000] Command line: root=/dev/md2 rhash_entries=2097152
> [ 0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
> [ 4.697294] IP route cache hash table entries: 524288 (order: 10,
> 4194304 bytes)
>
>

Dont tell me you _still_ use a 32bit kernel ?

If so, you need to tweak alloc_large_system_hash() to use vmalloc,
because you hit MAX_ORDER (10) page allocations.

But considering LOWMEM is about 700 Mbytes, you wont be able to create a
lot of route cache entries.

Come on, do us a favor, and enter new era of computing.

Paweł Staszewski

unread,
Nov 6, 2011, 4:57:39 PM11/6/11
to Eric Dumazet, Linux Network Development list
W dniu 2011-11-06 22:26, Eric Dumazet pisze:

> Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
>> Yes with this is a little problem i think with kernel 3.1 because
>> dmesg | egrep '(rhash)|(route)'
>> [ 0.000000] Command line: root=/dev/md2 rhash_entries=2097152
>> [ 0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
>> [ 4.697294] IP route cache hash table entries: 524288 (order: 10,
>> 4194304 bytes)
>>
>>
> Dont tell me you _still_ use a 32bit kernel ?
no it is 64bit :)
Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
:)

> If so, you need to tweak alloc_large_system_hash() to use vmalloc,
> because you hit MAX_ORDER (10) page allocations.

funny then :)
Maybee i turned off too many kernel features

Eric Dumazet

unread,
Nov 6, 2011, 6:08:35 PM11/6/11
to Paweł Staszewski, Linux Network Development list
Le dimanche 06 novembre 2011 à 22:57 +0100, Paweł Staszewski a écrit :
> W dniu 2011-11-06 22:26, Eric Dumazet pisze:
> > Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
> >> Yes with this is a little problem i think with kernel 3.1 because
> >> dmesg | egrep '(rhash)|(route)'
> >> [ 0.000000] Command line: root=/dev/md2 rhash_entries=2097152
> >> [ 0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
> >> [ 4.697294] IP route cache hash table entries: 524288 (order: 10,
> >> 4194304 bytes)
> >>
> >>
> > Dont tell me you _still_ use a 32bit kernel ?
> no it is 64bit :)
> Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
> :)
>
> > If so, you need to tweak alloc_large_system_hash() to use vmalloc,
> > because you hit MAX_ORDER (10) page allocations.
> funny then :)
> Maybee i turned off too many kernel features
> > But considering LOWMEM is about 700 Mbytes, you wont be able to create a
> > lot of route cache entries.
> >
> > Come on, do us a favor, and enter new era of computing.

OK, then your kernel is not CONFIG_NUMA enabled

It seems strange given you probably have a NUMA machine (24 cpus)

If so, your choices are :

1) enable CONFIG_NUMA. Really this is a must given the workload of your
machine.

2) Or : you need to add "hashdist=1" on boot params
and patch your kernel with following patch :

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..07f86e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5362,7 +5362,6 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,

int hashdist = HASHDIST_DEFAULT;

-#ifdef CONFIG_NUMA
static int __init set_hashdist(char *str)
{
if (!str)
@@ -5371,7 +5370,6 @@ static int __init set_hashdist(char *str)
return 1;
}
__setup("hashdist=", set_hashdist);
-#endif

/*
* allocate a large system hash table from bootmem

Paweł Staszewski

unread,
Nov 7, 2011, 3:36:53 AM11/7/11
to Eric Dumazet, Linux Network Development list
W dniu 2011-11-07 00:08, Eric Dumazet pisze:

> Le dimanche 06 novembre 2011 à 22:57 +0100, Paweł Staszewski a écrit :
>> W dniu 2011-11-06 22:26, Eric Dumazet pisze:
>>> Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
>>>> Yes with this is a little problem i think with kernel 3.1 because
>>>> dmesg | egrep '(rhash)|(route)'
>>>> [ 0.000000] Command line: root=/dev/md2 rhash_entries=2097152
>>>> [ 0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
>>>> [ 4.697294] IP route cache hash table entries: 524288 (order: 10,
>>>> 4194304 bytes)
>>>>
>>>>
>>> Dont tell me you _still_ use a 32bit kernel ?
>> no it is 64bit :)
>> Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
>> :)
>>
>>> If so, you need to tweak alloc_large_system_hash() to use vmalloc,
>>> because you hit MAX_ORDER (10) page allocations.
>> funny then :)
>> Maybee i turned off too many kernel features
>>> But considering LOWMEM is about 700 Mbytes, you wont be able to create a
>>> lot of route cache entries.
>>>
>>> Come on, do us a favor, and enter new era of computing.
> OK, then your kernel is not CONFIG_NUMA enabled
>
> It seems strange given you probably have a NUMA machine (24 cpus)
Yes NUMA was not enabled
I make some tests with NUMA and without to compare performance of ixgbe
with use Node="" parameters for ixgbe module

> If so, your choices are :
>
> 1) enable CONFIG_NUMA. Really this is a must given the workload of your
> machine.
>
> 2) Or : you need to add "hashdist=1" on boot params
> and patch your kernel with following patch :
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..07f86e0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5362,7 +5362,6 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
>
> int hashdist = HASHDIST_DEFAULT;
>
> -#ifdef CONFIG_NUMA
> static int __init set_hashdist(char *str)
> {
> if (!str)
> @@ -5371,7 +5370,6 @@ static int __init set_hashdist(char *str)
> return 1;
> }
> __setup("hashdist=", set_hashdist);
> -#endif
>
> /*
> * allocate a large system hash table from bootmem
>

Yes after enabling NUMA I can change rhash_entries on kernel boot.

And what is the most important for big route cahce is rhash_entries
if route cache size exceed hash size performance will drop 6x to 8x
So the best settings for route cache are:
rhash_entries = gc_thresh = max_size

Eric tell me what are the plans for removing route cache from kernel ?
Because as You see with route cache performance is better
And without route cache performance is not soo good than with route
cache enabled but it is stable for all situations even DDOS with 10kk
random_ips

So for the feature we need to prepare for lower kernel IP forwarding
performance because of no route cache ?
Or removing route cache will save some time in IP stack processing ?


Thanks
Pawel

Eric Dumazet

unread,
Nov 7, 2011, 4:08:42 AM11/7/11
to Paweł Staszewski, Linux Network Development list
Le lundi 07 novembre 2011 à 09:36 +0100, Paweł Staszewski a écrit :

> Yes after enabling NUMA I can change rhash_entries on kernel boot.
>
> And what is the most important for big route cahce is rhash_entries
> if route cache size exceed hash size performance will drop 6x to 8x
> So the best settings for route cache are:
> rhash_entries = gc_thresh = max_size
>
> Eric tell me what are the plans for removing route cache from kernel ?
> Because as You see with route cache performance is better
> And without route cache performance is not soo good than with route
> cache enabled but it is stable for all situations even DDOS with 10kk
> random_ips
>
> So for the feature we need to prepare for lower kernel IP forwarding
> performance because of no route cache ?
> Or removing route cache will save some time in IP stack processing ?
>

Obviously, cache removal will be possible only when performance without
it is the same.

Work is in progress, it started a long time ago.

Eric Dumazet

unread,
Nov 7, 2011, 4:16:25 AM11/7/11
to Paweł Staszewski, Linux Network Development list
Le lundi 07 novembre 2011 à 10:08 +0100, Eric Dumazet a écrit :

> Obviously, cache removal will be possible only when performance without
> it is the same.
>
> Work is in progress, it started a long time ago.
>

One of the reason to get rid of this cache is its memory use.

256 bytes per entry, thats a lot of memory if you need 2.000.000
entries...

Ben Hutchings

unread,
Nov 7, 2011, 8:42:44 AM11/7/11
to Eric Dumazet, Paweł Staszewski, Linux Network Development list
On Sun, 2011-11-06 at 20:38 +0100, Eric Dumazet wrote:
> Le dimanche 06 novembre 2011 à 20:20 +0100, Paweł Staszewski a écrit :
[...]

> > So the point of this test was figure out how much of route cache entries
> > Linux can handle without dropping performance.
>
> No need to even do a bench, its pretty easy to understand how a hash
> table is handled.
>
> Allowing long chains is not good.
>
> With your 512k slots hash table, you cannot expect handling 1.4M routes
> with optimal performance. End of story.
>
> Since route hash table is allocated at boot time, only way to change its
> size is using "rhash_entries=2097152" boot parameter.
>
> If it still doesnt fly, try with "rhash_entries=4194304"

A routing cache this big is not going to fit in the processor caches,
anyway; in fact even the hash table may not. So a routing cache hit is
likely to involve processor cache misses. After David's work to make
cacheless operation faster, I suspect that such a 'hit' can be a net
loss. But it *is* necessary to run a benchmark to answer this (and the
answer will obviously vary between systems).

Ben.

--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

Eric Dumazet

unread,
Nov 7, 2011, 9:33:42 AM11/7/11
to Ben Hutchings, Paweł Staszewski, Linux Network Development list
Le lundi 07 novembre 2011 à 13:42 +0000, Ben Hutchings a écrit :

> A routing cache this big is not going to fit in the processor caches,
> anyway; in fact even the hash table may not. So a routing cache hit is
> likely to involve processor cache misses. After David's work to make
> cacheless operation faster, I suspect that such a 'hit' can be a net
> loss. But it *is* necessary to run a benchmark to answer this (and the
> answer will obviously vary between systems).
>

I dont know why you think full hash table should fit processor cache.
If it does, thats perfect, but its not a requirement.

This is one cache miss, to get the pointer to the first element in
chain. Of course this might be a cache hit if several packets for a
given flow are processed in a short period of time.

Given a dst itself is 256 bytes (4 cache lines), one extra cache miss to
get the pointer to dst is not very expensive.

At least, in recent kernels we dont change dst->refcnt in forwarding
patch (usinf NOREF skb->dst)

One particular point is the atomic_inc(dst->refcnt) we have to perform
when queuing an UDP packet if socket asked PKTINFO stuff (for example a
typical DNS server has to setup this option)

I have one patch somewhere that stores the information in skb->cb[] and
avoid the atomic_{inc|dec}(dst->refcnt).

Paweł Staszewski

unread,
Nov 7, 2011, 5:12:14 PM11/7/11
to Eric Dumazet, Linux Network Development list
W dniu 2011-11-07 10:16, Eric Dumazet pisze:

> Le lundi 07 novembre 2011 à 10:08 +0100, Eric Dumazet a écrit :
>
>> Obviously, cache removal will be possible only when performance without
>> it is the same.
>>
>> Work is in progress, it started a long time ago.
>>
> One of the reason to get rid of this cache is its memory use.
>
> 256 bytes per entry, thats a lot of memory if you need 2.000.000
> entries...
>
Yes it is allot for embedded small systems
But in this times when many systems have 12 / 24 / 48GB of memory - it
is not too much.

Eric Dumazet

unread,
Nov 9, 2011, 12:24:35 PM11/9/11
to Ben Hutchings, David Miller, Paweł Staszewski, Linux Network Development list
Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>

OK I found it, I did some extra tests and believe its ready.

[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.

We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.

We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.

This removes two atomic operations per packet, and false sharing as
well.

On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.

IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.

Signed-off-by: Eric Dumazet <eric.d...@gmail.com>
---
include/net/ip.h | 2 +-
net/ipv4/ip_sockglue.c | 37 +++++++++++++++++++------------------
net/ipv4/raw.c | 3 ++-
net/ipv4/udp.c | 3 ++-
net/ipv6/raw.c | 3 ++-
net/ipv6/udp.c | 4 +++-
6 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index eca0ef7..fd1561e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -450,7 +450,7 @@ extern int ip_options_rcv_srr(struct sk_buff *skb);
* Functions provided by ip_sockglue.c
*/

-extern int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+extern void ipv4_pktinfo_prepare(struct sk_buff *skb);
extern void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb);
extern int ip_cmsg_send(struct net *net,
struct msghdr *msg, struct ipcm_cookie *ipc);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 09ff51b..b516030 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -55,20 +55,13 @@
/*
* SOL_IP control messages.
*/
+#define PKTINFO_SKB_CB(__skb) ((struct in_pktinfo *)((__skb)->cb))

static void ip_cmsg_recv_pktinfo(struct msghdr *msg, struct sk_buff *skb)
{
- struct in_pktinfo info;
- struct rtable *rt = skb_rtable(skb);
-
+ struct in_pktinfo info = *PKTINFO_SKB_CB(skb);
+
info.ipi_addr.s_addr = ip_hdr(skb)->daddr;
- if (rt) {
- info.ipi_ifindex = rt->rt_iif;
- info.ipi_spec_dst.s_addr = rt->rt_spec_dst;
- } else {
- info.ipi_ifindex = 0;
- info.ipi_spec_dst.s_addr = 0;
- }

put_cmsg(msg, SOL_IP, IP_PKTINFO, sizeof(info), &info);
}
@@ -992,20 +985,28 @@ e_inval:
}

/**
- * ip_queue_rcv_skb - Queue an skb into sock receive queue
+ * ipv4_pktinfo_prepare - transfert some info from rtable to skb
* @sk: socket
* @skb: buffer
*
- * Queues an skb into socket receive queue. If IP_CMSG_PKTINFO option
- * is not set, we drop skb dst entry now, while dst cache line is hot.
+ * To support IP_CMSG_PKTINFO option, we store rt_iif and rt_spec_dst
+ * in skb->cb[] before dst drop.
+ * This way, receiver doesnt make cache line misses to read rtable.
*/
-int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+void ipv4_pktinfo_prepare(struct sk_buff *skb)
{
- if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO))
- skb_dst_drop(skb);
- return sock_queue_rcv_skb(sk, skb);
+ struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb);
+ const struct rtable *rt = skb_rtable(skb);
+
+ if (rt) {
+ pktinfo->ipi_ifindex = rt->rt_iif;
+ pktinfo->ipi_spec_dst.s_addr = rt->rt_spec_dst;
+ } else {
+ pktinfo->ipi_ifindex = 0;
+ pktinfo->ipi_spec_dst.s_addr = 0;
+ }
+ skb_dst_drop(skb);
}
-EXPORT_SYMBOL(ip_queue_rcv_skb);

int ip_setsockopt(struct sock *sk, int level,
int optname, char __user *optval, unsigned int optlen)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 007e2eb..7a8410d 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -292,7 +292,8 @@ static int raw_rcv_skb(struct sock * sk, struct sk_buff * skb)
{
/* Charge it to the socket. */

- if (ip_queue_rcv_skb(sk, skb) < 0) {
+ ipv4_pktinfo_prepare(skb);
+ if (sock_queue_rcv_skb(sk, skb) < 0) {
kfree_skb(skb);
return NET_RX_DROP;
}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ab0966d..6854f58 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1357,7 +1357,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
if (inet_sk(sk)->inet_daddr)
sock_rps_save_rxhash(sk, skb);

- rc = ip_queue_rcv_skb(sk, skb);
+ rc = sock_queue_rcv_skb(sk, skb);
if (rc < 0) {
int is_udplite = IS_UDPLITE(sk);

@@ -1473,6 +1473,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)

rc = 0;

+ ipv4_pktinfo_prepare(skb);
bh_lock_sock(sk);
if (!sock_owned_by_user(sk))
rc = __udp_queue_rcv_skb(sk, skb);
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 331af3b..204f2e8 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -383,7 +383,8 @@ static inline int rawv6_rcv_skb(struct sock *sk, struct sk_buff *skb)
}

/* Charge it to the socket. */
- if (ip_queue_rcv_skb(sk, skb) < 0) {
+ skb_dst_drop(skb);
+ if (sock_queue_rcv_skb(sk, skb) < 0) {
kfree_skb(skb);
return NET_RX_DROP;
}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 846f475..b4a4a15 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -538,7 +538,9 @@ int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb)
goto drop;
}

- if ((rc = ip_queue_rcv_skb(sk, skb)) < 0) {
+ skb_dst_drop(skb);
+ rc = sock_queue_rcv_skb(sk, skb);
+ if (rc < 0) {
/* Note that an ENOMEM error is charged twice */
if (rc == -ENOMEM)
UDP6_INC_STATS_BH(sock_net(sk),

David Miller

unread,
Nov 9, 2011, 4:37:08 PM11/9/11
to eric.d...@gmail.com, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org
From: Eric Dumazet <eric.d...@gmail.com>
Date: Wed, 09 Nov 2011 18:24:35 +0100

> [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
>
> When a socket uses IP_PKTINFO notifications, we currently force a dst
> reference for each received skb. Reader has to access dst to get needed
> information (rt_iif & rt_spec_dst) and must release dst reference.
>
> We also forced a dst reference if skb was put in socket backlog, even
> without IP_PKTINFO handling. This happens under stress/load.
>
> We can instead store the needed information in skb->cb[], so that only
> softirq handler really access dst, improving cache hit ratios.
>
> This removes two atomic operations per packet, and false sharing as
> well.
>
> On a benchmark using a mono threaded receiver (doing only recvmsg()
> calls), I can reach 720.000 pps instead of 570.000 pps.
>
> IP_PKTINFO is typically used by DNS servers, and any multihomed aware
> UDP application.
>
> Signed-off-by: Eric Dumazet <eric.d...@gmail.com>

Looks good, if it compiles I'll push it out to net-next :-)

Eric Dumazet

unread,
Nov 9, 2011, 5:03:03 PM11/9/11
to David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org
Le mercredi 09 novembre 2011 à 16:37 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.d...@gmail.com>
> Date: Wed, 09 Nov 2011 18:24:35 +0100
>
> > [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
> >
> > When a socket uses IP_PKTINFO notifications, we currently force a dst
> > reference for each received skb. Reader has to access dst to get needed
> > information (rt_iif & rt_spec_dst) and must release dst reference.
> >
> > We also forced a dst reference if skb was put in socket backlog, even
> > without IP_PKTINFO handling. This happens under stress/load.
> >
> > We can instead store the needed information in skb->cb[], so that only
> > softirq handler really access dst, improving cache hit ratios.
> >
> > This removes two atomic operations per packet, and false sharing as
> > well.
> >
> > On a benchmark using a mono threaded receiver (doing only recvmsg()
> > calls), I can reach 720.000 pps instead of 570.000 pps.
> >
> > IP_PKTINFO is typically used by DNS servers, and any multihomed aware
> > UDP application.
> >
> > Signed-off-by: Eric Dumazet <eric.d...@gmail.com>
>
> Looks good, if it compiles I'll push it out to net-next :-)

Arg :( I cross my fingers :)

BTW, on my bnx2x adapter, even small UDP frames use more than PAGE_SIZE
bytes :

skb->truesize=4352 len=26 (payload only)

Truesize being now more precise, we hit badly the shared
udp_memory_allocated, even with single frames.

I wonder if we shouldnt increase SK_MEM_QUANTUM a bit to avoid
ping/pong...

-#define SK_MEM_QUANTUM ((int)PAGE_SIZE)
+#define SK_MEM_QUANTUM ((int)PAGE_SIZE * 2)

Eric Dumazet

unread,
Nov 9, 2011, 7:29:00 PM11/9/11
to David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org, Eilon Greenstein
Le mercredi 09 novembre 2011 à 23:03 +0100, Eric Dumazet a écrit :

> BTW, on my bnx2x adapter, even small UDP frames use more than PAGE_SIZE
> bytes :
>
> skb->truesize=4352 len=26 (payload only)
>

> I wonder if we shouldnt increase SK_MEM_QUANTUM a bit to avoid


> ping/pong...
>
> -#define SK_MEM_QUANTUM ((int)PAGE_SIZE)
> +#define SK_MEM_QUANTUM ((int)PAGE_SIZE * 2)
>

Following patch also helps a lot, even with only two cpus (one handling
device interrupts, one running the application thread)

[PATCH net-next] bnx2x: reduce skb truesize by ~50%

bnx2x uses following formula to compute its rx_buf_sz :

dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8

Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
skb_shared_info))

Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.

Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
false sharing because of mem_reclaim in UDP stack.

One possible way to half truesize is to lower the need by 64 bytes (2112
-> 2048 bytes)

This way, skb->truesize is lower than SK_MEM_QUANTUM and we get better
performance.

(760.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)


Signed-off-by: Eric Dumazet <eric.d...@gmail.com>
CC: Eilon Greenstein <eil...@broadcom.com>
---
drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index aec7212..ebbdc55 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1185,9 +1185,14 @@ struct bnx2x {
#define ETH_MAX_PACKET_SIZE 1500
#define ETH_MAX_JUMBO_PACKET_SIZE 9600

- /* Max supported alignment is 256 (8 shift) */
-#define BNX2X_RX_ALIGN_SHIFT ((L1_CACHE_SHIFT < 8) ? \
- L1_CACHE_SHIFT : 8)
+/* Max supported alignment is 256 (8 shift)
+ * It should ideally be min(L1_CACHE_SHIFT, 8)
+ * Choosing 5 (32 bytes) permits to get skb heads of 2048 bytes
+ * instead of 4096 bytes.
+ * With SLUB/SLAB allocators, data will be cache line aligned anyway.
+ */
+#define BNX2X_RX_ALIGN_SHIFT 5
+
/* FW use 2 Cache lines Alignment for start packet and size */
#define BNX2X_FW_RX_ALIGN (2 << BNX2X_RX_ALIGN_SHIFT)
#define BNX2X_PXP_DRAM_ALIGN (BNX2X_RX_ALIGN_SHIFT - 5)

Eilon Greenstein

unread,
Nov 10, 2011, 10:05:26 AM11/10/11
to Eric Dumazet, David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org

Hi Eric,

This can seriously hurt the PCI utilization. So in scenarios in which
the PCI is the bottle neck, you will see performance degradation. We are
looking at alternatives to reduce the allocation, but it is taking a
while. Please hold off with this patch.

Thanks,
Eilon

Eric Dumazet

unread,
Nov 10, 2011, 10:27:58 AM11/10/11
to eil...@broadcom.com, David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org

What do you mean exactly ?

This patch doesnt change skb->data alignment, its still 64 bytes
aligned. (cqe_fp->placement_offset == 2). PCI utilization is the same.

Only SLOB could get a misalignement, but who uses SLOB for performance ?

Alternative would be to check why hardware need 2*L1_CACHE_BYTES extra
room for alignment... Normaly it could be 1*L1_CACHE_BYTES ?

/* FW use 2 Cache lines Alignment for start packet and size */

-#define BNX2X_FW_RX_ALIGN (2 << BNX2X_RX_ALIGN_SHIFT)
+#define BNX2X_FW_RX_ALIGN (1 << BNX2X_RX_ALIGN_SHIFT)

Eilon Greenstein

unread,
Nov 10, 2011, 11:27:03 AM11/10/11
to Eric Dumazet, David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org

Obviously you are right... But the FW is configured to the wrong
alignment and that will affect the end alignment (padding) which is
significant in small packets scenarios where the PCI is the bottle neck.

> Alternative would be to check why hardware need 2*L1_CACHE_BYTES extra
> room for alignment... Normaly it could be 1*L1_CACHE_BYTES ?

Again - you are a mind reader :) This is what we are looking into right
now. The problem is that `if` the buffer is not aligned (SLOB) we can
overstep the allocated boundaries by configuring the FW to align.

Eric Dumazet

unread,
Nov 10, 2011, 11:45:12 AM11/10/11
to eil...@broadcom.com, David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org

Yes, I fully understand.

>
> > Alternative would be to check why hardware need 2*L1_CACHE_BYTES extra
> > room for alignment... Normaly it could be 1*L1_CACHE_BYTES ?
>
> Again - you are a mind reader :) This is what we are looking into right
> now. The problem is that `if` the buffer is not aligned (SLOB) we can
> overstep the allocated boundaries by configuring the FW to align.
>
> > /* FW use 2 Cache lines Alignment for start packet and size */
> > -#define BNX2X_FW_RX_ALIGN (2 << BNX2X_RX_ALIGN_SHIFT)
> > +#define BNX2X_FW_RX_ALIGN (1 << BNX2X_RX_ALIGN_SHIFT)
> >
> >

I did a SLOB test (and my patch included as well)

skb->len=66 pad=26 wkb->data=0xffff8801194da048 truesize=2304

So skb->data + pad -> 0xffff8801194da062 : So a 32bytes alignement + 2
bytes to align IP header. (BTW we dont really need it, NET_IP_ALIGN is
now 0 on most x86 platforms ?)

In the end, we get 98 bytes of 'skb reserve', and also 64 bytes of extra
headroom _after_ the end of full frame.

In my understanding, hardware alignement should be between 0 and 63, not
0 and 127.

So maybe only BNX2X_FW_RX_ALIGN is twice the needed amount.

Eilon Greenstein

unread,
Nov 13, 2011, 1:53:39 PM11/13/11
to Eric Dumazet, David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org

I’m not sure I’m following the math over here. Assuming L1 is 64 bytes,
we need up to 63 bytes to align the start address (assuming SLOB is
being used) and additional (up to) 63 bytes at the end. That can sum up
to 126 bytes am I missing something?

> So maybe only BNX2X_FW_RX_ALIGN is twice the needed amount.

I agree that it does not make much sense to optimize for SLOB - after
checking with our FW expert, it seems that we can change the FW to have
two different configuration flags for start address alignment and end
packet padding. This way, we can only set the end packet padding and add
only 64 bytes. The only down side is that the FW team is pre occupied so
this new FW will be ready for submission only in about a month.

Thanks,
Eilon

Eric Dumazet

unread,
Nov 13, 2011, 2:42:18 PM11/13/11
to eil...@broadcom.com, David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org
Le dimanche 13 novembre 2011 à 20:53 +0200, Eilon Greenstein a écrit :

> I’m not sure I’m following the math over here. Assuming L1 is 64 bytes,
> we need up to 63 bytes to align the start address (assuming SLOB is
> being used) and additional (up to) 63 bytes at the end. That can sum up
> to 126 bytes am I missing something?
>

What do you really mean by aligning the end ?

How can both start and end of a frame can be aligned ?

If hardware needs extra room after the end of frame, then we already
have it (since we store struct skb_shared_info here)

I ran following patch and everything is fine here

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index aec7212..ddc94cc 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1188,8 +1188,8 @@ struct bnx2x {


/* Max supported alignment is 256 (8 shift) */

#define BNX2X_RX_ALIGN_SHIFT ((L1_CACHE_SHIFT < 8) ? \
L1_CACHE_SHIFT : 8)
- /* FW use 2 Cache lines Alignment for start packet and size */
-#define BNX2X_FW_RX_ALIGN (2 << BNX2X_RX_ALIGN_SHIFT)
+ /* FW use Cache line Alignment for start packet and size */
+#define BNX2X_FW_RX_ALIGN (1 << BNX2X_RX_ALIGN_SHIFT)


#define BNX2X_PXP_DRAM_ALIGN (BNX2X_RX_ALIGN_SHIFT - 5)

struct host_sp_status_block *def_status_blk;

Eilon Greenstein

unread,
Nov 13, 2011, 3:08:15 PM11/13/11
to Eric Dumazet, David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org
On Sun, 2011-11-13 at 11:42 -0800, Eric Dumazet wrote:
> Le dimanche 13 novembre 2011 à 20:53 +0200, Eilon Greenstein a écrit :
>
> > I’m not sure I’m following the math over here. Assuming L1 is 64 bytes,
> > we need up to 63 bytes to align the start address (assuming SLOB is
> > being used) and additional (up to) 63 bytes at the end. That can sum up
> > to 126 bytes am I missing something?
> >
>
> What do you really mean by aligning the end ?

I mean padding it to full cache line.

> How can both start and end of a frame can be aligned ?

The packet will start at aligned address and (using padding) will end at
cache line boundaries.

> If hardware needs extra room after the end of frame, then we already
> have it (since we store struct skb_shared_info here)

We have some space in there, but as far as I can tell it's not up to 63
bytes, right? We will overrun the dataref.

Thanks,
Eilon

Eric Dumazet

unread,
Nov 13, 2011, 5:00:57 PM11/13/11
to eil...@broadcom.com, David Miller, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org
Le dimanche 13 novembre 2011 à 22:08 +0200, Eilon Greenstein a écrit :
> On Sun, 2011-11-13 at 11:42 -0800, Eric Dumazet wrote:
> > Le dimanche 13 novembre 2011 à 20:53 +0200, Eilon Greenstein a écrit :
> >
> > > I’m not sure I’m following the math over here. Assuming L1 is 64 bytes,
> > > we need up to 63 bytes to align the start address (assuming SLOB is
> > > being used) and additional (up to) 63 bytes at the end. That can sum up
> > > to 126 bytes am I missing something?
> > >
> >
> > What do you really mean by aligning the end ?
>
> I mean padding it to full cache line.
>
> > How can both start and end of a frame can be aligned ?
>
> The packet will start at aligned address and (using padding) will end at
> cache line boundaries.
>

OK, so hardware adds up to 63 bytes of padding at the end of the packet.

> > If hardware needs extra room after the end of frame, then we already
> > have it (since we store struct skb_shared_info here)
>
> We have some space in there, but as far as I can tell it's not up to 63
> bytes, right? We will overrun the dataref.
>

OK then we need using build_skb() for this driver :)

http://lists.openwall.net/netdev/2011/07/11/19

This way, we build the skb_shared_info content _after_ frame is
delivered by device.

David Miller

unread,
Nov 14, 2011, 12:08:11 AM11/14/11
to eric.d...@gmail.com, eil...@broadcom.com, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org
From: Eric Dumazet <eric.d...@gmail.com>
Date: Sun, 13 Nov 2011 23:00:57 +0100

> OK then we need using build_skb() for this driver :)
>
> http://lists.openwall.net/netdev/2011/07/11/19
>
> This way, we build the skb_shared_info content _after_ frame is
> delivered by device.

I fully support bringing this thing back to life :-)

Eric Dumazet

unread,
Nov 14, 2011, 1:25:45 AM11/14/11
to David Miller, eil...@broadcom.com, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org
Le lundi 14 novembre 2011 à 00:08 -0500, David Miller a écrit :

> I fully support bringing this thing back to life :-)

I'll make extensive tests today and provide two patches when ready, with
all performance results.

Some prefetch() calls will be removed, since build_skb() provides
already cache hot skb.

Thanks

Eric Dumazet

unread,
Nov 14, 2011, 10:57:45 AM11/14/11
to David Miller, eil...@broadcom.com, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org, Thomas Graf, Tom Herbert, Jamal Hadi Salim, Stephen Hemminger
Le lundi 14 novembre 2011 à 07:25 +0100, Eric Dumazet a écrit :
> Le lundi 14 novembre 2011 à 00:08 -0500, David Miller a écrit :
>
> > I fully support bringing this thing back to life :-)
>
> I'll make extensive tests today and provide two patches when ready, with
> all performance results.
>
> Some prefetch() calls will be removed, since build_skb() provides
> already cache hot skb.

Impressive results :

before : 720.000 pps
after : 820.000 pps

[ One mono threaded application receiving UDP messages on a single
socket, asking IP_PKTINFO ancillary info ]

Latencies are also a bit improved : softirq handler dirties about 320
bytes less per skb.

Definitely worth the pain.

I am sending two patches. Other drivers probably can benefit from
build_skb() as well.

[PATCH net-next 1/2] net: introduce build_skb()
[PATCH net-next 2/2] bnx2x: uses build_skb() in receive path

David Miller

unread,
Nov 14, 2011, 2:21:17 PM11/14/11
to eric.d...@gmail.com, eil...@broadcom.com, bhutc...@solarflare.com, pstas...@itcare.pl, net...@vger.kernel.org, tg...@infradead.org, ther...@google.com, ha...@mojatatu.com, shemm...@vyatta.com
From: Eric Dumazet <eric.d...@gmail.com>
Date: Mon, 14 Nov 2011 16:57:45 +0100

> Le lundi 14 novembre 2011 � 07:25 +0100, Eric Dumazet a �crit :
>> Le lundi 14 novembre 2011 � 00:08 -0500, David Miller a �crit :


>>
>> > I fully support bringing this thing back to life :-)
>>
>> I'll make extensive tests today and provide two patches when ready, with
>> all performance results.
>>
>> Some prefetch() calls will be removed, since build_skb() provides
>> already cache hot skb.
>
> Impressive results :
>
> before : 720.000 pps
> after : 820.000 pps
>
> [ One mono threaded application receiving UDP messages on a single
> socket, asking IP_PKTINFO ancillary info ]

Sweeeeeet.

Reply all
Reply to author
Forward
Message has been deleted
0 new messages