FreeBSD 10G forwarding performance @Intel

Alexander V. Chernikov

unread,

Jul 3, 2012, 12:11:14 PM7/3/12

to

Hello list!

I'm quite stuck with bad forwarding performance on many FreeBSD boxes
doing firewalling.

Typical configuration is E5645 / E5675 @ Intel 82599 NIC.
HT is turned off.
(Configs and tunables below).

I'm mostly concerned with unidirectional traffic flowing to single
interface (e.g. using singe route entry).

In most cases system can forward no more than 700 (or 1400) kpps which
is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).

Test scenario:

Ixia XM2 (traffic generator) <> ix0 (FreeBSD).

Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to
destinations in vlan11 (10.100.1.128 - 10.100.1.192).

Static arps are configured for all destination addresses.

Traffic level is slightly above or slightly below system performance.

================= Test 1 =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE,
no firewall

Traffic: 1-1 flow (1 src, 1 dst)
(This is actually a bit different from described above)

Result:
input (ix0) output
packets errs idrops bytes packets errs bytes colls
878k 48k 0 59M 878k 0 56M 0
874k 48k 0 59M 874k 0 56M 0
875k 48k 0 59M 875k 0 56M 0

16:41 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf " %7s
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
STATE C TIME CPU COMMAND
CPU6 6 17:28 100.00% kernel{ix0 que}
CPU9 9 20:42 60.06% intr{irq265: ix0:que

16:41 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0 500796 167
irq257: ix0:que 1 6693573 2245
irq258: ix0:que 2 2572380 862
irq259: ix0:que 3 3166273 1062
irq260: ix0:que 4 9691706 3251
irq261: ix0:que 5 10766434 3611
irq262: ix0:que 6 8933774 2996
irq263: ix0:que 7 5246879 1760
irq264: ix0:que 8 3548930 1190
irq265: ix0:que 9 11817986 3964
irq266: ix0:que 10 227561 76
irq267: ix0:link 1 0

Note that system is using 2 cores to forward, so 12 cores should be able
to forward 4+ mpps which is more or less consistent with Linux results.
Note that interrupts on all queues are (as far as I understand from the
fact that AIM is turned off and interrupt rates are the same from
previous test). Additionally, despite hw.intr_storm_threshold = 200k,
i'm constantly getting
interrupt storm detected on "irq265:"; throttling interrupt source
message.

================= Test 2 =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE,
no firewall

Traffic: Unidirectional many-2-many

16:20 [0] test15# netstat -I ix0 -hw 1
input (ix0) output
packets errs idrops bytes packets errs bytes colls
507k 651k 0 74M 508k 0 32M 0
506k 652k 0 74M 507k 0 28M 0
509k 652k 0 74M 508k 0 37M 0

16:28 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf " %7s
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
STATE C TIME CPU COMMAND
CPU10 6 0:40 100.00% kernel{ix0 que}
CPU2 2 11:47 84.86% intr{irq258: ix0:que
CPU3 3 11:50 81.88% intr{irq259: ix0:que
CPU8 8 11:38 77.69% intr{irq264: ix0:que
CPU7 7 11:24 77.10% intr{irq263: ix0:que
WAIT 1 10:10 74.76% intr{irq257: ix0:que
CPU4 4 8:57 63.48% intr{irq260: ix0:que
CPU6 6 8:35 61.96% intr{irq262: ix0:que
CPU9 9 14:01 60.79% intr{irq265: ix0:que
RUN 0 9:07 59.67% intr{irq256: ix0:que
WAIT 5 6:13 43.26% intr{irq261: ix0:que
CPU11 11 5:19 35.89% kernel{ix0 que}
- 4 3:41 25.49% kernel{ix0 que}
- 1 3:22 21.78% kernel{ix0 que}
- 1 2:55 17.68% kernel{ix0 que}
- 4 2:24 16.55% kernel{ix0 que}
- 1 9:54 14.99% kernel{ix0 que}
CPU0 11 2:13 14.26% kernel{ix0 que}

16:07 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0 13654 15
irq257: ix0:que 1 87043 96
irq258: ix0:que 2 39604 44
irq259: ix0:que 3 48308 53
irq260: ix0:que 4 138002 153
irq261: ix0:que 5 169596 188
irq262: ix0:que 6 107679 119
irq263: ix0:que 7 72769 81
irq264: ix0:que 8 30878 34
irq265: ix0:que 9 1002032 1115
irq266: ix0:que 10 10967 12
irq267: ix0:link 1 0

Note that all cores are loaded more or less evenly, but the result is
_worse_. The first reason for this is mtx_lock which is acquired twice
on every lookup (once in in in_matroute() where it can possibly be
removed and once again in rtalloc1_fib()). Latter one is addressed by
andre@ in r234650).

Additionally, despite itreads are bound to singe CPU each, kernel que
are not in stock setup. However, configuration with 5 queues and 5
kernel threads bound to different CPU provides the same bad results.

================= Test 3 =======================
Kernel: FreeBSD-8-S June 4 SVN, +merged ifaddrlock, stock drivers, stock
routing, no FLOWTABLE, no firewall

packets errs idrops bytes packets errs bytes colls
580k 18k 0 38M 579k 0 37M 0
581k 26k 0 39M 580k 0 37M 0
580k 24k 0 39M 580k 0 37M 0
................
Enabling ipfw _increases_ performance a bit:

604k 0 0 39M 604k 0 39M 0
604k 0 0 39M 604k 0 39M 0
582k 19k 0 38M 568k 0 37M 0
527k 81k 0 39M 530k 0 34M 0
605k 28 0 39M 605k 0 39M 0

================= Test 3.1 =======================

Same as test 3, the only difference is the following:
route add -net 10.100.1.160/27 -iface vlan11.

input (ix0) output
packets errs idrops bytes packets errs bytes colls
543k 879k 0 91M 544k 0 35M 0
547k 870k 0 91M 545k 0 35M 0
541k 870k 0 91M 539k 0 30M 0
952k 565k 0 97M 962k 0 48M 0
1.2M 228k 0 91M 1.2M 0 92M 0
1.2M 226k 0 90M 1.1M 0 76M 0
1.1M 228k 0 91M 1.2M 0 76M 0
1.2M 233k 0 90M 1.2M 0 76M 0

================= Test 3.2 =======================

Same as test 3, splitting destination into 4 smaller rtes:
route add -net 10.100.1.128/28 -iface vlan11
route add -net 10.100.1.144/28 -iface vlan11
route add -net 10.100.1.160/28 -iface vlan11
route add -net 10.100.1.176/28 -iface vlan11

input (ix0) output
packets errs idrops bytes packets errs bytes colls
1.4M 0 0 106M 1.6M 0 106M 0
1.8M 0 0 106M 1.6M 0 71M 0
1.6M 0 0 106M 1.6M 0 71M 0
1.6M 0 0 87M 1.6M 0 71M 0
1.6M 0 0 126M 1.6M 0 212M 0

================= Test 3.3 =======================

Same as test 3, splitting destination into 16 smaller rtes:
input (ix0) output
packets errs idrops bytes packets errs bytes colls
1.6M 0 0 118M 1.8M 0 118M 0
2.0M 0 0 118M 1.8M 0 119M 0
1.8M 0 0 119M 1.8M 0 79M 0
1.8M 0 0 117M 1.8M 0 157M 0

================= Test 4 =======================
Kernel: FreeBSD-8-S June 4 SVN, stock drivers, routing patch 1, no
FLOWTABLE, no firewall

input (ix0) output
packets errs idrops bytes packets errs bytes colls
1.8M 0 0 114M 1.9M 0 114M 0
1.7M 0 0 114M 1.7M 0 114M 0
1.8M 0 0 114M 1.8M 0 114M 0
1.7M 0 0 114M 1.7M 0 114M 0
1.8M 0 0 114M 1.8M 0 74M 0
1.5M 0 0 114M 1.8M 0 74M 0
2M 0 0 114M 1.8M 0 194M 0

Patch 1 totally eliminates mtx_lock for fastforwarding path to get an
idea how much performance we can achieve. The result is nearly the same
as in 3.3

================= Test 4.1 =======================

Same as the test 4, same traffic level, enabling firewall with single
allow rule (evaluating RLOCK performance)

22:35 [0] test15# netstat -I ix0 -hw 1
input (ix0) output
packets errs idrops bytes packets errs bytes colls
1.8M 149k 0 114M 1.6M 0 142M 0
1.4M 148k 0 85M 1.6M 0 104M 0
1.8M 149k 0 143M 1.6M 0 104M 0
1.6M 151k 0 114M 1.6M 0 104M 0
1.6M 151k 0 114M 1.6M 0 104M 0
1.4M 152k 0 114M 1.6M 0 104M 0

E.g something like 10% performance loss.

================= Test 4.2 =======================

Same as test4, playing with number of queues.

5queues, same traffic level
1.5M 225k 0 114M 1.5M 0 99M 0

================= Test 4.3 =======================

Same as test 4, HT on, number of queues = 16

input (ix0) output
packets errs idrops bytes packets errs bytes colls
2.4M 0 0 157M 2.4M 0 156M 0
2.4M 0 0 156M 2.4M 0 157M 0

However, enabling firewall immediately drops rate to 1.9mpps which is
nearly the same as 4.1 (and complicated fw ruleset possibly kill HT core
much faster)

================= Test 4.3 =======================

Same as test4, kerwnel ix0 que Tx threads bound to specific CPUs
(corresponding to RX ):
18:02 [0] test15# procstat -ak | grep ix0 | sort -nk 2
12 100045 intr irq256: ix0:que <running>
0 100046 kernel ix0 que <running>
12 100047 intr irq257: ix0:que <running>
0 100048 kernel ix0 que mi_switch sleepq_wait
msleep_spin taskqueue_thread_loop fork_exit fork_trampoline
12 100049 intr irq258: ix0:que <running>
..

test15# for i in `jot 12 0`; do cpuset -l $i -t $((100046+2*$i)); done

Result:
input (ix0) output
packets errs idrops bytes packets errs bytes colls
2.1M 0 0 139M 2M 0 193M 0
2.1M 0 0 139M 2.3M 0 139M 0
2.1M 0 0 139M 2.1M 0 85M 0
2.1M 0 0 139M 2.1M 0 193M 0

Quite considerable increase, however this works better for uniform
traffic distribution only.

================= Test 5 =======================
Same as test 4, make radix use rmlock (r234648, r234649).

Result: 1.7 MPPS.

================= Test 6 =======================
Same as test 4 + FLOWTABLE

Result: 1.7 MPPS.

================= Test 7 =======================
Same as test 4, build with GCC 4.7

Result: No performance gain

Further investigations:

================= Test 8 =======================
Test 4 setup with kernel build with LOCK_PROFILING.

17:46 [0] test15# sysctl debug.lock.prof.enable=1 ; sleep 2 ; sysctl
debug.lock.prof.enable=0

920k 0 0 59M 920k 0 59M 0
875k 0 0 59M 920k 0 59M 0
628k 0 0 39M 566k 0 45M 0
79k 2.7M 0 186M 57k 0 6.5M 0
71k 878k 0 61M 73k 0 4.0M 0
891k 254k 0 72M 917k 0 54M 0
920k 0 0 59M 920k 0 59M 0

When enabled, forwarding performance goes down to 60kpps.
Enabled for 2 seconds (so actually 130k packets forwarded), results
attached as separate file. Several hundred lock contentions in ixgbe,
that's all.

================= Test 9 =======================
Same as test 4 setup with hwpmc.
Results attached.

================= Test 9 =======================
Kernel: Freebsd-9-S.
No major difference

Some (my) preliminary conclusions:
1) rte mtx_lock should (and can) be eliminated from stock kernel. (And
it can be done more or less easily for in_matroute).
2) rmlock vs rwlock performance difference is insignificant (maybe
because of 3) )
3) there are locks contention between ixgbe taskq threads and ithreads.
I'm not sure if taskq threads are necessary in the case of packet
forwarding and not traffic generation.

Maybe I'm missing something else? (l2 cache misses or other things).

What else I can do to debug this further?

Relevant files:
http://static.ipfw.ru/files/fbsd10g/0001-no-rt-mutex.patch
http://static.ipfw.ru/files/fbsd10g/kernel.gprof.txt
http://static.ipfw.ru/files/fbsd10g/prof_stats.txt

============= CONFIGS ====================

sysctl.conf:
kern.ipc.maxsockbuf=33554432
net.inet.udp.maxdgram=65535
net.inet.udp.recvspace=16777216
net.inet.tcp.sendbuf_auto=0
net.inet.tcp.recvbuf_auto=0
net.inet.tcp.sendspace=16777216
net.inet.tcp.recvspace=16777216
net.inet.ip.maxfragsperpacket=64

kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
kern.random.sys.harvest.interrupt=0

net.inet.ip.forwarding=1
net.inet.ip.fastforwarding=1
net.inet.ip.redirect=0

hw.intr_storm_threshold=20000

loader.conf:
kern.ipc.nmbclusters="512000"
ixgbe_load="YES"
hw.ixgbe.rx_process_limit="300"
hw.ixgbe.nojumbobuf="1"
hw.ixgbe.max_loop="100"
hw.ixgbe.max_interrupt_rate="20000"
hw.ixgbe.num_queues="11"

hw.ixgbe.txd=4096
hw.ixgbe.rxd=4096

kern.hwpmc.nbuffers=2048

debug.debugger_on_panic=1
net.inet.ip.fw.default_to_accept=1

kernel:
cpu HAMMER

ident CORE_RELENG_7
options COMPAT_IA32

makeoptions DEBUG=-g # Build kernel with gdb(1) debug
symbols

options SCHED_ULE # ULE scheduler
options PREEMPTION # Enable kernel thread preemption
options INET # InterNETworking
options INET6 # IPv6 communications protocols
options SCTP # Stream Control Transmission
Protocol
options FFS # Berkeley Fast Filesystem
options SOFTUPDATES # Enable FFS soft updates support
options UFS_ACL # Support for access control lists
options UFS_DIRHASH # Improve performance on big
directories
options UFS_GJOURNAL # Enable gjournal-based UFS
journaling
options MD_ROOT # MD is a potential root device
options PROCFS # Process filesystem (requires
PSEUDOFS)
options PSEUDOFS # Pseudo-filesystem framework
options GEOM_PART_GPT # GUID Partition Tables.
options GEOM_LABEL # Provides labelization
options COMPAT_43TTY # BSD 4.3 TTY compat [KEEP THIS!]
options COMPAT_FREEBSD4 # Compatible with FreeBSD4
options COMPAT_FREEBSD5 # Compatible with FreeBSD5
options COMPAT_FREEBSD6 # Compatible with FreeBSD6
options COMPAT_FREEBSD7 # Compatible with FreeBSD7
options COMPAT_FREEBSD32
options SCSI_DELAY=4000 # Delay (in ms) before probing SCSI
options KTRACE # ktrace(1) support
options STACK # stack(9) support
options SYSVSHM # SYSV-style shared memory
options SYSVMSG # SYSV-style message queues
options SYSVSEM # SYSV-style semaphores
options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time
extensions
options KBD_INSTALL_CDEV # install a CDEV entry in /dev
options AUDIT # Security event auditing
options HWPMC_HOOKS
options GEOM_MIRROR
options MROUTING
options PRINTF_BUFR_SIZE=100

# To make an SMP kernel, the next two lines are needed
options SMP # Symmetric MultiProcessor Kernel

# CPU frequency control
device cpufreq

# Bus support.
device acpi
device pci

device ada
device ahci

# SCSI Controllers
device ahd # AHA39320/29320 and onboard AIC79xx devices
options AHD_REG_PRETTY_PRINT # Print register bitfields in debug
# output. Adds ~215k to driver.
device mpt # LSI-Logic MPT-Fusion
# SCSI peripherals
device scbus # SCSI bus (required for SCSI)
device da # Direct Access (disks)
device pass # Passthrough device (direct SCSI access)
device ses # SCSI Environmental Services (and SAF-TE)

# RAID controllers
device mfi # LSI MegaRAID SAS

# atkbdc0 controls both the keyboard and the PS/2 mouse
device atkbdc # AT keyboard controller
device atkbd # AT keyboard
device psm # PS/2 mouse

device kbdmux # keyboard multiplexer

device vga # VGA video card driver

device splash # Splash screen and screen saver support

# syscons is the default console driver, resembling an SCO console
device sc

device agp # support several AGP chipsets

## Power management support (see NOTES for more options)
#device apm
## Add suspend/resume support for the i8254.
#device pmtimer

# Serial (COM) ports
#device sio # 8250, 16[45]50 based serial ports
device uart # Generic UART driver

# If you've got a "dumb" serial or parallel PCI card that is
# supported by the puc(4) glue driver, uncomment the following
# line to enable it (connects to sio, uart and/or ppc drivers):
#device puc

# PCI Ethernet NICs.
device em # Intel PRO/1000 adapter Gigabit
Ethernet Card
device bce
#device ixgb # Intel PRO/10GbE Ethernet Card
#device ixgbe

# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device miibus # MII bus support

# Pseudo devices.
device loop # Network loopback
device random # Entropy device
device ether # Ethernet support
device pty # Pseudo-ttys (telnet etc)
device md # Memory "disks"
device firmware # firmware assist module
device lagg

# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device bpf # Berkeley packet filter

# USB support
device uhci # UHCI PCI->USB interface
device ohci # OHCI PCI->USB interface
device ehci # EHCI PCI->USB interface (USB 2.0)
device usb # USB Bus (required)
#device udbp # USB Double Bulk Pipe devices
device uhid # "Human Interface Devices"
device ukbd # Keyboard
device umass # Disks/Mass storage - Requires scbus and da
device ums # Mouse
# USB Serial devices
device ucom # Generic com ttys

options INCLUDE_CONFIG_FILE

options KDB
options KDB_UNATTENDED
options DDB
options ALT_BREAK_TO_DEBUGGER

options IPFIREWALL #firewall
options IPFIREWALL_FORWARD #packet destination changes
options IPFIREWALL_VERBOSE #print information about
# dropped packets
options IPFIREWALL_VERBOSE_LIMIT=10000 #limit verbosity

# MRT support
options ROUTETABLES=16

device vlan #VLAN support

# Size of the kernel message buffer. Should be N * pagesize.
options MSGBUF_SIZE=4096000

options SW_WATCHDOG
options PANIC_REBOOT_WAIT_TIME=4

#
# Hardware watchdog timers:
#
# ichwd: Intel ICH watchdog timer
#
#device ichwd

device smbus
device ichsmb
device ipmi

--
WBR, Alexander

_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net...@freebsd.org"

Luigi Rizzo

unread,

Jul 3, 2012, 12:55:06 PM7/3/12

to

On Tue, Jul 03, 2012 at 08:11:14PM +0400, Alexander V. Chernikov wrote:
> Hello list!
>
> I'm quite stuck with bad forwarding performance on many FreeBSD boxes
> doing firewalling.

...

> In most cases system can forward no more than 700 (or 1400) kpps which
> is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).

among the many interesting tests you have run, i am curious
if you have tried to remove the update of the counters on route
entries. They might be another severe contention point.

cheers
luigi

Alexander V. Chernikov

unread,

Jul 3, 2012, 1:37:38 PM7/3/12

to

On 03.07.2012 20:55, Luigi Rizzo wrote:
> On Tue, Jul 03, 2012 at 08:11:14PM +0400, Alexander V. Chernikov wrote:
>> Hello list!
>>
>> I'm quite stuck with bad forwarding performance on many FreeBSD boxes
>> doing firewalling.
> ...
>> In most cases system can forward no more than 700 (or 1400) kpps which
>> is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).
>
> among the many interesting tests you have run, i am curious
> if you have tried to remove the update of the counters on route
> entries. They might be another severe contention point.

21:47 [0] m@test15 netstat -I ix0 -w 1

input (ix0) output
packets errs idrops bytes packets errs bytes colls

1785514 52785 0 121318340 1784650 0 117874854 0
1773126 52437 0 120701470 1772977 0 117584736 0
1781948 52154 0 121060126 1778271 0 75029554 0
1786169 52982 0 121451160 1787312 0 160967392 0
21:47 [0] test15# sysctl net.rt_count=0
net.rt_count: 1 -> 0
1814465 22546 0 121302076 1814291 0 76860092 0
1817769 14272 0 120984922 1816254 0 163643534 0
1815311 13113 0 120831970 1815340 0 120159118 0
1814059 13698 0 120799132 1813738 0 120172092 0
1818030 13513 0 120960140 1814578 0 120332662 0
1814169 14351 0 120836182 1814003 0 120164310 0

Thanks, another good point. I forgot to merge this option from andre's
patch.

Another 30-40-50kpps to win.

+u_int rt_count = 1;
+SYSCTL_INT(_net, OID_AUTO, rt_count, CTLFLAG_RW, &rt_count, 1, "");

@@ -601,17 +625,20 @@ passout:
if (error != 0)
IPSTAT_INC(ips_odropped);
else {
- ro.ro_rt->rt_rmx.rmx_pksent++;
+ if (rt_count)
+ ro.ro_rt->rt_rmx.rmx_pksent++;
IPSTAT_INC(ips_forward);
IPSTAT_INC(ips_fastforward);

>
> cheers
> luigi
>

--
WBR, Alexander

Luigi Rizzo

unread,

Jul 3, 2012, 4:27:57 PM7/3/12

to

On Tue, Jul 03, 2012 at 09:37:38PM +0400, Alexander V. Chernikov wrote:
...

> Thanks, another good point. I forgot to merge this option from andre's
> patch.
>
> Another 30-40-50kpps to win.

not much gain though.
What about the other IPSTAT_INC counters ?
I think the IPSTAT_INC macros were introduced (by rwatson ?)
following a discussion on how to make the counters per-cpu
and avoid the contention on cache lines.
But they are still implemented as a single instance,
and neither volatile nor atomic, so it is not even clear
that they can give reliable results, let alone the fact
that you are likely to get some cache misses.

the relevant macro is in ip_var.h.

Cheers
luigi

Alexander V. Chernikov

unread,

Jul 3, 2012, 4:31:56 PM7/3/12

to

On 04.07.2012 00:27, Luigi Rizzo wrote:
> On Tue, Jul 03, 2012 at 09:37:38PM +0400, Alexander V. Chernikov wrote:
> ...
>> Thanks, another good point. I forgot to merge this option from andre's
>> patch.
>>
>> Another 30-40-50kpps to win.
>
> not much gain though.
> What about the other IPSTAT_INC counters ?

Well, we should then remove all such counters (total, forwarded) and
per-interface statistics (at least for forwarded packets).

> I think the IPSTAT_INC macros were introduced (by rwatson ?)
> following a discussion on how to make the counters per-cpu
> and avoid the contention on cache lines.
> But they are still implemented as a single instance,
> and neither volatile nor atomic, so it is not even clear
> that they can give reliable results, let alone the fact
> that you are likely to get some cache misses.
>
> the relevant macro is in ip_var.h.

Hm. This seems to be just per-vnet structure instance.
We've got some more real DPCPU stuff (sys/pcpu.h && kern/subr_pcpu.c)
which can be used for global ipstat structure, however since it is
allocated from single area without possibility to free we can't use it
for per-interface counters.

I'll try to run tests without any possibly contested counters and report
the results on Thursday.

>
> Cheers
> luigi
>
>>
>> +u_int rt_count = 1;

>> +SYSCTL_INT(_net, OID_AUTO, rt_count, CTLFLAG_RW,&rt_count, 1, "");

Luigi Rizzo

unread,

Jul 3, 2012, 5:28:16 PM7/3/12

to

On Wed, Jul 04, 2012 at 12:31:56AM +0400, Alexander V. Chernikov wrote:
> On 04.07.2012 00:27, Luigi Rizzo wrote:
> >On Tue, Jul 03, 2012 at 09:37:38PM +0400, Alexander V. Chernikov wrote:
> >...
> >>Thanks, another good point. I forgot to merge this option from andre's
> >>patch.
> >>
> >>Another 30-40-50kpps to win.
> >
> >not much gain though.
> >What about the other IPSTAT_INC counters ?
> Well, we should then remove all such counters (total, forwarded) and
> per-interface statistics (at least for forwarded packets).

I am not saying to remove them for good, but at least have a
try at what we can hope to save by implementing them
on a per-cpu basis.

There is a chance that one will not
see big gains util the majority of such shared counters
are fixed (there are probably 3-4 at least on the non-error
path for forwarded packets), plus the per-interface ones
that are not even wrapped in macros (see if_ethersubr.c)

> >I think the IPSTAT_INC macros were introduced (by rwatson ?)
> >following a discussion on how to make the counters per-cpu
> >and avoid the contention on cache lines.
> >But they are still implemented as a single instance,
> >and neither volatile nor atomic, so it is not even clear
> >that they can give reliable results, let alone the fact
> >that you are likely to get some cache misses.
> >
> >the relevant macro is in ip_var.h.
> Hm. This seems to be just per-vnet structure instance.

yes but essentially they are still shared by all threads within a vnet
(besides you probably ran your tests in the main instance)

> We've got some more real DPCPU stuff (sys/pcpu.h && kern/subr_pcpu.c)
> which can be used for global ipstat structure, however since it is
> allocated from single area without possibility to free we can't use it
> for per-interface counters.

yes, those should be moved to a private, dynamically allocated
region of the ifnet (the number of CPUs is known at driver init
time, i hope). But again for a quick test disabling the
if_{i|o}{bytesC|packets} should do the job, if you can count
the received rate by some other means.

> I'll try to run tests without any possibly contested counters and report
> the results on Thursday.

great, that would be really useful info.

cheers
luigi

Doug Barton

unread,

Jul 3, 2012, 5:19:06 PM7/3/12

to

Just curious ... what's the MTU on your FreeBSD box, and the Linux box?

(also, please don't cross-post to so many lists) :)

Doug

Luigi Rizzo

unread,

Jul 3, 2012, 5:44:19 PM7/3/12

to

On Tue, Jul 03, 2012 at 02:19:06PM -0700, Doug Barton wrote:
> Just curious ... what's the MTU on your FreeBSD box, and the Linux box?

he is (correctly) using min-sized packets, and counting packets not bps.

cheers
luigi

Doug Barton

unread,

Jul 3, 2012, 5:29:28 PM7/3/12

to

On 07/03/2012 14:44, Luigi Rizzo wrote:
> On Tue, Jul 03, 2012 at 02:19:06PM -0700, Doug Barton wrote:
>> Just curious ... what's the MTU on your FreeBSD box, and the Linux box?
>
> he is (correctly) using min-sized packets, and counting packets not bps.

Yes, I know. That wasn't what I asked.

--

This .signature sanitized for your protection

Alexander V. Chernikov

unread,

Jul 4, 2012, 2:29:24 AM7/4/12

to

On 04.07.2012 01:29, Doug Barton wrote:
> On 07/03/2012 14:44, Luigi Rizzo wrote:
>> On Tue, Jul 03, 2012 at 02:19:06PM -0700, Doug Barton wrote:
>>> Just curious ... what's the MTU on your FreeBSD box, and the Linux box?
>>
>> he is (correctly) using min-sized packets, and counting packets not bps.

In this particular setup - 1500. You're probably meaning type of mbufs
which are allocated by ixgbe driver?

>
> Yes, I know. That wasn't what I asked.
>
>

Doug Barton

unread,

Jul 4, 2012, 4:13:08 AM7/4/12

to

On 07/03/2012 23:29, Alexander V. Chernikov wrote:
> On 04.07.2012 01:29, Doug Barton wrote:
>>>> Just curious ... what's the MTU on your FreeBSD box, and the Linux box?
>

> In this particular setup - 1500. You're probably meaning type of mbufs
> which are allocated by ixgbe driver?

1500 for both?

And no, I'm not thinking of the mbufs directly, although that may be a
side effect. I've seen cases on FreeBSD with em where setting the MTU to
9000 had unexpected (albeit pleasant) side effects on throughput vs.
system load. Since it was working better I didn't take the time to find
out why. However since you're obviously interested in finding out the
nitty-gritty details (and thank you for that) you might want to give it
a look, and a few test runs.

hth,

Doug

--

This .signature sanitized for your protection

Alexander V. Chernikov

unread,

Jul 4, 2012, 4:46:09 AM7/4/12

to

On 04.07.2012 12:13, Doug Barton wrote:
> On 07/03/2012 23:29, Alexander V. Chernikov wrote:
>> On 04.07.2012 01:29, Doug Barton wrote:
>>>>> Just curious ... what's the MTU on your FreeBSD box, and the Linux box?
>>
>> In this particular setup - 1500. You're probably meaning type of mbufs
>> which are allocated by ixgbe driver?
>
> 1500 for both?

Well, AFAIR it was 1500. We've done a variety of tests half a year ago
with similar server and Intel and Mellanox equipment. Test results vary
from 4 to 6mpps in different setups (and mellanox seems to behave better
on Linux). If you're particularly interested in exact Linux performance
on exactly the same box I can try to do this possibly next week.

My point actually is the following:
It is possible to do linerate 10G (14.8mpps) forwarding with current
market-available hardware. Linux is going that way and it is much more
close than we do. Even dragonfly performs _much_ better than we do in
routing.

http://shader.kaist.edu/packetshader/ (and links there) are good example
of what is going on.

>
> And no, I'm not thinking of the mbufs directly, although that may be a
> side effect. I've seen cases on FreeBSD with em where setting the MTU to
> 9000 had unexpected (albeit pleasant) side effects on throughput vs.

Yes. Stock drivers has this problem, especially with IPv6 addresses.
We actually use our versions of em/igb/ixgbe drivers in production which
are free from several problems in stock driver.

(Tests, however, were done using stock driver)

> system load. Since it was working better I didn't take the time to find
> out why. However since you're obviously interested in finding out the
> nitty-gritty details (and thank you for that) you might want to give it
> a look, and a few test runs.
>
> hth,
>
> Doug
>

Luigi Rizzo

unread,

Jul 4, 2012, 5:12:41 AM7/4/12

to

On Wed, Jul 04, 2012 at 12:46:09PM +0400, Alexander V. Chernikov wrote:
> On 04.07.2012 12:13, Doug Barton wrote:
> >On 07/03/2012 23:29, Alexander V. Chernikov wrote:
> >>On 04.07.2012 01:29, Doug Barton wrote:
> >>>>>Just curious ... what's the MTU on your FreeBSD box, and the Linux box?
> >>
> >>In this particular setup - 1500. You're probably meaning type of mbufs
> >>which are allocated by ixgbe driver?
> >
> >1500 for both?
> Well, AFAIR it was 1500. We've done a variety of tests half a year ago
> with similar server and Intel and Mellanox equipment. Test results vary
> from 4 to 6mpps in different setups (and mellanox seems to behave better
> on Linux). If you're particularly interested in exact Linux performance
> on exactly the same box I can try to do this possibly next week.
>
> My point actually is the following:
> It is possible to do linerate 10G (14.8mpps) forwarding with current
> market-available hardware. Linux is going that way and it is much more
> close than we do. Even dragonfly performs _much_ better than we do in
> routing.
>
> http://shader.kaist.edu/packetshader/ (and links there) are good example
> of what is going on.

Alex,
i am sure you are aware that in FreeBSD we have netmap too

http://info.iet.unipi.it/~luigi/netmap/

which is probably a lot more usable than packetshader
(hw independent, included in the OS, also works on linux...)

cheers
luigi

Alexander V. Chernikov

unread,

Jul 4, 2012, 5:54:01 AM7/4/12

to

On 04.07.2012 13:12, Luigi Rizzo wrote:
> Alex,
> i am sure you are aware that in FreeBSD we have netmap too

Yes, I'm aware of that :)

> which is probably a lot more usable than packetshader
> (hw independent, included in the OS, also works on linux...)

I'm actually not talking about usability and comparison here :). Thay
have nice idea and nice performance graphs. And packetshader is actually
_platform_ with fast packet delivery being one (and the only open) part
of platform.

Their graphs shows 40MPPS (27G/64byte) CPU-only IPv4 packet forwarding
on "two four-core Intel Nehalem CPUs (2.66GHz)" which illustrates
software routing possibilities quite clearly.

Luigi Rizzo

unread,

Jul 4, 2012, 11:48:56 AM7/4/12

to

On Wed, Jul 04, 2012 at 01:54:01PM +0400, Alexander V. Chernikov wrote:
> On 04.07.2012 13:12, Luigi Rizzo wrote:
> >Alex,
> >i am sure you are aware that in FreeBSD we have netmap too
> Yes, I'm aware of that :)
>
> >which is probably a lot more usable than packetshader
> >(hw independent, included in the OS, also works on linux...)
> I'm actually not talking about usability and comparison here :). Thay
> have nice idea and nice performance graphs. And packetshader is actually
> _platform_ with fast packet delivery being one (and the only open) part
> of platform.

i am not sure if i should read the above as a feature or a limitation :)

>
> Their graphs shows 40MPPS (27G/64byte) CPU-only IPv4 packet forwarding
> on "two four-core Intel Nehalem CPUs (2.66GHz)" which illustrates
> software routing possibilities quite clearly.

i suggest to be cautious about graphs in papers (including mine) and
rely on numbers you can reproduce yourself.
As your nice experiments showed (i especially liked when you moved
from one /24 to four /28 routes), at these speeds a factor
of 2 or more in throughput can easily arise from tiny changes
in configurations, bus, memory and CPU speeds, and so on.

Lev Serebryakov

unread,

Jul 4, 2012, 3:37:05 PM7/4/12

to

Hello, Alexander.
You wrote 4 июля 2012 г., 12:46:09:

AVC> http://shader.kaist.edu/packetshader/ (and links there) are good example
AVC> of what is going on.
But HOW?! GPU has very high "preparation" and data transfer cost,
how it could be used for such small packets of data, as 1.5-9K
datagrams?!

--
// Black Lion AKA Lev Serebryakov <l...@FreeBSD.org>

Alexander V. Chernikov

unread,

Jul 4, 2012, 3:59:35 PM7/4/12

to

On 04.07.2012 23:37, Lev Serebryakov wrote:
> Hello, Alexander.
> You wrote 4 июля 2012 г., 12:46:09:
>
>
> AVC> http://shader.kaist.edu/packetshader/ (and links there) are good example
> AVC> of what is going on.
> But HOW?! GPU has very high "preparation" and data transfer cost,
> how it could be used for such small packets of data, as 1.5-9K
> datagrams?!

According to
http://www.ndsl.kaist.edu/~kyoungsoo/papers/packetshader.pdf -
cumulative dispatch latency is between 3.8-4.1 microseconds (section 2.2).
And GPU is doing routing lookup only (at least for IPv4/IPv6
forwarding), so we're always transferring/receving fixed amount of data.

Btw, there are exact hardware specifications in this document.

Alexander V. Chernikov

unread,

Jul 5, 2012, 9:40:37 AM7/5/12

to

On 04.07.2012 19:48, Luigi Rizzo wrote:
> On Wed, Jul 04, 2012 at 01:54:01PM +0400, Alexander V. Chernikov wrote:
>> On 04.07.2012 13:12, Luigi Rizzo wrote:
>>> Alex,
>>> i am sure you are aware that in FreeBSD we have netmap too
>> Yes, I'm aware of that :)
>>
>>> which is probably a lot more usable than packetshader
>>> (hw independent, included in the OS, also works on linux...)
>> I'm actually not talking about usability and comparison here :). Thay
>> have nice idea and nice performance graphs. And packetshader is actually
>> _platform_ with fast packet delivery being one (and the only open) part
>> of platform.
>
> i am not sure if i should read the above as a feature or a limitation :)

I'm not trying to compare their i/o code with netmap implementation :)

>
>>
>> Their graphs shows 40MPPS (27G/64byte) CPU-only IPv4 packet forwarding
>> on "two four-core Intel Nehalem CPUs (2.66GHz)" which illustrates
>> software routing possibilities quite clearly.
>
> i suggest to be cautious about graphs in papers (including mine) and
> rely on numbers you can reproduce yourself.

Yup. Of course. However, even it if we divide their number by 4, there
is still a huge gap.

> As your nice experiments showed (i especially liked when you moved
> from one /24 to four /28 routes), at these speeds a factor
> of 2 or more in throughput can easily arise from tiny changes
> in configurations, bus, memory and CPU speeds, and so on.

Traffic stats with most possible counters eliminated:
(there is a possibility in ixgbe code to update rx/tx packets once per
rx_process_limit (which is 100 by default)):

input (ix0) output
packets errs idrops bytes packets errs bytes colls

2.8M 0 0 186M 2.8M 0 186M 0
2.8M 0 0 187M 2.8M 0 186M 0

And it seems that netstat uses 1024 as divisor (no HN_DIVISOR_1000
passed in if.c to show_stat), so real frame count from Ixia side is much
closer to 3MPPS (~ 2.961600 ).

This is wrong from my point of view and we should change it, at least
for packets count.

Here is the patch itself:
http://static.ipfw.ru/files/fbsd10g/no_ifcounters.diff

IPFW contention:
Same setup as shown upper, same traffic level

17:48 [0] test15# ipfw show
00100 0 0 allow ip from any to any
65535 0 0 deny ip from any to any

net.inet.ip.fw.enable: 0 -> 1

input (ix0) output
packets errs idrops bytes packets errs bytes colls

2.1M 734k 0 187M 2.1M 0 139M 0
2.1M 736k 0 187M 2.1M 0 139M 0
2.1M 737k 0 187M 2.1M 0 89M 0
2.1M 735k 0 187M 2.1M 0 189M 0
net.inet.ip.fw.update_counters: 1 -> 0
2.3M 636k 0 187M 2.3M 0 148M 0
2.5M 343k 0 187M 2.5M 0 164M 0
2.5M 351k 0 187M 2.5M 0 164M 0
2.5M 345k 0 187M 2.5M 0 164M 0

Patch here: http://static.ipfw.ru/files/fbsd10g/no_ipfw_counters.diff

It seems that ipfw counters are suffering from this problem, too.
Unfortunately, there is no DPCPU allocator in our kernel.
I'm planning to make a very simple per-cpu counters patch:
(
allocate 65k*(u64_bytes+u64_packets) memory for each CPU per vnet
instance init and make ipfw use it as counter backend.

There is a problem with several rules residing in single entry. This can
(probably) be worked-around by using fast counters for the first such
rule (or not using fast counters for such rules at all)
)

What do you think about this?

>
> cheers
> luigi
>

--
WBR, Alexander

Luigi Rizzo

unread,

Jul 6, 2012, 2:11:26 AM7/6/12

to

On Thu, Jul 05, 2012 at 05:40:37PM +0400, Alexander V. Chernikov wrote:
> On 04.07.2012 19:48, Luigi Rizzo wrote:

...

> Traffic stats with most possible counters eliminated:
> (there is a possibility in ixgbe code to update rx/tx packets once per
> rx_process_limit (which is 100 by default)):
>
> input (ix0) output
> packets errs idrops bytes packets errs bytes colls
> 2.8M 0 0 186M 2.8M 0 186M 0
> 2.8M 0 0 187M 2.8M 0 186M 0
>
> And it seems that netstat uses 1024 as divisor (no HN_DIVISOR_1000
> passed in if.c to show_stat), so real frame count from Ixia side is much
> closer to 3MPPS (~ 2.961600 ).

...

> IPFW contention:
> Same setup as shown upper, same traffic level
>
> 17:48 [0] test15# ipfw show
> 00100 0 0 allow ip from any to any
> 65535 0 0 deny ip from any to any
>
> net.inet.ip.fw.enable: 0 -> 1
> input (ix0) output
> packets errs idrops bytes packets errs bytes colls
> 2.1M 734k 0 187M 2.1M 0 139M 0
> 2.1M 736k 0 187M 2.1M 0 139M 0
> 2.1M 737k 0 187M 2.1M 0 89M 0
> 2.1M 735k 0 187M 2.1M 0 189M 0
> net.inet.ip.fw.update_counters: 1 -> 0
> 2.3M 636k 0 187M 2.3M 0 148M 0
> 2.5M 343k 0 187M 2.5M 0 164M 0
> 2.5M 351k 0 187M 2.5M 0 164M 0
> 2.5M 345k 0 187M 2.5M 0 164M 0

...

> It seems that ipfw counters are suffering from this problem, too.
> Unfortunately, there is no DPCPU allocator in our kernel.
> I'm planning to make a very simple per-cpu counters patch:
> (
> allocate 65k*(u64_bytes+u64_packets) memory for each CPU per vnet
> instance init and make ipfw use it as counter backend.
>
> There is a problem with several rules residing in single entry. This can
> (probably) be worked-around by using fast counters for the first such
> rule (or not using fast counters for such rules at all)
> )
>
> What do you think about this?

the thing discussed a few years ago (at least the one i took out of the
discussion) was that the counter fields in rules should hold the
index of a per-cpu counter associated to the rule. So CTR_INC(rule->ctr)
becomes something like pcpu->ipfw_ctrs[rule->ctr]++
Once you create a new rule you also grab one free index from ipfw_ctrs[],
and the same should go for dummynet counters.
The alternative would be to allocate the rule and a set of counters
within the rule itself, but that kills 64 bytes per core per rule
to avoid cache contention.

cheers
luigi