Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ixl 40G bad performance?

278 views
Skip to first unread message

Eggert, Lars

unread,
Oct 19, 2015, 9:53:05 AM10/19/15
to
Hi,

I'm running a few simple tests on -CURRENT with a pair of dual-port Intel XL710 boards, which are seen by the kernel as:

ixl0: <Intel(R) Ethernet Connection XL710 Driver, Version - 1.4.3> mem 0xdc800000-0xdcffffff,0xdd808000-0xdd80ffff irq 32 at device 0.0 on pci3
ixl0: Using MSIX interrupts with 33 vectors
ixl0: f4.40 a1.4 n04.53 e80001dca
ixl0: Using defaults for TSO: 65518/35/2048
ixl0: Ethernet address: 68:05:ca:32:0b:98
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
ixl0: netmap queues/slots: TX 32/1024, RX 32/1024
ixl1: <Intel(R) Ethernet Connection XL710 Driver, Version - 1.4.3> mem 0xdc000000-0xdc7fffff,0xdd800000-0xdd807fff irq 32 at device 0.1 on pci3
ixl1: Using MSIX interrupts with 33 vectors
ixl1: f4.40 a1.4 n04.53 e80001dca
ixl1: Using defaults for TSO: 65518/35/2048
ixl1: Ethernet address: 68:05:ca:32:0b:99
ixl1: PCI Express Bus: Speed 8.0GT/s Width x8
ixl1: netmap queues/slots: TX 32/1024, RX 32/1024
ixl0: link state changed to UP
ixl1: link state changed to UP

I have two identical machines connected with patch cables (no switch). iperf performance is bad:

# iperf -c 10.0.1.2
------------------------------------------------------------
Client connecting to 10.0.1.2, TCP port 5001
TCP window size: 32.5 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.1 port 19238 connected with 10.0.1.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 3.91 GBytes 3.36 Gbits/sec

As is flood ping latency:

# sudo ping -f 10.0.1.2
PING 10.0.1.2 (10.0.1.2): 56 data bytes
.^C
--- 10.0.1.2 ping statistics ---
41927 packets transmitted, 41926 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.084/0.116/0.145/0.002 ms

Any ideas on what's going on here? Testing 10G ix interfaces between the same two machines results in 9.39 Gbits/sec and flood ping latencies of 17 usec.

Thanks,
Lars

PS: Full dmesg attached.

Copyright (c) 1992-2015 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.0-CURRENT #2 483de3c(muclab)-dirty: Mon Oct 19 11:01:16 CEST 2015
el...@laurel.muccbc.hq.netapp.com:/usr/home/elars/obj/usr/home/elars/src/sys/MUCLAB amd64
FreeBSD clang version 3.7.0 (tags/RELEASE_370/final 246257) 20150906
VT(vga): resolution 640x480
CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2000.05-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x206d7 Family=0x6 Model=0x2d Stepping=7
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
AMD Features2=0x1<LAHF>
XSAVE Features=0x1<XSAVEOPT>
VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
TSC: P-state invariant, performance statistics
real memory = 137438953472 (131072 MB)
avail memory = 133484290048 (127300 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: < >
FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
FreeBSD/SMP: 2 package(s) x 8 core(s) x 2 SMT threads
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
cpu2 (AP): APIC ID: 2
cpu3 (AP): APIC ID: 3
cpu4 (AP): APIC ID: 4
cpu5 (AP): APIC ID: 5
cpu6 (AP): APIC ID: 6
cpu7 (AP): APIC ID: 7
cpu8 (AP): APIC ID: 8
cpu9 (AP): APIC ID: 9
cpu10 (AP): APIC ID: 10
cpu11 (AP): APIC ID: 11
cpu12 (AP): APIC ID: 12
cpu13 (AP): APIC ID: 13
cpu14 (AP): APIC ID: 14
cpu15 (AP): APIC ID: 15
cpu16 (AP): APIC ID: 32
cpu17 (AP): APIC ID: 33
cpu18 (AP): APIC ID: 34
cpu19 (AP): APIC ID: 35
cpu20 (AP): APIC ID: 36
cpu21 (AP): APIC ID: 37
cpu22 (AP): APIC ID: 38
cpu23 (AP): APIC ID: 39
cpu24 (AP): APIC ID: 40
cpu25 (AP): APIC ID: 41
cpu26 (AP): APIC ID: 42
cpu27 (AP): APIC ID: 43
cpu28 (AP): APIC ID: 44
cpu29 (AP): APIC ID: 45
cpu30 (AP): APIC ID: 46
cpu31 (AP): APIC ID: 47
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 24-47 on motherboard
ioapic2 <Version 2.0> irqs 48-71 on motherboard
random: entropy device external interface
module_register_init: MOD_LOAD (vesa, 0xffffffff8094fb90, 0) error 19
netmap: loaded module
vtvga0: <VT VGA driver> on motherboard
smbios0: <System Management BIOS> at iomem 0xf04d0-0xf04ee on motherboard
smbios0: Version: 2.7, BCD Revision: 2.7
cryptosoft0: <software crypto> on motherboard
aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM> on motherboard
acpi0: <SUPERM SMCI--MB> on motherboard
acpi0: Power Button (fixed)
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
cpu4: <ACPI CPU> on acpi0
cpu5: <ACPI CPU> on acpi0
cpu6: <ACPI CPU> on acpi0
cpu7: <ACPI CPU> on acpi0
cpu8: <ACPI CPU> on acpi0
cpu9: <ACPI CPU> on acpi0
cpu10: <ACPI CPU> on acpi0
cpu11: <ACPI CPU> on acpi0
cpu12: <ACPI CPU> on acpi0
cpu13: <ACPI CPU> on acpi0
cpu14: <ACPI CPU> on acpi0
cpu15: <ACPI CPU> on acpi0
cpu16: <ACPI CPU> on acpi0
cpu17: <ACPI CPU> on acpi0
cpu18: <ACPI CPU> on acpi0
cpu19: <ACPI CPU> on acpi0
cpu20: <ACPI CPU> on acpi0
cpu21: <ACPI CPU> on acpi0
cpu22: <ACPI CPU> on acpi0
cpu23: <ACPI CPU> on acpi0
cpu24: <ACPI CPU> on acpi0
cpu25: <ACPI CPU> on acpi0
cpu26: <ACPI CPU> on acpi0
cpu27: <ACPI CPU> on acpi0
cpu28: <ACPI CPU> on acpi0
cpu29: <ACPI CPU> on acpi0
cpu30: <ACPI CPU> on acpi0
cpu31: <ACPI CPU> on acpi0
attimer0: <AT timer> port 0x40-0x43 irq 0 on acpi0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0
Event timer "RTC" frequency 32768 Hz quality 0
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 950
Event timer "HPET" frequency 14318180 Hz quality 350
Event timer "HPET1" frequency 14318180 Hz quality 340
Event timer "HPET2" frequency 14318180 Hz quality 340
Event timer "HPET3" frequency 14318180 Hz quality 340
Event timer "HPET4" frequency 14318180 Hz quality 340
Event timer "HPET5" frequency 14318180 Hz quality 340
Event timer "HPET6" frequency 14318180 Hz quality 340
Event timer "HPET7" frequency 14318180 Hz quality 340
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> irq 26 at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> irq 26 at device 1.1 on pci0
pci2: <ACPI PCI bus> on pcib2
igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.2> port 0x8020-0x803f mem 0xdf820000-0xdf83ffff,0xdf844000-0xdf847fff irq 27 at device 0.0 on pci2
igb0: Using MSIX interrupts with 9 vectors
igb0: Ethernet address: 00:25:90:9b:73:2e
igb0: Bound queue 0 to cpu 0
igb0: Bound queue 1 to cpu 1
igb0: Bound queue 2 to cpu 2
igb0: Bound queue 3 to cpu 3
igb0: Bound queue 4 to cpu 4
igb0: Bound queue 5 to cpu 5
igb0: Bound queue 6 to cpu 6
igb0: Bound queue 7 to cpu 7
igb0: netmap queues/slots: TX 8/1024, RX 8/1024
igb1: <Intel(R) PRO/1000 Network Connection, Version - 2.5.2> port 0x8000-0x801f mem 0xdf800000-0xdf81ffff,0xdf840000-0xdf843fff irq 30 at device 0.1 on pci2
igb1: Using MSIX interrupts with 9 vectors
igb1: Ethernet address: 00:25:90:9b:73:2f
igb1: Bound queue 0 to cpu 8
igb1: Bound queue 1 to cpu 9
igb1: Bound queue 2 to cpu 10
igb1: Bound queue 3 to cpu 11
igb1: Bound queue 4 to cpu 12
igb1: Bound queue 5 to cpu 13
igb1: Bound queue 6 to cpu 14
igb1: Bound queue 7 to cpu 15
igb1: netmap queues/slots: TX 8/1024, RX 8/1024
pcib3: <ACPI PCI-PCI bridge> irq 33 at device 2.0 on pci0
pci3: <ACPI PCI bus> on pcib3
pci3: <network, ethernet> at device 0.0 (no driver attached)
pci3: <network, ethernet> at device 0.1 (no driver attached)
pcib4: <ACPI PCI-PCI bridge> irq 33 at device 2.2 on pci0
pci4: <ACPI PCI bus> on pcib4
pcib5: <ACPI PCI-PCI bridge> irq 41 at device 3.0 on pci0
pci5: <ACPI PCI bus> on pcib5
pcib6: <ACPI PCI-PCI bridge> irq 41 at device 3.2 on pci0
pci6: <ACPI PCI bus> on pcib6
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.0> port 0x7020-0x703f mem 0xdf180000-0xdf1fffff,0xdf604000-0xdf607fff irq 42 at device 0.0 on pci6
ix0: Using MSIX interrupts with 33 vectors
ix0: Ethernet address: 90:e2:ba:77:d4:9c
ix0: PCI Express Bus: Speed 5.0GT/s Width x8
ix0: netmap queues/slots: TX 32/2048, RX 32/2048
ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 3.1.0> port 0x7000-0x701f mem 0xdf100000-0xdf17ffff,0xdf600000-0xdf603fff irq 45 at device 0.1 on pci6
ix1: Using MSIX interrupts with 33 vectors
ix1: Ethernet address: 90:e2:ba:77:d4:9d
ix1: PCI Express Bus: Speed 5.0GT/s Width x8
ix1: netmap queues/slots: TX 32/2048, RX 32/2048
pcib7: <ACPI PCI-PCI bridge> irq 16 at device 17.0 on pci0
pci7: <ACPI PCI bus> on pcib7
isci0: <Intel(R) C600 Series Chipset SAS Controller (SATA mode)> port 0x6000-0x60ff mem 0xde07c000-0xde07ffff,0xddc00000-0xddffffff irq 16 at device 0.0 on pci7
pci0: <simple comms> at device 22.0 (no driver attached)
pci0: <simple comms> at device 22.1 (no driver attached)
ehci0: <Intel Patsburg USB 2.0 controller> mem 0xdf923000-0xdf9233ff irq 16 at device 26.0 on pci0
usbus0: EHCI version 1.0
usbus0 on ehci0
pcib8: <ACPI PCI-PCI bridge> irq 17 at device 28.0 on pci0
pci8: <ACPI PCI bus> on pcib8
pcib9: <ACPI PCI-PCI bridge> irq 19 at device 28.7 on pci0
pci9: <ACPI PCI bus> on pcib9
pcib10: <PCI-PCI bridge> at device 0.0 on pci9
pci10: <PCI bus> on pcib10
pcib11: <PCI-PCI bridge> at device 0.0 on pci10
pci11: <PCI bus> on pcib11
pcib12: <PCI-PCI bridge> at device 0.0 on pci11
pci12: <PCI bus> on pcib12
vgapci0: <VGA-compatible display> mem 0xdb000000-0xdbffffff,0xdf000000-0xdf003fff,0xde800000-0xdeffffff irq 19 at device 0.0 on pci12
vgapci0: Boot video device
pcib13: <PCI-PCI bridge> at device 1.0 on pci10
pci13: <PCI bus> on pcib13
ehci1: <Intel Patsburg USB 2.0 controller> mem 0xdf922000-0xdf9223ff irq 23 at device 29.0 on pci0
usbus1: EHCI version 1.0
usbus1 on ehci1
pcib14: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci14: <ACPI PCI bus> on pcib14
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
ahci0: <Intel Patsburg AHCI SATA controller> port 0x9050-0x9057,0x9040-0x9043,0x9030-0x9037,0x9020-0x9023,0x9000-0x901f mem 0xdf921000-0xdf9217ff irq 18 at device 31.2 on pci0
ahci0: AHCI v1.30 with 6 6Gbps ports, Port Multiplier not supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ahcich4: <AHCI channel> at channel 4 on ahci0
ahcich5: <AHCI channel> at channel 5 on ahci0
ahciem0: <AHCI enclosure management bridge> on ahci0
ichsmb0: <Intel Patsburg SMBus controller> port 0x1180-0x119f mem 0xdf920000-0xdf9200ff irq 18 at device 31.3 on pci0
smbus0: <System Management Bus> on ichsmb0
smb0: <SMBus generic I/O> on smbus0
pcib15: <ACPI Host-PCI bridge> on acpi0
pci15: <ACPI PCI bus> on pcib15
pcib16: <ACPI Host-PCI bridge> on acpi0
pci16: <ACPI PCI bus> on pcib16
pcib17: <ACPI PCI-PCI bridge> irq 57 at device 2.0 on pci16
pci17: <ACPI PCI bus> on pcib17
pcib18: <ACPI PCI-PCI bridge> irq 64 at device 3.0 on pci16
pci18: <ACPI PCI bus> on pcib18
pcib19: <ACPI PCI-PCI bridge> irq 64 at device 3.2 on pci16
pci19: <ACPI PCI bus> on pcib19
pcib20: <ACPI Host-PCI bridge> on acpi0
pci20: <ACPI PCI bus> on pcib20
acpi_button0: <Power Button> on acpi0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart0: console (115200,n,8,1)
uart1: <16550 or compatible> port 0x2f8-0x2ff irq 3 on acpi0
ipmi0: <IPMI System Interface> port 0xca2,0xca3 on acpi0
ipmi0: KCS mode found at io 0xca2 on acpi
ichwd0 on isa0
ichwd0: ICH WDT present but disabled in BIOS or hardware
device_attach: ichwd0 attach returned 6
ichwd0 at port 0x430-0x437,0x460-0x47f on isa0
ichwd0: ICH WDT present but disabled in BIOS or hardware
device_attach: ichwd0 attach returned 6
orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff on isa0
coretemp0: <CPU On-Die Thermal Sensors> on cpu0
est0: <Enhanced SpeedStep Frequency Control> on cpu0
coretemp1: <CPU On-Die Thermal Sensors> on cpu1
est1: <Enhanced SpeedStep Frequency Control> on cpu1
coretemp2: <CPU On-Die Thermal Sensors> on cpu2
est2: <Enhanced SpeedStep Frequency Control> on cpu2
coretemp3: <CPU On-Die Thermal Sensors> on cpu3
est3: <Enhanced SpeedStep Frequency Control> on cpu3
coretemp4: <CPU On-Die Thermal Sensors> on cpu4
est4: <Enhanced SpeedStep Frequency Control> on cpu4
coretemp5: <CPU On-Die Thermal Sensors> on cpu5
est5: <Enhanced SpeedStep Frequency Control> on cpu5
coretemp6: <CPU On-Die Thermal Sensors> on cpu6
est6: <Enhanced SpeedStep Frequency Control> on cpu6
coretemp7: <CPU On-Die Thermal Sensors> on cpu7
est7: <Enhanced SpeedStep Frequency Control> on cpu7
coretemp8: <CPU On-Die Thermal Sensors> on cpu8
est8: <Enhanced SpeedStep Frequency Control> on cpu8
coretemp9: <CPU On-Die Thermal Sensors> on cpu9
est9: <Enhanced SpeedStep Frequency Control> on cpu9
coretemp10: <CPU On-Die Thermal Sensors> on cpu10
est10: <Enhanced SpeedStep Frequency Control> on cpu10
coretemp11: <CPU On-Die Thermal Sensors> on cpu11
est11: <Enhanced SpeedStep Frequency Control> on cpu11
coretemp12: <CPU On-Die Thermal Sensors> on cpu12
est12: <Enhanced SpeedStep Frequency Control> on cpu12
coretemp13: <CPU On-Die Thermal Sensors> on cpu13
est13: <Enhanced SpeedStep Frequency Control> on cpu13
coretemp14: <CPU On-Die Thermal Sensors> on cpu14
est14: <Enhanced SpeedStep Frequency Control> on cpu14
coretemp15: <CPU On-Die Thermal Sensors> on cpu15
est15: <Enhanced SpeedStep Frequency Control> on cpu15
coretemp16: <CPU On-Die Thermal Sensors> on cpu16
est16: <Enhanced SpeedStep Frequency Control> on cpu16
coretemp17: <CPU On-Die Thermal Sensors> on cpu17
est17: <Enhanced SpeedStep Frequency Control> on cpu17
coretemp18: <CPU On-Die Thermal Sensors> on cpu18
est18: <Enhanced SpeedStep Frequency Control> on cpu18
coretemp19: <CPU On-Die Thermal Sensors> on cpu19
est19: <Enhanced SpeedStep Frequency Control> on cpu19
coretemp20: <CPU On-Die Thermal Sensors> on cpu20
est20: <Enhanced SpeedStep Frequency Control> on cpu20
coretemp21: <CPU On-Die Thermal Sensors> on cpu21
est21: <Enhanced SpeedStep Frequency Control> on cpu21
coretemp22: <CPU On-Die Thermal Sensors> on cpu22
est22: <Enhanced SpeedStep Frequency Control> on cpu22
coretemp23: <CPU On-Die Thermal Sensors> on cpu23
est23: <Enhanced SpeedStep Frequency Control> on cpu23
coretemp24: <CPU On-Die Thermal Sensors> on cpu24
est24: <Enhanced SpeedStep Frequency Control> on cpu24
coretemp25: <CPU On-Die Thermal Sensors> on cpu25
est25: <Enhanced SpeedStep Frequency Control> on cpu25
coretemp26: <CPU On-Die Thermal Sensors> on cpu26
est26: <Enhanced SpeedStep Frequency Control> on cpu26
coretemp27: <CPU On-Die Thermal Sensors> on cpu27
est27: <Enhanced SpeedStep Frequency Control> on cpu27
coretemp28: <CPU On-Die Thermal Sensors> on cpu28
est28: <Enhanced SpeedStep Frequency Control> on cpu28
coretemp29: <CPU On-Die Thermal Sensors> on cpu29
est29: <Enhanced SpeedStep Frequency Control> on cpu29
coretemp30: <CPU On-Die Thermal Sensors> on cpu30
est30: <Enhanced SpeedStep Frequency Control> on cpu30
coretemp31: <CPU On-Die Thermal Sensors> on cpu31
est31: <Enhanced SpeedStep Frequency Control> on cpu31
fuse-freebsd: version 0.4.4, FUSE ABI 7.8
Timecounters tick every 1.000 msec
iw_cxgb: Chelsio T3 RDMA Driver loaded
IPsec: Initialized Security Association Processing.
ipfw2 (+ipv6) initialized, divert loadable, nat enabled, default to accept, logging disabled
DUMMYNET 0 with IPv6 initialized (100409)
load_dn_sched dn_sched FIFO loaded
load_dn_sched dn_sched PRIO loaded
load_dn_sched dn_sched QFQ loaded
load_dn_sched dn_sched RR loaded
load_dn_sched dn_sched WF2Q+ loaded
ipmi0: IPMI device rev. 1, firmware rev. 2.35, version 2.0
ipmi0: Number of channels 3
ipmi0: Attached watchdog
usbus0: 480Mbps High Speed USB v2.0
usbus1: 480Mbps High Speed USB v2.0
ugen0.1: <Intel> at usbus0
uhub0: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus0
ugen1.1: <Intel> at usbus1
uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1
ses0 at ahciem0 bus 0 scbus7 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
ada0 at ahcich0 bus 0 scbus1 target 0 lun 0
ada0: <INTEL SSDSC2BW180A3F 400i> ACS-2 ATA SATA 3.x device
ada0: Serial Number CVCV3102050X180EGN
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 171705MB (351651888 512 byte sectors)
ada0: quirks=0x1<4K>
random: unblocking device.
Sending DHCP Discover packet from interface igb0 (00:25:90:9b:73:2e)
Sending DHCP Discover packet from interface igb1 (00:25:90:9b:73:2f)
Sending DHCP Discover packet from interface ix0 (90:e2:ba:77:d4:9c)
Sending DHCP Discover packet from interface ix1 (90:e2:ba:77:d4:9d)
ix0: link state changed to UP
ix1: link state changed to UP
uhub1: 2 ports with 2 removable, self powered
uhub0: 2 ports with 2 removable, self powered
ugen1.2: <vendor 0x8087> at usbus1
uhub2: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus1
ugen0.2: <vendor 0x8087> at usbus0
uhub3: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus0
uhub3: 6 ports with 6 removable, self powered
uhub2: 8 ports with 8 removable, self powered
ugen0.3: <American Megatrends Inc.> at usbus0
ukbd0: <Keyboard Interface> on usbus0
ums0: <Mouse Interface> on usbus0
ums0: 3 buttons and [Z] coordinates ID=0
igb0: link state changed to UP
Received DHCP Offer packet on igb0 from 192.168.0.2 (accepted) (no root path) (boot_file)
Received DHCP Offer packet on igb0 from 192.168.0.2 (ignored) (no root path) (boot_file)
Received DHCP Offer packet on igb0 from 192.168.0.2 (ignored) (no root path) (boot_file)
Sending DHCP Request packet from interface igb0 (00:25:90:9b:73:2e)
Received DHCP Ack packet on igb0 from 192.168.0.2 (accepted) (got root path)
DHCP timeout for interface igb1
DHCP timeout for interface ix0
DHCP timeout for interface ix1
Wired loader interface (IP 192.168.11.1) is igb0
igb0 at 192.168.11.1 server 192.168.0.2 boot file /pxe/pxelinux.0
subnet mask 255.255.0.0 router 192.168.0.2 rootfs 192.168.0.10:/muclab/image/machines/phobos2 hostname phobos2
Adjusted interface igb0
Shutdown interface igb1
Shutdown interface ix0
ix0: link state changed to DOWN
Shutdown interface ix1
ix1: link state changed to DOWN
SMP: AP CPU #31 Launched!
SMP: AP CPU #10 Launched!
SMP: AP CPU #25 Launched!
SMP: AP CPU #14 Launched!
SMP: AP CPU #30 Launched!
SMP: AP CPU #12 Launched!
SMP: AP CPU #17 Launched!
SMP: AP CPU #7 Launched!
SMP: AP CPU #28 Launched!
SMP: AP CPU #13 Launched!
SMP: AP CPU #24 Launched!
SMP: AP CPU #6 Launched!
SMP: AP CPU #27 Launched!
SMP: AP CPU #8 Launched!
SMP: AP CPU #29 Launched!
SMP: AP CPU #11 Launched!
SMP: AP CPU #20 Launched!
SMP: AP CPU #15 Launched!
SMP: AP CPU #26 Launched!
SMP: AP CPU #9 Launched!
SMP: AP CPU #22 Launched!
SMP: AP CPU #5 Launched!
SMP: AP CPU #18 Launched!
SMP: AP CPU #4 Launched!
SMP: AP CPU #21 Launched!
SMP: AP CPU #16 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #23 Launched!
SMP: AP CPU #2 Launched!
SMP: AP CPU #19 Launched!
Timecounter "TSC" frequency 2000045308 Hz quality 1000
hwpmc: SOFT/16/64/0x67<INT,USR,SYS,REA,WRI> TSC/1/64/0x20<REA> IAP/4/48/0x3ff<INT,USR,SYS,EDG,THR,REA,WRI,INV,QUA,PRC> IAF/3/48/0x67<INT,USR,SYS,REA,WRI>
Trying to mount root from nfs: []...
NFS ROOT: 192.168.0.10:/muclab/image/machines/phobos2
ixl0: <Intel(R) Ethernet Connection XL710 Driver, Version - 1.4.3> mem 0xdc800000-0xdcffffff,0xdd808000-0xdd80ffff irq 32 at device 0.0 on pci3
ixl0: Using MSIX interrupts with 33 vectors
ixl0: f4.40 a1.4 n04.53 e80001dca
ixl0: Using defaults for TSO: 65518/35/2048
ixl0: Ethernet address: 68:05:ca:32:15:d0
ixl0: PCI Express Bus: Speed 8.0GT/s Width x8
queues is 0xfffffe0008d03000
ixl0: netmap queues/slots: TX 32/1024, RX 32/1024
ixl1: <Intel(R) Ethernet Connection XL710 Driver, Version - 1.4.3> mem 0xdc000000-0xdc7fffff,0xdd800000-0xdd807fff irq 32 at device 0.1 on pci3
ixl1: Using MSIX interrupts with 33 vectors
ixl1: f4.40 a1.4 n04.53 e80001dca
ixl1: Using defaults for TSO: 65518/35/2048
ixl1: Ethernet address: 68:05:ca:32:15:d1
ixl1: PCI Express Bus: Speed 8.0GT/s Width x8
queues is 0xfffffe0009227000
ixl1: netmap queues/slots: TX 32/1024, RX 32/1024
ix0: link state changed to UP
ix1: link state changed to UP
ixl0: link state changed to UP
ixl1: link state changed to UP

signature.asc

Luigi Rizzo

unread,
Oct 19, 2015, 10:20:48 AM10/19/15
to
i would look at the following:
- c states and clock speed - make sure you never go below C1,
and fix the clock speed to max.
Sure these parameters also affect the 10G card, but there
may be strange interaction that trigger the power saving
modes in different ways

- interrupt moderation (may affect ping latency,
do not remember how it is set in ixl but probably a sysctl

- number of queues (32 is a lot i wouldn't use more than 4-8),
may affect cpu-socket affinity

- tso and flow director - i have seen bad effects of
accelerations so i would run the iperf test with
of these features disabled on both sides, and then enable
them one at a time

- queue sizes - the driver seems to use 1024 slots which is
about 1.5 MB queued, which in turn means you have 300us
(and possibly half of that) to drain the queue at 40Gbit/s.
150-300us may seem an eternity, but if a couple of cores fall
into c7 your budget is gone and the loss will trigger a
retransmission and window halving etc.

cheers
luigi
--
-----------------------------------------+-------------------------------
Prof. Luigi RIZZO, ri...@iet.unipi.it . Dip. di Ing. dell'Informazione
http://www.iet.unipi.it/~luigi/ . Universita` di Pisa
TEL +39-050-2217533 . via Diotisalvi 2
Mobile +39-338-6809875 . 56122 PISA (Italy)
-----------------------------------------+-------------------------------
_______________________________________________
freeb...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net...@freebsd.org"

Eggert, Lars

unread,
Oct 19, 2015, 11:05:21 AM10/19/15
to
Hi,

On 2015-10-19, at 16:20, Luigi Rizzo <ri...@iet.unipi.it> wrote:
>
> i would look at the following:
> - c states and clock speed - make sure you never go below C1,
> and fix the clock speed to max.
> Sure these parameters also affect the 10G card, but there
> may be strange interaction that trigger the power saving
> modes in different ways

I already have powerd_flags="-a max -b max -n max" in rc.conf, which I hope should be enough.

> - interrupt moderation (may affect ping latency,
> do not remember how it is set in ixl but probably a sysctl

ixl(4) describes two sysctls that sound like they control AIM, and they default to off:

hw.ixl.dynamic_tx_itr: 0
hw.ixl.dynamic_rx_itr: 0

> - number of queues (32 is a lot i wouldn't use more than 4-8),
> may affect cpu-socket affinity

With hw.ixl.max_queues=4 in loader.conf, performance is still unchanged.

> - tso and flow director - i have seen bad effects of
> accelerations so i would run the iperf test with
> of these features disabled on both sides, and then enable
> them one at a time

No change with "ifconfig -tso4 -tso6 -rxcsum -txcsum -lro".

How do I turn off flow director?

> - queue sizes - the driver seems to use 1024 slots which is
> about 1.5 MB queued, which in turn means you have 300us
> (and possibly half of that) to drain the queue at 40Gbit/s.
> 150-300us may seem an eternity, but if a couple of cores fall
> into c7 your budget is gone and the loss will trigger a
> retransmission and window halving etc.

Also no change with "hw.ixl.ringsz=256" in loader.conf.

This is really weird.

Lars
signature.asc

Luigi Rizzo

unread,
Oct 19, 2015, 11:11:41 AM10/19/15
to
On Monday, October 19, 2015, Eggert, Lars <la...@netapp.com> wrote:

> Hi,
>
> On 2015-10-19, at 16:20, Luigi Rizzo <ri...@iet.unipi.it <javascript:;>>
> wrote:
> >
> > i would look at the following:
> > - c states and clock speed - make sure you never go below C1,
> > and fix the clock speed to max.
> > Sure these parameters also affect the 10G card, but there
> > may be strange interaction that trigger the power saving
> > modes in different ways
>
> I already have powerd_flags="-a max -b max -n max" in rc.conf, which I
> hope should be enough.


I suspect it might not touch the c states, but better check. The safest is
disable them in the bios.


>
> > - interrupt moderation (may affect ping latency,
> > do not remember how it is set in ixl but probably a sysctl
>
> ixl(4) describes two sysctls that sound like they control AIM, and they
> default to off:
>
> hw.ixl.dynamic_tx_itr: 0
> hw.ixl.dynamic_rx_itr: 0
>
>
There must be some other control for the actual (fixed, not dynamic)
moderation.


> > - number of queues (32 is a lot i wouldn't use more than 4-8),
> > may affect cpu-socket affinity
>
> With hw.ixl.max_queues=4 in loader.conf, performance is still unchanged.
>
> > - tso and flow director - i have seen bad effects of
> > accelerations so i would run the iperf test with
> > of these features disabled on both sides, and then enable
> > them one at a time
>
> No change with "ifconfig -tso4 -tso6 -rxcsum -txcsum -lro".
>
> How do I turn off flow director?


I am not sure if it is enabled I'm FreeBSD. It is in linux and almost
halves the pkt rate with netmap (from 35 down to 19mpps).
Maybe it is not too bad for bulk TCP.


>
> > - queue sizes - the driver seems to use 1024 slots which is
> > about 1.5 MB queued, which in turn means you have 300us
> > (and possibly half of that) to drain the queue at 40Gbit/s.
> > 150-300us may seem an eternity, but if a couple of cores fall
> > into c7 your budget is gone and the loss will trigger a
> > retransmission and window halving etc.
>
> Also no change with "hw.ixl.ringsz=256" in loader.conf.


Any better success with 2048 slots?
3.5 gbit is what I used to see on the ixgbe with tso disabled, probably
hitting a CPU bound.

Cheers
Luigi


> This is really weird.
>
> Lars
>


Eggert, Lars

unread,
Oct 19, 2015, 11:36:25 AM10/19/15
to
Hi,

in order to eliminate network or hardware weirdness, I've rerun the test with Linux 4.3rc6, where I get 13.1 Gbits/sec throughput and 52 usec flood ping latency. Not great either, but in line with earlier experiments with Mellanox NICs and an untuned Linux system.

On 2015-10-19, at 17:11, Luigi Rizzo <ri...@iet.unipi.it> wrote:
> I suspect it might not touch the c states, but better check. The safest is
> disable them in the bios.

I'll try that.

>> hw.ixl.dynamic_tx_itr: 0
>> hw.ixl.dynamic_rx_itr: 0
>>
>>
> There must be some other control for the actual (fixed, not dynamic)
> moderation.

The only other sysctls in ixl(4) that look relevant are:

hw.ixl.rx_itr
The RX interrupt rate value, set to 8K by default.

hw.ixl.tx_itr
The TX interrupt rate value, set to 4K by default.

I'll play with those.

>> Also no change with "hw.ixl.ringsz=256" in loader.conf.
>
> Any better success with 2048 slots?
> 3.5 gbit is what I used to see on the ixgbe with tso disabled, probably
> hitting a CPU bound.

Will try.

Thanks!

Lars
signature.asc

Luigi Rizzo

unread,
Oct 19, 2015, 11:56:37 AM10/19/15
to
On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars <la...@netapp.com> wrote:
> Hi,
>
> in order to eliminate network or hardware weirdness, I've rerun the test with Linux 4.3rc6, where I get 13.1 Gbits/sec throughput and 52 usec flood ping latency. Not great either, but in line with earlier experiments with Mellanox NICs and an untuned Linux system.
>
...

>> There must be some other control for the actual (fixed, not dynamic)
>> moderation.
>
> The only other sysctls in ixl(4) that look relevant are:
>
> hw.ixl.rx_itr
> The RX interrupt rate value, set to 8K by default.
>
> hw.ixl.tx_itr
> The TX interrupt rate value, set to 4K by default.
>

yes those. raise to 20-50k and see what you get in
terms of ping latency.
Note that 4k on tx means you get to reclaim buffers
in the tx queue (unless it is done opportunistically)
every 250us which is dangerously close to the 300us
capacity of the queue itself.

cheers
luigi

> I'll play with those.
>
>>> Also no change with "hw.ixl.ringsz=256" in loader.conf.
>>
>> Any better success with 2048 slots?
>> 3.5 gbit is what I used to see on the ixgbe with tso disabled, probably
>> hitting a CPU bound.
>
> Will try.
>
> Thanks!
>
> Lars



hiren panchasara

unread,
Oct 19, 2015, 12:47:57 PM10/19/15
to
On 10/19/15 at 08:11P, Luigi Rizzo wrote:
> On Monday, October 19, 2015, Eggert, Lars <la...@netapp.com> wrote:
>
> >
> > How do I turn off flow director?
>
>
> I am not sure if it is enabled I'm FreeBSD. It is in linux and almost
> halves the pkt rate with netmap (from 35 down to 19mpps).
> Maybe it is not too bad for bulk TCP.
>

Flow director support is incomplete on FreeBSD and that's why it is
disabled by default.

Cheers,
Hiren

Kevin Oberman

unread,
Oct 20, 2015, 12:47:57 AM10/20/15
to
On Mon, Oct 19, 2015 at 8:11 AM, Luigi Rizzo <ri...@iet.unipi.it> wrote:

> On Monday, October 19, 2015, Eggert, Lars <la...@netapp.com> wrote:
>
> > Hi,
> >
> > On 2015-10-19, at 16:20, Luigi Rizzo <ri...@iet.unipi.it <javascript:;>>
> > wrote:
> > >
> > > i would look at the following:
> > > - c states and clock speed - make sure you never go below C1,
> > > and fix the clock speed to max.
> > > Sure these parameters also affect the 10G card, but there
> > > may be strange interaction that trigger the power saving
> > > modes in different ways
> >
> > I already have powerd_flags="-a max -b max -n max" in rc.conf, which I
> > hope should be enough.
>
>
> I suspect it might not touch the c states, but better check. The safest is
> disable them in the bios.
>

To disable C-States:
sysctl dev.cpu.0.cx_lowest=C1
--
Kevin Oberman, Part time kid herder and retired Network Engineer

Ian Smith

unread,
Oct 20, 2015, 4:25:14 AM10/20/15
to
On Mon, 19 Oct 2015 21:47:36 -0700, Kevin Oberman wrote:
> > I suspect it might not touch the c states, but better check. The safest is
> > disable them in the bios.
> >
>
> To disable C-States:
> sysctl dev.cpu.0.cx_lowest=C1

Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead. Otherwise
you've only changed cpu.0; if you try it you should see that other CPUs
will have retained their previous C-state setting - up to 9.3 at least.

Setting performance_cx_lowest=C1 in rc.conf (and economy_cx_lowest=C1 on
laptops) performs that by setting hw.acpi.cpu.cx_lowest on boot (and on
every change to/from battery power) in power_profile via devd notifies.

cheers, Ian

Eggert, Lars

unread,
Oct 20, 2015, 7:04:00 AM10/20/15
to
Hi,

On 2015-10-20, at 10:24, Ian Smith <smi...@nimnet.asn.au> wrote:
> Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead.

Done.

On 2015-10-19, at 17:55, Luigi Rizzo <ri...@iet.unipi.it> wrote:
> On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars <la...@netapp.com> wrote:
>> The only other sysctls in ixl(4) that look relevant are:
>>
>> hw.ixl.rx_itr
>> The RX interrupt rate value, set to 8K by default.
>>
>> hw.ixl.tx_itr
>> The TX interrupt rate value, set to 4K by default.
>>
>
> yes those. raise to 20-50k and see what you get in
> terms of ping latency.

While ixl(4) talks about 8K and 4K, the defaults actually seem to be:

hw.ixl.tx_itr: 122
hw.ixl.rx_itr: 62

Doubling those values *increases* flood ping latency to ~200 usec (from ~116 usec).

Halving them to 62/31 decreases flood ping latency to ~50 usec, but still doesn't increase iperf throughput (still 2.8 Gb/s). Going to 31/16 further drops latency to 24 usec, with no change in throughput.

(Looking at the "interrupt Moderation parameters" #defines in sys/dev/ixl/ixl.h it seems that ixl likes to have its irq rates specified with some weird divider scheme.)

With 5/5 (which corresponds to IXL_ITR_100K), I get down to 16 usec. Unfortunately, throughput is then also down to about 2 Gb/s.

One thing I noticed in top is that one queue irq is using quite a bit of CPU when I run iperf:

11 0 -92 - 0K 1152K CPU2 2 0:19 50.98% intr{irq293: ixl1:q2}
11 0 -92 - 0K 1152K WAIT 3 0:02 5.18% intr{irq294: ixl1:q3}
0 0 -92 0 0K 8944K - 25 0:01 1.07% kernel{ixl1 que}
11 0 -92 - 0K 1152K WAIT 1 0:01 0.00% intr{irq292: ixl1:q1}
11 0 -92 - 0K 1152K WAIT 0 0:00 0.00% intr{irq291: ixl1:q0}
0 0 -92 0 0K 8944K - 22 0:00 0.00% kernel{ixl1 adminq}
0 0 -92 0 0K 8944K - 31 0:00 0.00% kernel{ixl1 que}
0 0 -92 0 0K 8944K - 31 0:00 0.00% kernel{ixl1 que}
0 0 -92 0 0K 8944K - 31 0:00 0.00% kernel{ixl1 que}
11 0 -92 - 0K 1152K WAIT -1 0:00 0.00% intr{irq290: ixl1:aq}

With 10G ix interfaces and a throughput of ~9Gb/s, the CPU load is much lower:

11 0 -92 - 0K 1152K WAIT 0 0:05 7.67% intr{irq274: ix0:que }
0 0 -92 0 0K 8944K - 27 0:00 0.29% kernel{ix0 que}
0 0 -92 0 0K 8944K - 10 0:00 0.00% kernel{ix0 linkq}
11 0 -92 - 0K 1152K WAIT 1 0:00 0.00% intr{irq275: ix0:que }
11 0 -92 - 0K 1152K WAIT 3 0:00 0.00% intr{irq277: ix0:que }
11 0 -92 - 0K 1152K WAIT 2 0:00 0.00% intr{irq276: ix0:que }
11 0 -92 - 0K 1152K WAIT 18 0:00 0.00% intr{irq278: ix0:link}
0 0 -92 0 0K 8944K - 0 0:00 0.00% kernel{ix0 que}
0 0 -92 0 0K 8944K - 0 0:00 0.00% kernel{ix0 que}
0 0 -92 0 0K 8944K - 0 0:00 0.00% kernel{ix0 que}

Lars
signature.asc

Bruce Evans

unread,
Oct 20, 2015, 10:51:32 AM10/20/15
to
On Tue, 20 Oct 2015, Eggert, Lars wrote:

> Hi,
>
> On 2015-10-20, at 10:24, Ian Smith <smi...@nimnet.asn.au> wrote:
>> Actually, you want to set hw.acpi.cpu.cx_lowest=C1 instead.
>
> Done.
>
> On 2015-10-19, at 17:55, Luigi Rizzo <ri...@iet.unipi.it> wrote:
>> On Mon, Oct 19, 2015 at 8:34 AM, Eggert, Lars <la...@netapp.com> wrote:
>>> The only other sysctls in ixl(4) that look relevant are:
>>>
>>> hw.ixl.rx_itr
>>> The RX interrupt rate value, set to 8K by default.
>>>
>>> hw.ixl.tx_itr
>>> The TX interrupt rate value, set to 4K by default.
>>>
>>
>> yes those. raise to 20-50k and see what you get in
>> terms of ping latency.
>
> While ixl(4) talks about 8K and 4K, the defaults actually seem to be:
>
> hw.ixl.tx_itr: 122
> hw.ixl.rx_itr: 62

ixl seems to have a different set of itr sysctl bugs than em. In em,
122 for the itr means 125 initially, but it is documented (only by
sysctl -d, not by the man page) as having units usecs/4. The units
are actually usecs*4 except initially, and these units take effect if
you write the initial value back -- writing back 122 changes the active
period from 125 to 488. 122 instead of 125 is the result of confusion
between powers of 2 and powers of 10.

The first obvious bug in ixl is that the above sysctls are read-only
global tunables (not documented as sysctls of course), but you can
write them using per-device sysctls (dev.ixl.[0-N].*itr?). Writing
them for 1 device clobbers the globals and probably the settings for
all ixl devices.

sysctl -d doesn't say anything useful about ixl's itrs. It misdocuments
the units for all of them as being rates. Actually, the units for 2
of them are boolean and the units for the other 2 are periods. ixl(4)
uses better wording for the booleans but even worse wording for the
periods ("rate value"). em uses better wording for its itr sysctl but
em(4) has no documentation for any sysctl or its itr tunable. igb is
more like em than ixl here.

122 seems to be the result of mis-scaling 125, and 62 from correctly
scaling 62.5, but these numbers are also off by a factor of 2. Either
there is a scaling bug or the undocumented units are usecs/2 where
em's documented units are usecs/4. In em, the default itr rate is
8 kHz (power of 10), but in ixl it is unclear if 4K and 8K are actually
4000 and 8000, since they are scaled more in hardware (IXL_ITR_4K is
hard-coded as 122; the scale is linear but their aren't enough bits
to preserve linearity; it is unclear if the hard-coded values are
defined by the hardware or are the result of precomputing the values
(using hard-coded 0x7A (122) where em uses 1000000 / SCALE (100000
being user-friendly microseconds and SCALE a hardware clock frequency)).

I think 122 really does mean a period that approximates the period for
a frequency of 4 khz. The period for this frequency is 250 usecs,
and 122 is 250 with units of usec*2, with an approximate error of
3 units. Or 122 is the period for the documented frequency of 4K
(binary power of 2 with undocumented units which I assume are Hz),
with the weird usec*2 units and a tiny error. Similarly for 62 and
8K, except there is a rounding error of almost 1.

> Doubling those values *increases* flood ping latency to ~200 usec (from ~116 usec).

Since they are periods and not frequencies, doubling them should double
the latency. Since their units are weird and undocumented, it is hard to
predict what the latency actually is. But I predict that if the units are
usecs*2, then the unscaled values give average latencies from interrupt
moderation. This gives 122 + 62 = 184 plus maybe another 20 for other
delays. Since the observed average latency is less than half that, the
units seem to usecs*1 and it is the documented frequencies that are off
by a power of 2.

> Halving them to 62/31 decreases flood ping latency to ~50 usec, but still doesn't increase iperf throughput (still 2.8 Gb/s). Going to 31/16 further drops latency to 24 usec, with no change in throughput.

For em and lem, I use itr = 0 or 1 when optimizing for latency. This
reduces the latency to 50 for lem but only to 73 for em (where the
connection goes through a slow switch to not so slow bge). 24 seems
quite good, and the lowest I have seen for 1 Gbps is 26, but this
requires kludges like a direct connection and polling, and I would
hope for 40 times lower at 40 Gbps.

> (Looking at the "interrupt Moderation parameters" #defines in sys/dev/ixl/ixl.h it seems that ixl likes to have its irq rates specified with some weird divider scheme.)
>
> With 5/5 (which corresponds to IXL_ITR_100K), I get down to 16 usec. Unfortunately, throughput is then also down to about 2 Gb/s.

Lowering (improving) latency always lowers (unimproves) throughput by
increasing load. itr = 8 kHz is resonable for 1 Gbps (it gives higher
latency than I like), but scaling that to 40 Gbps gives itr = 320 kHz
and it is impossible to scale up the speed of a single CPU to reasonbly
keep up with that.

Fix for em:

X diff -u2 if_em.c~ if_em.c
X --- if_em.c~ 2015-09-28 06:29:35.000000000 +0000
X +++ if_em.c 2015-10-18 18:49:36.876699000 +0000
X @@ -609,8 +609,8 @@
X em_tx_abs_int_delay_dflt);
X em_add_int_delay_sysctl(adapter, "itr",
X - "interrupt delay limit in usecs/4",
X + "interrupt delay limit in usecs",
X &adapter->tx_itr,
X E1000_REGISTER(hw, E1000_ITR),
X - DEFAULT_ITR);
X + 1000000 / MAX_INTS_PER_SEC);
X
X /* Sysctl for limiting the amount of work done in the taskqueue */

"delay limit" is fairly good wording. Other parameters tend to give long
delays, but itr limits the longest delay due to interrupt moderation to
whatever the itr respresents.

Bruce

Eggert, Lars

unread,
Oct 21, 2015, 8:26:38 AM10/21/15
to
Hi Bruce,

thanks for the very detailed analysis of the ixl sysctls!

On 2015-10-20, at 16:51, Bruce Evans <br...@optusnet.com.au> wrote:
>
> Lowering (improving) latency always lowers (unimproves) throughput by
> increasing load.

That, I also understand. But even when I back off the itr values to something more reasonable, throughput still remains low.

With all the tweaking I have tried, I have yet to top 3 Gb/s with ixl cards, whereas they do ~13 Gb/s on Linux straight out of the box.

Lars
signature.asc

Jack Vogel

unread,
Oct 21, 2015, 10:15:10 AM10/21/15
to
The 40G hardware is absolutely dependent on firmware, if you have a mismatch
for instance, it can totally bork things. So, I would work with your Intel
rep and be
sure you have the correct version for your specific hardware.

Good luck,

Jack

Eggert, Lars

unread,
Oct 21, 2015, 11:01:39 AM10/21/15
to
Hi Jack,

On 2015-10-21, at 16:14, Jack Vogel <jfv...@gmail.com> wrote:
> The 40G hardware is absolutely dependent on firmware, if you have a mismatch
> for instance, it can totally bork things. So, I would work with your Intel
> rep and be sure you have the correct version for your specific hardware.

I got these tester cards from Amazon, so I don't have a rep.

I flashed the latest NVM (1.2.5), because previously the FreeBSD driver was complaining about the firmware being too old. But I did that before the experiments.

If there is anything else I should be doing, I'd appreciate being put in contact with someone at Intel who can help.

Thanks,
Lars
Lars
signature.asc

hiren panchasara

unread,
Oct 21, 2015, 12:19:54 PM10/21/15
to
+ Eric from Intel
(Also trimming the CC list as it wouldn't let me send the message
otherwise.)
Eric,

Can you think of anything else that could explain this low performance?

Cheers,
Hiren

Eggert, Lars

unread,
Oct 22, 2015, 3:39:42 AM10/22/15
to
Hi,

for those of you following along, I did try jumbograms and throughput increases roughly 5x. So it looks like I'm hitting a packet-rate limit somewhere.

Lars
signature.asc

Eggert, Lars

unread,
Oct 22, 2015, 4:22:52 AM10/22/15
to
On 2015-10-22, at 9:38, Eggert, Lars <la...@netapp.com> wrote:
> for those of you following along, I did try jumbograms and throughput increases roughly 5x. So it looks like I'm hitting a packet-rate limit somewhere.

Does the ixl driver have an issue with TSO/LRO?

If I tcpdump on the receiver when testing the 10G ix interfaces, I see that most "packets" are up to 64KB in the traces on both sender and receiver, which is expected with TSO/LRO.

When I look at the traffic over the ixl interfaces, I see that most "packets" on the sender are much smaller (~2896 aka 2 segments; although some few are >40K). On the receiver, I only see 1448 byte packets.

Lars
signature.asc

Bruce Evans

unread,
Oct 23, 2015, 1:36:42 AM10/23/15
to
On Wed, 21 Oct 2015, Bruce Evans wrote:

> Fix for em:
>
> X diff -u2 if_em.c~ if_em.c
> X --- if_em.c~ 2015-09-28 06:29:35.000000000 +0000
> X +++ if_em.c 2015-10-18 18:49:36.876699000 +0000
> X @@ -609,8 +609,8 @@
> X em_tx_abs_int_delay_dflt);
> X em_add_int_delay_sysctl(adapter, "itr",
> X - "interrupt delay limit in usecs/4",
> X + "interrupt delay limit in usecs",
> X &adapter->tx_itr,
> X E1000_REGISTER(hw, E1000_ITR),
> X - DEFAULT_ITR);
> X + 1000000 / MAX_INTS_PER_SEC);
> X X /* Sysctl for limiting the amount of work done in the taskqueue */
>
> "delay limit" is fairly good wording. Other parameters tend to give long
> delays, but itr limits the longest delay due to interrupt moderation to
> whatever the itr respresents.

Everything in the last paragraph is backwards (inverted). Other
parameters tend to give short delays. They should be set to small
values to minimise latency. Then under load, itr limits the interrupt
_rate_ from above. The interrupt delay is the inverse of the interrupt
rate, so it is limited from below. So "delay limit" is fairly bad
wording. Normally, limits are from above, but the inversion makes
the itr limit from below.

This is most easily understood by converting itr to a rate: itr = 125
means a rate limit of 8000 Hz. It doesn't quite mean that the latency
is at least 125 usec. No one wants to ensure large latencies, and the
itr setting only ensures a minimal average latency them under load.

Bruce

Eric Joyner

unread,
Oct 23, 2015, 5:43:34 PM10/23/15
to
Bruce mostly has it right -- ITR is the minimum latency between interrupts.
But, it does actually guarantee a minimum period between interrupts.
Though, Fortville actually is unique a little bit in that there is another
ITR setting that can ensure a certain average number of interrupts per
second (called Interrupt Rate Limiting), though, but I don't think this is
used in the current version of the driver.

I see that the sysctl does clobber the global value, but have you tried
lowering the interval / raising the rate? You could try something like
10usecs, and see if that helps. We'll do some more investigation here --
3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let
that be happening.

- Eric

Eggert, Lars

unread,
Oct 24, 2015, 3:44:32 AM10/24/15
to
On 2015-10-23, at 23:36, Eric Joyner <e...@freebsd.org> wrote:
> I see that the sysctl does clobber the global value, but have you tried lowering the interval / raising the rate? You could try something like 10usecs, and see if that helps. We'll do some more investigation here -- 3Gb/s on a 40Gb/s using default settings is terrible, and we shouldn't let that be happening.

I played with different settings, but I've never been able to get more than 4Gb/s, whereas under Linux 4.2 without any special settings I get 13.

See my other email on TSO/LRO not looking to be effective; that would certainly explain it. Plausible? Anything to try here?

Lars

signature.asc

Jack Vogel

unread,
Oct 24, 2015, 4:33:20 AM10/24/15
to
13 on a 40G interface?? I don't think that's very good for Linux either, is
this a 4x10 adapter?
Maybe elaborating on the details of the hardware, you sure you don't have a
bad PCI slot
somewhere that might be throttling everything?

Cheers,

Jack

Eggert, Lars

unread,
Oct 24, 2015, 4:47:22 AM10/24/15
to
On 2015-10-24, at 10:32, Jack Vogel <jfv...@gmail.com> wrote:
> 13 on a 40G interface?? I don't think that's very good for Linux either, is
> this a 4x10 adapter?

No, its's a 2x40. And I can get it into the high 30s with tuning. I just mentioned the value to illustrate that something seems to be seriously broken under FreeBSD.

Lars
signature.asc

Daniel Engberg

unread,
Oct 25, 2015, 3:28:27 AM10/25/15
to
One thing I've noticed that probably affects your performance benchmarks
somewhat is that you're using iperf(2) instead of the newer iperf3 but I
could be wrong...

Best regards,
Daniel

Kevin Oberman

unread,
Oct 25, 2015, 11:38:50 PM10/25/15
to
On Sun, Oct 25, 2015 at 12:10 AM, Daniel Engberg <
daniel.eng...@pyret.net> wrote:

> One thing I've noticed that probably affects your performance benchmarks
> somewhat is that you're using iperf(2) instead of the newer iperf3 but I
> could be wrong...
>
> Best regards,
> Daniel
>

iperf3 is not a newer version of iperf. It is a total re-write and a rather
different tool. It has significant improvements in many areas and new
capabilities that might be of use. That said, there is no reason to think
that the results of tests using iperf2 are in any way inaccurate. However,
it is entirely possible to get misleading results if options not properly
selected.
--
Kevin Oberman, Part time kid herder and retired Network Engineer
E-mail: rkob...@gmail.com
PGP Fingerprint: D03FB98AFA78E3B78C1694B318AB39EF1B055683

Eggert, Lars

unread,
Oct 26, 2015, 5:28:29 AM10/26/15
to
On 2015-10-26, at 4:38, Kevin Oberman <rkob...@gmail.com> wrote:
> On Sun, Oct 25, 2015 at 12:10 AM, Daniel Engberg <
> daniel.eng...@pyret.net> wrote:
>
>> One thing I've noticed that probably affects your performance benchmarks
>> somewhat is that you're using iperf(2) instead of the newer iperf3 but I
>> could be wrong...
>
> iperf3 is not a newer version of iperf. It is a total re-write and a rather
> different tool. It has significant improvements in many areas and new
> capabilities that might be of use. That said, there is no reason to think
> that the results of tests using iperf2 are in any way inaccurate. However,
> it is entirely possible to get misleading results if options not properly
> selected.

FWIW, I've been using netperf and tried various options.

I don't think the issues is the benchmarking tool. I think the issue is TSO/LRO issues (per my earlier email.)

Lars

signature.asc

Pieper, Jeffrey E

unread,
Oct 26, 2015, 10:38:30 AM10/26/15
to
With the latest ixl component from: https://downloadcenter.intel.com/download/25160/Network-Adapter-Driver-for-PCI-E-40-Gigabit-Network-Connections-under-FreeBSD-

running on 10.2 amd64, I easily get 9.6 Gb/s with one netperf stream, either b2b or through a switch. This is with no driver/kernel tuning. Running 4 streams easily gets me 36 GB/s.

Jeff

Eggert, Lars

unread,
Oct 26, 2015, 11:08:33 AM10/26/15
to
On 2015-10-26, at 15:38, Pieper, Jeffrey E <jeffrey....@intel.com> wrote:
> With the latest ixl component from: https://downloadcenter.intel.com/download/25160/Network-Adapter-Driver-for-PCI-E-40-Gigabit-Network-Connections-under-FreeBSD-
>
> running on 10.2 amd64, I easily get 9.6 Gb/s with one netperf stream, either b2b or through a switch. This is with no driver/kernel tuning. Running 4 streams easily gets me 36 GB/s.

Thanks, will test!

If the newer driver makes a difference, any chance we'll see it in -HEAD soon?

Lars
signature.asc

Pieper, Jeffrey E

unread,
Oct 26, 2015, 12:09:36 PM10/26/15
to


-----Original Message-----
From: Eggert, Lars [mailto:la...@netapp.com]
Sent: Monday, October 26, 2015 8:08 AM
To: Pieper, Jeffrey E <jeffrey....@intel.com>
Cc: Kevin Oberman <rkob...@gmail.com>; freeb...@freebsd.org; Daniel Engberg <daniel.eng...@pyret.net>
Subject: Re: ixl 40G bad performance?

As a caveat, this was using default netperf message sizes.

Eggert, Lars

unread,
Oct 26, 2015, 1:41:50 PM10/26/15
to
On 2015-10-26, at 17:08, Pieper, Jeffrey E <jeffrey....@intel.com> wrote:
> As a caveat, this was using default netperf message sizes.

I get the same ~3 Gb/s with the default netperf sizes and driver 1.4.5.

When you tcpdump during the run, do you see TSO/LRO in effect, i.e., do you see "segments" > 32K in the trace?

Lars
signature.asc
0 new messages