Carolyn,
I've just had the opportunity to upgrade to a 3.2.9 kernel on these
systems and have made sure e1000e is loaded with IntMode=1,1. One of the
servers was only up 5.5 hours before the NIC has crashed/stopped working
again.
Here is the latest dmesg after the failure:
[ 3.254553] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k
[ 3.265852] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[ 3.266034] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) ->
IRQ 20
[ 3.266067] e1000e 0000:00:19.0: setting latency timer to 64
[ 3.266460] e1000e 0000:00:19.0: (unregistered net_device): Interrupt
Mode set to 1
[ 3.266800] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X
[ 3.611840] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1)
00:25:90:56:ac:75
[ 3.611855] e1000e 0000:00:19.0: eth2: Intel(R) PRO/1000 Network
Connection
[ 3.612303] e1000e 0000:00:19.0: eth2: MAC: 10, PHY: 11, PBA No:
FFFFFF-0FF
[ 3.612350] e1000e 0000:05:00.0: Disabling ASPM L0s
[ 3.612594] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[ 3.612812] e1000e 0000:05:00.0: setting latency timer to 64
[ 3.613582] e1000e 0000:05:00.0: (unregistered net_device): Interrupt
Mode set to 1
[ 3.614156] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X
[ 3.734442] e1000e 0000:05:00.0: eth3: (PCI Express:2.5GT/s:Width x1)
00:25:90:56:ac:74
[ 3.734465] e1000e 0000:05:00.0: eth3: Intel(R) PRO/1000 Network
Connection
[ 3.734689] e1000e 0000:05:00.0: eth3: MAC: 3, PHY: 8, PBA No: FFFFFF-0FF
[ 13.799848] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X
[ 13.855646] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X
[ 14.031739] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X
[ 14.087566] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X
[ 16.112504] e1000e: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow
Control: None
[ 16.124129] e1000e 0000:05:00.0: eth2: 10/100 speed: disabling TSO
And here is the output just as it hangs:
[19745.327241] ------------[ cut here ]------------
[19745.334501] WARNING: at
/build/buildd-linux-2.6_3.2.9-1-amd64-KTPapN/linux-2.6-3.2.9/debian/build/source_amd64_none/net/sched/sch_generic.c:255
dev_watchdog+0xe9/0x148()
[19745.350441] Hardware name: X9SCL/X9SCM
[19745.358859] NETDEV WATCHDOG: eth2 (e1000e): transmit queue 0 timed out
[19745.367287] Modules linked in: hmac sha256_generic dlm configfs
ebtable_nat ebtables acpi_cpufreq mperf cpufreq_stats
cpufreq_conservative cpufreq_userspace cpufreq_powersave microcode
ip6_queue xt_TCPMSS xt_sctp ip6t_LOG ip6t_REJECT nf_conntrack_ipv6
ip6table_raw ip6table_mangle ip6table_filter xt_NOTRACK ip_set_hash_net
act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb
sch_hfsc sch_ingress sch_sfq xt_statistic xt_CT xt_time xt_connlimit
xt_realm xt_addrtype ip_set_hash_ip iptable_raw xt_comment xt_recent
xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE
ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp
nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp
nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp
nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip
nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp
nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns
nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323
nf_conntrack_ftp xt_TPROXY nf_tproxy_core ip6_tables nf_defrag_ipv6
xt_tcpmss xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_NFLOG
nfnetlink_log xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange
xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark
xt_CLASSIFY xt_AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink
iptable_filter ip_tables x_tables kvm_intel kvm bridge stp bonding
w83627ehf hwmon_vid coretemp sha1_ssse3 sha1_generic crc32c_intel
aesni_intel cryptd aes_x86_64 aes_generic ipmi_poweroff ipmi_devintf
ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache cn
loop snd_pcm snd_timer snd soundcore snd_page_alloc iTCO_wdt i2c_i801
psmouse cdc_acm processor i2c_core iTCO_vendor_support serio_raw pcspkr
thermal_sys button evdev joydev ext4 mbcache jbd2 crc16 dm_mod raid1
md_mod sd_mod crc_t10dif usb_storage uas usbhid hid ahci libahci libata
ehci_hcd usbcore igb scsi_mod e1000e usb_common dca [last unloaded:
scsi_wait_scan]
[19745.502559] Pid: 0, comm: swapper/0 Not tainted 3.2.0-2-amd64 #1
[19745.502561] Call Trace:
[19745.502562] <IRQ> [<ffffffff81046879>] ? warn_slowpath_common+0x78/0x8c
[19745.502570] [<ffffffff81046925>] ? warn_slowpath_fmt+0x45/0x4a
[19745.502574] [<ffffffff8129aa11>] ? netif_tx_lock+0x40/0x72
[19745.502588] [<ffffffff8129ab72>] ? dev_watchdog+0xe9/0x148
[19745.502601] [<ffffffff81051f38>] ? run_timer_softirq+0x19a/0x261
[19745.502603] [<ffffffff8129aa89>] ? netif_tx_unlock+0x46/0x46
[19745.502606] [<ffffffff81065a73>] ? timekeeping_get_ns+0xd/0x2a
[19745.502609] [<ffffffff8104be98>] ? __do_softirq+0xb9/0x177
[19745.502612] [<ffffffff8134892c>] ? call_softirq+0x1c/0x30
[19745.502615] [<ffffffff8100f8e5>] ? do_softirq+0x3c/0x7b
[19745.502617] [<ffffffff8104c100>] ? irq_exit+0x3c/0x9a
[19745.502621] [<ffffffff81023f18>] ? smp_apic_timer_interrupt+0x74/0x82
[19745.502624] [<ffffffff8134719e>] ? apic_timer_interrupt+0x6e/0x80
[19745.502625] <EOI> [<ffffffff81070761>] ? arch_local_irq_save+0x11/0x17
[19745.502631] [<ffffffff811e45d9>] ? intel_idle+0xea/0x119
[19745.502633] [<ffffffff811e45b8>] ? intel_idle+0xc9/0x119
[19745.502637] [<ffffffff812643f7>] ? cpuidle_idle_call+0xec/0x179
[19745.502639] [<ffffffff8100d248>] ? cpu_idle+0xa5/0xf2
[19745.502641] [<ffffffff816aab3d>] ? start_kernel+0x3bd/0x3c8
[19745.502643] [<ffffffff816aa140>] ? early_idt_handlers+0x140/0x140
[19745.502645] [<ffffffff816aa3c4>] ? x86_64_start_kernel+0x104/0x111
[19745.502646] ---[ end trace 10e791a6f31603fa ]---
[19745.503125] e1000e 0000:05:00.0: eth2: Reset adapter
Once again, rmmod e1000e followed by modprobe e1000e does not fix the
problem:
[20508.158919] e1000e 0000:05:00.0: PCI INT A disabled
[20508.194927] e1000e 0000:00:19.0: PCI INT A disabled
[20511.119765] e1000e: Intel(R) PRO/1000 Network Driver - 1.5.1-k
[20511.130711] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[20511.141206] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) ->
IRQ 20
[20511.151797] e1000e 0000:00:19.0: setting latency timer to 64
[20511.151921] e1000e 0000:00:19.0: (unregistered net_device): Interrupt
Mode set to 1
[20511.162853] e1000e 0000:00:19.0: irq 45 for MSI/MSI-X
[20511.528436] e1000e 0000:00:19.0: eth2: (PCI Express:2.5GT/s:Width x1)
00:25:90:56:ac:75
[20511.539261] e1000e 0000:00:19.0: eth3: Intel(R) PRO/1000 Network
Connection
[20511.550066] e1000e 0000:00:19.0: eth3: MAC: 10, PHY: 11, PBA No:
FFFFFF-0FF
[20511.561027] e1000e 0000:05:00.0: Disabling ASPM L0s
[20511.571883] e1000e 0000:05:00.0: enabling device (0000 -> 0002)
[20511.575224] udevd[5449]: renamed network interface eth2 to eth3
[20511.594234] e1000e 0000:05:00.0: PCI INT A -> GSI 16 (level, low) ->
IRQ 16
[20511.605703] e1000e 0000:05:00.0: setting latency timer to 64
[20511.605871] e1000e 0000:05:00.0: (unregistered net_device): Interrupt
Mode set to 1
[20511.617706] e1000e 0000:05:00.0: irq 64 for MSI/MSI-X
[20511.617828] e1000e 0000:05:00.0: PCI INT A disabled
[20511.629565] e1000e: probe of 0000:05:00.0 failed with error -2
Please let me know if/how I can debug this further.
Many thanks,
Chris
--
Chris Boot
bo...@bootc.net.