Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

tg3: transmit timed out, resetting

869 views
Skip to first unread message

Christian Kujau

unread,
Jun 4, 2012, 7:20:29 PM6/4/12
to LKML, mcar...@broadcom.com
Hi,

on this Ideapad S10 the onboard Broadcom BCM5906M prints the warning
below, once. From then on, the "transmit timed out, resetting" message
repeats, every now and then.

This laptop is mounting 2 readonly NFS shares from a box in the same LAN
and when scanning lots of files on these NFS shares, the transmit timeouts
occur more often, I think. When there's sequential traffic (i.e. reading
larger files from the NFS shares), fewer warnings occur. But this is just
manual observation, I haven't been able to reproduce this reliably.
However, there's constant traffic on the device (maybe ~700KB/s both tx
and rx), so the messages occur pretty regularly.

I have reported the error against the Fedora 17 kernel [0] but it happens
with a vanilla 3.4.0 too[1] - check out for full dmesg, .config and more.

I had a similar issue a while ago[2] and almost forgot about them. The
laptop ran Ubuntu 10.04 (2.6.32) since then and the problem was gone, so
I'd say 2.6.32 fixed it. Now the same laptop switched to Fedora, kernel
3.3.4 and the problem seems to be back again.

I'll try running with sg=off, as Matt suggested in [3] and report back.

Thanks,
Christian.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=825123
[1] http://nerdbynature.de/bits/3.4.0/tg3/
[2] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00004.html
[3] http://lkml.indiana.edu/hypermail/linux/kernel/0906.1/00317.html

------------[ cut here ]------------
WARNING: at /opt/home/chrisk/dev/linux-2.6-git/net/sched/sch_generic.c:255
dev_watchdog+0x1cc/0x1e0()
Hardware name: Lenovo
NETDEV WATCHDOG: p2p1 (tg3): transmit queue 0 timed out
Modules linked in: acpi_cpufreq mperf freq_table nfs lockd sunrpc b43
mac80211 cfg80211 ssb coretemp hwmon usb_storage [last unloaded: scsi_wait_scan]
Pid: 685, comm: FahCore_78 Not tainted 3.4.0-10151-g4fc3acf #8
Call Trace:
[<c102b299>] ? warn_slowpath_common+0x79/0xb0
[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
[<c102b374>] ? warn_slowpath_fmt+0x34/0x40
[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
[<c12d5320>] ? pfifo_fast_dequeue+0xe0/0xe0
[<c1035cf1>] ? run_timer_softirq+0xd1/0x1d0
[<c1031615>] ? __do_softirq+0x75/0x100
[<c10315a0>] ? remote_softirq_receive+0x20/0x20
<IRQ> [<c10318a6>] ? irq_exit+0x66/0x90
[<c101b8d9>] ? smp_apic_timer_interrupt+0x59/0x90
[<c1360b35>] ? apic_timer_interrupt+0x31/0x38
[<c1360000>] ? rt_mutex_trylock+0x70/0x70
---[ end trace 9de668a859ee5d6c ]---
tg3 0000:02:00.0: p2p1: transmit timed out, resetting


--
BOFH excuse #438:

sticky bit has come loose
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christian Kujau

unread,
Jun 5, 2012, 12:57:59 PM6/5/12
to LKML, mcar...@broadcom.com, mc...@broadcom.com
On Mon, 4 Jun 2012 at 16:14, Christian Kujau wrote:
> Hi,
>
> on this Ideapad S10 the onboard Broadcom BCM5906M prints the warning
> below, once. From then on, the "transmit timed out, resetting" message
> repeats, every now and then.
>
> This laptop is mounting 2 readonly NFS shares from a box in the same LAN
> and when scanning lots of files on these NFS shares, the transmit timeouts
> occur more often, I think. When there's sequential traffic (i.e. reading
> larger files from the NFS shares), fewer warnings occur. But this is just
> manual observation, I haven't been able to reproduce this reliably.
> However, there's constant traffic on the device (maybe ~700KB/s both tx
> and rx), so the messages occur pretty regularly.
>
> I have reported the error against the Fedora 17 kernel [0] but it happens
> with a vanilla 3.4.0 too[1] - check out for full dmesg, .config and more.
>
> I had a similar issue a while ago[2] and almost forgot about them. The
> laptop ran Ubuntu 10.04 (2.6.32) since then and the problem was gone, so
> I'd say 2.6.32 fixed it. Now the same laptop switched to Fedora, kernel
> 3.3.4 and the problem seems to be back again.
>
> I'll try running with sg=off, as Matt suggested in [3] and report back.

sg=off seems to help, no errors since I disabled it yesterday.

Any thoughts on this issue?

Christian.
--
BOFH excuse #18:

excess surge protection

Matt Carlson

unread,
Jun 5, 2012, 9:09:05 PM6/5/12
to Christian Kujau, LKML, mcar...@broadcom.com
I'm attempting to reproduce this in our lab. In the meantime,
the latest revisions of the driver output a register dump and some
additional information when transmit timeouts happen. It would be
useful to see that data. Would it be possible to try a the latest
kernels and get this information?

ethan zhao

unread,
Jun 5, 2012, 9:58:51 PM6/5/12
to Matt Carlson, Christian Kujau, LKML
Saw many similar bugs report by simply google,
The root cause of this issue may be related to Broadcom tg3 firmware
and the version of tg3 hardware, so I think it is hard to get fix in
Linux driver. better way is get another NIC, or disable some its
feature to workaround if we got what feature block it (tso ? sg ? ).

Some debugging messages from other guys:

[ 3538.223529] tg3 0000:01:08.0: eth1: transmit timed out, resetting
[ 3538.229698] tg3 0000:01:08.0: eth1: DEBUG: MAC_TX_STATUS[00000008]
MAC_RX_STATUS[00000008]
[ 3538.236001] tg3 0000:01:08.0: eth1: DEBUG: RDMAC_STATUS[00000000]
WDMAC_STATUS[00000000]
[ 3538.343602] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=1800 enable_bit=2
[ 3538.449609] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
[ 3538.555402] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=4800 enable_bit=2
[ 3538.692079] tg3 0000:01:08.0: eth1: Link is down

We could see tg3_reset_hw()-->tg3_stop_fw()--> tg3_stop_block() timeout,
so the response of firmware is not right.

Just my 2 cents.

Ethan

Matt Carlson

unread,
Jun 5, 2012, 10:16:05 PM6/5/12
to ethan zhao, Matt Carlson, Christian Kujau, LKML
Hi Ethan. This device does not have any special firmware (beyond
bootcode). It shouldn't be necessary to disable any of the device's
features if it is working correctly.

Thanks for the debugging output. The tg3_stop_block() timeouts mean
that (a portion of) the chip is stuck somehow. Later drivers output a lot
more information than this. The additional information can help answer a
lot of questions in a short period of time. I was hoping I could
accomplish a lot more in fewer emails if I have more data available. :)

On Wed, Jun 06, 2012 at 09:58:42AM +0800, ethan zhao wrote:
> Saw many similar bugs report by simply google,
> The root cause of this issue may be related to Broadcom tg3 firmware
> and the version of tg3 hardware, so I think it is hard to get fix in
> Linux driver. better way is get another NIC, or disable some its
> feature to workaround if we got what feature block it (tso ? sg ? ).
>
> Some debugging messages from other guys:
>
> [ 3538.223529] tg3 0000:01:08.0: eth1: transmit timed out, resetting
> [ 3538.229698] tg3 0000:01:08.0: eth1: DEBUG: MAC_TX_STATUS[00000008]
> MAC_RX_STATUS[00000008]
> [ 3538.236001] tg3 0000:01:08.0: eth1: DEBUG: RDMAC_STATUS[00000000]
> WDMAC_STATUS[00000000]
> [ 3538.343602] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=1800 enable_bit=2
> [ 3538.449609] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
> [ 3538.555402] tg3 0000:01:08.0: tg3_stop_block timed out, ofs=4800 enable_bit=2
> [ 3538.692079] tg3 0000:01:08.0: eth1: Link is down
>
> We could see tg3_reset_hw()-->tg3_stop_fw()--> tg3_stop_block() timeout,
> so the response of firmware is not right.
>
> Just my 2 cents.
>
> Ethan
>
>
> On Wed, Jun 6, 2012 at 9:02 AM, Matt Carlson <mcar...@broadcom.com> wrote:
> > I'm attempting to reproduce this in our lab. ?In the meantime,
> > the latest revisions of the driver output a register dump and some
> > additional information when transmit timeouts happen. ?It would be
> > useful to see that data. ?Would it be possible to try a the latest
> >> ?[<c102b299>] ? warn_slowpath_common+0x79/0xb0
> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> >> ?[<c102b374>] ? warn_slowpath_fmt+0x34/0x40
> >> ?[<c12d54ec>] ? dev_watchdog+0x1cc/0x1e0
> >> ?[<c12d5320>] ? pfifo_fast_dequeue+0xe0/0xe0
> >> ?[<c1035cf1>] ? run_timer_softirq+0xd1/0x1d0
> >> ?[<c1031615>] ? __do_softirq+0x75/0x100
> >> ?[<c10315a0>] ? remote_softirq_receive+0x20/0x20
> >> ?<IRQ> ?[<c10318a6>] ? irq_exit+0x66/0x90
> >> ?[<c101b8d9>] ? smp_apic_timer_interrupt+0x59/0x90
> >> ?[<c1360b35>] ? apic_timer_interrupt+0x31/0x38
> >> ?[<c1360000>] ? rt_mutex_trylock+0x70/0x70
> >> ---[ end trace 9de668a859ee5d6c ]---
> >> tg3 0000:02:00.0: p2p1: transmit timed out, resetting
> >>
> >>
> >> --
> >> BOFH excuse #438:
> >>
> >> sticky bit has come loose
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majo...@vger.kernel.org
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at ?http://www.tux.org/lkml/

ethan zhao

unread,
Jun 5, 2012, 10:29:53 PM6/5/12
to Matt Carlson, Christian Kujau, LKML
So no way to fix it via firmware update or Linux driver ? :<

ethan zhao

unread,
Jun 6, 2012, 12:52:37 AM6/6/12
to Eric Dumazet, Matt Carlson, Christian Kujau, LKML, netdev
Eric,
That is ask for confirmation from Matt Carlson of Broadcom.

Ethan

On Wed, Jun 6, 2012 at 12:12 PM, Eric Dumazet <eric.d...@gmail.com> wrote:
> On Wed, 2012-06-06 at 10:29 +0800, ethan zhao wrote:
>> So no way to fix it via firmware update or Linux driver ? :<
>
> Yes, but you need to cooperate, or else it might take more time than
> necessary.
>
> Asking questions like that on lkml is not going to help very much.
>
> So, once again, we kindly ask you try a recent kernel and post
> register dump and some additional information when transmit timeouts
> happen.
>
> The 'latest kernel' is either linux-3.5.rc1, or one of David Miller
> tree :
>
> http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=summary
>
> or
>
> http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=summary
>
> Thanks

Christian Kujau

unread,
Jun 6, 2012, 2:18:15 AM6/6/12
to Matt Carlson, LKML
On Tue, 5 Jun 2012 at 18:02, Matt Carlson wrote:
> I'm attempting to reproduce this in our lab. In the meantime,
> the latest revisions of the driver output a register dump and some
> additional information when transmit timeouts happen. It would be
> useful to see that data.

I've only copied so much of the warning into my initial email, but after
that, much more followed, which looks like a register dump. I've put
everything (the whole logs and more) here:

http://nerdbynature.de/bits/3.4.0/tg3/

Is that what you're looking for?

> Would it be possible to try a the latest kernels and get this information?

I've observed this with 3.4, but I'll update to latest 3.5-git tomorrow
and let you know.

Thanks for replying,
Christian.
--
BOFH excuse #192:

runaway cat on system.

Christian Kujau

unread,
Jun 7, 2012, 3:13:19 AM6/7/12
to Matt Carlson, LKML
On Tue, 5 Jun 2012 at 23:17, Christian Kujau wrote:
> I've only copied so much of the warning into my initial email, but after
> that, much more followed, which looks like a register dump. I've put
> everything (the whole logs and more) here:
>
> http://nerdbynature.de/bits/3.4.0/tg3/
>
> Is that what you're looking for?

Have you had a chance looking at those outputs yet?

> > Would it be possible to try a the latest kernels and get this information?

I'm running today's git (3.5.0-rc1-00110-g71fae7e) and ~3.5h after
booting the same warning was printed, along with the register dump (if
that's what it is). I've put the full output online again:

http://nerdbynature.de/bits/3.4.0/tg3/
- messages_3.5.0-rc1-00110-g71fae7e.txt.gz
- config_3.5.0-rc1-00110-g71fae7e.gz

Thanks,
Christian.
--
BOFH excuse #10:

hardware stress fractures

Christian Kujau

unread,
Jun 7, 2012, 8:52:22 AM6/7/12
to Ethan Zhao, LKML, mcar...@broadcom.com
On Thu, 7 Jun 2012 at 17:47, Ethan Zhao wrote:
> Could you try 3.5RC1+ with pcie_aspm=off kernel parameter ?

Will try.

> I notice there are some AER errors ( UnsupReq+,RxErr+) with the tg3
> from you lspci output, have you seen the AER errors on console ? if
> so, please attach them.

I haven't seen any actual erros on the console, except these messages
during bootup:

--------
pci0000:00: Requesting ACPI _OSC control (0x1d)
pci0000:00: ACPI _OSC request failed (AE_NOT_FOUND), returned control mask: 0x1d
ACPI _OSC control for PCIe not granted, disabling ASPM
--------

I have seen these messages in 3.2.0 (an Ubuntu kernel), can't say that I
have seen them before, they did not show up when I booted this 2.6.38
Ubuntu kernel.

I'll try booting with pcie_aspm=off and see what it gives...

Thanks,
Christian.

PS: AFAICT, Ubuntu's 2.6.38 [0] had these options set:

--------
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
--------

With 3.5.x, my .config has:

--------
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
--------

[0] https://launchpad.net/ubuntu/natty/i386/linux-image-2.6.38-15-generic/2.6.38-15.60

--
BOFH excuse #338:

old inkjet cartridges emanate barium-based fumes

Christian Kujau

unread,
Jun 7, 2012, 6:21:49 PM6/7/12
to Ethan Zhao, LKML, mcar...@broadcom.com
On Thu, 7 Jun 2012 at 05:52, Christian Kujau wrote:
> On Thu, 7 Jun 2012 at 17:47, Ethan Zhao wrote:
> > Could you try 3.5RC1+ with pcie_aspm=off kernel parameter ?

Hm, this didn't help.

> > I notice there are some AER errors ( UnsupReq+,RxErr+) with the tg3
> > from you lspci output

Isn't lspci just listing ASPM _capabilities_ there? Booting with
pcie_aspm=off showed almost the same output:

http://nerdbynature.de/bits/3.4.0/tg3/lspci_aspm.diff.txt

So, the workaround for me is to disable "scatter-gather":

ethtool -K p2p1 sg off

With that, no more errors show up and the interface keeps working.

Christian.
BOFH excuse #396:

Mail server hit by UniSpammer.

Matt Carlson

unread,
Jun 7, 2012, 6:58:26 PM6/7/12
to ethan zhao, Eric Dumazet, Matt Carlson, Christian Kujau, LKML, netdev
On Wed, Jun 06, 2012 at 12:52:32PM +0800, ethan zhao wrote:
> Eric,
> That is ask for confirmation from Matt Carlson of Broadcom.
>
> Ethan
>
> On Wed, Jun 6, 2012 at 12:12 PM, Eric Dumazet <eric.d...@gmail.com> wrote:
> > On Wed, 2012-06-06 at 10:29 +0800, ethan zhao wrote:
> >> So no way to fix it via firmware update or Linux driver ? :<
> >
> > Yes, but you need to cooperate, or else it might take more time than
> > necessary.
> >
> > Asking questions like that on lkml is not going to help very much.
> >
> > So, once again, we kindly ask you try a recent kernel and post
> > register dump and some additional information when transmit timeouts
> > happen.
> >
> > The 'latest kernel' is either linux-3.5.rc1, or one of David Miller
> > tree :
> >
> > http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git;a=summary
> >
> > or
> >
> > http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=summary
> >
> > Thanks

Does the following patch fix your problem?


[PATCH] tg3: Apply short DMA frag workaround to 5906

5906 devices also need the short DMA fragment workaround. This patch
makes the necessary change.

Signed-off-by: Matt Carlson <mcar...@broadcom.com>
---
drivers/net/ethernet/broadcom/tg3.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index d55df32..2db4d70 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -14275,7 +14275,8 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
}
}

- if (tg3_flag(tp, 5755_PLUS))
+ if (tg3_flag(tp, 5755_PLUS) ||
+ GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
tg3_flag_set(tp, SHORT_DMA_BUG);

if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5719)
--
1.7.3.4

Ethan Zhao

unread,
Jun 7, 2012, 9:24:19 PM6/7/12
to Matt Carlson, Eric Dumazet, Christian Kujau, LKML, netdev
Matt,
I notice there are some AER errors ( UnsupReq+,RxErr+) with the tg3
from Christian' lspci output, do you know why and how to clear them ?

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF-
MalfTLP- ECRC- UnsupReq+ ACSViol-
CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+

Thanks,
Ethan

David Miller

unread,
Jun 11, 2012, 7:55:41 PM6/11/12
to li...@nerdbynature.de, mcar...@broadcom.com, ethan....@gmail.com, eric.d...@gmail.com, linux-...@vger.kernel.org, net...@vger.kernel.org
From: Christian Kujau <li...@nerdbynature.de>
Date: Mon, 11 Jun 2012 16:53:16 -0700 (PDT)

> Tested-by: Christian Kujau <li...@nerdbynature.de>

Great, applied, thanks everyone.
0 new messages