Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Broadcom TG3 network drops, cannot recover without reboot

658 views
Skip to first unread message

Justin Catterall

unread,
May 26, 2015, 8:40:04 AM5/26/15
to

At irregular times, and apparently for no reason at all, networking
drops and cannot be restarted without reboot on a fresh install of
Jessie. The NIC is a Broadcom NetXtreme BCM5720.

ifconfig thinks networking is still up because I can:
ifconfig eth0 down

I find this when I try 'ifconfig eth0 up':
tg3_abort_hw timed out TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff

If I:
rmmod tg3; insmod tg3
the problem does not resolve. It seems the card needs a hard reset.

Searching the web there are various issues with the TG3, and most
are resolved by installing firmware-linux-nonfree. I have this
module installed but I can re-create the problem by running
/etc/init.d/networking restart. The networking stops working completely, I can't ping the machine nor can I ping from the machine.

Any suggestions on where to look for a solution?


--
Justin C, by the sea.

--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: https://lists.debian.org/BA8A1C4C-A677-4C53...@masonsmusic.co.uk

Henrique de Moraes Holschuh

unread,
May 26, 2015, 10:50:05 AM5/26/15
to
On Tue, May 26, 2015, at 09:24, Justin Catterall wrote:
> At irregular times, and apparently for no reason at all, networking
> drops and cannot be restarted without reboot on a fresh install of
> Jessie. The NIC is a Broadcom NetXtreme BCM5720.
>
> ifconfig thinks networking is still up because I can:
> ifconfig eth0 down
>
> I find this when I try 'ifconfig eth0 up':
> tg3_abort_hw timed out TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff

Hmm, it is either a kernel issue, or a hardware issue.

> Any suggestions on where to look for a solution?

Yes.

First, disable all hardware offloading using ethtool. See if that
helps.

Also, if this NIC is in the system mainboard, make sure you are using
the latest firmware ("BIOS update") from your motherboard vendor: it is
usual to have the motherboard NICs use a data block in the shared system
FLASH for vital product data and firmware. The motherboard vendor will
bundle up updates for the NIC firmware with the BIOS updates when both
are in the same FLASH chip.

Make sure you have the latest linux firmware file for the tg3 driver as
well. If the initramfs image has the tg3.ko module inside, it must also
have the firmware file. A workaround for any initramfs-related tg3
firmware loading issues is to "rmmod tg3 ; modprobe tg3" after the
system booted (and before the NIC hardlocks).

If all of the above failed, get yourself familiar with building a custom
Debian-compatible kernel using pristine upstream kernels from
www.kernel.org. Wait until 3.18.15 and 4.0.5 are released in
www.kernel.org, and build custom kernels based on them. Alternatively,
wait until a debian-packaged version of kernel 4.0.5 is available. DO
NOT use 4.0 kernels before 4.0.5 on pain of possible data loss.

If either the 3.18.15 or 4.0.5 kernel fixes the issue with your bcm5720,
please tell us so that we can try to isolate the fix and backport it to
the Debian kernel.

If that fails, you will have to engage the kernel community itself for a
fix. Please file a bug on bugzilla.kernel.org, and good luck. There are
several hardware hang reports open against BCM57xx + tg3.

Alternatively, try to get yourself an Intel NIC that works with the igb
driver (don't get an Intel NIC that needs the e1000e driver) to replace
the hardlock-prone bcm5720 + tg3 combination.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique de Moraes Holschuh <h...@debian.org>


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: https://lists.debian.org/1432651192.1585308...@webmail.messagingengine.com

Justin Catterall

unread,
May 27, 2015, 5:50:04 AM5/27/15
to
[Sorry, Henrique, for replying directly to you]

> On 26 May 2015, at 15:39, Henrique de Moraes Holschuh wrote:
>
> On Tue, May 26, 2015, at 09:24, Justin Catterall wrote:
>> At irregular times, and apparently for no reason at all, networking
>> drops and cannot be restarted without reboot on a fresh install of
>> Jessie. The NIC is a Broadcom NetXtreme BCM5720.
>>
>> ifconfig thinks networking is still up because I can:
>> ifconfig eth0 down
>>
>> I find this when I try 'ifconfig eth0 up':
>> tg3_abort_hw timed out TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
>
> Hmm, it is either a kernel issue, or a hardware issue.
>
>> Any suggestions on where to look for a solution?
>
> Yes.
>
> First, disable all hardware offloading using ethtool. See if that
> helps.

Was able to disable all except:
rx-vlan-offload: on [fixed]
tx-vlan-offload: on [fixed]

Now, if I "/etc/init.d/networking restart" the system doesn't report any error, but networking is still dead. However, I can rmmod tg3|ptp|libphy, then "modprobe tg3" and "/etc/init.d/networking start" and all works (I have done this a handful of times with no need to reboot to re-enable networking). So that's some progress.


> Also, if this NIC is in the system mainboard, make sure you are using
> the latest firmware ("BIOS update") from your motherboard vendor: it is
> usual to have the motherboard NICs use a data block in the shared system
> FLASH for vital product data and firmware. The motherboard vendor will
> bundle up updates for the NIC firmware with the BIOS updates when both
> are in the same FLASH chip.

I've read the documentation for the latest firmware and there is no mention of changes for the NIC, only a "power-on delay option" to allow longer/shorter period of time to hit the key to access the BIOS. And a change to boot device detection to better detect devices with invalid boot records. No other changes mentioned in the firmware.

Here's a link to the page:
http://h20565.www2.hp.com/hpsc/swd/public/detail?sp4ts.oid=5390291&swItemId=MTX_a21cee44c55643598fb2f52bc2&swEnvOid=4144#tab4

I don't like tinkering with firmware if I can help it, in this case they don't say there are changes to the NIC so do you think I should still upgrade? The description says no bugs fixed, only enhancements.


> Make sure you have the latest linux firmware file for the tg3 driver as
> well. If the initramfs image has the tg3.ko module inside, it must also
> have the firmware file. A workaround for any initramfs-related tg3
> firmware loading issues is to "rmmod tg3 ; modprobe tg3" after the
> system booted (and before the NIC hardlocks).

See above, even after rmmod'ing I can still force network restart to fail without error, though it is recoverable if noticed.


> If all of the above failed, get yourself familiar with building a custom
> Debian-compatible kernel using pristine upstream kernels from
> www.kernel.org. Wait until 3.18.15 and 4.0.5 are released in
> www.kernel.org, and build custom kernels based on them. Alternatively,
> wait until a debian-packaged version of kernel 4.0.5 is available. DO
> NOT use 4.0 kernels before 4.0.5 on pain of possible data loss.

Data loss? On a "stable" kernel? WTF are they doing these days? I notice that stable/dev are no longer even/odd major numbers - took me a bit of Googling to get caught up!


> If either the 3.18.15 or 4.0.5 kernel fixes the issue with your bcm5720,
> please tell us so that we can try to isolate the fix and backport it to
> the Debian kernel.

In the mean time I've made a bash-script to rmmod and modprobe as appropriate. I'll set a cron job to ping a couple of other servers on the LAN and execute the script and restart networking should the pings fail.


> If that fails, you will have to engage the kernel community itself for a
> fix. Please file a bug on bugzilla.kernel.org, and good luck. There are
> several hardware hang reports open against BCM57xx + tg3.

Damn crap hardware. I remember having issues with tg3 at least six or seven years ago. I can believe it's still being incorporated into motherboards when there are obviously problems with the chipset. Depending on speed of progress on the kernel front I may just stick a PCI NIC in there - I think I still have some 3c509's around somewhere...


> Alternatively, try to get yourself an Intel NIC that works with the igb
> driver (don't get an Intel NIC that needs the e1000e driver) to replace
> the hardlock-prone bcm5720 + tg3 combination.

Thanks for the pointers. I at least have a situation now where I don't need a reboot to get networking functioning after it fails. It's far from perfect, but it's much, much better.

--
Justin C, by the sea.


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: https://lists.debian.org/3BCB9E79-8988-475E...@masonsmusic.co.uk

Toan Pham

unread,
May 27, 2015, 12:10:04 PM5/27/15
to
Justin,


I've observed a similar symptom on the bcm5762 chip, not the 5720, and
not sure if the bugs they are related. I've filed a bug report
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664), and
actively working with Broadcom's engineering team to get this bug
resolved. They are running multiple test cases but could not get this
bug to surface in a short amount of time.

> In the mean time I've made a bash-script to rmmod and modprobe as appropriate. I'll set a cron job to ping a couple of other servers on the LAN and execute the script and restart networking should the pings fail.

This is a patch, not a fix. Have you tested on kernel 4.0?


> Alternatively, try to get yourself an Intel NIC that works with the igb
> driver (don't get an Intel NIC that needs the e1000e driver) to replace
> the hardlock-prone bcm5720 + tg3 combination.

I ended up with an intel NIC instead, but with the e1000e driver.
What's wrong with the e1000e driver by the way, please update. Thank
you


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: https://lists.debian.org/CAGNmLEOfS1AackvgKtfHMES+...@mail.gmail.com

Justin Catterall

unread,
May 28, 2015, 3:50:06 AM5/28/15
to

> On 27 May 2015, at 17:06, Toan Pham <tpha...@gmail.com> wrote:
>
> Justin,
>
>
> I've observed a similar symptom on the bcm5762 chip, not the 5720, and
> not sure if the bugs they are related. I've filed a bug report
> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664), and
> actively working with Broadcom's engineering team to get this bug
> resolved. They are running multiple test cases but could not get this
> bug to surface in a short amount of time.
>
>> In the mean time I've made a bash-script to rmmod and modprobe as appropriate. I'll set a cron job to ping a couple of other servers on the LAN and execute the script and restart networking should the pings fail.
>
> This is a patch, not a fix. Have you tested on kernel 4.0?

I've not tested with 4.0. This machine needs to be rock solid, it will be Debian stable all the way, 4.x will only get on this machine when stable is updated to that kernel.


> Alternatively, try to get yourself an Intel NIC that works with the igb
>> driver (don't get an Intel NIC that needs the e1000e driver) to replace
>> the hardlock-prone bcm5720 + tg3 combination.
>
> I ended up with an intel NIC instead, but with the e1000e driver.
> What's wrong with the e1000e driver by the way, please update. Thank
> you

You'll have to hope Henrique is still following this, I don't have an answer.

Just thinking out-loud here: I've got an identical server with FreeNAS installed, that's never disappeared off the network. Also I've had a server for about 5 years at home with the same on-board NIC, (different mobo), that's never locked up either, but I was able to force it to lock up with "/etc/init.d/networking restart". It seems that, at home at least, I've been fortunate so far. WRT to the FreeNAS, I have no idea how that's driving the NIC, listing the modules there doesn't show anything I recognise, and I've minimal experience with *BSD.

--
Justin C, by the sea.

--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: https://lists.debian.org/A1E820DD-50C4-4708...@masonsmusic.co.uk

Henrique de Moraes Holschuh

unread,
May 28, 2015, 7:10:04 AM5/28/15
to
On Wed, May 27, 2015, at 13:06, Toan Pham wrote:
> > Alternatively, try to get yourself an Intel NIC that works with the igb
> > driver (don't get an Intel NIC that needs the e1000e driver) to replace
> > the hardlock-prone bcm5720 + tg3 combination.
>
> I ended up with an intel NIC instead, but with the e1000e driver.
> What's wrong with the e1000e driver by the way, please update.

It is not that the NICs that need e1000e are "bad news". It is that
they're outdated and less capable than the more recent Intel designs
that use the igb driver (or igbx, for 10GbE).

Also there is certainly something very wrong with any motherboard of
recent design that uses an outdated NIC.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique de Moraes Holschuh <h...@debian.org>


--
To UNSUBSCRIBE, email to debian-us...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Archive: https://lists.debian.org/1432810960.10471....@webmail.messagingengine.com

Toan Pham

unread,
Oct 14, 2015, 12:10:07 PM10/14/15
to
Justin,

Have  you found a solution to the NIC issue yet?  

FYI, I was working with Broadcom's test team but they pretty much dropped the ball on me.... So it is all up to us to find a solution!  Please share if you know how to get the NIC not to drop out intermitently. 

thank you,

TP

Gene Heskett

unread,
Oct 14, 2015, 1:30:06 PM10/14/15
to
On Wednesday 14 October 2015 12:04:27 Toan Pham wrote:

> Justin,
>
> Have you found a solution to the NIC issue yet?
>
> FYI, I was working with Broadcom's test team but they pretty much
> dropped the ball on me.... So it is all up to us to find a solution!
> Please share if you know how to get the NIC not to drop out
> intermitently.
>
> thank you,
>
> TP

From my lengthy but 100% failed experinces with BC, including hacking up
the ndiswrapper to use the winderz drivers, you would be far better off
to vote with your wallet and change the brand name on the card. On my
old HP lappy, which had XP on it as OEM, and a broadcom 4318 radio in
it, one that would not connect for more than 15 seconds when running XP,
so I plugged in a netgear USB radio, rebooted, and it just worked(tm).

But I've upgraded the lappy to an XFCE vesion of Mint 17 now, and I've
shut the wired relay router's radio off in the shop because a neighbor
had managed the passphrase & was using my bandwidth. Shut it off hard by
replacing the router with a small 4 port hub. :) They can't use what
isn't available.

I still don't think I have any neighbors that close AND that smart, but
it happened. They could not get to me, but they could get to the
internet thru it as it had access to the gateway router. That one is
running dd-wrt, and the only one who has come thru it, I had to give him
the username & pw. I used to watch the logs, but that was a boring waste
of time. Nothing unwanted gets thru dd-wrt.

YMMV of course, but some folks enjoy the challenge of trying to make
broken by design stuff work. I do too, but there is a point too far
that you can easily surpass with BC gear.

Cheers, Gene Heskett
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>
0 new messages