I managed to see the same crash when I was selectively trying to add newer
ucc_geth patches to the 2.6.31 kernel a couple of months ago, and the same
patch that caused a crash then seems suspect. If I revert the patch the
system runs completely stable. Amusingly, the excact error message the
patch claims to fix is in fact the error it causes to happen in my case.
So preferably 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4 could be reverted
unless someone can fix it. I can't even make sense of why it is supposed
to improve anything, it certainly seems like a very unsafe change
to make. Removing locking and disabling of interrupts before poking at
phy settings and such doesn't seem like a minor change and doesn't seem
that safe either.
Now I must add that I run with the xenomai/adeos-ipipe patches as well,
which do change interrupt handling a little, but so far this has worked
fine with the previous code and only the current code is broken for us.
I could try to build without the patch, although I would loose some major
functionality on the box doing so, and I would not be surprised if it
still fails since I believe I tried that already with 2.6.31+selected
git patches before without xenomai patched in and it still failed,
but I am only about 99% sure of that.
With the patch I get:
NETDEV WATCHDOG: eth2 (ucc_geth): transmit queue 0 timed out
------------[ cut here ]------------
Badness at c02729a8 [verbose debug info unavailable]
NIP: c02729a8 LR: c02729a8 CTR: c01b6088
REGS: c0451c40 TRAP: 0700 Not tainted (2.6.32-trunk-8360e)
MSR: 00029032 <EE,ME,CE,IR,DR> CR: 42042024 XER: 20000000
TASK = c041f3e8[0] 'swapper' THREAD: c0450000
GPR00: c02729a8 c0451cf0 c041f3e8 00000045 00002ae9 ffffffff c01b6afc c0422c48
GPR08: c042fde8 00000002 00000003 00010000 22042024 1001af90 1fffd000 00000000
GPR16: 00000000 c038c6d8 00000001 00200200 00000000 c0465eec c0465cec c0465aec
GPR24: c0450000 c04658ec c0423c2c df0e01c0 c0480000 df0e0000 c0423c2c 00000000
NIP [c02729a8] dev_watchdog+0x280/0x290
LR [c02729a8] dev_watchdog+0x280/0x290
Call Trace:
[c0451cf0] [c02729a8] dev_watchdog+0x280/0x290 (unreliable)
[c0451d50] [c00377c4] run_timer_softirq+0x164/0x224
[c0451da0] [c0032a38] __do_softirq+0xb8/0x13c
[c0451df0] [c00065cc] do_softirq+0xa0/0xac
[c0451e00] [c003280c] irq_exit+0x7c/0x9c
[c0451e10] [c00640c4] __ipipe_sync_stage+0x248/0x24c
[c0451e50] [c0064374] ipipe_suspend_domain+0xa0/0xf4
[c0451e70] [c00644a4] __ipipe_walk_pipeline+0xdc/0x120
[c0451e90] [c000af28] __ipipe_handle_irq+0x164/0x168
[c0451ec0] [c000b03c] __ipipe_grab_irq+0x3c/0xa4
[c0451ed0] [c0014814] __ipipe_ret_from_except+0x0/0xc
--- Exception: 501 at cpu_idle+0xe0/0xf0
LR = cpu_idle+0xe0/0xf0
[c0451f90] [c000970c] cpu_idle+0x68/0xf0 (unreliable)
[c0451fb0] [c0003f30] rest_init+0x5c/0x6c
[c0451fc0] [c03f07ac] start_kernel+0x27c/0x2e0
[c0451ff0] [00003438] 0x3438
Instruction dump:
7d2903a6 4bfffea8 38810008 7fa3eb78 38a00040 4bfe9c81 7fe6fb78 7fa4eb78
7c651b78 3c60c03c 38631774 480b7d2d <0fe00000> 38000001 901c2fb0 4bffff94
warning: `zebra' uses 32-bit capabilities (legacy support in use)
PHY: 0:03 - Link is Up - 1000/Full
PHY: 0:09 - Link is Up - 100/Full
PHY: 0:02 - Link is Up - 100/Full
PHY: 0:0f - Link is Up - 100/Full
PHY: 0:17 - Link is Up - 100/Full
When reverted I get a stable running system with no errors.
Which port happens to fail first varies, but one always does and then
the system almost always crashes soon after.
--
Len Sorensen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Hm. Just curious, what CPU revision you use? So that I could try
to reproduce the issue... I have MPC8360E-MDS boards, rev 2.0 and
rev rev 2.1 CPUs, Marvell PHYs. I also have MPC8360E-RDK (rev 2.1).
And I didn't see any issues with this patch.
> I managed to see the same crash when I was selectively trying to add newer
> ucc_geth patches to the 2.6.31 kernel a couple of months ago, and the same
> patch that caused a crash then seems suspect. If I revert the patch the
> system runs completely stable. Amusingly, the excact error message the
> patch claims to fix is in fact the error it causes to happen in my case.
>
> So preferably 864fdf884e82bacbe8ca5e93bd43393a61d2e2b4 could be reverted
I don't see any point in reverting, because it will surely break my boards.
So, we need to fix this, not just hide.
> unless someone can fix it.
Well, I'm ready to help you with debugging.
> Now I must add that I run with the xenomai/adeos-ipipe patches as well,
> which do change interrupt handling a little,
It could be that it takes too long to stop the UCC, and xenomai
makes the timings worse, so the watchdog triggers.
Can you try the following patch?
diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index afaf088..2f73e3f 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -1563,6 +1563,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
static void ugeth_quiesce(struct ucc_geth_private *ugeth)
{
+ netif_device_detach(ugeth->ndev);
+
/* Wait for and prevent any further xmits. */
netif_tx_disable(ugeth->ndev);
@@ -1577,7 +1579,7 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
{
napi_enable(&ugeth->napi);
enable_irq(ugeth->ug_info->uf_info.irq);
- netif_tx_wake_all_queues(ugeth->ndev);
+ netif_device_attach(ugeth->ndev);
}
/* Called every time the controller might need to be made
Oh and I will be off on vacation for about a week by tomorrow, so if I
don't fix it today, I will be a bit slow at getting back to you
on this. Not that it is bothering any real users yet (they are using
2.6.30 for now).
--
Len Sorensen
Hmm, I have an MDS around, so I could probably try and see if it is
reproduceable there.
> I don't see any point in reverting, because it will surely break my boards.
> So, we need to fix this, not just hide.
I agree with that.
> > unless someone can fix it.
>
> Well, I'm ready to help you with debugging.
Excellent.
> > Now I must add that I run with the xenomai/adeos-ipipe patches as well,
> > which do change interrupt handling a little,
>
> It could be that it takes too long to stop the UCC, and xenomai
> makes the timings worse, so the watchdog triggers.
>
> Can you try the following patch?
Sure thing.
> diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
> index afaf088..2f73e3f 100644
> --- a/drivers/net/ucc_geth.c
> +++ b/drivers/net/ucc_geth.c
> @@ -1563,6 +1563,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
>
> static void ugeth_quiesce(struct ucc_geth_private *ugeth)
> {
> + netif_device_detach(ugeth->ndev);
> +
> /* Wait for and prevent any further xmits. */
> netif_tx_disable(ugeth->ndev);
>
> @@ -1577,7 +1579,7 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
> {
> napi_enable(&ugeth->napi);
> enable_irq(ugeth->ug_info->uf_info.irq);
> - netif_tx_wake_all_queues(ugeth->ndev);
> + netif_device_attach(ugeth->ndev);
> }
>
> /* Called every time the controller might need to be made
So there result is:
--
Len Sorensen
This I can reproduce. It seems it's a long standing bug that
becomes easily reproducible with quiesce/activate sequence.
The driver doesn't handle empty queue correctly, i.e. it ignores
the empty queue check if netdev queue is stopped, which makes no
sense.
Can you try this patch in addition to previous (i.e. both should
be applied)?
Thanks!
diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index 2f73e3f..b22de51 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -3275,7 +3275,7 @@ static int ucc_geth_tx(struct net_device *dev, u8 txQ)
/* Handle the transmitted buffer and release */
/* the BD to be used with the current frame */
- if ((bd == ugeth->txBd[txQ]) && (netif_queue_stopped(dev) == 0))
+ if (bd == ugeth->txBd[txQ]) /* queue empty? */
break;
dev->stats.tx_packets++;
That seems to be it. It works now. No more crashes.
Those two patches together seem to do the trick. I really hope they
can go into 2.6.32-stable then, since this is a regression over 2.6.31
and is hopefully an obvious fix.
Now if only my mdio-gpio bitbang one line fix would be accepted.
--
Len Sorensen
Great!
> I really hope they
> can go into 2.6.32-stable then, since this is a regression over 2.6.31
> and is hopefully an obvious fix.
Yep, will try to get them there.
Thanks a lot for helping to track this down,
--
Anton Vorontsov
email: cbouat...@gmail.com
irc://irc.freenode.net/bd2
Thanks for fixing it so fast. I only knew that patch broke it, not why
it broke it.
--
Len Sorensen