Networking issues (ssh timeouts) on cubietruck

259 views
Skip to first unread message

Ian Campbell

unread,
Mar 20, 2015, 7:47:12 AM3/20/15
to linux...@googlegroups.com
Hi all,

At work I am in the process of deploying an array of 4 cubietrucks for
use in the Xen Project automated test framework.

2 of the 4 boards seem to work just fine in (repeated) pre-commissioning
tests but two are failing fairly reliably.

One with:
Timeout, server not responding.
From ssh and the other with networking issues (DNS timeouts etc) during
initial installation (Debian installer).

These boards are now in a colo, but previously when they were on my desk
the same two boards both failed with the ssh Timeout error and the other
two were ok, so the problem boards do seem to persist over changes of
infrastructure etc.

ssh is used to login to the boxes and drive the test case from a
controller machine. In this case the failing test is a build job which
is building Xen or a Linux kernel etc. All build jobs run natively under
Debian (running things under Xen would follow, but it never gets past
the build jobs due to this issue).

The kernel in use is 3.16.7-ckt2 (Debian revision 1~bpo70+1).

The boards are all using u-boot v2014.10.

I can't see anything in the logs and the ifconfig stats show now errors.
Apart from the hiccough networking seems fine (i.e. subsequent ssh
commands do work).

Ssh is using "-o BatchMode=yes -o ConnectTimeout=100 -o
ServerAliveInterval=100" options and make is invoked with -j4 which
doesn't seem too aggressive.

In the case of the failing
installation /sys/class/net/eth0/statistics/*{dropped,errors} are all 0,
nothing in dmesg or the logs. TBH this one might be a cabling or
infrastructure issue, but I'm reasonably confident that the ssh one is
not.

Since two of the boards are OK and two are not I suppose something
somewhere must be marginal.

I'm not really sure where to start looking. Perhaps CONFIG_GMAC_TX_DELAY
on the u-boot side might be relevant? I've had a look through the logs
from v3.16 to master for drivers/net/ethernet/stmicro/stmmac and nothing
leaps out as being a relevant backport.

Or perhaps it isn't networking related at all, e.g. perhaps AHCI is
stalling and stopping sshd from responding to pings. There's nothing in
the logs to indicate one way or another.

Any bright ideas on where to look / what to try would be gratefully
received.

Ian.


Bastiaan van den Berg

unread,
Mar 20, 2015, 8:00:50 AM3/20/15
to linux-sunxi
run the build jobs inside screen?​

--
buZz

Hans de Goede

unread,
Mar 20, 2015, 8:14:33 AM3/20/15
to linux...@googlegroups.com
Hi,
Yes that is the first thing I was thinking of, the cubietruck is the only
gbit phy using a20 design which does not need it, which is kinda fishy.

Regards,

Hans

Ian Campbell

unread,
Mar 20, 2015, 9:42:11 AM3/20/15
to linux...@googlegroups.com
Ah, I hadn't cottoned on to that little fact-let, it's certainly worth a
try then!

Time for the first remote test of the firmware update process since the
boards went to the colo, it worked here several times!

(the timing of the shipment wasn't in my hands, or I'd have kept them
local until this was resolved, the one I kept back for testing turns out
not to exhibit the issue, of course)

Ian.


Ian Campbell

unread,
Mar 20, 2015, 10:40:03 AM3/20/15
to linux...@googlegroups.com
On Fri, 2015-03-20 at 13:42 +0000, Ian Campbell wrote:
> On Fri, 2015-03-20 at 13:14 +0100, Hans de Goede wrote:
> > On 20-03-15 12:47, Ian Campbell wrote:
>
> > > I'm not really sure where to start looking. Perhaps CONFIG_GMAC_TX_DELAY
> > > on the u-boot side might be relevant?
> >
> > Yes that is the first thing I was thinking of, the cubietruck is the only
> > gbit phy using a20 design which does not need it, which is kinda fishy.
>
> Ah, I hadn't cottoned on to that little fact-let, it's certainly worth a
> try then!

Setting CONFIG_GMAC_TX_DELAY=3 (or actually, the moral equivalent on
v2014.10, i.e removing the ifdef BANANPI) resulted in non-functional
tftp from u-boot (luckily I tried it on a local board first!).

Using 0, 1, or 2 seems to work from u-boot at least. I've not tried
anything more aggressive, my local board didn't have issues to start
with though.

I'm going to triple check my remote recovery procedures and then do some
experiments on one of the problematic boards.

Cheers,
Ian.

Hans de Goede

unread,
Mar 20, 2015, 11:28:03 AM3/20/15
to linux...@googlegroups.com, Tom Cubie
Hi,

On 20-03-15 15:39, Ian Campbell wrote:
> On Fri, 2015-03-20 at 13:42 +0000, Ian Campbell wrote:
>> On Fri, 2015-03-20 at 13:14 +0100, Hans de Goede wrote:
>>> On 20-03-15 12:47, Ian Campbell wrote:
>>
>>>> I'm not really sure where to start looking. Perhaps CONFIG_GMAC_TX_DELAY
>>>> on the u-boot side might be relevant?
>>>
>>> Yes that is the first thing I was thinking of, the cubietruck is the only
>>> gbit phy using a20 design which does not need it, which is kinda fishy.
>>
>> Ah, I hadn't cottoned on to that little fact-let, it's certainly worth a
>> try then!
>
> Setting CONFIG_GMAC_TX_DELAY=3 (or actually, the moral equivalent on
> v2014.10, i.e removing the ifdef BANANPI) resulted in non-functional
> tftp from u-boot (luckily I tried it on a local board first!).

Interesting ...

> Using 0, 1, or 2 seems to work from u-boot at least. I've not tried
> anything more aggressive, my local board didn't have issues to start
> with though.
>
> I'm going to triple check my remote recovery procedures and then do some
> experiments on one of the problematic boards.

I wonder if it could be that there are 2 production runs of the cubietruck,
with potentially different phy revisions.

Tom Cubie (added to the Cc), can you tell us if it is possible that there
are different ethernet phy revisions on some cubeitrucks? Has there been
a second production run ?

Regards,

Hans

Ian Campbell

unread,
Mar 20, 2015, 1:03:48 PM3/20/15
to linux...@googlegroups.com, Tom Cubie
On Fri, 2015-03-20 at 16:27 +0100, Hans de Goede wrote:
> Hi,
>
> On 20-03-15 15:39, Ian Campbell wrote:
> > On Fri, 2015-03-20 at 13:42 +0000, Ian Campbell wrote:
> >> On Fri, 2015-03-20 at 13:14 +0100, Hans de Goede wrote:
> >>> On 20-03-15 12:47, Ian Campbell wrote:
> >>
> >>>> I'm not really sure where to start looking. Perhaps CONFIG_GMAC_TX_DELAY
> >>>> on the u-boot side might be relevant?
> >>>
> >>> Yes that is the first thing I was thinking of, the cubietruck is the only
> >>> gbit phy using a20 design which does not need it, which is kinda fishy.
> >>
> >> Ah, I hadn't cottoned on to that little fact-let, it's certainly worth a
> >> try then!
> >
> > Setting CONFIG_GMAC_TX_DELAY=3 (or actually, the moral equivalent on
> > v2014.10, i.e removing the ifdef BANANPI) resulted in non-functional
> > tftp from u-boot (luckily I tried it on a local board first!).
>
> Interesting ...
>
> > Using 0, 1, or 2 seems to work from u-boot at least. I've not tried
> > anything more aggressive, my local board didn't have issues to start
> > with though.
> >
> > I'm going to triple check my remote recovery procedures and then do some
> > experiments on one of the problematic boards.
>
> I wonder if it could be that there are 2 production runs of the cubietruck,
> with potentially different phy revisions.

FWIW the 4 which are remote were purchased in the same batch, which may
not mean much but makes it a bit less likely they were different
batches.

If, say, tx delay == 1 was needed, then is it possible that the default
setting of 0 might work on some systems perhaps depending on other
factors (manufacturing imperfections, cable differences)?

> Tom Cubie (added to the Cc), can you tell us if it is possible that there
> are different ethernet phy revisions on some cubeitrucks? Has there been
> a second production run ?

I think Tom moved onto other things quite a long time ago.

Ian.

Ian Campbell

unread,
Mar 20, 2015, 1:23:27 PM3/20/15
to linux...@googlegroups.com
On Fri, 2015-03-20 at 14:39 +0000, Ian Campbell wrote:
> I'm going to triple check my remote recovery procedures and then do
> some experiments on one of the problematic boards.

Actually, I remember I could use mw.l on the u-boot prompt to fiddle
with the register values, so no need for the recovery procedure (but,
still good to have).

I did some quick experiments with tx delay set to 0, 1, 2 and 3 on one
of the problematic boards:

0: mw.l 0x1c20164 0x006 1 -- t/o in 4/5 tftp runs
1: mw.l 0x1c20164 0x406 1 -- t/o in 1/5 tftp runs
2: mw.l 0x1c20164 0x806 1 -- t/o in 1/5 tftp runs
3: mw.l 0x1c20164 0xc06 1 -- t/o many times in first tftp run

For 0, 1 and 2 "t/o" means one or two "T " glitches in the download, but
it did complete. For 3 those were basically continuous and it couldn't
complete.

tftp was of a 16M initrd.

Based on that I'm now running a full test with tx delay == 1.

Ian.

Ian Campbell

unread,
Mar 23, 2015, 5:32:22 AM3/23/15
to linux...@googlegroups.com
On Fri, 2015-03-20 at 17:23 +0000, Ian Campbell wrote:
> Based on that I'm now running a full test with tx delay == 1.

First results are positive, running with tx delay on the two problematic
boards I've managed a few successful flights runs over the weekend.

Today I'll deploy to all of the boards including the non-problematic
ones and run a few more flights and assuming that's successful I'll send
out a patch.

Ian.

Iain Paton

unread,
Mar 26, 2015, 5:04:00 AM3/26/15
to linux...@googlegroups.com, Hans de Goede
On 20/03/15 12:14, Hans de Goede wrote:

> Yes that is the first thing I was thinking of, the cubietruck is the only
> gbit phy using a20 design which does not need it, which is kinda fishy.

Actually, the OLinuXino-LIME2 has a gbit phy, doesn't define
CONFIG_GMAC_TX_DELAY and doesn't initially appear to need it.

I have quite a few of them deployed for several months with no aparrent
issues similar to this. I also don't appear to be seeing any reports of
problems here or elsewhere.


The one thing I have encountered is that very occasionally on a reboot
I'll get unidirectional network. Either transmit or receive is totally
dead while the other direction functions ok. Haven't been able to find
a cause yet as it's not easily repeatable, and with no other reports
it's not something I've spent a lot of time worrying about.

Rgds,
Iain
Reply all
Reply to author
Forward
0 new messages