No more rebooting tegras: Yay stability and lessons learned

Showing 1-3 of 3 messages
No more rebooting tegras: Yay stability and lessons learned William Lachance 8/7/12 12:09 PM
[ BCCing a few people who might be interested in this -- please send
followups to mozilla.tools ]

So at the end of last week Callek (Justin Wood) applied a fix to our
Android automation which drastically reduced our failure rate (I think
Joel Maher was quoting a number of 6%, down from 14%) and generally just
had a huge impact on the Android resource utilization on try and
inbound. Over the last few weeks, the time needed for Android tests to
run had grown to > 12 hours. Now we're back down to just a few hours.

The cause of this problem was an ancient application running on our
Android devices (tegra boards aka "tegras") called the watcher. The
watcher has a few functions (keeping the lock screen on, performing
agent upgrades) but it was also set up to ping www.mozilla.org on 5
minute intervals, and to reboot the device if it didn't get a
satisfactory response three times in a row. It turns out that something
happened recently which made mozilla.org occasionally inaccessible from
the network the tegras were running on, which was enough to cause a good
chunk of them to reboot while they were still running tests.

It turns out that we actually have better ways of identifying which
tegras are having network connectivity problems and resetting them, so
this is apparently kind of redundant. Once I identified this as a
potential problem, fixing it was really rather simple: just add a .ini
file to every android device telling it not to try pinging *any* server.
Voila, problem solved.

While I'm very happy that we fixed this issue, I think it's also worth
thinking about what we can do to make it less likely that this class of
problems will come back to haunt us in the future. To be very clear, I'm
not trying to single anyone out for blame here: these are just
issues/suggestions of process refinement.

1. We should really be careful about what kinds of defaults we set. In
this case, we were depending on a server on the general internet
(mozilla.org) being available. This is known to be a bad idea for
testing environments and was probably intended to be set to something
else, but it never was. My proposal is that in the future, settings
which could cause things like reboots should default to being OFF (in
the case of the watcher, a patch was recently applied to do exactly
this: https://bugzilla.mozilla.org/show_bug.cgi?id=779871)

2. More importantly, we should really try harder to document all the
behaviours and moving pieces of our automation, end-to-end. We've been
making a great effort recently to do this with the new SUTAgent as we
rework it to be suitable for B2G, but that's just one piece of the
puzzle. There's the architecture of the machines (foopies) we use, the
buildbot scripts we use to clean up/restart/configure the tegras, the
device manager abstraction we use to communication with the agent, not
to mention the test harnesses themselves.

The worst part of the present situation is that not only is the
behaviour of our automation undocumented, but that the knowledge about
how these different pieces work is split across three different teams
(release engineering, ateam, and mobile platform). Can anyone really
claim even high-level knowledge about how this whole stack of stuff
works end-to-end? Maybe one or two people can. I sure can't.

Yes, documentation is boring to write and hard to keep up to date. But
it will vastly help training people and attracting new contributors
(increasingly a concern as the scope of our mobile automation grows to
encompass B2G as well), not to mention identifying issues like this. For
example, it would have been obvious that we had redundant code to reboot
tegras due to lack of network connectivity (and that one of them was
broken) if the functions of the various components involved were laid
out and described in one place.

To be clear, I'm not proposing that we write a one-hundred page
document. A few block diagrams and paragraph-long descriptions of all
the different pieces that go into our mobile automation (along with
links to their locations in source control) would go a long way.

It's possible that such documentation exists for some components and I'm
just not aware of it. If so, please post links. :) I'm not sure about
the best place or way to coordinate writing any remaining documentation
across teams. If people are amenable, maybe we can discuss this during
tomorrow's mobile testing meeting (https://wiki.mozilla.org/Mobile/Testing).

Will

P.S. In a seperate email conversation, we were discussing how to get the
Android failure rate down to an even lower / more acceptable level. This
is also important, but it's a bit of a different issue. I'll spin off
another thread about it.
Re: No more rebooting tegras: Yay stability and lessons learned jmaher 8/7/12 12:18 PM
our failure rate went from 39% to 13% with just this one fix.  I believe we have had this problem for a long time, it is just that our tegra devices never stayed up >15 minutes.  Now we have turned on more tests and this is causing a greater volume of longer test runs.
Re: No more rebooting tegras: Yay stability and lessons learned jmaher 8/7/12 12:18 PM
On Tuesday, August 7, 2012 3:09:59 PM UTC-4, William Lachance wrote: