Tegra board getting a newer version of the SUT Agent

Armen Zambrano G.

unread,

Mar 16, 2012, 12:05:40 PM3/16/12

to release

Hi,
In bug 734221 [1], I have deployed a change that will make every tegra
taking a job to upgrade its SUT agent version from 1.00 to 1.07.
This has been properly tested in staging for a week.

Getting this deployed will fix some problems for talos robocop and
xpcshell (new suites).

If you see any problems please let us know on the bug.

cheers,
Armen

######################################################

Background info:
################
The SUT agent is an apk that runs on the tegra board all the time.
We have scripts to talk to it through telnet and helps us execute
commands on the boards. It can install apks, we can push files to the
board, we can retrieve files from the board, etc.

The patch has been written in such a way that adds certain features:
* upgrading the SUT agent becomes part of the standard procedure
* we can retry tegra steps both in talos and unit tests
* prepares ourselves to start using devicemanager*.py files from
hg.m.o/m-c (rather than using old versions of it from hg.m.o/build/talos)
* prepares ourselves to switch to start using talos.json for some components

The patch also:
* prevents a tegra board from taking a job unless it is upgraded to the
correct sut agent version
* prevents a tegra to be added to the pool if it does not have the
correct sut agent version
* does extra cleaning on the board before rebooting. This might improve
the clean up steps at the beginning of a job
* makes mobile jobs to use the same talos.zip as desktop jobs

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=734221

Armen Zambrano G.

unread,

Mar 16, 2012, 12:38:03 PM3/16/12

to

I am watching for the cleanup.py step failing.
If the problem does not clear soon I will remove it.

cheers,
Armen

Armen Zambrano G.

unread,

Mar 16, 2012, 3:26:44 PM3/16/12

to

Believe it or not. There is not an increase of purple and blue tegra
jobs because of my changes specifically but because we are calling the
device in two more locations and thus increasing the chance of not being
ready to talk to it.

In other words, it is more noticeable because I was looking but tegra
world is generally this bad.
Now, after my change, the first step on the unit test jobs to talk to
the device is the updateSUT.py script rather than cleanup.py and that is
why it barfs on updateSUT.py rather than cleanup.py.

This has helped me understand that we need to have:
1) a first step to verify that the board will actually be able to take jobs
2) improve the reboot.py step to ensure we can recover the board or put
it on the side

I am also now more suspicious that on peak hours to increase the number
of exceptions. You can see more Tracebacks like this:
* remoteFailed: [Failure instance: Traceback (failure with no frames):
<class 'twisted.spread.pb.PBConnectionLost'>: [Failure instance:
Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
* remoteFailed: [Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.ConnectionLost'>: Connection to the other
side was lost in a non-clean fashion.

My theory is because there are more jobs, which causes more load on the
foopies (10 or so tegra boards per foopy) and more load on the 2 tegra
masters.
This is a theory and I will need more data to prove it.

In the meantime I will look into adding a third master regardless if I
am right or not since we're supposed to have a 3rd master.

cheers,
Armen