Believe it or not. There is not an increase of purple and blue tegra
jobs because of my changes specifically but because we are calling the
device in two more locations and thus increasing the chance of not being
ready to talk to it.
In other words, it is more noticeable because I was looking but tegra
world is generally this bad.
Now, after my change, the first step on the unit test jobs to talk to
the device is the updateSUT.py script rather than cleanup.py and that is
why it barfs on updateSUT.py rather than cleanup.py.
This has helped me understand that we need to have:
1) a first step to verify that the board will actually be able to take jobs
2) improve the reboot.py step to ensure we can recover the board or put
it on the side
I am also now more suspicious that on peak hours to increase the number
of exceptions. You can see more Tracebacks like this:
* remoteFailed: [Failure instance: Traceback (failure with no frames):
<class 'twisted.spread.pb.PBConnectionLost'>: [Failure instance:
Traceback (failure with no frames): <class
'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
* remoteFailed: [Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.ConnectionLost'>: Connection to the other
side was lost in a non-clean fashion.
My theory is because there are more jobs, which causes more load on the
foopies (10 or so tegra boards per foopy) and more load on the 2 tegra
masters.
This is a theory and I will need more data to prove it.
In the meantime I will look into adding a third master regardless if I
am right or not since we're supposed to have a 3rd master.
cheers,
Armen