Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The War on Orange Needs YOU

162 views
Skip to first unread message

Clint Talbert

unread,
Mar 1, 2013, 8:01:46 PM3/1/13
to
The rate of intermittent failures (which is called the orange factor,
and the effort to diminish it is called The War on Orange) in our
automation has skyrocketed. Starting around February 17 on
mozilla-inbound, we took off on an exponential curve that is making it
extremely hard to sheriff the tree, land patches, and otherwise get work
done at Mozilla [1]. If anyone knows what started happening on the 17th
that caused this sudden change, please let us know; however, it looks
like it was several changes over several days. Therefore, it might not
be one single push.

We track this through a metric called the Orange Factor which is simply
the average number of intermittent failures encountered on each push.
This means, right now when you push, on *average* you are getting 8
failures. On February 17, you were averaging 2. Something has gone
horribly, horribly wrong.

If we solve the current top 10 intermittent issues [2] we will be back
down to 4.47, which, while almost double where we were on Feburary 17 is
a far, far better than where we are now (8.32).

I'm begging for volunteers to step forward and do everyone a favor and
dig into one of these bugs below and for some set of brave souls to look
critically at what landed during our exponential uptick for possible
culprits.

* Bug 761987 - http://bit.ly/WmorJg - The worst offender. If anyone can
help out, please do.
* Bug 833769 - http://bit.ly/ZK3sNQ - Memory leak that has recently
spiked, Andrew McCreight is on the case
* Bug 711725 - http://bit.ly/14b1BlO - Jmaher and dividehex are digging
into this. It started because tegras would reboot intermittently, we
fixed those, and now the pandas are. We suspect pandas are overheating.
* Bug 835658 - http://bit.ly/13unJvC - Needs an owner
* Bug 824069 - http://bit.ly/13uSlNU - Needs an owner
* Bug 807230 - http://bit.ly/YEY5wk - Jmaher looking into this next
* Bug 764369 - http://bit.ly/Z3vWzY - Needs an owner
* Bug 754860 - http://bit.ly/WmoVz7 - Needs an owner
* Bug 818103 - http://bit.ly/Z3w6am - Needs an owner
* Bug 663657 - http://bit.ly/VjLlzF - Needs an owner, probably someone
from my team or releng

And when you're weighing whether or not you want to jump in, remember we
do have a goal to clean up the technical debt we've left ourselves in
the rush to ship two 1.0 products, and this work falls (in my mind)
squarely in line with that goal. Please help out where you can.

Many thanks,

Clint


[1] http://bit.ly/XIogWW
[2] http://bit.ly/14aZgra

Benjamin Smedberg

unread,
Mar 4, 2013, 10:49:31 AM3/4/13
to Clint Talbert, dev-pl...@lists.mozilla.org
On 3/1/2013 8:01 PM, Clint Talbert wrote:
> I'm begging for volunteers to step forward and do everyone a favor and
> dig into one of these bugs below and for some set of brave souls to
> look critically at what landed during our exponential uptick for
> possible culprits.
It would really help if you included bug summaries in these lists;
otherwise you have to load them each individually to figure out whether
they are relevant to you at all.
> * Bug 824069 - http://bit.ly/13uSlNU - Needs an owner
No it doesn't... it's just a really hard problem (primarily seems to be
a bug in the Java plugin for mac).

--BDS

Ed Morley

unread,
Mar 7, 2013, 9:49:36 AM3/7/13
to dev.platform
Just to follow up on this - yesterday the metrics Elastic Search cluster
experienced some data loss, with OrangeFactor being amongst the services
affected.

It's not yet clear if there is a backup available [1][2], so for now the
stats shown on http://brasstacks.mozilla.com/orangefactor/ are missing
the last 3 months worth of submissions - causing the OrangeFactor (see
Clint's email below) for the last 7 days to display as 2-3, rather than
the 10 it was at previously.

Whilst the sudden reduction in failure rate is primarily an artefact of
the data-loss, in the last week we have also seen an extremely positive
response from devs in many intermittent-failure bugs -- RyanVM and
myself would like to thank everyone who has chipped in so far! :-)

Kind regards,


Ed

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=848092#c6
[2] http://www.youtube.com/watch?v=plWnm7UpsXk&t=7s
0 new messages