New policy: 48-hour backouts for major Talos regressions

Vladan Djeric

unread,

Aug 14, 2015, 5:26:47 PM8/14/15

to dev-platform, firefox-dev-owner firefox-dev-owner

The perf team and the A-Team would like to test out a new policy: we want
to back out patches that cause significant Talos regressions on Windows
builds. We would like to get developers’ feedback before starting this
experiment.

Why are we doing this?

Essentially, we would like more Talos regressions to get fixed and Firefox
performance to improve. We want to test a 48-hour backout policy because we
noticed that patch authors often don't fix regressions quickly. If a
regression sits in the tree for days, it becomes difficult to back out, and
it becomes much more likely that the regression will end up riding the
trains to Release by default.

Note that we already have a policy of backing out regressions of 3% or more
if the patch author does not respond at all on the regression bug for 3
business days. See
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling

The new policy is more aggressive. We think a patch that regresses
performance significantly should be backed out quickly, and re-landed when
its performance is acceptable.

How will the backouts work?

The A-Team perf sheriffs will create a Talos regression bug as soon as a
regression is confirmed using Talos re-triggers. The patch author and
reviewer will be CC’ed, and if they don’t provide an explanation for why
the regression is acceptable, the patch will be backed out. The goal is to
back out unjustified regressions within 48 hours of landing. We’d like to
give the patch author about 24 hours to reply after the regression bug is
filed.

The A-Team has been working hard on improving the tools for understanding
Talos regressions (e.g. Perfherder), and we think debugging a Talos
regression is a much less painful process these days. For example, there is
now a highly usable view to visualize the comparison between a proposed
patch against a baseline revision at
https://treeherder.mozilla.org/perf.html#/comparechooser

Will all regressions get backed out?

No. Only regressions of 10% or more, on reliable Talos tests, on Windows,
will face automatic backouts during this trial period. Given historical
trends, we are anticipating about one backout per week.

List of reliable Talos tests: ts_paint, sessionrestore, tp5o_scroll,
tscrollx, tresize, TART, tsvgx, tp5o

How are you testing this new policy?

First off, we want developers’ feedback before trialing this new policy.

I would like us to collect feedback and then start enforcing the new policy
sometime next week. You can talk to us on the newsgroups or on #perf. You
can also reach me directly (my IRC nick is “vladan”).

We’ll post our conclusions on the newsgroups after a couple of months of
enforcing the policy, and then we can all re-evaluate the backout policy
together.

Who will be the perf sheriffs?

Joel Maher, Vaibhav Agrawal

Kartikaya Gupta

unread,

Aug 14, 2015, 5:45:35 PM8/14/15

to Vladan Djeric, dev-platform, firefox-dev-owner firefox-dev-owner

In general I'm in favour of this proposal, although it will probably
come back to haunt me in the not-too-distant future. That being said I
would like to know what criteria you used to distinguish "reliable"
talos tests from the rest.

kats

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Vladan D

unread,

Aug 14, 2015, 5:56:49 PM8/14/15

to

Tests are reliable if they detect regressions, aren't very noisy, and if they measure things that have a real impact on actual Firefox user experience.

We're using past experience with the tests to determine which ones meet this criteria.

Martin Thomson

unread,

Aug 14, 2015, 6:23:29 PM8/14/15

to Vladan D, dev-platform

On Fri, Aug 14, 2015 at 2:56 PM, Vladan D <vdj...@mozilla.com> wrote:
> Tests are reliable if they detect regressions, aren't very noisy, and if they measure things that have a real impact on actual Firefox user experience.

Do we track false positives on these? I say that because I got a mail
just last week for ts_paint that seemed spurious.

Vladan Djeric

unread,

Aug 14, 2015, 6:44:46 PM8/14/15

to Martin Thomson, Maher, Joel, dev-platform

I don't think anyone systematically logs the false-positives.

Is this the ts_paint regression you're referring to?
https://groups.google.com/forum/#!searchin/mozilla.dev.tree-alerts/ts_paint/mozilla.dev.tree-alerts/FArVsa8guXg/FfY91JK7AAAJ
I don't think the perf sheriffs filed a bug for that one, so I think they
knew it was a false positive. List of regression bugs filed for 42:
https://bugzilla.mozilla.org/showdependencytree.cgi?id=1186954&hide_resolved=0

Also note that perf sheriffs will re-trigger 3 Talos runs for the suspected
changeset and another 3 runs for the previous revision. It should be
extremely rare for developers to get a 48-hour backout notification because
of a (sustained) false positive.

Joel: can you comment on the ts_paint false positive email above?

On Fri, Aug 14, 2015 at 6:23 PM, Martin Thomson <m...@mozilla.com> wrote:

> On Fri, Aug 14, 2015 at 2:56 PM, Vladan D <vdj...@mozilla.com> wrote:

> > Tests are reliable if they detect regressions, aren't very noisy, and if
> they measure things that have a real impact on actual Firefox user
> experience.
>

Martin Thomson

unread,

Aug 14, 2015, 8:49:14 PM8/14/15

to Vladan Djeric, Maher, Joel, dev-platform

On Fri, Aug 14, 2015 at 3:44 PM, Vladan Djeric <vdj...@mozilla.com> wrote:
> Is this the ts_paint regression you're referring to?
> https://groups.google.com/forum/#!searchin/mozilla.dev.tree-alerts/ts_paint/mozilla.dev.tree-alerts/FArVsa8guXg/FfY91JK7AAAJ

Yeah. I only ask because in exercising judgment suppresses
information about the stability of the tests, so that all we have is
anectodal evidence. That's probably OK here. The process you
describe sounds pretty robust against false positives.

Vladan Djeric

unread,

Aug 14, 2015, 9:02:48 PM8/14/15

to Martin Thomson, Maher, Joel, dev-platform

There are known issues with the test infrastructure (e.g. differences in
weekend vs weekday results) and those known issues are currently being
masked with human judgement.
A-Team has investigated these issues, and fixed some of them, but fixing
the rest will take a non-trivial amount of effort as I understand it.
When there's enough time to fix all the sources of noise in the
infrastructure, human judgement will no longer be required.

As an aside, I'm answering the questions for this 48-hour backout
announcement, but it's really Joel Maher + William Lachance + Vaibhav
Agarwal doing all the heavy lifting related to regression handling. They're
working on the regression-detection and regression-investigation tools, and
they're the ones acting as perf sheriffs.

Avi from my team is helping test the tools, and I just participate in
policy discussions and act as an (unintentional) spokesperson :)

On Fri, Aug 14, 2015 at 8:49 PM, Martin Thomson <m...@mozilla.com> wrote:

jmaher

unread,

Aug 15, 2015, 6:36:23 AM8/15/15

to

I did see the ts, paint regression. This happened on 4 different platforms and was backed out for telemetry issues about 5 pushes later:
http://hg.mozilla.org/integration/mozilla-inbound/rev/1190bc7b862d

and the backout:
http://hg.mozilla.org/integration/mozilla-inbound/rev/59ad2812d3c7

By the time we get the alert (about 2 hours later), we would have seen the backout and looking at the raw data it would have been clear it was related to the patch which was backed out. In this case, there would be no need to file a bug and pester folks.

I guess if you have questions about a specific email or alert ask in #developers or #perf.

We get over 1000 alerts/month these days- It is unrealistic to comment on every alert, but reasonable to sanity check them and ensure we are filing bugs for the real regressions.

cco...@deadsquid.com

unread,

Aug 15, 2015, 9:28:38 AM8/15/15

to

kmoir filed https://bugzil.la/1192994 this week to hook SETA up to talos. SETA seems like the best tool we have right now to both reduce the overall test burden and increase the reliability of the tests we do run.

Chris Pearce

unread,

Aug 18, 2015, 6:43:03 PM8/18/15

to Vladan Djeric, Martin Thomson, Maher, Joel, dev-platform

We recently had a false positive Talos regression on our team, which
turned out to be caused by a change to the test machine coinciding with
our push. This took up a bunch of energy and time away from our team,
which we really can't afford.

So to mitigate that I propose that *before* the backout happens, someone
on the regression-detection team does an `hg up` and Try push of the
backout and a Try push without the backout to ensure that backing out
actually helps.

Retriggers on the original push in our case didn't help, so I think a
completely clean push is necessary.

This should also assist the regression-detection team in convincing
patch authors that their patch is at fault.

The Try-backout should happen before the need-info to the patch author
happens. If the backout has non-trivial merge conflicts, then the first
action of the patch author should be to preform this step instead of the
regression-detection team member.

cpearce.

Chris Pearce

unread,

Aug 18, 2015, 6:43:11 PM8/18/15

to Vladan Djeric, Martin Thomson, Maher, Joel, dev-platform

We recently had a false positive Talos regression on our team, which
turned out to be caused by a change to the test machine coinciding with
our push. This took up a bunch of energy and time away from our team,
which we really can't afford.

So to mitigate that I propose that *before* the backout happens, someone
on the regression-detection team does an `hg up` and Try push of the
backout and a Try push without the backout to ensure that backing out
actually helps.

Retriggers on the original push in our case didn't help, so I think a
completely clean push is necessary.

This should also assist the regression-detection team in convincing
patch authors that their patch is at fault.

The Try-backout should happen before the need-info to the patch author
happens. If the backout has non-trivial merge conflicts, then the first
action of the patch author should be to preform this step instead of the
regression-detection team member.

cpearce.

On 8/15/2015 1:02 PM, Vladan Djeric wrote:

Bobby Holley

unread,

Aug 18, 2015, 6:50:30 PM8/18/15

to Chris Pearce, dev-platform, Maher, Joel, Martin Thomson, Vladan Djeric

On Tue, Aug 18, 2015 at 3:43 PM, Chris Pearce <cpe...@mozilla.com> wrote:

> We recently had a false positive Talos regression on our team, which
> turned out to be caused by a change to the test machine coinciding with our
> push. This took up a bunch of energy and time away from our team, which we
> really can't afford.
>
> So to mitigate that I propose that *before* the backout happens, someone
> on the regression-detection team does an `hg up` and Try push of the
> backout and a Try push without the backout to ensure that backing out
> actually helps.
>

+1.

L. David Baron

unread,

Aug 19, 2015, 1:45:19 AM8/19/15

to Chris Pearce, dev-platform, Maher, Joel, Martin Thomson, Vladan Djeric

On Wednesday 2015-08-19 10:43 +1200, Chris Pearce wrote:
> We recently had a false positive Talos regression on our team, which turned
> out to be caused by a change to the test machine coinciding with our push.
> This took up a bunch of energy and time away from our team, which we really
> can't afford.

I though we'd collectively learned this lesson a number of times in
the past, but it seems to keep happening. Machine configuration
changes need to either happen in the repository or happen as part of
a tree closure in which all runs complete, the configuration change
is made, a dummy changeset is pushed to all trees, and the trees
reopened.

I think this is in violation of
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures
(see the first bullet point).

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)

signature.asc

Joel Maher

unread,

Aug 19, 2015, 8:53:29 AM8/19/15

to L. David Baron, Chris Pearce, Martin Thomson, dev-platform, Vladan Djeric

This is a great point, and I still have no idea what caused the linux 32/64
machines to change on July 30th. It appeared to be a gradual rollout
(indicates a machine issue which was picked up on reboot or something
similar). For running talos tests we pin to a specific revision in the
talos repository, this avoids any issues of pulling from tip. As a side
note, we are really close to landing talos in tree, which will reduce one
level of complexity in understanding this.

Regarding the regression cpearce mentioned- we had done retriggers on the
revision and the previous revision after receiving the alert (in the future
by a day in this case) to rule out any infrastructure changes. This would
be equivalent to doing a try push at that time in point. Sadly enough the
numbers only solidified the fact that we had a regression. Yet a couple
days later pushing to try and doing retriggers showed a completely
different set of numbers. This happens about twice a year historically.
Given the stated policy I am fairly certain this would have been backed out
(even with a try push) and we would have relanded. We backout patches many
times a day for unittest or build failures, and sometimes incorrectly if we
mistake the root cause or it is not clear. We have been trying to use
similar practices to that of the main tree sheriffs and this 48 hour policy
gets us a lot closer.

I do think it is valid for us to push to try to verify that the backout
fixes the regression. The danger becomes when we have to way 20 hours for
try results (now getting to 72 hours) and then other dependencies on the
patch in question land. This is why I am skeptical about adding a try push
in if we have enough data on the main tree already. I guess if we cannot
trust what is on the tree including retriggered jobs to show a trend, then
we wouldn't be able to trust try. Do chime in if I am missing something
outside of the once or twice a year a core infrastructure issue gives us
false data.

Thanks for bringing up your concerns so far- I look forward to making
future regression bugs more reliable and trustworthy!

On Wed, Aug 19, 2015 at 1:44 AM, L. David Baron <dba...@dbaron.org> wrote:

> On Wednesday 2015-08-19 10:43 +1200, Chris Pearce wrote:

> > We recently had a false positive Talos regression on our team, which
> turned
> > out to be caused by a change to the test machine coinciding with our
> push.
> > This took up a bunch of energy and time away from our team, which we
> really
> > can't afford.
>