| New policy: 48-hour backouts for major Talos regressions | Vladan Djeric | 14/08/15 14:26 | The perf team and the A-Team would like to test out a new policy: we want
to back out patches that cause significant Talos regressions on Windows builds. We would like to get developers’ feedback before starting this experiment. Why are we doing this? Essentially, we would like more Talos regressions to get fixed and Firefox performance to improve. We want to test a 48-hour backout policy because we noticed that patch authors often don't fix regressions quickly. If a regression sits in the tree for days, it becomes difficult to back out, and it becomes much more likely that the regression will end up riding the trains to Release by default. Note that we already have a policy of backing out regressions of 3% or more if the patch author does not respond at all on the regression bug for 3 business days. See https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling The new policy is more aggressive. We think a patch that regresses performance significantly should be backed out quickly, and re-landed when its performance is acceptable. How will the backouts work? The A-Team perf sheriffs will create a Talos regression bug as soon as a regression is confirmed using Talos re-triggers. The patch author and reviewer will be CC’ed, and if they don’t provide an explanation for why the regression is acceptable, the patch will be backed out. The goal is to back out unjustified regressions within 48 hours of landing. We’d like to give the patch author about 24 hours to reply after the regression bug is filed. The A-Team has been working hard on improving the tools for understanding Talos regressions (e.g. Perfherder), and we think debugging a Talos regression is a much less painful process these days. For example, there is now a highly usable view to visualize the comparison between a proposed patch against a baseline revision at https://treeherder.mozilla.org/perf.html#/comparechooser Will all regressions get backed out? No. Only regressions of 10% or more, on reliable Talos tests, on Windows, will face automatic backouts during this trial period. Given historical trends, we are anticipating about one backout per week. List of reliable Talos tests: ts_paint, sessionrestore, tp5o_scroll, tscrollx, tresize, TART, tsvgx, tp5o How are you testing this new policy? First off, we want developers’ feedback before trialing this new policy. I would like us to collect feedback and then start enforcing the new policy sometime next week. You can talk to us on the newsgroups or on #perf. You can also reach me directly (my IRC nick is “vladan”). We’ll post our conclusions on the newsgroups after a couple of months of enforcing the policy, and then we can all re-evaluate the backout policy together. Who will be the perf sheriffs? Joel Maher, Vaibhav Agrawal |
| Re: New policy: 48-hour backouts for major Talos regressions | Kartikaya Gupta | 14/08/15 14:45 | In general I'm in favour of this proposal, although it will probably
come back to haunt me in the not-too-distant future. That being said I would like to know what criteria you used to distinguish "reliable" talos tests from the rest. kats > _______________________________________________ > dev-platform mailing list > dev-pl...@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-platform |
| Re: New policy: 48-hour backouts for major Talos regressions | Vladan D | 14/08/15 14:56 | Tests are reliable if they detect regressions, aren't very noisy, and if they measure things that have a real impact on actual Firefox user experience.
We're using past experience with the tests to determine which ones meet this criteria. |
| Re: New policy: 48-hour backouts for major Talos regressions | Martin Thomson | 14/08/15 15:23 | On Fri, Aug 14, 2015 at 2:56 PM, Vladan D <vdj...@mozilla.com> wrote:Do we track false positives on these? I say that because I got a mail just last week for ts_paint that seemed spurious. |
| Re: New policy: 48-hour backouts for major Talos regressions | Vladan Djeric | 14/08/15 15:44 | I don't think anyone systematically logs the false-positives.
Is this the ts_paint regression you're referring to? https://groups.google.com/forum/#!searchin/mozilla.dev.tree-alerts/ts_paint/mozilla.dev.tree-alerts/FArVsa8guXg/FfY91JK7AAAJ I don't think the perf sheriffs filed a bug for that one, so I think they knew it was a false positive. List of regression bugs filed for 42: https://bugzilla.mozilla.org/showdependencytree.cgi?id=1186954&hide_resolved=0 Also note that perf sheriffs will re-trigger 3 Talos runs for the suspected changeset and another 3 runs for the previous revision. It should be extremely rare for developers to get a 48-hour backout notification because of a (sustained) false positive. Joel: can you comment on the ts_paint false positive email above? |
| Re: New policy: 48-hour backouts for major Talos regressions | Martin Thomson | 14/08/15 17:49 | On Fri, Aug 14, 2015 at 3:44 PM, Vladan Djeric <vdj...@mozilla.com> wrote:Yeah. I only ask because in exercising judgment suppresses information about the stability of the tests, so that all we have is anectodal evidence. That's probably OK here. The process you describe sounds pretty robust against false positives. |
| Re: New policy: 48-hour backouts for major Talos regressions | Vladan Djeric | 14/08/15 18:02 | There are known issues with the test infrastructure (e.g. differences in
weekend vs weekday results) and those known issues are currently being masked with human judgement. A-Team has investigated these issues, and fixed some of them, but fixing the rest will take a non-trivial amount of effort as I understand it. When there's enough time to fix all the sources of noise in the infrastructure, human judgement will no longer be required. As an aside, I'm answering the questions for this 48-hour backout announcement, but it's really Joel Maher + William Lachance + Vaibhav Agarwal doing all the heavy lifting related to regression handling. They're working on the regression-detection and regression-investigation tools, and they're the ones acting as perf sheriffs. Avi from my team is helping test the tools, and I just participate in policy discussions and act as an (unintentional) spokesperson :) |
| Re: New policy: 48-hour backouts for major Talos regressions | jmaher | 15/08/15 03:36 | I did see the ts, paint regression. This happened on 4 different platforms and was backed out for telemetry issues about 5 pushes later:
http://hg.mozilla.org/integration/mozilla-inbound/rev/1190bc7b862d and the backout: http://hg.mozilla.org/integration/mozilla-inbound/rev/59ad2812d3c7 By the time we get the alert (about 2 hours later), we would have seen the backout and looking at the raw data it would have been clear it was related to the patch which was backed out. In this case, there would be no need to file a bug and pester folks. I guess if you have questions about a specific email or alert ask in #developers or #perf. We get over 1000 alerts/month these days- It is unrealistic to comment on every alert, but reasonable to sanity check them and ensure we are filing bugs for the real regressions. |
| Re: New policy: 48-hour backouts for major Talos regressions | cco...@deadsquid.com | 15/08/15 06:28 | kmoir filed https://bugzil.la/1192994 this week to hook SETA up to talos. SETA seems like the best tool we have right now to both reduce the overall test burden and increase the reliability of the tests we do run.
|
| Re: New policy: 48-hour backouts for major Talos regressions | Chris Pearce | 18/08/15 15:43 | We recently had a false positive Talos regression on our team, which
turned out to be caused by a change to the test machine coinciding with our push. This took up a bunch of energy and time away from our team, which we really can't afford. So to mitigate that I propose that *before* the backout happens, someone on the regression-detection team does an `hg up` and Try push of the backout and a Try push without the backout to ensure that backing out actually helps. Retriggers on the original push in our case didn't help, so I think a completely clean push is necessary. This should also assist the regression-detection team in convincing patch authors that their patch is at fault. The Try-backout should happen before the need-info to the patch author happens. If the backout has non-trivial merge conflicts, then the first action of the patch author should be to preform this step instead of the regression-detection team member. cpearce. |
| Re: New policy: 48-hour backouts for major Talos regressions | Chris Pearce | 18/08/15 15:43 | We recently had a false positive Talos regression on our team, which |
| Re: New policy: 48-hour backouts for major Talos regressions | Bobby Holley | 18/08/15 15:50 | On Tue, Aug 18, 2015 at 3:43 PM, Chris Pearce <cpe...@mozilla.com> wrote:+1. |
| Re: New policy: 48-hour backouts for major Talos regressions | L. David Baron | 18/08/15 22:45 | On Wednesday 2015-08-19 10:43 +1200, Chris Pearce wrote:I though we'd collectively learned this lesson a number of times in the past, but it seems to keep happening. Machine configuration changes need to either happen in the repository or happen as part of a tree closure in which all runs complete, the configuration change is made, a dummy changeset is pushed to all trees, and the trees reopened. I think this is in violation of https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Must_avoid_patterns_known_to_cause_non_deterministic_failures (see the first bullet point). -David -- 𝄞 L. David Baron http://dbaron.org/ 𝄂 𝄢 Mozilla https://www.mozilla.org/ 𝄂 Before I built a wall I'd ask to know What I was walling in or walling out, And to whom I was like to give offense. - Robert Frost, Mending Wall (1914) |
| Re: New policy: 48-hour backouts for major Talos regressions | Joel Maher | 19/08/15 05:53 | This is a great point, and I still have no idea what caused the linux 32/64
machines to change on July 30th. It appeared to be a gradual rollout (indicates a machine issue which was picked up on reboot or something similar). For running talos tests we pin to a specific revision in the talos repository, this avoids any issues of pulling from tip. As a side note, we are really close to landing talos in tree, which will reduce one level of complexity in understanding this. Regarding the regression cpearce mentioned- we had done retriggers on the revision and the previous revision after receiving the alert (in the future by a day in this case) to rule out any infrastructure changes. This would be equivalent to doing a try push at that time in point. Sadly enough the numbers only solidified the fact that we had a regression. Yet a couple days later pushing to try and doing retriggers showed a completely different set of numbers. This happens about twice a year historically. Given the stated policy I am fairly certain this would have been backed out (even with a try push) and we would have relanded. We backout patches many times a day for unittest or build failures, and sometimes incorrectly if we mistake the root cause or it is not clear. We have been trying to use similar practices to that of the main tree sheriffs and this 48 hour policy gets us a lot closer. I do think it is valid for us to push to try to verify that the backout fixes the regression. The danger becomes when we have to way 20 hours for try results (now getting to 72 hours) and then other dependencies on the patch in question land. This is why I am skeptical about adding a try push in if we have enough data on the main tree already. I guess if we cannot trust what is on the tree including retriggered jobs to show a trend, then we wouldn't be able to trust try. Do chime in if I am missing something outside of the once or twice a year a core infrastructure issue gives us false data. Thanks for bringing up your concerns so far- I look forward to making future regression bugs more reliable and trustworthy! |