Thank you for posting this - I've become increasingly worried by the number of real regressions that we are likely missing due to the current situation.
Like yourself I used to read every single dev.tree-management mail & try to act on those that looked real, however after many months of not feeling like it was making any difference at all, I stopped checking regularly since it was a very unproductive use of my time.
The main issues were:
1) Many seemingly irrelevant tests (eg codesize, number of constructors, ...?) and noisy tests causing the SNR of dev.tree-management to be very low, resulting in dev.tree-management receiving hundreds of regression emails a week. This meant it was both a struggle to find time for them all on a daily basis and also to remember which were reliable and so should be more urgently acted upon.
2) Large amounts of coalescing causing the regression range to be wide, resulting in either vast amounts of sheriff time spent trying retriggers/TryServer runs; or else developers believing the regression must have been caused by someone else in the range & so do not look into it themselves.
3) Lack of documentation for even working out what the test in question does & how to run it locally. I filed bug 770460 to deal with this, which jhammel kindly has, so the situation is a bit better than it was: https://wiki.mozilla.org/Buildbot/Talos
4) The combination of the above causing:
a) Devs to either not believe the regression could be anything other than noise/due to someone else's patch, or else they try really hard to investigate the regression but are not able to due to gaps in the documentation/difficultly in reproducing locally.
b) Sheriffs to not have enough conviction in the reliability of a reported regression, that they don't feel they are able to hassle said devs about (a).
As an example, I filed bug 778718 (30% Windows Ts regression since 1st March) a month ago, but yet it has sat inactive for a while now and I sadly don't see us making any progress on it before we release the regressed builds :-(
On Thursday, 30 August 2012 01:01:26 UTC+1, Matt Brubeck wrote:
> * Less is more. We can pay more attention to tests if every alert is
> for something we care about. We can *track* stuff like Trace Malloc
> Allocs if there are people who find the data useful in their work, but
> we should not *alert* on it unless it is a key user-facing metric.
On Thursday, 30 August 2012 02:20:24 UTC+1, Dave Mandelin wrote:
> - First, and most important, fix the test suite so that it measures only things that are useful and meaningful to developers and users. We can easily take a first cut at this if engineering teams go over the tests related to their work, and tell A-Team which are not useful. Over time, I think we need to get a solid understanding of what performance looks like to users, what things to test, and how to test them soundly. This may require dedicated performance engineers or a performance product manager.
On Thursday, 30 August 2012 17:11:25 UTC+1, Ehsan Akhgari wrote:
> Note that some of the work of to differentiate between false positives
> and real regressions needs to be done by the engineers, similar to the
> work required to investigate correctness problems. And people need to
> accept that seemingly benign changes may also cause real performance
> regressions, so it's not always possible to glance over a changeset and
> say "nah, this can't be my fault." :-)
I agree entirely with each of the above.
On Thursday, 30 August 2012 16:53:08 UTC+1, Ehsan Akhgari wrote:
> The way that we handle similar situations with test failures is to close
> inbound, back out enough patches so that things go green again, and then
> reopen, and ask the people that had been backed out to retry landing
> after testing on try.
> This is entirely possible for Talos regressions as well if we want to,
> it just takes longer, and will mean more patches will get backed out.
On Thursday, 30 August 2012 22:42:07 UTC+1, Taras Glek wrote:
> ** We will extend the current merge criteria of last green PGO changeset
> to also include "good Talos numbers"
Once we have the main issues resolved, then I agree this will be the best way to victory - and in fact will be happy to take lead on pro-actively checking dev.tree-management before merging.
However at this precise point in time, I feel that blocking merges on talos "green" is not yet viable, since the reasons at the start of my post still hold. I'm hopeful that many of them can mitigated over the next couple of weeks, but for now I think we're just going to have to deal with just the most obvious talos regressions in the days shortly after each merge, rather than blocking the merges on them.
In addition, as far as I'm aware, it takes several talos runs (after the regression) before the regression mails are triggered (presumably more so for noisy tests), so we would have to wait several hours longer than the usual "PGO green on all platforms" before we could be sure it was safe to merge. Add to this the end-to-end time of retriggers to counteract coalescing - and I suspect this would cause a fair amount of grief between sheriffs and devs, if our experience with delaying merges for unit test failures is anything to go by. Can someone who knows talos better than I, clarify talos' behaviour for when the regression emails are sent?
I would absolutely love it if we were able to do this, but with our current capacity issues (which really do need resolving, but that's another thread, to which I'm replying after this), we just don't have anywhere near the spare machine time.
What would be useful however, is an easy way to mark a regression-range (eg from that given in a dev.tree-management email) as needing retriggers on every push to combat coalescing - that then schedules them for our quieter machine times eg weekends/outside PDT hours. If we did this, we still wouldn't be able to block mozilla-inbound merges on 'green' talos, but we would at least be in a better position than we are now, where the regression ranges are so wide that they are not easily actionable.