TBPL job visibility policy

Ed Morley

unread,

Mar 26, 2013, 10:10:18 AM3/26/13

to dev.tree-management, auto-...@mozilla.com, Ryan VanderMeulen, Philip Ringnalda, dev.platform, Releng

(Please reply to dev.tree-management)

Until now, the requirements for a new platform/test-suite to be shown in
the default TBPL view have been scattered across many newsgroup
discussions, bugs & IRC conversations, which understandably leads to
surprise when developers working on bringing a new job type into our
buildbot automation are told that it does not yet meet them.

In order to make the existing criteria more discoverable, I have
documented them at:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy

The new page also includes:
* A bug template for requesting changes in visibility.
* Tips for tracking a job if it is not shown in the default view.
* A rough overview of planned tooling improvements to give us more
flexibility in the future.

If you spot any omissions or any of the requirements need rewording to
be more clear, please let me know :-)

Best wishes,

Ed

Justin Lebar

unread,

Mar 26, 2013, 10:38:39 AM3/26/13

to Ed Morley, Releng, dev.tree-management, auto-...@mozilla.com, Ryan VanderMeulen, Philip Ringnalda, dev.platform

> 4) Runs on every push

I thought many tests don't run on every push to m-i, due to
coalescing. (Just to be clear, I wish we could disable coalescing on
m-i!)

But also, PGO builds/tests are not run on every push to m-c.

-Justin

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Ed Morley

unread,

Mar 26, 2013, 11:25:08 AM3/26/13

to Justin Lebar, Ryan VanderMeulen, Philip Ringnalda, dev.tree-management

(Removing non dev.tree-management mailing list recipients)

In an earlier draft I had it phrased as "Is scheduled to run on every
push" (ie might not end up running, but the intention is at least to do
so) - however I thought I may be over-complicating it, hence the
simplified wording, but given what you've said, I've switched back to
that and added an explanation.

And yeah good point about PGO, I've noted an exception for tests run on
PGO builds (like I had to do for nightlies vs dep builds).

Changes:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy?title=Sheriffing%2FJob_Visibility_Policy&action=historysubmit&diff=641473&oldid=641429

Many thanks,

Ed

Bobby Holley

unread,

Mar 26, 2013, 12:21:08 PM3/26/13

to Ed Morley, Ryan VanderMeulen, Philip Ringnalda, Justin Lebar, dev.tree-management

> Is supported by mach

Marionette isn't currently supported by mach, and is a pain to run locally.
Maybe this can help us marshal resources to get bug 799308 fixed?

bholley

Ryan VanderMeulen

unread,

Mar 26, 2013, 12:24:48 PM3/26/13

to

On 3/26/2013 12:07 PM, Dave Townsend wrote:
> This is awesome, thanks for writing this up, it's made me spot one more
> place where Jetpack is failing.
>
> Can we get some more definition around "Runs on all trees that merge
> into mozilla-central"? In particular I want to make that true for
> Jetpack but I don't know what all those trees are.
>
> Dave

Any branch that's merging with m-c (and actively working to land there
at some point), including project branches.

L. David Baron

unread,

Mar 26, 2013, 12:43:06 PM3/26/13

to Ed Morley, dev.tree-management

On Tuesday 2013-03-26 14:10 +0000, Ed Morley wrote:
> In order to make the existing criteria more discoverable, I have
> documented them at:
> https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy

> If you spot any omissions or any of the requirements need rewording
> to be more clear, please let me know :-)

One key requirement that I think you missed is that the test results
should either:
(1) change only as a result of things committed to mozilla-central
or equivalent merges-into-mozilla-central repositories, or
(2) change only in a downtime window that involves closing the
trees, letting all existing test runs complete, making the
change, triggering a set of runs (on all relevant trees) using
a dummy push that describes the change, maybe waiting for those
runs to complete, and reopening the tree.
Obviously (1) is preferred, but when it's not the case (2) should be
required. When this isn't the case, we end up with sheriffs or
developers chasing down what changes in external repositories might
have happened to break tests or regress performance numbers.

It's possible you should also mention the requirement that the test
suite not contain time bombs, e.g., things that will fail after a
certain date or when run at certain times.

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla http://www.mozilla.org/ 𝄂

Ed Morley

unread,

Mar 26, 2013, 1:02:59 PM3/26/13

to Bobby Holley, dev.tree-management, Dave Townsend, Ryan VanderMeulen, Philip Ringnalda

On 26/03/2013 16:07, Dave Townsend wrote:
> This is awesome, thanks for writing this up, it's made me spot one more
> place where Jetpack is failing.
>
> Can we get some more definition around "Runs on all trees that merge
> into mozilla-central"? In particular I want to make that true for
> Jetpack but I don't know what all those trees are.

No problem :-) Sorry we've not had anything like this until now!

I'll clarify the entry on the wiki page to mention that you can file a
releng bug asking for jobs to be enabled for "mozilla-central based
trees" - and that they'll know what you mean.

Thank you for reminding me of that bug :-) The mach requirement is one
that we'll have to be slightly more flexible on initially for legacy job
types (given the recency of mach), but I still wanted to include it so
that new test-suites would add support from the outset.

I'll see if I can get someone to take a look at that bug (I know gps has
his hands full with the moz.build refactoring work).

Best wishes,

Ed

Jonathan Griffin

unread,

Mar 26, 2013, 1:05:30 PM3/26/13

to dev-tree-...@lists.mozilla.org

> On 26/03/2013 16:21, Bobby Holley wrote:
>> > Is supported by mach
>>
>> Marionette isn't currently supported by mach, and is a pain to run
>> locally. Maybe this can help us marshal resources to get bug 799308
>> fixed?
>
> Thank you for reminding me of that bug :-) The mach requirement is one
> that we'll have to be slightly more flexible on initially for legacy
> job types (given the recency of mach), but I still wanted to include
> it so that new test-suites would add support from the outset.
>
> I'll see if I can get someone to take a look at that bug (I know gps
> has his hands full with the moz.build refactoring work).
>

This is on the a-team's radar; we intend to add mach support for all the
B2G test runners soon-ish (as well as Marionette on desktop).

Jonathan

Ed Morley

unread,

Mar 26, 2013, 1:34:18 PM3/26/13

to L. David Baron, dev.tree-management

On 26/03/2013 16:43, L. David Baron wrote:
> One key requirement that I think you missed is that the test results
> should either:
> (1) change only as a result of things committed to mozilla-central
> or equivalent merges-into-mozilla-central repositories, or
> (2) change only in a downtime window that involves closing the
> trees, letting all existing test runs complete, making the
> change, triggering a set of runs (on all relevant trees) using
> a dummy push that describes the change, maybe waiting for those
> runs to complete, and reopening the tree.
> Obviously (1) is preferred, but when it's not the case (2) should be
> required. When this isn't the case, we end up with sheriffs or
> developers chasing down what changes in external repositories might
> have happened to break tests or regress performance numbers.

I would absolutely love it if we could make that a requirement (it would
make sheriffing a lot easier), but that horse has already bolted :-(

Eg: changes to any of the following can break builds:
* (less of an issue/unavoidable) buildbotcustom
* (less of an issue/unavoidable) buildbot-configs
* mozharness
* gaia-master
* Other B2G repos
* (For non-firefox trees) comm-central

That said, TBPL's replacement is being designed from the ground up to be
able to visualise changes from multiple repositories, so hopefully we
should have a better story for this scenario in the future.

There used to also be issues when the B2G emulator was updated, however
we now have a manifest in the tree that gets updated when we want to
upgrade:
http://mxr.mozilla.org/mozilla-central/source/b2g/test/emulator.manifest
...though as much as it would be nice to switch gaia to something
similar, I fear it wouldn't scale / there would be too much pushback.

> It's possible you should also mention the requirement that the test
> suite not contain time bombs, e.g., things that will fail after a
> certain date or when run at certain times.

Good point, I'll add that now.

Thank you for the feedback :-)

Ed

L. David Baron

unread,

Mar 26, 2013, 1:42:13 PM3/26/13

to Ed Morley, dev.tree-management

I think it should still go on the list, maybe with caveats, since
we've blocked tests from being unhidden on m-c for doing this when
it was avoidable (e.g., jetpack), and because it's a major source of
pain (e.g., when it was done incorrectly with talos).

Steve Fink

unread,

Mar 26, 2013, 1:58:56 PM3/26/13

to Ed Morley, dev.tree-management, auto-...@mozilla.com, Ryan VanderMeulen, Philip Ringnalda, dev.platform, Releng

On 03/26/2013 07:10 AM, Ed Morley wrote:
> (Please reply to dev.tree-management)
>
> Until now, the requirements for a new platform/test-suite to be shown in
> the default TBPL view have been scattered across many newsgroup
> discussions, bugs & IRC conversations, which understandably leads to
> surprise when developers working on bringing a new job type into our
> buildbot automation are told that it does not yet meet them.
>
> In order to make the existing criteria more discoverable, I have
> documented them at:
> https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy

I'm maintaining a somewhat irregular tier-1 job type, the spidermonkey
root analysis builds. It makes an interesting example case, since it
meets most but not all of your criteria presently, and I think I can
weasel out of a few more.

#1 yes

#2 yes

#3 no. The build only runs on mozilla-inbound and try. It could be added
to mozilla-central, though it wouldn't change much (because no
spidermonkey development lands straight on m-c afaik.) It should be
added to the baseline compiler repo. I would argue that it doesn't
really matter whether it's enabled on other trees, since it only runs
when something under js/src is changed.

I don't know if you want to update the wiki page to exempt things that
are restricted to a subtree, or if there's an implicit "...but use
common sense" override.

#4 yes

#5 yes but I didn't add anything specific for it in trychooser. I think
I'm "good enough" here because it's a build job, not a test job, and the
try syntax isn't easily expressible in the UI. (For the record, the
build is scheduled whenever js/src is touched and you request a
linux64-debug build as part of your request.) I can think of something
better, but it seems like it would require more work than it's worth.

#6 yes, but could the specifics be documented? Preferably with a
mechanism, or at least advice, for handling numerical results. (eg a
count of bad somethings, which you don't want to regress.)

Given that we *just* fixed an output problem in JS tests, and we had no
idea it was a problem, I think having an explicit description of the
allowed formats would be useful.

#7 yes. Though I worry that explicitly laying all this out could
eliminate some needed wiggle room. For example, if a job really had a 5%
failure rate, and it stayed that way for 3 months, I don't think you'd
be happy.

#8 yes

#9 yes

#10 no. I'll make one.

#11 not really supported by mach, but it seems like this is somewhat
test-specific anyway. My job is a build with specific configure options
followed by a test run with an environment variable set. So it's not
incompatible with mach, but I don't think it would do very much good to
add it to mach, at least until we get to the point where we require each
letter that appears on tbpl to be runnable directly from mach:

./mach tbpl 'SM(r)'

-----

It sounds like you're already working on this with tbpl2, but I've
thought for a while that we need some just a little below tier 1.
Something like a persistent jobname pattern that shows certain jobs
based on your group membership, or my preference: just a way in the tbpl
UI to indicate failures that you don't need to backout for. There's no
reason "visible" == "punishable by backout" if the UI makes it clear
which failures are tolerated. Put them in a separate column on the right
hand side or something. I like this better than hiding jobs entirely
because it makes it harder to ignore flaky tests forever.

Steve Fink

unread,

Mar 26, 2013, 1:58:56 PM3/26/13

to Ed Morley, dev.tree-management, dev.platform, Ryan VanderMeulen, Philip Ringnalda, auto-...@mozilla.com, Releng

On 03/26/2013 07:10 AM, Ed Morley wrote:

> (Please reply to dev.tree-management)
>
> Until now, the requirements for a new platform/test-suite to be shown in
> the default TBPL view have been scattered across many newsgroup
> discussions, bugs & IRC conversations, which understandably leads to
> surprise when developers working on bringing a new job type into our
> buildbot automation are told that it does not yet meet them.
>
> In order to make the existing criteria more discoverable, I have
> documented them at:
> https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy

Alex Keybl

unread,

Mar 26, 2013, 3:39:00 PM3/26/13

to Ed Morley, Ryan VanderMeulen, Philip Ringnalda, dev.tree-management

Will this criteria (ownership, failure rate, etc.) end up being applied to already existing tests? If so, who is in the loop on signing off on a removal?

-Alex

On Mar 26, 2013, at 7:10 AM, Ed Morley <emo...@mozilla.com> wrote:

> (Please reply to dev.tree-management)
>
> Until now, the requirements for a new platform/test-suite to be shown in the default TBPL view have been scattered across many newsgroup discussions, bugs & IRC conversations, which understandably leads to surprise when developers working on bringing a new job type into our buildbot automation are told that it does not yet meet them.
>
> In order to make the existing criteria more discoverable, I have documented them at:
> https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy
>

> The new page also includes:
> * A bug template for requesting changes in visibility.
> * Tips for tracking a job if it is not shown in the default view.
> * A rough overview of planned tooling improvements to give us more flexibility in the future.
>

> If you spot any omissions or any of the requirements need rewording to be more clear, please let me know :-)
>

Ed Morley

unread,

Apr 1, 2013, 2:56:29 PM4/1/13

to Alex Keybl, Steve Fink, L. David Baron, Ryan VanderMeulen, Philip Ringnalda, dev.tree-management

tl;dr: made a number of tweaks - diff:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy?title=Sheriffing%2FJob_Visibility_Policy&action=historysubmit&diff=642742&oldid=641665

Replies to emails in-line below - thank you all for the suggestions :-)

Ed

On 26/03/2013 17:42, L. David Baron wrote:
> I think it should still go on the list, maybe with caveats, since
> we've blocked tests from being unhidden on m-c for doing this when
> it was avoidable (e.g., jetpack), and because it's a major source of
> pain (e.g., when it was done incorrectly with talos).

Very true, added a bullet under #8 for this :-)

On 26/03/2013 19:39, Alex Keybl wrote:
> Will this criteria (ownership, failure rate, etc.) end up being
> applied to already existing tests? If so, who is in the loop on
> signing off on a removal?

I suspect removal isn't what you meant - but just to clarify: this
policy is only governing the visibility of tests on TBPL's default view
(ie: what tests are sheriff-managed - a sheriffing decision) rather than
determining what platforms/tests we stop running (and displaying on the
TBPL &showall=1 view) at all (which is what would require sign off).
This is something that the sheriffs already manage on a day to day basis
- so other than now having the process documented, it's business as
usual (existing job types won't suddenly be moved out of the default
view as a direct result).

On 26/03/2013 17:58, Steve Fink wrote:
> It makes an interesting example case

Thank you - this is a great way to iron out any issues with the policy.

> #3 no. The build only runs on mozilla-inbound and try. It could be
> added
> to mozilla-central, though it wouldn't change much (because no
> spidermonkey development lands straight on m-c afaik.) It should be
> added to the baseline compiler repo. I would argue that it doesn't
> really matter whether it's enabled on other trees, since it only runs
> when something under js/src is changed.
>
> I don't know if you want to update the wiki page to exempt things that
> are restricted to a subtree, or if there's an implicit "...but use
> common sense" override.

Agree there needs to be common sense applied in general, I've added such
to the start of the document.

> yes but I didn't add anything specific for it in trychooser. I think
> I'm "good enough" here because it's a build job, not a test job, and
> the
> try syntax isn't easily expressible in the UI. (For the record, the
> build is scheduled whenever js/src is touched and you request a
> linux64-debug build as part of your request.)

This seems fine - I think this case is covered by #10 (documentation).
I've also tweaked the wording to imply trychooser only if appropriate.

> #6 yes, but could the specifics be documented? Preferably with a
> mechanism, or at least advice, for handling numerical results. (eg a
> count of bad somethings, which you don't want to regress.)
>
> Given that we *just* fixed an output problem in JS tests, and we had
> no
> idea it was a problem, I think having an explicit description of the
> allowed formats would be useful.

I've now added extra detail here - let me know if it needs more :-)

> #7 yes. Though I worry that explicitly laying all this out could
> eliminate some needed wiggle room. For example, if a job really had
> a 5%
> failure rate, and it stayed that way for 3 months, I don't think you'd
> be happy.

Yeah I agree - have changed this section now.

> #11 not really supported by mach, but it seems like this is somewhat
> test-specific anyway. My job is a build with specific configure
> options
> followed by a test run with an environment variable set. So it's not
> incompatible with mach, but I don't think it would do very much good
> to
> add it to mach, at least until we get to the point where we require
> each
> letter that appears on tbpl to be runnable directly from mach:

Have added "(if appropriate)".

> It sounds like you're already working on this with tbpl2, but I've
> thought for a while that we need some just a little below tier 1.
> Something like a persistent jobname pattern that shows certain jobs
> based on your group membership, or my preference: just a way in the
> tbpl
> UI to indicate failures that you don't need to backout for. There's no
> reason "visible" == "punishable by backout" if the UI makes it clear
> which failures are tolerated. Put them in a separate column on the
> right
> hand side or something. I like this better than hiding jobs entirely
> because it makes it harder to ignore flaky tests forever.

Yeah agree we can (a) indicate support tier in the UI so we're more keen
to display tier != 1, (b) create multiple UIs/dashboards for different
teams, (c) have a notification system. See my thoughts at:
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#The_future

TBPL job visibility policy - now documented

Ed Morley

Justin Lebar

Ed Morley

Bobby Holley

Ryan VanderMeulen

L. David Baron

Ed Morley

Jonathan Griffin

Ed Morley

L. David Baron

Steve Fink

Steve Fink

Alex Keybl

Ed Morley