Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Feedback wanted: Downtime windows that affect builds

28 views
Skip to first unread message

John Hopkins

unread,
Jan 11, 2012, 4:16:52 PM1/11/12
to dev-pl...@lists.mozilla.org
Background:

This Friday's scheduled downtime is currently scheduled from 9am-12pm
PST to allow IT time to upgrade hg.mozilla.org and stage.mozilla.org.

Release Engineering will schedule buildbot masters for graceful shutdown
approximately 2 hours ahead (7am) of the above downtime to give
in-progress builds some time to complete. No builds will be processed
during this ~2 hour time window.

The Problem:

A concern raised is that during the ~2 hour "no-builds" window,
developers may check in broken changes without realizing it and lead to
broken builds with no clear idea of which change broke the build.
Triaging/rolling back these broken changes after a downtime window could
be time consuming.

In the past, we've relied on developers understanding the implications
of submitting a change near certain types of downtime, but not everyone
has a clear understanding of when an "ok" time to submit changes is
prior to a downtime, or even what types of downtime would require such
consideration.

A Proposal:

We extend the official downtime window to cover the "no-builds" period.
This makes the downtime window larger, obviously, but being explicit
obviates people making incorrect assumptions or relying on guesswork.


Are there any objections to this proposed change to downtime scheduling?


Thanks,
John

Anthony Hughes

unread,
Jan 11, 2012, 4:21:51 PM1/11/12
to John Hopkins, dev-pl...@lists.mozilla.org
It was my understanding that the 9am-noon window would essentially mean we can't push 10.0b4 our to beta until the afternoon. Does extending this window put an afternoon push to beta at risk?
_______________________________________________
dev-planning mailing list
dev-pl...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-planning

John Hopkins

unread,
Jan 11, 2012, 4:26:50 PM1/11/12
to Anthony Hughes, dev-pl...@lists.mozilla.org
On 12-01-11 04:21 PM, Anthony Hughes wrote:
> It was my understanding that the 9am-noon window would essentially mean we can't push 10.0b4 our to beta until the afternoon. Does extending this window put an afternoon push to beta at risk?

The downtime extension would only affect the start of the downtime
(7am-noon instead of 9am-noon) so the afternoon push would be unaffected
unless it is dependent on something happening between 7am-9am.

John

Boris Zbarsky

unread,
Jan 11, 2012, 5:21:01 PM1/11/12
to
On 1/11/12 4:16 PM, John Hopkins wrote:
> In the past, we've relied on developers understanding the implications
> of submitting a change near certain types of downtime

Please just close the tree when you start the builder buildbot shutdown.
The only "drawback" is that people who would otherwise have gotten
broken builds will get a message saying the tree is closed when they try
to push... which seems like a win to me personally.

-Boris

Justin Wood (Callek)

unread,
Jan 11, 2012, 8:52:54 PM1/11/12
to John Hopkins
John Hopkins wrote:
> On 12-01-11 04:21 PM, Anthony Hughes wrote:
>> It was my understanding that the 9am-noon window would essentially
>> mean we can't push 10.0b4 our to beta until the afternoon. Does
>> extending this window put an afternoon push to beta at risk?
>
> The downtime extension would only affect the start of the downtime
> (7am-noon instead of 9am-noon) so the afternoon push would be unaffected
> unless it is dependent on something happening between 7am-9am.
>
> John
>
>

And to extend this, its downtiming developer side earlier, because of
actual developer builds getting shut off earlier. The actual work that
required the downtime would get scheduled/done when it would always have
been, 9am-noon, which is when surf, hg, etc. would go down.

The idea with this proposal (which a few of us discussed in IRC first)
is to downtime the "developer affecting" period, to eliminate guesswork
for involved parties, rather than arbitrarily shortening the downtime to
just the time the work is done.

(does that make sense?)

--
~Justin Wood (Callek)

Chris Cooper

unread,
Jan 12, 2012, 11:02:13 AM1/12/12
to dev-pl...@lists.mozilla.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12-01-11 4:16 PM, John Hopkins wrote:
> We extend the official downtime window to cover the "no-builds"
> period. This makes the downtime window larger, obviously, but being
> explicit obviates people making incorrect assumptions or relying on
> guesswork.
>
> Are there any objections to this proposed change to downtime
> scheduling?

I think the most important thing here is properly setting developer
expectations.

After talking to jhopkins about this, one thing releng could do to
help with this is include some boilerplate to the downtime notice that
could list the expected build times for various platforms.

If we explicitly warn people that they're unlikely to get builds if
they check-in during that window immediately preceding a downtime,
would that defray some of the concern here?

cheers,
- --
coop
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPDwQCAAoJEGVzgtv/JREKzxgH+gPR51B3Zu1DI3GnSe19l3ak
6ISRukOikCcJs0PCJ6vOiNJYQIRUPV0sJ5o7ULaZ/pw7+JZUzVLYRyQu9s+EhjFE
yo4ANBKLj/rfF1OY1d4OWStkW1JIbqmd8IazvT6hwUFobz4GJ/gnMqkMBpYN+9aK
hqiiGUWZcfFFSlLdLNwFlbw0Xla+1PtXo3cvsdi8mHbLZj2T0nrZDnUR0m+VReLt
MZ3eN/u6Yh3SA2r7wh3u6iDucjzVjNB5l4rjwGRxsAcchRL9N/kVhL1ZIMr94Wo9
quRNeWwcxPj6cXWBFxYrSULXb2PINLb+jXho2hXCuiGtvzTJgSz97dpgbIYcT80=
=hfAH
-----END PGP SIGNATURE-----

Boris Zbarsky

unread,
Jan 12, 2012, 11:16:40 AM1/12/12
to
On 1/12/12 11:02 AM, Chris Cooper wrote:
> If we explicitly warn people that they're unlikely to get builds if
> they check-in during that window immediately preceding a downtime,
> would that defray some of the concern here?

This seems strictly worse, from my perspective as a developer, than just
closing the tree. For one thing it requires me to notice the downtime
notice, keep remembering when the downtime is, etc.

Whereas closing the tree means that I can just push as normal, and if it
won't work I'll be told that.

-Boris

Chris Cooper

unread,
Jan 12, 2012, 3:22:40 PM1/12/12
to dev-pl...@lists.mozilla.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12-01-12 11:16 AM, Boris Zbarsky wrote:
> This seems strictly worse, from my perspective as a developer, than
> just closing the tree. For one thing it requires me to notice the
> downtime notice, keep remembering when the downtime is, etc.
>
> Whereas closing the tree means that I can just push as normal, and
> if it won't work I'll be told that.

I encourage you to push as normal. You shouldn't need to know the
details of the downtime in order to work effectively around it.

Adding some extra information to the downtime notice is only meant to
help developers who are on the fence about pushing before a downtime
make an informed decision. It's by no means prescriptive. Our
self-serve tools should be helping people navigate and recover from
downtimes if they happen to push at an inopportune time.

cheers,
- --
coop
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPD0ELAAoJEGVzgtv/JREK79MH/RknS17DxPPRFxwthXGgUXTp
DnYxvwND2K9Sj59JlcFhlwC30fvOQYEQTat+xfKk2TE/5AWbG1PFoUtlGr8l8w9A
tK+8rzZyf+qSxmHUzmKIR8P8BaZxlU536XX003ycac3/NqEA55zb+fHjp5zqwXQA
M3U4wR4Le39JGEKraY98wPxyzgJVYy2Kw2Bk8jdrBTbQJY6NWc8Hs7/ExGRhpiah
6abKqftt0xHH1e8ZGhp9Dm7JWCLG0WgOHSc1lay3s7CAycBDwkCydXiGuBnlu1+D
g254x1p7VpGD5RSmbsIPQe3p7suP/db5cjVnWR0fYn+RhbOCCb/+OMKtSZI5zug=
=NGcC
-----END PGP SIGNATURE-----

Justin Lebar

unread,
Jan 12, 2012, 4:35:08 PM1/12/12
to Chris Cooper, dev-pl...@lists.mozilla.org
(In reply to bz, ccooper wrote)
>> Whereas closing the tree means that I can just push as normal, and
>> if it won't work I'll be told that.
>
> I encourage you to push as normal. You shouldn't need to know the
> details of the downtime in order to work effectively around it.

> I think the most important thing here is properly setting developer
> expectations.

"The tree is open" should mean "the tree is open and we expect to have
full test coverage." Speaking as a developer, this is my expectation.
I suspect it's bz's as well.

It sounds like we won't have full test coverage in this two-hour
window, because some tests won't finish in time.

Rather than ask the sheriffs to use self-serve to figure out
post-facto what exactly burned the tree, can we just close the tree?

In fact, I imagine that whoever is tending the tree this day will
close it, even if IT leaves it technically open.

I don't understand what is the motivation for leaving the tree open in
this case. Is the fear that closing it for these extra two hours will
harm productivity? We're saying here that, in our estimation as
developers, not closing the tree would be more harmful.

> Adding some extra information to the downtime notice is only meant to
> help developers who are on the fence about pushing before a downtime
> make an informed decision. It's by no means prescriptive. Our
> self-serve tools should be helping people navigate and recover from
> downtimes if they happen to push at an inopportune time.
>
> cheers,
> - --
> coop
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iQEcBAEBAgAGBQJPD0ELAAoJEGVzgtv/JREK79MH/RknS17DxPPRFxwthXGgUXTp
> DnYxvwND2K9Sj59JlcFhlwC30fvOQYEQTat+xfKk2TE/5AWbG1PFoUtlGr8l8w9A
> tK+8rzZyf+qSxmHUzmKIR8P8BaZxlU536XX003ycac3/NqEA55zb+fHjp5zqwXQA
> M3U4wR4Le39JGEKraY98wPxyzgJVYy2Kw2Bk8jdrBTbQJY6NWc8Hs7/ExGRhpiah
> 6abKqftt0xHH1e8ZGhp9Dm7JWCLG0WgOHSc1lay3s7CAycBDwkCydXiGuBnlu1+D
> g254x1p7VpGD5RSmbsIPQe3p7suP/db5cjVnWR0fYn+RhbOCCb/+OMKtSZI5zug=
> =NGcC
> -----END PGP SIGNATURE-----

Axel Hecht

unread,
Jan 12, 2012, 5:26:16 PM1/12/12
to
On 12.01.12 22:35, Justin Lebar wrote:
> (In reply to bz, ccooper wrote)
>>> Whereas closing the tree means that I can just push as normal, and
>>> if it won't work I'll be told that.
>>
>> I encourage you to push as normal. You shouldn't need to know the
>> details of the downtime in order to work effectively around it.
>
>> I think the most important thing here is properly setting developer
>> expectations.
>
> "The tree is open" should mean "the tree is open and we expect to have
> full test coverage." Speaking as a developer, this is my expectation.
> I suspect it's bz's as well.
>
> It sounds like we won't have full test coverage in this two-hour
> window, because some tests won't finish in time.
>
> Rather than ask the sheriffs to use self-serve to figure out
> post-facto what exactly burned the tree, can we just close the tree?
>
> In fact, I imagine that whoever is tending the tree this day will
> close it, even if IT leaves it technically open.
>
> I don't understand what is the motivation for leaving the tree open in
> this case. Is the fear that closing it for these extra two hours will
> harm productivity? We're saying here that, in our estimation as
> developers, not closing the tree would be more harmful.

Having lived through a downtime like this with a landing of mine, it's
rather confusing.

What happens is that your existing builds finish, but the builds that
are triggered by your builds finishing, like tests, get put on hold, and
are only started after the downtime.

Effectively, the test coverage and all is the very same, the difference
is that your builds take two hours+ longer than you thought they would.

For inbound, this isn't much of a problem to the people commiting. For
landing on the non-inbound trees, the landing engineer needs to
understand that during the actual downtime, there's going to be a deaf
master between his build requests and the eager slaves. Effectively
making the window that you have to watch your builds longer than regular.

Axel

Chris Cooper

unread,
Jan 12, 2012, 8:29:46 PM1/12/12
to dev-pl...@lists.mozilla.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12-01-12 4:35 PM, Justin Lebar wrote:
> I don't understand what is the motivation for leaving the tree open
> in this case. Is the fear that closing it for these extra two
> hours will harm productivity? We're saying here that, in our
> estimation as developers, not closing the tree would be more
> harmful.

This is part of the reason why we created a buildduty position within
release engineering: to help the sheriff understand and deal with tree
closures resulting from releng/IT downtimes.

We all share the same goal here, I believe: to minimize the impact of
tree closures to developers. The sheriff and buildduty rep should
definitely be coordinating before a downtime to decide how early the
trees should be closed, and how long it should stay closed. Every
downtime is going to be different WRT who's trying to land and the
relative importance of those landings, so the coordination is key.

We have the self-serve tools now that allow us to re-trigger builds,
so (IMO) the only time sheriffs *need* to be more aggressive about
closing the tree early is when they won't be around to (or are
unwilling to, for whatever reason) re-trigger failed builds after the
downtime. Again, buildduty can help here too if asked.

I'm trying not to be prescriptive about always adding 2 hours of tree
closure. We usually ask for 3 hours of downtime, and 5 hours seems
like a long time to stop landing things altogether.

cheers,
- --
coop
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPD4kIAAoJEGVzgtv/JREKifIIAIHTrBHyAWh4dLipzxe2RslD
I9S+dLVSfZCAxwfVUcJS7+hTkP5mgEC4y+1SdNAKpgYKrVMRrce+8c1+tAcvPWlA
fUBH0Zvtk3PD0mPxwWXFpEl8LBATCDlYPmJfzh0gntP2+TR8EFksAaf9+GfYO1Dl
prBt3CcVFaKBPktLvnyTcI3m4jr/yBxfeXaRPIaP7tOzuCbHw/ut5Rir1ReJqFJZ
DnSEpdEnaVJn0mdV5py2IQB2YHz2D652FoMxtiXOjDUQY0EwLdpjjznMFXYeEdWr
hxxyHLW+oER5/1qVleATzMWE/4qgLeshrzA2ErPKCoJUYcAk9z+cNu53Dxmks8s=
=f9mx
-----END PGP SIGNATURE-----
0 new messages