Consensus sought - when to reset try repository?

Hal Wine

unread,

Feb 28, 2014, 8:24:20 PM2/28/14

to dev-pl...@lists.mozilla.org

tl;dr: what is the balance point between pushes to try taking too long
and loosing repository history of recent try pushes?

Summary:
--------

As most developers have experienced, pushing to try can sometimes take a
long time. Once it takes "too long" (as measured by screams of pain in
#releng) <https://etherpad.mozilla.org/ep/search?query=releng%29>, a
"try [repository] reset" is scheduled. This hurts productivity and
increases frustration for everyone involved (devs, IT, RelEng). We don't
want to do this anymore.

A reset of the try repository deletes the existing contents, and
replaces with a fresh clone from mozilla-central. While the tbpl
information will remain valid for any completed build, any attempt to
view the diffs for a try build will fail (unless you already had them in
your local repository).

Progress on resolution of the root cause:
-----------------------------------------

IT has made tremendous progress in reducing the occurrence of "long push
times", but they still are not predictable. Various attempts at
monitoring[1] and auto correction[2] have not been successful in
improving the situation. Work continues on additional changes that
should improve the situation[3].

The most recent mitigation strategy is to trade the "unknown timing"
disruption of the push times increasing to a pain threshold with a
"known timing" of reseting the try repository every TCW (tree closing
window - every 6 wks currently). However, we heard from some folks that
this is too often.

The most recent try-reset-triggered-by-pain was a duration of 6
months[4]. There was at least one report just 3 months after reset of
problems[5].

So, the question is - what say developers -- what's the balance point
between:
- too often, making collaborating on try pushes hard
- too infrequent, introducing increasing push times

--Hal

Prior Work:
-----------
[1] bug https://bugzil.la/691459
[2] bugs https://bugzil.la/554656https://bugzil.la/734225
<https://bugzil.la/734225#c24>#c24
<https://bugzil.la/734225#c24>https://bugzil.la/633161https://bugzil.la/529179
[3] bugs https://bugzil.la/770811https://bugzil.la/937732others
[4] bugs https://bugzil.la/894429& https://bugzil.la/962275
[5] bug https://bugzil.la/925354

L. David Baron

unread,

Feb 28, 2014, 8:32:16 PM2/28/14

to Hal Wine, dev-pl...@lists.mozilla.org

On Friday 2014-02-28 17:24 -0800, Hal Wine wrote:
> So, the question is - what say developers -- what's the balance point
> between:
> - too often, making collaborating on try pushes hard
> - too infrequent, introducing increasing push times

Why not change the try repo reset procedure so that instead of just
cloning mozilla-central, you also pull from the old try repo into
the new one all of the heads of try pushes made within the last one
or two weeks. (Presumably there's a list of them somewhere, or it
could be maintained?) Then the try reset won't break things for
those recent pushes, but only the older ones.

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)

signature.asc

Daniel Holbert

unread,

Feb 28, 2014, 8:40:33 PM2/28/14

to L. David Baron, Hal Wine, dev-pl...@lists.mozilla.org

On 02/28/2014 05:32 PM, L. David Baron wrote:
> Why not change the try repo reset procedure so that instead of just
> cloning mozilla-central, you also pull from the old try repo into
> the new one all of the heads of try pushes made within the last one
> or two weeks. (Presumably there's a list of them somewhere, or it
> could be maintained?) Then the try reset won't break things for
> those recent pushes, but only the older ones.

This seems like a good solution.

One (possibly obvious) clarification: we'd need to rely on the pushlog
DB (rather than the changeset datestamps) when creating the list of
recent heads, since changeset datestamps are customizable and hence
unreliable.

~Daniel

John Schoenick

unread,

Feb 28, 2014, 8:44:50 PM2/28/14

to Daniel Holbert, L. David Baron, Hal Wine, dev-pl...@lists.mozilla.org

Or taking this a step further, having a rolling cronjob |hg strip|
revisions not on m-c older than a certain date would remove the need to
perform resets entirely, and give a predictable date after which your
try push would disappear. You could even add a "keep me for N days"
parameter to try syntax for pushes that we'd like to stick around.

>
> ~Daniel
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Ryan VanderMeulen

unread,

Feb 28, 2014, 8:47:55 PM2/28/14

to

On 2/28/2014 8:44 PM, John Schoenick wrote:
> Or taking this a step further, having a rolling cronjob |hg strip|
> revisions not on m-c older than a certain date would remove the need to
> perform resets entirely, and give a predictable date after which your
> try push would disappear. You could even add a "keep me for N days"
> parameter to try syntax for pushes that we'd like to stick around.

30 days is how long the logs are kept for, so maybe that would be a good
amount of time.

L. David Baron

unread,

Feb 28, 2014, 8:55:43 PM2/28/14

to John Schoenick, Daniel Holbert, Hal Wine, dev-pl...@lists.mozilla.org

On Friday 2014-02-28 17:44 -0800, John Schoenick wrote:
> Or taking this a step further, having a rolling cronjob |hg strip|
> revisions not on m-c older than a certain date would remove the need
> to perform resets entirely, and give a predictable date after which
> your try push would disappear. You could even add a "keep me for N
> days" parameter to try syntax for pushes that we'd like to stick
> around.

I'm not sure how well "hg strip" would interact with a repository
that people are pushing to at the same time, though.

signature.asc

Hal Wine

unread,

Feb 28, 2014, 9:02:54 PM2/28/14

to L. David Baron, dev-pl...@lists.mozilla.org

On 2014-02-28 17:32 , L. David Baron wrote:
> On Friday 2014-02-28 17:24 -0800, Hal Wine wrote:
>> So, the question is - what say developers -- what's the balance point
>> between:
>> - too often, making collaborating on try pushes hard
>> - too infrequent, introducing increasing push times
>
> Why not change the try repo reset procedure so that instead of just
> cloning mozilla-central, you also pull from the old try repo into
> the new one all of the heads of try pushes made within the last one
> or two weeks. (Presumably there's a list of them somewhere, or it
> could be maintained?) Then the try reset won't break things for
> those recent pushes, but only the older ones.

David -- that's one idea that has not yet been tried. I suspect other
folks will also come up with new ideas.

However - in the meantime, what try reset schedule would devs prefer?
Are you suggesting we stay with the only-reset-when-devs-scream-in-pain
approach until a real solution is found?

--Hal

P.S. There is data deep in the bugs which casts doubt on effectiveness
of such an approach. (The issue is not strictly number of heads, but
also an unknown function of the "depth" of the heads.)

Hal Wine

unread,

Feb 28, 2014, 9:02:54 PM2/28/14

to L. David Baron, dev-pl...@lists.mozilla.org

On 2014-02-28 17:32 , L. David Baron wrote:

> On Friday 2014-02-28 17:24 -0800, Hal Wine wrote:
>> So, the question is - what say developers -- what's the balance point
>> between:
>> - too often, making collaborating on try pushes hard
>> - too infrequent, introducing increasing push times
>
> Why not change the try repo reset procedure so that instead of just
> cloning mozilla-central, you also pull from the old try repo into
> the new one all of the heads of try pushes made within the last one
> or two weeks. (Presumably there's a list of them somewhere, or it
> could be maintained?) Then the try reset won't break things for
> those recent pushes, but only the older ones.

Ehsan Akhgari

unread,

Mar 1, 2014, 4:16:43 PM3/1/14

to Hal Wine, L. David Baron, dev-pl...@lists.mozilla.org

On 2014-02-28, 9:02 PM, Hal Wine wrote:
> On 2014-02-28 17:32 , L. David Baron wrote:
>> On Friday 2014-02-28 17:24 -0800, Hal Wine wrote:
>>> So, the question is - what say developers -- what's the balance point
>>> between:
>>> - too often, making collaborating on try pushes hard
>>> - too infrequent, introducing increasing push times
>>
>> Why not change the try repo reset procedure so that instead of just
>> cloning mozilla-central, you also pull from the old try repo into
>> the new one all of the heads of try pushes made within the last one
>> or two weeks. (Presumably there's a list of them somewhere, or it
>> could be maintained?) Then the try reset won't break things for
>> those recent pushes, but only the older ones.
>
> David -- that's one idea that has not yet been tried. I suspect other
> folks will also come up with new ideas.
>
> However - in the meantime, what try reset schedule would devs prefer?
> Are you suggesting we stay with the only-reset-when-devs-scream-in-pain
> approach until a real solution is found?

I would recommend doing that while we try to find a real solution.

Cheers,
Ehsan

Ted Mielczarek

unread,

Mar 2, 2014, 1:53:13 PM3/2/14

to John Schoenick, Hal Wine, Mozilla Platform Development

On 2/28/2014 8:44 PM, John Schoenick wrote:

> On 02/28/2014 05:40 PM, Daniel Holbert wrote:
>> On 02/28/2014 05:32 PM, L. David Baron wrote:

>>> Why not change the try repo reset procedure so that instead of just
>>> cloning mozilla-central, you also pull from the old try repo into
>>> the new one all of the heads of try pushes made within the last one
>>> or two weeks. (Presumably there's a list of them somewhere, or it
>>> could be maintained?) Then the try reset won't break things for
>>> those recent pushes, but only the older ones.

>> This seems like a good solution.
>>
>> One (possibly obvious) clarification: we'd need to rely on the pushlog
>> DB (rather than the changeset datestamps) when creating the list of
>> recent heads, since changeset datestamps are customizable and hence
>> unreliable.
>

> Or taking this a step further, having a rolling cronjob |hg strip|
> revisions not on m-c older than a certain date would remove the need
> to perform resets entirely, and give a predictable date after which
> your try push would disappear. You could even add a "keep me for N
> days" parameter to try syntax for pushes that we'd like to stick around.
>

Note, we already investigated this some time ago[1], and "hg strip"
doesn't interact well with the current pushlog hook. It's possible we
could make this work if we changed the pushlog hook to accomodate.

-Ted

1. https://bugzilla.mozilla.org/show_bug.cgi?id=633161

Gregory Szorc

unread,

Mar 5, 2014, 1:07:28 AM3/5/14

to Hal Wine, dev-pl...@lists.mozilla.org

> So, the question is - what say developers -- what's the balance point
> between:
> - too often, making collaborating on try pushes hard
> - too infrequent, introducing increasing push times

I wouldn't have such a big issue with Try resets if we didn't lose
information in the process. I believe every time there's been a Try
reset, I've lost data from a recent (<1 week) Try push and I needed to
re-run that job - incurring extra cost to Mozilla and wasting my time. I
also periodically find myself wanting to answer questions like "what
percentage of tree closures are due to pushes that didn't go to Try
first." Data loss stinks.

I'd say the goal should be "no data loss." I have an idea that will
enable us to achieve this.

Let's expose every newly-reset instance of the Try repo as a separate
URL. We would still push to ssh://hg.mozilla.org/try, but the URLs
printed and the URLs used by automation would be URLs to repos that
would never go away. e.g.
https://hg.mozilla.org/tries/try1/rev/840f122d1286 ("try1" being the
important bit in there). When we reset Try, you'd hand out URLs to
"try2." You could reset the writable Try repo as frequently as you
desired and aside from a slightly different repo URL being given out,
nobody should notice.

The main drawbacks of this approach that I can think of are all in
automation: parts of automation are very repo/URL centric and having
effectively dynamic URLs might break assumptions. But making automation
work against arbitrary URLs is a good thing, as it allows automation to
be more flexible and this allows people to experiment with alternate
repo hosting, landing tools, landing-integrated code review tools, etc
without requiring special involvement from RelEng. "Everything is a web
service and is self-service," etc.

Ed Morley

unread,

Mar 5, 2014, 6:27:27 AM3/5/14

to Gregory Szorc, Hal Wine, dev-pl...@lists.mozilla.org

On 05 March 2014 06:07:28, Gregory Szorc wrote:
> I wouldn't have such a big issue with Try resets if we didn't lose
> information in the process. I believe every time there's been a Try
> reset, I've lost data from a recent (<1 week) Try push and I needed to
> re-run that job

Whilst it doesn't help with being able to refer to the diff, fwiw bug
721152 means that TBPL now supports accessing the results of a Try run
even after repo-reset, so long as you use the single-revision URL form
that appears in the "thank you for your try push" email.

Note that TBPL data is purged for all trees after 30 days (in line with
when the logs are purged from ftp.m.o).

Ed

Hal Wine

unread,

Mar 7, 2014, 5:41:23 PM3/7/14

to dev-pl...@lists.mozilla.org

On 2014-02-28 17:24 , Hal Wine wrote:
> tl;dr: what is the balance point between pushes to try taking too long
> and loosing repository history of recent try pushes?

Based on the responses to this specific question, we'll go back to
waiting for developers to notify IT when there is enough performance
impact to warrant a reset of the try repository. I've added the
reporting instructions to the wiki page about try:
https://wiki.mozilla.org/ReleaseEngineering/TryServer
<https://wiki.mozilla.org/ReleaseEngineering/TryServer#Pushes_to_try_take_a_very_long_time>

Thanks to everyone else for showing interest in the underlying problems.
Suggestions for that are best added to the bugs cited below.

Thanks!
--Hal

Daniel Holbert

unread,

Apr 30, 2014, 7:52:26 PM4/30/14

to Hal Wine, dev-pl...@lists.mozilla.org

On 03/07/2014 02:41 PM, Hal Wine wrote:
> On 2014-02-28 17:24 , Hal Wine wrote:
>> tl;dr: what is the balance point between pushes to try taking too long
>> and loosing repository history of recent try pushes?
> Based on the responses to this specific question, we'll go back to
> waiting for developers to notify IT when there is enough performance
> impact to warrant a reset of the try repository

As documented on
https://bugzilla.mozilla.org/show_bug.cgi?id=994028
we've now had multiple instances in the past few weeks where Try has
been horked (refusing all pushes) for hours at a time, with no clear
reason why.

I'm not sure if this is caused by Try having too many heads & needing a
reset, but it seems like it could be. (It also could be *indirectly*
caused by the too-many-heads issue, too; e.g. perhaps someone
interrupted a push because it was taking too long (due to too many
heads), and their client inadvertently left something on the server
locked, which then locks everyone else out for hours.)

Whatever the cause, it's feeling more and more like periodic, automatic
Try resets would be helpful to keep things running smoothly.

Would it be possible to set up a system along the lines of dbaron's
suggestion earlier in this post? (Frequent resets, with a post-reset
step to pull in the most recent ~2 weeks worth of heads from the old
repo, so that people's try pushes don't mysteriously disappear if they
happen to push right before a reset.)

Thanks,
~Daniel

Hal Wine

unread,

May 1, 2014, 1:32:51 AM5/1/14

to Daniel Holbert, dev-pl...@lists.mozilla.org

On 2014-04-30 16:52 , Daniel Holbert wrote:
> On 03/07/2014 02:41 PM, Hal Wine wrote:
>> On 2014-02-28 17:24 , Hal Wine wrote:
>>> tl;dr: what is the balance point between pushes to try taking too long
>>> and loosing repository history of recent try pushes?
>> Based on the responses to this specific question, we'll go back to
>> waiting for developers to notify IT when there is enough performance
>> impact to warrant a reset of the try repository

Thanks for reopening this thread.

>
> As documented on
> https://bugzilla.mozilla.org/show_bug.cgi?id=994028
> we've now had multiple instances in the past few weeks where Try has
> been horked (refusing all pushes) for hours at a time, with no clear
> reason why.
>
> I'm not sure if this is caused by Try having too many heads & needing a
> reset, but it seems like it could be. (It also could be *indirectly*
> caused by the too-many-heads issue, too; e.g. perhaps someone
> interrupted a push because it was taking too long (due to too many
> heads), and their client inadvertently left something on the server
> locked, which then locks everyone else out for hours.)
>
> Whatever the cause, it's feeling more and more like periodic, automatic
> Try resets would be helpful to keep things running smoothly.

Yes, or a better working definition of "too much performance impact". In
this case, we had a 4h10m gap in the pushlog, and now things are back to
"normal". A reset would take about that long to perform.

>
> Would it be possible to set up a system along the lines of dbaron's
> suggestion earlier in this post? (Frequent resets, with a post-reset
> step to pull in the most recent ~2 weeks worth of heads from the old
> repo, so that people's try pushes don't mysteriously disappear if they
> happen to push right before a reset.)

It is something that could be tried - we'll try a few dry runs to see
how much this adds to the reset try duration (given that we have to pull
those changes from the "slow repo").

I also have some fresh thoughts on https://bugzil.la/691459 - there may
be some log correlation possible to get us hard data on overall success
rates and push times.

Looking forward to getting a newer (and hopefully better) approach to
this recurring issue.

--Hal