Impact of mozharness of developers

Axel Hecht

unread,

Nov 17, 2010, 11:06:12 AM11/17/10

to

Hi folks,

as you may have seen, releng is working on a redo of our builds,
mozharness. Aki's summary is at
http://drkscrtlv.livejournal.com/322809.html.

I consider a public discussion on this in .planning to be in order,
because this move has a deep impact on how we can evolve our habits on
landing code, and how to make it stick. We need to have this discussion
now, too, as now is the time we kill existing code [1].

I invite releng to make a more complete case in favor of mozharness in a
follow up. I'll follow up on this post with a my take on it as well.

Axel

Axel Hecht

unread,

Nov 17, 2010, 11:16:28 AM11/17/10

to

Oops, [1] is https://bugzilla.mozilla.org/show_bug.cgi?id=608004

Axel

Mike Beltzner

unread,

Nov 17, 2010, 11:17:17 AM11/17/10

to Axel Hecht, dev-pl...@lists.mozilla.org

On 2010-11-17, at 11:06 AM, Axel Hecht wrote:

> I consider a public discussion on this in .planning to be in order, because this move has a deep impact on how we can evolve our habits on landing code, and how to make it stick. We need to have this discussion now, too, as now is the time we kill existing code [1].

[1]: footage not found :)

> I invite releng to make a more complete case in favor of mozharness in a follow up. I'll follow up on this post with a my take on it as well.

From Aki's summary I didn't see how this was going to affect our mechanisms and habits for coding changes to the tree. Hopefully your or RelEng's follow up here will clarify this.

I'm sure that what Aki's proposing is of benefit to us - his blog post certainly shows some fantastic counterexamples - but I'm not sure I was left with any clearer understanding of what would be different or how it would affect our contributors.

cheers,
mike

Jeff Muizelaar

unread,

Nov 17, 2010, 11:28:42 AM11/17/10

to Axel Hecht, dev-pl...@lists.mozilla.org

On 17-Nov-10, at 11:06 AM, Axel Hecht wrote:

> Hi folks,
>
> as you may have seen, releng is working on a redo of our builds,
> mozharness. Aki's summary is at http://drkscrtlv.livejournal.com/322809.html
> .
>
> I consider a public discussion on this in .planning to be in order,
> because this move has a deep impact on how we can evolve our habits
> on landing code, and how to make it stick. We need to have this
> discussion now, too, as now is the time we kill existing code [1].

Can you elaborate on the deep impact that this would have? I briefly
talked with some releng people about this and couldn't determine any
impact that such a change would have on me.

-Jeff

Axel Hecht

unread,

Nov 17, 2010, 11:58:02 AM11/17/10

to

So, here my personal take of what mozharness will do.

It's going to have a few examples out of the l10n world, but most of
this applies just as well to regular builds and test runs.

Also, you probably saw my latest serious of blog posts fly by. I wrote
this in a different context, but I'll use some of that.

Here's what I see mozharness to do to folks consuming the builds: It'll
make builds less granular. Right now, you hardly notice, and you may
even consider less of ===== Output started ===== in your build logs to
be a good thing.

Dropping the granularity comes with three downsides:
- you broke the tree, and want to see what's at fault. HG checkout,
compile, make package, make check? Right now, you'd just see, if
tinderbox wasn't serving you the opposite.
- you broke the compile, landed a fix, and you won't see the previous
failure passing until the very end
- you want to figure out what a build does, to tweak a step. Right now,
you should be able to look at things step by step, reproduce locally,
compare the logs. With mozharness, you see one big chunk and need to
figure out a dozen command line arguments to see how to make it run just
this one fragment. And do that again to figure out how to rerun that
fragment.

Let's look at a concrete example. Some of you may have seen that we have
serious trouble with fennec l10n builds. The single-locale builds are
all busted, android is still lacking pieces, whatnot. Let's see how
properly reporting granular builds make a difference. Let's look at the
steps of a maemo multi-locale build and a single locale repack.
http://l10n.mozilla.org/bounty/build/1895091 is the bounty url that
shows the individual steps for the first.

You see distinct steps that do the original en-US build, then you see a
bit of "let's get compare-locales", and then for a series of locales,
it's doing an hg checkout, a compare-locales, and a make chrome-% step.
If we'd actually expose the logs for each step, you could even run the
thing as you go, and compare the output upstream with what you get locally.

Once you use mozharness, all you'll see is one single python script run
for anything, at least for all things after the en-US build.

So instead of Brad saying "I think we're going to need some step by step
how to for repacks if any progress is going to be made here"
(https://bugzilla.mozilla.org/show_bug.cgi?id=609430#c7), it'd be,
"someone needs to explain all those options to mozharness, I haven't
used it in a year". I totally sympathize with Brad who didn't want to
figure this out by reading through
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1289984400.1289988999.2704.gz&fulltext=1

The repacks on the other hand actually,
http://l10n.mozilla.org/bounty/build/1895095, does a bunch of setup,
get's en-US tar and deb, gets and runs compare-locales, and then calls
into make-installers-%.

End result? We could have easily shipped a localized Beta 1 for mobile,
in a much better shape than Beta 2, if the mobile team could have just
looked at the build and made sure that the steps that work on maemo and
desktop work on android. Instead of having the same argument as last
year about who's responsible for understanding the stuff over months.

And that's just getting the builds we do exposed in the detail that
we're using them.

The following is something that may very well apply to
intermittent-orange, too. Just that I have less experience with juggling
that type of data. The l10n nightly builds pretty reliably hit
intermittent failures (almost orange, just fatal, didn't say "random",
too, mike). There's a baseline of builds that really just fail to repack
(missing files), and then there's a few minutes mostly every day where
uploads to stage just fail due to thrown away ssh connections. Making
sure that infrastructure problems are detected, filed, and fixed
requires that we actually run our builds in a granularity that allows to
do that. mozharness wouldn't want to pretend to be buildbot if it's not
buildbot but recreate all of buildbot again, i guess. The level of
detail we have today really allows one to file an IT bug and say "ssh
uploads failed between 3:40 and 3:42, there were 14 concurrent uploads
at that time". The numbers are not based on concrete data, as, well, I
don't have access to a live version of our data.

Quite generally, having one architecture that does our builds (buildbot)
instead of two architectures stacked on top (mozharness on buildsbot)
appears to be a much more agile and reliable setup.

Axel

Aki Sasaki

unread,

Nov 18, 2010, 7:30:24 AM11/18/10

to

My late night followup, after a long day debugging Android Tegra bustage:

http://drkscrtlv.livejournal.com/322870.html

I hope most of it is semi coherent English.

I think I've covered most of the response to your post; I can also write
a followup that more closely follows the

> orig post
reply

format if that's clearer.

Axel Hecht

unread,

Nov 18, 2010, 10:30:25 AM11/18/10

to

Firstly, I need to say sorry to Aki for apparently putting him on the
spot here. Using the name of his baby for a design decision wasn't the
right thing to do.

That said, there are a few points in his post that I want to follow up on.

> Buildbot was not written to micromanage slaves.
>
> The above statement sums up my first conversation with Brian Warner,
where we vehemently agreed that too much complex logic had been
relegated to Mozilla's buildbot masters.
>
> Even if one ignores his original intent, I assert that moving
> complex
logic from an overloaded master to its relatively unloaded slaves will be
>
> 1. more efficient
> 2. more scalable
> 3. more portable

Each and every build we do, l10n or not, does clobbering software
install of http://hg.mozilla.org/build/tools and a ton of steps to call
the clobberer. L10n builds also do vanilla installs of compare-locales
and mar. If we're talking about micromanaging...

Overloaded masters is a complaint that I can't really accept as long as
tinderbox mails are generated on them.

I guess I'm missing the point about complex logic in that assertion.

I don't intend to comment on most of the post, because it's pretty
detailed on how much energy we'd spend on getting things back we already
have.

There's one part of Aki's post that actually says why he's doing
mozharness, and I'd be happy to learn if that overlaps with what
bhearsum does. And that's

> I've been trying to solve real problems with real deadlines. Like
> MultiNightlyL10n being the only real blocker to moving the mobile
> build infrastructure from the crufty buildbot-0.7 branch to the
> supported and shiny default 0.8.x branch.

which I read together with (quoted from earlier in the post)

> I mean, the barrier of entry to buildbot is... high. First, install
> buildbot! Then, navigate our buildbot-configs and buildbotcustom
> repos (easy!), set up your master, then set up a slave that points to
> the master, then somehow use the appropriate one of the six buildbot
> methods to trigger a build/test/repack that you want, and debug from
> there.

yielding to "let's move things out of buildbot into other code". This
concerns me deeply. This again has nothing to do with l10n in
particular, but I read this as "the code is so intertwingled, I can't
change it and test it".

If that's remotely true, we should gather energy, forces, and excitement
in all of Mozilla around fixing that. Enough excitement to close the
tree for a day and have it metered for a week to then actually deploy a
fixed set up. If there are dependencies on main-code developments, let's
get them called out.

Axel

John O'Duinn

unread,

Nov 19, 2010, 11:44:08 PM11/19/10

to Mike Beltzner, Axel Hecht, dev-pl...@lists.mozilla.org

On 11/18/10 1:17 AM, Mike Beltzner wrote:

> On 2010-11-17, at 11:06 AM, Axel Hecht wrote:
>
>> I consider a public discussion on this in .planning to be in order, because this move has a deep impact on how we can evolve our habits on landing code, and how to make it stick. We need to have this discussion now, too, as now is the time we kill existing code [1].
>

> [1]: footage not found :)
>

>> I invite releng to make a more complete case in favor of mozharness in a follow up. I'll follow up on this post with a my take on it as well.
>

> From Aki's summary I didn't see how this was going to affect our mechanisms and habits for coding changes to the tree. Hopefully your or RelEng's follow up here will clarify this.
>
> I'm sure that what Aki's proposing is of benefit to us - his blog post certainly shows some fantastic counterexamples - but I'm not sure I was left with any clearer understanding of what would be different or how it would affect our contributors.
>
> cheers,
> mike

What Aki is proposing will be of great benefit. Stepping back to set
context:

tl;dr: Mozharness is a project to refactor internal ReleaseAutomation
code. This includes moving code out of buildbot custom libraries and
into standalone scripts/programs/Makefiles.

Mozharness is important, exciting, work because:

1) upgrading buildbot will be easier:
Today, upgrading buildbot in production is complicated because of all
our code in buildbot custom. Every upgrade requires significant testing
of these custom libraries, and some recoding if buildbot internals
change. Moving this logic out of buildbot custom to external scripts
makes it easier for us to upgrade buildbot in production going forward.

2) refactoring is needed for Fennec:
Fennec Release Automation still runs on buildbot 0.7.x, while Firefox
Release Automation runs on (incompatible) buildbot 0.8.2. Refactoring
and consolidating these gets us improved Fennec automation, and one code
base for RelEng to maintain going forward.

3) others can use the *same* scripts as RelEng:
Moving the consolidated logic out of buildbot custom means that others
outside of RelEng can run the *exact* same scripts and Makefile targets
that RelEng uses. This is a big help to developers debugging build/test
differences between something running on their machine vs a RelEng system.

Obviously, anyone using undocumented internal buildbot custom code will
have to modify their code as we refactor. Going forward, we recommend
that people do not rely on undocumented internals. Instead, we recommend
using the JSON feed available and already being used by TBPL. If this
“supported API” isn’t sufficient, please file bugs to have us fix the
supported API – its much better then using undocumented internals and
having to rewrite your code every time we fix something internally.

Hopefully that helps explain what Mozharness is, and why we're excited
by it.

tc
John.

Mike Shaver

unread,

Nov 19, 2010, 11:50:53 PM11/19/10

to jod...@mozilla.com, Axel Hecht, dev-pl...@lists.mozilla.org

John,

This sounds like a big step forward. Thanks for the explanation.

Mike

On Nov 19, 2010 11:45 PM, "John O'Duinn" <jod...@mozilla.com> wrote:
>
>
> On 11/18/10 1:17 AM, Mike Beltzner wrote:
>> On 2010-11-17, at 11:06 AM, Axel Hecht wrote:
>>

>>> I consider a public discussion on this in .planning to be in order,
because this move has a deep impact on how we can evolve our habits on
landing code, and how to make it stick. We need to have this discussion now,
too, as now is the time we kill existing code [1].
>>

>> [1]: footage not found :)
>>

>>> I invite releng to make a more complete case in favor of mozharness in a
follow up. I'll follow up on this post with a my take on it as well.
>>

> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning

Clint Talbert

unread,

Nov 29, 2010, 10:53:50 AM11/29/10

to

As one of the folks who gets to try to exactly recreate the releng
structure and add new things to it on a regular basis, my team and I are
really excited about mozharness. I think that Axel is bringing up good
concerns w.r.t. granularity of logging and tracking, but I think all of
that can be done inside the new architecture. These may be good
requirements for that new architecture, but they are not hard problems
to solve. The hard problems that mozharness will solve are the ones
that John points out below which are gigantic pain points in our current
infrastructure.

Our current release engineering automation was not written in a day,
week, or even a month, and changing it will not be a quick process. If
the current buildbot mess could be quickly and effectively untangled,
trust me, releng would have already done it. It's not something you can
close the tree for a week and fix. I think the purposeful, considered,
restructuring that Aki is proposing is the right way to go about it.

Clint