Suggested changes to sheriffing

Vladimir Vukicevic

unread,

Nov 12, 2008, 9:20:03 PM11/12/08

to

Over the past few weeks (or months), it's become increasingly obvious
that the current sheriff structure really doesn't work when there are
serious problems that need to be dealt with. Things function fine if
the sheriff serves as a watchdog for noticing performance regressions
(though this should be everyone's responsibility, and we need tools to
make this easier), if he/she's just watching to make sure we don't have
too many checkins at once, or other similar 'maintenance' tasks.

But if there are serious problems -- machine instability, intermittent
test failures, or tinderbox-only crashes to name a few -- the current
system fails for two main reasons: there is no continuity as the sheriff
changes from day-to-day, and the daily sheriffs have no clear escalation
path if they are faced with a problem that they can't handle.

So, to resolve these two problems, I would suggest that we create a new
super-sheriffs group, much like we have a super-reviewers group. The
set of people within this group would, as a group:

* have continuous responsibility for tests, crashes, etc. on the
tinderbox, meaning that they would be on the hook for at least being
aware of and tracking these areas;

* be the people who sheriff when the day's sheriff isn't around -- that
is, add another step between the day's sheriff and #developers;

* give sheriffs a specific group of people whom they can contact when
they have problems;

* have direct access to all tinderbox machines to be able to diagnose
problems as they occur.

There are a number of people who have been performing a similar role
already, but I think it would be helpful to formalize this and ensure
that those people have the tools they need to do their jobs efficiently
(specifically, direct access to the tinderboxes).

I think that this, along with creating a sheriff's newsgroup (do we have
one already?) to better track discussions about both daily problems and
ongoing ones, would help us get a handle on what's going on with the
tree when problems arise.

- Vlad

Johnathan Nightingale

unread,

Nov 12, 2008, 10:13:43 PM11/12/08

to Vladimir Vukicevic, dev-pl...@lists.mozilla.org

Vladimir Vukicevic wrote:
> So, to resolve these two problems, I would suggest that we create a new
> super-sheriffs group, much like we have a super-reviewers group. The
> set of people within this group would, as a group:

This is an interesting idea, and seems to get to the heart of some of
the concerns I've had with our current system too. As you say, it works
well enough when the system isn't stressed, but it falls down when
things get tight.

I'm trying super-sheriffs on as an idea, and part of me wonders if
really this is what the "sheriffs" group is supposed to be. Are the
sets really that distinct? Because the Sheriffs are the group of people
who have been tagged as "ought to be able to manage the tree, given
their experience with/exposure to it, and dedication to its health"
which is much of what you want in super-sheriffs.

I think the reasons you propose a distinct super-sheriffs group are:

1) Not all sheriffs actually do behave in the ways I describe - maybe
they don't know how to sheriff, or don't want to, or don't even know
that they are on the list? The current list, after all, wasn't created
by enlisting only interested and excited people. It was closer to just
an amalgam of the front-end and platform dev teams.

2) You're proposing additional access privileges, and it makes sense to
limit that group to people who want and can make use of the authority

3) You're proposing additional responsibilities, and it makes sense to
limit that group to people who want and can honour those responsibilities.

Would it make sense, if that list is accurate, to consider instead just
changing the sheriff list? There are 6 weeks worth of people in the
calendar, and more than that CC'd on the bug. is there a core of, say,
15 people who all actually want the job? If so, is it more valuable to
have them be the only sheriffs and grow the role to enable the things
you describe?

I'm not actually disagreeing with your proposal, I think there are
several of us, as you say, who do pieces of this work already because we
see it needs to be done. And I think that kind of tiered structure,
like we have with review, has worked well for us in terms of mentoring,
socializing knowledge, and providing leadership at critical points.

I just want to make sure that this kind of redesign leaves us in a place
with fewer problems and not with, say, a list of sheriffs which, having
been pruned of its most interested sheriffs, is now even more populated
with people who don't really understand or want their role. Maybe those
are separate problems though. The kind of problems that a list of
super-sheriffs could solve, say.

I'm becoming convinced.

> I think that this, along with creating a sheriff's newsgroup (do we have
> one already?) to better track discussions about both daily problems and
> ongoing ones, would help us get a handle on what's going on with the
> tree when problems arise.

I think this is a great idea, almost regardless of the super-sheriffs
question. If you haven't filed the bug yet, I will. If you have, please
cc me.

Cheers,

Johnathan

Johnathan Nightingale

unread,

Nov 13, 2008, 11:43:39 AM11/13/08

to Vladimir Vukicevic, dev-pl...@lists.mozilla.org

After several people pinging me about my first reply to Vlad, it's
clear that I was unclear in my original post. Looking back on it, I
sure was. Let's try this:

--> I agree that we should build out a list of super-sheriffs.

In addition to the benefits & responsibilities Vlad outlines...

> The set of people within this group would, as a group:
>
> * have continuous responsibility for tests, crashes, etc. on the
> tinderbox, meaning that they would be on the hook for at least being
> aware of and tracking these areas;
> * be the people who sheriff when the day's sheriff isn't around --
> that is, add another step between the day's sheriff and #developers;
> * give sheriffs a specific group of people whom they can contact
> when they have problems;
> * have direct access to all tinderbox machines to be able to
> diagnose problems as they occur.

I think this also lets us improve the pool of sheriffs by having a
group that can identify, mentor and build up new sheriffs. Sheriffing
is something that most people actively contributing code should be a
part of - both to understand the larger ramifications of their coding
work and to help keep the project, and the tree, healthy.

Super-sheriffs can identify new contributors and help integrate them
into the pool. Contrary to some of my musings last time, I think the
pool of sheriffs should be GROWING, not shrinking, on balance. But I
think that can't happen if the barrier to entry remains as high as it
is. The work that Vlad outlines should also make day-to-day
sheriffing an easier job, since a coordinated group like this should
be taking down long-standing problems like reducing the number of
random reds/oranges.

As Vlad mentions, we are already doing this. Several of us try to
bring up new sheriffs, try to understand & fix systematic problems,
and try to keep the tree green. Empowering those people to more
effectively do that work in a more deliberate and direct way is a good
thing.

Consider me sold,

Johnathan

---
Johnathan Nightingale
Human Shield
joh...@mozilla.com

Ben Hearsum

unread,

Nov 13, 2008, 1:43:12 PM11/13/08

to

On 11/12/08 9:20 PM, Vladimir Vukicevic wrote:
> * have direct access to all tinderbox machines to be able to diagnose
> problems as they occur.

There are a lot of details that need to be worked out before doing this.
Build machines aren't LDAP controlled - it's a non trivial thing to give
a bunch of people access to them.

Additionally, we would need to audit some permissions to ensure that
non-build folks do not have access to build ssh keys (which in turn
would give them access to a number of critical systems).

Without discussing this with other RelEng folks I _think_ I'm okay with
this in principle, but there's a fair amount of work to be done to make
it possible.

- Ben

Shawn Wilsher

unread,

Nov 13, 2008, 3:01:31 PM11/13/08

to Vladimir Vukicevic, dev-pl...@lists.mozilla.org

On 11/12/08 6:20 PM, Vladimir Vukicevic wrote:
> * be the people who sheriff when the day's sheriff isn't around -- that
> is, add another step between the day's sheriff and #developers;

My biggest problem with this is that it gives another incentive for the
sheriff to not be around. Right now the incentive is something like
"hopefully people will be responsible, and hopefully nobody will realize
I'm not around". With the set of super-sheriffs it becomes "it's OK if
I'm not around - one of the super-sheriffs will have to step up".

We already have too many people who don't really sheriff when they are
listed as sheriff. I'd hate to see that problem get worse.

/sdwilsh

Mike Beltzner

unread,

Nov 13, 2008, 7:43:02 PM11/13/08

to Shawn Wilsher, dev-pl...@lists.mozilla.org, Vladimir Vukicevic

On 13-Nov-08, at 3:01 PM, Shawn Wilsher wrote:

> On 11/12/08 6:20 PM, Vladimir Vukicevic wrote:

>> * be the people who sheriff when the day's sheriff isn't around --
>> that
>> is, add another step between the day's sheriff and #developers;

> My biggest problem with this is that it gives another incentive for
> the sheriff to not be around. Right now the incentive is something
> like "hopefully people will be responsible, and hopefully nobody
> will realize I'm not around". With the set of super-sheriffs it
> becomes "it's OK if I'm not around - one of the super-sheriffs will
> have to step up".
>
> We already have too many people who don't really sheriff when they
> are listed as sheriff. I'd hate to see that problem get worse.

On the other hand, it also means that there are a group of people who
can easily notice when that happens, and suggest that perhaps someone
shouldn't actually be on the sheriff list.

On the gripping hand[1], it's important that everyone who has commit
access understands the cost of pushing changes to the tree. I think
that by making every committer responsible for a day (or even few
hours within a day) of checking against performance, regressions and
test failures, we'll end up with a better set of committers. In
parallel we can continue to invest in tools that reduce the pain of
being a sheriff (many have long been whispered of in the halls of the
sheriff: clearer indication of performance regressions, reinstatement
of the blame column, a cleaner tinderbox layout) but in the main we
need to make these activities more familiar to every committer, so the
burden of being a sheriff is reduced.

cheers,
mike

[1]: http://en.wikipedia.org/wiki/The_Mote_in_God%27s_Eye - read it.

Justin Dolske

unread,

Nov 14, 2008, 12:08:24 AM11/14/08

to

On 11/12/08 6:20 PM, Vladimir Vukicevic wrote:

> But if there are serious problems -- machine instability, intermittent
> test failures, or tinderbox-only crashes to name a few -- the current
> system fails for two main reasons: there is no continuity as the sheriff
> changes from day-to-day, and the daily sheriffs have no clear escalation
> path if they are faced with a problem that they can't handle.

I think one additional problem with the current sheriff scheme is that
it's such a short and infrequent shift. I've often felt like I'm
relearning the basics each time, because either I forget things
(documentation? ha!) or things have changed. That makes it a
frustrating, inefficient experience.

Fixing that would probably require longer and/or more frequent sheriff
duties. Say, a week at a time every other month. (Don't everyone
volunteer at once!) Maybe it would help to have "sheriff and a deputy"
-- the sheriff being on a longer duty cycle, and the deputy rotating
daily. Sheriffs would become more efficient, gain experience faster, and
be able to tackle longer-term projects. Deputies would be able to mentor
from someone experienced, and lend a hand on busy days (which seem to be
increasingly common).

(This could all be in addition to super-sheriffs, or an alternative.)

> * have direct access to all tinderbox machines to be able to diagnose
> problems as they occur.

+1, this would really be useful. Even if it's just a handful of people
due to access concerns, or if there's a way to clone troublesome
tinderboxes for developer experimentation.

Justin

Dave Townsend

unread,

Nov 15, 2008, 5:43:17 PM11/15/08

to

I've been reading this thread with interest because I am pretty sure
that something needs to be done with sheriffing. To be honest I'm not
entirely sure how I feel about the super-sheriffs proposal. I'm not
necessarily against it or for it. This more of a dump of some of my
thoughts that crop up.

On 13/11/08 02:20, Vladimir Vukicevic wrote:
>
> Over the past few weeks (or months), it's become increasingly obvious
> that the current sheriff structure really doesn't work when there are
> serious problems that need to be dealt with. Things function fine if the
> sheriff serves as a watchdog for noticing performance regressions
> (though this should be everyone's responsibility, and we need tools to
> make this easier), if he/she's just watching to make sure we don't have
> too many checkins at once, or other similar 'maintenance' tasks.
>
> But if there are serious problems -- machine instability, intermittent
> test failures, or tinderbox-only crashes to name a few -- the current
> system fails for two main reasons: there is no continuity as the sheriff
> changes from day-to-day, and the daily sheriffs have no clear escalation
> path if they are faced with a problem that they can't handle.
>
> So, to resolve these two problems, I would suggest that we create a new
> super-sheriffs group, much like we have a super-reviewers group. The set
> of people within this group would, as a group:
>
> * have continuous responsibility for tests, crashes, etc. on the
> tinderbox, meaning that they would be on the hook for at least being
> aware of and tracking these areas;

I'm not entirely sure what you are suggesting the super-sheriff does
here. Right now we kind of have a system where whoever spots a
regression tends to end up having to try to work out what caused it,
however long that takes. This can suck for that person, it certainly
makes me wonder why I check perf graphs sometimes, but it has the
benefit that at least that person has all the info on the issue.

Are you suggesting that instead whoever finds the regression merely
hands off to a super sheriff?

> * be the people who sheriff when the day's sheriff isn't around -- that
> is, add another step between the day's sheriff and #developers;

This is I think good for spotting people who aren't sheriffing regularly
and either kicking them or replacing them. I wonder though if you have
considered the timezone problem. For a long time now the sheriffing has
been focused on PST work hours. That is understandable but it can leave
problems where people who are on the sheriff list can't actually work
those hours, and that there is commonly no sheriff outside of those hours.

> * give sheriffs a specific group of people whom they can contact when
> they have problems;
>
> * have direct access to all tinderbox machines to be able to diagnose
> problems as they occur.

I can see this being useful, but I'm not sure we need a super-sheriff
group for it specifically. As it happens I've found merely having access
to the buildbot waterfall to be seriously helpful when sheriffing.

>
> There are a number of people who have been performing a similar role
> already, but I think it would be helpful to formalize this and ensure
> that those people have the tools they need to do their jobs efficiently
> (specifically, direct access to the tinderboxes).
>
> I think that this, along with creating a sheriff's newsgroup (do we have
> one already?) to better track discussions about both daily problems and
> ongoing ones, would help us get a handle on what's going on with the
> tree when problems arise.

This I certainly agree with. I had thought about a wiki page or
something to track things. But a newsgroup where a thread is created for
perf regressions would be very useful. Equally useful would be a good
way to hand off between sheriffs. It's pretty common that I get up in
the morning to see confusing messages on the tinderbox and in bug
reports and so I have to guess whether it is safe to check in or not. A
simple message at the end of the day by a sheriff to say "this is the
state of the tree, this is what to look for and this is when we can
start checking in again" would be pretty helpful.

Johnathan Nightingale

unread,

Nov 16, 2008, 3:13:18 PM11/16/08

to dev-pl...@lists.mozilla.org

Vladimir Vukicevic wrote:
> I think that this, along with creating a sheriff's newsgroup (do we have
> one already?) to better track discussions about both daily problems and
> ongoing ones, would help us get a handle on what's going on with the
> tree when problems arise.

I have since filed the bug for the newsgroup (464609), which Gerv is
ready to create. Gerv suggested mozilla.dev.planning.sheriff, which is
mostly fine, however I think that since this is a newsgroup for
sheriffs, RelEng, IT and devs to come *together* on issues of tree
management, it could stand a slightly broader name.

Before I go on, let's everybody take a firm grip on your bikeshedding
instincts. :) Now, is there anything show-stoppingly-wrong with:

mozilla.dev.planning.treemanagement

If I don't hear of any such showstoppers in the next few days, I'll ask
Gerv to create it as named.

Cheers,

Johnathan

Dave Townsend

unread,

Nov 16, 2008, 5:16:50 PM11/16/08

to

Not bikeshedding I think, but do we have a good consensus on what the
scope of discussion in the newsgroup should be. I would expect the
following:

* Threads about investigations into performance/test regressions
* Threads about planned tree outages
* Threads about changes to the tree rules

I'm positive I'm missing stuff, but I could also be way off base.

Mike Beltzner

unread,

Nov 16, 2008, 5:39:04 PM11/16/08

to dtow...@mozilla.com, dev-pl...@lists.mozilla.org

Changes to tree rules and outages should go to dev.planning, imo.

I would expect this newsgroup to be about tools and techniques for
sheriffing, schedules for sheriffing, discussion of random test failures and
slow build boxes, etc.

cheers,
mike

_______________________________________________
dev-planning mailing list
dev-pl...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-planning

John O'Duinn

unread,

Nov 17, 2008, 3:55:54 AM11/17/08

to Mike Beltzner, Vladimir Vukicevic, dev-pl...@lists.mozilla.org, dtow...@mozilla.com

hi;

Thoughts after reading this thread:

1) Currently, we only have developers as sheriffs. However, intermittent
problems can be caused by bugs in the shipping product code, the test or
the RelEng infrastructure. What if QA and RelEng scheduled people to be
available, as advisors to those Developer sheriffs? That way,
Dev+QA+RelEng could immediately debug all aspects of a problem
*together*? Not sure how we'd figure out the timezones, and I think we
dont have enough people to do 24x7, but even if its only during some
certain hours of the day, the idea of a joint group that can immediately
investigate all parts of the problem space seems useful.

2) I like the idea of longer "sheriff" shifts (a week, not a day) as it
will allow people time to re-learn, be productive and get into the
rhythm of it.

3) Doing "sheriff + deputy" for the week seems like a good way to have
cover in case sheriff have to attend meeting / take day off during the
week... and its a great way to help train up new sheriffs.

4) Personally, I'd be in favour of refining membership of the existing
sheriff group, and not bother creating "super-sheriffs"; it seems less
confusing to me.

5) Why do we need a different sheriff/treemanagement newsgroup?
Personally, the existing dev.planning newsgroup seems like the right
venue for all this, imho.

Just my $0.02!

tc
John.
=====

Robert Kaiser

unread,

Nov 17, 2008, 8:13:10 AM11/17/08

to

John O'Duinn wrote:
> 1) Currently, we only have developers as sheriffs. However, intermittent
> problems can be caused by bugs in the shipping product code, the test or
> the RelEng infrastructure. What if QA and RelEng scheduled people to be
> available, as advisors to those Developer sheriffs? That way,
> Dev+QA+RelEng could immediately debug all aspects of a problem
> *together*? Not sure how we'd figure out the timezones, and I think we
> dont have enough people to do 24x7, but even if its only during some
> certain hours of the day, the idea of a joint group that can immediately
> investigate all parts of the problem space seems useful.

What we really seem to be missing is someone from RelEng on a European
timeshift, esp. between about 12:00 to 18:00 UTC.

I really like the idea of having on-call people from other areas to
support the sheriff, so the sheriff knows whom to ask if e.g. the red is
a machine issue.

Robert Kaiser

Mike Beltzner

unread,

Nov 17, 2008, 8:52:22 AM11/17/08

to jod...@mozilla.com, dev-pl...@lists.mozilla.org, Vladimir Vukicevic, dtow...@mozilla.com

On 17-Nov-08, at 3:55 AM, John O'Duinn wrote:

> 1) Currently, we only have developers as sheriffs. However,
> intermittent
> problems can be caused by bugs in the shipping product code, the
> test or
> the RelEng infrastructure. What if QA and RelEng scheduled people to
> be
> available, as advisors to those Developer sheriffs? That way,
> Dev+QA+RelEng could immediately debug all aspects of a problem
> *together*? Not sure how we'd figure out the timezones, and I think we
> dont have enough people to do 24x7, but even if its only during some
> certain hours of the day, the idea of a joint group that can
> immediately
> investigate all parts of the problem space seems useful.

Agreed, this is required. So far I think we've made do by cross-
communicating on IRC, usually pulling nthomas or bhearsum in for
questions and asssistance on IRC. A lot of the times people are told
to file bugs (which is fine) but knowing that there's someone "on
call" to deal with those issues in a timely fashion is critical and
would reduce frustration.

> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as
> it
> will allow people time to re-learn, be productive and get into the
> rhythm of it.

I don't recall seeing this suggestion, aside from the idea that the
"super sheriff" (or if you will, sheriff with daily deputies) will
have a lesser responsibility towards monitoring each checkin and a
greater one towards helping that day's sheriff (or if you will,
deputy) should odd problems occur.

I'm hesitant to increase the time commitment of super sheriffs, as
there's a good degree of overlap between them and our senior code
reviewing community.

> 3) Doing "sheriff + deputy" for the week seems like a good way to have
> cover in case sheriff have to attend meeting / take day off during the
> week... and its a great way to help train up new sheriffs.

Absolutely agreed. This seems like a good way to institute the model
you posit above.

> 4) Personally, I'd be in favour of refining membership of the existing
> sheriff group, and not bother creating "super-sheriffs"; it seems less
> confusing to me.

As long as the idea is that they get a group of deputies, this makes
sense. While I originally felt the same way (just boot the sheriffs
who don't seem to care as much) I think that it's a necessary part of
our community process and should be a requirement for code committers
as a way of enforcing good behaviours and habits.

> 5) Why do we need a different sheriff/treemanagement newsgroup?
> Personally, the existing dev.planning newsgroup seems like the right
> venue for all this, imho.

Right now we're only discussing about 50% of what we should be
discussing in newsgroups; the rest is on IRC, wiki pages and group
knowledge. I don't think that dev.planning is the right place to
discuss daily operational content as opposed to an area for project-
wide announcements that affect planning and how we work (like this
thread). The types of things I'd expect to see in a sheriffing
newsgroup are discussions of "Hey, there's a random orange here, what
could be causing it?" with the group collaborating to debug it; that
would be a little much for dev-planning.

cheers,
mike

Ben Hearsum

unread,

Nov 17, 2008, 10:39:23 AM11/17/08

to

On 11/17/08 3:55 AM, John O'Duinn wrote:
> hi;
>
> Thoughts after reading this thread:
>
> 1) Currently, we only have developers as sheriffs. However, intermittent
> problems can be caused by bugs in the shipping product code, the test or
> the RelEng infrastructure. What if QA and RelEng scheduled people to be
> available, as advisors to those Developer sheriffs? That way,
> Dev+QA+RelEng could immediately debug all aspects of a problem
> *together*? Not sure how we'd figure out the timezones, and I think we
> dont have enough people to do 24x7, but even if its only during some
> certain hours of the day, the idea of a joint group that can immediately
> investigate all parts of the problem space seems useful.
>

I love this idea. Having dedicated time where I am to *expect*
interrupts makes me much more amiable to them. Personally, I'd like to
see these shifts be a week long. This way, each person would only have
one once every 6 weeks or so - giving folks a lot of time to *ignore*
interrupts and focus on their day to day work.

> 5) Why do we need a different sheriff/treemanagement newsgroup?
> Personally, the existing dev.planning newsgroup seems like the right
> venue for all this, imho.
>

I agree with Mike and others here. dev.planning is a project wide
newsgroup and sheriff stuff isn't relevant to most people outside of
Engineering. A dev.whatever group would be *great* for tracking things
like this and would hopefully reduce the occurrence of "I don't know
where to file this, post this, or note this, so I won't do anything"
incidents.

Ben Hearsum

unread,

Nov 17, 2008, 11:17:09 AM11/17/08

to

On 11/17/08 10:39 AM, Ben Hearsum wrote:
> On 11/17/08 3:55 AM, John O'Duinn wrote:
>> hi;
>>
>> Thoughts after reading this thread:
>>
>> 1) Currently, we only have developers as sheriffs. However, intermittent
>> problems can be caused by bugs in the shipping product code, the test or
>> the RelEng infrastructure. What if QA and RelEng scheduled people to be
>> available, as advisors to those Developer sheriffs? That way,
>> Dev+QA+RelEng could immediately debug all aspects of a problem
>> *together*? Not sure how we'd figure out the timezones, and I think we
>> dont have enough people to do 24x7, but even if its only during some
>> certain hours of the day, the idea of a joint group that can immediately
>> investigate all parts of the problem space seems useful.
>>
>
> I love this idea. Having dedicated time where I am to *expect*
> interrupts makes me much more amiable to them. Personally, I'd like to
> see these shifts be a week long. This way, each person would only have
> one once every 6 weeks or so - giving folks a lot of time to *ignore*
> interrupts and focus on their day to day work.
>

A couple more things on this:
If we implement this I think we should hold off an giving sheriffs
direct tinderbox access. If there is a defined point of contact for
Releng I think it will greatly reduce the need for sheriffs to step in.
I'm totally open to revisiting this later, though.

And to be clear, I don't think a RelEng person needs to monitor the tree
the same way the sheriff does. They shouldn't spend their day *actively*
watching for problems. However, they should be hanging out in
#developers, watching the newgroup, and watching for incoming bugs - so
they can respond in a timely manner to them.

John O'Duinn

unread,

Nov 17, 2008, 12:30:23 PM11/17/08

to Mike Beltzner, dev-pl...@lists.mozilla.org, Vladimir Vukicevic, dtow...@mozilla.com

hi

Mike Beltzner wrote:

> On 17-Nov-08, at 3:55 AM, John O'Duinn wrote:
>> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as it
>> will allow people time to re-learn, be productive and get into the
>> rhythm of it.
>
> I don't recall seeing this suggestion, aside from the idea that the
> "super sheriff" (or if you will, sheriff with daily deputies) will have
> a lesser responsibility towards monitoring each checkin and a greater
> one towards helping that day's sheriff (or if you will, deputy) should
> odd problems occur.

See justin dolske's post on this thread at 11/13/08 9:08 PM.

> I'm hesitant to increase the time commitment of super sheriffs, as
> there's a good degree of overlap between them and our senior code
> reviewing community.

Well, by doing longer "sheriff shifts", each shift is longer, but your
next shift is further in the future. Not sure if the actual overall time
commitment changes...

tc
John.

Johnathan Nightingale

unread,

Nov 17, 2008, 12:39:02 PM11/17/08

to jod...@mozilla.com, dev-pl...@lists.mozilla.org

On 17-Nov-08, at 3:55 AM, John O'Duinn wrote:

> hi;
>
> Thoughts after reading this thread:
>

> 1) ...What if QA and RelEng scheduled people to be

> available, as advisors to those Developer sheriffs? That way,
> Dev+QA+RelEng could immediately debug all aspects of a problem
> *together*?

I think this would be useful. I don't know if it obviates the need
for some sheriffs to have box access, I guess it comes down to
availability of resources - if sheriffs can get the equivalent of
hands-on control when they need to, for instance, attach a debugger to
a test-failing process, I think that's a real improvement.

> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as
> it
> will allow people time to re-learn, be productive and get into the
> rhythm of it.

> 3) Doing "sheriff + deputy" for the week seems like a good way to have
> cover in case sheriff have to attend meeting / take day off during the
> week... and its a great way to help train up new sheriffs.

I am not keen to do this - I think it's too much to put on every
sheriff, and I don't think it should be necessary. I'd really rather
see us reduce the amount of "re-learning" sheriffs need, by having a
consolidated group of people tracking down and eliminating recurrent
"fake oranges" and box problems. I don't think the time-calculus
works out in reality the way it might on paper - I think it's
substantially harder for most people to commit to a whole week of
reduced productivity elsewhere in their jobs.

> 4) Personally, I'd be in favour of refining membership of the existing
> sheriff group, and not bother creating "super-sheriffs"; it seems less
> confusing to me.

I'm not sure how confusing this would be to developer-sheriffs who are
already accustomed to the review/superreview process, and the peer/
module owner structure. In any event though, there are long-standing
problems with the current organization - there are well-documented
failures, particularly during crunch periods. I think that if we have
organizational belief that a super-sheriff structure would help track
those down, the cost of change-confusion is probably worth it.

> 5) Why do we need a different sheriff/treemanagement newsgroup?
> Personally, the existing dev.planning newsgroup seems like the right
> venue for all this, imho.

Oh this is the easiest part of it for me. In fine Mozilla tradition,
that's just recognizing something that's already happening. We have
been playing various ad hoc games in the meantime - email threads that
get most-but-not-all interested parties, or IRC conversations that are
(perforce) timezone-specific. If it falls into disuse, fine, but I
suspect it's an coordination point we've needed for a while.

Cheers,

Axel Hecht

unread,

Nov 17, 2008, 12:39:47 PM11/17/08

to

Having one releng person on call is nice, but looking at the daily habit
of Ted and Mossop, we need more people on different timezones empowered
to fix the tree.

If it's not feasible to have them have access to the actual machines, we
need to re-establish methods to clobber (TODO: define what that is for
multiple slaves) and kick builds.

I'm still working on "ignorance is bliss" when it comes down to
sheriffing, but the "European tree" is in an utterly sad state. I
actually don't remember landing stuff with a good feeling about what the
tree was up to for quite a while. Luckily, most of my landings these
days are just changes to all-locales or shipped-locales, so I don't
really bother.

Axel

John O'Duinn

unread,

Nov 17, 2008, 12:41:24 PM11/17/08

to Ben Hearsum, dev-pl...@lists.mozilla.org

Ben Hearsum wrote:
> On 11/17/08 10:39 AM, Ben Hearsum wrote:
>> On 11/17/08 3:55 AM, John O'Duinn wrote:
>>> hi;
>>>
>>> Thoughts after reading this thread:
>>>
>>> 1) Currently, we only have developers as sheriffs. However, intermittent
>>> problems can be caused by bugs in the shipping product code, the test or
>>> the RelEng infrastructure. What if QA and RelEng scheduled people to be
>>> available, as advisors to those Developer sheriffs? That way,
>>> Dev+QA+RelEng could immediately debug all aspects of a problem
>>> *together*? Not sure how we'd figure out the timezones, and I think we
>>> dont have enough people to do 24x7, but even if its only during some
>>> certain hours of the day, the idea of a joint group that can immediately
>>> investigate all parts of the problem space seems useful.
>>>
>>
>> I love this idea. Having dedicated time where I am to *expect*
>> interrupts makes me much more amiable to them. Personally, I'd like to
>> see these shifts be a week long. This way, each person would only have
>> one once every 6 weeks or so - giving folks a lot of time to *ignore*
>> interrupts and focus on their day to day work.

Cool. :-)

> A couple more things on this:
> If we implement this I think we should hold off an giving sheriffs
> direct tinderbox access. If there is a defined point of contact for
> Releng I think it will greatly reduce the need for sheriffs to step in.
> I'm totally open to revisiting this later, though.

Agreed. If there's a RelEng contact/advisor available, they can look
into machine issues on the spot, without worries of an accidental change
messing up the machine.

> And to be clear, I don't think a RelEng person needs to monitor the tree
> the same way the sheriff does. They shouldn't spend their day *actively*
> watching for problems. However, they should be hanging out in
> #developers, watching the newgroup, and watching for incoming bugs - so
> they can respond in a timely manner to them.

Yes, thats why I suggested "advisors to those Developer sheriffs". I'm
not suggesting that only one person be sheriff, with sometimes QA,
RelEng being that solo sheriff coordinating landings, etc. I'm
suggesting that the two Developers who are sheriff have a QA person and
RelEng person available to help investigate problems that arise.

tc
John.

Gervase Markham

unread,

Nov 17, 2008, 5:28:15 PM11/17/08

to

Mike Beltzner wrote:
> Changes to tree rules and outages should go to dev.planning, imo.

Yeah - I think the key point is that only people who are responsible for
tree management should need to add that group to their reading list.
"Normal" developers should see announcements of important tree
management events elsewhere.

Gerv

Vladimir Vukicevic

unread,

Nov 20, 2008, 3:54:12 PM11/20/08

to Mike Beltzner, jod...@mozilla.com, dev-pl...@lists.mozilla.org, Vladimir Vukicevic, dtow...@mozilla.com

On 11/17/08 5:52 AM, Mike Beltzner wrote:
> On 17-Nov-08, at 3:55 AM, John O'Duinn wrote:
>
>> 1) Currently, we only have developers as sheriffs. However, intermittent
>> problems can be caused by bugs in the shipping product code, the test or
>> the RelEng infrastructure. What if QA and RelEng scheduled people to be
>> available, as advisors to those Developer sheriffs? That way,
>> Dev+QA+RelEng could immediately debug all aspects of a problem
>> *together*? Not sure how we'd figure out the timezones, and I think we
>> dont have enough people to do 24x7, but even if its only during some
>> certain hours of the day, the idea of a joint group that can immediately
>> investigate all parts of the problem space seems useful.
>
> Agreed, this is required. So far I think we've made do by

> cross-communicating on IRC, usually pulling nthomas or bhearsum in for

> questions and asssistance on IRC. A lot of the times people are told to
> file bugs (which is fine) but knowing that there's someone "on call" to
> deal with those issues in a timely fashion is critical and would reduce
> frustration.

Yes, this would be a big help. Bugs are useful for tracking, but
usually if a bug has to be filed to fix something that's blocking people
from committing, people can usually forget about getting anything
checked in for the next few hours. We need a faster way to fix these
issues.

>> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as it
>> will allow people time to re-learn, be productive and get into the
>> rhythm of it.
>
> I don't recall seeing this suggestion, aside from the idea that the
> "super sheriff" (or if you will, sheriff with daily deputies) will have
> a lesser responsibility towards monitoring each checkin and a greater
> one towards helping that day's sheriff (or if you will, deputy) should
> odd problems occur.

Yes, that was the original idea -- I am very much against increasing the
time commitment for active sheriffing; it's a huge drain and effort, and
I think I would go crazy doing it for a week straight.

>> 3) Doing "sheriff + deputy" for the week seems like a good way to have
>> cover in case sheriff have to attend meeting / take day off during the
>> week... and its a great way to help train up new sheriffs.
>
> Absolutely agreed. This seems like a good way to institute the model you
> posit above.

Well, no -- sheriff + deputy for the week implies, well, being sheriff
for a week. Continuing sheriff duties as normal but creating a group
that's a safety net for the daily sheriff is what I had in mind -- that
is, I didn't want to provide an opportunity for sheriffs to not do their
daily duties, but more to have someone to turn to if they need help, and
for a group to be aware of longer-standing problems. Hence the idea for
a separate group, because I'm not proposing any changes to existing
sheriffs, just additional structure around them.

>> 5) Why do we need a different sheriff/treemanagement newsgroup?
>> Personally, the existing dev.planning newsgroup seems like the right
>> venue for all this, imho.
>
> Right now we're only discussing about 50% of what we should be
> discussing in newsgroups; the rest is on IRC, wiki pages and group
> knowledge. I don't think that dev.planning is the right place to discuss
> daily operational content as opposed to an area for project-wide
> announcements that affect planning and how we work (like this thread).
> The types of things I'd expect to see in a sheriffing newsgroup are
> discussions of "Hey, there's a random orange here, what could be causing
> it?" with the group collaborating to debug it; that would be a little
> much for dev-planning.

Yep, I think this was discussed in other posts, but that's what I think
as well. m.d.planning is for planning, not for discussion of details of
the tree.

- Vlad

Vladimir Vukicevic

unread,

Nov 20, 2008, 3:54:12 PM11/20/08

to Mike Beltzner, dev-pl...@lists.mozilla.org, Vladimir Vukicevic, dtow...@mozilla.com, jod...@mozilla.com

On 11/17/08 5:52 AM, Mike Beltzner wrote:

> On 17-Nov-08, at 3:55 AM, John O'Duinn wrote:
>
>> 1) Currently, we only have developers as sheriffs. However, intermittent
>> problems can be caused by bugs in the shipping product code, the test or
>> the RelEng infrastructure. What if QA and RelEng scheduled people to be
>> available, as advisors to those Developer sheriffs? That way,
>> Dev+QA+RelEng could immediately debug all aspects of a problem
>> *together*? Not sure how we'd figure out the timezones, and I think we
>> dont have enough people to do 24x7, but even if its only during some
>> certain hours of the day, the idea of a joint group that can immediately
>> investigate all parts of the problem space seems useful.
>
> Agreed, this is required. So far I think we've made do by

> cross-communicating on IRC, usually pulling nthomas or bhearsum in for

> questions and asssistance on IRC. A lot of the times people are told to
> file bugs (which is fine) but knowing that there's someone "on call" to
> deal with those issues in a timely fashion is critical and would reduce
> frustration.

Yes, this would be a big help. Bugs are useful for tracking, but

usually if a bug has to be filed to fix something that's blocking people
from committing, people can usually forget about getting anything
checked in for the next few hours. We need a faster way to fix these
issues.

>> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as it

>> will allow people time to re-learn, be productive and get into the
>> rhythm of it.
>
> I don't recall seeing this suggestion, aside from the idea that the
> "super sheriff" (or if you will, sheriff with daily deputies) will have
> a lesser responsibility towards monitoring each checkin and a greater
> one towards helping that day's sheriff (or if you will, deputy) should
> odd problems occur.

Yes, that was the original idea -- I am very much against increasing the

time commitment for active sheriffing; it's a huge drain and effort, and
I think I would go crazy doing it for a week straight.

>> 3) Doing "sheriff + deputy" for the week seems like a good way to have

>> cover in case sheriff have to attend meeting / take day off during the
>> week... and its a great way to help train up new sheriffs.
>
> Absolutely agreed. This seems like a good way to institute the model you
> posit above.

Well, no -- sheriff + deputy for the week implies, well, being sheriff

for a week. Continuing sheriff duties as normal but creating a group
that's a safety net for the daily sheriff is what I had in mind -- that
is, I didn't want to provide an opportunity for sheriffs to not do their
daily duties, but more to have someone to turn to if they need help, and
for a group to be aware of longer-standing problems. Hence the idea for
a separate group, because I'm not proposing any changes to existing
sheriffs, just additional structure around them.

>> 5) Why do we need a different sheriff/treemanagement newsgroup?

>> Personally, the existing dev.planning newsgroup seems like the right
>> venue for all this, imho.
>
> Right now we're only discussing about 50% of what we should be
> discussing in newsgroups; the rest is on IRC, wiki pages and group
> knowledge. I don't think that dev.planning is the right place to discuss
> daily operational content as opposed to an area for project-wide
> announcements that affect planning and how we work (like this thread).
> The types of things I'd expect to see in a sheriffing newsgroup are
> discussions of "Hey, there's a random orange here, what could be causing
> it?" with the group collaborating to debug it; that would be a little
> much for dev-planning.

Yep, I think this was discussed in other posts, but that's what I think

Vladimir Vukicevic

unread,

Dec 3, 2008, 5:06:01 PM12/3/08

to

This died out a bit without anything actually getting done, so
jumpstarting it again a bit. Some additional comments below. From
reading back over the thread, people still seem split on whether we need
to create a set of super-sheriffs or not. I can try to distill the
feedback to come up with a list of requirements/abilities of
super-sheriffs and see if it's become any clearer.

On 11/17/08 9:39 AM, Johnathan Nightingale wrote:
> On 17-Nov-08, at 3:55 AM, John O'Duinn wrote:
>
>> hi;
>>
>> Thoughts after reading this thread:
>>
>> 1) ...What if QA and RelEng scheduled people to be
>> available, as advisors to those Developer sheriffs? That way,
>> Dev+QA+RelEng could immediately debug all aspects of a problem
>> *together*?
>
> I think this would be useful. I don't know if it obviates the need for
> some sheriffs to have box access, I guess it comes down to availability
> of resources - if sheriffs can get the equivalent of hands-on control
> when they need to, for instance, attach a debugger to a test-failing
> process, I think that's a real improvement.

Sure, though "equivalent of hands-on control" is not the same as
hands-on control. Trying to remotely debug with someone copy-pasting
(or typing back responses, since copy-pasting from some of these
machines isn't easy) is a huge pain and really unnecessary. Having
someone around to do things like reboot machines, check
network/diskspace/etc. would be very helpful though, because then the
sheriff can work with that person instead of necessarily always going to
the supersheriff.

>> 4) Personally, I'd be in favour of refining membership of the existing
>> sheriff group, and not bother creating "super-sheriffs"; it seems less
>> confusing to me.
>
> I'm not sure how confusing this would be to developer-sheriffs who are
> already accustomed to the review/superreview process, and the

> peer/module owner structure. In any event though, there are

> long-standing problems with the current organization - there are
> well-documented failures, particularly during crunch periods. I think
> that if we have organizational belief that a super-sheriff structure
> would help track those down, the cost of change-confusion is probably
> worth it.

I'm not sure what's confusing, really -- there's a set of sheriffs and
then there's a set of super-sheriffs who are empowered to do additional
things that the sheriffs normally aren't, and who are also charged with
day-to-day continuity of the status of the tree.

>> 5) Why do we need a different sheriff/treemanagement newsgroup?
>> Personally, the existing dev.planning newsgroup seems like the right
>> venue for all this, imho.
>
> Oh this is the easiest part of it for me. In fine Mozilla tradition,
> that's just recognizing something that's already happening. We have been
> playing various ad hoc games in the meantime - email threads that get
> most-but-not-all interested parties, or IRC conversations that are
> (perforce) timezone-specific. If it falls into disuse, fine, but I
> suspect it's an coordination point we've needed for a while.

Ok -- as per bug 464609, mozilla.dev.tree-management should be showing
up sometime this week which should help. (dev.planning is not the right
place for this, as others have said; talking about why test X is failing
is not in the same scope as planning a schedule for the next release).

- Vlad