Over the past few weeks (or months), it's become increasingly obvious that the current sheriff structure really doesn't work when there are serious problems that need to be dealt with. Things function fine if the sheriff serves as a watchdog for noticing performance regressions (though this should be everyone's responsibility, and we need tools to make this easier), if he/she's just watching to make sure we don't have too many checkins at once, or other similar 'maintenance' tasks.
But if there are serious problems -- machine instability, intermittent test failures, or tinderbox-only crashes to name a few -- the current system fails for two main reasons: there is no continuity as the sheriff changes from day-to-day, and the daily sheriffs have no clear escalation path if they are faced with a problem that they can't handle.
So, to resolve these two problems, I would suggest that we create a new super-sheriffs group, much like we have a super-reviewers group. The set of people within this group would, as a group:
* have continuous responsibility for tests, crashes, etc. on the tinderbox, meaning that they would be on the hook for at least being aware of and tracking these areas;
* be the people who sheriff when the day's sheriff isn't around -- that is, add another step between the day's sheriff and #developers;
* give sheriffs a specific group of people whom they can contact when they have problems;
* have direct access to all tinderbox machines to be able to diagnose problems as they occur.
There are a number of people who have been performing a similar role already, but I think it would be helpful to formalize this and ensure that those people have the tools they need to do their jobs efficiently (specifically, direct access to the tinderboxes).
I think that this, along with creating a sheriff's newsgroup (do we have one already?) to better track discussions about both daily problems and ongoing ones, would help us get a handle on what's going on with the tree when problems arise.
Vladimir Vukicevic wrote: > So, to resolve these two problems, I would suggest that we create a new > super-sheriffs group, much like we have a super-reviewers group. The > set of people within this group would, as a group:
This is an interesting idea, and seems to get to the heart of some of the concerns I've had with our current system too. As you say, it works well enough when the system isn't stressed, but it falls down when things get tight.
I'm trying super-sheriffs on as an idea, and part of me wonders if really this is what the "sheriffs" group is supposed to be. Are the sets really that distinct? Because the Sheriffs are the group of people who have been tagged as "ought to be able to manage the tree, given their experience with/exposure to it, and dedication to its health" which is much of what you want in super-sheriffs.
I think the reasons you propose a distinct super-sheriffs group are:
1) Not all sheriffs actually do behave in the ways I describe - maybe they don't know how to sheriff, or don't want to, or don't even know that they are on the list? The current list, after all, wasn't created by enlisting only interested and excited people. It was closer to just an amalgam of the front-end and platform dev teams.
2) You're proposing additional access privileges, and it makes sense to limit that group to people who want and can make use of the authority
3) You're proposing additional responsibilities, and it makes sense to limit that group to people who want and can honour those responsibilities.
Would it make sense, if that list is accurate, to consider instead just changing the sheriff list? There are 6 weeks worth of people in the calendar, and more than that CC'd on the bug. is there a core of, say, 15 people who all actually want the job? If so, is it more valuable to have them be the only sheriffs and grow the role to enable the things you describe?
I'm not actually disagreeing with your proposal, I think there are several of us, as you say, who do pieces of this work already because we see it needs to be done. And I think that kind of tiered structure, like we have with review, has worked well for us in terms of mentoring, socializing knowledge, and providing leadership at critical points.
I just want to make sure that this kind of redesign leaves us in a place with fewer problems and not with, say, a list of sheriffs which, having been pruned of its most interested sheriffs, is now even more populated with people who don't really understand or want their role. Maybe those are separate problems though. The kind of problems that a list of super-sheriffs could solve, say.
I'm becoming convinced.
> I think that this, along with creating a sheriff's newsgroup (do we have > one already?) to better track discussions about both daily problems and > ongoing ones, would help us get a handle on what's going on with the > tree when problems arise.
I think this is a great idea, almost regardless of the super-sheriffs question. If you haven't filed the bug yet, I will. If you have, please cc me.
After several people pinging me about my first reply to Vlad, it's clear that I was unclear in my original post. Looking back on it, I sure was. Let's try this:
--> I agree that we should build out a list of super-sheriffs.
In addition to the benefits & responsibilities Vlad outlines...
> The set of people within this group would, as a group:
> * have continuous responsibility for tests, crashes, etc. on the > tinderbox, meaning that they would be on the hook for at least being > aware of and tracking these areas; > * be the people who sheriff when the day's sheriff isn't around -- > that is, add another step between the day's sheriff and #developers; > * give sheriffs a specific group of people whom they can contact > when they have problems; > * have direct access to all tinderbox machines to be able to > diagnose problems as they occur.
I think this also lets us improve the pool of sheriffs by having a group that can identify, mentor and build up new sheriffs. Sheriffing is something that most people actively contributing code should be a part of - both to understand the larger ramifications of their coding work and to help keep the project, and the tree, healthy.
Super-sheriffs can identify new contributors and help integrate them into the pool. Contrary to some of my musings last time, I think the pool of sheriffs should be GROWING, not shrinking, on balance. But I think that can't happen if the barrier to entry remains as high as it is. The work that Vlad outlines should also make day-to-day sheriffing an easier job, since a coordinated group like this should be taking down long-standing problems like reducing the number of random reds/oranges.
As Vlad mentions, we are already doing this. Several of us try to bring up new sheriffs, try to understand & fix systematic problems, and try to keep the tree green. Empowering those people to more effectively do that work in a more deliberate and direct way is a good thing.
Consider me sold,
Johnathan
--- Johnathan Nightingale Human Shield john...@mozilla.com
> * have direct access to all tinderbox machines to be able to diagnose > problems as they occur.
There are a lot of details that need to be worked out before doing this. Build machines aren't LDAP controlled - it's a non trivial thing to give a bunch of people access to them.
Additionally, we would need to audit some permissions to ensure that non-build folks do not have access to build ssh keys (which in turn would give them access to a number of critical systems).
Without discussing this with other RelEng folks I _think_ I'm okay with this in principle, but there's a fair amount of work to be done to make it possible.
> * be the people who sheriff when the day's sheriff isn't around -- that > is, add another step between the day's sheriff and #developers;
My biggest problem with this is that it gives another incentive for the sheriff to not be around. Right now the incentive is something like "hopefully people will be responsible, and hopefully nobody will realize I'm not around". With the set of super-sheriffs it becomes "it's OK if I'm not around - one of the super-sheriffs will have to step up".
We already have too many people who don't really sheriff when they are listed as sheriff. I'd hate to see that problem get worse.
> On 11/12/08 6:20 PM, Vladimir Vukicevic wrote: >> * be the people who sheriff when the day's sheriff isn't around -- >> that >> is, add another step between the day's sheriff and #developers; > My biggest problem with this is that it gives another incentive for > the sheriff to not be around. Right now the incentive is something > like "hopefully people will be responsible, and hopefully nobody > will realize I'm not around". With the set of super-sheriffs it > becomes "it's OK if I'm not around - one of the super-sheriffs will > have to step up".
> We already have too many people who don't really sheriff when they > are listed as sheriff. I'd hate to see that problem get worse.
On the other hand, it also means that there are a group of people who can easily notice when that happens, and suggest that perhaps someone shouldn't actually be on the sheriff list.
On the gripping hand[1], it's important that everyone who has commit access understands the cost of pushing changes to the tree. I think that by making every committer responsible for a day (or even few hours within a day) of checking against performance, regressions and test failures, we'll end up with a better set of committers. In parallel we can continue to invest in tools that reduce the pain of being a sheriff (many have long been whispered of in the halls of the sheriff: clearer indication of performance regressions, reinstatement of the blame column, a cleaner tinderbox layout) but in the main we need to make these activities more familiar to every committer, so the burden of being a sheriff is reduced.
> But if there are serious problems -- machine instability, intermittent > test failures, or tinderbox-only crashes to name a few -- the current > system fails for two main reasons: there is no continuity as the sheriff > changes from day-to-day, and the daily sheriffs have no clear escalation > path if they are faced with a problem that they can't handle.
I think one additional problem with the current sheriff scheme is that it's such a short and infrequent shift. I've often felt like I'm relearning the basics each time, because either I forget things (documentation? ha!) or things have changed. That makes it a frustrating, inefficient experience.
Fixing that would probably require longer and/or more frequent sheriff duties. Say, a week at a time every other month. (Don't everyone volunteer at once!) Maybe it would help to have "sheriff and a deputy" -- the sheriff being on a longer duty cycle, and the deputy rotating daily. Sheriffs would become more efficient, gain experience faster, and be able to tackle longer-term projects. Deputies would be able to mentor from someone experienced, and lend a hand on busy days (which seem to be increasingly common).
(This could all be in addition to super-sheriffs, or an alternative.)
> * have direct access to all tinderbox machines to be able to diagnose > problems as they occur.
+1, this would really be useful. Even if it's just a handful of people due to access concerns, or if there's a way to clone troublesome tinderboxes for developer experimentation.
I've been reading this thread with interest because I am pretty sure that something needs to be done with sheriffing. To be honest I'm not entirely sure how I feel about the super-sheriffs proposal. I'm not necessarily against it or for it. This more of a dump of some of my thoughts that crop up.
> Over the past few weeks (or months), it's become increasingly obvious > that the current sheriff structure really doesn't work when there are > serious problems that need to be dealt with. Things function fine if the > sheriff serves as a watchdog for noticing performance regressions > (though this should be everyone's responsibility, and we need tools to > make this easier), if he/she's just watching to make sure we don't have > too many checkins at once, or other similar 'maintenance' tasks.
> But if there are serious problems -- machine instability, intermittent > test failures, or tinderbox-only crashes to name a few -- the current > system fails for two main reasons: there is no continuity as the sheriff > changes from day-to-day, and the daily sheriffs have no clear escalation > path if they are faced with a problem that they can't handle.
> So, to resolve these two problems, I would suggest that we create a new > super-sheriffs group, much like we have a super-reviewers group. The set > of people within this group would, as a group:
> * have continuous responsibility for tests, crashes, etc. on the > tinderbox, meaning that they would be on the hook for at least being > aware of and tracking these areas;
I'm not entirely sure what you are suggesting the super-sheriff does here. Right now we kind of have a system where whoever spots a regression tends to end up having to try to work out what caused it, however long that takes. This can suck for that person, it certainly makes me wonder why I check perf graphs sometimes, but it has the benefit that at least that person has all the info on the issue.
Are you suggesting that instead whoever finds the regression merely hands off to a super sheriff?
> * be the people who sheriff when the day's sheriff isn't around -- that > is, add another step between the day's sheriff and #developers;
This is I think good for spotting people who aren't sheriffing regularly and either kicking them or replacing them. I wonder though if you have considered the timezone problem. For a long time now the sheriffing has been focused on PST work hours. That is understandable but it can leave problems where people who are on the sheriff list can't actually work those hours, and that there is commonly no sheriff outside of those hours.
> * give sheriffs a specific group of people whom they can contact when > they have problems;
> * have direct access to all tinderbox machines to be able to diagnose > problems as they occur.
I can see this being useful, but I'm not sure we need a super-sheriff group for it specifically. As it happens I've found merely having access to the buildbot waterfall to be seriously helpful when sheriffing.
> There are a number of people who have been performing a similar role > already, but I think it would be helpful to formalize this and ensure > that those people have the tools they need to do their jobs efficiently > (specifically, direct access to the tinderboxes).
> I think that this, along with creating a sheriff's newsgroup (do we have > one already?) to better track discussions about both daily problems and > ongoing ones, would help us get a handle on what's going on with the > tree when problems arise.
This I certainly agree with. I had thought about a wiki page or something to track things. But a newsgroup where a thread is created for perf regressions would be very useful. Equally useful would be a good way to hand off between sheriffs. It's pretty common that I get up in the morning to see confusing messages on the tinderbox and in bug reports and so I have to guess whether it is safe to check in or not. A simple message at the end of the day by a sheriff to say "this is the state of the tree, this is what to look for and this is when we can start checking in again" would be pretty helpful.
Vladimir Vukicevic wrote: > I think that this, along with creating a sheriff's newsgroup (do we have > one already?) to better track discussions about both daily problems and > ongoing ones, would help us get a handle on what's going on with the > tree when problems arise.
I have since filed the bug for the newsgroup (464609), which Gerv is ready to create. Gerv suggested mozilla.dev.planning.sheriff, which is mostly fine, however I think that since this is a newsgroup for sheriffs, RelEng, IT and devs to come *together* on issues of tree management, it could stand a slightly broader name.
Before I go on, let's everybody take a firm grip on your bikeshedding instincts. :) Now, is there anything show-stoppingly-wrong with:
mozilla.dev.planning.treemanagement
If I don't hear of any such showstoppers in the next few days, I'll ask Gerv to create it as named.
> Vladimir Vukicevic wrote: >> I think that this, along with creating a sheriff's newsgroup (do we >> have one already?) to better track discussions about both daily >> problems and ongoing ones, would help us get a handle on what's going >> on with the tree when problems arise.
> I have since filed the bug for the newsgroup (464609), which Gerv is > ready to create. Gerv suggested mozilla.dev.planning.sheriff, which is > mostly fine, however I think that since this is a newsgroup for > sheriffs, RelEng, IT and devs to come *together* on issues of tree > management, it could stand a slightly broader name.
> Before I go on, let's everybody take a firm grip on your bikeshedding > instincts. :) Now, is there anything show-stoppingly-wrong with:
> mozilla.dev.planning.treemanagement
> If I don't hear of any such showstoppers in the next few days, I'll ask > Gerv to create it as named.
Not bikeshedding I think, but do we have a good consensus on what the scope of discussion in the newsgroup should be. I would expect the following:
* Threads about investigations into performance/test regressions * Threads about planned tree outages * Threads about changes to the tree rules
I'm positive I'm missing stuff, but I could also be way off base.
Changes to tree rules and outages should go to dev.planning, imo.
I would expect this newsgroup to be about tools and techniques for sheriffing, schedules for sheriffing, discussion of random test failures and slow build boxes, etc.
----- Original Message ----- From: dev-planning-bounces+beltzner=mozilla....@lists.mozilla.org
<dev-planning-bounces+beltzner=mozilla....@lists.mozilla.org> To: dev-plann...@lists.mozilla.org <dev-plann...@lists.mozilla.org> Sent: Sun Nov 16 14:16:50 2008 Subject: Re: Suggested changes to sheriffing
On 16/11/08 20:13, Johnathan Nightingale wrote: > Vladimir Vukicevic wrote: >> I think that this, along with creating a sheriff's newsgroup (do we >> have one already?) to better track discussions about both daily >> problems and ongoing ones, would help us get a handle on what's going >> on with the tree when problems arise.
> I have since filed the bug for the newsgroup (464609), which Gerv is > ready to create. Gerv suggested mozilla.dev.planning.sheriff, which is > mostly fine, however I think that since this is a newsgroup for > sheriffs, RelEng, IT and devs to come *together* on issues of tree > management, it could stand a slightly broader name.
> Before I go on, let's everybody take a firm grip on your bikeshedding > instincts. :) Now, is there anything show-stoppingly-wrong with:
> mozilla.dev.planning.treemanagement
> If I don't hear of any such showstoppers in the next few days, I'll ask > Gerv to create it as named.
Not bikeshedding I think, but do we have a good consensus on what the scope of discussion in the newsgroup should be. I would expect the following:
* Threads about investigations into performance/test regressions * Threads about planned tree outages * Threads about changes to the tree rules
I'm positive I'm missing stuff, but I could also be way off base. _______________________________________________ dev-planning mailing list dev-plann...@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-planning
1) Currently, we only have developers as sheriffs. However, intermittent problems can be caused by bugs in the shipping product code, the test or the RelEng infrastructure. What if QA and RelEng scheduled people to be available, as advisors to those Developer sheriffs? That way, Dev+QA+RelEng could immediately debug all aspects of a problem *together*? Not sure how we'd figure out the timezones, and I think we dont have enough people to do 24x7, but even if its only during some certain hours of the day, the idea of a joint group that can immediately investigate all parts of the problem space seems useful.
2) I like the idea of longer "sheriff" shifts (a week, not a day) as it will allow people time to re-learn, be productive and get into the rhythm of it.
3) Doing "sheriff + deputy" for the week seems like a good way to have cover in case sheriff have to attend meeting / take day off during the week... and its a great way to help train up new sheriffs.
4) Personally, I'd be in favour of refining membership of the existing sheriff group, and not bother creating "super-sheriffs"; it seems less confusing to me.
5) Why do we need a different sheriff/treemanagement newsgroup? Personally, the existing dev.planning newsgroup seems like the right venue for all this, imho.
Mike Beltzner wrote: > Changes to tree rules and outages should go to dev.planning, imo.
> I would expect this newsgroup to be about tools and techniques for > sheriffing, schedules for sheriffing, discussion of random test failures and > slow build boxes, etc.
> cheers, > mike
> ----- Original Message ----- > From: dev-planning-bounces+beltzner=mozilla....@lists.mozilla.org > <dev-planning-bounces+beltzner=mozilla....@lists.mozilla.org> > To: dev-plann...@lists.mozilla.org <dev-plann...@lists.mozilla.org> > Sent: Sun Nov 16 14:16:50 2008 > Subject: Re: Suggested changes to sheriffing
> On 16/11/08 20:13, Johnathan Nightingale wrote: >> Vladimir Vukicevic wrote: >>> I think that this, along with creating a sheriff's newsgroup (do we >>> have one already?) to better track discussions about both daily >>> problems and ongoing ones, would help us get a handle on what's going >>> on with the tree when problems arise. >> I have since filed the bug for the newsgroup (464609), which Gerv is >> ready to create. Gerv suggested mozilla.dev.planning.sheriff, which is >> mostly fine, however I think that since this is a newsgroup for >> sheriffs, RelEng, IT and devs to come *together* on issues of tree >> management, it could stand a slightly broader name.
>> Before I go on, let's everybody take a firm grip on your bikeshedding >> instincts. :) Now, is there anything show-stoppingly-wrong with:
>> mozilla.dev.planning.treemanagement
>> If I don't hear of any such showstoppers in the next few days, I'll ask >> Gerv to create it as named.
> Not bikeshedding I think, but do we have a good consensus on what the > scope of discussion in the newsgroup should be. I would expect the > following:
> * Threads about investigations into performance/test regressions > * Threads about planned tree outages > * Threads about changes to the tree rules
John O'Duinn wrote: > 1) Currently, we only have developers as sheriffs. However, intermittent > problems can be caused by bugs in the shipping product code, the test or > the RelEng infrastructure. What if QA and RelEng scheduled people to be > available, as advisors to those Developer sheriffs? That way, > Dev+QA+RelEng could immediately debug all aspects of a problem > *together*? Not sure how we'd figure out the timezones, and I think we > dont have enough people to do 24x7, but even if its only during some > certain hours of the day, the idea of a joint group that can immediately > investigate all parts of the problem space seems useful.
What we really seem to be missing is someone from RelEng on a European timeshift, esp. between about 12:00 to 18:00 UTC.
I really like the idea of having on-call people from other areas to support the sheriff, so the sheriff knows whom to ask if e.g. the red is a machine issue.
> 1) Currently, we only have developers as sheriffs. However, > intermittent > problems can be caused by bugs in the shipping product code, the > test or > the RelEng infrastructure. What if QA and RelEng scheduled people to > be > available, as advisors to those Developer sheriffs? That way, > Dev+QA+RelEng could immediately debug all aspects of a problem > *together*? Not sure how we'd figure out the timezones, and I think we > dont have enough people to do 24x7, but even if its only during some > certain hours of the day, the idea of a joint group that can > immediately > investigate all parts of the problem space seems useful.
Agreed, this is required. So far I think we've made do by cross- communicating on IRC, usually pulling nthomas or bhearsum in for questions and asssistance on IRC. A lot of the times people are told to file bugs (which is fine) but knowing that there's someone "on call" to deal with those issues in a timely fashion is critical and would reduce frustration.
> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as > it > will allow people time to re-learn, be productive and get into the > rhythm of it.
I don't recall seeing this suggestion, aside from the idea that the "super sheriff" (or if you will, sheriff with daily deputies) will have a lesser responsibility towards monitoring each checkin and a greater one towards helping that day's sheriff (or if you will, deputy) should odd problems occur.
I'm hesitant to increase the time commitment of super sheriffs, as there's a good degree of overlap between them and our senior code reviewing community.
> 3) Doing "sheriff + deputy" for the week seems like a good way to have > cover in case sheriff have to attend meeting / take day off during the > week... and its a great way to help train up new sheriffs.
Absolutely agreed. This seems like a good way to institute the model you posit above.
> 4) Personally, I'd be in favour of refining membership of the existing > sheriff group, and not bother creating "super-sheriffs"; it seems less > confusing to me.
As long as the idea is that they get a group of deputies, this makes sense. While I originally felt the same way (just boot the sheriffs who don't seem to care as much) I think that it's a necessary part of our community process and should be a requirement for code committers as a way of enforcing good behaviours and habits.
> 5) Why do we need a different sheriff/treemanagement newsgroup? > Personally, the existing dev.planning newsgroup seems like the right > venue for all this, imho.
Right now we're only discussing about 50% of what we should be discussing in newsgroups; the rest is on IRC, wiki pages and group knowledge. I don't think that dev.planning is the right place to discuss daily operational content as opposed to an area for project- wide announcements that affect planning and how we work (like this thread). The types of things I'd expect to see in a sheriffing newsgroup are discussions of "Hey, there's a random orange here, what could be causing it?" with the group collaborating to debug it; that would be a little much for dev-planning.
> 1) Currently, we only have developers as sheriffs. However, intermittent > problems can be caused by bugs in the shipping product code, the test or > the RelEng infrastructure. What if QA and RelEng scheduled people to be > available, as advisors to those Developer sheriffs? That way, > Dev+QA+RelEng could immediately debug all aspects of a problem > *together*? Not sure how we'd figure out the timezones, and I think we > dont have enough people to do 24x7, but even if its only during some > certain hours of the day, the idea of a joint group that can immediately > investigate all parts of the problem space seems useful.
I love this idea. Having dedicated time where I am to *expect* interrupts makes me much more amiable to them. Personally, I'd like to see these shifts be a week long. This way, each person would only have one once every 6 weeks or so - giving folks a lot of time to *ignore* interrupts and focus on their day to day work.
> 5) Why do we need a different sheriff/treemanagement newsgroup? > Personally, the existing dev.planning newsgroup seems like the right > venue for all this, imho.
I agree with Mike and others here. dev.planning is a project wide newsgroup and sheriff stuff isn't relevant to most people outside of Engineering. A dev.whatever group would be *great* for tracking things like this and would hopefully reduce the occurrence of "I don't know where to file this, post this, or note this, so I won't do anything" incidents.
>> 1) Currently, we only have developers as sheriffs. However, intermittent >> problems can be caused by bugs in the shipping product code, the test or >> the RelEng infrastructure. What if QA and RelEng scheduled people to be >> available, as advisors to those Developer sheriffs? That way, >> Dev+QA+RelEng could immediately debug all aspects of a problem >> *together*? Not sure how we'd figure out the timezones, and I think we >> dont have enough people to do 24x7, but even if its only during some >> certain hours of the day, the idea of a joint group that can immediately >> investigate all parts of the problem space seems useful.
> I love this idea. Having dedicated time where I am to *expect* > interrupts makes me much more amiable to them. Personally, I'd like to > see these shifts be a week long. This way, each person would only have > one once every 6 weeks or so - giving folks a lot of time to *ignore* > interrupts and focus on their day to day work.
A couple more things on this: If we implement this I think we should hold off an giving sheriffs direct tinderbox access. If there is a defined point of contact for Releng I think it will greatly reduce the need for sheriffs to step in. I'm totally open to revisiting this later, though.
And to be clear, I don't think a RelEng person needs to monitor the tree the same way the sheriff does. They shouldn't spend their day *actively* watching for problems. However, they should be hanging out in #developers, watching the newgroup, and watching for incoming bugs - so they can respond in a timely manner to them.
Mike Beltzner wrote: > On 17-Nov-08, at 3:55 AM, John O'Duinn wrote: >> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as it >> will allow people time to re-learn, be productive and get into the >> rhythm of it.
> I don't recall seeing this suggestion, aside from the idea that the > "super sheriff" (or if you will, sheriff with daily deputies) will have > a lesser responsibility towards monitoring each checkin and a greater > one towards helping that day's sheriff (or if you will, deputy) should > odd problems occur.
See justin dolske's post on this thread at 11/13/08 9:08 PM.
> I'm hesitant to increase the time commitment of super sheriffs, as > there's a good degree of overlap between them and our senior code > reviewing community.
Well, by doing longer "sheriff shifts", each shift is longer, but your next shift is further in the future. Not sure if the actual overall time commitment changes...
> 1) ...What if QA and RelEng scheduled people to be > available, as advisors to those Developer sheriffs? That way, > Dev+QA+RelEng could immediately debug all aspects of a problem > *together*?
I think this would be useful. I don't know if it obviates the need for some sheriffs to have box access, I guess it comes down to availability of resources - if sheriffs can get the equivalent of hands-on control when they need to, for instance, attach a debugger to a test-failing process, I think that's a real improvement.
> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as > it > will allow people time to re-learn, be productive and get into the > rhythm of it. > 3) Doing "sheriff + deputy" for the week seems like a good way to have > cover in case sheriff have to attend meeting / take day off during the > week... and its a great way to help train up new sheriffs.
I am not keen to do this - I think it's too much to put on every sheriff, and I don't think it should be necessary. I'd really rather see us reduce the amount of "re-learning" sheriffs need, by having a consolidated group of people tracking down and eliminating recurrent "fake oranges" and box problems. I don't think the time-calculus works out in reality the way it might on paper - I think it's substantially harder for most people to commit to a whole week of reduced productivity elsewhere in their jobs.
> 4) Personally, I'd be in favour of refining membership of the existing > sheriff group, and not bother creating "super-sheriffs"; it seems less > confusing to me.
I'm not sure how confusing this would be to developer-sheriffs who are already accustomed to the review/superreview process, and the peer/ module owner structure. In any event though, there are long-standing problems with the current organization - there are well-documented failures, particularly during crunch periods. I think that if we have organizational belief that a super-sheriff structure would help track those down, the cost of change-confusion is probably worth it.
> 5) Why do we need a different sheriff/treemanagement newsgroup? > Personally, the existing dev.planning newsgroup seems like the right > venue for all this, imho.
Oh this is the easiest part of it for me. In fine Mozilla tradition, that's just recognizing something that's already happening. We have been playing various ad hoc games in the meantime - email threads that get most-but-not-all interested parties, or IRC conversations that are (perforce) timezone-specific. If it falls into disuse, fine, but I suspect it's an coordination point we've needed for a while.
Cheers,
Johnathan
--- Johnathan Nightingale Human Shield john...@mozilla.com
Ben Hearsum wrote: > On 11/17/08 10:39 AM, Ben Hearsum wrote: >> On 11/17/08 3:55 AM, John O'Duinn wrote: >>> hi;
>>> Thoughts after reading this thread:
>>> 1) Currently, we only have developers as sheriffs. However, intermittent >>> problems can be caused by bugs in the shipping product code, the test or >>> the RelEng infrastructure. What if QA and RelEng scheduled people to be >>> available, as advisors to those Developer sheriffs? That way, >>> Dev+QA+RelEng could immediately debug all aspects of a problem >>> *together*? Not sure how we'd figure out the timezones, and I think we >>> dont have enough people to do 24x7, but even if its only during some >>> certain hours of the day, the idea of a joint group that can immediately >>> investigate all parts of the problem space seems useful.
>> I love this idea. Having dedicated time where I am to *expect* >> interrupts makes me much more amiable to them. Personally, I'd like to >> see these shifts be a week long. This way, each person would only have >> one once every 6 weeks or so - giving folks a lot of time to *ignore* >> interrupts and focus on their day to day work.
> A couple more things on this: > If we implement this I think we should hold off an giving sheriffs > direct tinderbox access. If there is a defined point of contact for > Releng I think it will greatly reduce the need for sheriffs to step in. > I'm totally open to revisiting this later, though.
> And to be clear, I don't think a RelEng person needs to monitor the tree > the same way the sheriff does. They shouldn't spend their day *actively* > watching for problems. However, they should be hanging out in > #developers, watching the newgroup, and watching for incoming bugs - so > they can respond in a timely manner to them.
Having one releng person on call is nice, but looking at the daily habit of Ted and Mossop, we need more people on different timezones empowered to fix the tree.
If it's not feasible to have them have access to the actual machines, we need to re-establish methods to clobber (TODO: define what that is for multiple slaves) and kick builds.
I'm still working on "ignorance is bliss" when it comes down to sheriffing, but the "European tree" is in an utterly sad state. I actually don't remember landing stuff with a good feeling about what the tree was up to for quite a while. Luckily, most of my landings these days are just changes to all-locales or shipped-locales, so I don't really bother.
Ben Hearsum wrote: > On 11/17/08 10:39 AM, Ben Hearsum wrote: >> On 11/17/08 3:55 AM, John O'Duinn wrote: >>> hi;
>>> Thoughts after reading this thread:
>>> 1) Currently, we only have developers as sheriffs. However, intermittent >>> problems can be caused by bugs in the shipping product code, the test or >>> the RelEng infrastructure. What if QA and RelEng scheduled people to be >>> available, as advisors to those Developer sheriffs? That way, >>> Dev+QA+RelEng could immediately debug all aspects of a problem >>> *together*? Not sure how we'd figure out the timezones, and I think we >>> dont have enough people to do 24x7, but even if its only during some >>> certain hours of the day, the idea of a joint group that can immediately >>> investigate all parts of the problem space seems useful.
>> I love this idea. Having dedicated time where I am to *expect* >> interrupts makes me much more amiable to them. Personally, I'd like to >> see these shifts be a week long. This way, each person would only have >> one once every 6 weeks or so - giving folks a lot of time to *ignore* >> interrupts and focus on their day to day work.
Cool. :-)
> A couple more things on this: > If we implement this I think we should hold off an giving sheriffs > direct tinderbox access. If there is a defined point of contact for > Releng I think it will greatly reduce the need for sheriffs to step in. > I'm totally open to revisiting this later, though.
Agreed. If there's a RelEng contact/advisor available, they can look into machine issues on the spot, without worries of an accidental change messing up the machine.
> And to be clear, I don't think a RelEng person needs to monitor the tree > the same way the sheriff does. They shouldn't spend their day *actively* > watching for problems. However, they should be hanging out in > #developers, watching the newgroup, and watching for incoming bugs - so > they can respond in a timely manner to them.
Yes, thats why I suggested "advisors to those Developer sheriffs". I'm not suggesting that only one person be sheriff, with sometimes QA, RelEng being that solo sheriff coordinating landings, etc. I'm suggesting that the two Developers who are sheriff have a QA person and RelEng person available to help investigate problems that arise.
Mike Beltzner wrote: > Changes to tree rules and outages should go to dev.planning, imo.
Yeah - I think the key point is that only people who are responsible for tree management should need to add that group to their reading list. "Normal" developers should see announcements of important tree management events elsewhere.
>> 1) Currently, we only have developers as sheriffs. However, intermittent >> problems can be caused by bugs in the shipping product code, the test or >> the RelEng infrastructure. What if QA and RelEng scheduled people to be >> available, as advisors to those Developer sheriffs? That way, >> Dev+QA+RelEng could immediately debug all aspects of a problem >> *together*? Not sure how we'd figure out the timezones, and I think we >> dont have enough people to do 24x7, but even if its only during some >> certain hours of the day, the idea of a joint group that can immediately >> investigate all parts of the problem space seems useful.
> Agreed, this is required. So far I think we've made do by > cross-communicating on IRC, usually pulling nthomas or bhearsum in for > questions and asssistance on IRC. A lot of the times people are told to > file bugs (which is fine) but knowing that there's someone "on call" to > deal with those issues in a timely fashion is critical and would reduce > frustration.
Yes, this would be a big help. Bugs are useful for tracking, but usually if a bug has to be filed to fix something that's blocking people from committing, people can usually forget about getting anything checked in for the next few hours. We need a faster way to fix these issues.
>> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as it >> will allow people time to re-learn, be productive and get into the >> rhythm of it.
> I don't recall seeing this suggestion, aside from the idea that the > "super sheriff" (or if you will, sheriff with daily deputies) will have > a lesser responsibility towards monitoring each checkin and a greater > one towards helping that day's sheriff (or if you will, deputy) should > odd problems occur.
Yes, that was the original idea -- I am very much against increasing the time commitment for active sheriffing; it's a huge drain and effort, and I think I would go crazy doing it for a week straight.
>> 3) Doing "sheriff + deputy" for the week seems like a good way to have >> cover in case sheriff have to attend meeting / take day off during the >> week... and its a great way to help train up new sheriffs.
> Absolutely agreed. This seems like a good way to institute the model you > posit above.
Well, no -- sheriff + deputy for the week implies, well, being sheriff for a week. Continuing sheriff duties as normal but creating a group that's a safety net for the daily sheriff is what I had in mind -- that is, I didn't want to provide an opportunity for sheriffs to not do their daily duties, but more to have someone to turn to if they need help, and for a group to be aware of longer-standing problems. Hence the idea for a separate group, because I'm not proposing any changes to existing sheriffs, just additional structure around them.
>> 5) Why do we need a different sheriff/treemanagement newsgroup? >> Personally, the existing dev.planning newsgroup seems like the right >> venue for all this, imho.
> Right now we're only discussing about 50% of what we should be > discussing in newsgroups; the rest is on IRC, wiki pages and group > knowledge. I don't think that dev.planning is the right place to discuss > daily operational content as opposed to an area for project-wide > announcements that affect planning and how we work (like this thread). > The types of things I'd expect to see in a sheriffing newsgroup are > discussions of "Hey, there's a random orange here, what could be causing > it?" with the group collaborating to debug it; that would be a little > much for dev-planning.
Yep, I think this was discussed in other posts, but that's what I think as well. m.d.planning is for planning, not for discussion of details of the tree.
>> 1) Currently, we only have developers as sheriffs. However, intermittent >> problems can be caused by bugs in the shipping product code, the test or >> the RelEng infrastructure. What if QA and RelEng scheduled people to be >> available, as advisors to those Developer sheriffs? That way, >> Dev+QA+RelEng could immediately debug all aspects of a problem >> *together*? Not sure how we'd figure out the timezones, and I think we >> dont have enough people to do 24x7, but even if its only during some >> certain hours of the day, the idea of a joint group that can immediately >> investigate all parts of the problem space seems useful.
> Agreed, this is required. So far I think we've made do by > cross-communicating on IRC, usually pulling nthomas or bhearsum in for > questions and asssistance on IRC. A lot of the times people are told to > file bugs (which is fine) but knowing that there's someone "on call" to > deal with those issues in a timely fashion is critical and would reduce > frustration.
Yes, this would be a big help. Bugs are useful for tracking, but usually if a bug has to be filed to fix something that's blocking people from committing, people can usually forget about getting anything checked in for the next few hours. We need a faster way to fix these issues.
>> 2) I like the idea of longer "sheriff" shifts (a week, not a day) as it >> will allow people time to re-learn, be productive and get into the >> rhythm of it.
> I don't recall seeing this suggestion, aside from the idea that the > "super sheriff" (or if you will, sheriff with daily deputies) will have > a lesser responsibility towards monitoring each checkin and a greater > one towards helping that day's sheriff (or if you will, deputy) should > odd problems occur.
Yes, that was the original idea -- I am very much against increasing the time commitment for active sheriffing; it's a huge drain and effort, and I think I would go crazy doing it for a week straight.
>> 3) Doing "sheriff + deputy" for the week seems like a good way to have >> cover in case sheriff have to attend meeting / take day off during the >> week... and its a great way to help train up new sheriffs.
> Absolutely agreed. This seems like a good way to institute the model you > posit above.
Well, no -- sheriff + deputy for the week implies, well, being sheriff for a week. Continuing sheriff duties as normal but creating a group that's a safety net for the daily sheriff is what I had in mind -- that is, I didn't want to provide an opportunity for sheriffs to not do their daily duties, but more to have someone to turn to if they need help, and for a group to be aware of longer-standing problems. Hence the idea for a separate group, because I'm not proposing any changes to existing sheriffs, just additional structure around them.
>> 5) Why do we need a different sheriff/treemanagement newsgroup? >> Personally, the existing dev.planning newsgroup seems like the right >> venue for all this, imho.
> Right now we're only discussing about 50% of what we should be > discussing in newsgroups; the rest is on IRC, wiki pages and group > knowledge. I don't think that dev.planning is the right place to discuss > daily operational content as opposed to an area for project-wide > announcements that affect planning and how we work (like this thread). > The types of things I'd expect to see in a sheriffing newsgroup are > discussions of "Hey, there's a random orange here, what could be causing > it?" with the group collaborating to debug it; that would be a little > much for dev-planning.
Yep, I think this was discussed in other posts, but that's what I think as well. m.d.planning is for planning, not for discussion of details of the tree.
This died out a bit without anything actually getting done, so jumpstarting it again a bit. Some additional comments below. From reading back over the thread, people still seem split on whether we need to create a set of super-sheriffs or not. I can try to distill the feedback to come up with a list of requirements/abilities of super-sheriffs and see if it's become any clearer.
>> 1) ...What if QA and RelEng scheduled people to be >> available, as advisors to those Developer sheriffs? That way, >> Dev+QA+RelEng could immediately debug all aspects of a problem >> *together*?
> I think this would be useful. I don't know if it obviates the need for > some sheriffs to have box access, I guess it comes down to availability > of resources - if sheriffs can get the equivalent of hands-on control > when they need to, for instance, attach a debugger to a test-failing > process, I think that's a real improvement.
Sure, though "equivalent of hands-on control" is not the same as hands-on control. Trying to remotely debug with someone copy-pasting (or typing back responses, since copy-pasting from some of these machines isn't easy) is a huge pain and really unnecessary. Having someone around to do things like reboot machines, check network/diskspace/etc. would be very helpful though, because then the sheriff can work with that person instead of necessarily always going to the supersheriff.
>> 4) Personally, I'd be in favour of refining membership of the existing >> sheriff group, and not bother creating "super-sheriffs"; it seems less >> confusing to me.
> I'm not sure how confusing this would be to developer-sheriffs who are > already accustomed to the review/superreview process, and the > peer/module owner structure. In any event though, there are > long-standing problems with the current organization - there are > well-documented failures, particularly during crunch periods. I think > that if we have organizational belief that a super-sheriff structure > would help track those down, the cost of change-confusion is probably > worth it.
I'm not sure what's confusing, really -- there's a set of sheriffs and then there's a set of super-sheriffs who are empowered to do additional things that the sheriffs normally aren't, and who are also charged with day-to-day continuity of the status of the tree.
>> 5) Why do we need a different sheriff/treemanagement newsgroup? >> Personally, the existing dev.planning newsgroup seems like the right >> venue for all this, imho.
> Oh this is the easiest part of it for me. In fine Mozilla tradition, > that's just recognizing something that's already happening. We have been > playing various ad hoc games in the meantime - email threads that get > most-but-not-all interested parties, or IRC conversations that are > (perforce) timezone-specific. If it falls into disuse, fine, but I > suspect it's an coordination point we've needed for a while.
Ok -- as per bug 464609, mozilla.dev.tree-management should be showing up sometime this week which should help. (dev.planning is not the right place for this, as others have said; talking about why test X is failing is not in the same scope as planning a schedule for the next release).