MP: Here with us today we have someone who I think the vast majority of SREs at Google have had the opportunity to either hear speak or have received teaching from him, and generally a really well-known name around SRE at Google. And he's here today to talk to us about on-call. Andrew, why don't you go ahead and introduce yourself?
Download File https://ssurll.com/2yM3Ss
MP: Yeah. So on-call is something I think a lot of people in operations systems administration are already familiar with. Why would you say it's SRE's job? It's always a little bit of a funny thing that we're usually the point people for these emergencies for these huge complicated systems, but we're also not usually the experts in how any particular piece of these systems works.
Andrew: Yeah. Well, let me set the stage a little bit. So Google is a massively large company these days, but it is always been the case at Google that there have been more software engineers in a general sense, working on products, than there have been SREs. We are, if you will, naturally scarce. So while I agree that in many cases, for some of our most public-facing, high revenue, et cetera, high-risk sorts of products, that SRE should be on-call or co-on-call for a service.
The other part that I just want to mention that's a part of our leverage, a part of our scarcity, and a part of our selectivity, is that the overwhelming majority of, let's call them microservices at Google, do not have SRE on-call for them. So we pick and choose not our battles, but our responsibilities. And you can think of the fact that at least at Google, if an SRE is on-call, or an SRE team is on-call for a product, that means that there's a certain extra standard of reliability being afforded to it. But we try not to hoard that for ourselves. So even in many of those on-call rotations, we are co-on-call with our developers. Just wanted to clarify that off the top.
Whereas the mindshare of many other software developers who may be themselves self-on-call for a service don't necessarily have that mindshare or that amount of time allocation to do that. So, I don't know. I think we're meant to be exemplars of on-call, and we will both do a great job at on-call and like I said, make it better.
So anyway, that's my take on making sure that you have a balanced thing and having the managerial and cultural support to say, "This on-call rotation isn't sized for the number of incidents or the SLA that it has, or the number of people involved, et cetera." That's an important aspect of making sure that on-call is the sort of thing you don't run away from.
MP: Yeah, so something I've noticed that is not a standard across teams, but is a common practice, is to actually have both an on-call and an on-duty rotation. How does this relate to that limit?
Andrew: Sure. So just to be clear, let's define what we mean by on-call and on-duty generally. So I think on-call is: you are responsible for the vitality of the uptime, the responsiveness of the service during your shift. And on-duty means some sort of probably small quanta, but maybe high quantity of work that needs to be done: crank turning, answering tickets, answering support, whatever type of stuff.
So the only thing I would guard against, if you were to ask, would be making sure that on-duty does not drain the on-caller so that when they do get paged, say towards the end of their shift, but while they were still on-call, and they go, "Man, I just did a bajillion thousand tickets and now I'm getting paged. Oh, my head hurts." That's not setting a team up for success. So some teams, like I said, optimistically overload on-call to also include, like, "while you're here, do on-duty." But others keep it separate. And really, it comes down to how capable are you and your team of being simultaneously on-call, even if almost quote, unquote, "nothing happens", and also doing some other element of day job work.
Andrew: Absolutely. So let me first answer the, "Can you split being on-call at the same time?" and then, "How would you construct maybe the demographics and distributions of people involved in an on-call rotation?" So keep me honest; let's bookmark that.
So this is an indirect way of saying to your question, "What if you split the work?": so it's my personal preference to have on-call and on-duty be completely separate because I think it gives more agency to the single on-caller to say, "I choose to do whatever I do when I'm not being paged. Whatever is best for my work and day job," as opposed to I'm being told, "Please do tickets or this or that."
But if you were to split on-call by, say, having two on-callers, regardless of how on-duty is split, I think the ability to respond to an incident may be slightly improved possibly, but the authority-agency-visibility of, "Who is the on-caller? I need to talk to the on-caller" is reduced because you now have two people. You have maybe heard of phrases like, "Too many cooks in the kitchen," or, "If a problem is everyone's problem, then it is no one's problem." Or the classic, "If you don't get what you want from one parent, you ask the other."
Andrew: Yes. The other argument I would make here is if you were to be so unlucky as to have two problems coming in in short distance from each other, and it turns out that they are in fact completely unrelated or mostly orthogonal, having your secondary remain mentally fresh so that when you get paged again and you look at the thing you go, "Oh, that's not related to this at all. Dear secondary person, would you please take this?" Then this also reduces the cognitive burden for each of you. I think that's related to what you're saying.
Andrew: So I think the second part of your question, if I'm hearing you right, was about how would you design parts of on-call, or would you like to throw that question back to me? I just want to make sure I understand.
Viv: Sure. Yeah, I was just asking about the rotation since we were talking about how you might staff it. Other parts of the rotation. So maybe there are more opinions on staffing, but also: how long is your rotation? What does it cover? I don't know. I know I'm throwing more questions at you in response to us bookmarking a question for later.
Some other teams have much more of a congealed or consecutive sort of basis. So they'll say, "Okay, well, we're still going to do 12 and 12, or maybe we move the divider between the two on-call teams because one is in a slightly different time zone where it's a little bit harder to do this. So we're going to have some sort of a mercy shift, which is like we do 10 and 14 because of whatever the case may be." Sometimes you don't control what nation your second on-call team is hired in because it was a matter of your company staffing priorities, let's say.
So regardless of whether it's 12 and 12 in a day, or it's X and Y that sum up to 24, maybe you do formally say, "Okay, we're going to have the same person be on-call from North America during their daytime, plus or minus, for 7 days at a run." So that's much more of a different end-of-the-rails sort of setup. So you say, "I'm doing an on-call shift for 7 days, 12 hours a day. And I have a colleague who's also doing 7 and 12."
And there are also variations somewhere in the middle between these two that we've seen as well, which is, for example, not doing daily, not doing weekly, but doing something like either over the weekend plus Friday or Monday, so let's call it "Friday, Saturday, Sunday," and then having an entirely during the workweek "Tuesday, Wednesday, Thursday" setup. And by the way, I say that with a North American view on the workweek. You can imagine modifications for cultural norms in certain nations and certain countries, specifically around maybe days of Sabbath or employment law, et cetera.
Honestly, I think the difference between a halfsies Monday-Friday split versus a Friday plus the weekend and workdays minus Friday split is minimal. I think it may be over-optimization, but honestly, if there's a thing that resonates with your team or with your org and they would prefer to do that, give them that choice. Plus so long as you have systems that allow you to carefully trade shifts for people so that they can further micro-optimize for mutual benefit amongst pairs of people who want to do each other favors, you're going to be okay.
The last thing I'd advise for you to do, however, is just to say, "There is a robot that declares when people are on-call and it's all going to happen and you can't change, and deal with it." You have to acknowledge there are humans all throughout all of these processes. And we want to optimize for their happiness and their sustainability to want to come back to the on-call rotation.
MP: Prior to when I came to Google, I was actually part of a single site on-call rotation that had a 24-hour pager holding. I can't remember exactly how we split the weeks, but it would be multiple consecutive 24-hour periods that we'd be holding the pager for. And I'm sure there are organizations out there that don't have the ability to have dual-sided teams.
But maybe the general lesson to take away from this is, if you're going to be doing an all-day and all-night on-call rotation, I don't think it is sustainable personally to be paged multiple times in the middle of the night for multiple nights if the type of work you are doing is not shift work. I know certain classes of engineering work are like, "Oh, I'm going to roll onto the night shift and I'm going to roll off." That's a different story. But if you are a quote-unquote, "daylight hours" worker, 40 hours a week, whatever the case is, but you are also on-call, the only thing I would ask of a management structure in that is to have compassion for the fact that if people are woken up in the middle of the night multiple nights, they're not going to be at their best for later times.
7fc3f7cf58