[r9y.dev/discuss] Let's talk about Reliability Engineering

37 views

Skip to first unread message

unread,

Jun 15, 2023, 4:22:35 PM6/15/23

Howdy, reliable friends!

I'll be taking over hosting duties from Dave going forward (thanks for all the fish, Dave!). So I'll see you all on Tuesday!

Meeting link

[r9y.dev/discuss] Let's talk about Reliability Engineering

Tuesday Jun 20, 2023 ⋅ 09:30 – 10:30 (Pacific Time - Los Angeles)

Hosted by Google Cloud Platform

Format

Also, the mailing list is available any time for asynchronous discussion.

(Guest list is too large to display)

unread,

Jun 20, 2023, 3:48:27 PM6/20/23

to Reliability discussion group

Thanks for attending this week! Here are the notes:

Topics discussed:

Should your service's SLOs be listed as the Objectives / Goals / Key Results for your team?

Comments:

some OKRs are implicit, don't write them down, really. are SLOs also implicit?
yes!
clarify: OKRs might be concrete goals around mutation of a service. what about "fix it when it breaks" -- can assign work against that.
ought to be really easy to put top level SLO into OKRs.
OKRs vs BAU
SLO as O, or as KR?

Does SRE require platforms? (Alternatively, is SRE dead, long live Platform Engineering?)

Comments:

the hype is real! platforms have been a part of SRE at google for a long time. is it intrinsic?
required? nice to have? is SRE ~= PE ?
counterpoint: adding another layer doesn't change things. lots of layers/platforms already.
each has support teams. naming one "server platform team" shouldn't change that. useful wrt dev teams though. SRE remains managing server-specific oddities.
machine provisioning is its own thing. maybe an SRE team watches "are functions working" -
but is "cleanly" SRE or dev?
isn't this all the same thing with different buzzwords? devops, sre, PE.
culture is important. don't want "holding it wrong" problems.

how do people connect infrastructure alerts with incident response / NOC / enterprise alerting / ITIL?

Comments:

how to fuse monitoring inputs?
inverse: what alerts are actually actionable? what do you do when X happens? why is *this* alert important, how does it relate to code/service?
agree. sit and watch for 3 months. or require playbook doc updates. OR: let the watchers
understand last time doc was updated (freshness) to imply confidence in understanding
signal.
how recent an alert fired can be a helpful signal.
if you cant understand an alert at 3am, its not a good alert.
causes vs symptoms

Talking to "the business" -- how to articulate ROI of R9y?

Comments:

trust/belief that this approach CAN improve r9y, but why should we care? biz wants numbers, $
is FUD our only choice? ;)
can availability and speed be put into business terms about being better than competitors?
what does it cost if you go down? where do we spend money to prevent that? "not everything needs 5x9"
dollars per minute + diminishing returns of r9y investment (exponential curve of $ per 9)
not just dollars, also customers or reputation. concern that a thing won't be there in the future when you need it.
SLOs as short-term proxy to long term trend of customer confidence, revenue.

Are there good icebreakers for starting the discussion around reliability with large organizations who are on prem?

Comments:

one view: i dont understand the change (to cloud).
why move to the cloud?
sometimes customers arent sure why
onprem can lift/shift onto private (or private-on-public) cloud. did that fix anything?
might move to cloud for "other" reasons but then misunderstand the r9y characteristics of cloud
can a cloud provide new capabilities to a platform/team? r9y.dev map
Steve's Pyramids of R9y: https://www.youtube.com/watch?v=-lHPDx90Ppg

Which alerting methods provide best signals? thresholds, % error budget remaining, time to X% remaining, error budget burn rate, etc

Comments:

all can be wrong for some thing at some time :)
consider threats first, then find the metric that matches those. eg a bad rollout.
whatever allows you to look at a page/alert and say "nah i dont care" effectively