[r9y.dev/discuss] Let's talk about Reliability Engineering

29 views
Skip to first unread message

Steve McGhee

unread,
Jun 15, 2023, 4:22:35 PM6/15/23
to
[r9y.dev/discuss] Let's talk about Reliability Engineering

Howdy, reliable friends!

I'll be taking over hosting duties from Dave going forward (thanks for all the fish, Dave!). So I'll see you all on Tuesday!

[r9y.dev/discuss] Let's talk about Reliability Engineering
Tuesday Jun 20, 2023 ⋅ 09:30 – 10:30 (Pacific Time - Los Angeles)

A discussion group on engineering for reliability (r9y)

Hosted by Google Cloud Platform

Format

  • Monthly sessions via Google Meet -- all are welcome!

  • We use a Lean Coffee format where discussion topics are generated by the group

Also, the mailing list is available any time for asynchronous discussion.


More info at r9y.dev/discuss

Guests

(Guest list is too large to display)

Steve McGhee

unread,
Jun 20, 2023, 3:48:27 PM6/20/23
to Reliability discussion group
Thanks for attending this week!  Here are the notes:

Topics discussed:


Should your service's SLOs be listed as the Objectives / Goals / Key Results for your team?

Comments:

  • some OKRs are implicit, don't write them down, really. are SLOs also implicit?

  • yes!

  • clarify: OKRs might be concrete goals around mutation of a service. what about "fix it when it breaks" -- can assign work against that.

  • ought to be really easy to put top level SLO into OKRs.

  • OKRs vs BAU

  • SLO as O, or as KR?


Does SRE require platforms? (Alternatively, is SRE dead, long live Platform Engineering?)

Comments:

  • the hype is real! platforms have been a part of SRE at google for a long time. is it intrinsic?

  • required? nice to have? is SRE ~= PE ?

  • counterpoint: adding another layer doesn't change things. lots of layers/platforms already.

  • each has support teams. naming one "server platform team" shouldn't change that. useful wrt dev teams though. SRE remains managing server-specific oddities.

  • machine provisioning is its own thing. maybe an SRE team watches "are functions working" -

  • but is "cleanly" SRE or dev?

  • isn't this all the same thing with different buzzwords? devops, sre, PE.

  • culture is important. don't want "holding it wrong" problems.


how do people connect infrastructure alerts with incident response / NOC / enterprise alerting / ITIL?

Comments:

  • how to fuse monitoring inputs?

  • inverse: what alerts are actually actionable? what do you do when X happens? why is *this* alert important, how does it relate to code/service?

  • agree. sit and watch for 3 months. or require playbook doc updates. OR: let the watchers

  • understand last time doc was updated (freshness) to imply confidence in understanding

  • signal.

  • how recent an alert fired can be a helpful signal.

  • if you cant understand an alert at 3am, its not a good alert.

  • causes vs symptoms


Talking to "the business" -- how to articulate ROI of R9y?

Comments:

  • trust/belief that this approach CAN improve r9y, but why should we care? biz wants numbers, $

  • is FUD our only choice? ;)

  • can availability and speed be put into business terms about being better than competitors?

  • what does it cost if you go down? where do we spend money to prevent that? "not everything needs 5x9"

  • dollars per minute + diminishing returns of r9y investment (exponential curve of $ per 9)

  • not just dollars, also customers or reputation. concern that a thing won't be there in the future when you need it.

  • SLOs as short-term proxy to long term trend of customer confidence, revenue.


Are there good icebreakers for starting the discussion around reliability with large organizations who are on prem?

Comments:

  • one view: i dont understand the change (to cloud).

  • why move to the cloud?

  • sometimes customers arent sure why

  • onprem can lift/shift onto private (or private-on-public) cloud. did that fix anything?

  • might move to cloud for "other" reasons but then misunderstand the r9y characteristics of cloud

  • can a cloud provide new capabilities to a platform/team? r9y.dev map

  • Steve's Pyramids of R9y: https://www.youtube.com/watch?v=-lHPDx90Ppg


Which alerting methods provide best signals? thresholds, % error budget remaining, time to X% remaining, error budget burn rate, etc

Comments:

  • all can be wrong for some thing at some time :)

  • consider threats first, then find the metric that matches those. eg a bad rollout.

  • whatever allows you to look at a page/alert and say "nah i dont care" effectively


See you next time!
Reply all
Reply to author
Forward
0 new messages