TOMORROW 0930PT http://r9y.dev/discuss

25 views

Skip to first unread message

Steve McGhee

unread,

Aug 14, 2023, 1:11:21 PM8/14/23

to Reliability discussion group

I’m fresh off a long vacation and — what do you know — tomorrow is a reliability-discuss! I love these. We talk about reliability (r9y, sorry), resilience, SRE, DevOps, Platforms, our feels, oncall, toil, and maybe a little Taylor Swift.

[r9y.dev/discuss] Let's talk about Reliability Engineering
Tuesday, August 15 · 09:30 – 10:30
Time zone: America/Los_Angeles
Google Meet joining info
Video call link: https://meet.google.com/kdk-hnmf-yjp
Or dial: ‪(US) +1 609-491-2429‬ PIN: ‪478 299 241‬#
More phone numbers: https://tel.meet/kdk-hnmf-yjp?pin=4362527073963

See you tomorrow!

Steve McGhee

unread,

Aug 16, 2023, 5:00:38 PM8/16/23

to Reliability discussion group

Thank you to everyone who joined today! A summary of the discussion is below.

See you next time! – Tuesday 19th Sept

r9y.dev/discuss

r9y-discuss Date and Time: 15 August 2023 17:31 (UTC+00:00)

Topics discussed:

How has r9y changed since 2020? Any predictions that came true? Surprises?

hot take "nothing has really improved" - apart from talking
another take: awareness (words) is up, if not real understanding, action.
was: Ops with a new name
now: more like the book!
eg SLOs can be useful in theory, but sometimes not actually/directly useful ! is there a discussion happening around this?
are SLOs just a proxy/abstraction for "measurement" ?

what reliability/SRE conferences are you looking forward to next year?

remote > travel !
gophercon! https://www.gophercon.com/
p99conf (1st edition!) https://www.p99conf.io/
DORA Community day at DOES Las Vegas: https://members.itrevolution.com/register/dora/
Chaos Carnival https://chaoscarnival.io/
SREcon(s) https://www.usenix.org/srecon
LFIconf https://www.learningfromincidents.io/learning-from-incidents-conference-2023

any experience with r9y in air gap envs?

flash drives and sneakernet!
art of slo, distributed image server - links go here.
removal of toil !
To test airgapped environments for reliability I've heard the US gov't has experimented with https://litmuschaos.io/ but this is only for testing not CI/CD.
Target Corp built a nice platform called TAP. This is an old presentation but goodie: https://www.youtube.com/watch?v=cnHfK4MZA2Y&t=1260s They now use TAP to deploy/release to multiple cloud providers, onpremises, and edge....all while the developer doesn't have to think much about the workload the app is going to

Do you have a Terralith? How did you get there, and what now?

yes! it happens.
maybe this is due to the tool (tf).
"pink unicorn" thinking - project an ideal world
tools need to do the user validation, is this actually solving the problem.
can come from team silos. infra team has one pipeline for all things.
platform teams can suffer similarly, release platform too slowly, with big bang. ouch.
as a solution? SOA - provide hooks to service owning developers.
does "my terraform" for my service really need to be mine? esp if i dont understand it. thus
goes back to a central team who creates a 'lith
ETOOMUCHSTUFF
why reduce understanding of production? controls / knowledge? if it needs so much custom knowledge, adapt the tools. make it less hostile. reduce need for an interface team.
instead of ops helping dev make good choices, a bad model might be "here go play with this". not abstracted enough? just a weird language.
functions > modules.
keptn.sh

"Is platform engineering becoming part of/intersecting with reliability roles/topics?" Is Platform Reliability Engineering a thing now?