certification

531 views
Skip to first unread message

T D

unread,
Nov 9, 2020, 8:25:07 PM11/9/20
to Chaos Community
Hi,

     searching for certification in this forum comes up with no results. I think large companies with infrastructures aspiring to be fully fault tolerant would like this to be part of their arsenal, but would not touch it with a barge poll on prod without knowing they have an specialised individual capable of deploying the solution reliably. 

So.. has anyone got so far as creating a professional verification process?

Happy to be involved if not. :)

Cheers,
        Toby

Akhilesh Sarfare

unread,
Nov 9, 2020, 8:47:00 PM11/9/20
to T D, Chaos Community
Hi,

That's a great idea. Even I had searched for a certification earlier but no results were found.

Cheers,
Akhilesh

--
You received this message because you are subscribed to the Google Groups "Chaos Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chaos-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chaos-community/55ad9616-20d4-4b5d-a3c8-c1dfa17f49dbn%40googlegroups.com.

Mirko Ebert

unread,
Nov 10, 2020, 2:08:59 AM11/10/20
to Chaos Community
I'm on board. 

Mirko

Matt Coles

unread,
Nov 11, 2020, 5:01:11 AM11/11/20
to Chaos Community
I'd be keen to contribute to something like this also. Perhaps we could start a working group and figure out whats needed to make this happen?

Rodrigo Julián Martín

unread,
Nov 11, 2020, 6:21:08 AM11/11/20
to Matt Coles, Chaos Community

Akrem RIAHI

unread,
Nov 11, 2020, 6:32:59 AM11/11/20
to Rodrigo Julián Martín, Matt Coles, Chaos Community

Bugra Derre

unread,
Nov 11, 2020, 7:46:27 AM11/11/20
to Chaos Community
Interested as well. Would be nice to be a part of it

Mary Cardenas

unread,
Nov 11, 2020, 11:21:13 AM11/11/20
to Matt Coles, Chaos Community
I'm interested in helping out. Chaos Engineering is widely used within my org and I can give industry's practice and implementation examples.

Mary

T D

unread,
Nov 11, 2020, 12:21:37 PM11/11/20
to Chaos Community
A working group would be a good start, I agree. 

I think the first steps would be to create a curriculum. I started writing this, but realised it falls down as it depends on the use of one tool. but hey, maybe you have some insight into how it might be generalised

there would normally be some topics surrounding the mindset and goals
then the rest would probably lie around the use of a tool, so:
- installation
- upgrades/backup
- use cases and demonstrable how-to's - probably would take up a large chunk of the requirements. 
- trouble shooting
- risk mitigation/disaster recovery
- best practices

I'm not sure if you'd need to set up your own certification body; It might be better to just approach one that already does that with a curriculum and an exam, like the Linux Foundation that does a bunch of disparate certifications already.

Toby

Mirko Ebert

unread,
Nov 12, 2020, 11:38:08 AM11/12/20
to Chaos Community
How we want to start? Should we start with a Mail Group, Video Call, Web Site?
  • Definition of Target Group
  • What could proved? 
    • Best Practice like ISO9000
    • Course Certificate like SCRUM Master

  • What type of organization runs the certifications?
    • Foundation?
    • Franchise companies?

Mirko

Akrem RIAHI

unread,
Nov 12, 2020, 11:42:58 AM11/12/20
to Mirko Ebert, Chaos Community
we can start by a slack-channel and we discuss the plan and we can do a sync-up by video call after collecting ideas

Jason Yee

unread,
Nov 12, 2020, 12:16:44 PM11/12/20
to Chaos Community
We've been working on a certification program at Gremlin (no ETA yet). We're still in the process of defining our curriculum, but as Toby discovered, after we covered the basics of how to run an experiment, consider safety, etc. it quickly became tool specific. I'm curious to know what topics everyone thinks should be covered in a broader, more tool-agnostic certification.

I've created a #certification channel in the Chaos Engineering Slack if folks want to use that to discuss plans and collect ideas.

Jeremy Edberg

unread,
Nov 12, 2020, 3:30:30 PM11/12/20
to Jason Yee, Chaos Community
When I used to teach Chaos Engineering, most of my curriculum was
about DevOps best practices (automated deployment, monitoring and
alerting, canaries, good incident response and management, etc.)
because those were all necessary to be successful at CE.

Beyond that I would teach queueing theory, some basic control theory,
and concepts like service isolation and dealing with backpressure and
thundering herds.

It's hard to teach Chaos Engineering in isolation because it requires
so many other things to be in place first to be successful. So
perhaps the curriculum should include all the things you need to have
in place before even starting?
> To view this discussion on the web visit https://groups.google.com/d/msgid/chaos-community/0d4dda06-c525-4af4-9b38-ef70b1e88a94n%40googlegroups.com.

Amit Saha

unread,
Nov 12, 2020, 4:27:30 PM11/12/20
to Jeremy Edberg, Jason Yee, Chaos Community


On Fri, 13 Nov 2020, 7:30 am Jeremy Edberg, <jed...@gmail.com> wrote:
When I used to teach Chaos Engineering, most of my curriculum was
about DevOps best practices (automated deployment, monitoring and
alerting, canaries, good incident response and management, etc.)
because those were all necessary to be successful at CE.

Beyond that I would teach queueing theory, some basic control theory,
and concepts like service isolation and dealing with backpressure and
thundering herds.

It's hard to teach Chaos Engineering in isolation because it requires
so many other things to be in place first to be successful.  So
perhaps the curriculum should include all the things you need to have
in place before even starting?

+1 Without the above, it might end up being a software certification.

Omar Saenz Herrera

unread,
Nov 17, 2020, 11:45:42 AM11/17/20
to Amit Saha, Jeremy Edberg, Jason Yee, Chaos Community
Hi everyone,

As part of my "out of work" activities I am the membership officer of the (ISC)² London Chapter. (ISC)² is an international, nonprofit membership association for information security professionals and responsible for the most widely recognised security certification in this industry, CISSP.

I am also familiar with other vendor-agnostic certification bodies such as ISACA, SANS, ITIL, etc. 

Regarding Chaos Engineering itself my contribution will be limited (I'm learning from all of you!!!!!), but happy to contribute with ideas around how to establish/approach a certification body, create a book of knowledge and general support to this initiative. 

I joined the Slack channel so I'll see you there...

Omar


manav ghosh

unread,
Nov 17, 2020, 11:16:35 PM11/17/20
to Chaos Community
I would also like to contribute to towards the chaos certificatication curriculum, currently I'm driving the chaos engneering practice in my organization, where we are trying prcatice it across all the different hosting environments...kubernetes, cloud foundry and VM based applications. 

Kindly let me know how I can add value to this group...

Manav  

Jeremy Bares

unread,
Nov 18, 2020, 12:17:58 AM11/18/20
to manav ghosh, Chaos Community
John Kemnetz (john.k...@microsoft.com) and I work on resiliency and chaos engineering within Azure at Microsoft. We are very interested in this topic and would like to participate in the working group as well. Thank you Toby for starting this thread, it is much appreciated.

Thanks,

Jeremy

Haytham Elkhoja

unread,
Nov 18, 2020, 12:31:39 AM11/18/20
to Jeremy Bares, manav ghosh, Chaos Community
Haytham Elkhoja, haytham...@ibm.com.
I’m the Chaos Engineering Guild leader at IBM. I’d like to participate in the working group.

Haytham,
Sent from my iPhone

On 18 Nov 2020, at 9:17 AM, Jeremy Bares <dtr...@gmail.com> wrote:



Akhilesh Sarfare

unread,
Nov 18, 2020, 12:35:13 AM11/18/20
to Haytham Elkhoja, Jeremy Bares, manav ghosh, Chaos Community
Hi,

Count me in. I'd like to be an active contributor to the working group.

Regards,
Akhilesh


Lefty G. Balogh

unread,
Nov 18, 2020, 2:43:00 AM11/18/20
to Chaos Community
Guys, help me understand why you are taking this conversation to a slack channel?
I thought this group forum was supposed to be the vehicle for discussion, not a third party, practically private channel...
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

Sheryl Drake

unread,
Nov 20, 2020, 10:39:41 PM11/20/20
to Chaos Community
I'd like to contribute as well --  Maybe determine if there are any existing certification entities which would most closely map to chaos engineering and approach them as they already have the testing engines and awarding backends built.  I think some background info documentation needs to be collected / assembled into a consumable entity and then of course test banks.  There could conceivably be levels -- ie level 1 application of basic concepts (foundational)  Level 2  Professional and Level 3 Expert with perhaps a application in practice type test.   I think it is a great idea and I'm happy to work with the process.  

Sheryl

Naga Ravi Chaitanya Elluri

unread,
Nov 20, 2020, 10:52:17 PM11/20/20
to Sheryl Drake, Chaos Community
Count me in, I would like to be part of the working group. We have been working on tools around fault injection for Kubernetes/OpenShift - Kraken and making sure the cluster is healthy during the chaos - Cerberus.

Chris Aniszczyk

unread,
Nov 20, 2020, 11:43:57 PM11/20/20
to Naga Ravi Chaitanya Elluri, Sheryl Drake, Chaos Community
I'm also happy to help here!

At CNCF / Linux Foundation, we've produced a lot of certifications and so on with CKA as a recent example (https://www.cncf.io/certification/cka/) - I'd love to have an opportunity to help the chaos engineering community put together something great! Also selfishly with my CNCF hat on, we need more folks in the community that are aware of chaos engineering practices so the more we can help make this happen the better!



--
Chris Aniszczyk (@cra)

Mikołaj Pawlikowski

unread,
Nov 20, 2020, 11:52:42 PM11/20/20
to Lefty G. Balogh, Chaos Community
Hey all, 

I'm Miko, author of the Chaos Engineering book from Manning https://www.manning.com/books/chaos-engineering?a_aid=chaos&a_bid=d3243216 and SRE Lead @ Bloomberg.
I'd be interested in getting involved in this.

Some really good points here already:
1) I agree with T D that keeping it vendor neutral is important. Typically vendors have their own trainings, but right now, the basic, un-affiliated tools for Linux need to be taught in this context IMO
2) I think that working group would be a great idea to get started - let's kick off with a meeting to discuss: https://doodle.com/poll/9bq2wa2rhq4wh8ew
3) Jeremy's making a good point, that CE cuts across different disciplines, and there is a lot of overlap with SRE/DevOps, so it's hard to teach CE in isolation. Unless, perhaps, you direct it square at seasoned SREs?
4) T D's suggestion to attach this to the Linux Foundation or something similar is also pretty good - this would mostly take care of the promotional aspect (after all, what's the point of doing this, if no one bothers taking this certification)

I spent last year or so writing the book, and trying to cover a wide spectrum of scenarios/technologies/stacks/techniques, without relying on any single one too much. Perhaps some of the contents of the book could be a starting point to designing a curriculum?

I created a doodle to see what times would work for most people next week: please fill in https://doodle.com/poll/9bq2wa2rhq4wh8ew

I know we have a spread across time zones, I tried to suggest slots that work in the EU and both coasts of the US. I'll send an invite for the slot that gets the maximum of people. I'll close the poll on Saturday.



--
Best regards,
Mikolaj Pawlikowski
+44 747 330 2049

Amit Saha

unread,
Nov 20, 2020, 11:52:43 PM11/20/20
to Lefty G. Balogh, Chaos Community


On Wed, 18 Nov 2020, 6:43 pm Lefty G. Balogh, <leftyg...@gmail.com> wrote:
Guys, help me understand why you are taking this conversation to a slack channel?
I thought this group forum was supposed to be the vehicle for discussion, not a third party, practically private channel...

I would prefer if we discussed it in this list too. 


Jason Yee

unread,
Nov 20, 2020, 11:52:53 PM11/20/20
to Lefty G. Balogh, Chaos Community
I think the request for a Slack channel was to provide a more real-time conversation (or maybe is a communication preference).

I appreciate your point about keeping the discussion public. The Chaos Engineering Community Slack is open to everyone and creating a certification channel there will help grow the conversation to include people who are not in this email group (though I have pinned a link to this group, so hopefully they will join us here). At this early stage (no defined working group or plan), I don't think we need to be prescriptive on where discussion happens or which is "primary". And I think those of us who are in both communities can help cross pollinate the ideas from each. 


You received this message because you are subscribed to a topic in the Google Groups "Chaos Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/chaos-community/mwTFURkgSNU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to chaos-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chaos-community/cbfb5aa9-75bf-4651-83e9-8d1b8f28f229n%40googlegroups.com.


--
Jason Yee
Director of Advocacy
Gremlin

James Wickett

unread,
Nov 21, 2020, 9:40:10 AM11/21/20
to Matt Coles, Chaos Community
I like the idea of keeping the convo on this list instead of moving to slack.

Just my 2 cents.

Looking to forward to how this develops.



On Fri, Nov 20, 2020 at 10:53 PM Matt Coles <mattco...@gmail.com> wrote:

Would like the certification to be non-vendor specific and focused on common patterns, theory and relevant elements from DevOps/SRE. Liked Tony and Jed's ideas on topic starting points.

 

Let's keep the thread here (don't have access to the slack group i saw mentioned above). Am happy to create a collaborative document and set up a few video meetings for our timezones to collate ideas and start a draft, Unless someone else is keen to own both of these items?

 

Either way keen to get started :)

--
Sent from Gmail Mobile

Tony Skillman

unread,
Nov 21, 2020, 9:59:59 AM11/21/20
to James Wickett, Matt Coles, Chaos Community
So from Chaos Engineering authors to SRE and DevOps Leads at big companies, there's lots of good experience here. I don't think I can contribute much technically (I'm an Automation Engineer at Intel) but would love the opportunity to be a part of this, and would contribute my time possibly to something like proofreader or other support role. 

Lefty G. Balogh

unread,
Nov 21, 2020, 11:40:09 PM11/21/20
to Chaos Community
I'd like to build on Toby's ides.

Could we start buiding lists of
- Areas of responsibility?
- Skills?
- Tools?
- Environments?

Then we could ask fairly clear questions like: Can a novice, as part of her remit to vover responsibility X, perform skill X with toolX on environmentX .  Or is this totally off not a good idea?

Feel free to add what you and your team do at your company.

Best
Lefty

Lefty G. Balogh

unread,
Nov 22, 2020, 3:43:58 AM11/22/20
to Chaos Community
Sorry for the typos - fat fingers, let me try again.
(I woke up at 5 with this idea in my head, but my fingers were not ready for the typing.)

I'd like to build on Toby's ideas.
Could we start building lists of
- Areas of responsibility?
- Skills?
- Tools?
- Environments?
- Platforms?
Then we could ask fairly clear questions like:

Can a novice, as part of her remit to run a pilot, before proper chaos-monkeys are unleashed, to test the robustness of this cloud service, perform a restart of all the pods, with kubectl on a bunch of pods running CentOS in Azure? 

Or is this totally not a good idea? I've created a gdoc for brainstorming - https://docs.google.com/document/d/1KJBOQWPYlpy6QTNDQrV4MpFzt3Zmq9WldNVW81DEYfQ/edit?usp=sharing

Mirko Ebert

unread,
Nov 22, 2020, 4:04:15 AM11/22/20
to Chaos Community
I would like to add
- Target Group

Abhishek Mitra

unread,
Nov 22, 2020, 6:53:59 AM11/22/20
to Mirko Ebert, Chaos Community
Count me in. I will like to be involved as well

Jason Yee

unread,
Nov 23, 2020, 12:13:35 PM11/23/20
to Chaos Community
@Michael, You can register for the Chaos Engineering Community slack at gremlin.com/slack. Enter your email on the page and it'll generate the invite for you.

@Mikolaj, I think the in-person call will likely have the same accessibility challenge as the Slack (i.e. not everyone can/or wants to join). But I think at this early stage, more conversations (no matter where they occur) are good. We just need to ensure we're cross pollinating those. Would you mind ensuring that someone takes good notes and posts them back to the group here?


You received this message because you are subscribed to a topic in the Google Groups "Chaos Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/chaos-community/mwTFURkgSNU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to chaos-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chaos-community/CA%2BEEr1McHmTyoTHPL6SXFfy%3DqrTa%2Bcd7NhQDrrxOD30Fokf7bg%40mail.gmail.com.

Jason Yee

unread,
Nov 23, 2020, 12:46:03 PM11/23/20
to Chaos Community
@Lefty I think you might be scoping too broadly with your document and it might end up containing literally every technology, cloud service provider and tool.

Looking at certification from an educator/curriculum design perspective, it usually starts with the question, "What should the student/certified person be able to do?", then breaks that down into the tool/skill groupings: covered in the curriculum, prerequisite knowledge, or not required (which indicates something is out of scope, needs to be reworked, or needs discussion). If you start with the tools first (e.g. your example with kubernetes, Azure and CentOS), you inevitably get an overly broad scope (e.g. now you need kubernetes, Azure and CentOS knowledge to to Chaos Engineering, which of course is problematic for those who are doing Chaos Engineering on on-prem Windows Server VMs).

So as an example of the skill/knowledge first -> tools/skills second:
  • A certified Chaos Engineer should be able to test a DNS failure in CoreDNS and design/implement mitigation to keep applications functional during a DNS outage.
  • Prerequisite skills/knowlege: What DNS is & how it works
  • Covered: DNS failure modes, identifying them and mitigation strategies/techniques
  • Out of scope: We likely don't need to cover CoreDNS, but maybe we need to cover Kube-dns.
  • Needs discussion: Should Kubernetes be covered and added to the prerequisites? If so, how much Kubernetes should the curriculum cover and how do we determine what specific technologies to cover? If not, can this be simplified to just "be able to test a DNS failure"?
Starting with the skill/knowledge first will help us be able to define scope easier and more quickly.

mik...@pawlikowski.pl

unread,
Nov 23, 2020, 4:05:27 PM11/23/20
to Chaos Community
Hey all,

I've scheduled a meeting for this Wednesday, 25 Nov, 8pm GMT (the slot with maximum number of people according to the doodle). Invited everyone in this thread.

See you there!
Mikolaj


Nina Schiff

unread,
Nov 24, 2020, 2:34:47 AM11/24/20
to Chaos Community
Stepping back a bit, as this might help with deciding content, who is this training for? And what role are they looking to be hired into? And at what sort of companies?

My impression of various certifications is that they're tied to very specific tech in a very specific area (thinking of all the various networking ones here). Ie. After getting this certification you are guaranteed to know the various ways you can configure this family of Cisco switches, etc. What would be the equivalent here?

In my experience, with chaos engineering, most people involved in this at the larger tech companies have broader software or sre backgrounds. At least at the companies I've worked at, my guess would be that having large scale systems experience would trump a certification like this. 

So I guess the question for me, before investing a ton of time here, is what's the value add? Why a certification instead of an online resource? If people pay money to be certified, is there a ROI for them? What are the skills that potential employers know people certified as such have (especially if platform agnostic)?

run2o...@gmail.com

unread,
Nov 24, 2020, 2:45:32 AM11/24/20
to Chaos Community
Dear all,

I like this idea especially as I have researched on chaos engineering and built tooling support as part of my doctoral thesis.
Similarly, I have published several papers around the topic  e.g. - https://www.researchgate.net/publication/335922038_Security_Chaos_Engineering_for_Cloud_Services

My focus is on applying the CE principles to cyber security and I could imagine this will be very interesting to security & SRE folks.
Also, was a contributing author in the recently released O'Reilly book on "Security Chaos Engineering " , which is freely available

I'd like to contribute and be part of this discussion!
@Mikolaj, kindly add me to the invite for the video call.

Thanks.

Best regards,
Kennedy

Don Fanning

unread,
Dec 1, 2020, 8:30:56 AM12/1/20
to run2o...@gmail.com, Chaos Community
I would agree with what others have recommended in making sure that if this becomes a true "study/pay for validation" sort of ordeal, that it be properly focused on the needed skills and would encompass the range of knowledge to do certain sorts of work.

Chaos Engineering, in particular, gets *very* difficult because we're talking about "Resiliency" of systems/networks and most importantly: "the business process" which may not be able to be contained within code or have too many "outside the scope" tasks to be a realistic measure.

Also:  What would this certification do that other certifications don't already cover?  You have:
  • Cisco/Juniper or Network+ for networking
  • CISSP/CEH or Security+ for security 
  • MCSE+Specialization for Windows 
  • RHCE for Linux OS.  
  • Azure/AWS/GCE for cloud.  
  • Then there are CSTE and other Test Engineer certifications.
  • Not to mention that there are serious differences in philosophy in the DevOps movement between CNCF, Kubernetes, and regular Virtualization/Bare Metal with Chef/Ansible as well as the changing world of Net/SecOps is really still in its infancy slightly behind ChaosEng and that's _KEY_ to doing ChaosEng at scale.
Is the community saying there is a testing gap for people who have most/all of these certifications?
Or are we saying that one doesn't need the requisite for today's world?

My vote would be that Chaos Engineering focus on best practices/awesome list unless this is about the money (straight up - certs cost time and money).  

The gap that I've seen in reviewing resumes and potential candidates isn't knowledge or skills on paper, it's applied knowledge.  There isn't enough and people have been fudging it, padding their resumes and frankly, people who use DevOps as a skill and not acknowledging that it's a philosophy like Agile or 12 Factors.

Making this into a industry certification is premature and sends the wrong message.   It says that test engineering is strong enough to push back on the entire IT industry that their products and methodologies are able to be channeled into the best practices of Chaos Engineering - again the business process side still isn't fitting well into companies all up and down the size ladder.  It's like your recommending ISO compliance when the airplane is still wood, cloth and a motor.  That's why practices and movement are the right way to describe what Chaos Engineering is... But unless you are competing or cheating the big certifications, it's not a good plan for the moment as the world is changing.  SDN alone will rewrite networking.  SecOps isn't standardized and Test Engineering still cannot standardize it's own swim lanes.  Scraper which?  Framework to use?  Will it interop with Selenium? How about Code Scanning/Scheduling and Test Plan practices?  Do failures found result in loss of compliance in another space and how is it tracked on both sides?

Chaos Engineering again is the art of building resilient systems - is your entire business process at all levels and all directions?  The fact that this is even a topic without scope tells me no.  This is about frustration and talent pools at the management level vs Money.  

Pay the money, hire the people.

And on that note, feel free to find me on LinkedIn.  I'll be starting my official job search in January.  Till then have a great holiday.

Cheers,
-Don


Reply all
Reply to author
Forward
0 new messages