Do any of you run AZ failure chaos experiments against your systems?

112 views
Skip to first unread message

Chethan C

unread,
Oct 6, 2024, 12:59:12 PM10/6/24
to Chaos Community

I have a bunch of questions on them

  • What are your opinions on them?

  • How do you simulate them?

  • What are the recommendations that you give to your product teams to ensure their services/products are resilient to AZ failures?

Context: We are beginning to run AZ failure experiments on our production systems and we are noticing a lot of 502s around the time when EC2 systems are failed over(This involves removing an EC2 instance in an AZ, blocking ASG from deploying another instance in the same region). So during the period between the instance getting terminated and ALB noticing the instance is not working, the traffic ends up hitting the same instance result in spikes in 502s.

We are thinking about reducing the 502s or checking if there are options to completely remove such failures.

Reply all
Reply to author
Forward
0 new messages