Postmortem on the EKS Prow build cluster incident from 2024-02-21

87 views
Skip to first unread message

Marko Mudrinić

unread,
Feb 27, 2024, 4:09:45 PM2/27/24
to d...@kubernetes.io, kubernetes-sig-k8s-infra

Dear Kubernetes Community,


We want to inform you about the incident that affected the EKS Prow build cluster on 2024-02-21. As a result of this incident:

  • CI (Prow) Jobs that are running on the EKS Prow build cluster were unable to run from 2024-02-21 14:30 UTC until 2024-02-21 17:17 UTC (example: [1])

  • Monitoring data from the EKS Prow build cluster ([2]) prior to the incident (2024-02-21 14:30 UTC) is unrecoverably lost


SIG K8s Infra thrives to provide as best experience as possible for all contributors, and with that, we consider every incident seriously. We wrote a postmortem that clearly describes the root cause of the incident, how we resolved it, as well as what we learned from it and what we are going to change in the future.


The postmortem is available on the following link: https://docs.google.com/document/d/1PMgbClhYwIls8NdEw2jE80OqHmnD-Ty7TermtUe4MTE/edit?usp=sharing


We’re planning to discuss it at the upcoming SIG K8s Infra meeting, scheduled for tomorrow (Wednesday) at 21:00 UTC.


Any input and feedback, especially on the next steps and action items, is very appreciated!


Thank you for your understanding and support!


Kind regards,

Marko Mudrinić

on behalf of SIG K8s Infra


[1]: https://kubernetes.slack.com/archives/CCK68P2Q2/p1708526917330829

[2]: https://monitoring-eks.prow.k8s.io/

Reply all
Reply to author
Forward
0 new messages