Dear Kubernetes Community,
We want to inform you about the incident that affected the EKS Prow build cluster on 2024-02-21. As a result of this incident:
CI (Prow) Jobs that are running on the EKS Prow build cluster were unable to run from 2024-02-21 14:30 UTC until 2024-02-21 17:17 UTC (example: [1])
Monitoring data from the EKS Prow build cluster ([2]) prior to the incident (2024-02-21 14:30 UTC) is unrecoverably lost
SIG K8s Infra thrives to provide as best experience as possible for all contributors, and with that, we consider every incident seriously. We wrote a postmortem that clearly describes the root cause of the incident, how we resolved it, as well as what we learned from it and what we are going to change in the future.
The postmortem is available on the following link: https://docs.google.com/document/d/1PMgbClhYwIls8NdEw2jE80OqHmnD-Ty7TermtUe4MTE/edit?usp=sharing
We’re planning to discuss it at the upcoming SIG K8s Infra meeting, scheduled for tomorrow (Wednesday) at 21:00 UTC.
Any input and feedback, especially on the next steps and action items, is very appreciated!
Thank you for your understanding and support!
Kind regards,
Marko Mudrinić
on behalf of SIG K8s Infra
[1]: https://kubernetes.slack.com/archives/CCK68P2Q2/p1708526917330829
[2]: https://monitoring-eks.prow.k8s.io/