The Chrome CT Team notified us that Elephant2026h1 had fallen below 99% availability, with large spikes in HTTP 503s having been observed between approximately 2026-03-23 22:18 and 2026-03-24 13:54 UTC.
Our Cloudflare dashboard shows spikes of 503s (orange) and 504s (blue) during that period:
Root Cause Analysis: The Proxmox cluster in our Manchester UK datacentre was running at full capacity. After some hours of degraded performance due to the high load, one of the hosts crashed and rebooted at about 2026-03-24 07:00 UTC.
Due to a networking issue, that host could not be accessed via its standard management interface. Whilst troubleshooting that problem, our Ops team started the affected VMs on the remaining nodes, which were already at peak capacity. Soon after the networking issue had been resolved, a second host crashed due to the now extremely high load.
After this second crash, the team proceeded more cautiously, waiting for confirmation that all of the nodes were up and running before restarting the affected VMs. A number of VMs were migrated between hosts, to manually balance the load across the pool.
Action Items: We intend to boost capacity by deploying more Proxmox nodes by mid-May, with a couple of those going online within the next week.