Rome2024h2, Rome2025h1, and Rome2025h2 downtime incident report

279 views
Skip to first unread message

Filippo Valsorda

unread,
Nov 26, 2024, 4:20:44 PMNov 26
to Certificate Transparency Policy
Hey all,

The Rome prototype Sunlight instance experienced a little over four hours of downtime today due to a hardware issue. Although Rome2024h2, Rome2025h1, and Rome2025h2 are not production logs, I wanted to share a brief incident report with the community as a case study of single-node service availability.

Details

A hardware issue caused the dedicated server hosting this Sunlight instance to abruptly power off at 15:47:48 UTC. After being notified by Let's Encrypt staff, I investigated and found the machine unreachable. After checking the status page for related incidents, I filed a ticket with the hosting provider, Hetzner Online GmbH.

I received an initial reply asserting the machine was reachable. I was indeed able to log in, and extract logs showing the machine had powered off twice in the previous three hours.

At this time Sunlight had not come back up due to a configuration issue in systemd. The service came up before the network was available, exited five times in a row, and was marked as failed. The systemd unit includes "After=network.target" and "Restart=on-failure" and leaves the rest as default.

Upon reporting the pattern to the hosting provider, they offered to exchange the server keeping the drives. This Sunlight instance stores the SCT cache and the checkpoint database on a local RAID1 SSD array. I authorized the exchange. Once the new server booted, Sunlight came up cleanly upon being manually started.

A md(4) resync is visible in the logs but appears to have concluded successfully.

The root hardware issue "likely seemed to have been with the motherboard" according to the hosting provider.

Timeline

2024-11-26 15:47:48 UTC — machine powers off — DOWNTIME STARTS
2024-11-26 18:12 UTC — notification by Let's Encrypt staff
2024-11-26 18:14 UTC — investigation begins
2024-11-26 18:21 UTC — ticket filed with hosting provider
2024-11-26 18:55 UTC — initial hosting provider response
2024-11-26 19:15 UTC — hardware exchange offered
2024-11-26 19:20 UTC — hardware exchange authorized
2024-11-26 19:52 UTC — hardware exchange completed
2024-11-26 20:03:27 UTC — service restored — DOWNTIME ENDS

Follow-up work
  • Correct the systemd configuration to keep retrying indefinitely and to properly wait for the network to be available.
  • Enable external monitoring, as well as Hetzner native monitoring.
  • Consider configuring the software RAID to block writes to unhealthy arrays, as it is plausible the checkpoint database could rollback if disk A were to go offline, a write were to be committed to disk B, and then the system were to recover from disk A.
Analysis

Over two hours elapsed from the beginning of the downtime to me being notified. This could have been significantly reduced with better monitoring. (Keep in mind this is a prototyping instance.) Almost an hour elapsed from filing a ticket to the offer to exchange the hardware. This could have been significantly reduced by calling the support phone line. I chose not to do this because I was in a meeting and this is not a production log.

Overall, this single-node system experienced a little over four hours of downtime due to catastrophic hardware failure, which could have reasonably been shortened to one hour if it was treated as a production system. Given the 99% SLO budget is 21.5 hours of downtime every three months, I believe this shows the 99% uptime target is manageable even in the face of catastrophic hardware issues.

(Note that an unrelated network maintenance of max. 2 hours is scheduled for later in the week. This is the first such maintenance since Rome was migrated to Hetzner in August.)

Alla prossima,
Filippo
Reply all
Reply to author
Forward
0 new messages