Hello ct-policy,
Here's the full post-mortem on the Nimbus outage.
# Postmortem: Nimbus service outage due to data loss.
Note: All times are UTC.
## Outage Duration
First alert fired: 2021-11-10 20:44 (some 404s may have been returned before this point)
Total duration: approx 71 hours
## Summary Impact
Note that Cloudflare keeps multiple log shards. The signing service for all active logs stopped at 2021-11-10 21:07.
Nimbus2022: During the first 5 hours, (pre-)certificates were accepted and stored in the receive queue until the queue filled, meaning that approximately 500K (pre-)certs received valid SCTs but could not be written to the logs. Certs submitted during the remaining 66 hours were served ‘503 service unavailable errors’ in response.
Other Active Logs: Cloudflare served ‘503 service unavailable errors’ in response to pre-certificate submissions for up to 66 hours depending on the log.
Frozen Logs: For the full service outage duration, `404 not found` errors were returned for any request associated with data that was lost or corrupted.
## Background and Description
Cloudflare maintains CT logs on two different backend technologies, a distributed key-value database (db) in HBase used to store a hash to index mapping and, to store all certificates in sequence, a distributed publish-subscribe data store in Kafka supported by an auxiliary Postgres database. HBase storage layer, HDFS, is also shared by other services (Cloudflare’s IPFS Gateway service and CT Monitor). The db suffered a major data loss during execution of a data migration SOP.
The cause is at least partially due to older or unclear documentation, which itself is partially explained (but not justified) by HBase backend being deprecated internally, reaching end of support life, and scheduled for migration to another db backend. The challenges were exacerbated by knowledge gaps, in which incident responders and system maintainers are no longer the original builders of the infrastructure.
The end-of-life is also associated with the reasons for the duration of the outage, in that little attention and few optimizations have been dedicated to the surrounding tools, monitoring, and restoration mechanisms. Alongside, there has been no dry-run or test restoration in two years, in which time Nimbus data has grown substantially in size.
During the first 12 hours, other Cloudflare services backed by HBase were restored by moving to a new db, and was comparatively easy because the services are stateless., and was comparatively easy because the service is stateless. This is not something we could do with Nimbus. During the same period attempts to repair corruption were met with resource limits.
An additional 12 hours were exhausted on cycles of waiting, watching, and diagnosing failed attempts to reconstruct the db. The failures were caused by older serial-execution processes and tools that may have been initially suitable for Nimbus’ size, but did not scale with the service. Wait times were exacerbated, in part, because the collection of all logs were stored in a single db table.
## Resolution
Eventually, rather than pursue recovery, we realised that CT Log services could be reconstructed faster with a complete redesign of the db and associated restoration tools. This required two streams of work. First, each log was rebuilt with its own set of tables, as well as dedicated server and signer processes; doing so would isolate the impact of future corruptions or failures to the affected log, as well as reduce the time to restore.
Second, restoration tools were rewritten and reimplemented to execute in parallel, as well as optimized to fully consume the high-capacity I/O that was under-utilized by the serial process.
The bulk of downtime, between 36 and 48 hours, was spent waiting for reconstruction and verification of the new tables to ensure integrity of the modified infrastructure.
## Lessons Learned
* A good disaster recovery plan should be a requirement, and be regularly tested.
* During data migration, to preserve the possibility of rollback,
* be sure that all data has migrated, and
* do not wipe old nodes until the new nodes are verified operational
* Internal metrics have been improved to better indicate the criticality of services or Infrastructure that otherwise may be perceived to be non-critical.
* During a rush to meet an unanticipated deadline, in this case overshadowed by the 24 hr MMD, ask and accept as early as possible if the deadline cannot be met -- then respond quickly and accordingly.
## Things that went well
The root causes and trigger may have been human error, but the human response was exceptional:
* Compute, storage, and additional monitoring were quickly added or scaled to an otherwise deprecated infrastructure.
* After 12 hours of diagnosis and failed attempts at recovery, the team redesigned and initiated deployment of the new architecture within 6 hours.
## Approximate Timeline of Milestones (UTC)
2011-11-10:
20:32 Storage nodes are wiped and rebooted as part of data migration
21:05 Signer operations fail, and soon after the process stops entirely
21:15 <ESCALATION TIME> restoration of IPFS service is prioritized
23:56 IPFS restored; attempts to repair CT logs begins
2021-11-11:
01:08 <Unknown at the time> Nimbus2022 queue is full; approx 500K certs affected.
05:22 Allocation begins for greater physical resources for the backend db.
07:22 Recovery scripts initiated.
07:47 Attempt to patch the service to serve logs directly from Kafka; this fails without the valid signer head, needed for integrity.
12:11 Recovery tools fail; their use of memory has not scaled with the size of logs.
12:15 Planning begins to re-architect and rebuild the backend to restore full service asap.
22:44 Decision to reconfigure separate server and signer processes for each log, in lieu of the single server and signer processes attached to all logs.
2021-11-12:
08:15 (Time approximate) Reconstruction progress is slower than anticipated. Work begins to optimize
21:00 (Time approximate) By this point we have learned how to accelerate db reconstruction by a factor of 18-24x
21:50 nimbus2017 and nimbus2023 are back online
2021-11-13:
04:38 nimbus2022 reconstruction is complete but, since this is the most affected log, additional checks and reviews begin to ensure integrity, and that the service will resume without incident.
13:41 nimbus2022 is back online
14:15 nimbus2018, nimbus2020, nimbus2019 are back online
17:21 nimbus2019 is back online -- this is the last remaining log
17:51 Outage declared to have ended.