sustained Nimbus outage

596 views
Skip to first unread message

Marwan Fayed

unread,
Nov 11, 2021, 6:53:38 AM11/11/21
to Certificate Transparency Policy
ct-policy@

Cloudflare is aware of on-going issues with Nimbus. We're working to restore service, and promise a post-mortem.

Best,
--marwan

Devon O'Brien

unread,
Nov 12, 2021, 3:26:25 PM11/12/21
to Certificate Transparency Policy, mar...@cloudflare.com
Hi Marwan,

Thanks for giving ct-policy@ a heads up, and #HugOps to those working on restoring service. Chrome's SCT auditing infrastructure initially detected auditing failures for Nimbus 2022 on November 11, but have started observing additional SCT auditing failures for Nimbus 2021 as of this morning. We suspect that this issue is affecting all Nimbus logs, but that Nimbus 2023 isn't yet firing alerts due to the low volume of certificates expiring in that log's expiry range. Additionally, our CT Log compliance monitoring has noted sustained (>8hr) inability to add test certificates to Nimbus logs. 

Please let us know if there's an ETA (even approximate will do) for a return to normal operations before digging into a post-mortem.

-Devon

Marwan Fayed

unread,
Nov 12, 2021, 4:07:57 PM11/12/21
to Certificate Transparency Policy, Devon O'Brien, Marwan Fayed
Hi Devon, thanks for checking in.

All, we are on path to resuming service, with ETA for completion in about 12-18 hours. We want to be sure things are right, rather than rushed.

Thanks for patience. We'll be sure to announce when service restoration is complete. This will be an interesting post-mortem. 

--marwan


Marwan Fayed

unread,
Nov 13, 2021, 12:58:53 PM11/13/21
to Certificate Transparency Policy, Marwan Fayed, Devon O'Brien
All,

We are pleased to report that Nimbus is fully restored. Some may have noticed logs coming online over the last number of hours. We wanted to wait to announce until the last log (Nimbus2019) came online, which was about 1 hour ago. 

An enormous thanks for patience. We'll work on post-mortem over the next few days and post as soon as it's ready.

--marwan

Marwan Fayed

unread,
Nov 24, 2021, 11:07:37 AM11/24/21
to Certificate Transparency Policy, Marwan Fayed
Hello CT-Policy,

A quick update -- the post-mortem should appear early-to-mid next week, after the USA Thanksgiving holiday.

Happy Thanksgiving to group members in the USA.

--marwan (also on behalf of the Cloudflare community)

Nick Sullivan

unread,
Nov 29, 2021, 10:38:24 AM11/29/21
to Certificate Transparency Policy, Marwan Fayed
Hello ct-policy,

Here's the full post-mortem on the Nimbus outage.


# Postmortem: Nimbus service outage due to data loss.
Note: All times are UTC.


## Outage Duration
First alert fired: 2021-11-10 20:44 (some 404s may have been returned before this point)
Total duration: approx 71 hours


## Summary Impact
Note that Cloudflare keeps multiple log shards. The signing service for all active logs stopped at 2021-11-10 21:07.

Nimbus2022: During the first 5 hours, (pre-)certificates were accepted and stored in the receive queue until the queue filled, meaning that approximately 500K (pre-)certs received valid SCTs but could not be written to the logs. Certs submitted during the remaining 66 hours were served  ‘503 service unavailable errors’ in response.

Other Active Logs:  Cloudflare served ‘503 service unavailable errors’ in response to pre-certificate submissions for up to 66 hours depending on the log.

Frozen Logs: For the full service outage duration, `404 not found` errors were returned for any request associated with data that was lost or corrupted.


## Background and Description
Cloudflare maintains CT logs on two different backend technologies, a distributed key-value database (db) in HBase used to store a hash to index mapping and, to store all certificates in sequence, a distributed publish-subscribe data store in Kafka supported by an auxiliary Postgres database. HBase storage layer, HDFS, is also shared by other services (Cloudflare’s IPFS Gateway service and CT Monitor). The db suffered a major data loss during execution of a data migration SOP.

The cause is at least partially due to older or unclear documentation, which itself is partially explained (but not justified) by HBase backend being deprecated internally, reaching end of support life, and scheduled for migration to another db backend. The challenges were exacerbated by knowledge gaps, in which incident responders and system maintainers are no longer the original builders of the infrastructure.

The end-of-life is also associated with the reasons for the duration of the outage, in that little attention and few optimizations have been dedicated to the surrounding tools, monitoring, and restoration mechanisms. Alongside, there has been no dry-run or test restoration in two years, in which time Nimbus data has grown substantially in size.

During the first 12 hours, other Cloudflare services backed by HBase were restored by moving to a new db, and was comparatively easy because the services are stateless., and was comparatively easy because the service is stateless. This is not something we could do with Nimbus. During the same period attempts to repair corruption were met with resource limits.

An additional 12 hours were exhausted on cycles of waiting, watching, and diagnosing failed attempts to reconstruct the db. The failures were caused by older serial-execution processes and tools that may have been initially suitable for Nimbus’ size, but did not scale with the service. Wait times were exacerbated, in part, because the collection of all logs were stored in a single db table.


## Resolution
Eventually, rather than pursue recovery, we realised that CT Log services could be reconstructed faster with a complete redesign of the db and associated restoration tools. This required two streams of work. First, each log was rebuilt with its own set of tables, as well as dedicated server and signer processes; doing so would isolate the impact of future corruptions or failures to the affected log, as well as reduce the time to restore.

Second, restoration tools were rewritten and reimplemented to execute in parallel, as well as optimized to fully consume the high-capacity I/O that was under-utilized by the serial process.

The bulk of downtime, between 36 and 48 hours, was spent waiting for reconstruction and verification of the new tables to ensure integrity of the modified infrastructure.


## Lessons Learned
* A good disaster recovery plan should be a requirement, and be regularly tested.
* During data migration, to preserve the possibility of rollback,
  * be sure that all data has migrated, and
  * do not wipe old nodes until the new nodes are verified operational
* Internal metrics have been improved to better indicate the criticality of services or Infrastructure that otherwise may be perceived to be non-critical.
* During a rush to meet an unanticipated deadline, in this case overshadowed by the 24 hr MMD, ask and accept as early as possible if the deadline cannot be met -- then respond quickly and accordingly.


## Things that went well
The root causes and trigger may have been human error, but the human response was exceptional:

* Compute, storage, and additional monitoring were quickly added or scaled to an otherwise deprecated infrastructure.
* After 12 hours of diagnosis and failed attempts at recovery, the team redesigned and initiated deployment of the new architecture within 6 hours.


## Approximate Timeline of Milestones (UTC)
2011-11-10:
   20:32  Storage nodes are wiped and rebooted as part of data migration
   21:05  Signer operations fail, and soon after the process stops entirely
   21:15  <ESCALATION TIME> restoration of IPFS service is prioritized
   23:56  IPFS restored; attempts to repair CT logs begins

2021-11-11:
   01:08  <Unknown at the time> Nimbus2022 queue is full; approx 500K certs affected.
   05:22  Allocation begins for greater physical resources for the backend db.
   07:22  Recovery scripts initiated.
   07:47  Attempt to patch the service to serve logs directly from Kafka; this fails without the valid signer head, needed for integrity.
   12:11  Recovery tools fail; their use of memory has not scaled with the size of logs.
   12:15  Planning begins to re-architect and rebuild the backend to restore full service asap.
   22:44  Decision to reconfigure separate server and signer processes for each log, in lieu of the single server and signer processes attached to all logs.

2021-11-12:
   08:15  (Time approximate) Reconstruction progress is slower than anticipated. Work begins to optimize
   21:00  (Time approximate) By this point we have learned how to accelerate db reconstruction by a factor of 18-24x
   21:50  nimbus2017 and nimbus2023 are back online
   
2021-11-13:
   04:38  nimbus2022 reconstruction is complete but, since this is the most affected log, additional checks and reviews begin to ensure integrity, and that the service will resume without incident.
   13:41  nimbus2022 is back online
   14:15  nimbus2018, nimbus2020, nimbus2019 are back online
   17:21  nimbus2019 is back online -- this is the last remaining log
   17:51  Outage declared to have ended.

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/7333dbf9-fa20-433d-af56-e5acea8aa5ccn%40chromium.org.

Andrew Ayer

unread,
Dec 2, 2021, 12:44:37 PM12/2/21
to Nick Sullivan, Certificate Transparency Policy, Marwan Fayed
On Mon, 29 Nov 2021 10:38:04 -0500
"'Nick Sullivan' via Certificate Transparency Policy"
<ct-p...@chromium.org> wrote:

> Nimbus2022: During the first 5 hours, (pre-)certificates were
> accepted and stored in the receive queue until the queue filled,
> meaning that approximately 500K (pre-)certs received valid SCTs but
> could not be written to the logs.

Could you clarify what the impact of the above is? In particular: for
every SCT issued, has the corresponding TimestampedEntry been
incorporated into the log?

Regards,
Andrew

Nick Sullivan

unread,
Dec 3, 2021, 9:05:00 AM12/3/21
to Andrew Ayer, Certificate Transparency Policy, Marwan Fayed
Hi Andrew,

Yes, we have incorporated every submission for which we have issued a SCT. The integrity of the log has been maintained throughout this incident.
For Nimbus2022, once we hit the 500k limit, we stopped accepting new submissions.

Nick

Devon O'Brien

unread,
Dec 8, 2021, 11:32:07 AM12/8/21
to Certificate Transparency Policy, Nick Sullivan, Certificate Transparency Policy, mar...@cloudflare.com, Andrew Ayer

Hi Nick and Marwan,

Thanks for the additional details on how this incident occurred and the steps Cloudflare took to restore normal operations. We're glad to hear that the integrity of the Nimbus logs has been maintained and that no SCTs were minted that have not yet been included. As a result, this incident was restricted to a period of downtime that impacted Nimbus' availability (though less than you might expect given how current CT log compliance monitoring calculates uptime percentages) and MMD violations for SCTs issued during the incident.

From a Chrome perspective, given that SCT Auditing is now deployed and Cloudflare provided immediate notice of an issue with their CT logs, we are not inclined to Retire logs for outage-related short-term MMD violations in isolation. Should we detect any unincorporated SCTs as a result of this incident, we will revisit this topic.

-Devon

Nick Sullivan

unread,
Dec 8, 2021, 12:08:51 PM12/8/21
to Devon O'Brien, Certificate Transparency Policy, mar...@cloudflare.com, Andrew Ayer
Hi Devon,

Acknowledged, and thanks for the update.

Nick
Reply all
Reply to author
Forward
0 new messages