DigiCert working to address log issues

846 views
Skip to first unread message

Chuck Blevins

unread,
Mar 25, 2025, 11:57:39 AMMar 25
to Certificate Transparency Policy
We are aware of slowness and backlog issues on our logs over the past few days.
Engineering continues working to address and resolve.
I'll update findings and remediation as soon as we have a solution. 

Cheers. 

Joe DeBlasio

unread,
Mar 27, 2025, 1:14:15 PMMar 27
to Chuck Blevins, Certificate Transparency Operations, Certificate Transparency Policy
Hi Chuck,

Are there any updates available on this issue?

We are now regularly getting SCT auditing violation reports from both Sphinx2025h1 and Wyvern2025h1. We request that these logs stop accepting submissions of certificates until the ongoing issues are resolved so as to limit the number of certificates negatively impacted.

Thanks,
Joe, on behalf of the Chrome CT team

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/b4599f4f-f800-4ffd-8933-1156590af363n%40chromium.org.

Chuck Blevins

unread,
Apr 1, 2025, 12:01:33 AMApr 1
to Certificate Transparency Policy, Joe DeBlasio, Certificate Transparency Policy, Chuck Blevins, Certificate Transparency Operations

DigiCert's Nessie 2025 shard began logging database errors on March 27th due to an overloaded Cassandra cluster. Investigation found that entries do appear to be missing. We recommend this log be retired. We will provide an RCA by end of day on April 1, 2025. We are investigating but the root cause is just the number of certificates logged.

 

On a similar note, our Yeti2025 log is still running and uses the code base. We are currently at 1.25b entries, which is more than we've ever tested for with that set of code (we tested up to a billion). We believe this log may eventually fail similar to Nessie.  We recommend this log be shut down as well given the issue of scale. However, we would like direction from the community on whether we should shut it down and the appropriate date for a shut down.

 

The remaining Trillian-based Sphinx and Wyvern shards are also running. Our troubleshooting of the Trillian-based logs has shown that someone has been using the addchain endpoint heavily to add certificates for the retiring Wyvern2h and Sphinx2h logs. We did not have rate limits on those logs, which led to a rapid growth in log size and a failure in the logs ability to perform. We will be adding a rate limit (TBD on level and when we will add it) to the addchain endpoint. This will prevent too many non-pre-certs being logged at once. We will provide an RCA on the Sphinx and Wyvern shards tomorrow with more information about the rate limiting plan and how we will ensure the newer shards can meet the scale requirements. Please let us know what questions or suggestions the community has.

 

-- 

Chuck Blevins

Director of Product Management, Public PKI Services


 

Andrew Ayer

unread,
Apr 1, 2025, 7:41:57 AMApr 1
to Chuck Blevins, Certificate Transparency Policy, Joe DeBlasio, Certificate Transparency Operations
On Mon, 31 Mar 2025 21:01:33 -0700 (PDT)
Chuck Blevins <crb...@gmail.com> wrote:

> On a similar note, our Yeti2025 log is still running and uses the
> code base. We are currently at 1.25b entries, which is more than
> we've ever tested for with that set of code (we tested up to a
> billion). We believe this log may eventually fail similar to Nessie.
> We recommend this log be shut down as well given the issue of scale.
> However, we would like direction from the community on whether we
> should shut it down and the appropriate date for a shut down.

I agree it would be a good idea to sunset this log. The least disruptive way to do that would be to transition it to ReadOnly and continue operating it until all the certificates in it have expired. This would allow existing SCTs to continue satisfying the qualified-at-time-of-check requirement, avoiding the possibility of any certificate validation failures.

Regards,
Andrew

Joe DeBlasio

unread,
Apr 1, 2025, 3:39:15 PMApr 1
to Andrew Ayer, Chuck Blevins, Certificate Transparency Policy, Certificate Transparency Operations
Thanks for the update, Chuck.

I'll prepare a retirement announcement for Nessie2025 for Chrome. As before, please ensure the log has stopped accepting new certificates as soon as possible to minimize risk.

I also agree that it's wise to sunset Yeti2025 soon. However, doing so immediately would leave no DigiCert logging capacity for certificates expiring in the first half of 2025. While the volume of newly-issued certificates expiring in 2025H1 is about to go down significantly, that's still a loss. Standing up additional/replacement log shards can not address this loss since compliance monitoring and rollout takes around 100 days.

Do you have any additional indications of Yeti2025's health that might inform how imminent failure is? Ideally, we'd aim for a retirement of July 1, but that's obviously predicated on Yeti2025 retaining its integrity until then.

(No matter what, we encourage log operators to keep logs running until the end of their expiry window whenever possible, regardless of Chrome state or integrity of the log.)

Joe

Jeremy Rowley

unread,
Apr 3, 2025, 2:05:58 PMApr 3
to Joe DeBlasio, Andrew Ayer, Chuck Blevins, Certificate Transparency Policy, Certificate Transparency Operations
I think some DigiCert messages are  not being posted to this list? Chuck sent the RCAs to the CT policy list, but I don't see them showing up here. Can other people see them or did something cause them not to show up on the mailing list ? Example of RCA is here. Since it was sent as a reply all to this list, Andrew and Joe would have received it, but I'm not sure everyone else saw it. Just FYI - Chuck is working on a CoE as well that will address the RCA with remediation. That'll be shared with this group shortly.  

And here is the RCA for our Nessie issues..

DESCRIPTION

The Nessie log experienced database problems as far back as March 8th. These became a critical issue about 2PM MT,  March 27th, becoming constant and frequent. The combination of the number of Provisioned IOPs being at the limit, a surge in new entries being submitted, and the log reaching a certain size caused the database to slow significantly.  After the 27th it was discovered that entries in the database were not in the log tree, even after several days

 

ROOT CAUSE

The cause of the poor database performance was the number of provisioned IOPs for the data volumes of the six servers being lower than required. The configured setting was 10K IOPs - but based on total size of the data, and the rate of requests for adding and downloading entries - at least 20k IOPs would have been needed for adequate performance.

Historically the Nessie CT Log shards have been smaller and slower growing than the Yeti CT Log shards; the database cluster was provisioned accordingly. Closer monitoring of this server would have indicated the need to increase the resources available to the database cluster. Important metrics for evaluating load of the database cluster for future use:

  • Growth rate of data
  • Read/Write IOPs usage
  • CPU Usage and I/O Wait
  • Cassandra pending read requests
  • Cassandra pending mutations (write)

 

Alerts were received during this time window, but no action was taken until March 31st.

  • New Relic showed a warning threshold of > 70%) for CPU usage being exceeded regularly; however, only after the threshold exceeded 80%  would an alert have been sent.
  • Another alert condition (CPU I/O wait) that was expected to be tripped was found to have a critical threshold of 70%. Anything above 10% would be considered poor – being an indicator of slow disk performance.
  • Other alerts from New Relic for “High CT Log Latency” were received between March 20th and March 30th. This alert condition appears to have been poorly defined, and no action items appear to be documented for it. This is the likely reason it was not acted on.

 

A combination of alert fatigue, gaps in knowledge/training, and lack of clear ownership are the likely reasons for inaction, which resulted in the problem persisting and lead to the lead to the log’s planned retirement.

 

 


--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

Chuck Blevins

unread,
Apr 3, 2025, 2:15:03 PMApr 3
to Certificate Transparency Policy, Jeremy Rowley, Andrew Ayer, Chuck Blevins, Certificate Transparency Policy, Certificate Transparency Operations, Joe DeBlasio

Here is the RCA regarding Wyvern and Sphinx log issues (apologies this appeared to post Tuesday evening but is not to be found)

DESCRIPTION

Starting in 2024, DigiCert created new CT log shards with name prefixes wyvern and sphinx using the Google Trillian code base.

On March 19th GMT, the Wyvern & Sphinx 2025h1 shards logged several spikes of errors related to database connections - “context deadline exceeded”, and “too many connections”. These spikes were followed by a period with the persistent error “blocked because of many connection errors; unblock with ‘mariadb-admin flush-hosts”.  With connections from the application pods to the database blocked, CT requests would fail. After one of these spikes and runs of “blocked” errors, Kubernetes attempted to restart the application pods which would cyclically fail because the host was blocked from connecting to the database. 

 

On the evening of the 20th, SRE escalated that they were seeing problems with both the Wyvern and Sphinx 2025h1 pods. They flushed the cache for the MariaDB server and bounced the pods. A few hours later the errors returned, this time for Wyvern2025h1. The DBAs flushed the host, and the pods were restarted After these steps, the service returned to normal, and teams decided on mitigation steps, including changing the IOPS for the Wyvern2025h1 pods. If this helped alleviate the issue, the same change would be made on Sphinx2025h1. The DBAs updated the IOPS on the Wyvern2025h1 server, but implementation took several hours. 

 

The next day, we received another alert for Wyvern2025h1 crashing. We redeployed Wyvern2025h1, flushed the host cache on the MariaDB server, and redeployed Wyvern2025h1 again. Several hours later, the same problem reoccurred, but with Sphinx2025h1. We began investigating. This time, we were able to see “unauthenticated user” errors in the DB. We then decided to update the IOPS on Sphinx2025h1 so that we could remove that limitation, and so it would match with Wyvern2025h1. After making the change, we flushed the host cache on MariaDB and bounced the pods again. Since this problem continued to happen, we decided further investigation would be required. 

 

On March 31, Google via the CT-Policy group requested the retirement of these logs due to lack of resolution and missing of inclusion deadlines. DigiCert agreed with this decision and updated the thread in CT-Policy. 

 

 

ROOT CAUSE
It appears that there is something currently unknow) which can cause a logserver pod to create a large enough number of connections to the database to trigger the database blocking that entire host from connecting until the block is reset via a MariaDB admin command. Whatever the trigger is, it has happened repeatedly over the past few days and left the shards Wyvern2025h1 & Sphinx2025h1 in an unusable state.

 

Increasing the number of IOPs had diminishing returns. As the number of IOPs increase so did the disk queue length which indicates another bottleneck. The database instance type may be too small for the load and cannot accommodate the number of IOPs which the underlying storage has been configured to use.

 

INCIDENT TIMELINE
(All times in GMT)

March 20th, 2025

05:05 - First spike of errors for Wyvern & Sphinx2025h1 servers

05:55 - Wyvern2025h1 pod starts going into crashloopbackoff state

10:10 - Sphinx2025h1 pod starts going into crashloopbackoff state

21:54 - The SRE-Lehi team escalates that the Wyvern & Sphinx 2025h1 servers are unavailable

22:00 - Approx. time - flush the host cache on MariaDB & redeploy the pods for Wyvern & Sphinx 2025h1 

22:04 - Wyvern & Sphinx pods are back up and running again

 

March 21st, 2025

00:40 – error is seen again

01:03 - DBA – Flush host

01:05 – Restart the pods (wyvern)

01:07 - Everything is looking normal again (the pods are not in crashloop backoff)

04:30 – Begin the IOPS change on Wyvern2025h1 (ECC-2656)

10:08 - IOPS change completed for Wyvern2025h1

 

March 22nd, 2025

19:32 - Alert for Wyvern2025h1 pod crashing

20:05 - Zoom opened

20:11 - Redeployed Wyvern2025h1

20:20 - Flushed the MariaDB cache

20:25 – Restarted the pods again

 

March 23rd, 2025

00:34 - Crashing pods again for Sphinx2025h1

00:35 - Zoom opened

01:11 - "unauthenticated user" errors in db

01:30 - Start the IOPS change on Sphinx2025h1 (ECC-2656) 

01:40 - Flush host cache on MariaDB

01:43 - Restart the pods

01:44 - All pods running successfully 

 

March 24th, 2025

00:28 – Crashing pods again

00:51 - Validates pod still in crashloop stage  

00:05 – Flush host cache on MariaDB and redeployed Wyvern2025h1-logserver healthy running

01:20 - Monitor the pods for 15min and zoom ended 

06:28 - Crashing pods again

06:32 - Flush cache DB and redeployed pod  

21:39 - Crashing pods again

21:44 - Flush cache DB 

21:48 - Redeploy pod Wyvern2025h1-logserve and SRE monitor pod

 

March 26th, 2025

00:19 - NR alert for CT Log alerts 

00:25 - Flush cache DB and redeploy pod Wyvern2025h1-logserver

01:09 - NR alert for CT Log alerts 

01:21 - Redeploy pod Wyvern2025h1-logserver

Joe DeBlasio

unread,
Apr 3, 2025, 2:34:31 PMApr 3
to Chuck Blevins, Certificate Transparency Policy, Chuck Blevins, Certificate Transparency Operations, Andrew Ayer
Thanks for sending these again. We're not sure what happened with ct-policy@, but we'll poke around. We're not aware of any changes to the list during that time.

The Nessie investigation hasn't shown up on-list yet, but is included below from the copy I received on Tuesday (presumably because I was an explicitly-named recipient).

Joe

(NB. We may follow up with some questions or thoughts once we've read your posts a bit more carefully)

On Tue, Apr 1, 2025 at 8:14 PM Chuck Blevins <chuck....@digicert.com> wrote:

And here is the RCA for our Nessie issues..

DESCRIPTION

The Nessie log experienced database problems as far back as March 8th. These became a critical issue about 2PM MT,  March 27th, becoming constant and frequent. The combination of the number of Provisioned IOPs being at the limit, a surge in new entries being submitted, and the log reaching a certain size caused the database to slow significantly.  After the 27th it was discovered that entries in the database were not in the log tree, even after several days

 

ROOT CAUSE

The cause of the poor database performance was the number of provisioned IOPs for the data volumes of the six servers being lower than required. The configured setting was 10K IOPs - but based on total size of the data, and the rate of requests for adding and downloading entries - at least 20k IOPs would have been needed for adequate performance.

Historically the Nessie CT Log shards have been smaller and slower growing than the Yeti CT Log shards; the database cluster was provisioned accordingly. Closer monitoring of this server would have indicated the need to increase the resources available to the database cluster. Important metrics for evaluating load of the database cluster for future use:

  • Growth rate of data
  • Read/Write IOPs usage
  • CPU Usage and I/O Wait
  • Cassandra pending read requests
  • Cassandra pending mutations (write)

 

Alerts were received during this time window, but no action was taken until March 31st.

  • New Relic showed a warning threshold of > 70%) for CPU usage being exceeded regularly; however, only after the threshold exceeded 80%  would an alert have been sent.
  • Another alert condition (CPU I/O wait) that was expected to be tripped was found to have a critical threshold of 70%. Anything above 10% would be considered poor – being an indicator of slow disk performance.
  • Other alerts from New Relic for “High CT Log Latency” were received between March 20th and March 30th. This alert condition appears to have been poorly defined, and no action items appear to be documented for it. This is the likely reason it was not acted on.

 

A combination of alert fatigue, gaps in knowledge/training, and lack of clear ownership are the likely reasons for inaction, which resulted in the problem persisting and lead to the lead to the log’s planned retirement.

 

 

-- 

Chuck Blevins

Director of Product Management, Public PKI Services

Joe DeBlasio

unread,
Apr 3, 2025, 2:42:14 PMApr 3
to Jeremy Rowley, Chuck Blevins, Certificate Transparency Policy, Chuck Blevins, Certificate Transparency Operations, Andrew Ayer
It looks like Chuck's digicert.com email address is not subscribed to ct-policy@, and only members can post to the list. His @gmail.com address is a member, which explains the previously-successful posts. Either joining the list with your official address, or posting consistently from the gmail address should address the issue. (h/t to Devon for figuring that out.)

Joe

On Thu, Apr 3, 2025 at 11:36 AM Jeremy Rowley <jeremy...@digicert.com> wrote:

No problem. The correction of error posts will probably also answer a lot of questions as well.

Jeremy Rowley

unread,
Apr 3, 2025, 3:12:14 PMApr 3
to Joe DeBlasio, Jeremy Rowley, Chuck Blevins, Certificate Transparency Policy, Chuck Blevins, Certificate Transparency Operations, Andrew Ayer
To answer your question about Yeti, Joe:

Currently, the Yeti log is not experiencing any issues. Yeti was just shy of 1.5 billion entries during 2024. Yeti 2025 is already at 1.25 billion entries and growing. As Nessie fell over at 880m entries, we could see the same thing happen. The Cassandra cluster for Nessie was much smaller than Yeti. This resulted in a lot of database timeouts which the CTLog code couldn't handle correctly. 

For comparison:
Yeti Cassandra Cluster:

12 Nodes R5.4xlarge with 20k IOPsNessie Cassandra Cluster:
9 Node R5.4xlarge with 10k IOPsWe're happy to keep the log operational for as long as possible while we rebuild our Trillian based logs after implementing our corrective action plan.

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

Chuck Blevins

unread,
Apr 3, 2025, 7:24:46 PMApr 3
to Certificate Transparency Policy, Jeremy Rowley, Jeremy Rowley, Chuck Blevins, Certificate Transparency Policy, Chuck Blevins, Certificate Transparency Operations, Andrew Ayer, Joe DeBlasio

Correction of Errors (CoE 

Incident Details 

  • Title: CT log Database Performance Issues Resulting in Log Service Issues and MMD misses 

  • Date of Incident: March 27, 2025 

  • Time of First Error: March 3, 2025 

  • Time First Notification of Issue: March 17, 2025 

  • Duration of Incident: March 8 - March 31, 2025 

  • Owner: SRE-CloudOps/Engineering 

 

Incident Summary 

DigiCert operates four CT logs that are shared into different segments. Two of these logs, Nessie and Yeti, use a DigiCert codebase. The other two, Wyvern and Sphinx, are Trillian-based CT logs. Three shards of these logs (The H1 CT log shards) experienced degraded performance. The Nessie CT log experienced database performance degradation as early as March 3rd. Wyvern and Sphinx began exhibiting errors on March 15 with  critical errors starting on March 17th, 2025. The root cause of all failures was a combination of reaching provisioned IOPS limits, unexpected increases in log sizes compared to last year, the sustained increase in new entries (a factor of 2-2.5) and under-provisioned database resources. This combination of issues led to log performance and MMDs being missed by several days. Each factor contributed to the end result (the log failing) but the most direct factor was the insufficient IOPS allocation. Although internal alerts were triggering, the team failed to promptly respond and accurately identify and remediate the root cause leading to a prolonged service degradation and retirement of the three shards.  

 

Remediations Implemented and In Progress 

  • Increase Provisioned IOPS & Database Resources: We are moving to larger AWS instance types for scaled database storage and IOPS capacity to meet demand, ensuring adequate headroom for future growth as detailed in the corrective actions.  

  • Fine-tuning the Monitoring & Alerting: Our alert thresholds did not include clear actionable information nor a SOP on how to address performance degradation. Although we had alerts, the team was unclear on the process for remediating the IOPS issue. 

  • Revise Ownership & Response Protocols: Our SRE team had ownership over monitoring and response but was not familiar with the CT code or alerts. This gap in knowledge significantly hampered our ability to respond to alerted performance issues. We are cross-training the SRE team and engineering teams to provide better coverage on CT log alerts and reduce the potential for missed alerts.   

  • Capacity Planning: We implemented regular database performance reviews to anticipate scaling needs before performance degradation occurs. We based our previous database size on last year’s total log size. We failed to account for the growth in log size due to shorter lived certificates and the rapid growth in certificates.  

  • Adjust Maximum Merge Delay (MMD) Database Policies: We are increasing database sizing for writes and reads to maintain compliance with MMD requirements.  

 

These corrective actions will prevent similar issues in the future, ensuring better system resilience and response efficiency. Additionally, the Wyvern and Sphinx H2 2025 shards and all 2026 shards will be scaled, by provisioning larger instance types, to prevent performance issues, and all procedural improvements, as documented in corrective actions, will be in place to ensure SLA standards. 

 

Summary of Events 

March 3rd, 2025 

  • Logs begin having performance issues, no alerts were triggered because the alert thresholds were set too high to cause action. 

March 17th, 2025 

  • 22:36 UTC - Google sends email to DigiCert notifying of MMD misses 

March 18th, 2025 

  • 17:59 UTC - SRE is notified and begins investigation 

  • 22:37 UTC - SRE reports that they did not notice any major problem with the application during the timeframe 

March 19th, 2025 

  • 18:41 UTC - SRE responds to further inquiries to schedule a meeting to investigate the database 

  • 21:00 UTC - SRE investigates database and involves architecture for assistance 

  • 22:00 UTC - Wyvern and Sphinx 2025h1 logserver pods begin crashing due to database connection errors 

  • 22:10 UTC - SRE flushes host metrics on database server and restarts pods 

March 20th, 2025 

  • 05:05 UTC - First spike of errors for Wyvern & Sphinx 2025h1 servers 

  • 05:55 UTC - Wyvern2025h1 pod starts going into crashloopbackoff state 

  • 10:10 UTC - Sphinx2025h1 pod starts going into crashloopbackoff state 

  • 21:54 UTC - The SRE-Lehi team escalates to the sre-ops-internal channel that the Wyvern & Sphinx 2025h1 servers are unavailable 

  • 22:00 UTC - Approximate time - SRE-Lehi flushes the host cache on MariaDB & redeploys the pods for Wyvern & Sphinx 2025h1 

  • 22:04 UTC - Wyvern & Sphinx pods are back up and running again 

  • 22:08 UTC - ProdOps starts a Zoom call, and the teams begin investigating 

  • 22:10 UTC - ProdOps and the teams on the call indicate they are not seeing the issues anymore, and the service should be okay now 

March 21st, 2025 

  • 00:40 UTC - SRE-Lehi sees the error come up again 

  • 01:03 UTC - DBA flushes host cache 

  • 01:05 UTC - SRE-MTV restarts the pods (Wyvern) 

  • 01:07 UTC - Everything looks normal again (the pods are no longer in crashloopbackoff) 

  • 04:30 UTC - DBAs start the IOPS change on Wyvern2025h1 (ECC-2656) 

  • 10:08 UTC - IOPS change completed for Wyvern2025h1 

March 22nd, 2025 

  • 19:32 UTC - Alert for Wyvern2025h1 pod crashing 

  • 20:05 UTC - Zoom opened 

  • 20:11 UTC - SRE-MTV redeployed Wyvern2025h1 

  • 20:20 UTC - DBAs flushed the MariaDB cache 

  • 20:25 UTC - SRE-MTV restarted the pods again 

March 23rd, 2025 

  • 00:34 UTC - ProdOps starts seeing crashing pods again for Sphinx2025h1 

  • 00:35 UTC - Zoom opened 

  • 01:11 UTC - SRE-MTV sees "unauthenticated user" errors in the database 

  • 01:30 UTC - DBAs start the IOPS change on Sphinx2025h1 (ECC-2656) 

  • 01:40 UTC - DBAs flush host cache on MariaDB 

  • 01:43 UTC - SRE-MTV restarts the pods 

  • 01:44 UTC - SRE-MTV confirms all pods running successfully 

March 24th, 2025 

  • 00:28 UTC - ProdOps starts seeing crashing pods again for Wyvern2025h1 

  • 00:31 UTC - ProdOps validates alert and escalates to SRE-MV1 team 

  • 00:51 UTC - SRE-MTV invites ProdOps to join Zoom, SRE-MV1 validates pod still in crashloop stage 

  • 00:55 UTC - SRE-MTV requests DBA involvement, and DBA joins shortly 

  • 00:05 UTC - SRE-MTV flushes host cache on MariaDB and redeploys Wyvern2025h1 logserver, now running healthily 

  • 01:20 UTC - SRE monitors the pods for 15 minutes and ends Zoom call 

March 26th, 2025 

  • 00:19 UTC - ProdOps starts seeing NR alert for CT Log alerts (Wyvern2025h1 logserver crash) 

  • 00:21 UTC - ProdOps validates alert and escalates to SRE-MV1 team 

  • 00:25 UTC - SRE-MTV flushes DB cache and redeploys Wyvern2025h1 logserver 

  • 01:09 UTC - ProdOps sees NR alert for Wyvern2025h1 logserver crash again 

  • 01:15 UTC - ProdOps validates alert and escalates to SRE-MV1 team 

  • 01:21 UTC SRE-MTV redeploys Wyvern2025h1 logserver 

March 26th-31st, 2025 

  • Maximum Merge Delay (MMD) missed 

  • Continued investigation of root-cause of database performance constraints 

March 31st, 2025  

  • Google via the CT-Policy group requested the retirement of Wyvern and Sphinx logs due to lack of resolution and missing inclusion deadlines. DigiCert agreed with this decision and updated the thread in CT-Policy. 

  • DigiCert retired the CT Log, Nessie. 

 

Identifying Root Causes 

Root Cause Analysis (RCA) 

  1. Why did the database slow down? 

  • The provisioned IOPS limit was reached due to an increase in log size and a surge in new entries. 

  1. Why was the IOPS limit too low? 

  • The database was configured with 10K IOPS, but at least 20K IOPS was needed for adequate performance based on the current cert volumes being logged. 

  1. Why was the provisioning insufficient? 

  • CT Log (Nessie, Wyvern, Sphinx) shards were historically started with smaller resources (M5 2xLarge general purpose compute instance types), and growth monitoring was inadequate. 

  1. Why was growth monitoring inadequate? 

  • Key metrics such as data growth rate, read/write IOPS usage, CPU usage, and Nessie Cassandra, and Wyvern and Sphinx MariaDB pending requests were not closely monitored. 

  1. Why was there a delayed response to alerts? 

  • Lack of clear ownership and who had the action item to act on performance degradations combined with knowledge gaps about CT logs and the code base caused an issue.  

 

Corrective Action Items 

  • Increase AWS Instance Type with higher IOPS to R6.4xlarge for 2025-h2 shards: 

  • Priority: High 

  • Owner: Site Reliability Engineering 

  • Due Date: April 10, 2025 

  • Status: In progress 

  • Increase AWS Instance Type with higher IOPS to R6.4xlarge for all 2026 shards: 

  • Priority: Medium 

  • Owner: Site Reliability Engineering 

  • Due Date: April 17, 2025 

  • Status: In progress 

  • Enhance monitoring and alerting thresholds: 

  • Priority: Medium 

  • Owner: Site Reliability Engineering 

  • Due Date: April 17, 2025 

  • Status: In progress 

  • Adjust CPU usage alert from >80% to >70% 

  • Set CPU I/O wait alert to trigger above 10% average 

  • Monitor MMD through synthetic (woodpecker) testing 

  • Monitor Disk Queue Length -- alert > 10 count 

  • Improve documentation and training: 

  • Priority: Medium 

  • Owner: Site Reliability Engineering 

  • Due Date: April 17, 2025 

  • Status: In progress 

  • Define clear response protocols (playbooks) for database performance alerts 

  • Provide training on recognizing and responding to key alert conditions for both Engineering and Operations teams 

  • Increase visibility of CT log alerts to product engineering teams 

  • Train a minimum of 5 additional engineers 

  • Establish clear ownership for CT Log application and system operations: 

  • Priority: Medium 

  • Owner: Product Management 

  • Due Date: April 10, 2023 

  • Status: In progress 

  • Assign specific teams and individuals responsible for monitoring and responding to database performance issues, application errors, and violations of MMD SLO. 

  • Assign a PM as the primary person responsible for all communication to CT log stakeholders e.g. Chrome CT Log team and CT policy community when issues occur. 

 

Supporting Evidence & Attachments 

  • New Relic / Splunk alert logs 

Count of ERROR log events for Nessie 2025 over time 

A white background with black text

Description automatically generated with medium confidence 

Time chart of requests for `add-chain` and `add-pre-chain` endpoints for Nessie 2025Number of requests for the `add-chain` endpoint increased 3-fold around March 19th  

 

A graph with a line

Description automatically generated 

Nessie Database nodes CPU Metrics. Beginning around March 24th CPU I/O Wait increases indicating the throughput of the storage system has become saturated. 

A graph of a graph

Description automatically generated with medium confidence 

Time chart of requests for `add-chain` and `add-pre-chain` endpoints for Wyvern 2025Number of requests for the `add-chain` endpoint increased 12-fold around March 19th  

 

 

 

Time chart of requests for `add-chain` and `add-pre-chain` endpoints for Sphinx 2025Number of requests for the `add-chain` endpoint increased 9-fold around March 19th and then doubled again March 23rd  

 

 

Wyvern 2025 Database performance metricsMarch 21st IOPs increased from 6k to 12k and plateaus at 12kMarch 25th IOPs increased from 12k to 18k, but remained at the 12k plateau indicating the instance type isn’t capable of more throughput. 

A screenshot of a computer screen

Description automatically generated 

Chuck Blevins

unread,
Apr 3, 2025, 7:33:16 PMApr 3
to Certificate Transparency Policy, Chuck Blevins, Jeremy Rowley, Jeremy Rowley, Chuck Blevins, Certificate Transparency Policy, Certificate Transparency Operations, Andrew Ayer, Joe DeBlasio
Attaching PDF as images refuse to embed
CTLogs-CoE-2025-03-27.pdf

Jeremy Rowley

unread,
Apr 3, 2025, 7:33:25 PMApr 3
to Chuck Blevins, Certificate Transparency Policy, Jeremy Rowley, Chuck Blevins, Certificate Transparency Operations, Andrew Ayer, Joe DeBlasio
Chuck - looks like the graphs didn't come through.

CT people - do you prefer graphs as images uploaded as attachments to the message or as a PDF file that we attach to the message showing the full report? What's the preferred way to share graphics on a Google group (pardon my ignorance on this)? 

Devon O'Brien

unread,
Apr 4, 2025, 2:33:59 PMApr 4
to Certificate Transparency Policy, crb...@gmail.com, Jeremy Rowley, Jeremy Rowley, Chuck Blevins, Certificate Transparency Policy, Certificate Transparency Operations, Andrew Ayer, Joe DeBlasio - Google
Hi Chuck and Jeremy (welcome back!),

Thanks for putting together the detailed root cause analysis and correction of error reports; they are very helpful to both the Chrome CT team and to the broader CT community, and we appreciate the effort put into them.

Several requirements placed on CT Logs, notably the uptime and MMD requirements, produce fairly firm timelines for incident response and corrective action. In the timeline you provided, we do see several preventable failures that led to remediation delays and overall impact being larger than strictly necessary.

You covered most of these points in your correction of error report, but I’d like to emphasize several proactive steps that we recommend all log operators undertake to ensure they can reliably meet the requirements outlined in https://goo.gl/chrome/ct-log-policy:
  • Logs should be proactively monitored by their operators for signs of degraded health. Wherever possible, alerting thresholds should be configured to enable corrective action before a policy violation occurs.
  • CT log operations teams should be able to investigate reported incidents and confirm or repudiate with evidence within a short period of time. Multi-day delays on this process, such as delays in reports being routed to the appropriate people to investigate, put both the affected log and the broader ecosystem at unnecessarily elevated risk.
  • As soon as log misbehavior is identified (e.g. existing or imminent MMD violations, log split view, possible log compromise, etc.), please proactively disable write APIs and communicate this to the community. While this step may negatively impact log availability in the short-term, temporary write outages are significantly less harmful to the ecosystem than other more severe forms of log failure.
  • If you have any questions about the requirements in the CT Log Policy or other proactive measures that might help complying with these requirements, please ask the CT community at ct-p...@chromium.org, or you can reach out to us directly at chrome-certific...@google.com.
-Devon

Jeremy Rowley

unread,
Apr 4, 2025, 2:49:16 PMApr 4
to Devon O'Brien, Certificate Transparency Policy, crb...@gmail.com, Jeremy Rowley, Chuck Blevins, Certificate Transparency Operations, Andrew Ayer, Joe DeBlasio - Google
Thanks Devon!
  • Logs should be proactively monitored by their operators for signs of degraded health. Wherever possible, alerting thresholds should be configured to enable corrective action before a policy violation occurs.
For sure - we had the thresholds set to high and the alerts were not specific enough. We're rectifying this.
  • CT log operations teams should be able to investigate reported incidents and confirm or repudiate with evidence within a short period of time. Multi-day delays on this process, such as delays in reports being routed to the appropriate people to investigate, put both the affected log and the broader ecosystem at unnecessarily elevated risk.
Yes - looking back through the communication, we clearly failed to address the incident and communication from Google in a timely manner. Our SLAs for responding to incidents (regardless of source) are being updated with an expectation of communication within 1 hour and (where possible) resolution within 24 hours. 
  • As soon as log misbehavior is identified (e.g. existing or imminent MMD violations, log split view, possible log compromise, etc.), please proactively disable write APIs and communicate this to the community. While this step may negatively impact log availability in the short-term, temporary write outages are significantly less harmful to the ecosystem than other more severe forms of log failure.
This one is new to me but makes total sense. We'll update our playbook to include this as an action.
  • If you have any questions about the requirements in the CT Log Policy or other proactive measures that might help complying with these requirements, please ask the CT community at ct-p...@chromium.org, or you can reach out to us directly at chrome-certific...@google.com.
What more information would you like from us about this incident? Would you like a follow up email when all action items are implemented?  
Reply all
Reply to author
Forward
0 new messages