--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/b4599f4f-f800-4ffd-8933-1156590af363n%40chromium.org.
DigiCert's Nessie 2025 shard began logging database errors on March 27th due to an overloaded Cassandra cluster. Investigation found that entries do appear to be missing. We recommend this log be retired. We will provide an RCA by end of day on April 1, 2025. We are investigating but the root cause is just the number of certificates logged.
On a similar note, our Yeti2025 log is still running and uses the code base. We are currently at 1.25b entries, which is more than we've ever tested for with that set of code (we tested up to a billion). We believe this log may eventually fail similar to Nessie. We recommend this log be shut down as well given the issue of scale. However, we would like direction from the community on whether we should shut it down and the appropriate date for a shut down.
The remaining Trillian-based Sphinx and Wyvern shards are also running. Our troubleshooting of the Trillian-based logs has shown that someone has been using the addchain endpoint heavily to add certificates for the retiring Wyvern2h and Sphinx2h logs. We did not have rate limits on those logs, which led to a rapid growth in log size and a failure in the logs ability to perform. We will be adding a rate limit (TBD on level and when we will add it) to the addchain endpoint. This will prevent too many non-pre-certs being logged at once. We will provide an RCA on the Sphinx and Wyvern shards tomorrow with more information about the rate limiting plan and how we will ensure the newer shards can meet the scale requirements. Please let us know what questions or suggestions the community has.
--
Chuck Blevins
Director of Product Management, Public PKI Services
And here is the RCA for our
Nessie issues..
DESCRIPTION
The Nessie log experienced database problems as far back as March 8th. These became a critical issue about 2PM MT, March 27th, becoming constant and frequent. The combination of the number of Provisioned IOPs being at the limit, a surge in new entries being submitted, and the log reaching a certain size caused the database to slow significantly. After the 27th it was discovered that entries in the database were not in the log tree, even after several days
ROOT CAUSE
The cause of the poor database performance was the number of provisioned IOPs for the data volumes of the six servers being lower than required. The configured setting was 10K IOPs - but based on total size of the data, and the rate of requests for adding and downloading entries - at least 20k IOPs would have been needed for adequate performance.
Historically the Nessie CT Log shards have been smaller and slower growing than the Yeti CT Log shards; the database cluster was provisioned accordingly. Closer monitoring of this server would have indicated the need to increase the resources available to the database cluster. Important metrics for evaluating load of the database cluster for future use:
Alerts were received during this time window, but no action was taken until March 31st.
A combination of alert fatigue, gaps in knowledge/training, and lack of clear ownership are the likely reasons for inaction, which resulted in the problem persisting and lead to the lead to the log’s planned retirement.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
Here is the RCA regarding Wyvern and Sphinx log issues (apologies this appeared to post Tuesday evening but is not to be found)
DESCRIPTION
Starting in 2024, DigiCert created new CT log shards with name prefixes wyvern and sphinx using the Google Trillian code base.
On March 19th GMT, the Wyvern & Sphinx 2025h1 shards logged several spikes of errors related to database connections - “context deadline exceeded”, and “too many connections”. These spikes were followed by a period with the persistent error “blocked because of many connection errors; unblock with ‘mariadb-admin flush-hosts”. With connections from the application pods to the database blocked, CT requests would fail. After one of these spikes and runs of “blocked” errors, Kubernetes attempted to restart the application pods which would cyclically fail because the host was blocked from connecting to the database.
On the evening of the 20th, SRE escalated that they were seeing problems with both the Wyvern and Sphinx 2025h1 pods. They flushed the cache for the MariaDB server and bounced the pods. A few hours later the errors returned, this time for Wyvern2025h1. The DBAs flushed the host, and the pods were restarted After these steps, the service returned to normal, and teams decided on mitigation steps, including changing the IOPS for the Wyvern2025h1 pods. If this helped alleviate the issue, the same change would be made on Sphinx2025h1. The DBAs updated the IOPS on the Wyvern2025h1 server, but implementation took several hours.
The next day, we received another alert for Wyvern2025h1 crashing. We redeployed Wyvern2025h1, flushed the host cache on the MariaDB server, and redeployed Wyvern2025h1 again. Several hours later, the same problem reoccurred, but with Sphinx2025h1. We began investigating. This time, we were able to see “unauthenticated user” errors in the DB. We then decided to update the IOPS on Sphinx2025h1 so that we could remove that limitation, and so it would match with Wyvern2025h1. After making the change, we flushed the host cache on MariaDB and bounced the pods again. Since this problem continued to happen, we decided further investigation would be required.
On March 31, Google via the CT-Policy group requested the retirement of these logs due to lack of resolution and missing of inclusion deadlines. DigiCert agreed with this decision and updated the thread in CT-Policy.
ROOT CAUSE
It appears that there is something currently unknow) which can cause a logserver pod to create a large enough number of connections to the database to trigger the database blocking that entire host from connecting until the block is reset via a MariaDB admin command. Whatever the trigger is, it has happened repeatedly over the past few days and left the shards Wyvern2025h1 & Sphinx2025h1 in an unusable state.
Increasing the number of IOPs had diminishing returns. As the number of IOPs increase so did the disk queue length which indicates another bottleneck. The database instance type may be too small for the load and cannot accommodate the number of IOPs which the underlying storage has been configured to use.
INCIDENT TIMELINE
(All times in GMT)
March 20th, 2025
05:05 - First spike of errors for Wyvern & Sphinx2025h1 servers
05:55 - Wyvern2025h1 pod starts going into crashloopbackoff state
10:10 - Sphinx2025h1 pod starts going into crashloopbackoff state
21:54 - The SRE-Lehi team escalates that the Wyvern & Sphinx 2025h1 servers are unavailable
22:00 - Approx. time - flush the host cache on MariaDB & redeploy the pods for Wyvern & Sphinx 2025h1
22:04 - Wyvern & Sphinx pods are back up and running again
March 21st, 2025
00:40 – error is seen again
01:03 - DBA – Flush host
01:05 – Restart the pods (wyvern)
01:07 - Everything is looking normal again (the pods are not in crashloop backoff)
04:30 – Begin the IOPS change on Wyvern2025h1 (ECC-2656)
10:08 - IOPS change completed for Wyvern2025h1
March 22nd, 2025
19:32 - Alert for Wyvern2025h1 pod crashing
20:05 - Zoom opened
20:11 - Redeployed Wyvern2025h1
20:20 - Flushed the MariaDB cache
20:25 – Restarted the pods again
March 23rd, 2025
00:34 - Crashing pods again for Sphinx2025h1
00:35 - Zoom opened
01:11 - "unauthenticated user" errors in db
01:30 - Start the IOPS change on Sphinx2025h1 (ECC-2656)
01:40 - Flush host cache on MariaDB
01:43 - Restart the pods
01:44 - All pods running successfully
March 24th, 2025
00:28 – Crashing pods again
00:51 - Validates pod still in crashloop stage
00:05 – Flush host cache on MariaDB and redeployed Wyvern2025h1-logserver healthy running
01:20 - Monitor the pods for 15min and zoom ended
06:28 - Crashing pods again
06:32 - Flush cache DB and redeployed pod
21:39 - Crashing pods again
21:44 - Flush cache DB
21:48 - Redeploy pod Wyvern2025h1-logserve and SRE monitor pod
March 26th, 2025
00:19 - NR alert for CT Log alerts
00:25 - Flush cache DB and redeploy pod Wyvern2025h1-logserver
01:09 - NR alert for CT Log alerts
01:21 - Redeploy pod Wyvern2025h1-logserver
And here is the RCA for our Nessie issues..
DESCRIPTIONThe Nessie log experienced database problems as far back as March 8th. These became a critical issue about 2PM MT, March 27th, becoming constant and frequent. The combination of the number of Provisioned IOPs being at the limit, a surge in new entries being submitted, and the log reaching a certain size caused the database to slow significantly. After the 27th it was discovered that entries in the database were not in the log tree, even after several days
ROOT CAUSE
The cause of the poor database performance was the number of provisioned IOPs for the data volumes of the six servers being lower than required. The configured setting was 10K IOPs - but based on total size of the data, and the rate of requests for adding and downloading entries - at least 20k IOPs would have been needed for adequate performance.
Historically the Nessie CT Log shards have been smaller and slower growing than the Yeti CT Log shards; the database cluster was provisioned accordingly. Closer monitoring of this server would have indicated the need to increase the resources available to the database cluster. Important metrics for evaluating load of the database cluster for future use:
- Growth rate of data
- Read/Write IOPs usage
- CPU Usage and I/O Wait
- Cassandra pending read requests
- Cassandra pending mutations (write)
Alerts were received during this time window, but no action was taken until March 31st.
- New Relic showed a warning threshold of > 70%) for CPU usage being exceeded regularly; however, only after the threshold exceeded 80% would an alert have been sent.
- Another alert condition (CPU I/O wait) that was expected to be tripped was found to have a critical threshold of 70%. Anything above 10% would be considered poor – being an indicator of slow disk performance.
- Other alerts from New Relic for “High CT Log Latency” were received between March 20th and March 30th. This alert condition appears to have been poorly defined, and no action items appear to be documented for it. This is the likely reason it was not acted on.
A combination of alert fatigue, gaps in knowledge/training, and lack of clear ownership are the likely reasons for inaction, which resulted in the problem persisting and lead to the lead to the log’s planned retirement.
--
Chuck Blevins
Director of Product Management, Public PKI Services
No problem. The correction of error posts will probably also answer a lot of questions as well.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
Correction of Errors (CoE)
Incident Details
Title: CT log Database Performance Issues Resulting in Log Service Issues and MMD misses
Date of Incident: March 27, 2025
Time of First Error: March 3, 2025
Time First Notification of Issue: March 17, 2025
Duration of Incident: March 8 - March 31, 2025
Owner: SRE-CloudOps/Engineering
Incident Summary
DigiCert operates four CT logs that are shared into different segments. Two of these logs, Nessie and Yeti, use a DigiCert codebase. The other two, Wyvern and Sphinx, are Trillian-based CT logs. Three shards of these logs (The H1 CT log shards) experienced degraded performance. The Nessie CT log experienced database performance degradation as early as March 3rd. Wyvern and Sphinx began exhibiting errors on March 15 with critical errors starting on March 17th, 2025. The root cause of all failures was a combination of reaching provisioned IOPS limits, unexpected increases in log sizes compared to last year, the sustained increase in new entries (a factor of 2-2.5) and under-provisioned database resources. This combination of issues led to log performance and MMDs being missed by several days. Each factor contributed to the end result (the log failing) but the most direct factor was the insufficient IOPS allocation. Although internal alerts were triggering, the team failed to promptly respond and accurately identify and remediate the root cause leading to a prolonged service degradation and retirement of the three shards.
Remediations Implemented and In Progress
Increase Provisioned IOPS & Database Resources: We are moving to larger AWS instance types for scaled database storage and IOPS capacity to meet demand, ensuring adequate headroom for future growth as detailed in the corrective actions.
Fine-tuning the Monitoring & Alerting: Our alert thresholds did not include clear actionable information nor a SOP on how to address performance degradation. Although we had alerts, the team was unclear on the process for remediating the IOPS issue.
Revise Ownership & Response Protocols: Our SRE team had ownership over monitoring and response but was not familiar with the CT code or alerts. This gap in knowledge significantly hampered our ability to respond to alerted performance issues. We are cross-training the SRE team and engineering teams to provide better coverage on CT log alerts and reduce the potential for missed alerts.
Capacity Planning: We implemented regular database performance reviews to anticipate scaling needs before performance degradation occurs. We based our previous database size on last year’s total log size. We failed to account for the growth in log size due to shorter lived certificates and the rapid growth in certificates.
Adjust Maximum Merge Delay (MMD) Database Policies: We are increasing database sizing for writes and reads to maintain compliance with MMD requirements.
These corrective actions will prevent similar issues in the future, ensuring better system resilience and response efficiency. Additionally, the Wyvern and Sphinx H2 2025 shards and all 2026 shards will be scaled, by provisioning larger instance types, to prevent performance issues, and all procedural improvements, as documented in corrective actions, will be in place to ensure SLA standards.
Summary of Events
March 3rd, 2025
Logs begin having performance issues, no alerts were triggered because the alert thresholds were set too high to cause action.
March 17th, 2025
22:36 UTC - Google sends email to DigiCert notifying of MMD misses
March 18th, 2025
17:59 UTC - SRE is notified and begins investigation
22:37 UTC - SRE reports that they did not notice any major problem with the application during the timeframe
March 19th, 2025
18:41 UTC - SRE responds to further inquiries to schedule a meeting to investigate the database
21:00 UTC - SRE investigates database and involves architecture for assistance
22:00 UTC - Wyvern and Sphinx 2025h1 logserver pods begin crashing due to database connection errors
22:10 UTC - SRE flushes host metrics on database server and restarts pods
March 20th, 2025
05:05 UTC - First spike of errors for Wyvern & Sphinx 2025h1 servers
05:55 UTC - Wyvern2025h1 pod starts going into crashloopbackoff state
10:10 UTC - Sphinx2025h1 pod starts going into crashloopbackoff state
21:54 UTC - The SRE-Lehi team escalates to the sre-ops-internal channel that the Wyvern & Sphinx 2025h1 servers are unavailable
22:00 UTC - Approximate time - SRE-Lehi flushes the host cache on MariaDB & redeploys the pods for Wyvern & Sphinx 2025h1
22:04 UTC - Wyvern & Sphinx pods are back up and running again
22:08 UTC - ProdOps starts a Zoom call, and the teams begin investigating
22:10 UTC - ProdOps and the teams on the call indicate they are not seeing the issues anymore, and the service should be okay now
March 21st, 2025
00:40 UTC - SRE-Lehi sees the error come up again
01:03 UTC - DBA flushes host cache
01:05 UTC - SRE-MTV restarts the pods (Wyvern)
01:07 UTC - Everything looks normal again (the pods are no longer in crashloopbackoff)
04:30 UTC - DBAs start the IOPS change on Wyvern2025h1 (ECC-2656)
10:08 UTC - IOPS change completed for Wyvern2025h1
March 22nd, 2025
19:32 UTC - Alert for Wyvern2025h1 pod crashing
20:05 UTC - Zoom opened
20:11 UTC - SRE-MTV redeployed Wyvern2025h1
20:20 UTC - DBAs flushed the MariaDB cache
20:25 UTC - SRE-MTV restarted the pods again
March 23rd, 2025
00:34 UTC - ProdOps starts seeing crashing pods again for Sphinx2025h1
00:35 UTC - Zoom opened
01:11 UTC - SRE-MTV sees "unauthenticated user" errors in the database
01:30 UTC - DBAs start the IOPS change on Sphinx2025h1 (ECC-2656)
01:40 UTC - DBAs flush host cache on MariaDB
01:43 UTC - SRE-MTV restarts the pods
01:44 UTC - SRE-MTV confirms all pods running successfully
March 24th, 2025
00:28 UTC - ProdOps starts seeing crashing pods again for Wyvern2025h1
00:31 UTC - ProdOps validates alert and escalates to SRE-MV1 team
00:51 UTC - SRE-MTV invites ProdOps to join Zoom, SRE-MV1 validates pod still in crashloop stage
00:55 UTC - SRE-MTV requests DBA involvement, and DBA joins shortly
00:05 UTC - SRE-MTV flushes host cache on MariaDB and redeploys Wyvern2025h1 logserver, now running healthily
01:20 UTC - SRE monitors the pods for 15 minutes and ends Zoom call
March 26th, 2025
00:19 UTC - ProdOps starts seeing NR alert for CT Log alerts (Wyvern2025h1 logserver crash)
00:21 UTC - ProdOps validates alert and escalates to SRE-MV1 team
00:25 UTC - SRE-MTV flushes DB cache and redeploys Wyvern2025h1 logserver
01:09 UTC - ProdOps sees NR alert for Wyvern2025h1 logserver crash again
01:15 UTC - ProdOps validates alert and escalates to SRE-MV1 team
01:21 UTC – SRE-MTV redeploys Wyvern2025h1 logserver
March 26th-31st, 2025
Maximum Merge Delay (MMD) missed
Continued investigation of root-cause of database performance constraints
March 31st, 2025
Google via the CT-Policy group requested the retirement of Wyvern and Sphinx logs due to lack of resolution and missing inclusion deadlines. DigiCert agreed with this decision and updated the thread in CT-Policy.
DigiCert retired the CT Log, Nessie.
Identifying Root Causes
Root Cause Analysis (RCA)
Why did the database slow down?
The provisioned IOPS limit was reached due to an increase in log size and a surge in new entries.
Why was the IOPS limit too low?
The database was configured with 10K IOPS, but at least 20K IOPS was needed for adequate performance based on the current cert volumes being logged.
Why was the provisioning insufficient?
CT Log (Nessie, Wyvern, Sphinx) shards were historically started with smaller resources (M5 2xLarge general purpose compute instance types), and growth monitoring was inadequate.
Why was growth monitoring inadequate?
Key metrics such as data growth rate, read/write IOPS usage, CPU usage, and Nessie Cassandra, and Wyvern and Sphinx MariaDB pending requests were not closely monitored.
Why was there a delayed response to alerts?
Lack of clear ownership and who had the action item to act on performance degradations combined with knowledge gaps about CT logs and the code base caused an issue.
Corrective Action Items
Increase AWS Instance Type with higher IOPS to R6.4xlarge for 2025-h2 shards:
Priority: High
Owner: Site Reliability Engineering
Due Date: April 10, 2025
Status: In progress
Increase AWS Instance Type with higher IOPS to R6.4xlarge for all 2026 shards:
Priority: Medium
Owner: Site Reliability Engineering
Due Date: April 17, 2025
Status: In progress
Enhance monitoring and alerting thresholds:
Priority: Medium
Owner: Site Reliability Engineering
Due Date: April 17, 2025
Status: In progress
Adjust CPU usage alert from >80% to >70%
Set CPU I/O wait alert to trigger above 10% average
Monitor MMD through synthetic (woodpecker) testing
Monitor Disk Queue Length -- alert > 10 count
Improve documentation and training:
Priority: Medium
Owner: Site Reliability Engineering
Due Date: April 17, 2025
Status: In progress
Define clear response protocols (playbooks) for database performance alerts
Provide training on recognizing and responding to key alert conditions for both Engineering and Operations teams
Increase visibility of CT log alerts to product engineering teams
Train a minimum of 5 additional engineers
Establish clear ownership for CT Log application and system operations:
Priority: Medium
Owner: Product Management
Due Date: April 10, 2023
Status: In progress
Assign specific teams and individuals responsible for monitoring and responding to database performance issues, application errors, and violations of MMD SLO.
Assign a PM as the primary person responsible for all communication to CT log stakeholders e.g. Chrome CT Log team and CT policy community when issues occur.
Supporting Evidence & Attachments
New Relic / Splunk alert logs
Count of ERROR log events for Nessie 2025 over time
Time chart of requests for `add-chain` and `add-pre-chain` endpoints for Nessie 2025. Number of requests for the `add-chain` endpoint increased 3-fold around March 19th
Nessie Database nodes CPU Metrics. Beginning around March 24th CPU I/O Wait increases indicating the throughput of the storage system has become saturated.
Time chart of requests for `add-chain` and `add-pre-chain` endpoints for Wyvern 2025. Number of requests for the `add-chain` endpoint increased 12-fold around March 19th
Time chart of requests for `add-chain` and `add-pre-chain` endpoints for Sphinx 2025. Number of requests for the `add-chain` endpoint increased 9-fold around March 19th and then doubled again March 23rd
Wyvern 2025 Database performance metrics. March 21st IOPs increased from 6k to 12k and plateaus at 12k. March 25th IOPs increased from 12k to 18k, but remained at the 12k plateau indicating the instance type isn’t capable of more throughput.