Availability drop in Wyvern 2024h2

Chuck Blevins

unread,

Oct 18, 2024, 12:06:06 PM10/18/24

to Certificate Transparency Policy

Issue: On Sept 23, Google monitoring infrastructure detected a steep availability drop in DigiCert's Wyvern 2024h2 and reported this to us.

Here's DigiCert's post-mortem:

Description

Certificate Transparency logs allow website owners to track all publicly issued certificates for their domains. DigiCert, like most CTlog operators, regularly creates new CT Log shards to keep the size of any given shard a reasonable size. Starting in 2024, DigiCert created new CT log shards with name prefixes wyvern and sphinx using the Google Trillian code base.

On September 19th GMT, the shard wyvern2024h2 logged several large but brief spikes of errors related to database connections - “context cancelled”, “context deadline exceeded”, and “too many connections”. These spikes were followed by a period with the persistent error “blocked because of many connection errors; unblock with ‘mariadb-admin flush-hosts”. With connections from the application pods to the database blocked, CT requests would fail. After one of these spikes and runs of “blocked” errors, Kubernetes attempted to restart the application pods which would cyclically fail because the host was blocked from connecting to the database. Several days later on September 23rd GMT, the pods successfully restarted without human intervention, although there was another brief period of context and connection errors about 15 minutes after the main recovery.

The Chrome CT Team sent an email to inform DigiCert’s CT team it had detected a steep availability drop in CT performance to 97.4%. The ask from Google is as follows “Could you please investigate and post to ct-policy@ with a description of what happened, path to fix, timing, etc. when you can?”

Customer Impact
In general, Certificate Authorities submit requests to multiple CT logs, but only need responses from 2 (depending on the type of certificate). A single CT log shard having issues should not affect any issuance, unless a CA had made the choice to only send requests to 2 CT log servers and wyvern2024 was one of them. As far as we can tell, there are no such CAs so there was no impact.

Google regularly monitors the health of all CT log shards, and notified us that the availability was going down so we could improve our CT service. The new CT logs are currently not officially in production. The new CT log shards were being monitored by Google to see if they want to start sending requests to the new CT log infrastructure.

Error from the two wyvern2024 pods
F0923 05:13:31.902869 1 main.go:118] Failed to get storage provider: Error 1129: Host '10.255.183.231' is blocked because of many connection errors; unblock with 'mariadb-admin flush-hosts'

Incident timeline

September 19th GMT

01:55 wyvern2024h2 logserver pod 775cl has spike of 60,000+ error messages in under 30 seconds ***** first occurrence starts, incident starts

01:55 wyvern2024h2 ctfe pods dc9tn and qp7hv replicate many of the error messages

01:56 wyvern2024h2 logserver pod 775cl starts consistently logging (only) "Host '10.255.183.231' is blocked because of many connection errors; unblock with 'mariadb-admin flush-hosts'"

01:55 wyvern2024h2 ctfe pods dc9tn and qp7hv replicate the "Host '10.255.183.231' is blocked because of many connection errors; unblock with 'mariadb-admin flush-hosts'" messages

02:43 wyvern2024h2 logserver pod 775cl stops logging constant "block" messages, shuts down application

02:43 wyvern2024h2 ctfe pods dc9tn and qp7hv stop logging "block" messages, presumably stop sending requests to failed pod***** end of first occurrence

02:43 wyvern2024h2 logserver pod 775cl begins restart-fail-restart cycle (repeating "block" error once each cycle) about every 5 minutes

10:21 wyvern2024h2 logserver pod jpzbb has spike of 140,000+ error messages in about 30 seconds

10:21 wyvern2024h2 ctfe pods dc9tn and qp7hv replicate many of the error messages

10:30 wyvern2024h2 logserver pod jpzbb has spike of 20,000+ error messages in about 15 seconds ***** second occurrence starts

10:30 wyvern2024h2 ctfe pods dc9tn and qp7hv replicate many of the error messages

10:31 wyvern2024h2 logserver pod jpzbb (30 seconds after previous error spike ends) starts logging (only) "Host '10.255.56.35' is blocked because of many connection errors; unblock with 'mariadb-admin flush-hosts'"

10:31 wyvern2024h2 ctfe pods dc9tn and qp7hv start replicating the "block" messages

10:34 wyvern2024h2 logserver pod jpzbb stops logging constant "block" messages, shuts down application, begins restart-fail-restart cycle

10:34 wyvern2024h2 ctfe pods dc9tn and qp7hv start logging (only) "17.20.20.163:8090 connection refused" messages

[ for several days both logserver pods continue 5-minute restart-fail-restart cycle ; both ctfe pods continue to log "17.20.20.163:8090 connection refused" messages]

September 23rd GMT

05:16 wyvern2024h2 logserver pod jpzbb successfully restarts (no known intervention to clear DB block)

05:17 wyvern2024h2 ctfe pod dc9tn stops logging "connection refused" message

05:18 wyvern2024h2 ctfe pod qp7hv stops logging "connection refused" message ***** end of second occurrence

05:18 wyvern2024h2 logserver pod 775cl successfully restarts (no known intervention to clear DB block)

05:34 wyvern2024h2 ctfe pods dc9tn and qp7hv start logging "context deadline exceeded" messages ***** start third occurrence

05:34 wyvern2024h2 logserver pod jpzbb starts logging hundreds of errors per second (with some gaps), again context cancelled or rolled back and too manyu connections

05:36 wyvern2024h2 ctfe pods dc9tn and qp7hv start logging "too many connections" messages

05:40 wyvern2024h2 ctfe pods dc9tn and qp7hv last "too many connections" messages

05:41 wyvern2024h2 ctfe pods dc9tn and qp7hv last "context deadline exceeded" messages

05:41 wyvern2024h2 logserver pod jpzbb stops logging errors ***** end of third occurrence, <--------end of incident

17:51 PM GMT CTO created a slack group ctlog-wyvern-issues and notified the channel the Chrome team had detected an availability drop in Wyvern 2024h2 CT Log server to 97.4%. <-----------------Detect Time was after incident resolved.

17:59 GMT SRE-Lehi team was added to the ctlog-wyvern-issues slack channel

18:01 GMT SRE-MTV was added to the slack channel and started to look into the issue

18:32 GMT The 24x7 was notified of issue and ask to write an incident report

Root cause

It appears that there is something (currently unknown - could be a specific customer request, race condition, or other) which can cause a logserver pod to create a large enough number of connections to the database to trigger the database blocking that entire host from connecting until the block is reset via a MariaDB admin command. Whatever the trigger is, happened three times on Sept 19 and left the shard wyvern2024h2 in an unusable state. None of the existing monitors detected or alerted for the problem, so DigiCert was notified of the problem by the Chrome team (who is evaluating the system).

Mitigations

Monitoring Enhancements
- Pods missing New Relic alert
- High CT Log latency New Relic monitor
- Splunk Monitor identifying unexpected log entries from the new CT Log codebase
- DB Monitor - New Relic Integration and create new alerts for mysql
General new process for service turnover in order to educate and increase awareness (SREMV1-12675)
Remediation Plan
- DB team to flush error cache if this specific issue reoccurs (DB-KB Article)
Preventative Steps to prevent reoccurrence
- Enhance logging to show successes (or metrics including successes), not just errors - to help correlate customer traffic to application behavior
- Investigate Google Trillium CTLog code base support to see if variable number of database connections, potentially exceeding configured thresholds, is a known weakness

Philippe Boneff

unread,

Oct 22, 2024, 7:04:25 AM10/22/24

to Chuck Blevins, Certificate Transparency Policy

Thanks for sharing the write up Chuck - very good postmortem!

Two notes about the mitigation section:

- Did the errors trickle down into RPC server errors on the Logserver side? I wonder whether this could be detected with RPC server error monitoring.

- `Investigate Google Trillium CTLog code base support to see if variable number of database connections, potentially exceeding configured thresholds, is a known weakness`: you might find this issue relevant. Trillian has a flag to limit the number of SQL connections to prevent this from happening. Are you using this flag already, or did this issue happen despite the use of this flag?

Cheers,

Philippe

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/1d0567a4-c36e-4410-a457-9d4a8fcf63a5n%40chromium.org.

Chuck Blevins

unread,

Oct 22, 2024, 2:01:01 PM10/22/24

to Certificate Transparency Policy, Philippe Boneff, Certificate Transparency Policy, Chuck Blevins

Thanks for the feedback, Philippe.

We'll look into this.

Cheers

Chuck

Reply all

Reply to author

Forward