Description
Certificate Transparency logs allow website owners to track all publicly issued certificates for their domains. DigiCert, like most CTlog operators, regularly creates new CT Log shards to keep the size of any given shard a reasonable size. Starting in 2024, DigiCert created new CT log shards with name prefixes wyvern and sphinx using the Google Trillian code base.
On September 19th GMT, the shard wyvern2024h2 logged several large but brief spikes of errors related to database connections - “context cancelled”, “context deadline exceeded”, and “too many connections”. These spikes were followed by a period with the persistent error “blocked because of many connection errors; unblock with ‘mariadb-admin flush-hosts”. With connections from the application pods to the database blocked, CT requests would fail. After one of these spikes and runs of “blocked” errors, Kubernetes attempted to restart the application pods which would cyclically fail because the host was blocked from connecting to the database. Several days later on September 23rd GMT, the pods successfully restarted without human intervention, although there was another brief period of context and connection errors about 15 minutes after the main recovery.
The Chrome CT Team sent an email to inform DigiCert’s CT team it had detected a steep availability drop in CT performance to 97.4%. The ask from Google is as follows “Could you please investigate and post to ct-policy@ with a description of what happened, path to fix, timing, etc. when you can?”
Customer Impact
In general, Certificate Authorities submit requests to multiple CT logs, but only need responses from 2 (depending on the type of certificate). A single CT log shard having issues should not affect any issuance, unless a CA had made the choice to only send requests to 2 CT log servers and wyvern2024 was one of them. As far as we can tell, there are no such CAs so there was no impact.
Google regularly monitors the health of all CT log shards, and notified us that the availability was going down so we could improve our CT service. The new CT logs are currently not officially in production. The new CT log shards were being monitored by Google to see if they want to start sending requests to the new CT log infrastructure.
Error from the two wyvern2024 pods
F0923 05:13:31.902869 1 main.go:118] Failed to get storage provider: Error 1129: Host '10.255.183.231' is blocked because of many connection errors; unblock with 'mariadb-admin flush-hosts'
Incident timeline
September 19th GMT
01:55 wyvern2024h2 logserver pod 775cl has spike of 60,000+ error messages in under 30 seconds ***** first occurrence starts, incident starts
01:55 wyvern2024h2 ctfe pods dc9tn and qp7hv replicate many of the error messages
01:56 wyvern2024h2 logserver pod 775cl starts consistently logging (only) "Host '10.255.183.231' is blocked because of many connection errors; unblock with 'mariadb-admin flush-hosts'"
01:55 wyvern2024h2 ctfe pods dc9tn and qp7hv replicate the "Host '10.255.183.231' is blocked because of many connection errors; unblock with 'mariadb-admin flush-hosts'" messages
02:43 wyvern2024h2 logserver pod 775cl stops logging constant "block" messages, shuts down application
02:43 wyvern2024h2 ctfe pods dc9tn and qp7hv stop logging "block" messages, presumably stop sending requests to failed pod***** end of first occurrence
02:43 wyvern2024h2 logserver pod 775cl begins restart-fail-restart cycle (repeating "block" error once each cycle) about every 5 minutes
10:21 wyvern2024h2 logserver pod jpzbb has spike of 140,000+ error messages in about 30 seconds
10:21 wyvern2024h2 ctfe pods dc9tn and qp7hv replicate many of the error messages
10:30 wyvern2024h2 logserver pod jpzbb has spike of 20,000+ error messages in about 15 seconds ***** second occurrence starts
10:30 wyvern2024h2 ctfe pods dc9tn and qp7hv replicate many of the error messages
10:31 wyvern2024h2 logserver pod jpzbb (30 seconds after previous error spike ends) starts logging (only) "Host '10.255.56.35' is blocked because of many connection errors; unblock with 'mariadb-admin flush-hosts'"
10:31 wyvern2024h2 ctfe pods dc9tn and qp7hv start replicating the "block" messages
10:34 wyvern2024h2 logserver pod jpzbb stops logging constant "block" messages, shuts down application, begins restart-fail-restart cycle
10:34 wyvern2024h2 ctfe pods dc9tn and qp7hv start logging (only) "17.20.20.163:8090 connection refused" messages
[ for several days both logserver pods continue 5-minute restart-fail-restart cycle ; both ctfe pods continue to log "17.20.20.163:8090 connection refused" messages]
September 23rd GMT
05:16 wyvern2024h2 logserver pod jpzbb successfully restarts (no known intervention to clear DB block)
05:17 wyvern2024h2 ctfe pod dc9tn stops logging "connection refused" message
05:18 wyvern2024h2 ctfe pod qp7hv stops logging "connection refused" message ***** end of second occurrence
05:18 wyvern2024h2 logserver pod 775cl successfully restarts (no known intervention to clear DB block)
05:34 wyvern2024h2 ctfe pods dc9tn and qp7hv start logging "context deadline exceeded" messages ***** start third occurrence
05:34 wyvern2024h2 logserver pod jpzbb starts logging hundreds of errors per second (with some gaps), again context cancelled or rolled back and too manyu connections
05:36 wyvern2024h2 ctfe pods dc9tn and qp7hv start logging "too many connections" messages
05:40 wyvern2024h2 ctfe pods dc9tn and qp7hv last "too many connections" messages
05:41 wyvern2024h2 ctfe pods dc9tn and qp7hv last "context deadline exceeded" messages
05:41 wyvern2024h2 logserver pod jpzbb stops logging errors ***** end of third occurrence, <--------end of incident
17:51 PM GMT CTO created a slack group ctlog-wyvern-issues and notified the channel the Chrome team had detected an availability drop in Wyvern 2024h2 CT Log server to 97.4%. <-----------------Detect Time was after incident resolved.
17:59 GMT SRE-Lehi team was added to the ctlog-wyvern-issues slack channel
18:01 GMT SRE-MTV was added to the slack channel and started to look into the issue
18:32 GMT The 24x7 was notified of issue and ask to write an incident report
Root cause
It appears that there is something (currently unknown - could be a specific customer request, race condition, or other) which can cause a logserver pod to create a large enough number of connections to the database to trigger the database blocking that entire host from connecting until the block is reset via a MariaDB admin command. Whatever the trigger is, happened three times on Sept 19 and left the shard wyvern2024h2 in an unusable state. None of the existing monitors detected or alerted for the problem, so DigiCert was notified of the problem by the Chrome team (who is evaluating the system).
Mitigations
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/1d0567a4-c36e-4410-a457-9d4a8fcf63a5n%40chromium.org.