Wyvern 2025h2 and Sphinx 2025h2 get-entries outage

151 views
Skip to first unread message

Andrew Ayer

unread,
Sep 8, 2025, 3:40:37 PM (7 days ago) Sep 8
to ct...@digicert.com, Certificate Transparency Policy
For at least the last 24 hours, most get-entries calls to Wyvern 2025h2 and Sphinx 2025h2 have been failing with "503 Service Unavailable" errors. Unfortunately, the logs continue to accept certificate submissions, meaning that 10s of millions of SCTs have been issued for certificates that monitors can't access. Some of these SCTs were issued over 24 hours ago. Can the logs' write APIs be disabled as soon as possible, per the guidance issued to DigiCert earlier this year (https://groups.google.com/a/chromium.org/g/ct-policy/c/aR6gKzCANVs/m/7qCOGa8XCAAJ)?

Regards,
Andrew

chuck Blevins DC

unread,
Sep 8, 2025, 3:44:11 PM (7 days ago) Sep 8
to Andrew Ayer, ct...@digicert.com, Certificate Transparency Policy
We are investigating Andrew.
Will have a response to this shortly

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/20250908154030.5cb9285b73cc89fbdda28701%40andrewayer.name.

Rick Roos

unread,
Sep 8, 2025, 4:10:11 PM (7 days ago) Sep 8
to Andrew Ayer, ct...@digicert.com, Certificate Transparency Policy
Thanks Andrew, we have turned off accepting new entries for these two logs and are investigating the issue.

Thanks,
Rick

Rick Roos

unread,
Sep 9, 2025, 1:42:07 PM (6 days ago) Sep 9
to Andrew Ayer, ct...@digicert.com, Certificate Transparency Policy
I wanted to give an update on the current status of these logs.  This morning we have enabled these logs to accept new entries again. Our investigation into the outage revealed that we saw a 3-4x increase in calls to the get-entries endpoint which caused the ctile PODs to crash due to memory constraints.  This indicates we had our rate limits set too high for the infrastructure setup.  As of now we have increased the memory of our ctitle PODs and lowered our get-entries rate limits to be 10/requests per second per IP (with a maximum of 256 entries per request).  We will continue to work on expanding infrastructure resources and then revlatue if we should increase the rate limit again.

The other item we are addressing is why our monitoring alerts did not get triggered.  We internally run woodpecker for end-to-end testing and we have monitoring on the logs services themselves. Both of those feed alerts into the same system alert system which failed to trigger the notifications.

Thanks,
Rick
Reply all
Reply to author
Forward
0 new messages