2015-10-06 Google Log Outage Incident Report

107 views
Skip to first unread message

Adam Eijdenberg

unread,
Oct 20, 2015, 1:36:17 PM10/20/15
to ct-p...@chromium.org
SUMMARY:
On Tuesday 6 October 2015, the ‘pilot’, ‘aviator’, ‘rocketeer’, and ‘testtube’ Certificate Transparency logs operated by Google experienced an outage for a duration of 5 hours and 20 minutes. If your service or application was affected, we apologize; this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.

DETAILED DESCRIPTION OF IMPACT:
On Tuesday 6th of October 2015 between 18:35 and 23:55 PDT, DNS lookups for ct.googleapis.com failed to correctly resolve resulting in 100% outage for clients attempting to connect to the service.  The underlying logs themselves were not affected and no data was at risk.

ROOT CAUSE:
During preparation for the planned addition of a DNS interface for inclusion proof checking a DNS configuration change was made to enable resolution of records below ct.googleapis.com. This change unintentionally affected DNS resolution for the existing ct.googleapis.com domain. Additionally, automated monitoring that was in place for all 4 logs unexpectedly bypassed DNS resolution such that these probes did not detect service outage. The DNS change became effective at 18:28 and caused resolution to fail soon after.

REMEDIATION AND PREVENTION:
Google was notified by a user at 19:13 and the engineering team began their investigation shortly afterwards. The root cause was identified, a fix was prepared, tested, and the roll-out completed by 23:55 when normal operation was restored.

To prevent similar incidents in future, we have changed our automated monitoring to ensure that a more complete path (including DNS resolution) is tested during probes.

Ryan Sleevi

unread,
Oct 20, 2015, 2:28:29 PM10/20/15
to Adam Eijdenberg, ct-p...@chromium.org
Thanks for posting this, Adam.

I would encourage other log operators to consider such reports in the
future as well, as part of providing transparency but also helping to
understand the growing pains in the ecosystem, so that we can make
sure things are more robust.
> --
> You received this message because you are subscribed to the Google Groups
> "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ct-policy+...@chromium.org.
> To post to this group, send email to ct-p...@chromium.org.
> To view this discussion on the web visit
> https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAP9QY5a26QcMmv4bSsX9%3DoeLsaOg5TbgO6RK5saPXQdwhOv3dQ%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages