SUMMARY:
On Tuesday 6 October 2015, the ‘pilot’, ‘aviator’, ‘rocketeer’, and ‘testtube’ Certificate Transparency logs operated by Google experienced an outage for a duration of 5 hours and 20 minutes. If your service or application was affected, we apologize; this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.
DETAILED DESCRIPTION OF IMPACT:
On Tuesday 6th of October 2015 between 18:35 and 23:55 PDT, DNS lookups for
ct.googleapis.com failed to correctly resolve resulting in 100% outage for clients attempting to connect to the service. The underlying logs themselves were not affected and no data was at risk.
ROOT CAUSE:
During preparation for the planned addition of a DNS interface for inclusion proof checking a DNS configuration change was made to enable resolution of records below
ct.googleapis.com. This change unintentionally affected DNS resolution for the existing
ct.googleapis.com domain. Additionally, automated monitoring that was in place for all 4 logs unexpectedly bypassed DNS resolution such that these probes did not detect service outage. The DNS change became effective at 18:28 and caused resolution to fail soon after.
REMEDIATION AND PREVENTION:
Google was notified by a user at 19:13 and the engineering team began their investigation shortly afterwards. The root cause was identified, a fix was prepared, tested, and the roll-out completed by 23:55 when normal operation was restored.
To prevent similar incidents in future, we have changed our automated monitoring to ensure that a more complete path (including DNS resolution) is tested during probes.