SUMMARY:
On Saturday 4 October 2014, some Google App Engine applications experienced elevated errors and latency for a period of 3 hours and 4 minutes. We apologize if your application was affected. We know that you depend on Google to provide a reliable service. We are taking steps to prevent a recurrence of this type of incident, and we are crediting all applications in cases where we did not meet the terms of the App Engine Service Level Agreement.
DETAILED DESCRIPTION OF IMPACT:
On Saturday 4 October 2014 from 09:27 to 12:31 PDT, 8.4% of App Engine applications in US datacenters experienced a rate of serving HTTP 500 errors that was higher than 10%. A further 15% of applications experienced error rates that were lower than 10%. Latency for affected applications was not significantly higher at the median for successful requests. At the 90th percentile, latency increased by 2.2 times. 80% of deployments of new versions failed during the incident period. App Engine’s auto-scaling was impaired during the incident.
ROOT CAUSE:
The incident occurred after one application made a configuration change that resulted in an invalid state. This state was saved to a storage system used by US applications to store their configuration. Some processes in App Engine’s serving infrastructure received segmentation faults when they tried to read the configuration for this application. Any application that was scheduled to run on a segfaulting server experienced elevated errors and latency.
REMEDIATION AND PREVENTION:
The invalid state was written to storage at 09:27. Our engineers received an automated alert at 09:41, indicating that some processes in one datacenter were failing. We identified the root cause of the problem at 10:35. However, the process that we normally use to write configuration to storage was itself crashing. We therefore had to find an alternative means to repair the configuration. This was completed at 12:14. The new configuration was replicated to all datacenters and the system was stabilized at 12:31.
To prevent recurrence, we will change our code to ensure that an invalid application configuration cannot affect other applications. In addition, we will improve our tools, so that we can repair invalid configurations more quickly.