App Engine issues on Saturday 4 October, 2014

Google App Engine Downtime Notify

unread,

Oct 4, 2014, 1:42:58 PM10/4/14

to google-appengine...@googlegroups.com

We are investigating an issue with Google App Engine beginning at Saturday 2014-10-04 10:00 (all times are in US/Pacific). We will provide more information shortly within 30 minutes.

Google App Engine Downtime Notify

unread,

Oct 4, 2014, 2:07:08 PM10/4/14

to google-appengine...@googlegroups.com, google-appengine...@googlegroups.com

We are currently experiencing an issue with Google App Engine serving and some applications are experiencing elevated errors and latency. For everyone who is affected, we apologize - we know you count on Google to work for you and we're working hard to restore normal operation. We will provide an update by Saturday, 2014-10-04 10:45 (all times are in US/Pacific)] with current details, and if available an estimated time for resolution.

Google App Engine Downtime Notify

unread,

Oct 4, 2014, 2:47:04 PM10/4/14

to google-appengine...@googlegroups.com

We are still continuing to experience an issue with Google App Engine serving and some applications are experiencing elevated errors and latency. We will provide an update by Saturday, 2014-10-04 12:30 (US/Pacific) with current details, and if available an estimated time for resolution.

John Lowry

unread,

Oct 4, 2014, 3:29:08 PM10/4/14

to google-appengine...@googlegroups.com, google-appengine...@googlegroups.com

We continue to experience the issue with Google App Engine serving and some applications are experiencing elevated errors and latency. We are working hard to restore normal operation. We will provide an update by Saturday, 2014-10-04 13:15 (US/Pacific) with current details.

Google App Engine Downtime Notify

unread,

Oct 4, 2014, 4:32:00 PM10/4/14

to google-appengine...@googlegroups.com

The problem with Google App Engine serving was resolved as of Saturday 2014-10-04 13:15 (US/Pacific). We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Google App Engine Downtime Notify

unread,

Oct 7, 2014, 1:30:03 PM10/7/14

to google-appengine...@googlegroups.com, google-appengine...@googlegroups.com

SUMMARY:

On Saturday 4 October 2014, some Google App Engine applications experienced elevated errors and latency for a period of 3 hours and 4 minutes. We apologize if your application was affected. We know that you depend on Google to provide a reliable service. We are taking steps to prevent a recurrence of this type of incident, and we are crediting all applications in cases where we did not meet the terms of the App Engine Service Level Agreement.

DETAILED DESCRIPTION OF IMPACT:

On Saturday 4 October 2014 from 09:27 to 12:31 PDT, 8.4% of App Engine applications in US datacenters experienced a rate of serving HTTP 500 errors that was higher than 10%. A further 15% of applications experienced error rates that were lower than 10%. Latency for affected applications was not significantly higher at the median for successful requests. At the 90th percentile, latency increased by 2.2 times. 80% of deployments of new versions failed during the incident period. App Engine’s auto-scaling was impaired during the incident.

ROOT CAUSE:

The incident occurred after one application made a configuration change that resulted in an invalid state. This state was saved to a storage system used by US applications to store their configuration. Some processes in App Engine’s serving infrastructure received segmentation faults when they tried to read the configuration for this application. Any application that was scheduled to run on a segfaulting server experienced elevated errors and latency.

REMEDIATION AND PREVENTION:

The invalid state was written to storage at 09:27. Our engineers received an automated alert at 09:41, indicating that some processes in one datacenter were failing. We identified the root cause of the problem at 10:35. However, the process that we normally use to write configuration to storage was itself crashing. We therefore had to find an alternative means to repair the configuration. This was completed at 12:14. The new configuration was replicated to all datacenters and the system was stabilized at 12:31.

To prevent recurrence, we will change our code to ensure that an invalid application configuration cannot affect other applications. In addition, we will improve our tools, so that we can repair invalid configurations more quickly.

Reply all

Reply to author

Forward