Google App Engine issue at 01:40 AM US/Pacific on June 11, 2013

Taishi Iwasaki

unread,

Jun 11, 2013, 5:45:25 AM6/11/13

to google-appengine...@googlegroups.com

We were experiencing an issue with Google App Engine on 2013-06-11 between 01:40 AM and 02:10 AM US Pacific time. Affected applications were sporadically serving backend errors.

Our team keeps monitoring the incident closely and we will provide an update by 03:30 AM US Pacific time.

Riccardo Carlesso

unread,

Jun 11, 2013, 6:36:33 AM6/11/13

to google-appengine...@googlegroups.com

The issue has not manifested again and our team keeps monitoring the situation.

We will update this thread as soon as further information is available.

Riccardo Carlesso

unread,

Jun 11, 2013, 8:27:10 AM6/11/13

to google-appengine...@googlegroups.com

The problem with Google App Engine should be resolved. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

John Lowry

unread,

Jul 12, 2013, 1:43:06 PM7/12/13

to google-appengine...@googlegroups.com

On June 11 2013 from 01:40 AM to 02:01 AM and from 06:11 AM to 06:16 AM US/Pacific time, Google App Engine applications running in one of our US data centers experienced an elevated rate of errors due to crashes in the processes that run applications.

This issue was caused by a global rollout of a new memcache server, which started on June 10. In one US data center, an unexpected configuration caused application servers in that data center to crash when the rollout completed around 01:40 AM. The issue caused an elevated rate of 500 responses to HTTP requests. The average error rate for affected applications peaked at 40%. At 02:10 AM, we redirected traffic from the affected data center to a different data center to restore normal serving for affected applications. Our engineering team attempted to correct the configuration and returned traffic back to the original data center at 06:11 AM. The fix was faulty causing the same problem to reoccur, so we redirected traffic away from the affected data center again at 06:16 AM. Our engineering team then corrected the configuration and returned traffic back to the original data center at 07:13 AM.

In response to this incident, we have improved the memcache client code in our application servers to return an error rather than crashing the process, should an unexpected configuration occur again. Additionally, we have identified the origin of the unexpected configuration, and are updating our operational procedures and automation to guard against any reoccurrence.

We apologize for any inconvenience you or your customers experienced as a result of this issue. If you believe your paid application experienced an SLA violation as a result of this incident, please contact us.

Regards,

John Lowry, on behalf of the Google App Engine Team

Reply all

Reply to author

Forward