On June 11 2013 from 01:40 AM to 02:01 AM and from 06:11 AM to 06:16 AM US/Pacific time, Google App Engine applications running in one of our US data centers experienced an elevated rate of errors due to crashes in the processes that run applications.
This issue was caused by a global rollout of a new memcache server, which started on June 10. In one US data center, an unexpected configuration caused application servers in that data center to crash when the rollout completed around 01:40 AM. The issue caused an elevated rate of 500 responses to HTTP requests. The average error rate for affected applications peaked at 40%. At 02:10 AM, we redirected traffic from the affected data center to a different data center to restore normal serving for affected applications. Our engineering team attempted to correct the configuration and returned traffic back to the original data center at 06:11 AM. The fix was faulty causing the same problem to reoccur, so we redirected traffic away from the affected data center again at 06:16 AM. Our engineering team then corrected the configuration and returned traffic back to the original data center at 07:13 AM.
In response to this incident, we have improved the memcache client code in our application servers to return an error rather than crashing the process, should an unexpected configuration occur again. Additionally, we have identified the origin of the unexpected configuration, and are updating our operational procedures and automation to guard against any reoccurrence.
We apologize for any inconvenience you or your customers experienced as a result of this issue. If you believe your paid application experienced an SLA violation as a result of this incident, please
contact us.
Regards,
John Lowry, on behalf of the Google App Engine Team