Elevated error rates and latency on 2013-01-15

Showing 1-1 of 1 messages
Elevated error rates and latency on 2013-01-15 Chris Ramsdale 1/17/13 4:21 PM
Beginning on January 15, 2012 at approximately 7 AM US/Pacific time and continuing until approximately 12 PM US/Pacific time, some Google App Engine applications experienced elevated request latency and error rates. This incident was caused by a configuration issue in our storage layer that resulted in increased CPU usage for a single datacenter. Approximately 4% of all requests to App Engine applications resulted in errors during this event. For some applications, a majority of requests resulted in errors.

This incident was identified by our standard monitoring systems and we began taking corrective measures immediately, including moving applications out of the affected data center. A more detailed timeline is included below (all times are in US/Pacific).

Tuesday, 2013-01-15
  • 07:20 - Storage configuration is updated in a single data center. ¬†
  • 07:40 - Intermittent errors within the storage infrastructure exceed our monitoring threshold. ¬†Investigation begins.
  • 08:00 - We begin moving some App Engine services out of the affected data center.
  • 08:10 - A subset of applications begin experiencing request failures.
  • 09:21 - Request failures rise to a level that require us to move all applications and services out of the affected data center.
  • 09:36 - Monitoring indicates that the affected data center has fully recovered. Applications are moved back into this data center.
  • 09:39 - A subsequent storage configuration is updated in the same data center.
  • 10:03 - Again, intermittent errors within the storage infrastructure exceed our monitoring threshold. Applications and services are moved out of the affected data center.
  • 10:31 - Storage configuration is corrected. Storage infrastructure in the affected data center begins to recover.
  • 11:40 - Monitoring indicates that the affected data center has fully recovered. We begin moving applications back into this data center.
  • 12:03 - Application serving returns to normal.

App Engine infrastructure operates across multiple data centers and is designed to be resilient both to individual hardware failures and even to the loss of an entire data center. We are actively improving our ability to move applications from one data center to another quickly and transparently. Similarly, we are improving our processes and tools for configuration changes to the storage infrastructure to avoid similar incidents in the future.

We apologize for the inconvenience caused by this outage. If you believe your paid application experienced an SLA violation during this incident, please fill out our refund request form.

Regards,

Chris Ramsdale on behalf of the Google App Engine Team