App Engine serving issue on February 12, 2014

John Lowry

unread,

Feb 12, 2014, 4:50:05 PM2/12/14

to google-appengine...@googlegroups.com

We're investigating an issue with Google App Engine serving beginning at Wednesday, 2014-02-12 11:00 AM (all times are in US/Pacific). We will provide more information shortly.

John Lowry

unread,

Feb 12, 2014, 5:13:15 PM2/12/14

to google-appengine...@googlegroups.com

We are currently experiencing an issue with Google App Engine serving. Some applications saw higher latency from 11:25 AM - 12:20 PM (all times are in US/Pacific). For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by Wednesday, 2014-02-12 16:00 with current details.

John Lowry

unread,

Feb 12, 2014, 7:11:05 PM2/12/14

to google-appengine...@googlegroups.com

We continue to monitor elevated latency and error rates for a small number of Google App Engine serving instances. The vast majority of instances are performing normally. We are continuing to work on restoring normal operation for the remaining affected instances. For everyone who is affected, we apologize for any inconvenience you may be experiencing.

Taishi Iwasaki

unread,

Feb 12, 2014, 8:45:09 PM2/12/14

to google-appengine...@googlegroups.com

The problem with Google App Engine serving was resolved as of Wednesday, 2014-02-12 16:32 (all times are in US/Pacific). We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.

Message has been deleted

John Lowry

unread,

Feb 24, 2014, 1:39:09 PM2/24/14

to google-appengine...@googlegroups.com

SUMMARY:

On Wednesday 12 February 2014, Google App Engine (GAE) experienced elevated error rates and latency for a significant number of applications for a duration of 6 hours and 4 minutes. If your service or application was affected, we apologize - this is not the level of reliability and performance we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability.

DETAILED DESCRIPTION OF IMPACT:

On Wednesday 12 February 2014, approximately 0.3% of applications on GAE experienced elevated latency and/or error rates at times during the period 10:33 to 16:37 PST. The impact was most severe between 10:40 and 11:10 PST. Google engineers further measured that 0.7% of overall requests to applications during that period received errors, of which the majority were attributed to affected applications. Finally, Google engineers recorded elevated latency for those requests to affected applications which did not receive errors; the exact latency increase is not quantifiable without a baseline, but we believe it was 2x-3x normal.

ROOT CAUSE:

The root cause of the outage was insufficient capacity provisioned on multiple GAE instances, coupled with downtime for one instance. GAE instances were provisioned with sufficient resources for steady-state operation, but were insufficiently provisioned to handle the peak “cold cache” load from multiple applications during instance failover.

REMEDIATION AND PREVENTION:

In the 48 hours after the event, Google engineers added over 20 thousand CPU cores, 59 terabytes of RAM, and 4 petabytes of disk & spindle capacity to the GAE instances to ensure that there would be sufficient capacity for both steady-state and peak load events. During this week, we are permanently modifying our capacity planning calculations to provision for peak / failover load levels, so that we will not reenter a state where GAE is underprovisioned for these loads. Finally, over the next several weeks we are updating our API monitoring and alerting to ensure that our SRE teams are alerted within seconds to API error rate and latency excursions, regardless of source.

Reply all

Reply to author

Forward