SUMMARY:
On Wednesday 12 February 2014, Google App Engine (GAE) experienced elevated error rates and latency for a significant number of applications for a duration of 6 hours and 4 minutes. If your service or application was affected, we apologize - this is not the level of reliability and performance we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability.
DETAILED DESCRIPTION OF IMPACT:
On Wednesday 12 February 2014, approximately 0.3% of applications on GAE experienced elevated latency and/or error rates at times during the period 10:33 to 16:37 PST. The impact was most severe between 10:40 and 11:10 PST. Google engineers further measured that 0.7% of overall requests to applications during that period received errors, of which the majority were attributed to affected applications. Finally, Google engineers recorded elevated latency for those requests to affected applications which did not receive errors; the exact latency increase is not quantifiable without a baseline, but we believe it was 2x-3x normal.
ROOT CAUSE:
The root cause of the outage was insufficient capacity provisioned on multiple GAE instances, coupled with downtime for one instance. GAE instances were provisioned with sufficient resources for steady-state operation, but were insufficiently provisioned to handle the peak “cold cache” load from multiple applications during instance failover.
REMEDIATION AND PREVENTION:
In the 48 hours after the event, Google engineers added over 20 thousand CPU cores, 59 terabytes of RAM, and 4 petabytes of disk & spindle capacity to the GAE instances to ensure that there would be sufficient capacity for both steady-state and peak load events. During this week, we are permanently modifying our capacity planning calculations to provision for peak / failover load levels, so that we will not reenter a state where GAE is underprovisioned for these loads. Finally, over the next several weeks we are updating our API monitoring and alerting to ensure that our SRE teams are alerted within seconds to API error rate and latency excursions, regardless of source.