SUMMARY:
On Thursday afternoon, 8 May 2014, between 0.01% and 0.10% of applications on GAE experienced an unexpected increase in the number of instances associated with their application. If your service or application was affected, we apologize - we have corrected the error in GAE, we are crediting all affected applications for the erroneous additional instance-hours, and we are improving our GAE monitoring and release procedures to help prevent a recurrence.
DETAILED DESCRIPTION OF IMPACT:
On Thursday 8 May 2014, between 0.01% and 0.10% of applications on GAE experienced an increase in the number of running instances during the period 16:50 PST to 9:50 PST the following morning. This behavior resulted in increases in instance hours quota usage,higher numbers of loading requests, and in some cases moderately-increased latency. The increased use of instance hours caused approximately 0.001% of apps to reach their free instance-hours limit or daily budget, resulting in some errors.
ROOT CAUSE:
The root cause of the incident was an issue introduced by the rollout of GAE version 1.9.5. The new scheduler did not correctly re-use idle instances in specific instances, instead starting new instances for every request. The issue took some time to resolve as the large number of instances was persisted across a rollback to the previous 1.9.4 version.
REMEDIATION AND PREVENTION:
Google engineers reset the instance hour quota to stop quota exceeded errors. Google engineers resolved the issue by redirecting traffic to a datacenter running GAE version 1.9.4. Google will credit impacted customers to cover the cost of instances used during this period.
To prevent recurrences, Google engineers are adding additional pre-launch tests and improving the alerting and management infrastructure to ensure rapid detection and diagnosis of any recurrence..