Google Cloud Platform Status
unread,Mar 26, 2015, 1:32:44 AM3/26/15Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to google-appengine...@googlegroups.com
SUMMARY:
On Tuesday 24th March 2015, Google App Engine served elevated 503 errors on
<1% of applications for a typical duration of 50 minutes. We know how
important high uptime and low error rates are to you and your users, and we
apologize for these errors. We are learning from this incident and are
implementing several improvements to make our service more reliable.
DETAILED DESCRIPTION OF IMPACT:
On Tuesday 24th March 2015 from 13:03 to 13:53 PDT approximately 1% of
requests to App Engine erroneously received an error 503 with a
message "Over Quota. This application is temporarily over its serving
quota. Please try again later." This occurred despite applications being
within their quotas. The distribution of these errors was not uniform; some
applications received a disproportionately high fraction of the total
errors.
ROOT CAUSE:
A latent bug in the App Engine quota handling code was triggered during a
routine software update of the quota system. This resulted in App Engine
returning over-quota errors to some applications that were not over quota.
As App Engine software updates are rolled out progressively, only some
applications were affected by the update before the issue was detected and
remediated.
REMEDIATION AND PREVENTION:
Google engineers directed traffic away from the affected App Engine
infrastructure once the nature of the problem was understood. This led to
the return of global 503 rates to pre-incident levels at 13:53. Google
engineers identified a small number of applications that escaped the
initial change and fixed their quota behavior manually at 14:45.
In order to prevent recurrence of this issue, Google engineers will add
monitoring and alerting for the quota issue that resulted in spurious 503
errors, create a new quick response protocol for handling erroneous quota
responses, and will modify application quota behavior to tolerate novel
quota system behavior with lower application impact.