SUMMARY:
On Thursday 5 March 2015, for a duration of 84 minutes, Google App Engine
applications that accessed some Google APIs over HTTP experienced elevated
error rates. We apologize for any impact this incident had on your service
or application, and have made immediate changes to prevent this issue from
recurring.
DETAILED DESCRIPTION OF IMPACT:
On Thursday 5 January, from 07:04 AM to 08:28 AM, some Google App Engine
applications making calls to other Google APIs via HTTP experienced
elevated error rates. During the incident, the global error rate for all
API calls remained under 1%, and in total, the outage affected 2% of
applications that were active during the incident. The effect on those
applications was significant: requests to issue OAuth tokens experienced an
error rate of over 85%. In addition, the HTTP APIs to
googleapis.com/storage and
googleapis.com/gmail received error rates
between 50% and 60%. Other
googleapis.com endpoints were affected with
error rates of 10% to 20%.
ROOT CAUSE:
A component in Google’s shared HTTP load balancing fabric experienced a
non-malicious increase in traffic, exceeding its provisioned capacity. This
triggered an automatic DoS protection which shunted a portion of the
incoming traffic to a CAPTCHA. The unexpected response caused some clients
to issue automated retries, exacerbating the problem.
REMEDIATION AND PREVENTION:
Google Engineers were alerted to the issue by automated monitoring at
07:02, as the load balancing system detected excess traffic and attempted
to automatically mitigate it. At 07:46, Google Engineers enabled standby
load balancing capacity to rectify the issue. From 08:15 to 08:40, Google
Engineers continued to provision additional resources in the load balancing
fabric in order to serve the increased traffic. During this period, at
08:28, Google engineers determined that sufficient capacity was in place to
serve both regular and retry traffic, and instructed the load balancing
system to cease mitigation and resume normal traffic serving. This action
marked the end of the event.
To prevent this issue from recurring, Google engineers are comprehensively
re-examining the affected load balancing fabric to ensure it is and remains
correctly provisioned. Additionally, Google engineers are improving
monitoring rules to provide an early warning of capacity shortfall.
Finally, Google engineers are examining the services that depend on this
load balancing system, and will move some services to a separate pool of
more easily scalable load balancers where appropriate.