Google Cloud Platform Status
unread,Dec 16, 2015, 10:40:17 AM12/16/15Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to google-appengine...@googlegroups.com
SUMMARY:
On Monday 7 December 2015, 1.29% of Google App Engine applications received
errors when issuing authenticated calls to Google APIs over a period of 17
hours and 3 minutes. During a 45-minute period, authenticated calls to
Google APIs from outside of App Engine also received errors, with the error
rate peaking at 12%. We apologise for the impact of this issue on you and
your service. We consider service degradation of this level and duration to
be very serious and we are planning many changes to prevent this occurring
again in the future.
DETAILED DESCRIPTION OF IMPACT:
Between Monday 7 December 2015 20:09 PST and Tuesday 8 December 2015 13:12,
1.29% of Google App Engine applications using service accounts received
error 401 "Access Denied" for all requests to Google APIs requiring
authentication. Unauthenticated API calls were not affected. Different
applications experienced impact at different times, with few applications
being affected for the full duration of the incident.
In addition, between 23:05 and 23:50, an average of 7% of all requests to
Google Cloud APIs failed or timed out, peaking briefly at 12%. Outside of
this time only API calls from App Engine were affected.
ROOT CAUSE:
Google engineers have recently carried out a migration of the Google
Accounts system to a new storage backend, which included copying API
authentication service credentials data and redirecting API calls to the
new backend.
To complete this migration, credentials were scheduled to be deleted from
the previous storage backend. This process started at 20:09 PST on Monday 7
December 2015. Due to a software bug, the API authentication service
continued to look up some credentials, including those used by Google App
Engine service accounts, in the old storage backend. As these credentials
were progressively deleted, their corresponding service accounts could no
longer be authenticated.
The impact increased as more credentials were deleted and some Google App
Engine applications started to issue a high volume of retry requests. At
23:05, the retry volume exceeded the regional capacity of the API
authentication service, causing 1.3% of all authenticated API calls to fail
or timeout, including Google APIs called from outside Google App Engine. At
23:30 the API authentication service exceeded its global capacity, causing
up to 12% of all authenticated API calls to fail until 23:50, when the
overload issue was resolved.
REMEDIATION AND PREVENTION:
At 23:50 PST on Monday 8 December, Google engineers blocked certain
authentication credentials that were known to be failing, preventing
retries on these credentials from overloading the API authentication
service.
On Tuesday 9 December 08:52 PST, the deletion process was halted, having
removed 2.3% of credentials, preventing further applications from being
affected. At 10:08, Google engineers identified the root cause for the
misdirected credentials lookup. After thorough testing, a fix was rolled
out globally, resolving the issue for all affected Google App Engine
applications by 13:12.
Google has conducted a far-reaching review of the issue's root causes and
contributory factors, leading to numerous prevention and mitigation actions
in the following areas:
— Google engineers have deployed monitoring for additional infrastructure
signals to detect and analyse similar issues more quickly.
— Google engineers have improved internal tools to extend auditing and
logging and automatically advise relevant teams on potentially risky data
operations.
— Additional rate limiting and caching features will be added to the API
authentication service, increasing its resilience to load spikes.
— Google’s development guidelines are being reviewed and updated to improve
the handling of service or backend migrations, including a grace period of
disabling access to old data locations before fully decommissioning them.
Our customers rely on us to provide a superior service and we regret we did
not live up to expectations in this case. We apologize again for the
inconvenience this caused you and your users.