Elevated error rate with Google App Engine Blobstore API and App Engine Version Deployment

52 views
Skip to first unread message

Google Cloud Platform Status

unread,
Mar 12, 2019, 11:14:53 PM3/12/19
to google-appengine...@googlegroups.com
Mitigation work is currently underway by our Engineering Team. We will
provide another status update by Tuesday, 2019-03-12 20:45 US/Pacific with
current details.

Google Cloud Platform Status

unread,
Mar 12, 2019, 11:50:57 PM3/12/19
to google-appengine...@googlegroups.com
We are still working on the issue with Google App Engine Blobstore API and
App Engine Version Deployment. Our Engineering Team believes they have
identified the potential root causes of the errors and is working to
mitigate. We will provide another status update by Tuesday, 2019-03-12
21:15 US/Pacific with current details.

Google Cloud Platform Status

unread,
Mar 13, 2019, 12:17:57 AM3/13/19
to google-appengine...@googlegroups.com
Mitigation work with the underlying storage infrastructure is still
underway by our Engineering Team. We will provide another status update by
Tuesday, 2019-03-12 21:45 US/Pacific with current details.

Google Cloud Platform Status

unread,
Mar 13, 2019, 12:47:42 AM3/13/19
to google-appengine...@googlegroups.com
We are still working to address the root cause of the issue. We will
provide another status update by Tuesday, 2019-03-12 22:15 US/Pacific with
current details.

Google Cloud Platform Status

unread,
Mar 13, 2019, 12:56:20 AM3/13/19
to google-appengine...@googlegroups.com
The issue with App Engine Version Deployment should be resolved for some
runtimes, including nodejs, python37, php72, go111, but we are still seeing
the issue with other runtimes. We will provide another status update by

Google Cloud Platform Status

unread,
Mar 13, 2019, 1:17:00 AM3/13/19
to google-appengine...@googlegroups.com
We still have an issue with App Engine Blobstore API and Version
Deployment. Our Engineering team understands the root cause and is working
to implement the solution. We will provide another status update by
Tuesday, 2019-03-12 22:45 US/Pacific with current details.

Google Cloud Platform Status

unread,
Mar 13, 2019, 1:53:04 AM3/13/19
to google-appengine...@googlegroups.com
The underlying storage infrastructure is gradually recovering. We will
provide another status update by Tuesday, 2019-03-12 23:15 US/Pacific with
current details.

Google Cloud Platform Status

unread,
Mar 13, 2019, 2:18:17 AM3/13/19
to google-appengine...@googlegroups.com
The issue with App Engine Deployment should be resolved as of Tuesday,
2019-03-12 23:11 US/Pacific, and the issue with underlying storage of
Blobstore API should be resolved for the majority of projects and we expect
a full resolution in the near future. We will provide another status update
by Tuesday, 2019-03-12 23:45 US/Pacific with current details.

Google Cloud Platform Status

unread,
Mar 13, 2019, 2:31:29 AM3/13/19
to google-appengine...@googlegroups.com
The issue with App Engine Blobstore API has been resolved for all affected
projects as of Tuesday, 2019-03-12 23:27 US/Pacific. We will conduct an
internal investigation of this issue and make appropriate improvements to
our systems to help prevent or minimize future recurrence. We will provide
a more detailed analysis of this incident once we have completed our
internal investigation.

Google Cloud Platform Status

unread,
Mar 14, 2019, 2:11:16 PM3/14/19
to google-appengine...@googlegroups.com
ISSUE SUMMARY

On Tuesday 12 March 2019, Google's internal blob storage service
experienced a service disruption for a duration of 4 hours and 10 minutes.
We apologize to customers whose service or application was impacted by this
incident. We know that our customers depend on Google Cloud Platform
services and we are taking immediate steps to improve our availability and
prevent outages of this type from recurring.

DETAILED DESCRIPTION OF IMPACT

On Tuesday 12 March 2019 from 18:40 to 22:50 PDT, Google's internal blob
(large data object) storage service experienced elevated error rates,
averaging 20% error rates with a short peak of 31% errors during the
incident. User-visible Google services including Gmail, Photos, and Google
Drive, which make use of the blob storage service also saw elevated error
rates, although (as was the case with GCS) the user impact was greatly
reduced by caching and redundancy built into those services. There will be
a separate incident report for non-GCP services affected by this incident.

The Google Cloud Platform services that experienced the most significant
customer impact were the following:

Google Cloud Storage experienced elevated long tail latency and an average
error rate of 4.8%. All bucket locations and storage classes were impacted.
GCP services that depend on Cloud Storage were also impacted.

Stackdriver Monitoring experienced up to 5% errors retrieving historical
time series data. Recent time series data was available. Alerting was not
impacted.

App Engine's Blobstore API experienced elevated latency and an error rate
that peaked at 21% for fetching blob data. App Engine deployments
experienced elevated errors that peaked at 90%. Serving of static files
from App Engine also experienced elevated errors.

ROOT CAUSE

On Monday 11 March 2019, Google SREs were alerted to a significant increase
in storage resources for metadata used by the internal blob service. On
Tuesday 12 March, to reduce resource usage, SREs made a configuration
change which had a side effect of overloading a key part of the system for
looking up the location of blob data. The increased load eventually lead to
a cascading failure.

REMEDIATION AND PREVENTION

SREs were alerted to the service disruption at 18:56 PDT and immediately
stopped the job that was making configuration changes. In order to recover
from the cascading failure, SREs manually reduced traffic levels to the
blob service to allow tasks to start up without crashing due to high load.

In order to prevent service disruptions of this type, we will be improving
the isolation between regions of the storage service so that failures are
less likely to have global impact. We will be improving our ability to more
quickly provision resources in order to recover from a cascading failure
triggered by high load. We will make software measures to prevent any
configuration changes that cause overloading of key parts of the system. We
will improve load shedding behavior of the metadata storage system so that
it degrades gracefully under overload.
Reply all
Reply to author
Forward
0 new messages