Google Cloud Platform Status
unread,Dec 28, 2018, 12:53:47 PM12/28/18Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to gs-an...@googlegroups.com
ISSUE SUMMARY
On Friday 21 December 2018, customers deploying App Engine apps, deploying
in Cloud Functions, reading from Google Cloud Storage (GCS), or using Cloud
Build experienced increased latency and elevated error rates ranging from
1.6% to 18% for a period of 3 hours, 41 minutes.
We understand that these services are critical to our customers and
sincerely apologize for the disruption caused by this incident; this is not
the level of quality and reliability that we strive to offer you. We have
several engineering efforts now under way to prevent a recurrence of this
sort of problem; they are described in detail below.
DETAILED DESCRIPTION OF IMPACT
On Friday 21 December 2018, from 08:01 to 11:43 PST, Google Cloud Storage
reads, App Engine deployments, Cloud Functions deployments, and Cloud Build
experienced a disruption due to increased latency and 5xx errors while
reading from Google Cloud Storage. The peak error rate for GCS reads was
1.6% in US multi-region. Writes were not impacted, as the impacted metadata
store is not utilized on writes.
Elevated deployment errors for App Engine Apps in all regions averaged 8%
during the incident period. In Cloud Build, a 14% INTERNAL_ERROR rate and
18% TIMEOUT error rate occurred at peak. The aggregated average deployment
failure rate of 4.6% for Cloud Functions occurred in us-central1, us-east1,
europe-west1, and asia-northeast1.
ROOT CAUSE
Impact began when increased load on one of GCS's metadata stores resulted
in request queuing, which in turn created an uneven distribution of service
load.
The additional load was created by a partially-deployed new feature. A
routine maintenance operation in combination with this new feature resulted
in an unexpected increase in the load on the metadata store. This increase
in load affected read workloads due to increased request latency to the
metadata store.
In some cases, this latency exceeded the timeout threshold, causing an
average of 0.6% of requests to fail in the US multi-region for the duration
of the incident.
REMEDIATION AND PREVENTION
Google engineers were automatically alerted to the increased error rate at
08:22 PST. Since the issue involved multiple backend systems, multiple
teams at Google were involved in the investigation and narrowed down the
issue to the newly-deployed feature. The latency and error rate began to
subside as Google Engineers initiated the rollback of the new feature. The
issue was fully mitigated at 11:43 PST when the rollback finished, at which
point the impacted GCP services recovered completely.
In addition to updating the impacting feature to prevent this type of
increased load, we will update the rollout workflow to stress feature
limits before rollout. To improve time to resolution of issues in the
metadata store, we are implementing additional instrumentation to the
requests made of the subsystem.