Elevated GCS Errors from Canada

Google Cloud Platform Status

unread,

Oct 13, 2017, 12:32:04 PM10/13/17

to gs-an...@googlegroups.com

We are investigating an issue with Google Cloud Storage users in Canada and
Northeast North America experiencing HTTP 503 failures. We will provide
more information by 10:30 US/Pacific.

Google Cloud Platform Status

unread,

Oct 13, 2017, 1:17:55 PM10/13/17

to gs-an...@googlegroups.com

The issue with Google Cloud Storage request failures for users in Canada
and Northeast North America has been resolved for all affected users as of
Friday, 2017-10-13 10:08 US/Pacific. We will conduct an internal
investigation of this issue and make appropriate improvements to our
systems to help prevent or minimize future recurrence. We will provide a
more detailed analysis of this incident once we have completed our internal
investigation.

Google Cloud Platform Status

unread,

Oct 19, 2017, 3:49:37 PM10/19/17

to gs-an...@googlegroups.com

ISSUE SUMMARY

Starting Thursday 12 October 2017, Google Cloud Storage clients located in
the Northeast of North America experienced up to a 10% error rate for a
duration of 21 hours and 35 minutes when fetching objects stored in
multi-regional buckets in the US.

We apologize for the impact of this incident on your application or
service. The reliability of our service is a top priority and we understand
that we need to do better to ensure that incidents of this type do not
recur.

DETAILED DESCRIPTION OF IMPACT

Between Thursday 12 October 2017 12:47 PDT and Friday 13 October 2017 10:12
PDT, Google Cloud Storage clients located in the Northeast of North America
experienced up to a 10% rate of 503 errors and elevated latency. Some users
experienced higher error rates for brief periods. This incident only
impacted requests to fetch objects stored in multi-regional buckets in the
US; clients were able to mitigate impact by retrying. The percentage of
total global requests to Cloud Storage that experienced errors was 0.03%.

ROOT CAUSE

Google ensures balanced use of its internal networks by throttling outbound
traffic at the source host in the event of congestion. This incident was
caused by a bug in an earlier version of the job that reads Cloud Storage
objects from disk and streams data to clients. Under high traffic
conditions, the bug caused these jobs to incorrectly throttle outbound
network traffic even though the network was not congested.

Google had previously identified this bug and was in the process of rolling
out a fix to all Google datacenters. At the time of the incident, Cloud
Storage jobs in a datacenter in Northeast North America that serves
requests to some Canadian and US clients had not yet received the fix. This
datacenter is not a location for customer buckets
(https://cloud.google.com/storage/docs/bucket-locations), but objects in
multi-regional buckets can be served from instances running in this
datacenter in order to optimize latency for clients.

REMEDIATION AND PREVENTION

The incident was first reported by a customer to Google on Thursday 12
October 14:59 PDT. Google engineers determined root cause on Friday 13
October 09:47 PDT. We redirected Cloud Storage traffic away from the
impacted region at 10:08 and the incident was resolved at 10:12.

We have now rolled out the bug fix to all regions. We will also add
external monitoring probes for all regional points of presence so that we
can more quickly detect issues of this type.

Reply all

Reply to author

Forward