Google Cloud Platform Status
unread,Sep 7, 2018, 11:00:49 PM9/7/18Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to gs-an...@googlegroups.com
# ISSUE SUMMARY
On Tuesday 4 September 2018, Google Cloud Storage experienced 1.1% error
rates and increased 99th percentile latency for US multiregion buckets for
a duration of 5 hours 38 minutes. After that time some customers
experienced 0.1% error rates which returned to normal progressively over
the subsequent 4 hours. To our Google Cloud Storage customers whose
businesses were impacted during this incident, we sincerely apologize; this
is not the level of tail-end latency and reliability we strive to offer
you. We are taking immediate steps to improve the platform’s performance
and availability.
# DETAILED DESCRIPTION OF IMPACT
On Tuesday 4 September 2018 from 02:55 to 08:33 PDT, customers with buckets
located in the US multiregion experienced a 1.066% error rate and 4.9x
increased 99th percentile latency, with the peak effect occurring between
05:00 PDT and 07:30 PDT for write-heavy workloads. At 08:33 PDT 99th
percentile latency decreased to 1.4x normal levels and error rates
decreased, initially to 0.146% and then subsequently to nominal levels by
12:50 PDT.
# ROOT CAUSE
At the beginning of August, Google Cloud Storage deployed a new feature
which among other things prefetched and cached the location of some
internal metadata. On Monday 3rd September 18:45 PDT, a change in the
underlying metadata storage system resulted in increased load to that
subsystem, which eventually invalidated some cached metadata for US
multiregion buckets. This meant that requests for that metadata experienced
increased latency, or returned an error if the backend RPC timed out. This
additional load on metadata lookups led to elevated error rates and latency
as described above.
# REMEDIATION AND PREVENTION
Google Cloud Storage SREs were alerted automatically once error rates had
risen materially above nominal levels. Additional SRE teams were involved
as soon as the metadata storage system was identified as a likely root
cause of the incident. In order to mitigate the incident, the keyspace that
was suffering degraded performance needed to be identified and isolated so
that it could be given additional resources. This work completed by the 4th
September 08:33 PDT. In parallel, Google Cloud Storage SREs pursued the
source of additional load on the metadata storage system and traced it to
cache invalidations.
In order to prevent this type of incident from occurring again in the
future, we will expand our load testing to ensure that performance
degradations are detected prior to reaching production. We will improve our
monitoring diagnostics to ensure that we more rapidly pinpoint the source
of performance degradation. The metadata prefetching algorithm will be
changed to introduce randomness and further reduce the chance of creating
excessive load on the underlying storage system. Finally, we plan to
enhance the storage system to reduce the time needed to identify, isolate,
and mitigate load concentration such as that resulting from cache
invalidations.