Google Cloud Platform Status
unread,Sep 7, 2018, 11:00:49 PM9/7/18Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
  to gs-an...@googlegroups.com
# ISSUE SUMMARY
On Tuesday 4 September 2018, Google Cloud Storage experienced 1.1% error  
rates and increased 99th percentile latency for US multiregion buckets for  
a duration of 5 hours 38 minutes. After that time some customers  
experienced 0.1% error rates which returned to normal progressively over  
the subsequent 4 hours. To our Google Cloud Storage customers whose  
businesses were impacted during this incident, we sincerely apologize; this  
is not the level of tail-end latency and reliability we strive to offer  
you. We are taking immediate steps to improve the platform’s performance  
and availability.
# DETAILED DESCRIPTION OF IMPACT
On Tuesday 4 September 2018 from 02:55 to 08:33 PDT, customers with buckets  
located in the US multiregion experienced a 1.066% error rate and 4.9x  
increased 99th percentile latency, with the peak effect occurring between  
05:00 PDT and 07:30 PDT for write-heavy workloads. At 08:33 PDT 99th  
percentile latency decreased to 1.4x normal levels and error rates  
decreased, initially to 0.146% and then subsequently to nominal levels by  
12:50 PDT.
# ROOT CAUSE
At the beginning of August, Google Cloud Storage deployed a new feature  
which among other things prefetched and cached the location of some  
internal metadata. On Monday 3rd September 18:45 PDT, a change in the  
underlying metadata storage system resulted in increased load to that  
subsystem, which eventually invalidated some cached metadata for US  
multiregion buckets. This meant that requests for that metadata experienced  
increased latency, or returned an error if the backend RPC timed out. This  
additional load on metadata lookups led to elevated error rates and latency  
as described above.
# REMEDIATION AND PREVENTION
Google Cloud Storage SREs were alerted automatically once error rates had  
risen materially above nominal levels. Additional SRE teams were involved  
as soon as the metadata storage system was identified as a likely root  
cause of the incident. In order to mitigate the incident, the keyspace that  
was suffering degraded performance needed to be identified and isolated so  
that it could be given additional resources. This work completed by the 4th  
September 08:33 PDT. In parallel, Google Cloud Storage SREs pursued the  
source of additional load on the metadata storage system and traced it to  
cache invalidations.
In order to prevent this type of incident from occurring again in the  
future, we will expand our load testing to ensure that performance  
degradations are detected prior to reaching production. We will improve our  
monitoring diagnostics to ensure that we more rapidly pinpoint the source  
of performance degradation. The metadata prefetching algorithm will be  
changed to introduce randomness and further reduce the chance of creating  
excessive load on the underlying storage system. Finally, we plan to  
enhance the storage system to reduce the time needed to identify, isolate,  
and mitigate load concentration such as that resulting from cache  
invalidations.