SUMMARY:
On Friday 10th April 2015, attempts to create or update Datastore indexes
failed for some Google App Engine applications for a duration of 148
minutes. In addition, a number of applications retrieved stale data using
eventually consistent read operations for an unexpectedly long period. If
your service or application was affected, we apologize — this is not the
level of quality and reliability we strive to offer you, and we are taking
immediate steps to improve the platform’s performance and availability.
DETAILED DESCRIPTION OF IMPACT:
On Friday 10 April 2015 from 11:30 to 13:58 PDT, 331 requests to create or
update the definition of Datastore composite indexes across 21 applications
failed to complete. In addition, about 34% of applications retrieved stale
data using eventually consistent QUERY or GET operations [1]. Unlike
strongly consistent queries, it is expected of eventually consistent read
operations to return stale data for a brief period. However, this behaviour
was extended to a longer duration than that which is typically observed
during normal operations. There was no impact on strongly consistent
operations.
During the recovery phase of this incident about 7% of Google App Engine
applications experienced elevated latency on PUT operations for 15 minutes.
ROOT CAUSE:
During a planned maintenance activity, undertaken to create a new Datastore
replica to accommodate organic growth, incorrectly configured automation
created an unnecessary large table in the new replica. This resulted in
exhaustion of resources allocated to Datastore and write failures to this
replica. Once the underlying problem was resolved, a high volume of writes
were routed to the new replica, resulting in elevated latency for write
operations.
REMEDIATION AND PREVENTION:
At 00:30 PDT on Friday 10th April 2015, an automated alert on resource
depletion was sent out to Google Engineers. However, this alert was
suppressed, as is normal practice when undertaking this type of maintenance
activity. At 11:30 PDT, quota allocated to the replica was exhausted.
Google Engineers were notified by internal teams at 12:53 PDT of problems
with Datastore indexes. At 13:26 PDT, Google Engineers deleted the
problematic large table and started the procedure to reserve additional
quota for this storage replica. This took effect at 13:35 PDT and the
replica started receiving write requests immediately, which caused a brief
increase in latency. Normal operation was restored at 13:58 PDT.
To prevent similar incidents in future, we are modifying our maintenance
procedures to avoid suppression of the appropriate alerts, and to ensure
that this large table is created under close monitoring.
[1]. Details on eventual and strong consistency on Google Cloud Datastore:
https://cloud.google.com/developers/articles/balancing-strong-and-eventual-consistency-with-google-cloud-datastore/#h.tf76fya5nqk8