ISSUE SUMMARY
On Wednesday 6 November 2017, the App Engine Memcache service experienced
unavailability for applications in all regions for 1 hour and 50 minutes.
We sincerely apologize for the impact of this incident on your application
or service. We recognize the severity of this incident and will be
undertaking a detailed review to fully understand the ways in which we must
change our systems to prevent a recurrence.
DETAILED DESCRIPTION OF IMPACT
On Wednesday 6 November 2017 from 12:33 to 14:23 PST, the App Engine
Memcache service experienced unavailability for applications in all regions.
Some customers experienced elevated Datastore latency and errors while
Memcache was unavailable. At this time, we believe that all the Datastore
issues were caused by surges of Datastore activity due to Memcache being
unavailable. When Memcache failed, if an application sent a surge of
Datastore operations to specific entities or key ranges, then Datastore may
have experienced contention or hotspotting, as described in
https://cloud.google.com/datastore/docs/best-practices#designing_for_scale.
Datastore experienced elevated load on its servers when the outage ended
due to a surge in traffic. Some applications in the US experienced elevated
latency on gets between 14:23 and 14:31, and elevated latency on puts
between 14:23 and 15:04.
Customers running Managed VMs experienced failures of all HTTP requests and
App Engine API calls during this incident. Customers using App Engine
Flexible Environment, which is the successor to Managed VMs, were not
impacted.
ROOT CAUSE
The App Engine Memcache service requires a globally consistent view of the
current serving datacenter for each application in order to guarantee
strong consistency when traffic fails over to alternate datacenters. The
configuration which maps applications to datacenters is stored in a global
database.
The incident occurred when the specific database entity that holds the
configuration became unavailable for both reads and writes following a
configuration update. App Engine Memcache is designed in such a way that
the configuration is considered invalid if it cannot be refreshed within 20
seconds. When the configuration could not be fetched by clients, Memcache
became unavailable.
REMEDIATION AND PREVENTION
Google received an automated alert at 12:34. Following normal practices,
our engineers immediately looked for recent changes that may have triggered
the incident. At 12:59, we attempted to revert the latest change to the
configuration file. This configuration rollback required an update to the
configuration in the global database, which also failed. At 14:21,
engineers were able to update the configuration by sending an update
request with a sufficiently long deadline. This caused all replicas of the
database to synchronize and allowed clients to read the mapping
configuration.
As a temporary mitigation, we have reduced the number of readers of the
global configuration, which avoids the contention during write and led to
the unavailability during the incident. Engineering projects are already
under way to regionalize this configuration and thereby limit the blast
radius of similar failure patterns in the future.