Investigating incident with AppEngine and Memcache.

102 views
Skip to first unread message

Google Cloud Platform Status

unread,
Nov 6, 2017, 4:11:59 PM11/6/17
to google-appengine...@googlegroups.com
We are investigating an issue with Google App Engine and Memcache. We will
provide more information by 13:30 US/Pacific.

Google Cloud Platform Status

unread,
Nov 6, 2017, 4:31:44 PM11/6/17
to google-appengine...@googlegroups.com
We are experiencing an issue with Memcache availability beginning at
November 6, 2017 at 12:30 pm US/Pacific.
Current data indicate(s) that all projects using Memcache are affected by
this issue.
For everyone who is affected, we apologize for any inconvenience you may be
experiencing.

We will provide an update by 14:00 US/Pacific with current details.

Google Cloud Platform Status

unread,
Nov 6, 2017, 4:57:41 PM11/6/17
to google-appengine...@googlegroups.com
We are experiencing an issue with Memcache availability beginning at
November 6, 2017 at 12:30 pm US/Pacific.
Current data indicates that all projects using Memcache are affected by
this issue.
For everyone who is affected, we apologize for any inconvenience you may be
experiencing.

We will provide an update by 14:30 US/Pacific with current details.

Google Cloud Platform Status

unread,
Nov 6, 2017, 5:31:14 PM11/6/17
to google-appengine...@googlegroups.com
We are experiencing an issue with Memcache availability beginning at
November 6, 2017 at 12:30 pm US/Pacific.
Our Engineering Team believes they have identified the root cause of the
errors and is working to mitigate.

We will provide an update by 15:00 US/Pacific with current details.

Google Cloud Platform Status

unread,
Nov 6, 2017, 5:44:47 PM11/6/17
to google-appengine...@googlegroups.com
We are experiencing an issue with Memcache availability beginning at
November 6, 2017 at 12:30 pm US/Pacific.
At this time we are gradually ramping up traffic to Memcache and we see
that the rate of errors is decreasing.
Other services affected by the outage, such as MVM instances, should be
normalizing in the near future.

We will provide an update by 15:15 US/Pacific with current details.

Google Cloud Platform Status

unread,
Nov 6, 2017, 6:08:48 PM11/6/17
to google-appengine...@googlegroups.com
The issue with Memcache and MVM availability should be resolved for the
majority of projects and we expect a full resolution in the near future.

We will provide an update by 15:30 US/Pacific with current details.

Google Cloud Platform Status

unread,
Nov 6, 2017, 6:27:07 PM11/6/17
to google-appengine...@googlegroups.com
The Memcache service is still recovering from the outage. The rate of
errors continues to decrease and we expect a full resolution of this
incident in the near future.

We will provide an update by 16:00 US/Pacific with current details.

Google Cloud Platform Status

unread,
Nov 6, 2017, 6:55:54 PM11/6/17
to google-appengine...@googlegroups.com
The issue with Memcache availability has been resolved for all affected
projects as of 15:30 US/Pacific.
We will conduct an internal investigation of this issue and make
appropriate improvements to our systems to help prevent or minimize future
recurrence.
We will provide a more detailed analysis of this incident once we have
completed our internal investigation.

This is the final update for this incident.

Google Cloud Platform Status

unread,
Nov 7, 2017, 1:59:38 PM11/7/17
to google-appengine...@googlegroups.com
ISSUE SUMMARY

On Wednesday 6 November 2017, the App Engine Memcache service experienced
unavailability for applications in all regions for 1 hour and 50 minutes.

We sincerely apologize for the impact of this incident on your application
or service. We recognize the severity of this incident and will be
undertaking a detailed review to fully understand the ways in which we must
change our systems to prevent a recurrence.

DETAILED DESCRIPTION OF IMPACT

On Wednesday 6 November 2017 from 12:33 to 14:23 PST, the App Engine
Memcache service experienced unavailability for applications in all regions.

Some customers experienced elevated Datastore latency and errors while
Memcache was unavailable. At this time, we believe that all the Datastore
issues were caused by surges of Datastore activity due to Memcache being
unavailable. When Memcache failed, if an application sent a surge of
Datastore operations to specific entities or key ranges, then Datastore may
have experienced contention or hotspotting, as described in
https://cloud.google.com/datastore/docs/best-practices#designing_for_scale.
Datastore experienced elevated load on its servers when the outage ended
due to a surge in traffic. Some applications in the US experienced elevated
latency on gets between 14:23 and 14:31, and elevated latency on puts
between 14:23 and 15:04.

Customers running Managed VMs experienced failures of all HTTP requests and
App Engine API calls during this incident. Customers using App Engine
Flexible Environment, which is the successor to Managed VMs, were not
impacted.

ROOT CAUSE

The App Engine Memcache service requires a globally consistent view of the
current serving datacenter for each application in order to guarantee
strong consistency when traffic fails over to alternate datacenters. The
configuration which maps applications to datacenters is stored in a global
database.

The incident occurred when the specific database entity that holds the
configuration became unavailable for both reads and writes following a
configuration update. App Engine Memcache is designed in such a way that
the configuration is considered invalid if it cannot be refreshed within 20
seconds. When the configuration could not be fetched by clients, Memcache
became unavailable.

REMEDIATION AND PREVENTION

Google received an automated alert at 12:34. Following normal practices,
our engineers immediately looked for recent changes that may have triggered
the incident. At 12:59, we attempted to revert the latest change to the
configuration file. This configuration rollback required an update to the
configuration in the global database, which also failed. At 14:21,
engineers were able to update the configuration by sending an update
request with a sufficiently long deadline. This caused all replicas of the
database to synchronize and allowed clients to read the mapping
configuration.

As a temporary mitigation, we have reduced the number of readers of the
global configuration, which avoids the contention during write and led to
the unavailability during the incident. Engineering projects are already
under way to regionalize this configuration and thereby limit the blast
radius of similar failure patterns in the future.
Reply all
Reply to author
Forward
0 new messages