Google App Engine issues on March 18, 2014

Jose Montes de Oca

unread,

Mar 18, 2014, 3:57:24 PM3/18/14

to google-appengine...@googlegroups.com

We're investigating an issue with Google App Engine Datastore beginning at Tuesday, 2014-03-18 11:45 AM. We will provide more information shortly.

Jose Montes de Oca

unread,

Mar 18, 2014, 5:13:04 PM3/18/14

to google-appengine...@googlegroups.com

The problem with Google App Engine Datastore was resolved as of Tuesday, 2014-03-18 12:30 PM US/Pacific. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.

Google App Engine Downtime Notify

unread,

Mar 25, 2014, 7:34:17 PM3/25/14

to google-appengine...@googlegroups.com

SUMMARY:

On Tuesday 18 March 2014, Google App Engine (GAE) applications in US datacenters experienced elevated latency for some Datastore API calls for a duration of one hour and 31 minutes. If your service or application was affected, we apologize - this is not the level of reliability and performance we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability.

DETAILED DESCRIPTION OF IMPACT:

On Tuesday 18 March 2014, applications in US datacenters experienced elevated latency for Datastore API calls from 11:47 AM US/Pacific to 13:18 PM US/Pacific. The period of highest impact was 11:47 AM to 12:32 PM. The actual impact experienced by applications depends heavily on factors like the number of entities that the application reads/writes in each call, the size of those entities, the datacenter where the application is hosted, and possible contention introduced by the volume of requests hitting the application. As an example of the impact, the latency at the 95th percentile for writes of single entities less than 1000 bytes with five or fewer index updates spiked to approximately twice normal levels for some applications.

ROOT CAUSE:

The incident occurred because two of the Megastore replicas that are used by the App Engine Datastore were unavailable. One replica was taken offline due to a failure in the underlying storage layer. A second replica was unreachable due to a fiber cut that caused a reduction in network capacity. In order to ensure high availability, Datastore writes to a quorum of replicas before returning success. If there are fewer replicas, then small variations in the performance of the remaining replicas become more significant to the overall performance.

REMEDIATION AND PREVENTION:

The storage layer failure occurred in a datacenter that was hosting both the Datastore and App Engine applications. We redirected traffic to other datacenters, which caused applications to spin up new instances in other datacenters and also flushed Memcache for affected applications. Our engineers have found a mitigation strategy for this issue and are working on diagnosing the root cause.

The network failure affected a datacenter that was hosting the Datastore but not any applications. Our network operations team was able to restore the fiber link to resolve this issue. In response to this incident, we are developing a mechanism to be able to more quickly redirect traffic in the event of networking failures in order to reduce the time taken to recover. The network team is also re-evaluating the network capacity required in the region, to prevent single- and dual-path failures from affecting Datastore performance in the future.

Reply all

Reply to author

Forward