Google Cloud Platform Status
unread,Sep 19, 2015, 1:00:48 PM9/19/15Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to google-appengine...@googlegroups.com
SUMMARY:
On Thursday 17 September 2015, Google App Engine experienced increased
latency and HTTP errors for 1 hour 28 minutes. We apologize to our
customers who were affected by this issue. This is not the level of
quality and reliability we strive to offer you, and we are taking immediate
steps to prevent similar issues from occurring in future.
DETAILED DESCRIPTION OF IMPACT:
On Thursday 17 September 2015 from 12:40 to 14:08 PDT, <0.01% of
applications using Google App Engine experienced elevated latencies, HTTP
error rates, and failures for the memcache API. The Google Developers
Console was also affected and experienced timeouts during the time.
ROOT CAUSE:
An unhealthy Managed VMs application triggered an excessive number of
retries in the App Engine infrastructure in a single datacenter. App
Engine's serving stack automatically detected the overload, and diverted
the majority of traffic to an alternate datacenter. Memcache was
unavailable for apps which were diverted in this manner; this increased
latency for those apps. Latency was also increased by the need to create
new instances to run those apps in the alternate datacenter. Traffic which
was not diverted experienced errors due to the overload.
REMEDIATION AND PREVENTION:
At 12:47, Google engineers were automatically alerted to increasing
latency, followed by elevated error rate, for App Engine, and started
investigating the root cause of the issue. The incident was resolved at
14:08.
Google engineers are rolling out a fix which curbs the excessive number of
retries that caused this incident. Additionally, the team is implementing
improved monitoring to reduce the time taken to detect and isolate
problematic workloads.