App Engine increased 500 errors

185 views
Skip to first unread message

Google Cloud Platform Status

unread,
Sep 17, 2015, 6:17:59 PM9/17/15
to google-appengine...@googlegroups.com
We are investigating reports of increased 500 errors with Google App Engine
beginning at Thursday, 2015-09-17 12:40 US/Pacific. The issue has been
resolved at 14:08 US/Pacific.

We will conduct an internal investigation of this issue and make
appropriate improvements to our systems to prevent or minimize future
recurrence. We will provide a more detailed analysis of this incident once
we have completed our internal investigation.

Google Cloud Platform Status

unread,
Sep 19, 2015, 1:00:48 PM9/19/15
to google-appengine...@googlegroups.com
SUMMARY:

On Thursday 17 September 2015, Google App Engine experienced increased
latency and HTTP errors for 1 hour 28 minutes. We apologize to our
customers who were affected by this issue. This is not the level of
quality and reliability we strive to offer you, and we are taking immediate
steps to prevent similar issues from occurring in future.

DETAILED DESCRIPTION OF IMPACT:

On Thursday 17 September 2015 from 12:40 to 14:08 PDT, <0.01% of
applications using Google App Engine experienced elevated latencies, HTTP
error rates, and failures for the memcache API. The Google Developers
Console was also affected and experienced timeouts during the time.

ROOT CAUSE:

An unhealthy Managed VMs application triggered an excessive number of
retries in the App Engine infrastructure in a single datacenter. App
Engine's serving stack automatically detected the overload, and diverted
the majority of traffic to an alternate datacenter. Memcache was
unavailable for apps which were diverted in this manner; this increased
latency for those apps. Latency was also increased by the need to create
new instances to run those apps in the alternate datacenter. Traffic which
was not diverted experienced errors due to the overload.

REMEDIATION AND PREVENTION:

At 12:47, Google engineers were automatically alerted to increasing
latency, followed by elevated error rate, for App Engine, and started
investigating the root cause of the issue. The incident was resolved at
14:08.

Google engineers are rolling out a fix which curbs the excessive number of
retries that caused this incident. Additionally, the team is implementing
improved monitoring to reduce the time taken to detect and isolate
problematic workloads.
Reply all
Reply to author
Forward
0 new messages