Elevated latency in deploying applications to App Engine

97 views
Skip to first unread message

Google Cloud Platform Status

unread,
Feb 24, 2015, 6:24:50 AM2/24/15
to google-appengine...@googlegroups.com
Starting at Tuesday, 2015-02-24 00:00, Google App Engine showed increased
latency when deploying applications possibly leading to timeouts. This
incident with Google App Engine deployment was resolved as of Tuesday,
2015-02-24 01:54 (all times are in US/Pacific). We apologize for the
inconvenience and thank you for your patience and continued support. Please
rest assured that system reliability is a top priority at Google, and we
are making continuous improvements to make our systems better.

Google Cloud Platform Status

unread,
Feb 25, 2015, 12:48:04 PM2/25/15
to google-appengine...@googlegroups.com
SUMMARY:

On Monday 23 February and Tuesday 24 February 2015, some deployments of App
Engine applications failed for a period of 356 minutes. We realize that you
depend on this service and we apologize if you were affected by this
incident. We are taking steps to ensure that incidents of this nature will
not happen again.

DETAILED DESCRIPTION OF IMPACT:

On Monday 23 February, 60% of deployments failed between 11:00 and 14:15
PST. On Tuesday 24 February 2015, 80% deployments of deployments failed
from 00:05 to 01:47. After that the rate of deployment failures decreased
linearly until the incident ended at 02:46.

ROOT CAUSE:

The App Engine 1.9.18 release contains an enabling change for future
scalability improvements which requires an update to the settings for all
applications across the global serving infrastructure. The fan out for this
change is handled by Google’s internal Pub/Sub infrastructure. Posting an
update for every one of App Engine’s large number of applications resulted
in throttling of messages by this infrastructure. As a result, deployment
messages were blocked behind these updates, resulting in timeouts..

REMEDIATION AND PREVENTION:

App Engine began to update application settings on Monday 23 February at
10:47. The deployment failures began at 11:00. Our engineers detected the
problem at 13:06 and we began to investigate the root cause. The incident
resolved itself at 14:15.

We then made a further update on Monday 23 February at 23:42. The
deployment failures began on Tuesday 24 February at 00:05. Our engineers
detected the issue at 01:08, and diagnosed the root cause at 01:11. The
Pub/Sub infrastructure's throttle limit was increased at 01:47 and the
incident ended at 02:46.

We have now increased App Engine’s throttle limits for the Pub/Sub
infrastructure and added an alert to our monitoring systems that will be
immediately triggered if this type of event recurs.
Reply all
Reply to author
Forward
0 new messages