SUMMARY:
On Tuesday, 16 June 2015, Google App Engine Task Queue service and App
Engine application deployment experienced increased error rates for a
duration of 3 hours and 25 minutes. If your service or application was
affected, we apologize. We have taken actions to fix the issue and are in
process of making the system more reliable.
DETAILED DESCRIPTION OF IMPACT:
On Tuesday, 16 June 2015 from 20:10 to 23:35 PDT, some developers of Google
App Engine applications in the US region were unable to deploy their
applications. The overall error rate of deployments during this period was
approximately 60%. Affected developers saw that attempted deployments
would exit and report an internal server error message after HTTP requests
to
appengine.google.com timed out. App Engine Admin Console was unable to
load data for affected applications. Additionally, between 20:58 to 21:33,
applications in the US region experienced an increase in error rate of up
to 0.25% as well as slower execution of Task Queue tasks.
ROOT CAUSE:
Google engineers had performed maintenance on a storage system of one of
datacenters which App Engine uses. During this maintenance, components of
App Engine that rely on this storage system had to rely on a replica in a
different datacenter. For both deployments and Task Queues, this switch did
not function properly.
REMEDIATION AND PREVENTION:
Google engineers took necessary measures to prevent the Task Queue service
from accessing the storage under the maintenance at 21:33. In addition,
all traffic for the affected applications was redirected to alternate
datacenters at 23:26. This was completed by 23:35 and applications were
again able to deploy successfully.
To prevent the issue from recurring, we are working to make deployments and
Task Queue are more resilient to movements in the underlying storage
system, in a similar fashion to other App Engine components.