SUMMARY:
On Friday, August 1, 2014, the App Engine Task Queue service delayed execution of some tasks for a duration of 70 minutes. If your service or application was affected, we apologize for any inconvenience. We have taken and are taking immediate steps to improve the platform’s performance and availability.
DETAILED DESCRIPTION OF IMPACT:
On Friday 1st of April 2014 from 15:50 to 17:00 US/Pacific, App Engine’s Task Queue service delayed execution of some tasks in the queue for some applications. During this period, the number of HTTP requests executed by Task Queue service dropped 21.7%. Execution of affected tasks were delayed, and cron jobs could not start on time.
ROOT CAUSE:
Google engineers were changing the configuration of the Task Queue service to allocate more resources for each process that manages tasks in the queue. Operator error led to misconfiguration of resource requirements, preventing the Task Queue service from operating in some datacenters until reconfigured properly.
REMEDIATION AND PREVENTION:
To fix the immediate issue, Google engineers directed traffic to datacenters that were not impacted. To prevent the issue in the future, Google engineers will enhance our deployment tool to prevent this class of misconfiguration when restarting processes. Google engineers will also increase the resources allocated to the Task Queue service so that the service has a buffer to perform the retry attempts of the configuration change.