SUMMARY:
On Wednesday 22 April 2015, for a duration of 92 minutes, some requests
from European regions to Google App Engine custom domains were redirected
to the Google front page. We apologise to our customers and users who were
affected by this issue, and we have taken and are taking immediate steps to
improve the platform’s availability.
DETAILED DESCRIPTION OF IMPACT:
Starting at 06:37 PDT on Wednesday 22 April, some custom-domain URL
requests from the Europe region were redirected to the
www.google.com front
page, or to equivalent national Google front pages, instead of being
dispatched to their target Google App Engine applications.
The incident had two phases. In the first phase, from 06:37 to 07:30, 7.9%
of traffic to custom domains was affected. In the second phase, from 07:30
to 08:09, 13.7% of custom domain traffic was affected. In total,
approximately 0.2% of requests to App Engine were incorrectly redirected
during the incident.
Requests originating outside Europe were not affected, except for a very
small percentage which were routed to the Google network through European
points of presence. Requests to applications via
appspot.com domains were
also not affected. The hosting region of the application was not a factor.
ROOT CAUSE:
App Engine custom domains are handled by a system which performs domain
mapping for a number of Google services. In order to increase performance,
capacity and supportability, Google engineers are in the process of
migrating this system's traffic onto Google's general-purpose network
infrastructure.
The outage commenced when a rollout of this integration began in European
datacenters, with a small fraction of custom domain requests being routed
through the general infrastructure. Detailed monitoring was in place for
this migration but, incorrectly, did not include App Engine custom domains.
Due to a configuration error, the migrated App Engine custom domains were
not recognized by the infrastructure, which therefore redirected them to
its default target of the Google front page.
REMEDIATION AND PREVENTION:
At 08:04, the issue was identified and Google engineers immediately
cancelled the rollout, restoring service by 08:09.
To prevent similar issues from reaching production in future, Google
engineers are implementing software release tests to identify the class of
configuration error that triggered the incident.
In case similar issues do reach production, Google engineers are extending
rollout testing to include App Engine custom domains so that problematic
rollouts will be detected and cancelled automatically and immediately.
Finally, continuous monitoring will be added to ensure that all types of
custom domain are being correctly recognized and dispatched by the
infrastructure, so that Google engineers will be rapidly notified if
similar issues recur, regardless of the cause.