Google Cloud Platform Statusunread,
Jan 7, 2019, 8:25:18 PM1/7/19
On Wednesday 2 January, 2019, application creation in Google App Engine
(App Engine), first-time deployment of Google Cloud Functions (Cloud
Functions) per region, and project creation & API management in Cloud
Console experienced elevated error rates ranging from 71% to 100% for a
duration of 3 hours, 40 minutes starting at 14:40 PST. Workloads already
running on App Engine and Cloud Functions, including deployment of new
versions of applications and functions, as well as ongoing use of existing
projects and activated APIs, were not impacted.
We know that many customers depend on the ability to create new Cloud
projects, applications & functions, and apologize for our failure to
provide this to you during this time. The root cause of the incident is
fully understood and engineering efforts are underway to ensure the issue
is not at risk of recurrence.
DETAILED DESCRIPTION OF IMPACT
On Wednesday 2 January, 2019 from 14:40 PST to 18:20 PST, application
creation in App Engine, first-time deployments of Cloud Functions, and
project creation & API auto-enablement in Cloud Console experienced
elevated error rates in all regions due to a recently deployed
configuration update to the underlying control plane for all impacted
First-time deployments of new Cloud Functions failed. Redeploying existing
deployments of Cloud Functions were not impacted. Workloads on already
deployed Cloud Functions were not impacted.
App Engine app creation experienced an error rate of 98%. Workloads for
deployed App Engine applications were not impacted.
Cloud API enable requests experienced a 97% average error rate while
disable requests had a 71% average error rate. Affected users observed
these errors when attempting to enable an API via the Cloud Console and API
The control plane responsible for managing new app creations in App Engine,
new function deployments in Cloud Functions, project creation & API
management in Cloud Console utilizes a metadata store. This metadata store
is responsible for persisting and processing new project creations,
function deployments, App Engine applications, and API enablements.
Google engineers began rolling out a new feature designed to improve the
fault-tolerance of the metadata store. The rollout had been successful in
test environments, but triggered an issue in production due to an
unexpected difference in configuration, which triggered a bug. The bug
caused writes to the metadata store to fail.
REMEDIATION AND PREVENTION
Google engineers were automatically alerted of the elevated error rate
within 3 minutes of the incident start and immediately began their
At 15:17, an issue with our metadata store was identified as the root
cause, and mitigation work began. An initial mitigation was applied, but
automation intentionally slowed the rollout of this mitigation to minimize
risks to production. To reduce time to resolution, Google engineers
developed and implemented a new mitigation. The metadata store became fully
available at 18:20.
To prevent a recurrence, we will implement additional validation to the
metadata store’s schemas and ensure that test validation of metadata store
configuration matches production.
To improve time to resolution for such issues, we are increasing the
robustness of our emergency rollback procedures for the metadata store, and
creating engineering runbooks for such scenarios.