# ISSUE SUMMARY
On Thursday 31 October, 2019, network administration operations on Google
Compute Engine (GCE), such as creating/deleting firewall rules, routes,
global load balancers, subnets, or new VPCs, were subject to elevated
latency and errors. Specific service impact is outlined in detail below.
# DETAILED DESCRIPTION OF IMPACT
On Thursday 31 October, 2019 from 16:30 to 18:00 US/Pacific and again from
20:24 to 23:08 Google Compute Engine experienced elevated latency and
errors applying certain network administration operations. At 23:08, the
issue was mitigated fully, and as a result, administrative operations began
to succeed for most projects. However, projects which saw network
administration operations fail during the incident were left stuck in a
state where new operations could not be applied. The cleanup process for
these stuck projects took until 2019-11-02 14:00.
The following services experienced up to a 100% error rate when submitting
create, modify, and/or delete requests that relied on Google Compute
Engine’s global (and in some cases, regional) networking APIs between
2019-10-31 16:40 - 18:00 and 20:24 - 23:08 US/Pacific for a combined
duration of 4 hours and 4 minutes:
- Google Compute Engine
- Google Kubernetes Engine
- Google App Engine Flexible
- Google Cloud Filestore
- Google Cloud Machine Learning Engine
- Google Cloud Memorystore
- Google Cloud Composer
- Google Cloud Data Fusion
# ROOT CAUSE
Google Compute Engine’s networking stack consists of software which is made
up of two components, a control plane and data plane. The data plane is
where packets are processed and routed based on the configuration set up by
the control plane. GCE’s networking control plane has global components
that are responsible for fanning-out network configurations that can affect
an entire VPC network to downstream (regional/zonal) networking
controllers. Each region and zone has their own control plane service, and
each control plane service is sharded such that network programming is
spread across multiple shards.
A performance regression introduced in a recent release of the networking
control software caused the service to begin accumulating a backlog of
requests. The backlog eventually became significant enough that requests
timed out, leaving some projects stuck in a state where further
administrative operations could not be applied. The backlog was further
exacerbated by the retry policy in the system sending the requests, which
increased load still further. Manual intervention was required to clear the
stuck projects, prolonging the incident.
# REMEDIATION AND PREVENTION
Google engineers were alerted to the problem on 2019-10-31 at 17:10
US/Pacific and immediately began investigating. From 17:10 to 18:00,
engineers ruled out potential sources of the outage without finding a
definitive root cause. The networking control plane performed an automatic
failover at 17:57, dropping the error rate. This greatly reduced the number
of stuck operations in the system and significantly mitigated user impact.
However, after 18:59, the overload condition returned and error rates again
increased. After further investigation from multiple teams, additional
mitigation efforts began at 19:52, when Google engineers allotted
additional resources to the overloaded components. At 22:16, as a further
mitigation, Google engineers introduced a rate limit designed to throttle
requests to the network programming distribution service. At 22:28, this
service was restarted, allowing it to drop any pending requests from its
queue. The rate limit coupled with the restart mitigated the issue of new
operations becoming stuck, allowing the team to begin focusing on the
cleanup of stuck projects.
Resolving the stuck projects required manual intervention, which was unique
to each failed operation type. Engineers worked round the clock to address
each operation type in turn; as each was processed, further operations of
the same type (from the same project) also began to be processed. 80% of
the stuck operations were processed by 2019-11-01 16:00, and all operations
were fully processed by 2019-11-02 14:00.
We will be taking these immediate steps to prevent this class of error from
recurring:
- We are implementing continuous load testing as part of the deployment
pipeline of the component which suffered the performance regression, so
that such issues are identified before they reach production in future.
- We have rate-limited the traffic between the impacted control plane
components to avoid the congestion collapse experienced during this
incident.
- We are further sharding the global network programming distribution
service to allow for graceful horizontal scaling under high traffic.
- We are automating the steps taken to unstick administrative operations,
to eliminate the need for manual cleanup after failures such as this one.
- We are adding alerting to the network programming distribution service,
to reduce response time in the event of a similar problem in the future.
- We are changing the way the control plane processes requests to allow
forward progress even when there is a significant backlog.
Google is committed to quickly and continually improving our technology and
operations to prevent service disruptions. We appreciate your patience and
apologize again for the impact to your organization. We thank you for your
business.
If you believe your application experienced an SLA violation as a result of
this incident, please contact us
(
https://support.google.com/cloud/answer/6282346).