The issue with the Google Compute Engine networking control plane is ongoing

276 views
Skip to first unread message

Google Cloud Platform Status

unread,
Nov 1, 2019, 1:06:56 AM11/1/19
to gce-ope...@googlegroups.com
Description: We observing recurrence of the issue. The engineering team
continues the investigation.

We will provide an update by Thursday, 2019-10-31 23:00 US/Pacific with
current details.

Diagnosis: Customer may experience errors while creating or deleting
backend services, subnets, instance groups and firewall rules.

Workaround: No workaround is available at the moment

Google Cloud Platform Status

unread,
Nov 1, 2019, 1:54:32 AM11/1/19
to gce-ope...@googlegroups.com
Description: Our engineering team has determined that further investigation
is required to mitigate the issue.

We will provide an update by Thursday, 2019-10-31 23:50 US/Pacific with

Google Cloud Platform Status

unread,
Nov 1, 2019, 2:51:18 AM11/1/19
to gce-ope...@googlegroups.com
Description: Our engineering team still investigating the issue.

We will provide an update by Friday, 2019-11-01 02:00 US/Pacific with

Google Cloud Platform Status

unread,
Nov 1, 2019, 4:53:27 AM11/1/19
to gce-ope...@googlegroups.com
Description: Our engineering team still investigating the issue.

We will provide an update by Friday, 2019-11-01 03:00 US/Pacific with

Google Cloud Platform Status

unread,
Nov 1, 2019, 5:55:16 AM11/1/19
to gce-ope...@googlegroups.com
Description: Our engineering team still investigating the issue.

We will provide an update by Friday, 2019-11-01 04:00 US/Pacific with

Google Cloud Platform Status

unread,
Nov 1, 2019, 6:54:26 AM11/1/19
to gce-ope...@googlegroups.com
Description: Our engineering team still investigating the issue.

We will provide an update by Friday, 2019-11-01 05:00 US/Pacific with

Google Cloud Platform Status

unread,
Nov 1, 2019, 7:56:18 AM11/1/19
to gce-ope...@googlegroups.com
Description: Our engineering team still investigating the issue.

We will provide an update by Friday, 2019-11-01 06:00 US/Pacific with
current details.

Diagnosis: Customer may experience errors while creating or deleting
backend services, subnets, instance groups and firewall rules.
New GKE nodes creation might fail with NetworkUnavailable status set to
True.
Cloud Armor rules might not be updated.

Google Cloud Platform Status

unread,
Nov 1, 2019, 8:51:59 AM11/1/19
to gce-ope...@googlegroups.com
Description: Our engineering team still investigating the issue.

We will provide an update by Friday, 2019-11-01 07:00 US/Pacific with
current details.

Diagnosis: Customer may experience errors while creating or deleting
backend services, subnets, instance groups, routes and firewall rules.

Google Cloud Platform Status

unread,
Nov 1, 2019, 10:15:11 AM11/1/19
to gce-ope...@googlegroups.com
Description: Mitigation work is still underway by our engineering team.

We will provide more information by Friday, 2019-11-01 08:30 US/Pacific.

Google Cloud Platform Status

unread,
Nov 1, 2019, 11:28:31 AM11/1/19
to gce-ope...@googlegroups.com
Description: Mitigation work is currently underway by our product team to
address the ongoing issue with some network operations failing globally at
this time. These reports started Thursday, 2019-10-31 16:41 US/Pacific.
Operations are showing a reduction in failures currently and we are
currently working to clear a back log of stuck operations in our system.

We will provide more information by Friday, 2019-11-01 09:30 US/Pacific.

Diagnosis: Customer may experience errors with the below products if
affected

Google Compute Engine
* Networking-related Compute API operations failing
* This may include deleting backend services, subnets, instance groups,
routes and firewall rules and more.

Google Kubernetes Engine
* Cluster operations including creation, update, auto scaling may fail due
to the networking API failures

Google Cloud Memorystore
* Create/Delete events failing

App Engine Flexible
* Deployments seeing elevated failure rates

Google Cloud Platform Status

unread,
Nov 1, 2019, 11:50:23 AM11/1/19
to gce-ope...@googlegroups.com
Description: Mitigation work is currently underway by our product team to
unblock stuck network operations globally. Network operations submitted
between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31
23:01 US/Pacific may be affected.

New operations are showing a reduction in failures currently and we are
currently working to clear a back log of pending operations in our system.

We will provide more information by Friday, 2019-11-01 10:00 US/Pacific.

Diagnosis: Customer may be seeing errors across the below products if
affected.

Google Compute Engine
- Networking-related Compute API operations failing to complete if
submitted during the above time.
- This may include deleting backend services, subnets, instance groups,
routes and firewall rules.

Google Kubernetes Engine
- Cluster operations including creation, update, auto scaling may have
failed due to the networking API failures mentioned under Google Compute
Engine
- New Cluster operations are now succeeding and further updates on
restoring this can be found
[https://status.cloud.google.com/incident/container-engine/19011](https://status.cloud.google.com/incident/container-engine/19011)

Google Cloud Memorystore
- Create/Delete events failed during the above time

App Engine Flexible
- Deployments seeing elevated failure rates

Google Cloud Platform Status

unread,
Nov 1, 2019, 12:06:48 PM11/1/19
to gce-ope...@googlegroups.com
Description: Mitigation work is currently underway by our product team to
unblock stuck network operations globally. Network operations submitted
between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31
23:01 US/Pacific may be affected.

New operations are showing a reduction in failures currently and we are
currently working to clear a back log of pending operations in our system.

We will provide more information by Friday, 2019-11-01 12:00 US/Pacific.

Diagnosis: Customer may have encountered errors across the below products
if affected.

Google Compute Engine
- Networking-related Compute API operations pending to complete if
submitted during the above time.
- The affected operations include: deleting backend services, subnets,
instance groups, routes and firewall rules.
- Some operations may still show as pending and are being mitigated at this
time. We expect this current mitigation work to be completed no later than
2019-11-01 12:30 PDT

Google Kubernetes Engine
- Cluster operations including creation, update, auto scaling may have
failed due to the networking API failures mentioned under Google Compute
Engine
- New Cluster operations are now succeeding and further updates on
recovering from this can be found
[https://status.cloud.google.com/incident/container-engine/19011](https://status.cloud.google.com/incident/container-engine/19011).
No further updates will be provided for Google Kubernetes Engine in this
post.

Google Cloud Memorystore
- This issue is believed to have affected less than 1% of projects
- The affected projects should find full resolution once the issue
affecting Google Compute Engine is resolved. No further updates will be
provided for Google Cloud Memorystore

App Engine Flexible
- New deployments experienced elevated failure rates during the affected
time.
- The team is no longer seeing issues affecting new deployment creation and
we feel the incident for this product is now resolved. No further updates
will be provided for App Engine Flexible

Google Cloud Platform Status

unread,
Nov 1, 2019, 1:05:48 PM11/1/19
to gce-ope...@googlegroups.com
Description: Mitigation work is currently underway by our product team to
unblock stuck network operations globally. Network operations submitted
between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31
23:01 US/Pacific may be affected.

New operations are showing a reduction in failures currently and we are
currently working to clear a back log of pending operations in our system.

We will provide more information by Friday, 2019-11-01 12:00 US/Pacific.

Diagnosis: Customer may have encountered errors across the below products
if affected.

Google Compute Engine
- Networking-related Compute API operations pending to complete if
submitted during the above time.
- The affected operations include: deleting backend services, subnets,
instance groups, routes and firewall rules.
- Some operations may still show as pending and are being mitigated at this
time. We expect this current mitigation work to be completed no later than
2019-11-01 12:30 PDT

Google Kubernetes Engine
- Cluster operations including creation, update, auto scaling may have
failed due to the networking API failures mentioned under Google Compute
Engine
- New Cluster operations are now succeeding and further updates on
recovering from this are underway as part of the mitigation mentioned under
Google Compute Engine. No further updates will be provided for Google

Google Cloud Platform Status

unread,
Nov 1, 2019, 1:13:37 PM11/1/19
to gce-ope...@googlegroups.com
Description: Mitigation work is currently underway by our product team to
unblock stuck network operations globally. Network operations submitted
between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31
23:01 US/Pacific may be affected.

New operations are succeeding as expected currently and we are currently
working to clear a back log of pending operations in our system.

We will provide more information by Friday, 2019-11-01 12:30 US/Pacific.

Diagnosis: Customer may have encountered errors across the below products
if affected.

Google Compute Engine
- Networking-related Compute API operations pending to complete if
submitted during the above time.
- Resubmitting similar requests may fail as they are waiting for the above
operations to complete.

Google Cloud Platform Status

unread,
Nov 1, 2019, 3:23:03 PM11/1/19
to gce-ope...@googlegroups.com
Description: Mitigation work continues to unblock pending network
operations globally. 40-80% of Cloud Networking operations submitted
between Thursday, 2019-10-31 16:41 US/Pacific and Thursday, 2019-10-31
23:01 US/Pacific may have been affected. The exact amount of failures is
region dependent.

Our team has been able to reduce the number of pending operations by 60% at
this time. We expect mitigation to continue over the next 4 hours and are
working to clear the pending operations by largest type impacted.

We will provide more information by Friday, 2019-11-01 14:30 US/Pacific.


Diagnosis: As we become aware of products which were impacted we will
update this post to ensure transparency.

Google Compute Engine
- Networking-related Compute API operations pending to complete if
submitted during the above time.
- Resubmitting similar requests may fail as they are waiting for the above
operations to complete.
- The affected operations include: deleting backend services, subnets,
instance groups, routes and firewall rules.
- Some operations may still show as pending and are being mitigated at this
time. We are currently working to address operations around subnet deletion
as our next target group

Google Cloud Platform Status

unread,
Nov 1, 2019, 5:38:31 PM11/1/19
to gce-ope...@googlegroups.com
Currently, the backlog of pending operations has been reduced by
approximately 70%, and we expect the majority of mitigations to complete
over the next several hours, with the long-tail going into tomorrow.
Mitigation work is still underway to unblock pending network operations
globally.

To determine whether you are affected by this incident, you may run the
following command [1] “gcloud compute operations list
--filter="status!=DONE” to view your project’s pending operations. If you
see global operations (or regional subnet operations) that are running for
a long time (or significantly longer than usual), then you are likely still
impacted.

The remaining 30% of stuck operations are currently either being processed
successfully or marked as failed. This will allow newer incoming operations
of the same type to be eventually processed successfully, however,
resubmitting similar requests may also get stuck in a running state as they
are waiting for the queued operations to complete.

If you have an operation that does not appear to be finishing, please wait
for it to succeed or be marked as failed before retrying the operation.

For Context:
40-80% of Cloud Networking operations submitted between 2019-10-31 16:41
US/Pacific and 2019-10-31 23:01 US/Pacific may have been affected. The
exact percentage of failures is region dependent.

We will provide more information by Friday, 2019-11-01 16:30 US/Pacific.

[1] https://cloud.google.com/sdk/gcloud/reference/compute/operations/list

Diagnosis:

As we become aware of products which were impacted we will update this post
to ensure transparency.

Google Cloud Networking
Networking-related Compute API operations stuck pending if submitted during
the above time.
The affected operations include: [deleting/creating] backend services,
subnets, instance groups, routes and firewall rules.
Resubmitting similar requests may also enter a pending state as they are
waiting for the previous operation to complete.
Our product team is working to unblock any pending operation

Google Compute Engine
40-80% of Compute Engine API operations may have become stuck pending if
submitted during the above time.
Affected operations include any operation which would need to update
Networking on affected projects

Google Cloud DNS
Some DNS updates may be stuck pending from during the above time.

Google Cloud Filestore
Impacts instance creation/deletion

Cloud Machine Learning
Online prediction jobs using Google Kubernetes Engine may have experienced
failures during this time
The team is no longer seeing issues affecting Cloud Machine Learning and we
feel the incident for this product is now resolved.

Cloud Composer
Create Environment operations during the affected time may have experienced
failures.
Customers should no longer being seeing impact

Google Kubernetes Engine
Cluster operations including creation, update, auto scaling may have failed
due to the networking API failures mentioned under Google Compute Engine
New Cluster operations are now succeeding and further updates on recovering
from this are underway as part of the mitigation mentioned under Google
Cloud Networking.

Google Cloud Memorystore
This issue is believed to have affected less than 1% of projects
The affected projects should find full resolution once the issue affecting
Google Compute Engine is resolved.

App Engine Flexible
New deployments experienced elevated failure rates during the affected time.

Google Cloud Platform Status

unread,
Nov 1, 2019, 7:32:11 PM11/1/19
to gce-ope...@googlegroups.com
Description: Approximately 25% of global (and regional) route and subnet
deletion operations remain stuck in a pending state. Mitigation work is
still underway to unblock pending network operations globally. We expect
the majority of mitigations to complete over the next several hours, with
the long-tail going into tomorrow.

Please note, this will allow newer incoming operations of the same type to
eventually process successfully. However, resubmitting similar requests may
still get stuck in a running state as they are waiting for previously
queued operations to complete.

We will publish an analysis of this incident once we have completed our
internal investigation. We thank you for your patience while we have worked
on resolving the issue. We will provide more information by Friday,
2019-11-01 20:30 US/Pacific.

Diagnosis: Google Cloud Networking

- Networking-related Compute API operations stuck pending if submitted
during the above time.
- The affected operations include: [deleting/creating] backend services,
subnets, instance groups, routes and firewall rules.
- Resubmitting similar requests may also enter a pending state as they are
waiting for the previous operation to complete.
- Our product team is working to unblock any pending operation

Google Compute Engine

- 40-80% of Compute Engine API operations may have become stuck pending if
submitted during the above time.
- Affected operations include any operation which would need to update
Networking on affected projects

Google Cloud Filestore

- Impacts instance creation/deletion

Google Kubernetes Engine

- Cluster operations including creation, update, auto scaling may have
failed due to the networking API failures mentioned under Google Compute
Engine
- New Cluster operations are now succeeding and further updates on
recovering from this are underway as part of the mitigation mentioned under
Google Cloud Networking.

Google Cloud Platform Status

unread,
Nov 1, 2019, 11:35:16 PM11/1/19
to gce-ope...@googlegroups.com
Description: Mitigation efforts have successfully mitigated most types of
operations. At this time the backlog consists of mostly network and subnet
deletion operations, and a small fraction of create subnet operations. This
affects subnets created during the impact window. Subnets created outside
of this window remain unaffected.

Mitigation efforts will continue overnight to unstick the remaining
operations.


We will publish an analysis of this incident once we have completed our
internal investigation. We thank you for your patience while we have worked
on resolving the issue. We will provide more information by Saturday,
2019-11-02 11:00 US/Pacific.


Diagnosis: Google Cloud Networking

- Networking-related Compute API operations stuck pending if submitted
during the above time.
- The affected operations include: deleting and creating subnets, creating
networks.
Resubmitting similar requests may also enter a pending state as they are
waiting for the previous operation to complete.

Google Cloud Platform Status

unread,
Nov 2, 2019, 1:52:14 PM11/2/19
to gce-ope...@googlegroups.com
Our engineers have made significant progress unsticking operations
overnight and early this morning. At this point in time, the issue with
Google Cloud Networking operations being stuck is believed to be affecting
a very small number of remaining projects and our Engineering Team is
actively working on unsticking the final stuck operations.

If you have questions or are still impacted, please open a case with the
Support Team and we will work with you directly until this issue is fully
resolved.

No further updates will be provided here.

Google Cloud Platform Status

unread,
Nov 8, 2019, 7:13:21 PM11/8/19
to gce-ope...@googlegroups.com
# ISSUE SUMMARY


On Thursday 31 October, 2019, network administration operations on Google
Compute Engine (GCE), such as creating/deleting firewall rules, routes,
global load balancers, subnets, or new VPCs, were subject to elevated
latency and errors. Specific service impact is outlined in detail below.


# DETAILED DESCRIPTION OF IMPACT


On Thursday 31 October, 2019 from 16:30 to 18:00 US/Pacific and again from
20:24 to 23:08 Google Compute Engine experienced elevated latency and
errors applying certain network administration operations. At 23:08, the
issue was mitigated fully, and as a result, administrative operations began
to succeed for most projects. However, projects which saw network
administration operations fail during the incident were left stuck in a
state where new operations could not be applied. The cleanup process for
these stuck projects took until 2019-11-02 14:00.


The following services experienced up to a 100% error rate when submitting
create, modify, and/or delete requests that relied on Google Compute
Engine’s global (and in some cases, regional) networking APIs between
2019-10-31 16:40 - 18:00 and 20:24 - 23:08 US/Pacific for a combined
duration of 4 hours and 4 minutes:


- Google Compute Engine
- Google Kubernetes Engine
- Google App Engine Flexible
- Google Cloud Filestore
- Google Cloud Machine Learning Engine
- Google Cloud Memorystore
- Google Cloud Composer
- Google Cloud Data Fusion


# ROOT CAUSE

Google Compute Engine’s networking stack consists of software which is made
up of two components, a control plane and data plane. The data plane is
where packets are processed and routed based on the configuration set up by
the control plane. GCE’s networking control plane has global components
that are responsible for fanning-out network configurations that can affect
an entire VPC network to downstream (regional/zonal) networking
controllers. Each region and zone has their own control plane service, and
each control plane service is sharded such that network programming is
spread across multiple shards.


A performance regression introduced in a recent release of the networking
control software caused the service to begin accumulating a backlog of
requests. The backlog eventually became significant enough that requests
timed out, leaving some projects stuck in a state where further
administrative operations could not be applied. The backlog was further
exacerbated by the retry policy in the system sending the requests, which
increased load still further. Manual intervention was required to clear the
stuck projects, prolonging the incident.



# REMEDIATION AND PREVENTION


Google engineers were alerted to the problem on 2019-10-31 at 17:10
US/Pacific and immediately began investigating. From 17:10 to 18:00,
engineers ruled out potential sources of the outage without finding a
definitive root cause. The networking control plane performed an automatic
failover at 17:57, dropping the error rate. This greatly reduced the number
of stuck operations in the system and significantly mitigated user impact.
However, after 18:59, the overload condition returned and error rates again
increased. After further investigation from multiple teams, additional
mitigation efforts began at 19:52, when Google engineers allotted
additional resources to the overloaded components. At 22:16, as a further
mitigation, Google engineers introduced a rate limit designed to throttle
requests to the network programming distribution service. At 22:28, this
service was restarted, allowing it to drop any pending requests from its
queue. The rate limit coupled with the restart mitigated the issue of new
operations becoming stuck, allowing the team to begin focusing on the
cleanup of stuck projects.


Resolving the stuck projects required manual intervention, which was unique
to each failed operation type. Engineers worked round the clock to address
each operation type in turn; as each was processed, further operations of
the same type (from the same project) also began to be processed. 80% of
the stuck operations were processed by 2019-11-01 16:00, and all operations
were fully processed by 2019-11-02 14:00.


We will be taking these immediate steps to prevent this class of error from
recurring:


- We are implementing continuous load testing as part of the deployment
pipeline of the component which suffered the performance regression, so
that such issues are identified before they reach production in future.
- We have rate-limited the traffic between the impacted control plane
components to avoid the congestion collapse experienced during this
incident.
- We are further sharding the global network programming distribution
service to allow for graceful horizontal scaling under high traffic.
- We are automating the steps taken to unstick administrative operations,
to eliminate the need for manual cleanup after failures such as this one.
- We are adding alerting to the network programming distribution service,
to reduce response time in the event of a similar problem in the future.
- We are changing the way the control plane processes requests to allow
forward progress even when there is a significant backlog.



Google is committed to quickly and continually improving our technology and
operations to prevent service disruptions. We appreciate your patience and
apologize again for the impact to your organization. We thank you for your
business.


If you believe your application experienced an SLA violation as a result of
this incident, please contact us
(https://support.google.com/cloud/answer/6282346).
Reply all
Reply to author
Forward
0 new messages