Networking issue with Google Compute Engine services

315 views
Skip to first unread message

Google Cloud Platform Status

unread,
Aug 22, 2016, 10:25:44 AM8/22/16
to gce-ope...@googlegroups.com
We are investigating an issue with network connectivity. We will provide
more information by 08:00 US/Pacific.

Google Cloud Platform Status

unread,
Aug 22, 2016, 10:32:32 AM8/22/16
to gce-ope...@googlegroups.com

Google Cloud Platform Status

unread,
Aug 22, 2016, 11:01:33 AM8/22/16
to gce-ope...@googlegroups.com
The issue with network connectivity to Google Compute Engine services
should have been resolved for the majority of instances and we expect a
full resolution in the near future. We will provide another status update
by 08:30 US/Pacific with current details.

Google Cloud Platform Status

unread,
Aug 22, 2016, 11:31:09 AM8/22/16
to gce-ope...@googlegroups.com
We are still investigating network connectivity issues for a subset of
instances that have not automatically recovered. We will provide another
status update by 09:00 US/Pacific.

Google Cloud Platform Status

unread,
Aug 22, 2016, 12:01:36 PM8/22/16
to gce-ope...@googlegroups.com
The work on the remaining instances with network connectivity is still
ongoing. Affected instances are located in us-central1-f. We will provide
another status update by 09:30 US/Pacific.

Google Cloud Platform Status

unread,
Aug 22, 2016, 12:34:13 PM8/22/16
to gce-ope...@googlegroups.com
The work on the remaining instances with network connectivity is still
ongoing. Affected instances are located in us-central1-f. We will provide
another status update by 10:00 US/Pacific.

Google Cloud Platform Status

unread,
Aug 22, 2016, 1:00:13 PM8/22/16
to gce-ope...@googlegroups.com
The work on the remaining instances with network connectivity issues,
located in us-central1-f, is still ongoing. We will provide another status
update by 10:30 US/Pacific.

Google Cloud Platform Status

unread,
Aug 22, 2016, 1:35:11 PM8/22/16
to gce-ope...@googlegroups.com
The work on the remaining instances with network connectivity issues,
located in us-central1-f, is still ongoing. We will provide another status
update by 11:00 US/Pacific.

Google Cloud Platform Status

unread,
Aug 22, 2016, 1:59:40 PM8/22/16
to gce-ope...@googlegroups.com
The issue with Compute Engine network connectivity should have been
resolved for affected instances in us-central1-a, -b, and -c as of 08:00
US/Pacific. Less than 4% of instances in us-central1-f are currently
affected and we expect a full resolution soon. We will provide another
status update by 11:30 US/Pacific.

Google Cloud Platform Status

unread,
Aug 22, 2016, 2:44:57 PM8/22/16
to gce-ope...@googlegroups.com
The issue with Compute Engine network connectivity should have been
resolved for affected instances in us-central1-a, -b, and -c as of 08:00
US/Pacific. Less than 4% of instances in us-central1-f are currently
affected. We will provide another status update by 12:00 US/Pacific.

Google Cloud Platform Status

unread,
Aug 22, 2016, 3:01:07 PM8/22/16
to gce-ope...@googlegroups.com
The issue with Compute Engine network connectivity should have been
resolved for nearly all instances. For the remaining few remaining
instances we are working directly with the affected customers. No further
updates will be posted, but we will conduct an internal investigation of
this issue and make appropriate improvements to our systems to prevent or
minimize future recurrence. We will also provide a more detailed analysis
of this incident once we have completed our internal investigation.

Google Cloud Platform Status

unread,
Aug 30, 2016, 9:19:36 PM8/30/16
to gce-ope...@googlegroups.com
SUMMARY:

On Monday 22 August 2016, the Google Cloud US-CENTRAL1-F zone lost network
connectivity to services outside that zone for a duration of 25 minutes.
All other zones in the US-CENTRAL1 region were unaffected. All network
traffic within the zone was also unaffected.

We apologize to customers whose service or application was affected by this
incident. We understand that a network disruption has a negative impact on
your application - particularly if it is homed in a single zone - and we
apologize for the inconvenience this caused. What follows is our detailed
analysis of the root cause and actions we will take in order to prevent
this type of incident from recurring.

DETAILED DESCRIPTION OF IMPACT:

We have received feedback from customers asking us to specifically and
separately enumerate the impact of incidents to any service that may have
been touched. We agree that this will make it easier to reason about the
impact of any particular event and we have done so in the following
descriptions.

On Monday 22 August 2016 from 07:05 to 07:30 PDT the Google Cloud
US-CENTRAL1-F zone lost network connectivity to services outside that zone.

App Engine

6% of App Engine Standard Environment applications in the US-CENTRAL region
served elevated error rates for up to 8 minutes, until the App Engine
serving infrastructure automatically redirected traffic to a failover zone.
The aggregate error rate across all impacted applications during the
incident period was 3%. The traffic redirection caused a Memcache flush for
affected applications, and also loading requests as new instances of the
applications started up in the failover zones.

All App Engine Flexible Environment applications deployed to the
US-CENTRAL1-F zone were unavailable for the duration of the incident.
Additionally, 4.5% of these applications experienced various levels of
unavailability for up to an additional 5 hours while the system recovered.

Deployments for US-CENTRAL Flexible applications were delayed during the
incident. Our engineers disabled the US-CENTRAL1-F zone for new deployments
during the incident, so that any customers who elected to redeploy,
immediately recovered.

Cloud Console

The Cloud Console was available during the incident, though some App Engine
administrative pages did not load for applications in US-CENTRAL and 50% of
project creation requests failed to complete and needed to be retried by
customers before succeeding.

Cloud Dataflow

Some Dataflow running jobs in the US-CENTRAL1 region experienced delays in
processing. Although most of the affected jobs recovered gracefully after
the incident ended, up to 2.5% of affected jobs in this zone became stuck
and required manual termination by customers. New jobs created during the
incident were not impacted.

Cloud SQL

Cloud SQL First Generation instances were not impacted by this incident.

30% of Cloud SQL Second Generation instances in US-CENTRAL1 were
unavailable for up to 5 minutes, after which they became available again.
An additional 15% of Second Generation instances were unavailable for 22
minutes.

Compute Engine

All instances in the US-CENTRAL1-F zone were inaccessible from outside the
zone for the duration of the incident. 9% of them remained inaccessible
from outside the zone for an additional hour.

Container Engine

Container Engine clusters running in US-CENTRAL1-F were inaccessible from
outside of the zone during the incident although they continued to serve.

In addition, calls to the Container Engine API experienced a 4% error rate
and elevated latency during the incident, though this was substantially
mitigated if the client retried the request.

Stackdriver Logging

20% of log API requests sent to Stackdriver Logging in the US-CENTRAL1
region failed during the incident, though App Engine logging was not
impacted. Clients retrying requests recovered gracefully.

Stackdriver Monitoring

Requests to the StackDriver web interface and the Google Monitoring API
v2beta2 and v3 experienced elevated latency and an error rate of up to 3.5%
during the incident. In addition, some alerts were delayed. Impact for API
calls was substantially mitigated if the client retried the request.

ROOT CAUSE:

On 18 July, Google carried out a planned maintenance event to inspect and
test the UPS on a power feed in one zone in the US-CENTRAL1 region. That
maintenance disrupted one of the two power feeds to network devices that
control routes into and out of the US-CENTRAL1-F zone.

Although this did not cause any disruption in service, these devices
unexpectedly and silently disabled the affected power supply modules - a
previously unseen behavior. Because our monitoring systems did not notify
our network engineers of this problem the power supply modules were not
re-enabled after the maintenance event.

The service disruption was triggered on Monday 22 August, when our
engineers carried out another planned maintenance event that removed power
to the second power feed of these devices, causing them to disable the
other power supply module as well, and thus completely shut down.

Following our standard procedure when carrying out maintenance events, we
made a detailed line walk of all critical equipment prior to, and after,
making any changes. However, in this case we did not detect the disabled
power supply modules.

Loss of these network devices meant that machines in US-CENTRAL1-F did not
have routes into and out of the zone but could still communicate to other
machines within the same zone.

REMEDIATION AND PREVENTION:

Our network engineers received an alert at 07:14, nine minutes after the
incident started. We restored power to the devices at 07:30. The network
returned to service without further intervention after power was restored.

As immediate followup to this incident, we have already carried out an
audit of all other network devices of this type in our fleet to verify that
there are none with disabled power supply modules.

We have also written up a detailed post mortem of this incident and will
take the following actions to prevent future outages of this type:

Our monitoring will be enhanced to detect cases in which power supply
modules are disabled. This will ensure that conditions that are missed by
the manual line walk prior to maintenance events are picked up by automated
monitoring.

We will change the configuration of these network devices so that power
disruptions do not cause them to disable their power supply modules.

The interaction between the network control plane and the data plane should
be such that the data plane should "fail open" and continue to route
packets in the event of control plane failures. We will add support for
networking protocols that have the capability to continue to route traffic
for a short period in the event of failures in control plane components.

We will also be taking various actions to improve the resilience of the
affected services to single-zone outages, including the following:

App Engine

Although App Engine Flexible Environment is currently in Beta, we expect
production services to be more resilient to single zone disruptions. We
will make this extra resilience an exit criteria before we allow the
service to reach General Availability.

Cloud Dataflow

We will improve resilience of Dataflow to single-zone outages by
implementing better strategies for migrating the job controller to a new
zone in the event of an outage. Work on this remediation is already
underway.

Stackdriver Logging

We will make improvements to the Stackdriver Logging service (currently in
Beta) in the areas of automatic failover and capacity management before
this service goes to General Availability. This will ensure that it is
resilient to single-zone outages.

Stackdriver Monitoring

The Google Monitoring API (currently in beta) is already hosted in more
than one zone, but we will further improve its resilience by adding
additional capacity to ensure a single-zone outage does not cause overload
in any other zones. We will do this before this service exits to General
Availability.

Finally, we know that you depend on Google Cloud Platform for your
production workloads and we apologize for the inconvenience this event
caused.
Reply all
Reply to author
Forward
0 new messages