Disruption of network connectivity to new instances

212 views
Skip to first unread message

gce-operations

unread,
Jul 31, 2014, 5:08:37 AM7/31/14
to gce-ope...@googlegroups.com
We're investigating an issue with network connectivity to new Google Compute Engine instances. Currently-running instances are not affected.  We will provide more information shortly.

gce-operations

unread,
Jul 31, 2014, 5:34:58 AM7/31/14
to gce-ope...@googlegroups.com
We are currently experiencing an issue with Google Compute Engine and newly-created GCE instances have no network connectivity via external IP addresses or internal project networks.  For everyone who is affected, we apologize - we know you count on Google to work for you and we're working hard to restore normal operation.

We will provide an update by 31 July 03:00 US/Pacific with current details, and if available an estimated time for resolution.

gce-operations

unread,
Jul 31, 2014, 6:01:16 AM7/31/14
to gce-ope...@googlegroups.com
We are still investigating the issue with network connectivity to new GCE instances.  Connectivity has been restored to some GCE instances.  We will provide another status update by 31 July 03:30 US/Pacific time.

Dave Hughes

unread,
Jul 31, 2014, 6:13:27 AM7/31/14
to gce-ope...@googlegroups.com, gce-ope...@googlegroups.com
The problem with network connectivity to new GCE instances should be resolved as of 03:08 US/Pacific time. We apologize for any issues this may have caused you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.

We will provide a more detailed analysis of this incident once we have completed our internal investigation.

gce-operations

unread,
Aug 2, 2014, 7:22:15 PM8/2/14
to gce-ope...@googlegroups.com

SUMMARY:

For 3 hours and 32 minutes on Thursday 31st July 2014, newly created Google Compute Engine instances had impaired or no network connectivity. If you were affected by this issue, we apologize — this is not the level of reliability we strive to offer, and we will ensure that the root causes of the incident are rigorously addressed.


DETAILED DESCRIPTION OF IMPACT:

From 00:30 to 02:47 US/Pacific time on the 31st July, approximately 50% of newly created Google Compute Engine instances in were not accessible via their external IP addresses or project internal networks.  Instances created before 00:30 were not affected. Concurrently, newly created instances in other zones were not accessible via all routes, resulting in approximately 45% of inbound connections being dropped. After 02:48, connectivity was restored to all instances created during this period.


From 14:50 to 16:05 US/Pacific time on the same day, the issue reoccurred in the section of the system running in a different datacenter, affecting 58% of inbound connections and approximately 50% of newly created instances. Again, instances that had been created prior to 14:50 were not affected.


ROOT CAUSE:

The root cause was a localized storage capacity shortage in one of the datacenters used by the system that propagates Compute Engine network configuration to new instances and to the network fabric.  The system’s defense-in-depth strategy correctly prevented this issue from affecting existing instances by preserving the existing network configuration when this system was unable to provide updated values. Compartmentalization of the system that propagates network configuration successfully prevented the issue from affecting all instances or all of the network fabric.


Although the system recognized it had run out of storage capacity, it did not correctly alert Google Engineers in time to correct the problem. Due to a software limitation, the address allocation system did not announce its failure, so graceful failover was not initiated.  The same set of circumstances reoccurred at 14:50 when the affected system experienced a similar temporary storage capacity shortage in a different datacenter.


RESOLUTION AND PREVENTION:

To resolve the immediate issues in both cases, Google engineers immediately directed traffic away from the affected datacenter, and quickly allocated more storage capacity, allowing all instance IP addresses to be correctly propagated to the network layer.


To prevent this issue in future, Google engineers have improved the isolation of storage resources, which dramatically decreases the risk that this system will run out of space.  Google engineers are also auditing storage capacity for this system in all datacenters to ensure this does not happen again.  Additionally, Google engineers are creating additional monitoring that will test connectivity to newly created instances, in order to immediately alert the engineering team in case an issue of this nature does occur in the future.
Reply all
Reply to author
Forward
0 new messages