SUMMARY:
For 3 hours and 32 minutes on Thursday 31st July 2014, newly created Google Compute Engine instances had impaired or no network connectivity. If you were affected by this issue, we apologize — this is not the level of reliability we strive to offer, and we will ensure that the root causes of the incident are rigorously addressed.
DETAILED DESCRIPTION OF IMPACT:
From 00:30 to 02:47 US/Pacific time on the 31st July, approximately 50% of newly created Google Compute Engine instances in were not accessible via their external IP addresses or project internal networks. Instances created before 00:30 were not affected. Concurrently, newly created instances in other zones were not accessible via all routes, resulting in approximately 45% of inbound connections being dropped. After 02:48, connectivity was restored to all instances created during this period.
From 14:50 to 16:05 US/Pacific time on the same day, the issue reoccurred in the section of the system running in a different datacenter, affecting 58% of inbound connections and approximately 50% of newly created instances. Again, instances that had been created prior to 14:50 were not affected.
ROOT CAUSE:
The root cause was a localized storage capacity shortage in one of the datacenters used by the system that propagates Compute Engine network configuration to new instances and to the network fabric. The system’s defense-in-depth strategy correctly prevented this issue from affecting existing instances by preserving the existing network configuration when this system was unable to provide updated values. Compartmentalization of the system that propagates network configuration successfully prevented the issue from affecting all instances or all of the network fabric.
Although the system recognized it had run out of storage capacity, it did not correctly alert Google Engineers in time to correct the problem. Due to a software limitation, the address allocation system did not announce its failure, so graceful failover was not initiated. The same set of circumstances reoccurred at 14:50 when the affected system experienced a similar temporary storage capacity shortage in a different datacenter.
RESOLUTION AND PREVENTION:
To resolve the immediate issues in both cases, Google engineers immediately directed traffic away from the affected datacenter, and quickly allocated more storage capacity, allowing all instance IP addresses to be correctly propagated to the network layer.