Google Compute Engine Load Balancing Outage This Past Weekend

316 views
Skip to first unread message

Google Compute Engine Team

unread,
Jan 22, 2014, 6:35:32 PM1/22/14
to

This past weekend, we saw two issues with Google Compute Engine's load balancing service. From 20:23 January 18 to 03:21 January 20 UTC, customers could not create new load balancing Forwarding Rules, and from 03:25 to 06:21 on January 20 UTC, the load balancing service experienced a complete outage.

When we prepare for planned maintenance on a Google Compute Engine zone that does not yet support transparent maintenance, we terminate all instances within the zone. This weekend, while we were preparing for planned maintenance on the europe-west1-a zone, there was an issue with Google Compute Engine’s load balancing control plane that was triggered when we began terminating instances in that zone. This prevented the load balancing service from creating new configurations. Eventually, the service reloaded according to schedule, removing all working configurations and leaving the service unresponsive.


As soon as we identified the problem, we removed the reference to the terminated instances, fixed the bug, and updated the control plane, which rectified the problem and returned the load balancing service to a healthy state.


To ensure this issue does not happen again, we are taking a number of steps, including deploying additional safety checks at multiple layers to reject large changes in the number of load balancing configurations. We are also developing a system that will detect these types of failures more quickly. Lastly, we are making changes that will result in more quickly delivering these kinds of updates to you. Reliability is a top priority at Google, and we are continuously making improvements to our systems.


-- Simon Newton, Google Cloud Networking Engineer

Reply all
Reply to author
Forward
0 new messages