Postmortem for August 18, 2011 outage

Showing 1-1 of 1 messages
Postmortem for August 18, 2011 outage Ikai Lan 8/29/11 1:08 PM
Postmortem

This document details the cause and events occurring immediately after App Engine's outage on August 18th, 2011 impacting applications running the Master/Slave datastore.

Summary

On August 18th, 2011, a Google data center in the American Midwest, which was serving App Engine Master/Slave Datastore applications on that date, lost utility power as a result of an intense thunderstorm. Power distribution equipment in the datacenter failed in the wake of the loss of utility power, which powered off a subset of the machines in the datacenter.

The power loss to the affected machines both reduced the available computing capacity in the data center, and took offline parts of the storage infrastructure, causing Master/Slave Datastore applications to experience high latency, serve errors, or be completely unavailable. When Google’s data center operations team reported that it would be several hours before they would be able to restore power to the affected machines due to the ongoing thunderstorm, the App Engine team decided to perform an emergency failover from the serving data center to the backup data center for Master/Slave Datastore applications.

High Replication Datastore applications were not serving from this data center on that date, and were therefore unaffected by this outage.

Background

During data center outages affecting Master/Slave Datastore applications, the App Engine team’s only options are to weather the outage in the current location, providing a degraded experience to our customers, or to perform an emergency maintenance to relocate Master/Slave Datastore applications to the backup datacenter. Emergency maintenance procedures do not allow time to fully replicate between the current serving data center for Master/Slave Datastore applications, and the backup data center, because the storage system in the current serving data center is severely degraded or offline. Thus, data written recently to the current Master/Slave Datastore data center is temporarily stranded, and the Master/Slave Datastore appears to jump backwards in time when it returns to service.

The App Engine team does not make the decision to perform an emergency relocation without replication lightly. It is extremely disruptive to applications to begin serving your App Engine Master/Slave Datastore application without the most recent data that your application committed to the Datastore. After the emergency maintenance, the App Engine team must audit and repair the Master/Slave Datastore, to determine the scope of the unreplicated data stranded in the affected data center, and provide the unreplicated data to application owners, so they may choose to re-integrate it into their application’s Datastore, if they so desire. This is clearly an undesirable outcome for both the App Engine team and the application owner.

As such, App Engine’s policies during power outages or severe disruptions affecting the Master/Slave Datastore, are to weather the outage in place for up to an hour, unless the team is informed that the return to service will definitely not begin within that hour. This policy was defined after examining the historical record of outages and return to service times in Google data center, and assessing the likelihood of a quick return to service versus the adverse effects of performing an emergency maintenance. Google’s data center operations team is highly competent at returning datacenters to service quickly and safely, and their abilities are leveraged to the benefit of App Engine customers.

During this outage, the impact of the adverse weather conditions continued for much longer than the App Engine team had anticipated, and made it impossible for the data center operations team to safely begin the repair process until the storm ended. As there was no estimated time for the data center to return to service at that point, the App Engine team elected to perform an emergency maintenance to switch Master/Slave Datastore applications to their backup data center, returning them to service with some amount of unreplicated data.

Remediation

The architecture of the Master/Slave Datastore for App Engine makes no substantial improvement in this situation possible. The Master/Slave Datastore is serves out of a single primary data center, with asynchronous delayed replication to a backup data center, and is always vulnerable to unexpected outages in its primary data center.

The normal maintenance procedure to switch Master/Slave Datastore applications from serving to backup data center requires an hour of read-only time to complete. While it would be possible to pre-emptively perform a normal maintenance procedure to switch from serving to backup data center for Master/Slave Datastore applications when adverse weather conditions are expected, the majority of the time, adverse weather does not result in a service outage. Implementing this policy would result in a far greater amount of Master/Slave Datastore read-only periods without a guarantee of reduced unplanned outages. This policy would provide no protection against outages that occur without sufficient forewarning, e.g. fire or loss of network connectivity.

Recommendations

The High Replication Datastore for App Engine applications is specifically engineered to be resilient in the face of sudden outages affecting one or more data centers. Data written to the High Replication Datastore is synchronously replicated to multiple datacenters before App Engine indicates success to your application.

Had High Replication Datastore applications been serving out of this data center, it’s entirely possible they would have experienced minimal or no degradation or outage. Additionally, the App Engine team could have ceased all serving from that data center within minutes, without any temporary stranding of data, or other adverse events.

The Google App Engine team encourages all App Engine customers to migrate their applications from the Master/Slave Datastore to the High Replication Datastore. The High Replication Datastore is now the default for new App Engine applications, the SLA available under the upcoming new pricing model applies exclusively to High Replication Datastore applications, and we are testing improved migration tools with early adopters now. (You can sign up to be an early adopter at this link: http://goo.gl/3jrXu ) In addition, new App Engine features, e.g. Go, Python 2.7 will be available exclusively to High Replication Datastore applications.

Timeline (all times US/Pacific)

5:35pm: Google data center loses computing and storage capability, as as result of loss of utility power due to severe thunderstorm in the area. Google data center operations team begins responding to outage, in contact with the App Engine team.
6:50pm: Google data center operations reports data center will not return to service promptly. App Engine team begins emergency maintenance to switch Master/Slave Datastore applications to backup data center.
7:10pm: appengine-downtime-notify forum is notified: https://groups.google.com/forum/#!topic/google-appengine-downtime-notify/_yTJse1eOaI
7:20pm: Master/Slave Datastore applications begin serving in read-only mode during the emergency maintenance.
7:50pm: App Engine team completes emergency maintenance, and Master/Slave Datastore applications are serving normally again.

-- Ikai Lan, on behalf of the App Engine team