Post-mortem for March 8th, 2011 outage

124 views

Skip to first unread message

Google App Engine team

unread,

Apr 20, 2011, 3:16:24 PM4/20/11

to Google App Engine

This document is the post mortem describing the March 8th 2011 App
Engine outage.

Summary

On March 8th 2011, a subset of Python App Engine applications using
the Master/Slave Datastore configuration were affected by a service
outage that lasted, in the worst case, up to five and a half hours.
During the outage, a percentage of requests to affected applications
would fail with errors when traffic was routed to affected instances;
application logs would have shown affected instances were unable to
import standard Python modules. The outage was caused by a previous
update to the system, which made it impossible to cleanly perform
future system updates without disruption.

Root Cause

In the morning of March 8th, the App Engine team pushed a new version
of the App Engine Python runtime. While the new Python runtime
contained no known issues, a performance optimization in a system
update pushed on March 3rd included a bug which would cause future
updates to App Engine runtimes to disrupt running applications as the
new runtime rolled out. Once the bug was triggered, applications would
be unable to import requested Python modules.

Timeline

March 3rd 2011 - A new runtime is pushed to production, uneventfully.
A configuration change made in this runtime contains a bug which will
disrupt future rollouts of new runtimes.

March 7th 2011 - 8:00 PM US/Pacific - Release of a new Python runtime
rolls out to the internal instance of App Engine, which is only used
to test new App Engine releases, and only contains internal Google
applications.

March 8th 2011 - 2:35 AM US/Pacific - Release of the new Python
runtime begins to rollout to production, first to the Master/Slave
Datastore applications. It is expected that this push is safe, as
nothing adverse was observed in the rollout to the internal instance
of App Engine. The push installs the new Python runtime, which due to
the bug present in the previous release, immediately disrupts the
running applications.

March 8th 2011 - 2:50 AM US/Pacific - A measurable rise in user
visible error rates across the service is noted. Investigation begins.

March 8th 2011 - 3:30 AM US/Pacific - Rollout of the update to
production is stopped. It had only reached 15% of the Python Master/
Slave Datastore service in production.

March 8th 2011 - 3:45 AM US/Pacific - The root cause of the problem is
identified, and the remediation method is determined: all of the
infrastructure underlying the Python Master/Slave Datastore
applications must be restarted. A hard restart of the underlying
infrastructure would fix the problem instantly, but would also disrupt
the entire Python Master/Slave Datastore population for tens of
minutes. As only a subset of the Python Master/Slave Datastore
applications are affected, monitoring analysis is used to find the
worst-affected infrastructure, which is restarted in small batches so
as not to disrupt the other, healthily-serving applications.

Mar 8 2011 - 5:20 AM US/Pacific - Enough infrastructure has been
restarted that the user-visible outage is largely over at this point.
A verification rollback begins, where the entire Python Master/Slave
Datastore infrastructure is restarted in the background at
conservative speed, to ensure that the problem is remediated globally.

Mar 8 2011 - 8:05 AM US/Pacific - A subset of Python Master/Slave
Datastore applications are determined to still be affected, and the
relevant infrastructure is manually restarted. The user-visible outage
is completely over at this point.

March 8th 2011 - 10:00 AM US/Pacific - Verification rollback complete.

Issues and Fixes

* The configuration bug in the March 3rd 2011 push was missed during
design & review.

* The impact of the configuration bug introduced on the March 3rd 2011
push was subtle and only manifested slowly in a subset of applications
as they restarted. This resulted in a new failure mode that automated
systems were not configured to detect. While it manifested in the push
to the internal instance, it appeared too slowly to inform us there
was a bug, and that we should not have continued with the push to
production.
* Implement monitoring and alerting on additional aspects of the
health of the runtimes, especially during updates. Gate releases to
production on successful verification of runtime health in the
internal instance.

* The Status Site did not indicate errors during the outage. The
disruption only manifested as application instances restarted (or did
late imports of modules). As the applications probed by the Status
Site had not restarted, they were not disrupted.
* Incorporate data from a subset of production applications into
the display on the status site, not just monitoring data.