Post-mortem for February 24th, 2010 outage

3,499 views
Skip to first unread message

App Engine Team

unread,
Mar 4, 2010, 6:22:21 PM3/4/10
to Google App Engine
Post-Mortem Summary

This document details the cause and events occurring immediately after
App Engine's outage on February 24th, 2010, as well as the steps we
are taking to mitigate the impact of future outages like this one in
the future.

On February 24th, 2010, all Googe App Engine applications were in
varying degraded states of operation for a period of two hours and
twenty minutes from 7:48 AM to 10:09 AM PT | 15:48 to 18:09 GMT. The
underlying cause of the outage was a power failure in our primary
datacenter. While the Google App Engine infrastructure is designed to
quickly recover from these sort of failures, this type of rare
problem, combined with internal procedural issues extended the time
required to restore the service.

<<Link to full timeline here, which is attached below.>>

What did we do wrong?

Though the team had planned for this sort of failure, our response had
a few important issues:

- Although we had procedures ready for this sort of outage, the oncall
staff was unfamiliar with them and had not trained sufficiently with
the specific recovery procedure for this type of failure.

- Recent work to migrate the datastore for better multihoming changed
and improved the procedure for handling these failures significantly.
However, some documentation detailing the procedure to support the
datastore during failover incorrectly referred to the old
configuration. This led to confusion during the event.

- The production team had not agreed on a policy that clearly
indicates when, and in what situations, our oncall staff should take
aggressive user-facing actions, such as an unscheduled failover. This
led to a bad call of returning to a partially working datacenter.

- We failed to plan for the case of a power outage that might affect
some, but not all, of our machines in a datacenter (in this case,
about 25%). In particular, this led to incorrect analysis of the
serving state of the failed datacenter and when it might recover.

- Though we were able to eventually migrate traffic to the backup
datacenter, a small number of Datastore entity groups, belonging to
approximately 25 applications in total, became stuck in an
inconsistent state as a result of the failover procedure. This
represented considerably less than 0.00002% of data stored in the
Datastore.

Ultimately, although significant work had been done over the past year
to improve our handling of these types of outages, issues with
procedures reduced their impact.

What are we doing to fix it?

As a result, we have instituted the following procedures going
forward:

- Introduce regular drills by all oncall staff of all of our
production procedures. This will include the rare and complicated
procedures, and all members of the team will be required to complete
the drills before joining the oncall rotation.

- Implement a regular bi-monthly audit of our operations docs to
ensure that all needed procedures are properly findable, and all out-
of-date docs are properly marked "Deprecated."

- Establish a clear policy framework to assist oncall staff to quickly
and decisively make decisions about taking intrusive, user-facing
actions during failures. This will allow them to act confidently and
without delay in emergency situations.

We believe that with these new procedures in place, last week's outage
would have been reduced in impact from about 2 hours of total
unavailability to about 10 to 20 minutes of partial unavailability.

In response to this outage, we have also decided to make a major
infrastructural change in App Engine. Currently, App Engine provides a
one-size-fits-all Datastore, that provides low write latency combined
with strong consistency, in exchange for lower availability in
situations of unexpected failure in one of our serving datacenters. In
response to this outage, and feedback from our users, we have begun
work on providing two different Datastore configurations:

- The current option of low-latency, strong consistency, and lower
availability during unexpected failures (like a power outage)

- A new option for higher availability using synchronous replication
for reads and writes, at the cost of significantly higher latency

We believe that providing both of these options to you, our users,
will allow you to make your own informed decisions about the tradeoffs
you want to make in running your applications.

We sincerely apologize for the impact of Feb 24th's service disruption
on your applications. We take great pride in the reliability that App
Engine offers, but we also recognize that we can do more to improve
it. You can be confident that we will continue to work diligently to
improve the service and ensure the impact of low level outages like
this have the least possible affect on our customers.


Timeline
-----------

7:48 AM - Internal monitoring graphs first begin to show that traffic
has problems in our primary datacenter and is returning an elevated
number of errors. Around the same time, posts begin to show up in the
google-appengine discussion group from users who are having trouble
accessing App Engine.

7:53 AM - Google Site Reliabilty Engineers send an email to a broad
audience notifying oncall staff that there has been a power outage in
our primary datacenter. Google's datacenters have backup power
generators for these situations. But, in this case, around 25% of
machines in the datacenter did not receive backup power in time and
crashed. At this time, our oncall staff was paged.

8:01 AM - By this time, our primary oncall engineer has determined the
extent and the impact of the page, and has determined that App Engine
is down. The oncall engineer, according to procedure, pages our
product managers and engineering leads to handle communicating about
the outage to out users. A few minutes later, the first post from the
App Engine team about this outage is made on the external group ("We
are investigating this issue.").

8:22 AM - After further analysis, we determine that although power has
returned to the datacenter, many machines in the datacenter are
missing due to the power outage, and are not able to serve traffic.
Particularly, it is determined that the GFS and Bigtable clusters are
not in a functioning state due to having lost too many machines, and
that thus the Datastore is not usable in the primary datacenter at
that time. The oncall engineer discusses performing a failover to our
alternate datacenter with the rest of the oncall team. Agreement is
reached to pursue our unexpected failover procedure for an unplanned
datacenter outages.

8:36 AM - Following up on the post on the discussion group outage
thread, the App Engine team makes a post about the outage to our
appengine-downtime-notify group and to the App Engine Status site.

8:40 AM - The primary oncall engineer discovers two conflicting sets
of procedures. This was a result of the operations process changing
after our recent migration of the Datastore. After discussion with
other oncall engineers, consensus is not reached, and members of the
engineering team attempt to contact the specific engineers responsible
for procedure change to resolve the situation.

8:44 AM - While others attempt to determine which is the correct
unexpected failover procedure, the oncall engineer attempts to move
all traffic into a read-only state in our alternate datacenter.
Traffic is moved, but an unexpected configuration problem from this
procedure prevents the read-only traffic from working properly.

9:08 AM - Various engineers are diagnosing the problem with read-only
traffic in our alternate datacenter. In the meantime, however, the
primary oncall engineer sees data that leads them to believe that our
primary datacenter has recovered and may be able to serve. Without a
clear rubric with which to make this decision, however, the engineer
was not aware that based on historical data the primary datacenter is
unlikely to have recovered to a usable state by this point of time.
Traffic is moved back to the original primary datacenter as an attempt
to resume serving, while others debug the read-only issue in the
alternate datacenter.

9:18 AM - The primary oncall engineer determines that the primary
datacenter has not recovered, and cannot serve traffic. It is now
clear to oncall staff that the call was wrong, the primary will not
recover, and we must focus on the alternate datacenter. Traffic is
failed back over to the alternate datacenter, and the oncall makes the
decision to follow the unplanned failover procedure and begins the
process.

9:35 AM - An engineer with familiarity with the unplanned failover
procedure is reached, and begins providing guidance about the failover
procedure. Traffic is moved to our alternate datacenter, initially in
read-only mode.

9:48 AM - Serving for App Engine begins externally in read-only mode,
from our alternate datacenter. At this point, apps that properly
handle read-only periods should be serving correctly, though in a
reduced operational state.

9:53 AM - After engineering team consultation with the relevant
engineers, now online, the correct unplanned failover procedure
operations document is confirmed, and is ready to be used by the
oncall engineer. The actual unplanned failover procedure for reads and
writes begins.

10:09 AM - The unplanned failover procedure completes, without any
problems. Traffic resumes serving normally, read and write. App Engine
is considered up at this time.

10:19 AM - A follow-up post is made to the appengine-downtime-notify
group, letting people know that App Engine is now serving normally.


Chris

unread,
Mar 5, 2010, 11:10:03 AM3/5/10
to Google App Engine
Thanks for sharing this information. It may very well be highly
irrational but it does feel good to get some insight into what
happened and how you guys respond to it!

One paragraph in particular caught my attention:

"- A new option for higher availability using synchronous replication
for reads and writes, at the cost of significantly higher latency"

I think I understand why you wish to expose this fundamental trade-off
to the app owners i.e. "Let us make our own decisons and force us to
acknowledge the fundamental forces at play". I'm a bit concerned about
the potential behavioral side-effect of such a "feature" though. Right
now you guys have to make GAE _both_ reliable and fast. That's what we
expect. It may very well be next to impossible to do both but you have
to keep trying...and having a bunch of hard-core GAE engineers
continously trying will probably land us all in a pretty happy place a
year or two down the line ;-)

On the other hand...making it a choice for app owners would
effectively be giving up on that fundamental challenge. "Oh..so you
can't live with 2-3 outages og several hours each every year...then
you should enable the super reliable but slooooow option". I'm sure,
you guys would still do your best to make the "slow option" super
fast...but at the end of the day resources need to be prioritized.
I could fear that optimizing reliability of the slow option will tend
to go at the bottom of the stack since there is a "workaround" for
customers who really need it.


/Chris


N.B: I'm very happy with the performance of GAE as it is today...or at
least what it was up until a week or so ago ;-)...but the reliability
is a major cause of concern for me.

On Mar 5, 12:22 am, App Engine Team <appengine.nore...@gmail.com>
wrote:

lennysan

unread,
Mar 5, 2010, 12:25:38 PM3/5/10
to Google App Engine
I've been working on a Guideline for Postmortem Communication, and ran
this post through the guideline:
http://www.transparentuptime.com/2010/03/google-app-engine-downtime-postmortem.html

Overall, this may be the simple most impressive postmortem I've seen
yet. The amount of time and though put into this post is staggering,
and the takeaways are useful to every organization. I'm especially
impressed with the proposed new functionality that turns this event
into a long term positive, which is really all you can ask for after
an incident.

On Mar 4, 3:22 pm, App Engine Team <appengine.nore...@gmail.com>
wrote:

Marc Provost

unread,
Mar 5, 2010, 2:09:01 PM3/5/10
to Google App Engine
Wow, I second lennysan. Awesome postmortem! Thank you so much for
sharing it with us.

Marc

On Mar 5, 12:25 pm, lennysan <lenny...@gmail.com> wrote:
> I've been working on a Guideline for Postmortem Communication, and ran

> this post through the guideline:http://www.transparentuptime.com/2010/03/google-app-engine-downtime-p...

> ...
>
> read more »

nickmilon

unread,
Mar 6, 2010, 6:19:02 PM3/6/10
to Google App Engine
Thanks for sharing this as well as the previus post mortem (last July
if I remember well).
Keep on the good work !

gwstuff

unread,
Mar 7, 2010, 12:33:26 PM3/7/10
to Google App Engine
> - The current option of low-latency, strong consistency, and lower
> availability during unexpected failures (like a power outage)
>
> - A new option for higher availability using synchronous replication
> for reads and writes, at the cost of significantly higher latency

This would be fantastic, if one were able to select between the two
configurations in the same App. That would make it possible for one to
have a "regular" and a "failsafe" version of a database, where the
failsafe version is updated less frequently. This makes sense for a
number of my apps in which it is helpful to give users access to some
basic data even if the system is basically "down."

Thanks also for the nice postmortem. So far, one of the best things
about using App Engine has been the openness with which information is
provided to users.

Evil Mushroom Lord

unread,
Mar 7, 2010, 3:03:05 PM3/7/10
to Google App Engine
Thank you for taking the time to share all this information with us.
Props to you guys. :)

David

unread,
Mar 7, 2010, 9:05:59 PM3/7/10
to Google App Engine
Thank you for the open and honest post-mortem.

On Mar 8, 5:03 am, Evil Mushroom Lord <evilmushrooml...@gmail.com>
wrote:

AFidel

unread,
Mar 8, 2010, 1:17:56 PM3/8/10
to Google App Engine
Excellent 10,000 foot view, and a template for what the entire
industry should do so that collectively we can learn from each others
mistakes instead of having to learn the painful lessons individually.

For those of us who have to mess with the low level details, can you
share why such large number of nodes lost power? Was it failure to
test and maintain generator system, improper PDU loading? Or some
other fault that could not have been foreseen?

SEWilco

unread,
Mar 8, 2010, 2:38:07 PM3/8/10
to Google App Engine
I wonder how long it will be before someone connects a fast-but-less-
reliable app to a fraternal slower-but-more-reliable partner app.

Michael Loftis

unread,
Mar 13, 2010, 1:20:10 PM3/13/10
to Google App Engine

On Mar 4, 4:22 pm, App Engine Team <appengine.nore...@gmail.com>
wrote:
> Post-Mortem Summary


While I'm not an App Engine user, as a systems administrator who has
always been in favor of more customer facing honesty and openness I
greatly appreciate seeing this. Knowing what happened gives a large
number of customers a peace of mind, even if it is sometimes painful
for internal customers/employees to admit to fault.

Jan Z

unread,
Mar 18, 2010, 5:33:16 AM3/18/10
to Google App Engine
Thanks for posting this. The transparency helps greatly.

I would like to add my vote to Chris's though regarding splitting
bigtable operations.
It's a mistake.

The enormous appeal of the App Engine today is that you've done an
amazing job shielding us from needing to make these sorts of
decisions.

*THAT* is the hard problem that GAE is addressing.

By making distinctions such as this one, you're fundamentally shifting
your direction away from what ought to be (and has been?) a key design
principle.

In short - please don't make us choose. Just make it work.

Jan / Cloudbreak

On Mar 5, 12:22 pm, App Engine Team <appengine.nore...@gmail.com>
wrote:

nickmilon

unread,
Mar 18, 2010, 7:23:32 PM3/18/10
to Google App Engine
Han Z, Chris,G gae team

I do agree splitting datastore operations is not a good idea.
I was also going to post and argue about this but I realized it was
too late since it seems there was a definite decision from google as
described in this same "post mortem" and implemented in zero time some
hours later in last SDK.
The question now (post mortem unfortunately !) is that G should at
least try and consult a little with developers here before committing
to such things.
So G please try engage us (poor developers) into the loop before such
decisions are made, probably we are just stupid developers but still
may be we have an idea or something worth considering
More so as I do not see any relevant issue - ticket request been
filled about this.

Hapy coding ;)

Nick
Athens - Greece

ryan

unread,
Mar 19, 2010, 9:03:25 PM3/19/10
to Google App Engine
On Mar 5, 9:10 am, Chris <cskjoldb...@gmail.com> wrote:
> Right now you guys have to make GAE _both_ reliable and fast.
> That's what we expect. It may very well be next to impossible
> to do both but you have to keep trying...and having a bunch of
> hard-core GAE engineers continously trying will probably land
> us all in a pretty happy place a year or two down the line ;-)

On Mar 18, 2:33 am, Jan Z <jan.zawad...@gmail.com> wrote:
> By making distinctions such as this one, you're fundamentally shifting
> your direction away from what ought to be (and has been?) a key design
> principle.

unfortunately, hard tradeoffs are at the core of most serious
engineering. it may be cliched, but there's generally no free lunch.
are you hungry enough to pay for that lunch? or would you rather save
the money instead? there's no single right answer to that question; it
depends on the person and the situation.

choices like this are similar. i discussed this in detail in
http://code.google.com/events/io/sessions/TransactionsAcrossDatacenters.html
; see slide 33 for the executive summary. in essence, you're fighting
against things like the speed of light and queueing in core backbone
routers in major internet peering points, which are difficult to
impossible to change. specifically, 1) the only known distributed
consensus protocol is paxos, 2) paxos requires two round trips, and 3)
getting packets between datacenters in different physical locations
takes time. multiply that time by four (for the two round trips), add
in disk seeks on either end, and your writes will always be
significantly slower than local writes in only one datacenter. there
isn't really any way around that. at least, not until maybe quantum
computing, or wormholes. :P

this is an important general lesson that we've learned in the
developer group at google. tradeoffs like these are inherent in
engineering, and there's usually no one size fits all. one choice is
right for some apps, but not for the others. given that, the
interesting question isn't whether to offer the option. we have to.
the interesting question - if any - is what the default should be.

(also see my post on synchronous replication vs. eventually consistent
reads, http://groups.google.com/group/google-appengine/browse_thread/thread/ca31fe630d73c3d3#d4e0651cd8051c63
)

nickmilon

unread,
Mar 20, 2010, 4:43:57 PM3/20/10
to Google App Engine
Ryan,

Thanks for explaining.
I know engineering is about trade offs and after reading your post
and viewing your presentation I am convinced G took the right
engineering decision, still I think a little consultation with
developers before major decisions are taken could be helpfully
sometimes.

Regards

Nick

Reply all
Reply to author
Forward
0 new messages