This document details the cause and events occurring immediately after
App Engine's outage on February 24th, 2010, as well as the steps we
are taking to mitigate the impact of future outages like this one in
On February 24th, 2010, all Googe App Engine applications were in
varying degraded states of operation for a period of two hours and
twenty minutes from 7:48 AM to 10:09 AM PT | 15:48 to 18:09 GMT. The
underlying cause of the outage was a power failure in our primary
datacenter. While the Google App Engine infrastructure is designed to
quickly recover from these sort of failures, this type of rare
problem, combined with internal procedural issues extended the time
required to restore the service.
<<Link to full timeline here, which is attached below.>>
What did we do wrong?
Though the team had planned for this sort of failure, our response had
a few important issues:
- Although we had procedures ready for this sort of outage, the oncall
staff was unfamiliar with them and had not trained sufficiently with
the specific recovery procedure for this type of failure.
- Recent work to migrate the datastore for better multihoming changed
and improved the procedure for handling these failures significantly.
However, some documentation detailing the procedure to support the
datastore during failover incorrectly referred to the old
configuration. This led to confusion during the event.
- The production team had not agreed on a policy that clearly
indicates when, and in what situations, our oncall staff should take
aggressive user-facing actions, such as an unscheduled failover. This
led to a bad call of returning to a partially working datacenter.
- We failed to plan for the case of a power outage that might affect
some, but not all, of our machines in a datacenter (in this case,
about 25%). In particular, this led to incorrect analysis of the
serving state of the failed datacenter and when it might recover.
- Though we were able to eventually migrate traffic to the backup
datacenter, a small number of Datastore entity groups, belonging to
approximately 25 applications in total, became stuck in an
inconsistent state as a result of the failover procedure. This
represented considerably less than 0.00002% of data stored in the
Ultimately, although significant work had been done over the past year
to improve our handling of these types of outages, issues with
procedures reduced their impact.
What are we doing to fix it?
As a result, we have instituted the following procedures going
- Introduce regular drills by all oncall staff of all of our
production procedures. This will include the rare and complicated
procedures, and all members of the team will be required to complete
the drills before joining the oncall rotation.
- Implement a regular bi-monthly audit of our operations docs to
ensure that all needed procedures are properly findable, and all out-
of-date docs are properly marked "Deprecated."
- Establish a clear policy framework to assist oncall staff to quickly
and decisively make decisions about taking intrusive, user-facing
actions during failures. This will allow them to act confidently and
without delay in emergency situations.
We believe that with these new procedures in place, last week's outage
would have been reduced in impact from about 2 hours of total
unavailability to about 10 to 20 minutes of partial unavailability.
In response to this outage, we have also decided to make a major
infrastructural change in App Engine. Currently, App Engine provides a
one-size-fits-all Datastore, that provides low write latency combined
with strong consistency, in exchange for lower availability in
situations of unexpected failure in one of our serving datacenters. In
response to this outage, and feedback from our users, we have begun
work on providing two different Datastore configurations:
- The current option of low-latency, strong consistency, and lower
availability during unexpected failures (like a power outage)
- A new option for higher availability using synchronous replication
for reads and writes, at the cost of significantly higher latency
We believe that providing both of these options to you, our users,
will allow you to make your own informed decisions about the tradeoffs
you want to make in running your applications.
We sincerely apologize for the impact of Feb 24th's service disruption
on your applications. We take great pride in the reliability that App
Engine offers, but we also recognize that we can do more to improve
it. You can be confident that we will continue to work diligently to
improve the service and ensure the impact of low level outages like
this have the least possible affect on our customers.
7:48 AM - Internal monitoring graphs first begin to show that traffic
has problems in our primary datacenter and is returning an elevated
number of errors. Around the same time, posts begin to show up in the
google-appengine discussion group from users who are having trouble
accessing App Engine.
7:53 AM - Google Site Reliabilty Engineers send an email to a broad
audience notifying oncall staff that there has been a power outage in
our primary datacenter. Google's datacenters have backup power
generators for these situations. But, in this case, around 25% of
machines in the datacenter did not receive backup power in time and
crashed. At this time, our oncall staff was paged.
8:01 AM - By this time, our primary oncall engineer has determined the
extent and the impact of the page, and has determined that App Engine
is down. The oncall engineer, according to procedure, pages our
product managers and engineering leads to handle communicating about
the outage to out users. A few minutes later, the first post from the
App Engine team about this outage is made on the external group ("We
are investigating this issue.").
8:22 AM - After further analysis, we determine that although power has
returned to the datacenter, many machines in the datacenter are
missing due to the power outage, and are not able to serve traffic.
Particularly, it is determined that the GFS and Bigtable clusters are
not in a functioning state due to having lost too many machines, and
that thus the Datastore is not usable in the primary datacenter at
that time. The oncall engineer discusses performing a failover to our
alternate datacenter with the rest of the oncall team. Agreement is
reached to pursue our unexpected failover procedure for an unplanned
8:36 AM - Following up on the post on the discussion group outage
thread, the App Engine team makes a post about the outage to our
appengine-downtime-notify group and to the App Engine Status site.
8:40 AM - The primary oncall engineer discovers two conflicting sets
of procedures. This was a result of the operations process changing
after our recent migration of the Datastore. After discussion with
other oncall engineers, consensus is not reached, and members of the
engineering team attempt to contact the specific engineers responsible
for procedure change to resolve the situation.
8:44 AM - While others attempt to determine which is the correct
unexpected failover procedure, the oncall engineer attempts to move
all traffic into a read-only state in our alternate datacenter.
Traffic is moved, but an unexpected configuration problem from this
procedure prevents the read-only traffic from working properly.
9:08 AM - Various engineers are diagnosing the problem with read-only
traffic in our alternate datacenter. In the meantime, however, the
primary oncall engineer sees data that leads them to believe that our
primary datacenter has recovered and may be able to serve. Without a
clear rubric with which to make this decision, however, the engineer
was not aware that based on historical data the primary datacenter is
unlikely to have recovered to a usable state by this point of time.
Traffic is moved back to the original primary datacenter as an attempt
to resume serving, while others debug the read-only issue in the
9:18 AM - The primary oncall engineer determines that the primary
datacenter has not recovered, and cannot serve traffic. It is now
clear to oncall staff that the call was wrong, the primary will not
recover, and we must focus on the alternate datacenter. Traffic is
failed back over to the alternate datacenter, and the oncall makes the
decision to follow the unplanned failover procedure and begins the
9:35 AM - An engineer with familiarity with the unplanned failover
procedure is reached, and begins providing guidance about the failover
procedure. Traffic is moved to our alternate datacenter, initially in
9:48 AM - Serving for App Engine begins externally in read-only mode,
from our alternate datacenter. At this point, apps that properly
handle read-only periods should be serving correctly, though in a
reduced operational state.
9:53 AM - After engineering team consultation with the relevant
engineers, now online, the correct unplanned failover procedure
operations document is confirmed, and is ready to be used by the
oncall engineer. The actual unplanned failover procedure for reads and
10:09 AM - The unplanned failover procedure completes, without any
problems. Traffic resumes serving normally, read and write. App Engine
is considered up at this time.
10:19 AM - A follow-up post is made to the appengine-downtime-notify
group, letting people know that App Engine is now serving normally.
One paragraph in particular caught my attention:
"- A new option for higher availability using synchronous replication
for reads and writes, at the cost of significantly higher latency"
I think I understand why you wish to expose this fundamental trade-off
to the app owners i.e. "Let us make our own decisons and force us to
acknowledge the fundamental forces at play". I'm a bit concerned about
the potential behavioral side-effect of such a "feature" though. Right
now you guys have to make GAE _both_ reliable and fast. That's what we
expect. It may very well be next to impossible to do both but you have
to keep trying...and having a bunch of hard-core GAE engineers
continously trying will probably land us all in a pretty happy place a
year or two down the line ;-)
On the other hand...making it a choice for app owners would
effectively be giving up on that fundamental challenge. "Oh..so you
can't live with 2-3 outages og several hours each every year...then
you should enable the super reliable but slooooow option". I'm sure,
you guys would still do your best to make the "slow option" super
fast...but at the end of the day resources need to be prioritized.
I could fear that optimizing reliability of the slow option will tend
to go at the bottom of the stack since there is a "workaround" for
customers who really need it.
N.B: I'm very happy with the performance of GAE as it is today...or at
least what it was up until a week or so ago ;-)...but the reliability
is a major cause of concern for me.
On Mar 5, 12:22 am, App Engine Team <appengine.nore...@gmail.com>
Overall, this may be the simple most impressive postmortem I've seen
yet. The amount of time and though put into this post is staggering,
and the takeaways are useful to every organization. I'm especially
impressed with the proposed new functionality that turns this event
into a long term positive, which is really all you can ask for after
On Mar 4, 3:22 pm, App Engine Team <appengine.nore...@gmail.com>
On Mar 5, 12:25 pm, lennysan <lenny...@gmail.com> wrote:
> I've been working on a Guideline for Postmortem Communication, and ran
> this post through the guideline:http://www.transparentuptime.com/2010/03/google-app-engine-downtime-p...
> read more »
This would be fantastic, if one were able to select between the two
configurations in the same App. That would make it possible for one to
have a "regular" and a "failsafe" version of a database, where the
failsafe version is updated less frequently. This makes sense for a
number of my apps in which it is helpful to give users access to some
basic data even if the system is basically "down."
Thanks also for the nice postmortem. So far, one of the best things
about using App Engine has been the openness with which information is
provided to users.
On Mar 8, 5:03 am, Evil Mushroom Lord <evilmushrooml...@gmail.com>
For those of us who have to mess with the low level details, can you
share why such large number of nodes lost power? Was it failure to
test and maintain generator system, improper PDU loading? Or some
other fault that could not have been foreseen?
On Mar 4, 4:22 pm, App Engine Team <appengine.nore...@gmail.com>
> Post-Mortem Summary
While I'm not an App Engine user, as a systems administrator who has
always been in favor of more customer facing honesty and openness I
greatly appreciate seeing this. Knowing what happened gives a large
number of customers a peace of mind, even if it is sometimes painful
for internal customers/employees to admit to fault.
I would like to add my vote to Chris's though regarding splitting
It's a mistake.
The enormous appeal of the App Engine today is that you've done an
amazing job shielding us from needing to make these sorts of
*THAT* is the hard problem that GAE is addressing.
By making distinctions such as this one, you're fundamentally shifting
your direction away from what ought to be (and has been?) a key design
In short - please don't make us choose. Just make it work.
Jan / Cloudbreak
On Mar 5, 12:22 pm, App Engine Team <appengine.nore...@gmail.com>
I do agree splitting datastore operations is not a good idea.
I was also going to post and argue about this but I realized it was
too late since it seems there was a definite decision from google as
described in this same "post mortem" and implemented in zero time some
hours later in last SDK.
The question now (post mortem unfortunately !) is that G should at
least try and consult a little with developers here before committing
to such things.
So G please try engage us (poor developers) into the loop before such
decisions are made, probably we are just stupid developers but still
may be we have an idea or something worth considering
More so as I do not see any relevant issue - ticket request been
filled about this.
Hapy coding ;)
Athens - Greece
On Mar 18, 2:33 am, Jan Z <jan.zawad...@gmail.com> wrote:
> By making distinctions such as this one, you're fundamentally shifting
> your direction away from what ought to be (and has been?) a key design
unfortunately, hard tradeoffs are at the core of most serious
engineering. it may be cliched, but there's generally no free lunch.
are you hungry enough to pay for that lunch? or would you rather save
the money instead? there's no single right answer to that question; it
depends on the person and the situation.
choices like this are similar. i discussed this in detail in
; see slide 33 for the executive summary. in essence, you're fighting
against things like the speed of light and queueing in core backbone
routers in major internet peering points, which are difficult to
impossible to change. specifically, 1) the only known distributed
consensus protocol is paxos, 2) paxos requires two round trips, and 3)
getting packets between datacenters in different physical locations
takes time. multiply that time by four (for the two round trips), add
in disk seeks on either end, and your writes will always be
significantly slower than local writes in only one datacenter. there
isn't really any way around that. at least, not until maybe quantum
computing, or wormholes. :P
this is an important general lesson that we've learned in the
developer group at google. tradeoffs like these are inherent in
engineering, and there's usually no one size fits all. one choice is
right for some apps, but not for the others. given that, the
interesting question isn't whether to offer the option. we have to.
the interesting question - if any - is what the default should be.
(also see my post on synchronous replication vs. eventually consistent
Thanks for explaining.
I know engineering is about trade offs and after reading your post
and viewing your presentation I am convinced G took the right
engineering decision, still I think a little consultation with
developers before major decisions are taken could be helpfully