Information regarding 2 July 2009 outage

Chris Beckmann (App Engine PM)

unread,

Jul 2, 2009, 7:47:57 PM7/2/09

to Google App Engine

We wanted to provide you with some additional detail regarding our
recent outage. On July 2nd, between 6:20 AM PT and 12:30 PM PT, all
applications experienced increased error rate and latency with
Datastore and memcache operations, as well as some serving errors.
Datastore access and serving were fully restored as of 12:25 PM PT.

Problem

There was a serious issue in one of App Engine's datacenters with GFS,
Google's low level storage system. GFS underlies Bigtable, which in
turn underlies App Engine's Datastore. GFS also provides storage for
our application serving infrastructure, so GFS unavailability caused
problems for Datastore reads and writes, as well as application
serving.

Resolution Efforts

Availability and data integrity are both very important to the App
Engine team. Typically, we would have switched to an alternate
datacenter immediately. However, due to the specific nature of this
problem, switching datacenters immediately meant that the most recent
data written by applications would not have been available, leading to
consistency problems for many applications.

The team decided to try to stabilize GFS first, then switch
datacenters. This was accomplished and we avoided any data consistency
issues.

Prevention

The team has been actively working on a solution in the medium-term
that would allow us to switchover datacenters immediately without
consistency problems.

Communication and Status

Many users noted that the System Status site was also down. The System
Status site is hosted separately from App Engine applications, and is
not typically affected by availability problems. However, due to the
low level problem with GFS in this case, the System Status site was
also affected. The team did post the downtime announcement and updates
on the Downtime Notification group, available here:
http://groups.google.com/group/google-appengine-downtime-notify

The App Engine team is continuing to work to improve the availability
and power of App Engine. Thanks for your patience.

Chris Beckmann
Product Manager, App Engine Team

jonathan

unread,

Jul 2, 2009, 8:40:52 PM7/2/09

to Google App Engine

Obviously App Engine is engineered for scalability. But what sort of
reliability are you aiming for and would you expect to be able to
support in the future? Currently the reliability of the application
platform (with all services available) seems to be about 99%. At 99%
reliability the service is unavailable or misbehaving for about 1.5
hours a week.

Do you think this is high enough for most applications? It may be high
enough for some applications, but I am not sure that it is high enough
for mine.

jonathan

On Jul 3, 9:47 am, "Chris Beckmann (App Engine PM)" <beckmann

Hari Donthi

unread,

Jul 2, 2009, 8:43:58 PM7/2/09

to Google App Engine

Hi Chris,

Is the root cause published somewhere? I'm looking to learn more about
what what you're referring to as the "specific nature of this problem"
and "serious issue".

On Jul 2, 7:47 pm, "Chris Beckmann (App Engine PM)" <beckmann

Charlie

unread,

Jul 3, 2009, 12:02:44 PM7/3/09

to Google App Engine

I posted an issue for making it easier for developers to do their own
testing with disabled capabilities:

http://code.google.com/p/googleappengine/issues/detail?id=1811

Having inconsistent data could lead to problems that lingered for a
long time, particularly since memcache will be out of sync, I think,
so you made the right call on not rolling over quickly.

On Jul 2, 7:47 pm, "Chris Beckmann (App Engine PM)" <beckmann
+...@google.com> wrote:

Chris Beckmann (App Engine PM)

unread,

Jul 9, 2009, 1:30:50 AM7/9/09

to Google App Engine

App Engine Developers,

We wanted to provide you with a fully detailed account of our recent
outage. Every billed application will also receive a credit for all
paid resource usage from the entire day of the outage; this will
appear as a credit balance in your application account within the next
week.

We apologize for the downtime, and the App Engine team is continuing
our to work improve the availability and power of App Engine. Our full
post-mortem analysis is below.

Thanks for your patience.

The Google App Engine Team

----------

Summary

On July 2, from 6:45 AM PDT until 12:35 PM PDT, Google App Engine (App
Engine) experienced an outage that ranged from partial to complete.
Following is a timeline of events, an analysis of the technology and
process failures, and a set of steps the team is committed to taking
to prevent such an outage from happening again.

The App Engine outage was due to complete unavailability of the
datacenter's persistence layer, GFS, for approximately three hours.
The GFS failure was abrupt for reasons described below, and as a
consequence the data belonging to App Engine applications remained
resident on GFS servers and was unreachable during this period. Since
needed application data was completely unreachable for a longer than
expected time period, we could not follow the usual procedure of
serving of App Engine applications from an alternate datacenter,
because doing so would have resulted in inconsistent or unavailable
data for applications.

The root cause of the outage was a bug in the GFS Master server caused
by another client in the datacenter sending it an improperly formed
filehandle which had not been safely sanitized on the server side, and
thus caused a stack overflow on the Master when processed.

----------

Timeline (all times below Pacific Daylight Time, GMT -0700)

6:44 AM --- A GFS Site Reliability Engineer (SRE) reports that the GFS
Master in App Engine's primary data center is failing and continuously
restarting. Since it is failing repeatedly, dependent services cannot
communicate reliably with GFS.

7:00 AM --- The monitoring system that monitors the health of the App
Engine cluster notices that request latency has spiked across many
applications and pages the primary on-call engineer for App Engine.
Primary begins investigating but quickly receives another page
reporting an increased error rate for Datastore RPCs, and latency for
Datastore operations (reads and writes) has increased. Datastore reads
are succeeding within normal tolerances. Between 5% and 20% of
Datastore writes are failing.

8:00 AM --- The cause of the GFS Master failures has not yet been
identified. However, a similar-looking issue that had been seen in a
different data center the week prior had been resolved by an upgrade
to a newer version of the GFS software. This upgrade was already
planned for the App Engine primary data center later in the week, so
the GFS SRE decides to commence the upgrade immediately in an attempt
to alleviate the problem.

8:07 AM --- The App Engine primary on-call engineer attempts to update
the System Status site with information describing elevated datastore
latency and error rates. However, the Status Site is only
intermittently available and returning errors on all updates.
Investigating the problem, the primary engineer discovered that the
isolated servers supporting the Status Site were running in the same
data center as the primary App Engine serving cluster. Thus, the site
ultimately depended upon the same GFS instance as App Engine itself.
The cause for this error in the Status Site was determined to be a
configuration error in the App Engine datacenter failover procedure.

8:35 AM --- Datastore write failure rate rises to 100%. Most of the
App Engine engineering team is present and involved in resolving the
problem at this point. Datastore replication delay between the App
Engine primary data center and the App Engine alternative datacenter
is measured at 30 minutes. In other words, the App Engine Datastore
was determined to be 30 minutes behind in replicating application data
to the alternate datacenter and would need another 30 minutes of
sending data to "catch up." Usual replication delay values are around
1 to 5 minutes.

Because not all data had been replicated out of the primary serving
datacenter to the alternate datacenter, moving serving traffic to the
alternate datacenter would have resulted in a random set of
application data being unavailable to App Engine applications. If
Datastore writes were enabled in the alternate datacenter, then writes
would be based on stale or incomplete data.

Thus, at this point we had to choose between failing over immediately,
in which case there would have been inconsistent or unavailable data
for applications, or waiting 30 minutes in read-only mode for
replication to catch up. We decided that inconsistency was not
acceptable, even given the serving problems for the past hour and a
half, so App Engine primary began preparing for failover to the
alternate datacenter with the understanding that failover could not
occur until replication caught up.

9:00 AM --- The GFS upgrade to the new version of the GFS software
finishes, but the Master is still failing. The GFS SRE escalates
directly to the GFS engineering team, and the GFS engineering team
immediately begins live debugging of the failing software to attempt
to determine the cause and come up with a fix.

10:00 AM --- GFS SRE advises that the GFS engineering team has
identified the cause of the crashes as a "query-of-death" against the
GFS servers. Another user of GFS in the same primary datacenter as App
Engine is issuing a request to the GFS servers that reliably causes a
crash. The client was sending an improperly formed filehandle which
was not safely checked and sanitized by the server, and which caused a
stack overflow when processed. Now that it is known that the bug is a
malformed query from a client, GFS SRE identifies a MapReduce process
that is triggering the GFS bug, and the process is disabled. GFS
Master is no longer failing and GFS Chunkservers, which hold the
actual needed data, are starting to come back up by 10:30 AM.

10:40 AM --- Overall Datastore error rate remains at 30% and continues
to rise, despite the fixed GFS Master. Replication delay is 2 hours
and 45 minutes. (The delay estimates are based on how quickly data can
be read and sent to the remote datacenter.)

11:47 AM --- Datastore servers start up properly again, and both reads
and writes are succeeding. Datastore replication quickly catches up to
the present, since all reads are now going through. Replication delay
drops to zero, indicating that the alternate datacenter now has all
application data from the primary datacenter. Initial prognosis is
that the App Engine cluster is healthy again.

12:00 PM --- GFS SRE advises that the GFS Master in the primary App
Engine data center needs to be restarted one more time later in the
day to pick up some configuration changes missed during the emergency
upgrade. Based on this fact and the fact that replication delay has
again dropped to zero, App Engine primary on-call decides to fail over
to the backup data center to avoid any instability introduced by the
planned GFS Master restart later in the day. Failover to the alternate
datacenter begins.

12:14 PM --- Writes are re-enabled in the backup data center. The
failover is complete, and the alternate datacenter is serving
normally.

12:35 PM --- Message posted to the google-appengine-downtime-notify
group that all functionality has been restored.

----------

What did we do wrong?

Production --- It is possible, although unlikely, that if we had
disabled Datastore writes before 8 AM when the problem was initially
detected, that Datastore replication might have caught up before GFS
completely failed. If this happened, we would have been able to move
traffic out of our primary data center before all reads and writes
became unavailable, and downtime would have been reduced to a partial
outage of around 30 minutes. However, between 7 AM and 8 AM there was
not yet any evidence that GFS would fail for the entire cluster and
that we were heading towards a major outage. Given the information
that was available at the time, leaving Datastore writes enabled was a
reasonable decision.

Communication --- One area where it's clear we could have done better
was communication with our customers during the outage. We have a well-
defined process for keeping our customers informed, but the process
assumes that updates can be posted to the System Status site. There
are a number of concrete steps we will take to prevent this from
happening again. First, we will devise a backup plan for communicating
with customers when the System Status site is unavailable. Second, we
will update our failover script to verify that the System Status site
and App Engine are running in different data centers. We will also
update our failover script to move the System Status site to a
different data center when this is not the case. Third, we will add an
automated alert to our monitoring system that will notify the App
Engine engineer who is on-call whenever the System Status site and App
Engine are running in the same data center.

Architecture --- Ultimately, the cause of this outage was not the
result of a single bad decision. GFS and App Engine are distributed
systems that have been designed from the ground up with fault-
tolerance in mind, and we designed our failover strategy with the
understanding that it relied on stored data being available for
reading during the time we wanted to fail over. The failover procedure
was designed to cope with partial unavailability of GFS for extended
periods, and full unavailability for short periods, but was not
designed to handle failover during full unavailability for a long
period (greater than three hours). We have had an engineering effort
under way for approximately 10 months to make App Engine less
dependent on any single instance of GFS, and more resilient in the
face of outages in the primary datacenter. We expect to deploy this
system to production within the next two months. This will
significantly reduce the likelihood of a complete outage like the one
we saw on July 2.

Recovery --- However, even if we could roll these changes out today,
it's still extremely important that we be able to get the entire
system functioning more quickly when any of the GFS instances we
depend on become unexpectedly unavailable for an extended period of
time. We were surprised by the amount of time it took for us to begin
serving normally once the GFS query-of-death was identified and
disabled, and this delay is unacceptable to us. Frankly, we did not
expect the whole persistence layer to be unavailable for nearly this
long for any reason, and therefore had not planned properly for it.
The engineering team is already discussing possible solutions, and we
will update our roadmap as soon as we have something concrete we can
work towards.

----------

What are we doing to fix it?

1. The underlying bug in GFS has already been addressed and the fix
will be pushed to all datacenters as soon as possible. It has also
been determined that the bug has been live for at least a year, so the
risk of recurrence should be low. Site reliability engineers are aware
of this issue and can quickly fix it if it should recur before then.

2. The App Engine team is accelerating its schedule to release the new
clustering system that was already under development. When this system
is in place, it will greatly reduce the likelihood of a complete
outage like this one.

3. The App Engine team is actively investigating new solutions to cope
with long-term unavailability of the primary persistence layer. These
solutions will be designed to ensure that applications can cope
reasonably with long-term catastrophic outages, no matter how rare.

4. Changes will be made to the Status Site configuration to ensure
that the Status Site is properly available during outages.

----------

On Jul 2, 4:47 pm, "Chris Beckmann (App Engine PM)" <beckmann

Brandon Thomson

unread,

Jul 9, 2009, 9:39:08 AM7/9/09

to Google App Engine

Thank you for this detailed post-mortem. It does much to assuage my
worry that there might have been a fundamental design problem in App
Engine's architecture.

nickmilon

unread,

Jul 9, 2009, 11:47:27 AM7/9/09

to Google App Engine

Thanks for keeping us posted with all this info.
It helps us understand the complexities involved in running a service
like appengine and make us more comfortable knowing that we are in
very good hands when the unthinkable happens.

Message has been deleted

GregF

unread,

Jul 17, 2009, 7:46:16 AM7/17/09

to Google App Engine

THAT is what we needed - thanks Chris for giving a detailed
description of the cause, resolution and follow-up. No-one likes
writing these things because they are basically detailing your own
failures, so Chris (and Google) deserve kudos for releasing this.

One of the frustrating things about Appengine is the paucity of
information about it. There is no reliable information about it's
architecture, the number of real applications that use it, or even the
requests per second it processes. It is easy to imagine that it
actually runs on a couple of boxes under someone's desk, and that this
incident was caused by the cleaner knocking the power cord.

So in fronting up about the cause of the outage, Google have actually
reassured me about their whole system. Now I know there are serious
monitoring systems in place, real people (SREs) onsite at 6:44am,
backup datacentres, and some serious decision-making skills in charge
(recognising data consistency is more important that uptime). You have
shown that you followed through with a detailed incident analysis and
have learned more about your system - and how it could be better. A
previous poster mentions 99% uptime - I don't know how accurate this
is, but to some extent I don't care if I know Google is learning from
each outage. That means uptime will trend in the right direction.

Please give us similarly detailed post-mortems in the future. We've
trusted you with our apps, please trust us with more information.
It'll also go a long way to answering the accusations of opaqueness
surrounding appengine - the Register's jibes about the "Chocolate
Factory" are close to the bone. And I'd love to know how many requests
per second you process...

Ram Shanker

unread,

Jul 24, 2009, 11:40:23 PM7/24/09

to Google App Engine

bump for those who missed this thread .. :-)

Reply all

Reply to author

Forward