App Engine Developers,
We wanted to provide you with a fully detailed account of our recent
outage. Every billed application will also receive a credit for all
paid resource usage from the entire day of the outage; this will
appear as a credit balance in your application account within the next
week.
We apologize for the downtime, and the App Engine team is continuing
our to work improve the availability and power of App Engine. Our full
post-mortem analysis is below.
Thanks for your patience.
The Google App Engine Team
----------
Summary
On July 2, from 6:45 AM PDT until 12:35 PM PDT, Google App Engine (App
Engine) experienced an outage that ranged from partial to complete.
Following is a timeline of events, an analysis of the technology and
process failures, and a set of steps the team is committed to taking
to prevent such an outage from happening again.
The App Engine outage was due to complete unavailability of the
datacenter's persistence layer, GFS, for approximately three hours.
The GFS failure was abrupt for reasons described below, and as a
consequence the data belonging to App Engine applications remained
resident on GFS servers and was unreachable during this period. Since
needed application data was completely unreachable for a longer than
expected time period, we could not follow the usual procedure of
serving of App Engine applications from an alternate datacenter,
because doing so would have resulted in inconsistent or unavailable
data for applications.
The root cause of the outage was a bug in the GFS Master server caused
by another client in the datacenter sending it an improperly formed
filehandle which had not been safely sanitized on the server side, and
thus caused a stack overflow on the Master when processed.
----------
Timeline (all times below Pacific Daylight Time, GMT -0700)
6:44 AM --- A GFS Site Reliability Engineer (SRE) reports that the GFS
Master in App Engine's primary data center is failing and continuously
restarting. Since it is failing repeatedly, dependent services cannot
communicate reliably with GFS.
7:00 AM --- The monitoring system that monitors the health of the App
Engine cluster notices that request latency has spiked across many
applications and pages the primary on-call engineer for App Engine.
Primary begins investigating but quickly receives another page
reporting an increased error rate for Datastore RPCs, and latency for
Datastore operations (reads and writes) has increased. Datastore reads
are succeeding within normal tolerances. Between 5% and 20% of
Datastore writes are failing.
8:00 AM --- The cause of the GFS Master failures has not yet been
identified. However, a similar-looking issue that had been seen in a
different data center the week prior had been resolved by an upgrade
to a newer version of the GFS software. This upgrade was already
planned for the App Engine primary data center later in the week, so
the GFS SRE decides to commence the upgrade immediately in an attempt
to alleviate the problem.
8:07 AM --- The App Engine primary on-call engineer attempts to update
the System Status site with information describing elevated datastore
latency and error rates. However, the Status Site is only
intermittently available and returning errors on all updates.
Investigating the problem, the primary engineer discovered that the
isolated servers supporting the Status Site were running in the same
data center as the primary App Engine serving cluster. Thus, the site
ultimately depended upon the same GFS instance as App Engine itself.
The cause for this error in the Status Site was determined to be a
configuration error in the App Engine datacenter failover procedure.
8:35 AM --- Datastore write failure rate rises to 100%. Most of the
App Engine engineering team is present and involved in resolving the
problem at this point. Datastore replication delay between the App
Engine primary data center and the App Engine alternative datacenter
is measured at 30 minutes. In other words, the App Engine Datastore
was determined to be 30 minutes behind in replicating application data
to the alternate datacenter and would need another 30 minutes of
sending data to "catch up." Usual replication delay values are around
1 to 5 minutes.
Because not all data had been replicated out of the primary serving
datacenter to the alternate datacenter, moving serving traffic to the
alternate datacenter would have resulted in a random set of
application data being unavailable to App Engine applications. If
Datastore writes were enabled in the alternate datacenter, then writes
would be based on stale or incomplete data.
Thus, at this point we had to choose between failing over immediately,
in which case there would have been inconsistent or unavailable data
for applications, or waiting 30 minutes in read-only mode for
replication to catch up. We decided that inconsistency was not
acceptable, even given the serving problems for the past hour and a
half, so App Engine primary began preparing for failover to the
alternate datacenter with the understanding that failover could not
occur until replication caught up.
9:00 AM --- The GFS upgrade to the new version of the GFS software
finishes, but the Master is still failing. The GFS SRE escalates
directly to the GFS engineering team, and the GFS engineering team
immediately begins live debugging of the failing software to attempt
to determine the cause and come up with a fix.
10:00 AM --- GFS SRE advises that the GFS engineering team has
identified the cause of the crashes as a "query-of-death" against the
GFS servers. Another user of GFS in the same primary datacenter as App
Engine is issuing a request to the GFS servers that reliably causes a
crash. The client was sending an improperly formed filehandle which
was not safely checked and sanitized by the server, and which caused a
stack overflow when processed. Now that it is known that the bug is a
malformed query from a client, GFS SRE identifies a MapReduce process
that is triggering the GFS bug, and the process is disabled. GFS
Master is no longer failing and GFS Chunkservers, which hold the
actual needed data, are starting to come back up by 10:30 AM.
10:40 AM --- Overall Datastore error rate remains at 30% and continues
to rise, despite the fixed GFS Master. Replication delay is 2 hours
and 45 minutes. (The delay estimates are based on how quickly data can
be read and sent to the remote datacenter.)
11:47 AM --- Datastore servers start up properly again, and both reads
and writes are succeeding. Datastore replication quickly catches up to
the present, since all reads are now going through. Replication delay
drops to zero, indicating that the alternate datacenter now has all
application data from the primary datacenter. Initial prognosis is
that the App Engine cluster is healthy again.
12:00 PM --- GFS SRE advises that the GFS Master in the primary App
Engine data center needs to be restarted one more time later in the
day to pick up some configuration changes missed during the emergency
upgrade. Based on this fact and the fact that replication delay has
again dropped to zero, App Engine primary on-call decides to fail over
to the backup data center to avoid any instability introduced by the
planned GFS Master restart later in the day. Failover to the alternate
datacenter begins.
12:14 PM --- Writes are re-enabled in the backup data center. The
failover is complete, and the alternate datacenter is serving
normally.
12:35 PM --- Message posted to the google-appengine-downtime-notify
group that all functionality has been restored.
----------
What did we do wrong?
Production --- It is possible, although unlikely, that if we had
disabled Datastore writes before 8 AM when the problem was initially
detected, that Datastore replication might have caught up before GFS
completely failed. If this happened, we would have been able to move
traffic out of our primary data center before all reads and writes
became unavailable, and downtime would have been reduced to a partial
outage of around 30 minutes. However, between 7 AM and 8 AM there was
not yet any evidence that GFS would fail for the entire cluster and
that we were heading towards a major outage. Given the information
that was available at the time, leaving Datastore writes enabled was a
reasonable decision.
Communication --- One area where it's clear we could have done better
was communication with our customers during the outage. We have a well-
defined process for keeping our customers informed, but the process
assumes that updates can be posted to the System Status site. There
are a number of concrete steps we will take to prevent this from
happening again. First, we will devise a backup plan for communicating
with customers when the System Status site is unavailable. Second, we
will update our failover script to verify that the System Status site
and App Engine are running in different data centers. We will also
update our failover script to move the System Status site to a
different data center when this is not the case. Third, we will add an
automated alert to our monitoring system that will notify the App
Engine engineer who is on-call whenever the System Status site and App
Engine are running in the same data center.
Architecture --- Ultimately, the cause of this outage was not the
result of a single bad decision. GFS and App Engine are distributed
systems that have been designed from the ground up with fault-
tolerance in mind, and we designed our failover strategy with the
understanding that it relied on stored data being available for
reading during the time we wanted to fail over. The failover procedure
was designed to cope with partial unavailability of GFS for extended
periods, and full unavailability for short periods, but was not
designed to handle failover during full unavailability for a long
period (greater than three hours). We have had an engineering effort
under way for approximately 10 months to make App Engine less
dependent on any single instance of GFS, and more resilient in the
face of outages in the primary datacenter. We expect to deploy this
system to production within the next two months. This will
significantly reduce the likelihood of a complete outage like the one
we saw on July 2.
Recovery --- However, even if we could roll these changes out today,
it's still extremely important that we be able to get the entire
system functioning more quickly when any of the GFS instances we
depend on become unexpectedly unavailable for an extended period of
time. We were surprised by the amount of time it took for us to begin
serving normally once the GFS query-of-death was identified and
disabled, and this delay is unacceptable to us. Frankly, we did not
expect the whole persistence layer to be unavailable for nearly this
long for any reason, and therefore had not planned properly for it.
The engineering team is already discussing possible solutions, and we
will update our roadmap as soon as we have something concrete we can
work towards.
----------
What are we doing to fix it?
1. The underlying bug in GFS has already been addressed and the fix
will be pushed to all datacenters as soon as possible. It has also
been determined that the bug has been live for at least a year, so the
risk of recurrence should be low. Site reliability engineers are aware
of this issue and can quickly fix it if it should recur before then.
2. The App Engine team is accelerating its schedule to release the new
clustering system that was already under development. When this system
is in place, it will greatly reduce the likelihood of a complete
outage like this one.
3. The App Engine team is actively investigating new solutions to cope
with long-term unavailability of the primary persistence layer. These
solutions will be designed to ensure that applications can cope
reasonably with long-term catastrophic outages, no matter how rare.
4. Changes will be made to the Status Site configuration to ensure
that the Status Site is properly available during outages.
----------
On Jul 2, 4:47 pm, "Chris Beckmann (App Engine PM)" <beckmann