May 25th Datastore Outage Post-mortem
Summary
On May 25th, App Engine’s Datastore experienced a failure causing an
unexpected read-only period while traffic moved to the secondary data
center. The outage affected all App Engine applications using the
Datastore service. The outage lasted 50 minutes while residual high
latency lingered for an additional two hours. Notably this was less
than half the length of our previous outage (thanks in part to new
procedures in place - previous outage post-mortem available here:
https://groups.google.com/group/google-appengine/browse_thread/thread/a7640a2743922dcf).
Unfortunately, we did see a number of applications affected by a
replication issue causing unapplied writes.
Root Cause
The Datastore relies on Bigtable to store data (read more about
Bigtable here:
http://labs.google.com/papers/bigtable.html). One of
the components of the Bigtable is a repository for determining where a
specific entity is located in the distributed system. Due to
instability in the cluster, this component became overloaded. This
had the cascading effect of preventing requests from determining where
to send Datastore operations in a timely fashion, making these
requests (both reads and writes) time out.
By default, App Engine will wait the full 30 seconds to complete a
Datastore request. This behavior caused the number of requests waiting
to complete to quickly jump beyond the safety limit for the App Engine
service. This in turn caused all requests to fail, regardless of
whether or not they used the Datastore.
Unapplied Writes
The outage caused the the primary Datastore to stop replicating data a
few minutes before we entered the read-only periods creating writes
that were not applied to the secondary. All of the data has been
recovered and reinserted into the application’s Datastore as
separately labeled entities. We want to stress that these unapplied
writes do not impact the transactional consistency of application data
and resulted in corruption. Instead you can think of them as causing
the mirror image between the primary and secondary Datastore to be out
of sync.
The App Engine team will email the administrators of all affected
applications (approximately 2%) in the next 24 hours to let them know
that they should take action. If you do not receive an email, there is
no action for you to take. For more information on unapplied writes
and guidance on how to reintegrate them, please see the Unapplied
Writes FAQ:
http://code.google.com/appengine/kb/unappliedwrites.html
On a related note, the unapplied writes also affected the billing
state of a approximately 0.3% of App Engine applications. This was
caused by unapplied writes affecting the App Engine Admin Console
Datastore just as any other App Engine applications would. For those
applications, rather than wait for the recovery, we assumed there were
no charges for the affected days and their billing history will show
$0.00 charges over the week centered around this outage.
Timeline
12:35 pm - Datastore begins seeing a large increase in latency as a
result of instability in the underlying infrastructure. Write
replication to the secondary datacenter slows to a crawl as a result
of latency.
12:40 pm - App Engine team determines that we cannot continue to serve
out of the primary data center and begin the failover procedure. The
Datastore is set to read-only as part of the procedure. Task Queue
execution and Cron scheduled tasks are also put on hold.
1:05 pm - Read queries are now served out of the secondary data
center.
1:13 pm - Communication team publishes external announcement on
downtime-notify that App Engine is having an outage and the Datastore
is currently in read-only.
1:24 pm - Secondary data center begins serving read and write traffic
but latency on requests is still high resulting in higher
DeadlineExceededErrors.
2:20 pm - All but large applications are no longer seeing issues. The
on-call team begins tuning resource allocation to help.
3:10 pm - Latency has returned to normal for all applications and the
all clear is announced on down-time notify. Cron and Task Queues are
turned back on.
Lessons and Take-aways
First, we’d like to thank the App Engine Site Reliability Engineering
team. The outage in this case was unavoidable, but the impact was
drastically reduced thanks both to their diligence as well as the many
processes and tools they have put in place in recent months.
However, there are several lessons we’ve learned as a result of the
outage:
- It is critical to offer an alternative configuration of the
Datastore. This implementation should be much less susceptible to
outages and will prevent any replication loss during outages, but will
trade off performance. This is now the highest priority task after
fixing the current Datastore latency problems (For more information,
please see:
http://googleappengine.blogspot.com/2010/06/datastore-performance-growing-pains.html)
- The oncall engineer has clearance to announce the outage on the
downtime-notify group as soon as the fail over process has been
initiated. This is no longer blocked on the communication team.
- There was a fair amount of confusion about Task Queue tasks not
executing during the outage. As a result, we will expand the
documentation on how Task Queues and Cron jobs behave in the event of
an outage.