App Engine Datastore Outage - May 25, 2010, 12:40PM

326 views
Skip to first unread message

App Engine Team

unread,
May 25, 2010, 4:13:44 PM5/25/10
to Google App Engine Downtime Notify
At 12:40PM PST, App Engine began experiencing a datastore service
outage which has resulted in the datastore being read-only. The team
is now actively working on re-enabling writes. We will update this
thread with further information as it is available.

App Engine Team

unread,
May 25, 2010, 4:27:32 PM5/25/10
to Google App Engine Downtime Notify
As of 1:24PM - All applications are now able to read and write to the
datastore, but latency is still high. The team is addressing the
underlying latency issues now and should see improvements in a few
minutes.

On May 25, 1:13 pm, App Engine Team <appengine.nore...@gmail.com>
wrote:

App Engine Team

unread,
May 25, 2010, 5:10:40 PM5/25/10
to Google App Engine Downtime Notify
As of 2:05PM - Many larger applications are still seeing elevated
latency and error rates. We do not yet consider this outage resolved.

On May 25, 1:27 pm, App Engine Team <appengine.nore...@gmail.com>

App Engine Team

unread,
May 25, 2010, 5:25:56 PM5/25/10
to Google App Engine Downtime Notify
As of 2:20 PM - Most applications should no longer be seeing issues.
Applications that are very heavy users of the datastore are still
seeing elevated error rates and we expect this to continue for the
next hour.

On May 25, 2:10 pm, App Engine Team <appengine.nore...@gmail.com>

App Engine Team

unread,
May 25, 2010, 5:41:36 PM5/25/10
to Google App Engine Downtime Notify
Why this is marked as abuse? It has been marked as abuse.
Report not abuse
Task Queues and Cron Tasks are currently not running while the
datastore service is being brought back up.

On May 25, 2:10 pm, App Engine Team <appengine.nore...@gmail.com>

App Engine Team

unread,
May 25, 2010, 6:10:09 PM5/25/10
to Google App Engine Downtime Notify
Latencies and error rates have dropped and we have re-enabled Task
Queues and Cron. At this point, we consider the problem resolved, but
we will be watching the performance of datastore very closely over the
next 24 hours. A post mortem of this outage will be completed and a
link will be added to this thread.

On May 25, 2:41 pm, App Engine Team <appengine.nore...@gmail.com>

App Engine Team

unread,
Jun 10, 2010, 12:58:35 PM6/10/10
to Google App Engine Downtime Notify
May 25th Datastore Outage Post-mortem

Summary

On May 25th, App Engine’s Datastore experienced a failure causing an
unexpected read-only period while traffic moved to the secondary data
center. The outage affected all App Engine applications using the
Datastore service. The outage lasted 50 minutes while residual high
latency lingered for an additional two hours. Notably this was less
than half the length of our previous outage (thanks in part to new
procedures in place - previous outage post-mortem available here:
https://groups.google.com/group/google-appengine/browse_thread/thread/a7640a2743922dcf).
Unfortunately, we did see a number of applications affected by a
replication issue causing unapplied writes.

Root Cause

The Datastore relies on Bigtable to store data (read more about
Bigtable here: http://labs.google.com/papers/bigtable.html). One of
the components of the Bigtable is a repository for determining where a
specific entity is located in the distributed system. Due to
instability in the cluster, this component became overloaded. This
had the cascading effect of preventing requests from determining where
to send Datastore operations in a timely fashion, making these
requests (both reads and writes) time out.

By default, App Engine will wait the full 30 seconds to complete a
Datastore request. This behavior caused the number of requests waiting
to complete to quickly jump beyond the safety limit for the App Engine
service. This in turn caused all requests to fail, regardless of
whether or not they used the Datastore.

Unapplied Writes

The outage caused the the primary Datastore to stop replicating data a
few minutes before we entered the read-only periods creating writes
that were not applied to the secondary. All of the data has been
recovered and reinserted into the application’s Datastore as
separately labeled entities. We want to stress that these unapplied
writes do not impact the transactional consistency of application data
and resulted in corruption. Instead you can think of them as causing
the mirror image between the primary and secondary Datastore to be out
of sync.

The App Engine team will email the administrators of all affected
applications (approximately 2%) in the next 24 hours to let them know
that they should take action. If you do not receive an email, there is
no action for you to take. For more information on unapplied writes
and guidance on how to reintegrate them, please see the Unapplied
Writes FAQ: http://code.google.com/appengine/kb/unappliedwrites.html

On a related note, the unapplied writes also affected the billing
state of a approximately 0.3% of App Engine applications. This was
caused by unapplied writes affecting the App Engine Admin Console
Datastore just as any other App Engine applications would. For those
applications, rather than wait for the recovery, we assumed there were
no charges for the affected days and their billing history will show
$0.00 charges over the week centered around this outage.

Timeline

12:35 pm - Datastore begins seeing a large increase in latency as a
result of instability in the underlying infrastructure. Write
replication to the secondary datacenter slows to a crawl as a result
of latency.
12:40 pm - App Engine team determines that we cannot continue to serve
out of the primary data center and begin the failover procedure. The
Datastore is set to read-only as part of the procedure. Task Queue
execution and Cron scheduled tasks are also put on hold.
1:05 pm - Read queries are now served out of the secondary data
center.
1:13 pm - Communication team publishes external announcement on
downtime-notify that App Engine is having an outage and the Datastore
is currently in read-only.
1:24 pm - Secondary data center begins serving read and write traffic
but latency on requests is still high resulting in higher
DeadlineExceededErrors.
2:20 pm - All but large applications are no longer seeing issues. The
on-call team begins tuning resource allocation to help.
3:10 pm - Latency has returned to normal for all applications and the
all clear is announced on down-time notify. Cron and Task Queues are
turned back on.

Lessons and Take-aways

First, we’d like to thank the App Engine Site Reliability Engineering
team. The outage in this case was unavoidable, but the impact was
drastically reduced thanks both to their diligence as well as the many
processes and tools they have put in place in recent months.

However, there are several lessons we’ve learned as a result of the
outage:
- It is critical to offer an alternative configuration of the
Datastore. This implementation should be much less susceptible to
outages and will prevent any replication loss during outages, but will
trade off performance. This is now the highest priority task after
fixing the current Datastore latency problems (For more information,
please see: http://googleappengine.blogspot.com/2010/06/datastore-performance-growing-pains.html)
- The oncall engineer has clearance to announce the outage on the
downtime-notify group as soon as the fail over process has been
initiated. This is no longer blocked on the communication team.
- There was a fair amount of confusion about Task Queue tasks not
executing during the outage. As a result, we will expand the
documentation on how Task Queues and Cron jobs behave in the event of
an outage.

App Engine Team

unread,
Jun 10, 2010, 5:40:59 PM6/10/10
to Google App Engine Downtime Notify
Important correction - The Unapplied Write section above should say

We want to stress that these unapplied writes do not impact the
transactional consistency of application data
and *did not* result in corruption.

Thank you Daniel!

On Jun 10, 9:58 am, App Engine Team <appengine.nore...@gmail.com>
wrote:
> May 25th Datastore Outage Post-mortem
>
> Summary
>
> On May 25th, App Engine’s Datastore experienced a failure causing an
> unexpected read-only period while traffic moved to the secondary data
> center. The outage affected all App Engine applications using the
> Datastore service. The outage lasted 50 minutes while residual high
> latency lingered for an additional two hours. Notably this was less
> than half the length of our previous outage (thanks in part to new
> procedures in place - previous outage post-mortem available here:https://groups.google.com/group/google-appengine/browse_thread/thread...).
> please see:http://googleappengine.blogspot.com/2010/06/datastore-performance-gro...)
Reply all
Reply to author
Forward
0 new messages