Datastore outage July 2, 2009

4 views
Skip to first unread message

Jeff S (Google)

unread,
Jul 2, 2009, 1:38:04 PM7/2/09
to Google App Engine
Hi all,

As many of you have noticed, App Engine has been experiencing elevated
datastore latency and error rates and we switched into an unplanned
maintenance mode. The notice was originally posted to the downtime
notify group, please follow this thread for details and suggestions on
handling the capability disabled exceptions:

http://groups.google.com/group/google-appengine-downtime-notify/browse_thread/thread/f7596d1d0bd0f0f9

Apologies all, I know this is frustrating for all of you and for us.
We normally post these notices just to the downtime notify group, but
in this case I've seen several threads on the topic.

Thank you,

Jeff

ksjun

unread,
Jul 2, 2009, 1:42:02 PM7/2/09
to Google App Engine
But how can I modify my code to handle the capability disabled
exceptions?

Now we can't upload updated code to server.


On Jul 3, 2:38 am, "Jeff S (Google)" <j...@google.com> wrote:
> Hi all,
>
> As many of you have noticed, App Engine has been experiencing elevated
> datastore latency and error rates and we switched into an unplanned
> maintenance mode. The notice was originally posted to the downtime
> notify group, please follow this thread for details and suggestions on
> handling the capability disabled exceptions:
>
> http://groups.google.com/group/google-appengine-downtime-notify/brows...

Jeff S (Google)

unread,
Jul 2, 2009, 1:51:55 PM7/2/09
to Google App Engine
Hi ksjun,

You make an excellent point. It may not be possible to make these
changes at the moment, but I think it would be good practice to have
this kind of error handling in place in the future.

Regards,

Jeff

Brian McConnell

unread,
Jul 2, 2009, 2:05:36 PM7/2/09
to Google App Engine
You really need to deal with your uptime problems. My experience has
been that App Engine is less reliable than some of the cheap web hosts
I have used. The promise of the system is scalability and redundancy,
but apparently it has a fundamental architectural flaw if outages like
this are so common. I like the way App Engine is designed, and really
would prefer to use it, but it is hard to recommend it to clients when
outages like this are the norm, and when Google does not understand
that customer support = real people answering real phones. Many of us
bought into App Engine on the basis that Google must understand the
issues well enough to provide a reliable service, but frequent hours
long outages during business hours betray this assumption. At this
point, you need to prove to the developer community that you have
dealt with the issues that keep knocking the service down, and provide
real support for paying customers. People will give you the benefit of
the doubt up to a point, but you've had enough unplanned outages that
the burden of proof is now on you.

On Jul 2, 10:38 am, "Jeff S (Google)" <j...@google.com> wrote:
> Hi all,
>
> As many of you have noticed, App Engine has been experiencing elevated
> datastore latency and error rates and we switched into an unplanned
> maintenance mode. The notice was originally posted to the downtime
> notify group, please follow this thread for details and suggestions on
> handling the capability disabled exceptions:
>
> http://groups.google.com/group/google-appengine-downtime-notify/brows...

ksjun

unread,
Jul 2, 2009, 2:22:01 PM7/2/09
to Google App Engine
At least, I want 'email support'. The only offical support is this
forum.

If I have a problem, I must post it here with some sensitive data.

davew

unread,
Jul 2, 2009, 2:27:08 PM7/2/09
to Google App Engine
Hi Jeff,

Is there a way to detect the disabled capabilities, without having to
catch CapabilityDisabledError? So rather than wait for a failure, we
can simple do something like:

if DatastoreDown:
return an error straight away letting users know what's going on +
expected down time (Please try again in 10 minutes, etc)

# else continue as normal..

Also, if there was a way to get an expected ETA on the time to fix the
problem *via an API* that would be awesome. Sometimes, with scheduled
maintenance you know it's only going to be down for 10 minutes. So we
could a post a message to users saying "We expect to be back in 10
minutes!". Or for outages like now, the API call could respond with 1
hour.

Thanks

Dave

stelg

unread,
Jul 2, 2009, 2:30:22 PM7/2/09
to Google App Engine
Brian,

You are right. For paying customers this situation is not really
acceptable and should be kept to the minimum. People expect from a
well known company like Google a high quality service. Higher then
from a service provider around the corner. I am pretty sure that this
is what Google want to prove.

I have to be honest: this free Google App Engine, its quality, the
service and the whole innovative approach is a pleasure and really
exiting. More than 6 months we did not experience major problems. This
is the first serious one that I encounter.

Bringing such infrastructure to the world, monitoring it and taking
measures when critical situations occur when tens-of-thousands people
are working on it is a major exercise (even an adventure in a certain
sense). One simple small error can be critical here. Now the load is
growing real scalability will be stress tested.

I am confidence that Google will learn from this and improve Google
App Engine further. I am also pretty sure this will not be the last
interruption. As long as they are kept to the minimum and the
frequency is really going down confidence will grow.

Come Google guys BEAT that "bug" and go back on track. Success in you
efforts to nail this problem down!


Brenton

unread,
Jul 2, 2009, 2:34:36 PM7/2/09
to Google App Engine
Why are you throwing CapabilityDisabledErrors on memcache writes?
Couldn't you silently fail? There's no guarantee that stuff in
memcache will still be in memcache, so I don't see why it needs to
error on failures. Even the API reference says it returns False on
errors - nothing about it preventing a page from rendering.

memcache is commonly used, and there's no warning in the docs that it
can throw page-breaking errors. By throwing CapabilityDisabled,
you've wrecked a lot of pages that could otherwise render right now.

Jeff S (Google)

unread,
Jul 2, 2009, 4:39:35 PM7/2/09
to Google App Engine
Just an FYI if you haven't seen the updates to the thread on downtime
notify, datastore writes have been enabled for about an hour now.

http://groups.google.com/group/google-appengine-downtime-notify/browse_thread/thread/f7596d1d0bd0f0f9

Thank you for your patience.
That's an interesting idea, and something worth considering. I think
the difference here is the memcache write failure might be a different
failure mode, where a memcache write would more likely succeed if
retried. The capability disabled exception provides a stronger
indication that a retry will probably also fail (unless of course we
exit read-only mode...).

Cheers,

Jeff

dfabulich

unread,
Jul 2, 2009, 2:20:04 PM7/2/09
to Google App Engine
Jeff, is it normal/expected that when the datastore goes down the
status page goes down too? http://code.google.com/status/appengine

And the appengine homepage? http://appengine.google.com/

On Jul 2, 10:38 am, "Jeff S (Google)" <j...@google.com> wrote:
> Hi all,
>
> As many of you have noticed, App Engine has been experiencing elevated
> datastore latency and error rates and we switched into an unplanned
> maintenance mode. The notice was originally posted to the downtime
> notify group, please follow this thread for details and suggestions on
> handling the capability disabled exceptions:
>
> http://groups.google.com/group/google-appengine-downtime-notify/brows...

Jerason Banes

unread,
Jul 2, 2009, 2:46:38 PM7/2/09
to Google App Engine
As of 1:38PM Central Daylight Time, my website (http://
www.dsicade.com) is completely unavailable. And that's after
experiencing memcache failures today (which disabled chat) and turning
away record new scores for some of the games on the site due to
datastore issues.

These issues appear to be occurring with increasing frequency. I
absolutely love the Google AppEngine service and was expecting to
begin paying for service soon with the explosive growth of my site.
But this extensive downtime is becoming an increasingly serious issue.
If this continues, new features (e.g. multiplayer games) will not be
feasible and I will need to change my hosting plans.

If I may humbly request, please find a solution to these issues as
soon as possible. I love what you're doing here and I want to remain
your customer!

Thanks,
Jerason Banes

http://www.dsicade.com

On Jul 2, 12:38 pm, "Jeff S (Google)" <j...@google.com> wrote:
> Hi all,
>
> As many of you have noticed, App Engine has been experiencing elevated
> datastore latency and error rates and we switched into an unplanned
> maintenance mode. The notice was originally posted to the downtime
> notify group, please follow this thread for details and suggestions on
> handling the capability disabled exceptions:
>
> http://groups.google.com/group/google-appengine-downtime-notify/brows...

Brenton

unread,
Jul 2, 2009, 5:34:30 PM7/2/09
to Google App Engine
Thanks Jeff.

My issue is that throwing an error almost guarantees a crash,
especially since it isn't published anywhere that memcache can even
throw that sort of error. I would imagine there's a bigger proportion
of people who would be caught offguard by a CapabilityDisabledError
crashing their app than their would people whose apps crash because
they are constantly retrying memcache.write.

So far, the best alternative I've heard is to create an "online"
constant for each service in the API. Someone in the case you are
describing could then check memcache.online before infinitely retrying
(which they shouldn't be doing anyway).

On Jul 2, 1:39 pm, "Jeff S (Google)" <j...@google.com> wrote:
> Just an FYI if you haven't seen the updates to the thread on downtime
> notify, datastore writes have been enabled for about an hour now.
>
> http://groups.google.com/group/google-appengine-downtime-notify/brows...

Jeff S (Google)

unread,
Jul 2, 2009, 6:55:55 PM7/2/09
to google-a...@googlegroups.com
On Thu, Jul 2, 2009 at 11:20 AM, dfabulich <danfa...@gmail.com> wrote:

Jeff, is it normal/expected that when the datastore goes down the
status page goes down too?  http://code.google.com/status/appengine

And the appengine homepage?  http://appengine.google.com/

No this does not usually occur during an outage and we are taking steps to make a simultaneous outage of the status site with the datastore even less likely. However if for some reason the status site does become unavailable during an outage you should still be able to receive updates through the downtime notify discussion group.

http://groups.google.com/group/google-appengine-downtime-notify

Thank you,

Jeff
 

Jeff S (Google)

unread,
Jul 2, 2009, 6:58:18 PM7/2/09
to google-a...@googlegroups.com
Thank you for the feedback Jerason. I hear you and we are working to prevent situations like this from arising again. We plan to publish a postmortem to explain the issue in the near future.

Cheers,

Jeff

jonathan

unread,
Jul 2, 2009, 7:43:03 PM7/2/09
to Google App Engine
I wasn't actually getting CapabilityDisabledExceptions during this
period: instead I got a lot of datastore errors and timeouts.

My timeline:
at 02/Jul/2009:08:40:59 -0700: my memcache calls starting failing
at 02/Jul/2009:09:24:49 -0700: I started receiving
google.appengine.api.datastore_errors.Timeout
at 02/Jul/2009:10:38:54 -0700: I started receiving
google.appengine.runtime.apiproxy_errors.Error
at 02/Jul/2009:11:30:36 -0700: I stopped receiving these errors
at 02/Jul/2009:11:56:08 -0700: I started getting
CapabilityDisabledExceptions
at 02/Jul/2009:12:07:23 -0700: Operations resumed normally

This doesn't seem to match what was expected by Google. Can anyone
tell me what was going on?

Jonathan

Jeff S (Google)

unread,
Jul 2, 2009, 8:20:54 PM7/2/09
to Google App Engine
Hi Jonathan,

A full postmortem has been posted here:

http://groups.google.com/group/google-appengine/browse_thread/thread/e9237fc7b0aa7df5#

The issues initially manifested as timeouts and API errors but we
switched into read-only mode (capability disabled exceptions) as we
prepared to switch over to other datacenters. Thanks again for your
patience during this outage and apologies as I know this has been a
serious problem for many.

Thank you,

Jeff

Iap

unread,
Jul 2, 2009, 11:34:04 PM7/2/09
to google-a...@googlegroups.com
The quality of GAE as well as the service quality of GAE is
not only critical to the Google and its business.
But also concerns to the reputation of all the pythoneers
who bet their endorsement on the debating table in their company.
I'd like to say that GAE is an exciting and great idea
and I expect its success in both technology and bussiness.
Our company are pretty willing to pay for an industry-level GAE.
Thanks for all the efforts of what the GAE team has done.
The idea behind GAE will re-construct the hosting industry when it got matured.
I do believe that the GAE team can tackle it soon.
Kan-ba-de. ("Cheer on" in Japanese)
 
Iap
 
2009/7/3 Jeff S (Google) j...@google.com
Reply all
Reply to author
Forward
0 new messages