How to best detect Google downtime from within GAE app and how to handle it?

132 views
Skip to first unread message

Benjamin Possolo

unread,
Oct 26, 2012, 6:18:41 PM10/26/12
to google-a...@googlegroups.com
Hey all. Like most everyone else here. My application was affected by the GAE downtime this morning.

I would like to know if there is a good mechanism for detecting this sort of failure in the future so that my application can automatically throw up a Maintenance Page.

Normally I would have an admin section where I can toggle a switch to render the Maintenance Page but since writing to the datastore was not working, this isn't really an option. Especially if I can't login to get to the admin page.
Uploading a new version of my application when a disruption like this occurs also seems impossible.

Another idea I thought of was to change my DNS CNAME record to point to a different server which hosts a maintenance page but the time for the DNS changes to take effect means this is pretty impractical as well.

Does anyone have any good ideas?

I was wondering if anyone using the Capabilities API would be able to chime in on what their app was seeing during the disruption.
If the Capabilities API correctly identified the GAE state as something other than "Enabled", then I could see this being a good solution for automatically redirecting all requests.

Then again....it seems the problem was that requests were not even reaching the application servers (due to failed load balancers)....so I guess that wouldn't have been a solution to the problem this morning either....

Joshua Smith

unread,
Oct 26, 2012, 7:41:06 PM10/26/12
to google-a...@googlegroups.com
The only way you could have done that this time is by re-directing your DNS to someplace completely non-google.


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/qvOaGAiQXcMJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Santiago Lema

unread,
Oct 26, 2012, 7:43:48 PM10/26/12
to google-a...@googlegroups.com
Any idea how you would handle requests staying in pending for 24 minutes? This happened to me today:

This server gets about 7 to 12 request per second and usually handles them in ~200ms. But the pending_ms ate all my quota instantly by creating many instances doing nothing but waiting. How should I handle this type of situations?

Jeff Schnitzer

unread,
Oct 26, 2012, 8:31:09 PM10/26/12
to google-a...@googlegroups.com
You can configure an error page for all the "normal" error conditions
(out of quota, exceptions that hit the top level, other 500 errors,
etc). It wouldn't have helped you today though.*

https://developers.google.com/appengine/docs/python/config/appconfig#Custom_Error_Responses

(there is a Java equivalent. presumably Go too)

* Not entirely true. There were some moments when basic serving
infrastructure was up but some backend services weren't - I did see
our error page a few times during the outage today. But mostly not.

Jeff
> https://groups.google.com/d/msg/google-appengine/-/kCgCaG10Pd8J.

Joakim

unread,
Oct 27, 2012, 12:28:44 PM10/27/12
to google-a...@googlegroups.com
You could probably do something like this if you serve through CloudFlare, but it would be much nicer if we could set a page for this in the app config. I'll file a feature request tomorrow, unless someone else wants to or has already done so. Suggestions for specifications would be appreciated. For example, it would probably be wise to aggressively push these error pages to some kind of edge cache, and to make implementation easier it could be limited to accepting a single HTML file per app.

Benjamin Possolo

unread,
Oct 29, 2012, 5:13:56 PM10/29/12
to google-a...@googlegroups.com
On Saturday, October 27, 2012 9:28:44 AM UTC-7, Joakim wrote:
You could probably do something like this if you serve through CloudFlare, but it would be much nicer if we could set a page for this in the app config. I'll file a feature request tomorrow, unless someone else wants to or has already done so. Suggestions for specifications would be appreciated. For example, it would probably be wise to aggressively push these error pages to some kind of edge cache, and to make implementation easier it could be limited to accepting a single HTML file per app.

I think this would be a valid request. It would great if Google could just flick a collective switch for all of our apps to render a maintenance page when something big like this happens.
My DNS is handled by a separate hosting provider and sending a request ticket to them would just take forever.

I would be happy if I could just show a static HTML page with inline CSS (the obligatory, moderately funny 500 image would be hosted elsewhere).

Joshua Smith

unread,
Oct 29, 2012, 5:37:13 PM10/29/12
to google-a...@googlegroups.com
If you read the chronology, it appears that if they had done this at the beginning, the incident would have been over two hours sooner.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/35cplEq70mYJ.

Jeff Schnitzer

unread,
Oct 30, 2012, 1:05:50 AM10/30/12
to google-a...@googlegroups.com
No matter where Google puts something like this, there's always some
place upstream that will cause a messy failure. It sounds like this
particular problem was pretty far upstream, and circumvented all the
usual "friendly" error msgs you can get from GAE. If you're asking
for a customized error page, you already have that - the problem is
that anything customized is almost guaranteed to be deep in the
network stack and useless during upstream failures.

Joakim is right; if you want a guaranteed error message, use a proxy
(eg CF) or make DNS changes (be sure to have a low TTL).

Jeff
Reply all
Reply to author
Forward
0 new messages