Service Failure - all instances restarted simultaneously

104 views
Skip to first unread message

Hamish

unread,
Jun 5, 2012, 5:29:15 AM6/5/12
to Google App Engine
Hi,

It appears that all instances for our application got restarted at the
same time causing service failure for our customers.

This happened at about 8am (GMT) this morning and we can see a spike
downwards on the graphs of the traffic served.

We use the High Replication Data store and our app id is dacloudapi.

Why would this happen? It seems very odd for "all" instances including
the resident to restart at the same time.

Can someone please look into this for me?

Thank you,
Hamish

Takashi Matsuo

unread,
Jun 5, 2012, 8:37:26 AM6/5/12
to google-a...@googlegroups.com

Hi Hamish,

On Tue, Jun 5, 2012 at 6:29 PM, Hamish <hgr...@afilias.info> wrote:
Hi,

It appears that all instances for our application got restarted at the
same time causing service failure for our customers.

This happened at about 8am (GMT) this morning and we can see a spike
downwards on the graphs of the traffic served.

We use the High Replication Data store and our app id is dacloudapi.

Why would this happen? It seems very odd for "all" instances including
the resident to restart at the same time.

It's totally an expected behavior because we moved your application from one datacenter to another around that time. In the current system design, when it happens, your instances will need to be re-loaded in the new datacenter. Also, all of your memcache content will be flushed in that case.

Although we're trying hard to avoid this situation as much as possible, but it can happen to any application. So it is very important to keep your application fast even on the loading request in order to minimize the damage to your system during such serving changes.

-- Takashi


Can someone please look into this for me?

Thank you,
Hamish

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--
Takashi Matsuo | Developer Advocate | tma...@google.com

Cesium

unread,
Jun 5, 2012, 9:51:51 AM6/5/12
to google-a...@googlegroups.com
Hi Takashi,

You wrote:
 ...it is very important to keep your application fast even on the loading request in order to minimize the damage to your system during such serving changes. 

Can you refer me to some guidelines to keep my application fast even on the loading request?
I have no idea what the implications are.

David
 
Message has been deleted

Michael Hermus

unread,
Jun 5, 2012, 10:38:18 AM6/5/12
to google-a...@googlegroups.com
I have been pondering the same question, since my instance start times are anywhere between 5 and 15 seconds, with the average approaching 10 seconds (GAE Java). This seems to be on the slow side; I found a few older articles with some tips on this topic:

http://www.listry.com/blog/2010/03/google-app-engine-cold-start-guide-for

http://www.small-improvements.com/app-engine-performance-tuning

The two biggest factors seem to be:

1) Reduce dependencies to a minimum (i.e don't use heavyweight frameworks).
2) Reduce app specific initialization activity to a minimum (i.e. don't load a ton of resources from disk or the data-store each time the app is initialized)

I don't do any app specific initialization, and the only framework I use is Objectify. I do use a few third party libs such as Apache commons, etc. Also, I am not sure if certain issues still apply; for example, is the loading of individual class files still a big drag on cold start times?

Cesium

unread,
Jun 5, 2012, 11:03:11 AM6/5/12
to google-a...@googlegroups.com
Thanks Michael,
Me too.
I use HRD Java and Objectify.

I am seeing some crazy start times.

I think I will deploy an application that is specifically designed to monitor instance lifespan and start up times.

If you have any design ideas, let me have 'em. I'll start on it later in the week.

David

Hamish

unread,
Jun 7, 2012, 11:27:52 AM6/7/12
to google-a...@googlegroups.com
Hi Takashi,

Thank you for the reply.

I understand from your point of view it is expected behaviour but from ours it is not. We were not notified of any data centre move and to have our application suddenly stop responding even for a short while is not acceptable. 

Is it possible for you to do such moves more gradually? Such as serve some requests from the new location and then once things are warmed up and working switch all the traffic to the new location.

Are all our application instances in the same data centre? Is this always the case? I would hope there would be some kind of geographic spread to the location of instances if requests are coming in from different parts of the world.

Thanks again you for your help,
Hamish




On Tuesday, June 5, 2012 1:37:26 PM UTC+1, Takashi Matsuo (Google) wrote:

Hi Hamish,

It's totally an expected behavior because we moved your application from one datacenter to another around that time. In the current system design, when it happens, your instances will need to be re-loaded in the new datacenter. Also, all of your memcache content will be flushed in that case.

Although we're trying hard to avoid this situation as much as possible, but it can happen to any application. So it is very important to keep your application fast even on the loading request in order to minimize the damage to your system during such serving changes.

-- Takashi


Can someone please look into this for me?

Thank you,
Hamish

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to google-appengine+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Takashi Matsuo

unread,
Jun 7, 2012, 12:24:50 PM6/7/12
to google-a...@googlegroups.com

Hi Hamish,

On Fri, Jun 8, 2012 at 12:27 AM, Hamish <hgr...@afilias.info> wrote:
Hi Takashi,

Thank you for the reply.

I understand from your point of view it is expected behaviour but from ours it is not. We were not notified of any data centre move and to have our application suddenly stop responding even for a short while is not acceptable. 

Well, I can understand your feeling, but we don't usually notify you beforehand about such a move. One of the beauty of App Engine is that you don't have to worry about those kind of things.

In terms of outage, if you feel it violated our SLA, please fill a form at:
 

Is it possible for you to do such moves more gradually? Such as serve some requests from the new location and then once things are warmed up and working switch all the traffic to the new location.

Thanks for the feedback. Can you file a feature request for that?


Are all our application instances in the same data centre? Is this always the case? I would hope there would be some kind of geographic spread to the location of instances if requests are coming in from different parts of the world.

Unfortunately, I don't think I can answer the few questions here, but in general, we're working hard and discussing several possibilities including things you're suggesting, in order to improve the developer experiences.

-- Takashi
 

Thanks again you for your help,
Hamish




On Tuesday, June 5, 2012 1:37:26 PM UTC+1, Takashi Matsuo (Google) wrote:

Hi Hamish,

It's totally an expected behavior because we moved your application from one datacenter to another around that time. In the current system design, when it happens, your instances will need to be re-loaded in the new datacenter. Also, all of your memcache content will be flushed in that case.

Although we're trying hard to avoid this situation as much as possible, but it can happen to any application. So it is very important to keep your application fast even on the loading request in order to minimize the damage to your system during such serving changes.

-- Takashi


Can someone please look into this for me?

Thank you,
Hamish

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to google-appengine+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--
Takashi Matsuo | Developer Advocate | tma...@google.com

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/1oALuvC2t78J.

To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Jeff Schnitzer

unread,
Jun 7, 2012, 1:23:02 PM6/7/12
to google-a...@googlegroups.com
On Thu, Jun 7, 2012 at 9:24 AM, Takashi Matsuo <tma...@google.com> wrote:
>
> On Fri, Jun 8, 2012 at 12:27 AM, Hamish <hgr...@afilias.info> wrote:
>>
>> Is it possible for you to do such moves more gradually? Such as serve some
>> requests from the new location and then once things are warmed up and
>> working switch all the traffic to the new location.
>
> Thanks for the feedback. Can you file a feature request for that?

I've filed one here:

http://code.google.com/p/googleappengine/issues/detail?id=7660

This situation also applies to uploading new code over the existing
default version. It causes *all* users to experience a loading
request, which is painful. We do a dozen deployments a day sometimes.

The only way around this right now is to upload to a new version,
manually warm up that version, and then switch the default. This is a
huge PITA. Seems like GAE should always warm up new instances before
shutting down old ones.

Jeff

Hernan Liendo

unread,
Jun 7, 2012, 4:41:36 PM6/7/12
to Google App Engine
+1!

really ugly days!


On Jun 7, 2:23 pm, Jeff Schnitzer <j...@infohazard.org> wrote:
> On Thu, Jun 7, 2012 at 9:24 AM, Takashi Matsuo <tmat...@google.com> wrote:

Brandon Thomson

unread,
Jun 7, 2012, 5:53:32 PM6/7/12
to google-a...@googlegroups.com
The only way around this right now is to upload to a new version,
manually warm up that version, and then switch the default.  This is a
huge PITA.

I have had this process automated since they added set_default_version to appcfg.py. Can't imagine deploying without it.

A new mode like "update_and_set_default" could be added to appcfg.py to make this easier for everybody. No appengine server code even needs to be changed.

Mahron

unread,
Jun 7, 2012, 6:30:05 PM6/7/12
to Google App Engine
So when a new version in deployed, all instances shutdown in the
middle of code ? Is that even written somewhere ?

Jeff Schnitzer

unread,
Jun 7, 2012, 7:05:35 PM6/7/12
to google-a...@googlegroups.com
On Thu, Jun 7, 2012 at 3:30 PM, Mahron <gan...@xehon.com> wrote:
> So when a new version in deployed, all instances shutdown in the
> middle of code ? Is that even written somewhere ?

Presumably all in-progress requests complete normally.

Jeff
Reply all
Reply to author
Forward
0 new messages