Resident instance not serving the traffic

166 views
Skip to first unread message

aswath satrasala

unread,
Apr 4, 2013, 12:07:03 AM4/4/13
to google-a...@googlegroups.com
Hello,
I have the appid Application settings
Idle instances:  1 - Automatic

My app is idle, and I can see the logs, that there is no traffic
There is one Resident instance.

Now, I issue a servlet request from my browser
- New instance is created
- the request is served by the new instance

What is the purpose of Resident instance?

Google team, please suggest for the appropriate settings to avoid this situation

-Aswath

Barry Hunter

unread,
Apr 4, 2013, 7:28:03 AM4/4/13
to google-appengine
- New instance is created

That is expected. Its there to serve the new traffic. 

 
- the request is served by the new instance

Are you absolutely sure of this? What evidence do you have? 

Unless you can provide real verifiable evidence, there is probably little Google can do. They need something to investigate. 

 

What is the purpose of Resident instance?

To accept the first hit, while a new dynamic instance is being spun up. 

Vinny P

unread,
Apr 4, 2013, 1:29:55 PM4/4/13
to google-a...@googlegroups.com, aswath satrasala
The resident is there to absorb spikes in traffic, not to handle standard traffic operations. In an ideal world, what should be happening is that you start out with resident instances. Traffic comes in, then the scheduler kicks up a dynamic instance while simultaneously directing some traffic to resident instances to process. Once the dynamics are up, then the requests go to the dynamic instances while residents return to idle. It doesn't always work that way, but that's the general idea.

Can you double check your logs, and specifically look at the instance id that is recorded on each log? Sometimes it can look like the dynamic instance is handling the request, when in reality the resident is handling the request; the simultaneous processes can sometimes confuse people.


-----------------
-Vinny P
Technology & Media Advisor
Chicago, IL

@GOV on AppDotNet: https://alpha.app.net/gov

aswath satrasala

unread,
Apr 4, 2013, 10:54:30 PM4/4/13
to Vinny P, google-a...@googlegroups.com
The view logs link in the instances screen does not show logs or is broken.  Hence, you cannot find out which instance served the traffic.

-Aswath

On Thu, Apr 4, 2013 at 10:59 PM, Vinny P <vinn...@gmail.com> wrote:
The resident is there to absorb spikes in traffic, not to handle standard traffic operations. In an ideal world, what should be happening is that you start out with resident instances. Traffic comes in, then the scheduler kicks up a dynamic instance while simultaneously directing some traffic to resident instances to process. Once the dynamics are up, then the requests go to the dynamic instances while residents return to idle. It doesn't always work that way, but that's the general idea.

Can you tell where is this documented?  Whey should the resident instance be idle, and the traffic served by dynamic instance, with resident instance being idle.  
What is the use of the pending latency setting in the Application Settings.
I have set it to 4sec - automatic.  With this setting, I understand that, if there is a request waiting in the queue for less than 4s, then it should be served by the resident instance.  There is no need to create a new instance, and getting charged for the new instance hours.

Cesium

unread,
Apr 5, 2013, 10:35:46 AM4/5/13
to google-a...@googlegroups.com, Vinny P, aswath satrasala
aswath,

Barry is absolutely correct when he says that, "Unless you can provide real verifiable evidence, there is probably little Google can do. They need something to investigate."

However, don't waste your time on this issue. Collecting "real verifiable evidence" on the ever-changing scheduler behavior has proven (time and time again) to lead to no investigation by Google.

I repeat, don't waste your time on this issue.

My advice is to focus on the things that make you happy.

A week or two from now, without warning, this scheduler behavior will change. Weeks later it will change again, and on it goes.

David

Vinny P

unread,
Apr 5, 2013, 10:52:10 AM4/5/13
to google-a...@googlegroups.com, Vinny P, aswath satrasala

 On Thursday, April 4, 2013 9:54:30 PM UTC-5, aswath wrote:
Can you tell where is this documented?  

Resident instances are purely to absorb a spike in traffic, until dynamics can launch and deal with requests. This behavior has been thoroughly and repeatedly discussed and commented upon by this mailing list.
 
It can be difficult to understand because it's not immediately obvious what's going on. When I do presentations to corporate clients, I use a spring as an analogy to resident instances: The spring wants to be "idle", i.e. it doesn't want to be compressed. You can compress a spring, but it just fights harder and harder and pushes back against you. Once you remove the pressure, the spring returns to idle, uncompressed state. Same thing with resident instances: you can push a spike of traffic towards them, and they'll handle it, but the scheduler is going to launch off more and more dynamic instances to handle the inflow of traffic; the scheduler wants the resident instances to "spring back/uncompress" and be idle. (This explanation goes over great with my clients, since many of them hold engineering/science degrees of some sort and love the in-real-life example. Bonus points if I actually hand out free springs for them to toy around with while I carry out my presentation!)
 
Here's some links, one from the official documentation and two answers from StackOverflow. You can search this group's archives for more.
 
Excerpt: Under regular conditions, [the scheduler] may spin up new idle instances to absorb traffic and minimize latency in the event of a sudden load spike (idle instances is a synonym for resident instances)
 
http://stackoverflow.com/a/14770848 
Excerpt: resident instances are for times when no other instances (f. e. dynamic) are available (busy, no one started). They are the buffer between full utilization and new (dynamic) instances available. 
 
Excerpt: Idle instances are "reserve" instances so that when increase in traffic happens they are immediately available.. So you need to have a large number of idle instances only if you expect large traffic spikes, but only if you want to keep the same latency.

On Thursday, April 4, 2013 9:54:30 PM UTC-5, aswath wrote:
The view logs link in the instances screen does not show logs or is broken.  Hence, you cannot find out which instance served the traffic.


You're looking at the wrong screen. Go to your application's dashboard, click on Logs in the left menu, and expand the individual log records by clicking the + button. Here's an image to illustrate what the request looks like (the red arrows point to the instance ID):  http://i.imgur.com/zve28uH.png

  On Thursday, April 4, 2013 9:54:30 PM UTC-5, aswath wrote:
Whey should the resident instance be idle, and the traffic served by dynamic instance, with resident instance being idle.  
What is the use of the pending latency setting in the Application Settings.

That is incorrect. The resident instances are not sitting idle, they are doing their job of absorbing traffic spikes. If the app gets hit by a traffic spike, the residents will pop into action and take the traffic while the scheduler starts loading in dynamics.
 
Now obviously, that is what is supposed to happen in an ideal world. In real life, it doesn't always work that way. There are times when residents are idle and dynamics have to handle the traffic and requests are delayed due to that. Here's one issue that deals with that: https://code.google.com/p/googleappengine/issues/detail?id=7865 You just have to tune your app and play around with the settings and sliders in application settings.


On Friday, April 5, 2013 9:35:46 AM UTC-5, Cesium wrote:
A week or two from now, without warning, this scheduler behavior will change. Weeks later it will change again, and on it goes.
 
 +1 & Star, full agreement, etc.

Jeff Schnitzer

unread,
Apr 5, 2013, 11:24:26 AM4/5/13
to Google App Engine
On Fri, Apr 5, 2013 at 10:52 AM, Vinny P <vinn...@gmail.com> wrote:
>
> Resident instances are purely to absorb a spike in traffic, until dynamics
> can launch and deal with requests. This behavior has been thoroughly and
> repeatedly discussed and commented upon by this mailing list.

The reason it keeps coming up is because there does not appear to be a
satisfactory answer :-(

I think what people are looking for is: What combination of settings
will prevent users from seeing cold starts, or at least decrease the
probability down to 4 or 5 sigma? There doesn't appear to be an
answer, not even "run ten thousand resident instances".

Jeff

Vinny P

unread,
Apr 5, 2013, 12:23:17 PM4/5/13
to google-a...@googlegroups.com, je...@infohazard.org

On Friday, April 5, 2013 10:24:26 AM UTC-5, Jeff Schnitzer wrote:
I think what people are looking for is:  What combination of settings
will prevent users from seeing cold starts, or at least decrease the
probability down to 4 or 5 sigma?  There doesn't appear to be an
answer, not even "run ten thousand resident instances".


Believe me, I sympathize with you, Aswath, Cesium, etc. I'm a big Java on GAE guy (and so are my clients) and I frequently run head first into this issue. And of course, Java makes it particularly hard due to its overhead  The only thing we can do is to run continuous testing, repeated inspection of logs, A/B test adjusting the latency slider and resident instances, etc. To be fair, I work for a large company alongside a ton of very bright people; we can afford to spend manpower to continually optimize our applications. Not everybody can do the same.

There's just no magic bullet here. As the famous Spolsky quote goes, "all abstractions are leaky". Is there room to criticize GAE? Yes, obviously, and I have a huge list of issues I would love fixed (Google, please give us incoming email/xmpp on custom domains, thanks). But I also have clients that complain to me about Heroku (especially after they were discovered lying to users about how routing worked, that day was just nonstop complaining), and other PAAS. If I had the solution to everything, I'd be selling it to clients. The only thing that works is continuous monitoring.

Jeff Schnitzer

unread,
Apr 5, 2013, 5:40:55 PM4/5/13
to Vinny P, Google App Engine
I don't think the situation is so dire. One option is just to get rid
of autoscaling - if that means user requests don't wait for cold
starts, I'd consider it an improvement.

Another option is to take the approach of AWS ElasticBeanstalk. The
elastic load balancers never send requests to unresponsive servers,
and you can configure the system to add or remove instances based on
metrics like CPU utilization and request latency. I'm not a big fan of
EB (or at least, I wasn't 2 years ago when I last tried it) but the
autoscaling system is exactly what I want.

Jeff

Carl Schroeder

unread,
Apr 6, 2013, 12:05:39 AM4/6/13
to google-a...@googlegroups.com, je...@infohazard.org
Most people know how the GAE scheduler is supposed to work. The problem is, it does not work as advertised under a number of conditions. 
Now I am absolutely certain that there is at least one configuration out there where the GAE scheduler performs exactly as advertised, and that is in Google's test lab. 

Making users wait for an instance to start before their request can be served while idle instances are standing by is a failure condition. Sure, there might be some large scale cases where this behavior is both rare and desirable among hundreds of instances serving millions of requests. Unfortunately nobody starting out on GAE would ever get to that point when every 5th pageload takes 15+ seconds to complete. The combination of aggressive recycling of instances and serving user facing requests with cold instance startups is pathological when it takes as long as Java does to start. For simple websites with one request per page-load, this is much less of an issue. For interactive sites with multiple moving parts, it can be a nightmare.

As far as gathering evidence:
Logs are tagged with what instance is serving them, so you can look at the request that takes 15s and see if it belonged to a resident instance.
You can also log when an instance is initializing, making identification of cold starts obvious.
You can also tell by the spinning GIF on your cursor when you access your site that your request is waiting on a Java instance to start.

This is no longer an issue for our site. This bug does not exist on Go because Go instances cold start in under 100ms. We moved the complicated stuff off GAE to AWS rather than port it. We ported the simple stuff to Go from Java and now our app performs great. We have a python stub to serve requests to the Full Text Search API and Cloud SQL, since those two services are not available on Go yet. Hopefully things improve with Go 1.1

You have to roll with the punches.

Cesium

unread,
Apr 7, 2013, 9:32:30 PM4/7/13
to google-a...@googlegroups.com
Once again, clearly stated by Carl.

I roll with the punches using a script that fetches dynamic pages to induce a large uniform load.

Speaking of large uniform loads, I gotta go shovel out the unicorn pens.

David

Jeff Schnitzer

unread,
Apr 8, 2013, 1:16:18 AM4/8/13
to Google App Engine
"You have to roll with the punches" meaning "moved the complicated
stuff off GAE to AWS" is awfully grim for the future of GAE. This
makes me very sad :-(

Jeff
> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
> To post to this group, send email to google-a...@googlegroups.com.
> Visit this group at http://groups.google.com/group/google-appengine?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
Reply all
Reply to author
Forward
0 new messages