It seems "Cold Starts" is an unavoidable problem for GAE. So why not to conquer it?

361 views
Skip to first unread message

Tapir

unread,
Mar 13, 2012, 11:51:02 PM3/13/12
to Google App Engine
.

Tapir

unread,
Mar 14, 2012, 1:06:10 AM3/14/12
to Google App Engine
I means "why not to conquer it instead of trying to avoid it?".

If you can't conquer it, please lower the instance prices to the
normal level, I mean about tenth to fifth of the current level so that
apps can open more resident instances to avoid "Cold Starts".

Here is one solution: give each app one hidden resident instance, when
an app needs a new instance to handle a coming request, GAE use the
hidden resident instance to handle the coming request, and open the
new instance at the same time. Your current "open a new instance and
let the new coming request to be handled after the new instance is
warmed up" is not a good implementation.

On Mar 14, 11:51 am, Tapir <tapir....@gmail.com> wrote:
> .

Gopal Patel

unread,
Mar 14, 2012, 1:17:20 AM3/14/12
to google-a...@googlegroups.com
you mean, always have one instance more than required ? ( who is going to pay for that ? ) , and is not minimum idle instance same thing ?

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.


Tapir

unread,
Mar 14, 2012, 4:28:08 AM3/14/12
to Google App Engine


On Mar 14, 1:17 pm, Gopal Patel <patelgo...@gmail.com> wrote:
> you mean, always have one instance more than required ? ( who is going to
> pay for that ? ) , and is not minimum idle instance same thing ?

It is different with the normal resident instance.
It is an instance to handle requests only at the time of the situation
"no available instances and need create a new instance",
so that many "Cold Starts" can be avoided.

"Who pay for that?"
If it is implemented cleverly, not many extra cost for Google.
And please consider the current instance prices is more than 5 times
higher than the normal level.

I said many times, the GAE instance scheduler program has problems.
Sometimes, even the normal resident instance is idle, GAE instance
scheduler will still not use it.
This means the normal resident instance is not always available for
your apps, the money you pays is wasted sometimes.

Jeff Schnitzer

unread,
Mar 14, 2012, 11:00:00 AM3/14/12
to google-a...@googlegroups.com
On Wed, Mar 14, 2012 at 4:28 AM, Tapir <tapi...@gmail.com> wrote:
>
> On Mar 14, 1:17 pm, Gopal Patel <patelgo...@gmail.com> wrote:
>> you mean, always have one instance more than required ? ( who is going to
>> pay for that ? ) , and is not minimum idle instance same thing ?
>
> It is different with the normal resident instance.
> It is an instance to handle requests only at the time of the situation
> "no available instances and need create a new instance",
> so that many "Cold Starts" can be avoided.

This is pretty much exactly what setting minimum idle instances does.
Requests are preferentially routed to dynamic instances rather than
resident instances.

The problem is, something in the scheduler is broken. Instead of
routing requests to the idle instance, GAE prefers to route requests
to a fresh instance, causing the user to wait while an instance warms
up. That setting is probably best described as "minimum useless
instances". Maybe somebody took the "minimum _idle_ instances" label
too literally ;-)

This is the behavior I observed a week or two ago. Hopefully it will
be fixed. Doesn't sound like it has been so far.

Jeff

Gregory D'alesandre

unread,
Mar 15, 2012, 3:54:33 AM3/15/12
to google-a...@googlegroups.com
Hey Jeff,

The way it is supposed to work with min idle instances set is:
- idle instance is warm and ready (let's call it I1)
- request comes in
- request goes to the idle instance at which point another instance is immediately spun up (let's call it I2)
- you now have 1 idle instance (I2) as well as 1 instance serving traffic (I1)

I know it might seem like we are taking the label too literally but we are trying to maintain idle capacity for you.  The tricky part is since we always spin up a new idle instance when an existing one begins to serve traffic it looks like they are sitting around unused when they are in fact being used often just others immediately take their place.  Are you sure this is not the behavior your are observing?

Thanks,

Greg

Tapir

unread,
Mar 15, 2012, 4:44:15 AM3/15/12
to Google App Engine

On Mar 15, 3:54 pm, "Gregory D'alesandre" <gr...@google.com> wrote:
> Hey Jeff,
>
> The way it is supposed to work with min idle instances set is:
> - idle instance is warm and ready (let's call it I1)
> - request comes in
> - request goes to the idle instance at which point another instance is
> immediately spun up (let's call it I2)
> - you now have 1 idle instance (I2) as well as 1 instance serving traffic
> (I1)

Do you mean this is how the current GAE instance scheduler is
implemented?
From my observations, sometimes the request will not go to the idle
instance but wait many seconds to go the new created instance.
I can confirm the idle instance is really idle, for in the observation
period, it didn't handle any new requests.
It is a bug? Or the machine the idle resident instance hosted on has
no free CPU?
I means "many apps are hosted the same machine, the resident instances
or the part of CPU we bought are not always available to us"?

>
> I know it might seem like we are taking the label too literally but we are
> trying to maintain idle capacity for you.  The tricky part is since we
> always spin up a new idle instance when an existing one begins to serve
> traffic it looks like they are sitting around unused when they are in fact
> being used often just others immediately take their place.  Are you sure
> this is not the behavior your are observing?
>
> Thanks,
>
> Greg
>
>
>
>
>
>
>
> On Wed, Mar 14, 2012 at 8:00 AM, Jeff Schnitzer <j...@infohazard.org> wrote:

Mark Rathwell

unread,
Mar 15, 2012, 5:03:14 AM3/15/12
to google-a...@googlegroups.com
> The way it is supposed to work with min idle instances set is:
> - idle instance is warm and ready (let's call it I1)
> - request comes in
> - request goes to the idle instance at which point another instance is
> immediately spun up (let's call it I2)
> - you now have 1 idle instance (I2) as well as 1 instance serving traffic
> (I1)

My experiences have been similar to Tapir over the last month or so.
I have an app that I have been testing to try to duplicate the always
on behavior in the new setup. This app has not been updated, etc., in
that time, billing is enabled, no datastore access, or any other
services, just dynamically generated html.

First, I tried the automatic setting for idle instances, and did not
keep track, but visiting the app about once a day I noticed that many
of those requests were loading requests, taking 20-35 seconds to load,
with subsequent requests performing fine.

Next, I tried setting idle instances to 1, and visiting about once a
day, 7 out of 10 initial requests to the app were loading requests,
with subsequent requests performing fine.

Finally, I tried setting idle instances to 2, and visiting about once
a day, only 3 out of 9 initial requests were loading requests. So,
better with 2, but still not what I would consider acceptable.

If this is the expected behavior, I would suggest that it probably shouldn't be.

- Mark

Tapir

unread,
Mar 15, 2012, 5:14:39 AM3/15/12
to Google App Engine


On Mar 14, 11:00 pm, Jeff Schnitzer <j...@infohazard.org> wrote:
Yes, here is usual case of the scheduler do it wrong:
- one resident instance and it is handling a request and the handling
is expected to be finished in 0.3 second.
- a new request comes, the scheduler prepares to create and initialize
a new instance, this need about 15 seconds to finish.
- ok, now the scheduler do a wrong job, it let the new request wait
for 15 seconds and be handled by the new instance.
It is TOTALLY WRONG! The scheduler should let the resident instance
handle the new request, too!

So it seems the problem is the scheduler maintain a request queue for
every instance. It is not a good implementation.
The scheduler should maintain a request queue for every app!

>
> Jeff

Jeff Schnitzer

unread,
Mar 15, 2012, 11:23:28 AM3/15/12
to google-a...@googlegroups.com
That is not quite the behavior I observe. It largely looks like this:

- I1 is warm and ready
- the request comes in
- the request goes to I2 and blocks while it starts up
- another request comes, goes to I3 and blocks while it starts up

While it's not 100%, GAE seems to route a disturbing number of
requests (say, 1 out of 10) to not-warmed-up instances. Just
clicking around the site produces large numbers of UX delays while an
instance starts. The dashboard often shows three instances. This is
in java with threadsafe-true and F1 instances and not particularly
CPU-intensive operations; my requests do a lot of data operations and
almost all have <1s latency.

This is with the setting at 1 min idle, 1 max idle. I've disabled
billing on my sandbox app so I can't experiment anymore - I'm not
willing to try this experiment on production. I was still seeing this
behavior a week ago. Switching to Auto/Auto, I almost never see more
than one instance and startup requests are nearly absent from the
logs.

I understand that min idle instances tries to keep them idle in
reserve. The problem is that the reserve doesn't seem to get used. I
can't think of any rationale for routing a user request to an instance
startup while there is idle capacity just sitting by. Or even if it
isn't idle that second, it will be in short order - waiting 1s is
better than waiting 20s for a whole instance.

Jeff

On Thu, Mar 15, 2012 at 3:54 AM, Gregory D'alesandre <gr...@google.com> wrote:

Prashant Hegde

unread,
Mar 15, 2012, 12:41:53 PM3/15/12
to google-a...@googlegroups.com
Almost identical setup and observations. 

In addition,

We are observing that one instance with very few warmups (spin offs of new instances) can be achieved with min instances = auto and max = 1. Our traffic is not that much ~ 1 QPS with mostly < 1 s latency


Thanks
Prashant

stevep

unread,
Mar 15, 2012, 2:55:56 PM3/15/12
to Google App Engine
Important also to know whether Scheduler makes decisions based on
average app latency or a more granular analysis of average latency per
specific call. Extending Tapir's example, if five new requests are
being queued that for calls that average less than 10 milliseconds,
Scheduler's spin-up decision might be different versus five requests
for a calls that average > 500 ms. --stevep

Emanuele Ziglioli

unread,
Mar 15, 2012, 4:09:05 PM3/15/12
to Google App Engine
Has anyone managed to profile cold starts (I don't know whether that's
even possible) to see where they take most of the time?
Does the length of cold starts on GAE servers correspond to how long
it takes on the development server?

I use GAE for Java so what I'll write below reflects that.
I know the libraries I use employ annotations and that's got to have
an impact. Jaxb (coupled with restlet) for example has got a terrible
cold start time.
My warmup request does exactly that, it triggers a jaxb warmup. But
lately it has been failing. Since that seems to affect mostly the
warmup request, users don't seem to be affected terribly (we have very
low traffic at this stage though).

I wonder whether the Google Engineers could implement something
similar to what Android does: there's a resident Dalvik VM and all new
processes fork from it.
Our prototype process could have the jvm, the web server and all the
rest that's common. Just a thought.

E

Jeff Schnitzer

unread,
Mar 15, 2012, 5:30:04 PM3/15/12
to google-a...@googlegroups.com
It's hard to be certain, but it *seems* like the biggest delay is
reading files from whatever it is that passes for a network
filesystem. I've timed several parts of my app and the only
reasonable explanation I can come up with is that classloading is
painfully and erratically slow. This hypothesis is consistent with
one quirky observed behavior - putting all your classes in a single
jar file (instead of WEB-INF/classes) has a measurable (beneficial)
effect on application startup time.

Jeff

Kyle Finley

unread,
Mar 15, 2012, 5:33:33 PM3/15/12
to google-a...@googlegroups.com
The interesting thing about cold start time, is that it seems to varies based on the AppID, at least with the Python27 runtime.

As stated here ( https://groups.google.com/forum/?fromgroups#!topic/google-appengine/W7EJrBhHEJg ) I tested the exact same application on multiple AppIDs and discovered that a new AppID started my application in less then half the time (63% faster) then an older one.

I would be interested to hear if this is unique to my application or is true for others.

Claude Zervas

unread,
Mar 15, 2012, 6:12:04 PM3/15/12
to google-a...@googlegroups.com
My app uses MS with python 2.5 and initially I was trying to maximize cold start performance as cheaply as possible by setting min instances = 1 and max = 2, but I was getting huge latencies due to requests getting routed to cold starts which then led to over quota errors. I then set max instances = 1 and performance more than doubled and no more over quota problems... weird. It seems having just one always hot instance bypasses the scheduler's odd behavior.

I think GAE is great if you have a fairly large budget but it just isn't a viable platform for low traffic (and low budget) applications given the unpredictable performance and almost zero support. The idea of automatically spinning up instances to handle traffic spikes is great, but in practice it seems to be really expensive and leads to spurious latency issues. I've been using GAE since 2009 and it's been really great to use, especially when it was free, but now I'm thinking an EC2 instance with Appscale might not be such a bad thing and possibly less of a hassle in the long run... I suspect Google would rather see apps like mine go away anyway since they don't generate much income.

Emanuele Ziglioli

unread,
Mar 15, 2012, 6:17:36 PM3/15/12
to Google App Engine
Hi Jeff,

have you got experience with Amazon or Heroku, with regards to cold
startup time?
Is it more predictable?
Perhaps having control over it means one could implement cold starts
in a more clever way.
Hint, hint for anyone!

Jeff Schnitzer

unread,
Mar 15, 2012, 6:33:21 PM3/15/12
to google-a...@googlegroups.com
On Thu, Mar 15, 2012 at 6:17 PM, Emanuele Ziglioli
<the...@emanueleziglioli.it> wrote:
>
> have you got experience with Amazon or Heroku, with regards to cold
> startup time?
> Is it more predictable?

Sure, EC2 is predictable... it takes several minutes (plural) to spin
up a VM and then your app. Remember, in EC2-land you have to wait for
the OS to boot before your web application can start loading.

No thanks.

20-second cold start times are annoying but not particularly terrible,
as long as users don't have to see them. I feel confident that Goog
will eventually fix whatever it is misrouting user requests to unready
instances. In the mean time, I've found that I can get very good
performance by:

* Using Auto/Auto
* Deploying new code to a new version, hitting that new version to
warm it up, then switching default versions

I'd really like to automate this second step.

Jeff

Stefano Ciccarelli

unread,
Mar 16, 2012, 2:45:20 AM3/16/12
to google-a...@googlegroups.com
I've observed the same.
We have a single jar and we disabled the jersey class scanning but the warmup still needs 30/35 seconds.

Two weeks ago the startup times were around 15/20 seconds, then, one day (I don't remember when) the performance dropped. 

-- 
Stefano Ciccarelli
Sent with Sparrow

Marcel Manz

unread,
Mar 16, 2012, 4:31:32 PM3/16/12
to google-a...@googlegroups.com
Just for curiosity, are you using <load-on-startup> in web.xml ?

http://code.google.com/appengine/docs/java/config/appconfig.html#Using_a_load-on-startup_Servlet

Since I use this initialization I never had any failed requests from non-ready instances (this didn't change the scheduler behavior though, but at least the warmup errors are gone now)

Marcel

Tapir

unread,
Mar 17, 2012, 3:26:05 AM3/17/12
to Google App Engine


On Mar 17, 4:31 am, Marcel Manz <marcel.m...@gmail.com> wrote:
> Just for curiosity, are you using <load-on-startup> in web.xml ?
>
> http://code.google.com/appengine/docs/java/config/appconfig.html#Usin...
>
> Since I use this initialization I never had any failed requests from non-ready instances (this didn't change the scheduler behavior though, but at least the warmup errors are gone now)
Yes, I have use for all servlets. But I use struts 2, so I also
register the struts listeners in <listener></listener>
I ever enabled the _ah_warmup custom handler, but I found it will make
things worse, so I disabled it now.

>
> Marcel

Tapir

unread,
Mar 17, 2012, 3:28:13 AM3/17/12
to Google App Engine
@GAE team,
Is the current implementation will let unwarmed instances to handle
requests?

Tapir

unread,
Mar 17, 2012, 3:31:30 AM3/17/12
to Google App Engine
@GAE team, again,
If the _ah_warmup custom handler is enabled, will the default
algorithm be disabled,
I mean "<load-on-startup> and <listener> will be ignored?".

Tapir

unread,
Mar 17, 2012, 3:35:53 AM3/17/12
to Google App Engine


On Mar 17, 4:31 am, Marcel Manz <marcel.m...@gmail.com> wrote:
> Just for curiosity, are you using <load-on-startup> in web.xml ?
>
> http://code.google.com/appengine/docs/java/config/appconfig.html#Usin...
>
>
"Since I use this initialization I never had any failed requests from
non-ready instances (this didn't change the scheduler behavior though,
but at least the warmup errors are gone now)"

how many classes and jars in your app?

>
> Marcel
Reply all
Reply to author
Forward
0 new messages