Javaland scheduler behavior

Jeff Schnitzer

unread,

Feb 28, 2012, 12:33:14 PM2/28/12

to Google App Engine

There's been a lot of discussion of the scheduler behavior in Pythonland, but not much about it's "eccentricities" in Javaland.

I have a threadsafe=true Java app. Let's say every request completes in exactly 1s. Settings are: idle instances min 1 max 1, latency auto/auto. Here is what I expect:

* Instance1 starts up and becomes permanently resident

* Instance1 serves concurrent requests up to some arbitrary CPU capcity

* When Instance1 exceeds capacity:

* Instance2 starts warming up

* All requests remain in the pending queue for Instance1, getting processed at 1/s * concurrency

* Instance2 is ready and starts processing new requests, sharing the load with Instance1

What I actually see (as far as I can determine):

* Instance1 starts up and becomes permanently resident

* Instance1 supports almost no concurrency. At most it's 2. (no, my app is not particularly compute intensive)

* A new request comes in which for some reason can't be handled by Instance1:

* Instance2 starts warming up

* The new request is blocked on Instance2's pending queue, waiting 10-20s for Instance2 to be ready

* In the mean time, Instance1 is actually idle

* Another new request comes in and starts up Instance3

* Possibly this is while Instance2 is warming up

* AFAICT, Instance1 is taking a coffee break

The net result is that I have an idle website with 1 user (me) clicking around and I've already gotten multiple 20s pauses and three instances. Something is seriously wrong here. Whether or not it's rational to have so many instances started, pending requests shouldn't be shunted to non-warmed-up servers, right?

I've tried upping the min latency to a high value to see if this improves the situation. If this works... shouldn't min latency *always* be as high as the startup time for an instance?

I know it's been said before, but it needs to be said again... the guidance for scheduler configuration is really, really inadequate.

Jeff

Mos

unread,

Feb 29, 2012, 3:45:08 AM2/29/12

to google-a...@googlegroups.com

Five days ago I run JMeter-Tests to check the performance of my application with and without "threadsafe=true".
I'm in "no billing mode" and therefore just worked with one instance.

With concurrent requests I expected a better throughput/ performance with the threadsafe=true configuration.
This was not the case. The result was the same with and without the threadsafe configuration.

I tested between 10 and 100 concurrent requests, but no difference at all.

I seems like the "threadsafe=true" configuration is broken currently.

Cheers

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Simon Knott

unread,

Feb 29, 2012, 8:02:03 AM2/29/12

to google-a...@googlegroups.com

Hi,

I thought I was going crazy - in the past week or so the latter is exactly what I'm experiencing. My free-tier app gets so little traffic at the moment that I was putting it down to that, but maybe it's not.

I've tested it with just a single user and I'm seeing three instances started for a single page request, which has a three ajax calls on it. What I see is the following:

Static page loads which fires 3 ajax calls
New instance gets started for the first ajax call
Second ajax call also starts up a new instance, but the initialisation is delayed until the first ajax call completes
Third ajax call also starts up a new instance, but the initialisation is delayed until the second ajax call completes

All in all, the single page is taking 20+ seconds to load and ends up with three instances. I don't have a lot of third-party libraries and initialisation of the server used to take 2-3 seconds, which seems to have elevated to 7 or 8 seconds in the last couple of weeks. Even once the page is loaded, another ajax call can then trigger a new instance to be started - it's bonkers. There doesn't appear to be any concurrency, or reuse of existing instances, at all. It is currently marked as thread-safe=true.

Cheers,

Simon

tempy

unread,

Feb 29, 2012, 9:54:41 AM2/29/12

to Google App Engine

I should have looked at this thread before posting my own (http://
groups.google.com/group/google-appengine/browse_thread/thread/
7ff808e8243562cf#).

I'm using GAE as a sync-backend for a smallish app, so my traffic is
very bursty, with long idle periods. I've been seeing exactly what you
describe for the past few days, although I haven't changed my
scheduler settings in many months (1 idle instance, 7s min pending
latency.)

Maybe they changed something so that the scheduler is now avoiding
sending requests to that one idle instance at all costs. This results
in requests sitting in the pending queue for those 7 seconds, and then
a new instance is started - so many requests to my idle app are
waiting for 10s now, while the idle instance does exactly nothing.

The fact that this behavior has been coming and going, and also that
normal (non cold-start) requests have lately been much slower than
normal, leads me to suspect that this isn't a deliberate change but
some kind of instability in GAE that's been going on for days now. I'd
love to hear some feedback as to what's happening.

On Feb 29, 2:02 pm, Simon Knott <knott.si...@gmail.com> wrote:
> Hi,
>
> I thought I was going crazy - in the past week or so the latter is exactly
> what I'm experiencing. My free-tier app gets so little traffic at the
> moment that I was putting it down to that, but maybe it's not.
>
> I've tested it with just a single user and I'm seeing three instances
> started for a single page request, which has a three ajax calls on it.
> What I see is the following:
>

> - Static page loads which fires 3 ajax calls
> - New instance gets started for the first ajax call
> - Second ajax call also starts up a new instance, but the initialisation

> is delayed until the first ajax call completes

> - Third ajax call also starts up a new instance, but the initialisation

> On Tuesday, 28 February 2012 17:33:14 UTC, Jeff Schnitzer...
>
> read more »

Jason Collins

unread,

Mar 8, 2012, 10:53:32 PM3/8/12

to Google App Engine

Jeff, I see very similar behaviour in pythonland - i.e., the resident
instance gets almost no traffic. I opened an Enterprise support ticket
on the topic, and I got the following response:

"the Resident instances are kept alive by the GAE scheduler for long
periods in order to attend to new requests whenever there are no
Dynamic instances to serve them. In this way, the request does not
have to wait for the instance creation, thus, the latency of creating
the instance is avoided.
However, as soon as the new Dynamic instance is up and running, it
starts getting requests and the Resident instance turns idle again,
until the app sees more traffic than its available Dynamic instances
are able to serve."

Frankly, I'm not totally sure when the resident instances actually get
traffic. I flipped on a resident instance just so that I could get
warmup requests back (they are only issued if you have resident
instances now), but now I have an instance floating around doing very,
very little work.

It just seems wrong to me.
j

Marcel Manz

unread,

Mar 10, 2012, 10:07:19 AM3/10/12

to google-a...@googlegroups.com

Jeff,

I'm seeing a very similar pattern in my threadsafe enabled Java-HRD app which does something between 1 to 20 QPS.

I was experimenting with various idle instance settings and pending latency, and could not overcome this
problem unless I allowed app engine to keep around 6 instances idle for my app (which is far too many). In my case the scheduler begins
to start new instances even instance 1 is just around 1 QPS and no request (or very little) are sent to the resident instance which is just snoozing around (0.000 QPS / 0.0 ms Latency).

I've also tried changing the frontend instance class from F1 to F2 to give it double-CPU power, but also this seemed to have little, if not no effect at all.

As I can control where requests to my app are sent to, I was shutting down all frontend instances and experimented in moving all to dynamic backends, where the backend instance view reports each instance's CPU usage (why is CPU load not listed for frontend instances?). In this case mostly my app served out of 1 dynamic B1 backend, which did not go over 20% CPU usage. Still, the scheduler several times per hour started at least 1 or 2 additional backend instances for no reason.

Since there is no way in controlling how quickly dynamic backends get shutdown (unless my app sends some keep-alive requests to them), the backend scheduler very quickly shut them down again, just so that another backend got started again during the 15-minute window one needs to additionally pay for backend shutdowns. -> IMHO app engine should make dynamic backend shutdown-times configurable.

In backend mode (same instance class), my app was able to serve with either 1 or 2 instances (compared to 3 - 6 in frontend mode), however the frequent starts and stops of dynamic backends made me switch back to frontend mode, as the 15-minute window quickly accumulates costs. (I can't use permanent-backends, due to unpredictable traffic spikes which require additional instances to come up).

Marcel

Marcel Manz

unread,

Mar 10, 2012, 10:15:13 AM3/10/12

to google-a...@googlegroups.com

Additional comment:

Just while I was writing my previous message, additional frontend instances got started and the request counter on the resident instance did not change at all.

The resident instance seems to be completely useless at the moment, as the scheduler started additional instances (keeping the request pending for the start up time), not having directed the request to the resident instance for immediate processing.

Mauricio Aristizabal

unread,

Mar 15, 2012, 2:24:00 AM3/15/12

to google-a...@googlegroups.com

I'm experiencing the same issues and it's a shame: It feels like Google designed an F1 race car here, and then limited top speed to 20mph.

What good is this shiny platform, full of slick, ready-to-use APIs and the promise of automatic elasticity, if a simple page often takes 40 seconds to load, and there is little you can do about it?

Further this doesn't seem like a complex algorithm but a simple rule: don't add any request to a new instance's queue until the warmup request returns (and it is marked available / added to the pool).

I really, really, hope this is a bug that will be addressed soon (and that release processes are improved so others like it are never introduced).

Either way, unfortunately between this and the last few days' issues AE is not looking very enterprise-ready at all, and I now have to consider it a risk rather than a highlight in my business plan and investor pitch.

Mauricio

Founder,

commentous.com

Mauricio Aristizabal

unread,

Mar 15, 2012, 3:37:31 AM3/15/12

to google-a...@googlegroups.com

Jason, how do you know you must have a resident instance to get warmup requests to work? I don't see this anywhere in the docs. And to enable one do I just define a backend in backends.xml?

I'm pretty sure warmups were working for me before but haven't seen them happening in the logs for a while now. I don't know if I did anything on my end but I've been trying to turn them on and nothing I do works.

thanks

Jason Collins

unread,

Mar 15, 2012, 10:35:37 AM3/15/12

to Google App Engine

Mauricio,

We opened an Enterprise support ticket on the topic and were told that
in order to get warmup requests, you have to have at least one min
idle instances. I believe this change occurred around 1.6.2.

Unfortunately, I can't get you the exact quotes from the support
person because, ironically, the Enterprise support system is down at
the moment.

j

Max Völkel

unread,

Mar 15, 2012, 5:02:14 PM3/15/12

to google-a...@googlegroups.com

We are also very often very confused about the scheduler and hope GAE will fix this soon.

Mauricio Aristizabal

unread,

Mar 15, 2012, 6:20:23 PM3/15/12

to google-a...@googlegroups.com

Woah it's working great for me today! Apparently because my changes to enable edge-caching finally kicked in overnight (I used Brandon Wirtz's excellent instructions here). So now I have only one dyno instance and it doesn't seem to ever create a second one even though setting is auto/auto.

Which makes sense: now a new user page load results in 2 request (html and ajax call) vs 40 (including static resources). I still see the static requests in logs, but with a 204 instead of 304, and taking consistently about 23ms instead of up to couple hundred depending on file size. So they probably return so quickly the queue never grows much if at all. Interestingly, the Requests count in the instances page only increases by 2 (so not reporting the served 204s).

Last thing: My static cache settings filter sets a random expiry between 5 and 6 minutes, so even though every couple minutes a resource will actually have to be served, it's one at a time:

resp.setHeader("Cache-Control", "public, max-age=3" + Math.round(Math.random()*9) + "0");