Need help: Pathological instancing behavior

Kristopher Giesing

unread,

Jul 31, 2012, 10:18:26 PM7/31/12

to google-a...@googlegroups.com

I've been working for some time on an app I plan to deploy. Over the development period I have done testing on the QA instance of my app (which is a separate app ID from the one I'll use in production) but have not paid serious attention to performance since the functionality wasn't yet complete.

This past weekend I did some light real-world type testing, and the performance was terrible. With QPS in the 5-10 range I was seeing latencies in the range of 10-20s. Obviously, the first place I looked was my own code, but after some easy optimizations I got my query handling down to the 200ms range, and the response times on the client were still usually very bad (1s minimum, up to 10-20s max). (Pings from the client to the appspot.com domain are pretty stable at 30ms.)

Looking further, I discovered that Google App Engine is spawning new instances of my app at a pathological rate. For example, I left the app stats console open for a while, and found that just loading the appstats URI by itself would cause a new instance to spawn! This is with zero QPS otherwise - my app was completely idle, and none of my own code was running at all.

I tried adjusting the min and max instance settings, but GAE seems to ignore them, as far as I can tell; the max is not honored, and although it does keep instances running to accommodate the min, those running instances don't prevent new ones from spawning. It's as if GAE doesn't think they really exist. I also have the min latency set to the max of 15.0s, which is far beyond the rise time I'm seeing for GAE spawning new instances.

Here's a set of logs taken from a very light load on the server, with a 2/2 instance max/min setting. I include logs from a previous slice of time so that you can see the /appstats URI also spawning new instances, but I want to emphasize that this is a continuous set of logs - there are no requests in this time window that aren't listed below. I also include a screenshot of the instance panel in the console so you can see that the instances are, in fact, being created.

2012-07-31 18:58:12.541 /_ah/channel/connected/ 200 101ms 0kb

2012-07-31 18:58:10.894 /api/chat/G:1286432802 200 132ms 0kb Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B206

2012-07-31 18:58:09.809 /_ah/channel/connected/ 200 1076ms 0kb

2012-07-31 18:58:09.107 /api/game/72001/users 200 12312ms 0kb Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B206

2012-07-31 18:58:07.725 /api/chat/G:658732847 200 7232ms 0kb Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B176

2012-07-31 18:58:00.279 /api/game/72001/users 200 8820ms 0kb Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B176

I 2012-07-31 18:58:00.279 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 18:57:56.393 /api/game/72001 200 6044ms 3kb Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B206

2012-07-31 18:57:55.920 /_ah/warmup 200 5105ms 0kb

I 2012-07-31 18:57:55.916 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 18:57:51.324 /api/game/72001 200 7151ms 3kb Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9B176

2012-07-31 18:57:49.779 /_ah/warmup 200 4629ms 0kb

I 2012-07-31 18:57:49.778 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 18:57:01.504 /_ah/warmup 200 5225ms 0kb

I 2012-07-31 18:57:01.503 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 18:13:56.599 /_ah/warmup 200 4784ms 0kb

I 2012-07-31 18:13:56.581 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 18:06:53.651 /appstats/stats 200 898ms 285kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 18:06:52.618 /appstats 302 33ms 0kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 18:06:14.538 /appstats/static/plus.gif 200 45ms 0kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 18:06:14.298 /appstats/static/app_engine_logo_sm.gif 200 4820ms 3kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 18:06:14.288 /appstats/static/appstats_js.js 200 4699ms 43kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

I 2012-07-31 18:06:14.288 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

Can anyone explain why this might be happening? I don't understand why this isn't clobbering everyone that uses GAE. With such aggressive instancing, the performance just completely falls apart.

If I can't solve this, I am faced with backing away from GAE entirely, ripping out all the API dependencies and hosting the code on my own Tomcat server somewhere. I don't really want to do that because I don't want to be in the business of server maintenance - that's one of the key reasons I went with GAE in the first place - but I can't ship a product with this behavior.

Thanks,

- Kris

Takashi Matsuo

unread,

Jul 31, 2012, 11:30:58 PM7/31/12

to google-a...@googlegroups.com

Hi Kristopher,

On Wed, Aug 1, 2012 at 11:18 AM, Kristopher Giesing
<kris.g...@gmail.com> wrote:
> I've been working for some time on an app I plan to deploy. Over the
> development period I have done testing on the QA instance of my app (which
> is a separate app ID from the one I'll use in production) but have not paid
> serious attention to performance since the functionality wasn't yet
> complete.
>
> This past weekend I did some light real-world type testing, and the
> performance was terrible. With QPS in the 5-10 range I was seeing latencies
> in the range of 10-20s. Obviously, the first place I looked was my own
> code, but after some easy optimizations I got my query handling down to the
> 200ms range, and the response times on the client were still usually very
> bad (1s minimum, up to 10-20s max). (Pings from the client to the
> appspot.com domain are pretty stable at 30ms.)

Where are you accessing your app from? Ping result is just for our
frontend facilities, so it doesn't reflect the real network latency of
your app.

>
> Looking further, I discovered that Google App Engine is spawning new
> instances of my app at a pathological rate. For example, I left the app
> stats console open for a while, and found that just loading the appstats URI
> by itself would cause a new instance to spawn! This is with zero QPS
> otherwise - my app was completely idle, and none of my own code was running
> at all.

Sorry if I miss something, but I think it's the expected behavior. If
you don't set min idle instances, your app doesn't need to have any
instances with zero QPS, right? So if there's a new request, App
Engine spins up a new instance for your application.

>
> I tried adjusting the min and max instance settings, but GAE seems to ignore
> them, as far as I can tell; the max is not honored, and although it does
> keep instances running to accommodate the min, those running instances don't
> prevent new ones from spawning. It's as if GAE doesn't think they really
> exist. I also have the min latency set to the max of 15.0s, which is far
> beyond the rise time I'm seeing for GAE spawning new instances.
>
> Here's a set of logs taken from a very light load on the server, with a 2/2
> instance max/min setting. I include logs from a previous slice of time so
> that you can see the /appstats URI also spawning new instances, but I want
> to emphasize that this is a continuous set of logs - there are no requests
> in this time window that aren't listed below. I also include a screenshot
> of the instance panel in the console so you can see that the instances are,
> in fact, being created.

I think the cause of the high latency is the user facing loading requests.

Can you try something like min idle instances=5(or 10)/max idle
instances=automatic, and see how it goes first? Currently setting 'min
idle instances' is the most reliable way to reduce user facing loading
requests. Additionally, you shouldn't set 'max idle instances' because
if you set max idle instances, App Engine will terminate your idle
instances aggressively, as a result you will see more user facing
loading requests.

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/qErJJOQ63KYJ.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

--
Takashi Matsuo

Kristopher Giesing

unread,

Jul 31, 2012, 11:47:07 PM7/31/12

to google-a...@googlegroups.com

Hi Takashi,

Thanks for the prompt response. Info below...

On Tuesday, July 31, 2012 8:30:58 PM UTC-7, Takashi Matsuo (Google) wrote:

Hi Kristopher,

Where are you accessing your app from? Ping result is just for our
frontend facilities, so it doesn't reflect the real network latency of
your app.

I'm physically in Palo Alto, CA. Do you need more detail than that? I can run a traceroute if that helps.

>
> Looking further, I discovered that Google App Engine is spawning new
> instances of my app at a pathological rate. For example, I left the app
> stats console open for a while, and found that just loading the appstats URI
> by itself would cause a new instance to spawn! This is with zero QPS
> otherwise - my app was completely idle, and none of my own code was running
> at all.

Sorry if I miss something, but I think it's the expected behavior. If
you don't set min idle instances, your app doesn't need to have any
instances with zero QPS, right? So if there's a new request, App
Engine spins up a new instance for your application.

Right, sorry, this part was not very clear.

In this particular case, I had min (and max) instances set to 1. So there was (should have been) an instance already running, and from the logs it was idle.

I can set this case up again if you like and gather more details.

I think the cause of the high latency is the user facing loading requests.

Can you try something like min idle instances=5(or 10)/max idle
instances=automatic, and see how it goes first? Currently setting 'min
idle instances' is the most reliable way to reduce user facing loading
requests. Additionally, you shouldn't set 'max idle instances' because
if you set max idle instances, App Engine will terminate your idle
instances aggressively, as a result you will see more user facing
loading requests.

I can try that. That was actually the point of bumping up the min instances to 2 (whereas previously it was 1). The fact that five instances were spawned is baffling to me because there really weren't that many requests - from the logs, it looks to me as if GAE spawned three new instances after receiving the very first request, despite there being two idle instances ready to handle it.

I'm wondering if I would have better luck with a backend for this. This is an Ajax application where the HTML is locally embedded in an iOS app, so there's really no frontend in the traditional sense.

- Kris

Kristopher Giesing

unread,

Aug 1, 2012, 2:24:37 AM8/1/12

to google-a...@googlegroups.com

So, this behavior is completely reproducible.

1) Shut down all instances.

2) Set min instances to 1 (leave all other sliders at default values).

3) Save settings and wait for instance to spin up.

4) Issue one GET request from browser to a specific API endpoint. This is an API call so it is not multiplexing into additional requests the way, say, an HTML page fetch would (to get css, images, etc). It really is just one call.

5) Note that Google App Engine spins up a new instance to handle the request.

Here are the logs from this; there are no other events within two minutes of this window of time:

2012-07-31 23:08:28.045 /api/game/57002?pretty=true 200 7893ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

76.102.149.245 - kris [31/Jul/2012:23:08:28 -0700] "GET /api/game/57002?pretty=true HTTP/1.1" 200 11652 - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25" "titan-game-qa.appspot.com" ms=7893 cpu_ms=3520 api_cpu_ms=0 cpm_usd=0.099322 instance=00c61b117c77507e2cfe78a0806d0ca80b52720e

2012-07-31 23:08:27.475 /_ah/warmup 200 5873ms 0kb

0.1.0.3 - - [31/Jul/2012:23:08:27 -0700] "GET /_ah/warmup HTTP/1.1" 200 60 - - "1.360723738856412175.titan-game-qa.appspot.com" ms=5873 cpu_ms=2475 api_cpu_ms=0 cpm_usd=0.068778 loading_request=1 instance=00c61b117cdaae6145945d99c16aeee7cc0f4ad8

I 2012-07-31 23:08:27.466

This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

2012-07-31 23:07:42.842 /_ah/warmup 200 5045ms 0kb

0.1.0.3 - - [31/Jul/2012:23:07:42 -0700] "GET /_ah/warmup HTTP/1.1" 200 60 - - "1.360723738856412175.titan-game-qa.appspot.com" ms=5046 cpu_ms=2475 api_cpu_ms=0 cpm_usd=0.068778 loading_request=1 instance=00c61b117c77507e2cfe78a0806d0ca80b52720e

I 2012-07-31 23:07:42.841

This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

Spinning up 10 instances doesn't help. I refreshed this GET at somewhat irregular intervals for about thirty seconds, and GAE spun up five new instances, despite not being even close to out of headroom:

2012-07-31 23:17:02.178 /api/game/57002?pretty=true 200 4744ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:17:02.177 /api/game/57002?pretty=true 200 5770ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:59.793 /api/game/57002?pretty=true 200 1311ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:57.844 /api/game/57002?pretty=true 200 2618ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:57.843 /api/game/57002?pretty=true 200 5619ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:54.623 /api/game/57002?pretty=true 200 309ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:51.911 /_ah/warmup 200 2942ms 0kb

I 2012-07-31 23:16:51.911 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 23:16:51.622 /api/game/57002?pretty=true 200 174ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:50.915 /api/game/57002?pretty=true 200 2664ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:47.720 /api/game/57002?pretty=true 200 557ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:47.286 /_ah/warmup 200 4823ms 0kb

I 2012-07-31 23:16:47.285 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 23:16:46.377 /api/game/57002?pretty=true 200 5360ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:40.500 /api/game/57002?pretty=true 200 1828ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:37.969 /api/game/57002?pretty=true 200 512ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:37.783 /_ah/warmup 200 2523ms 0kb

I 2012-07-31 23:16:37.783 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 23:16:36.264 /api/game/57002?pretty=true 200 1715ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:05.552 /api/game/57002?pretty=true 200 6351ms 11kb Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.25 (KHTML, like Gecko) Version/6.0 Safari/536.25

2012-07-31 23:16:04.883 /_ah/warmup 200 4870ms 0kb

I 2012-07-31 23:16:04.881 This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This requ

2012-07-31 23:15:42.565 /_ah/warmup 200 4354ms 0kb

I 2012-07-31 23:15:42.564 This request caused a new process to be started for your application, and thus caused your application code to be loaded f

Here's what my instances looked like at the end.

- Kris

Kristopher Giesing

unread,

Aug 1, 2012, 2:35:50 AM8/1/12

to google-a...@googlegroups.com

The above post notwithstanding, the original problem I was having - which caused me to try to set min instances to 1 to begin with - seems to have gone away. I know this is probably going to fall into the PEBCAK bucket, but I really don't think I did anything in particular to cause the problem, so I will explain in somewhat tedious detail the circumstances surrounding the problem, even if some of them aren't relevant.

The original problem was that the client-side latency of requests was intermittently very high. I don't have many details on this because as soon as I tried to diagnose it I ran into the problem outlined above, which certainly confused my investigation.

However, now that I realize that the min instance slider is basically useless to me, I went back to testing against the auto/auto settings which I was using to begin with. I then noticed that the appstats endpoint was getting hit at times I didn't expect. To try to diagnose the problem with A/B testing I deleted all references to appstats and re-deployed my app. Sure enough, the mysteriously high latency disappeared. Confident I was on to something, I put appstats back, and the problem failed to reappear.

So now the performance of the app is in the range that I would expect. The client-side latency is around a second, which is not great by many standards, but is sort-of OK for my application and is certainly better than 5+ seconds. It's also closer to what I see inside the request handling itself - that is, when client latency is higher, so is the accumulated latency inside the request handler (from internal data manipulation etc). This gives me the lever of optimizing my own calls to avoid occasionally-expensive operations, so it's much less of an issue; prior to this, the vast majority of the latency happened outside the request handling where I don't have any control.

If the performance of GAE generally is in the range I'm seeing now, that's a shippable product in my mind. I am worried, however, that the problem was somewhere else in the system and somewhere down the line I will be hit with this problem again, only it will be in production and users will rightly complain that their experience is suffering.

So any further comment on this issue would be very welcome. I'm still considering whether a backend is a more appropriate vehicle for this functionality.

Thanks,

- Kris

Kristopher Giesing

unread,

Aug 2, 2012, 3:53:33 PM8/2/12

to google-a...@googlegroups.com

I spoke too soon. Some time after posting this, testing revealed more problems. Upon diagnosis, it appears that GAE is too aggressive at spinning up new instances even when the "auto/auto" setting is used, although the behavior is less reproducible than when a min of 1 is set.

The only fix I found for this was to switch to using a backend. With a backend I can control exactly how many instances are used. A single instance is sufficient for me and the RTT is now down to about 400-500ms (measured from the client), which is completely satisfactory.

What is *not* so satisfactory is (a) backends are more expensive than frontends, and I'm basically paying in order not to use a feature, which I find pretty annoying; and (b) this breaks my site in half and I will now have to worry about cross-domain scripting when I eventually get to the point of having an actual web interface (and not just APIs called by a native app).

Something is clearly broken about frontend instance balancing. I'm guessing it's only broken at the lower end of the traffic scale or no one would be using GAE, but it's broken nonetheless. You could argue that GAE is built to scale up, not down, but GAE balancing is currently an *order of magnitude* worse than a naive solution for this problem domain; so even though the pervasiveness might be low, the severity is very high. IMHO this needs to be fixed.

- Kris

Kristopher Giesing

unread,

Aug 2, 2012, 4:03:16 PM8/2/12

to google-a...@googlegroups.com

PS. I once again encountered evidence that appstats is causing requests to fire even when I'm not actively using the management console, which is very weird. I've turned off appstats because I didn't really find it all that useful anyway (and because it causes the dev server not to start when you have backends enabled).

Jorge Amat

unread,

Oct 11, 2013, 10:50:24 AM10/11/13

to google-a...@googlegroups.com

Any news on this? I'm having similar issues nowadays and can't believe this is not working properly yet.

Jorge Amat

unread,

Oct 11, 2013, 11:00:42 AM10/11/13

to google-a...@googlegroups.com

Sometimes happens to me that when I set the min idle instance to 1, after some minutes it's again set as automatic.without any consent.

Other thing I have tried is to pause the defauls task queue, since suddently starts to make a lot of requests, however doesn't look to be the best solution for the non justified increase of instance hours and consequently money expenses for users.

Reply all

Reply to author

Forward