Weird latency

UseShots

unread,

Jan 19, 2012, 4:12:08 PM1/19/12

to google-a...@googlegroups.com

Hi,

In January, I began to notice short periods (about an hour or so) of increased latency.

New instances take about a minute to load. The pages only load simple django patterns. No database operations. Normally, new instances take about 4 seconds to load. So the minute load time looks completely unexplainable.

But today I see even more weird load time. Here are just a few such entries from my log:

0.0.0.0 - - [19/Jan/2012:12:27:28 -0800] "GET / HTTP/1.1" 500 0 -

           "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10 ( .NET CLR 3.5.30729)" "example.com"

           ms=680991 cpu_ms=537 api_cpu_ms=0 cpm_usd=0.014935 loading_request=1 pending_ms=4399 exit_code=203 instance=00c61b117c003720d56f8afc2e1e5cbfb8b5ab

This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)

0.0.0.0 - - [19/Jan/2012:12:26:57 -0800] "GET / HTTP/1.1" 500 0 -

            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.25) Gecko/20111212 Firefox/3.6.25 ( .NET CLR 3.5.30729)" "example.com"

ms=681098 cpu_ms=490 api_cpu_ms=0 cpm_usd=0.013652 loading_request=1 pending_ms=4898 exit_code=203 instance=00c61b117c383910c3d83c3e33e4b2f8511562

This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)

0.0.0.0 - - [19/Jan/2012:12:21:23 -0800] "GET / HTTP/1.1" 200 5367 -

            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10 ( .NET CLR 3.5.30729)" "example.com"

            ms=555699 cpu_ms=653 api_cpu_ms=0 cpm_usd=0.018776 loading_request=1 pending_ms=4014 instance=00c61b117ca9c15eb45f2a94e3585a92dab6ce00

This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.

Note, the last entry is for successful request (response code 200) that took 555699 ms! (9 minutes?)

My app settings:

Max Idle Instances: ( 1 )
Min Pending Latency: ( Automatic )
(I'm trying to operate using only free quota)

My app normally has 0.2 requests/second and 1 active instance. Most requests complete within 1 second, the rest within 10 seconds.

During the weird latency periods I can see the following stats (right now):
0.1 requests/second and 21(!) active instances (during the peak) - what's wrong?

What can contribute to these 10+ minutes load times? How is this possible?

UseShots

unread,

Jan 19, 2012, 5:45:45 PM1/19/12

to google-a...@googlegroups.com

Hey!

Those instances have exhausted my free quota. Now every request spins up a new instance and every request takes ~10 minutes (according the logs)

2012-01-19 14:32:10.764 /request 200 602229ms 3kb Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; ....)
2012-01-19 14:32:10.546 / 500 680188ms 0kb Mozilla/5.0 (Windows NT 6.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1
2012-01-19 14:31:16.876 /report/ 500 684346ms 0kb Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.75 Safari/535.7
2012-01-19 14:29:45.943 /report/ 200 663112ms 3kb Mozilla/5.0 (X11; U; Linux x86_64; en-CA; rv:1.9.2.24) ...

What's going on?

Why app engine can't load pages that normally take less 1-4 seconds to load? Why does it report such weird load times? Why does it spawns so many instances that do nothing but exhaust my quota?

Robert Kluin

unread,

Jan 19, 2012, 11:57:04 PM1/19/12

to google-a...@googlegroups.com

Hi,
Are you using the master slave or high replication datastore? This
type of issue was (is?) quite common when using the master slave
datastore.

Also, you note that pages typically take 1-4 seconds to load? That
seems very slow. Is that typical of all your pages? Have you used
appstats to profile your app?
http://code.google.com/appengine/docs/python/tools/appstats.html

Also, you said you're using django. Personally, I never suggest
someone use djnago on App Engine. It is a very heavy framework, and
it tends to further exacerbate these types of problems.

If this is a problem that just started happening and didn't
correspond to you making changes, I'd suggest filing a production
issue.
http://code.google.com/p/googleappengine/issues/entry?template=Production%20issue

Robert

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/jhQ_2Lf3LHgJ.
>
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

UseShots

unread,

Jan 20, 2012, 4:45:07 AM1/20/12

to google-a...@googlegroups.com

Hi Robert,

It's a three year old app that I didn't migrate to a high replication datastore (datastore use is really minimum though, I just periodically save cache values for persistence - less than 1% of requests use datastore )

>Also, you note that pages typically take 1-4 seconds to load? That seems very slow.

1-4 seconds is for new instance loads. Typical request to warm instances takes about 0.2 - 0.5 seconds. Some requests take several seconds but only because they involve urlfetch to third-party sites. And in this thread I'm not talking about those urlfetch requests (they have less that 10s timeout anyway).

The problem is with plain pages that only load django templates. I'm quite happy with their normal performance (see above), but sometimes I observe weirds periods when new instances take about a minute to load. And yesterday, the situation became even worse when GAE tried to open new instances that would just freeze for about 10 minutes (still spending my instance-hour quota). With those frozen instances, GAE tried to launch more and more new instances that would just freeze too. Luckily, that was a pretty slow part of the day and there was not many requests to my app so during the two peaks I had 15-20 active instances (where 1 instance is enough). As a result, those to peaks have almost exhausted my app's free instance-hour quota. It happened for the first time, and there were no spikes in requests (I would say the request/second ratio was below the average).

What's really interesting, is that some request would return the 200 response code after that long (~10minutes) execution times. How is this possible? I guess it was some error in computing execution times during that period as they didn't correlate to the times I saw for my own debug messages: I have debug messages in the beginning of requests and in the very end - the difference between the start and end times is approximately the same as the reported execution time. But during those periods of increased latency, there is no correlation between my calculated time and the reported execution time - sometimes they differ by orders of magnitude. Moreover, sometimes I noticed that the reported execution time is longer that the observed execution time. And this is really strange as that time counts toward the instance-hour quota (as far as I can tell)

I've attached some screenshots to show the "request/second", "instances" and "milliseconds/request" charts.

requests.jpg

instances.jpg

milliseconds.jpg

Robert Kluin

unread,

Jan 31, 2012, 1:30:25 AM1/31/12

to google-a...@googlegroups.com

Hi,
This has generally beed a recurring issue with both master-slave
apps and django. It is very possibly hanging up on your imports. You
might add some logging to verify that, but it is commonly the case.

Migrating to high-replication could help with this. Dropping django
can be a further boost.

Robert

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/google-appengine/-/1u4WJKzDFHMJ.

UseShots

unread,

Jan 31, 2012, 11:50:17 AM1/31/12

to google-a...@googlegroups.com

Hi,

My debug logs show that imports take quite a long time. But Google
App Engine spends several seconds even before my first log entry.
The whole execution is also slowed down (even without datastore operations).

During the last 10 days, I've been testing various app designs work during the latency periods (I still see them almost every day).

1. The first thing I did was moved the home page request (acccountable for ~30% of app regquests) from Django to WebApp with Django template (different request handler files in app.yaml)

The WebApp w/Django requests took roughly 3 times as little to execute as normal Django requests during the high latency periods. This still meant 50-100 seconds per loading request.

2. I moved another frequent request (also ~30% of all requests) to a new request handler in a separate file. This time it was WebApp without Django templates. This request just calculates an MD5 hash and sets a cookie. Very simple.

During the high latency periods it took about 25 seconds to load. Faster than requests with Django dependencies, but still too slow.

3. Converted the WebApp request handler to a pure CGI handler with no WebApp dependencies. The only imports are os, sys, datetime and hashlib

During the high latency periods, this simple request takes 15-20 seconds to execute. How can this be explained?

Now I have 4 different requests handlers that use
1. Django 2. WebApp w/Django, 3. WebApp without Django, 4. CGI
and can clearly see how Django contributes to slow execution times. Here are the average executions times (ms - pending_ms) for such types of requests during the recent high latency period.

Django : 132-300 seconds
WebApp w/Django : 22-50 seconds
WebApp without Django: 10 seconds (cron task, one dataset put())
Pure CGI : 17-18 seconds

Normally all requests take less than 1 second (less than 0.1 second for WebApp and CGI) and their loading requests take less than 4 seconds even for Django.

So, Django is a significant overhead during the high latency periods. But even simple pure CGI requests load unreasonably
sloooow (15-20 seconds for requests that only set a few headers and write about 20 bytes of data).

I don't like the idea of not using Django. The project will be hard to maintain. At some point I may consider moving to a different framework, say WebApp2 + Jinja2 or mako (by the way, any recommendations?), but this won't solve the whole high latency issue. Now I think about migrating to HRD. Not sure it will help though.

P.S. Anyway, moving 60% of requests out of Django, helped reduce the number of instances that App Engine loaded during the high latency periods. Now the billing history pages show less instance/hours than I had before those high latency periods :)

UseShots

unread,

Jan 31, 2012, 12:25:44 PM1/31/12

to google-a...@googlegroups.com

Quick update.
Another high latency period just a few minutes ago. One of those simple pure CGI no Django, no WebApp requests took 33 seconds to load. A few other such requests - from 10 to 20 seconds. (Djanjo web pages took about 200 seconds to load). Bad :((

Kenneth

unread,

Jan 31, 2012, 12:32:19 PM1/31/12

to google-a...@googlegroups.com

Not what you want to hear, but Google has totally given up on the MS datastore, so you're totally out of luck unless you move to HR, in which case you're slightly less out of luck.

Dave Loomer

unread,

Jan 31, 2012, 3:58:34 PM1/31/12

to google-a...@googlegroups.com

My ignorant question: Why are we discussing M/S vs. HRD when the OP said he isn't accessing any data in serving his page?

Robert Kluin

unread,

Feb 1, 2012, 1:46:08 AM2/1/12

to google-a...@googlegroups.com

Hi,
Some responses inline.

Robert

On Tue, Jan 31, 2012 at 11:50, UseShots <goo...@useshots.com> wrote:
> Hi,
>
> My debug logs show that imports take quite a long time. But Google
> App Engine spends several seconds even before my first log entry.
> The whole execution is also slowed down (even without datastore
> operations).

Do you see pending_ms values in the logs for these slow requests?

>
> During the last 10 days, I've been testing various app designs work during
> the latency periods (I still see them almost every day).
>
> 1. The first thing I did was moved the home page request (acccountable for
> ~30% of app regquests) from Django to WebApp with Django template
> (different request handler files in app.yaml)
>
> The WebApp w/Django requests took roughly 3 times as little to execute as
> normal Django requests during the high latency periods. This still meant
> 50-100 seconds per loading request.

Well, at least it is faster. Could the remaining delay be due to
imports in other portions of the app? In other words, once you start
importing some portion of your app, does the entire thing wind up
needing to be loaded?

>
> 2. I moved another frequent request (also ~30% of all requests) to a new
> request handler in a separate file. This time it was WebApp without Django
> templates. This request just calculates an MD5 hash and sets a cookie. Very
> simple.
>
> During the high latency periods it took about 25 seconds to load. Faster
> than requests with Django dependencies, but still too slow.
>
> 3. Converted the WebApp request handler to a pure CGI handler with no
> WebApp dependencies. The only imports are os, sys, datetime and hashlib
>
> During the high latency periods, this simple request takes 15-20 seconds
> to execute. How can this be explained?

So there are separate entries for each of these in app.yaml, correct?

>
> Now I have 4 different requests handlers that use
> 1. Django 2. WebApp w/Django, 3. WebApp without Django, 4. CGI
> and can clearly see how Django contributes to slow execution times. Here
> are the average executions times (ms - pending_ms) for such types of
> requests during the recent high latency period.
>
> Django : 132-300 seconds
> WebApp w/Django : 22-50 seconds
> WebApp without Django: 10 seconds (cron task, one dataset put())
> Pure CGI : 17-18 seconds
>
> Normally all requests take less than 1 second (less than 0.1 second for
> WebApp and CGI) and their loading requests take less than 4 seconds even for
> Django.
>
> So, Django is a significant overhead during the high latency periods. But
> even simple pure CGI requests load unreasonably
> sloooow (15-20 seconds for requests that only set a few headers and write
> about 20 bytes of data).
>
> I don't like the idea of not using Django. The project will be hard to
> maintain. At some point I may consider moving to a different framework, say
> WebApp2 + Jinja2 or mako (by the way, any recommendations?), but this won't
> solve the whole high latency issue. Now I think about migrating to HRD. Not
> sure it will help though.

I would suggest WebApp2 or flask + either Jinja2 or mako. I've been
using webapp + mako for several years on App Engine; I have no
complaints at all. I personally prefer mako syntax to jinja2's more
django-esque syntax.

I would suggest you setup a test app on HRD. It will almost certainly
eliminate the issues you're seeing.

>
> P.S. Anyway, moving 60% of requests out of Django, helped reduce the
> number of instances that App Engine loaded during the high latency periods.
> Now the billing history pages show less instance/hours than I had before
> those high latency periods :)

Well, at least something good is coming from this issue. ;)

>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/google-appengine/-/hoo6Iao5IxQJ.

Robert Kluin

unread,

Feb 1, 2012, 1:47:55 AM2/1/12

to google-a...@googlegroups.com

For one reason, iirc you're app gets loaded from the underlying
datasource. So being on M/S means you're much more susceptible to
latency spikes.

On Tue, Jan 31, 2012 at 15:58, Dave Loomer <dlo...@gmail.com> wrote:
> My ignorant question: Why are we discussing M/S vs. HRD when the OP said he
> isn't accessing any data in serving his page?
>

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/google-appengine/-/T8WGpRtowF4J.

Brandon Wirtz

unread,

Feb 1, 2012, 2:10:48 AM2/1/12

to google-a...@googlegroups.com

Also Memcache on MS behaves differently. I don't know how but it does. I
looked at doing some stuff with MS which should be faster accept when MS is
in read only mode, and couldn't make it work because MS just isn't prime
time.

Brandon Wirtz
BlackWaterOps: President / Lead Mercenary

Work: 510-992-6548
Toll Free: 866-400-4536
IM: dra...@gmail.com (Google Talk)
Skype: drakegreene
YouTube: BlackWaterOpsDotCom