Why are several production issues related to DeadlineExceededErrors being ignored?

Karl Rosaen

unread,

Jan 13, 2012, 11:21:44 AM1/13/12

to google-a...@googlegroups.com

We've been hit by random DeadlineExceededErrors in the past 36 hours, so I filed a production ticket:

http://code.google.com/p/googleappengine/issues/detail?id=6729

it's disconcerting to see that several similar issues have been left ignored over the past few weeks:

http://code.google.com/p/googleappengine/issues/detail?id=6629

http://code.google.com/p/googleappengine/issues/detail?id=6688

http://code.google.com/p/googleappengine/issues/detail?id=6701

http://code.google.com/p/googleappengine/issues/detail?id=6707

Could someone please help?

Olivier

unread,

Jan 13, 2012, 3:43:24 PM1/13/12

to Google App Engine

+1

On Jan 13, 11:21 am, Karl Rosaen <kros...@gmail.com> wrote:
> We've been hit by random DeadlineExceededErrors in the past 36 hours, so I
> filed a production ticket:http://code.google.com/p/googleappengine/issues/detail?id=6729
>
> it's disconcerting to see that several similar issues have been left
> ignored over the past few weeks:
>

> http://code.google.com/p/googleappengine/issues/detail?id=6629http://code.google.com/p/googleappengine/issues/detail?id=6688http://code.google.com/p/googleappengine/issues/detail?id=6701http://code.google.com/p/googleappengine/issues/detail?id=6707
>
> Could someone please help?

GAEfan

unread,

Jan 13, 2012, 9:54:13 PM1/13/12

to Google App Engine

+1

Alexis

unread,

Jan 14, 2012, 5:28:54 AM1/14/12

to Google App Engine

+1

On 14 jan, 03:54, GAEfan <ken...@gmail.com> wrote:
> +1

Brandon Wirtz

unread,

Jan 14, 2012, 5:44:16 AM1/14/12

to google-a...@googlegroups.com

I can’t speak for Google…

But every time I ask people what they set their Latency and Idle instances to. (which is usually the problem)

No one ever sends along the Instances, and the Request per second graphs.

I’ll bet money that if you up the number of idle instances that your issues will go away.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/Fxv51vEswg4J.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Klaas Pieter Annema

unread,

Jan 14, 2012, 9:02:49 AM1/14/12

to google-a...@googlegroups.com

+1

and:

- Klaas Pieter

image.tiff

Karl Rosaen

unread,

Jan 14, 2012, 9:26:00 AM1/14/12

to google-a...@googlegroups.com

Thanks Brandon. Many of the DeadlineExceededErrors were occurring during warmup requests, according to the stacktraces, during python import statements. I upped the number of idle instances in an attempt to mitigate this sort of thrashing, and your advice makes sense for this case. Our pending latency is set to 'Automatic' on both ends.

I'm attaching some graphs from the period when this was the worst

Instances:

Requests per second:

Milliseconds per request:

This suggests that some higher latency handlers were hit (some people were editing content), taking up the existing front end instances, after which GAE was trying to spin up some dynamic instances to serve other requests. But during warmup, there were DeadelineExceededErrors during file imports, suggesting that the dynamic instances aren't being given enough time to warmup.

Increasing the idle instances helps. So perhaps the revised question, at least for our particular situation is: why, under load, do the dynamic instances timeout during warmup? That seems to compound the problem as the dynamic instances aren't able to serve the requests that are backed up, leading to user visible 500 errors, and more attempts to dynamically load instances.

Does my theory have any holes? Is relying on dynamic instances to handle spikes without 500 errors unrealistic? I know the docs state, "A smaller number of idle Instances means your application costs less to run, but may encounter more startup latency during load spikes." but thrashing on DeadlineExceededErrors during warmup seems to indicate that dynamic instances can't be relied upon for load spikes at all right now.

Brandon Wirtz

unread,

Jan 14, 2012, 5:15:10 PM1/14/12

to google-a...@googlegroups.com

Do you have a warm up handler configured in your Yaml?

If you don’t then the new instance has to warm up and handle a request. Specifying a Warm up that simply initializes some variables and logs an event “Warm up complete”.

Should fix your issue.

I don’t think you have “Platform issues” I think you have Google hasn’t documented all best practices issues.

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/bYRgRhlKZjoJ.

Cezary Wagner

unread,

Jan 15, 2012, 9:27:10 AM1/15/12

to Google App Engine

I think that is big problem since GAE that no handle problem with
customer experience of start-ups/low traffic application it kills
traffic expansion.

First instance loads to slow to server first customer - result is drop
with 500 status customer goes away that should not occur since it
block traffic expansion.

One method I know to deal with it is to write pinger which generate
traffic to keep-a-live instance but it against good architecture
design GAE should allow the same.

Another method is always on but it need to be paid - better is write
pinger with use cron (it should be called in time less than 15min) on
GAE or other server.

Pinger allow also to trace availability - you could measure if all
pings occurs.

On 14 Sty, 23:15, "Brandon Wirtz" <drak...@digerat.com> wrote:
> Do you have a warm up handler configured in your Yaml?
> If you don't then the new instance has to warm up and handle a request.
> Specifying a Warm up that simply initializes some variables and logs an
> event "Warm up complete".
>
> Should fix your issue.
>
> I don't think you have "Platform issues" I think you have Google hasn't
> documented all best practices issues.
>
> From: google-a...@googlegroups.com
> [mailto:google-a...@googlegroups.com] On Behalf Of Karl Rosaen
> Sent: Saturday, January 14, 2012 6:26 AM
> To: google-a...@googlegroups.com
> Subject: Re: [google-appengine] Why are several production issues related to
> DeadlineExceededErrors being ignored?
>
> Thanks Brandon. Many of the DeadlineExceededErrors were occurring during
> warmup requests, according to the stacktraces, during python import
> statements. I upped the number of idle instances in an attempt to mitigate
> this sort of thrashing, and your advice makes sense for this case. Our
> pending latency is set to 'Automatic' on both ends.
>
> I'm attaching some graphs from the period when this was the worst
>
> Instances:
>

> <https://lh4.googleusercontent.com/--AtYMbWJ4ek/TxGNT3nfp0I/AAAAAAAAUu...
> m78Mc08/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.08.59%252 5
> 20AM.png>
>
> Requests per second:
>
> <https://lh6.googleusercontent.com/-LoIlwGhvLrA/TxGOnvzGmSI/AAAAAAAAUu...
> YssPK_4/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.17.39%252 5
> 20AM.png>
>
> Milliseconds per request:
>
> <https://lh5.googleusercontent.com/-A76zVs8CCEo/TxGNZ9kcpfI/AAAAAAAAUu...
> uPvgw50/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.09.41%252 5
> 20AM.png>

>
> This suggests that some higher latency handlers were hit (some people were
> editing content), taking up the existing front end instances, after which
> GAE was trying to spin up some dynamic instances to serve other requests.
> But during warmup, there were DeadelineExceededErrors during file imports,
> suggesting that the dynamic instances aren't being given enough time to
> warmup.
>
> Increasing the idle instances helps. So perhaps the revised question, at
> least for our particular situation is: why, under load, do the dynamic
> instances timeout during warmup? That seems to compound the problem as the
> dynamic instances aren't able to serve the requests that are backed up,
> leading to user visible 500 errors, and more attempts to dynamically load
> instances.
>
> Does my theory have any holes? Is relying on dynamic instances to handle
> spikes without 500 errors unrealistic? I know the docs state, "A smaller
> number of idle Instances means your application costs less to run, but may
> encounter more startup latency during load spikes." but thrashing on
> DeadlineExceededErrors during warmup seems to indicate that dynamic
> instances can't be relied upon for load spikes at all right now.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.

> To view this discussion on the web visithttps://groups.google.com/d/msg/google-appengine/-/bYRgRhlKZjoJ.

Karl Rosaen

unread,

Jan 15, 2012, 4:07:35 PM1/15/12

to google-a...@googlegroups.com

Hi Brandon,

We do have a warmup handler defined. Durning load, we were seeing the warmup request causing a DeadlineExceededError simply while importing code (e.g during a python import statement in one of the modules).

Thanks,

Karl

Brandon Wirtz

unread,

Jan 15, 2012, 6:02:06 PM1/15/12

to google-a...@googlegroups.com

That would imply that your warm up takes more than 15 seconds. The only thing your warm up should do is initialize the instance and send out a “hello world”.

You may need to manage the order of your imports and make sure you don’t have redundant imports.

There are some threads in the archive about speeding up you imports to avoid these errors.

For Debugging this issue you should log “started import” “Finished import” so you can see which ones take the longest.

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Karl Rosaen
Sent: Sunday, January 15, 2012 1:08 PM
To: google-a...@googlegroups.com
Subject: Re: [google-appengine] Why are several production issues related to DeadlineExceededErrors being ignored?

Hi Brandon,

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/ZxEeWK55414J.

Karl Rosaen

unread,

Jan 15, 2012, 8:42:22 PM1/15/12

to google-a...@googlegroups.com

Yeah, actually we do warmup memcache for data the frontpage needs to render - under normal circumstances, performing these queries takes well under 15 seconds. According to the docs:

You can implement handlers in this directory to perform application-specific tasks, such as pre-caching application data.

but you are suggesting that warmup handlers should only be simple noops. Is this based on our experience, or are there some other docs somewhere that recommend against doing application specific data prefetching in the warmup handler? I'll definitely give it a shot, just curious if this is official GAE best practices or not.

Thanks again,

Karl

Brandon Wirtz

unread,

Jan 15, 2012, 9:17:22 PM1/15/12

to google-a...@googlegroups.com

Give me 35 minutes. I shot a video to talk about this topic, I thought that might be easier than the really long email J

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Karl Rosaen
Sent: Sunday, January 15, 2012 5:42 PM
To: google-a...@googlegroups.com
Subject: Re: [google-appengine] Why are several production issues related to DeadlineExceededErrors being ignored?

Yeah, actually we do warmup memcache for data the frontpage needs to render - under normal circumstances, performing these queries takes well under 15 seconds. According to the docs:

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/D0358Demz2cJ.

Brandon Wirtz

unread,

Jan 15, 2012, 10:01:31 PM1/15/12

to google-a...@googlegroups.com

Snow is slowing down my internet… Maybe a little longer than 35 minutes YT says 45 min remaining…

Brandon Wirtz

unread,

Jan 15, 2012, 10:57:37 PM1/15/12

to google-a...@googlegroups.com

http://www.youtube.com/watch?v=AJIvYBd-gnI

Hopefully this make it clearer.

I don’t think Google has a best practices guide (they should)

You don’t want max pending latency to be High, you can hit the time out.

You don’t want the Warm up to do anything other than initialized variables and load imports, and say “hello world”

I ramble a bit since I didn’t plan to shoot this today.

-Brandon

Alexis

unread,

Jan 16, 2012, 4:29:48 AM1/16/12

to Google App Engine

Thanks for the shot!

Here are some comments:

- isn't the limit 60sec instead of 15sec?

- We indeed have the pending latency slider set to default,
but out of our 12 instances we have 3 resident ones, so the slider
will have little effect on our application's performance.

- Requests that DEE typically have this signature:
ms=63498 cpu_ms=1097 api_cpu_ms=0 cpm_usd=0.030639 loading_request=1
pending_ms=373 exit_code=104
So it did not spent much time waiting to be served... And the DEE is
raised during the import phase.

- Our warmup requests, that import most of our modules then return a
simple "ok" string, take in average 2100ms to complete (min 860, max
3800).
But during DEE spikes they can also raise DEE.

Brandon Wirtz

unread,

Jan 16, 2012, 4:57:07 AM1/16/12

to google-a...@googlegroups.com

We may have to get someone from GOOG to weigh in.

Requests have a 60s time limit, but from testing it seems that
Initialization has a 15s limit. I don't know if this is a bug, a feature,
or a flaw in my testing.

Also my "best practices" doesn't guarantee to fix everything, there could
still be a bug, but I was hoping that be eliminating the obvious causes we
might get closer to resolution.

-----Original Message-----
From: google-a...@googlegroups.com
[mailto:google-a...@googlegroups.com] On Behalf Of Alexis
Sent: Monday, January 16, 2012 1:30 AM
To: Google App Engine
Subject: [google-appengine] Re: Why are several production issues related to
DeadlineExceededErrors being ignored?

Thanks for the shot!

Here are some comments:

--

Kenneth

unread,

Jan 16, 2012, 9:42:29 AM1/16/12

to google-a...@googlegroups.com

Are we talking master slave or high replication datastores? If we're talking master slave, and I'm pretty certain we are, then forget it. Google isn't going to help you. You need to migrate to hrd to avoid these problems. My life was hell with deadline errors. Then I bit the bullet and migrated and not a single timeout error since.

If you are on hrd then I'm at a loss.

Karl Rosaen

unread,

Jan 16, 2012, 10:31:44 AM1/16/12

to google-a...@googlegroups.com

Brandon, thanks so much for taking the time to put together the video, very helpful.

The key insight seem to be: time spent in the queue waiting for a frontend counts towards the limit for a DeadlineExceedError. This seems silly - seems to me user visible latency, and framework level timeout enforcement should be decoupled in this case. But good insight and glad to better understand this behavior. This also makes me wonder what the benefit of having 'auto' for max pending latency would ever be - I'm going to slide mine down to ~1s.

One quibble about your advice for 'avoid importing code more than once': this shouldn't be a major issue in python unless you are importing a module from within a function that is called several times:

Although Python's interpreter is optimized to not import the same module multiple times, repeatedly executing an import statement can seriously affect performance in some circumstances.

http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Import_Statement_Overhead

Just wanted to clarify that one needn't fret about the same utility module being imported from two modules or anything like that.

Karl

Karl Rosaen

unread,

Jan 16, 2012, 10:40:28 AM1/16/12

to google-a...@googlegroups.com

Kenneth,

Agreed that running on the master / slave datastore is a liability these days (which IMO is pretty irresponsible on GAEs part, I see no reason why major latency spikes can't be avoided even if there are advantages to the high replication datastore and we can eventually migrate. If we must migrate, give us a deadline, and fully support the master/slave until then. but I digress...) but the DEEs we're talking about here are happening during warmup unrelated to datastore operations, and could effect anyone regardless of master/slave vs HR. For these cases I think Brandon's advice is spot on: lower your max pending latency to 1s or so.

Usually when we get hit with DEEs related to being on master / slave, I can see a corresponding spike in the system status dashboard:

http://code.google.com/status/appengine/detail/datastore/2012/01/16#ae-trust-detail-datastore-get-latency

and at least I know what's going on. Actually, despite my whining, the last time we were hit by DEEs related to master/slave datastore latency spikes was over a month ago - GAE folks if you are listening and you have been trying to improve the reliability of master / slave, thank you. I think when we were getting pounded by DEEs last week I started to freak out and pointed the finger at master / slave latency prematurely when it was really related to DEEs during warmup.

Anyways, thanks to Brandon, I also know what's going on in the case of DEEs during warmup too!

Best,

Karl

Brandon Wirtz

unread,

Jan 16, 2012, 12:58:28 PM1/16/12

to google-a...@googlegroups.com

I agree multiple imports is not supposed to be a problem, but I have seen it cause issues, or seen issues be resolved by not doing it.

Remember that the Google Implementation of Python has its own “Specialness” and what is true in traditional python is not always quite the same in GAE land.

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Karl Rosaen

Sent: Monday, January 16, 2012 7:32 AM
To: google-a...@googlegroups.com

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/7FTyQ34tagsJ.

Alexis

unread,

Jan 16, 2012, 5:25:25 PM1/16/12

to Google App Engine

Kenneth: I'm talking HRD

Cezary Wagner

unread,

Jan 16, 2012, 5:36:43 PM1/16/12

to Google App Engine

Brandom,

I think that problem of DeadlinesExceeds is different:
1st of GAE availability is not measure considering DeadlinesExceeds -
it will be rather not 100%. If it not monitored it has not impact on
quality. Am I wrong?
This is data for the last week:
Day Total checks Outages Failed checks Avg. response time Uptime
2012.01.14 26 1 1 3.152 96.154 %
2012.01.15 47 1 1 4.937 97.872 %
2012.01.16 47 4 5 3.657 89.362 %

2nd My code without changes through months becomes slower/not faster -
I have optimized code and it works not faster - on SDK it executes
5s-1s (excluding load time) see above results from production. It
could be suffered with imports but why it once loads and once not
loads it random - it should not load every time or it problem with
balance/resource - why IT IS RANDOM?

3rd Instance DIES just after load not survive 15min period - instance
was load for 1 min than it dies - that is some problem or lack of
balance/resource?

4rd Maybe master/slave has impact?

5th The application not works in my case under low traffic EARLY DIES
of INSTANCES and DeadlinesExceeds kills 3%-10% of customer
traffic(stats above) - I am imagine that GAE works some for higher
traffic with exception that number of errors will be lower - maybe -
but how achieve high traffic if it not works on lower?

Please answer to question or propose other solution - GAE concept is
good but the current my customers experience is not excellent.

> http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Import_Statem...
> rhead

>
> Just wanted to clarify that one needn't fret about the same utility module
> being imported from two modules or anything like that.
>
> Karl
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.

> To view this discussion on the web visithttps://groups.google.com/d/msg/google-appengine/-/7FTyQ34tagsJ.

Brandon Wirtz

unread,

Jan 16, 2012, 6:19:01 PM1/16/12

to google-a...@googlegroups.com

If it is a software / configuration issue in your setup then it is not down
time.

If you have errors on M/S you are on your own.

Instances die early if you hit the soft memory limit. Likely if you have
load time issues you have memory usage issues as well.

-----Original Message-----
From: google-a...@googlegroups.com
[mailto:google-a...@googlegroups.com] On Behalf Of Cezary Wagner
Sent: Monday, January 16, 2012 2:37 PM
To: Google App Engine

Kenneth

unread,

Jan 17, 2012, 3:06:00 AM1/17/12

to google-a...@googlegroups.com

I always found that startup errors and datastore errors happened at the same time. I assume instances are getting the code from the datastore.

For reasons unknown, the master slave datastore is broken and Google's solution is hrd. I was in the same mental position as you at the beginning of December, cursing Google over pos that is the msd, filing issues and appealing to Googlers to fix it, looking at the mess that is migration and thinking no way. Then I just jumped before Christmas and haven't looked back.

Kenneth

unread,

Jan 17, 2012, 3:10:03 AM1/17/12

to google-a...@googlegroups.com

Wow, sorry to hear that. Is it java or python? I've been on hrd now for about a month. I have the sliders set to el cheapo mode and haven't had any timeouts at all.

My understanding is the startup timeout is 10 seconds. The total request timeout is 60 seconds.

Cezary Wagner

unread,

Jan 17, 2012, 4:37:24 AM1/17/12

to Google App Engine

Brandon,

I am doing service management/development many years and your
explanation not convince me since it not show partially valid root
cause - I think so - see why it not convincing.

Low availability is a fact see report again it random for same thing:

2012.01.14 26 1 1 3.152 96.154 %
2012.01.15 47 1 1 4.937 97.872 %

2012.01.16 48 4 5 3.699 89.583 %
2012.01.17 20 0 0 1.699 100 %

Now it looks that it works but I try to open some test pages and it
opens slowly comparing to previous week. TODAY customer experience is
still bad - I do some optimization it visible in times.

You said that it could matter on/have root cause:
1. Configuration of application - why it is RANDOM - once something
loads once not loads - same configuration should give same results.

2. Instance die early if I hit memory limit - good point but not see
in log any message about memory and application is much below 48M - it
should not die - why it is RANDOM.

3. Master/Slave is NOT ROOT CAUSE since DeadlineExceededError occcurs
before any datastore will be touched so there is not impact - are you
agree with that?

4. If Master/Slave is not supported by Google is there any information
that is not supported and from which date?

QUICK example of DeadlineExceededError (dead in imports):
<class 'google.appengine.runtime.DeadlineExceededError'>:
Traceback (most recent call last):
File "/base/data/home/apps/wcinamy/1-9-9-6.356145314073492481/
frontend/web/order_in_restaurant.py", line 20, in <module>
from frontend.web.cache import restaurant_menu as
cache_restaurant_menu
File "/base/data/home/apps/wcinamy/1-9-9-6.356145314073492481/
frontend/web/cache/restaurant_menu.py", line 4, in <module>
from google.appengine.api import memcache

Cezary Wagner

unread,

Jan 17, 2012, 4:43:10 AM1/17/12

to Google App Engine

Kenneth,

The timeouts could not be 10s or 15s - since DeadlineExceededErrors
occurs in my logs is about 60s - I think that 10s or 15s is myths or
wrong implementation since what is sense to give 60s if it should
start in 15s???

Common timeouts are: 64608ms, 70806ms, 63093ms, 64499ms, ...

Brandon Wirtz

unread,

Jan 17, 2012, 4:47:39 AM1/17/12

to google-a...@googlegroups.com

You are on M/S. You could be doing "Echo Hello World" and nothing else and
get Deadline Exceeded Errors. Maybe not quite but if you have any imports
that use Memcache than you would.

M/S has Zero advantage at this point.

Brandon Wirtz

unread,

Jan 17, 2012, 4:50:45 AM1/17/12

to google-a...@googlegroups.com

You keep pointing to reasons you believe it isn't your code, and why we are
wrong, but you aren't trying the things we point out. If your time out are
at 70 seconds you are doing something wrong, it shouldn't take that long to
start up. EVER.

Also you are on M/S which has its own rules. Jump in to the modern world and
run on HR with the rest of us happy shiny people, rather than running on the
unsupported, failed experiment that M/S is.

-----Original Message-----
From: google-a...@googlegroups.com
[mailto:google-a...@googlegroups.com] On Behalf Of Cezary Wagner

Sent: Tuesday, January 17, 2012 1:43 AM
To: Google App Engine

Subject: [google-appengine] Re: Why are several production issues related to
DeadlineExceededErrors being ignored?

Kenneth,

--

You received this message because you are subscribed to the Google Groups
"Google App Engine" group.

Cezary Wagner

unread,

Jan 17, 2012, 11:03:31 AM1/17/12

to Google App Engine

Brandon,

Test on Master/Slave.

Not keeping point that my code is perfect I keep point that GEA LOAD
behavior is RANDOM/UNPREDICTED - once it load faster (5s) once it not
loads since DeadlineExceed (70s) - it not enough for production
system. Problem start from last weeks.

I will do some imports tracing and found what imports has the major
impact but problem is that its load RANDOM once faster once longer - I
do some downgrade one module but still GAE works worser than before
but should work faster since added very fast caching and
optimizations.

Planning to do HRD to see if it has any impact on customer experience/
availabity.

See how random it is (first load time if from module start and second
for the single module).

Shows time only above second - both requests has same input variance
is from 17s, 30s till deadline - it is NOT STABLE:

2012-01-17 14:16:07.807 ImportLogger 1.03 1.02 - frontend.web.handlers
- frontend.web.handlers.common:17.
D 2012-01-17 14:16:09.310 ImportLogger 2.53 1.50 - google -
frontend.web.handlers.common:28.
D 2012-01-17 14:16:10.164 ImportLogger 3.38 0.53 - frontend.web.cookie
- frontend.web.handlers.common:45.
D 2012-01-17 14:16:11.118 ImportLogger 4.34 0.95 - frontend.web -
frontend.web.handlers.common:47.
D 2012-01-17 14:16:12.106 ImportLogger 5.32 0.93 -
frontend.web.authentication - frontend.web.handlers.common:52.
D 2012-01-17 14:16:12.239 ImportLogger 6.66 6.66 - common handler -
__main__:13.
D 2012-01-17 14:16:18.552 ImportLogger 12.97 6.31 -
frontend.restaurant - __main__:29.
D 2012-01-17 14:16:19.886 ImportLogger 1.23 1.23 - cssutils -
css_inliner:8.
D 2012-01-17 14:16:19.924 ImportLogger 1.30 1.28 - css_inliner -
frontend.ordering.messages:24.
D 2012-01-17 14:16:22.118 ImportLogger 3.50 2.13 - utilities.time -
frontend.ordering.messages:39.
D 2012-01-17 14:16:22.579 ImportLogger 3.97 3.96 - frontend.ordering -
frontend.ordering.processing:22.
D 2012-01-17 14:16:22.642 ImportLogger 17.06 4.04 - frontend.ordering
- __main__:35.

2012-01-17 14:15:41.905 ImportLogger 0.95 0.61 -
frontend.ordering.order - __main__:33.
D 2012-01-17 14:15:59.633 ImportLogger 17.32 17.32 - cssutils -
css_inliner:8.
D 2012-01-17 14:16:03.322 ImportLogger 21.01 3.69 - BeautifulSoup -
css_inliner:13.
D 2012-01-17 14:16:04.559 ImportLogger 22.25 1.24 -
pynliner.soupselect - css_inliner:17.
D 2012-01-17 14:16:04.559 ImportLogger 22.51 22.37 - css_inliner -
frontend.ordering.messages:24.
D 2012-01-17 14:16:06.902 ImportLogger 24.85 2.34 -
messaging.sms.smsapipl - frontend.ordering.messages:27.
D 2012-01-17 14:16:07.636 ImportLogger 25.59 0.73 - link_shortener -
frontend.ordering.messages:29.
D 2012-01-17 14:16:08.990 ImportLogger 26.94 0.64 - globals -
frontend.ordering.messages:54.
D 2012-01-17 14:16:08.990 ImportLogger 27.02 27.01 - frontend.ordering
- frontend.ordering.processing:22.
D 2012-01-17 14:16:10.166 ImportLogger 28.20 0.85 -
frontend.ordering.order_state_controller -
frontend.ordering.processing:26.
D 2012-01-17 14:16:12.010 ImportLogger 30.04 1.84 - frontend.billing -
frontend.ordering.processing:30.
D 2012-01-17 14:16:12.010 ImportLogger 31.06 30.10 - frontend.ordering
- __main__:35.

Message has been deleted

Alexis

unread,

Jan 18, 2012, 5:02:26 AM1/18/12

to Google App Engine

Cezary, I was in the same position in August, with our Python apps on
M/S.
The DEE errors began to appear, then were more and more frequent and
finally had a high impact on our app availability.
Our warmup requests, that only load code, were randomly taking from a
few secs to more than 30sec and so triggering the DEE.

I could not figure out how it could be linked to the datastore, but
Google team only had one word to mouth: move to HRD.
No explanation at that time and made so sense for me but I decided to
trust them.

We migrated to HRD writing our own mapreduce and then all went fine
(except for the costs that went higher but hey, was the price for a
more reliable
solution).

But since something like mid-December, exactly the same symptoms
begin to appear again. Less frequent and not really impacting our
availability yet, but still there.
And no official acknowledgement of this issue so far...

Cezary Wagner

unread,

Jan 19, 2012, 7:08:05 PM1/19/12

to Google App Engine

Alexis,

It looks that GEA has hidden balancing/lack of resources problems or
deadlocks both results in RANDOM problems.

I think about moving to HRD but it looks that you suffer same on HRD
the last days so it will not help till GEA will not generate RANDOM
problems - I do some optimization it helps with time but not yet with
availability.

Day Total checks Outages Failed checks Avg. response time Uptime
2012.01.14 26 1 1 3.152 96.154 %
2012.01.15 47 1 1 4.937 97.872 %

2012.01.16 48 4 5 3.699 89.583 %

2012.01.17 48 3 5 2.507 89.583 %
2012.01.18 48 1 1 2.419 97.917 %
2012.01.19 47 2 2 3.143 95.745 %

Brandon Wirtz

unread,

Jan 19, 2012, 7:40:55 PM1/19/12

to google-a...@googlegroups.com

I'm still calling BS on your explanation. And since you are on MS I can’t call BS on your uptime numbers, but here are mine From JeffProbst.com for the last 6 months.

Some of that 1 hour was my fault for updating code and doing a misconfiguration. And issues on 1/11 were a DNS change we made. All total we have had less than 30 minutes of GAE down time in 6 months.

And to the nay-sayers who will say “oh but you don’t have a ‘real’ app” my app has dependencies on django, memcache, datastore, and does Regular Expressions. That’s 90% of what any app is going to do. We also spin up and down a lot more than most sites since we serve as few as 100 visitors a day on a really slow day and as many as 400k on a big day.

I believe my GAE uptime for 6 months is at 99.99% with an average QoS that is more than acceptable (requests served in an average of 489ms with a peak of 818ms)

<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Here are my imports to prove I’m a “real” app

Os, datetime, hashlib,logging, pickle, re, time, zlib, urllib, wsgiref.handlers, cgi,cachepy,urlparse, memcache, urlfetch, webapp, apiproxy_errors, template

<<<<<<<<<<>>>>>>>>>>>>>>>>>>>

Uptime

99.98%

Downtime

1h

The average downtime length is 6m 40s

Number of downtimes

9

The longest downtime was 10m on 10/25/2011 11:29:06AM and the shortest was 5m on 07/19/2011 1:04:06AM

Status	From	To	Time
	01/13/2012 6:14:06AM	01/19/2012 4:22:50PM	6d 10h 8m 44s
	01/13/2012 6:09:06AM	01/13/2012 6:14:06AM	5m
	01/11/2012 1:54:06PM	01/13/2012 6:09:06AM	1d 16h 15m
	01/11/2012 1:49:06PM	01/11/2012 1:54:06PM	5m
	01/11/2012 1:44:06PM	01/11/2012 1:49:06PM	5m
	01/11/2012 1:39:06PM	01/11/2012 1:44:06PM	5m
	01/11/2012 1:34:06PM	01/11/2012 1:39:06PM	5m
	01/11/2012 1:24:06PM	01/11/2012 1:34:06PM	10m
	01/01/2012 11:59:08PM	01/11/2012 1:24:06PM	9d 13h 24m 58s
	12/18/2011 2:19:09PM	01/01/2012 11:49:06PM	14d 9h 29m 57s
	12/15/2011 9:09:17PM	12/18/2011 2:04:06PM	2d 16h 54m 49s
	12/15/2011 6:54:20AM	12/15/2011 8:59:06PM	14h 4m 46s
	12/15/2011 4:04:26AM	12/15/2011 6:44:06AM	2h 39m 40s
	12/13/2011 6:04:35PM	12/15/2011 3:54:06AM	1d 9h 49m 31s
	12/10/2011 1:39:08PM	12/13/2011 5:54:06PM	3d 4h 14m 58s
	12/09/2011 11:24:09PM	12/10/2011 1:24:06PM	13h 59m 57s
	12/08/2011 7:44:07AM	12/09/2011 11:14:06PM	1d 15h 29m 59s
	12/05/2011 2:39:26PM	12/08/2011 7:34:06AM	2d 16h 54m 40s
	12/04/2011 11:44:08AM	12/05/2011 2:29:08PM	1d 2h 45m
	12/04/2011 3:39:06AM	12/04/2011 11:34:06AM	7h 55m
	12/02/2011 3:44:12PM	12/04/2011 3:19:06AM	1d 11h 34m 54s
	12/02/2011 4:24:09AM	12/02/2011 3:34:06PM	11h 9m 57s
	12/02/2011 1:34:09AM	12/02/2011 4:14:06AM	2h 39m 57s
	12/01/2011 5:44:11AM	12/02/2011 1:24:06AM	19h 39m 55s
	11/29/2011 8:09:06PM	12/01/2011 5:34:07AM	1d 9h 25m 1s
	11/29/2011 7:59:06PM	11/29/2011 8:09:06PM	10m
	11/28/2011 6:14:09PM	11/29/2011 7:59:06PM	1d 1h 44m 57s
	11/28/2011 9:44:09AM	11/28/2011 6:04:06PM	8h 19m 57s
	11/27/2011 8:14:23AM	11/28/2011 9:34:06AM	1d 1h 19m 43s
	11/26/2011 6:44:09AM	11/27/2011 8:04:06AM	1d 1h 19m 57s
	11/18/2011 2:54:08PM	11/26/2011 6:34:06AM	7d 15h 39m 58s
	11/18/2011 8:04:18AM	11/18/2011 2:44:06PM	6h 39m 48s
	11/13/2011 6:59:13AM	11/18/2011 7:54:11AM	5d 54m 58s
	11/13/2011 6:19:06AM	11/13/2011 6:49:06AM	30m
	11/13/2011 6:14:06AM	11/13/2011 6:19:06AM	5m
	11/11/2011 4:04:09AM	11/13/2011 6:14:06AM	2d 2h 9m 57s
	11/03/2011 4:54:06PM	11/11/2011 3:54:06AM	7d 11h
	11/03/2011 4:49:06PM	11/03/2011 4:54:06PM	5m
	11/02/2011 4:19:10PM	11/03/2011 4:49:06PM	1d 29m 56s
	11/02/2011 1:29:28PM	11/02/2011 4:09:09PM	2h 39m 41s
	11/01/2011 3:44:07AM	11/02/2011 1:19:06PM	1d 9h 34m 59s
	10/30/2011 6:09:19PM	11/01/2011 3:29:06AM	1d 9h 19m 47s
	10/30/2011 12:59:07PM	10/30/2011 5:59:06PM	4h 59m 59s
	10/27/2011 3:54:10PM	10/30/2011 12:49:06PM	2d 20h 54m 56s
	10/26/2011 8:24:06AM	10/27/2011 3:44:06PM	1d 7h 20m
	10/25/2011 11:39:06AM	10/26/2011 8:09:09AM	20h 30m 3s
	10/25/2011 11:29:06AM	10/25/2011 11:39:06AM	10m
	10/22/2011 2:34:16AM	10/25/2011 11:29:06AM	3d 8h 54m 50s
	10/21/2011 7:59:10AM	10/22/2011 2:24:07AM	18h 24m 57s
	10/21/2011 5:14:16AM	10/21/2011 7:49:07AM	2h 34m 51s
	10/03/2011 5:09:08PM	10/21/2011 5:04:06AM	17d 11h 54m 58s
	09/20/2011 4:14:08AM	10/03/2011 4:59:06PM	13d 12h 44m 58s
	09/16/2011 3:04:11PM	09/20/2011 4:04:06AM	3d 12h 59m 55s
	09/11/2011 9:24:25PM	09/16/2011 2:54:07PM	4d 17h 29m 42s
	09/06/2011 4:29:06AM	09/11/2011 9:14:14PM	5d 16h 45m 8s
	08/15/2011 7:54:13AM	09/06/2011 4:14:06AM	21d 20h 19m 53s
	08/13/2011 7:34:08PM	08/15/2011 7:44:06AM	1d 12h 9m 58s
	08/04/2011 9:14:08AM	08/13/2011 7:24:06PM	9d 10h 9m 58s
	07/24/2011 2:44:23AM	08/04/2011 9:04:06AM	11d 6h 19m 43s
	07/19/2011 1:09:06AM	07/24/2011 2:34:07AM	5d 1h 25m 1s
	07/19/2011 1:04:06AM	07/19/2011 1:09:06AM	5m
	07/19/2011 12:00:00AM	07/19/2011 1:04:06AM	1h 4m 6s

Here is the Home Page serve time for the past 100 days.

Date/time	Avg. Response Time
01/19/2012 12:00:00AM	467 ms
01/18/2012 12:00:00AM	467 ms
01/17/2012 12:00:00AM	583 ms
01/16/2012 12:00:00AM	475 ms
01/15/2012 12:00:00AM	541 ms
01/14/2012 12:00:00AM	523 ms
01/13/2012 12:00:00AM	478 ms
01/12/2012 12:00:00AM	492 ms
01/11/2012 12:00:00AM	630 ms
01/10/2012 12:00:00AM	488 ms
01/09/2012 12:00:00AM	500 ms
01/08/2012 12:00:00AM	496 ms
01/07/2012 12:00:00AM	477 ms
01/06/2012 12:00:00AM	508 ms
01/05/2012 12:00:00AM	517 ms
01/04/2012 12:00:00AM	484 ms
01/03/2012 12:00:00AM	544 ms
01/02/2012 12:00:00AM	517 ms
01/01/2012 12:00:00AM	477 ms
12/31/2011 12:00:00AM	454 ms
12/30/2011 12:00:00AM	456 ms
12/29/2011 12:00:00AM	567 ms
12/28/2011 12:00:00AM	483 ms
12/27/2011 12:00:00AM	465 ms
12/26/2011 12:00:00AM	471 ms
12/25/2011 12:00:00AM	468 ms
12/24/2011 12:00:00AM	460 ms
12/23/2011 12:00:00AM	495 ms
12/22/2011 12:00:00AM	489 ms
12/21/2011 12:00:00AM	520 ms
12/20/2011 12:00:00AM	502 ms
12/19/2011 12:00:00AM	468 ms
12/18/2011 12:00:00AM	441 ms
12/17/2011 12:00:00AM	505 ms
12/16/2011 12:00:00AM	490 ms
12/15/2011 12:00:00AM	480 ms
12/14/2011 12:00:00AM	508 ms
12/13/2011 12:00:00AM	516 ms
12/12/2011 12:00:00AM	513 ms
12/11/2011 12:00:00AM	479 ms
12/10/2011 12:00:00AM	500 ms
12/09/2011 12:00:00AM	456 ms
12/08/2011 12:00:00AM	451 ms
12/07/2011 12:00:00AM	457 ms
12/06/2011 12:00:00AM	453 ms
12/05/2011 12:00:00AM	444 ms
12/04/2011 12:00:00AM	448 ms
12/03/2011 12:00:00AM	450 ms
12/02/2011 12:00:00AM	531 ms
12/01/2011 12:00:00AM	402 ms
11/30/2011 12:00:00AM	519 ms
11/29/2011 12:00:00AM	544 ms
11/28/2011 12:00:00AM	433 ms
11/27/2011 12:00:00AM	449 ms
11/26/2011 12:00:00AM	415 ms
11/25/2011 12:00:00AM	499 ms
11/24/2011 12:00:00AM	441 ms
11/23/2011 12:00:00AM	544 ms
11/22/2011 12:00:00AM	434 ms
11/21/2011 12:00:00AM	469 ms
11/20/2011 12:00:00AM	452 ms
11/19/2011 12:00:00AM	461 ms
11/18/2011 12:00:00AM	443 ms
11/17/2011 12:00:00AM	453 ms
11/16/2011 12:00:00AM	445 ms
11/15/2011 12:00:00AM	423 ms
11/14/2011 12:00:00AM	447 ms
11/13/2011 12:00:00AM	414 ms
11/12/2011 12:00:00AM	434 ms
11/11/2011 12:00:00AM	447 ms
11/10/2011 12:00:00AM	507 ms
11/09/2011 12:00:00AM	488 ms
11/08/2011 12:00:00AM	577 ms
11/07/2011 12:00:00AM	464 ms
11/06/2011 12:00:00AM	497 ms
11/05/2011 12:00:00AM	480 ms
11/04/2011 12:00:00AM	476 ms
11/03/2011 12:00:00AM	464 ms
11/02/2011 12:00:00AM	544 ms
11/01/2011 12:00:00AM	497 ms
10/31/2011 12:00:00AM	552 ms
10/30/2011 12:00:00AM	486 ms
10/29/2011 12:00:00AM	495 ms
10/28/2011 12:00:00AM	477 ms
10/27/2011 12:00:00AM	483 ms
10/26/2011 12:00:00AM	601 ms
10/25/2011 12:00:00AM	534 ms
10/24/2011 12:00:00AM	529 ms
10/23/2011 12:00:00AM	480 ms
10/22/2011 12:00:00AM	491 ms
10/21/2011 12:00:00AM	463 ms
10/20/2011 12:00:00AM	530 ms
10/19/2011 12:00:00AM	494 ms
10/18/2011 12:00:00AM	484 ms
10/17/2011 12:00:00AM	521 ms
10/16/2011 12:00:00AM	517 ms
10/15/2011 12:00:00AM	515 ms
10/14/2011 12:00:00AM	495 ms
10/13/2011 12:00:00AM	550 ms
10/12/2011 12:00:00AM	518 ms

-----Original Message-----
From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Cezary Wagner
Sent: Thursday, January 19, 2012 4:08 PM
To: Google App Engine
Subject: [google-appengine] Re: Why are several production issues related to DeadlineExceededErrors being ignored?

Alexis,

--

image001.gif

image002.gif

image003.gif

image004.gif

Cezary Wagner

unread,

Jan 21, 2012, 2:44:40 PM1/21/12

to google-a...@googlegroups.com

Thanks for attention - it looks that bigger application scales very good - want to have such result!.

Sitll have random behavior - code speed is random for same set of parameters. I am doing teleinformation service monitoring long time and large scale architecture design so think that there no chance on bug generating random timings. Tracked almost all import times but not see root cause.

I am using almost the same set of imports + cssutil, beatifulSoup:
os, datetime, hashlib, logging, pickle, re, time, zlib, urllib, wsgiref.handlers, cgi, urlparse, memcache, urlfetch, webapp, template

I am think I do the best optimization but still timing is not good - exclude 20 day since there was some planned works.

2012.01.20	48	10	11	2.558	77.083 %
2012.01.21	41	1	2	3.287	95.122 %

image001.gif

image004.gif

image003.gif

image002.gif

Brandon Wirtz

unread,

Jan 21, 2012, 5:37:49 PM1/21/12

to google-a...@googlegroups.com

If you think it is because you are small scale set an instance to always on,(or 2).

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Cezary Wagner

Sent: Saturday, January 21, 2012 11:45 AM
To: google-a...@googlegroups.com

image001.gif

image002.gif

image003.gif

image004.gif

Cezary Wagner

unread,

Jan 21, 2012, 6:42:56 PM1/21/12

to google-a...@googlegroups.com

Sure 'Always On' has some impact but charts from instance spawing are more abnormal in the last few days - some people reports similar problems on issues list.

With large traffic you will see less impact since more instance is ready.

I saw today that chart of instance was becoming more flat and lower - probably less fast instances dies and something else ...

image004.gif

image001.gif

image003.gif

Abnormal-chart.png

image002.gif

Brandon Wirtz

unread,

Jan 21, 2012, 7:20:13 PM1/21/12

to google-a...@googlegroups.com

I’m going to stop commenting on this thread.

Your understanding of how the scheduler works is wrong. Which is common, and I can’t blame you for, but your resistance to HR, and to making the scheduler changes is preventing you from resolving your issues.

I still believe you have code issues as well, since the spin up times based on your imports should be nowhere near what you quote.

But everyone of your messages includes “I have a lot of experience with this I can’t be wrong” type language. I can’t help someone who doesn’t want to debug the issue, and only wishes to cite why the issue must be someone else’s fault.

-Brandon

Brandon Wirtz
BlackWaterOps: President / Lead Mercenary

Work: 510-992-6548
Toll Free: 866-400-4536

IM: dra...@gmail.com (Google Talk)
Skype: drakegreene

BlackWater Ops

image005.jpg

image006.png

image007.gif

image008.gif

image009.gif

image010.gif

Reply all

Reply to author

Forward