Why are several production issues related to DeadlineExceededErrors being ignored?

290 views
Skip to first unread message

Karl Rosaen

unread,
Jan 13, 2012, 11:21:44 AM1/13/12
to google-a...@googlegroups.com
We've been hit by random DeadlineExceededErrors in the past 36 hours, so I filed a production ticket:

it's disconcerting to see that several similar issues have been left ignored over the past few weeks:


Could someone please help?

Olivier

unread,
Jan 13, 2012, 3:43:24 PM1/13/12
to Google App Engine
+1

On Jan 13, 11:21 am, Karl Rosaen <kros...@gmail.com> wrote:
> We've been hit by random DeadlineExceededErrors in the past 36 hours, so I
> filed a production ticket:http://code.google.com/p/googleappengine/issues/detail?id=6729
>
> it's disconcerting to see that several similar issues have been left
> ignored over the past few weeks:
>
> http://code.google.com/p/googleappengine/issues/detail?id=6629http://code.google.com/p/googleappengine/issues/detail?id=6688http://code.google.com/p/googleappengine/issues/detail?id=6701http://code.google.com/p/googleappengine/issues/detail?id=6707
>
> Could someone please help?

GAEfan

unread,
Jan 13, 2012, 9:54:13 PM1/13/12
to Google App Engine
+1

Alexis

unread,
Jan 14, 2012, 5:28:54 AM1/14/12
to Google App Engine
+1

On 14 jan, 03:54, GAEfan <ken...@gmail.com> wrote:
> +1

Brandon Wirtz

unread,
Jan 14, 2012, 5:44:16 AM1/14/12
to google-a...@googlegroups.com

I can’t speak for Google…

 

But every time I ask people what they set their Latency and Idle instances to. (which is usually the problem)

 

No one ever sends along the Instances, and the Request per second graphs.

 

I’ll bet money that if you up the number of idle instances that your issues will go away.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/Fxv51vEswg4J.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Klaas Pieter Annema

unread,
Jan 14, 2012, 9:02:49 AM1/14/12
to google-a...@googlegroups.com
+1

and:


- Klaas Pieter

image.tiff

Karl Rosaen

unread,
Jan 14, 2012, 9:26:00 AM1/14/12
to google-a...@googlegroups.com
Thanks Brandon.  Many of the DeadlineExceededErrors were occurring during warmup requests, according to the stacktraces, during python import statements.  I upped the number of idle instances in an attempt to mitigate this sort of thrashing, and your advice makes sense for this case.  Our pending latency is set to 'Automatic' on both ends.

I'm attaching some graphs from the period when this was the worst

Instances:


Requests per second:




Milliseconds per request:



This suggests that some higher latency handlers were hit (some people were editing content), taking up the existing front end instances, after which GAE was trying to spin up some dynamic instances to serve other requests.  But during warmup, there were DeadelineExceededErrors during file imports, suggesting that the dynamic instances aren't being given enough time to warmup.

Increasing the idle instances helps.  So perhaps the revised question, at least for our particular situation is: why, under load, do the dynamic instances timeout during warmup?  That seems to compound the problem as the dynamic instances aren't able to serve the requests that are backed up, leading to user visible 500 errors, and more attempts to dynamically load instances.

Does my theory have any holes?  Is relying on dynamic instances to handle spikes without 500 errors unrealistic?  I know the docs state, "A smaller number of idle Instances means your application costs less to run, but may encounter more startup latency during load spikes." but thrashing on DeadlineExceededErrors during warmup seems to indicate that dynamic instances can't be relied upon for load spikes at all right now.


Brandon Wirtz

unread,
Jan 14, 2012, 5:15:10 PM1/14/12
to google-a...@googlegroups.com

Do you have a warm up handler configured in your Yaml?

If you don’t then the new instance has to warm up and handle a request.  Specifying a Warm up that simply initializes some variables and  logs an event “Warm up complete”.

 

Should fix your issue.

 

I don’t think you have “Platform issues” I think you have Google hasn’t documented all best practices issues.

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/bYRgRhlKZjoJ.

Cezary Wagner

unread,
Jan 15, 2012, 9:27:10 AM1/15/12
to Google App Engine
I think that is big problem since GAE that no handle problem with
customer experience of start-ups/low traffic application it kills
traffic expansion.

First instance loads to slow to server first customer - result is drop
with 500 status customer goes away that should not occur since it
block traffic expansion.

One method I know to deal with it is to write pinger which generate
traffic to keep-a-live instance but it against good architecture
design GAE should allow the same.

Another method is always on but it need to be paid - better is write
pinger with use cron (it should be called in time less than 15min) on
GAE or other server.

Pinger allow also to trace availability - you could measure if all
pings occurs.

On 14 Sty, 23:15, "Brandon Wirtz" <drak...@digerat.com> wrote:
> Do you have a warm up handler configured in your Yaml?
> If you don't then the new instance has to warm up and handle a request.
> Specifying a Warm up that simply initializes some variables and  logs an
> event "Warm up complete".
>
> Should fix your issue.
>
> I don't think you have "Platform issues" I think you have Google hasn't
> documented all best practices issues.
>
> From: google-a...@googlegroups.com
> [mailto:google-a...@googlegroups.com] On Behalf Of Karl Rosaen
> Sent: Saturday, January 14, 2012 6:26 AM
> To: google-a...@googlegroups.com
> Subject: Re: [google-appengine] Why are several production issues related to
> DeadlineExceededErrors being ignored?
>
> Thanks Brandon.  Many of the DeadlineExceededErrors were occurring during
> warmup requests, according to the stacktraces, during python import
> statements.  I upped the number of idle instances in an attempt to mitigate
> this sort of thrashing, and your advice makes sense for this case.  Our
> pending latency is set to 'Automatic' on both ends.
>
> I'm attaching some graphs from the period when this was the worst
>
> Instances:
>
> <https://lh4.googleusercontent.com/--AtYMbWJ4ek/TxGNT3nfp0I/AAAAAAAAUu...
> m78Mc08/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.08.59%252 5
> 20AM.png>
>
> Requests per second:
>
> <https://lh6.googleusercontent.com/-LoIlwGhvLrA/TxGOnvzGmSI/AAAAAAAAUu...
> YssPK_4/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.17.39%252 5
> 20AM.png>
>
> Milliseconds per request:
>
> <https://lh5.googleusercontent.com/-A76zVs8CCEo/TxGNZ9kcpfI/AAAAAAAAUu...
> uPvgw50/s1600/Screen%252520Shot%2525202012-01-14%252520at%2525209.09.41%252 5
> 20AM.png>
>
> This suggests that some higher latency handlers were hit (some people were
> editing content), taking up the existing front end instances, after which
> GAE was trying to spin up some dynamic instances to serve other requests.
> But during warmup, there were DeadelineExceededErrors during file imports,
> suggesting that the dynamic instances aren't being given enough time to
> warmup.
>
> Increasing the idle instances helps.  So perhaps the revised question, at
> least for our particular situation is: why, under load, do the dynamic
> instances timeout during warmup?  That seems to compound the problem as the
> dynamic instances aren't able to serve the requests that are backed up,
> leading to user visible 500 errors, and more attempts to dynamically load
> instances.
>
> Does my theory have any holes?  Is relying on dynamic instances to handle
> spikes without 500 errors unrealistic?  I know the docs state, "A smaller
> number of idle Instances means your application costs less to run, but may
> encounter more startup latency during load spikes." but thrashing on
> DeadlineExceededErrors during warmup seems to indicate that dynamic
> instances can't be relied upon for load spikes at all right now.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visithttps://groups.google.com/d/msg/google-appengine/-/bYRgRhlKZjoJ.

Karl Rosaen

unread,
Jan 15, 2012, 4:07:35 PM1/15/12
to google-a...@googlegroups.com
Hi Brandon, 

We do have a warmup handler defined.  Durning load, we were seeing the warmup request causing a DeadlineExceededError simply while importing code (e.g during a python import statement in one of the modules).

Thanks,
Karl

Brandon Wirtz

unread,
Jan 15, 2012, 6:02:06 PM1/15/12
to google-a...@googlegroups.com

That would imply that your warm up takes more than 15 seconds.  The only thing your warm up should do is initialize the instance and send out a “hello world”.

 

You may need to manage the order of your imports and make sure you don’t have redundant imports.

 

There are some threads in the archive about speeding up you imports to avoid these errors.

For Debugging this issue you should log “started import” “Finished import” so you can see which ones take the longest.

 

 

 

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Karl Rosaen
Sent: Sunday, January 15, 2012 1:08 PM
To: google-a...@googlegroups.com
Subject: Re: [google-appengine] Why are several production issues related to DeadlineExceededErrors being ignored?

 

Hi Brandon, 

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/ZxEeWK55414J.

Karl Rosaen

unread,
Jan 15, 2012, 8:42:22 PM1/15/12
to google-a...@googlegroups.com
Yeah, actually we do warmup memcache for data the frontpage needs to render - under normal circumstances, performing these queries takes well under 15 seconds.  According to the docs:

You can implement handlers in this directory to perform application-specific tasks, such as pre-caching application data.

but you are suggesting that warmup handlers should only be simple noops.  Is this based on our experience, or are there some other docs somewhere that recommend against doing application specific data prefetching in the warmup handler?  I'll definitely give it a shot, just curious if this is official GAE best practices or not.

Thanks again,
Karl

Brandon Wirtz

unread,
Jan 15, 2012, 9:17:22 PM1/15/12
to google-a...@googlegroups.com

Give me 35 minutes. I shot a video to talk about this topic, I thought that might be easier than the really long email J

 

 

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Karl Rosaen
Sent: Sunday, January 15, 2012 5:42 PM
To: google-a...@googlegroups.com
Subject: Re: [google-appengine] Why are several production issues related to DeadlineExceededErrors being ignored?

 

Yeah, actually we do warmup memcache for data the frontpage needs to render - under normal circumstances, performing these queries takes well under 15 seconds.  According to the docs:

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/D0358Demz2cJ.

Brandon Wirtz

unread,
Jan 15, 2012, 10:01:31 PM1/15/12
to google-a...@googlegroups.com

Snow is slowing down my internet… Maybe a little longer than 35 minutes YT says 45 min remaining…

Brandon Wirtz

unread,
Jan 15, 2012, 10:57:37 PM1/15/12
to google-a...@googlegroups.com

http://www.youtube.com/watch?v=AJIvYBd-gnI

 

Hopefully this make it clearer.

 

I don’t think Google has a best practices guide (they should)

 

You don’t want max pending latency to be High, you can hit the time out.

 

You don’t want the Warm up to do anything other than initialized variables and load imports, and say “hello world”

 

I ramble a bit since I didn’t plan to shoot this today.

 

-Brandon

Alexis

unread,
Jan 16, 2012, 4:29:48 AM1/16/12
to Google App Engine
Thanks for the shot!

Here are some comments:

- isn't the limit 60sec instead of 15sec?

- We indeed have the pending latency slider set to default,
but out of our 12 instances we have 3 resident ones, so the slider
will have little effect on our application's performance.

- Requests that DEE typically have this signature:
ms=63498 cpu_ms=1097 api_cpu_ms=0 cpm_usd=0.030639 loading_request=1
pending_ms=373 exit_code=104
So it did not spent much time waiting to be served... And the DEE is
raised during the import phase.

- Our warmup requests, that import most of our modules then return a
simple "ok" string, take in average 2100ms to complete (min 860, max
3800).
But during DEE spikes they can also raise DEE.

Brandon Wirtz

unread,
Jan 16, 2012, 4:57:07 AM1/16/12
to google-a...@googlegroups.com
We may have to get someone from GOOG to weigh in.

Requests have a 60s time limit, but from testing it seems that
Initialization has a 15s limit. I don't know if this is a bug, a feature,
or a flaw in my testing.

Also my "best practices" doesn't guarantee to fix everything, there could
still be a bug, but I was hoping that be eliminating the obvious causes we
might get closer to resolution.

-----Original Message-----
From: google-a...@googlegroups.com
[mailto:google-a...@googlegroups.com] On Behalf Of Alexis
Sent: Monday, January 16, 2012 1:30 AM
To: Google App Engine
Subject: [google-appengine] Re: Why are several production issues related to
DeadlineExceededErrors being ignored?

Thanks for the shot!

Here are some comments:

--

Kenneth

unread,
Jan 16, 2012, 9:42:29 AM1/16/12
to google-a...@googlegroups.com
Are we talking master slave or high replication datastores?  If we're talking master slave, and I'm pretty certain we are, then forget it.  Google isn't going to help you.  You need to migrate to hrd to avoid these problems.  My life was hell with deadline errors.  Then I bit the bullet and migrated and not a single timeout error since.

If you are on hrd then I'm at a loss.

Karl Rosaen

unread,
Jan 16, 2012, 10:31:44 AM1/16/12
to google-a...@googlegroups.com
Brandon, thanks so much for taking the time to put together the video, very helpful.

The key insight seem to be: time spent in the queue waiting for a frontend counts towards the limit for a DeadlineExceedError.  This seems silly - seems to me user visible latency, and framework level timeout enforcement should be decoupled in this case.  But good insight and glad to better understand this behavior.  This also makes me wonder what the benefit of having 'auto' for max pending latency would ever be - I'm going to slide mine down to ~1s.

One quibble about your advice for 'avoid importing code more than once': this shouldn't be a major issue in python unless you are importing a module from within a function that is called several times:

Although Python's interpreter is optimized to not import the same module multiple times, repeatedly executing an import statement can seriously affect performance in some circumstances.


Just wanted to clarify that one needn't fret about the same utility module being imported from two modules or anything like that.

Karl

Karl Rosaen

unread,
Jan 16, 2012, 10:40:28 AM1/16/12
to google-a...@googlegroups.com
Kenneth, 

Agreed that running on the master / slave datastore is a liability these days (which IMO is pretty irresponsible on GAEs part, I see no reason why major latency spikes can't be avoided even if there are advantages to the high replication datastore and we can eventually migrate. If we must migrate, give us a deadline, and fully support the master/slave until then.  but I digress...) but the DEEs we're talking about here are happening during warmup unrelated to datastore operations, and could effect anyone regardless of master/slave vs HR.  For these cases I think Brandon's advice is spot on: lower your max pending latency to 1s or so.

Usually when we get hit with DEEs related to being on master / slave, I can see a corresponding spike in the system status dashboard:


and at least I know what's going on.  Actually, despite my whining, the last time we were hit by DEEs related to master/slave datastore latency spikes was over a month ago - GAE folks if you are listening and you have been trying to improve the reliability of master / slave, thank you.  I think when we were getting pounded by DEEs last week I started to freak out and pointed the finger at master / slave latency prematurely when it was really related to DEEs during warmup.

Anyways, thanks to Brandon, I also know what's going on in the case of DEEs during warmup too!

Best,
Karl

Brandon Wirtz

unread,
Jan 16, 2012, 12:58:28 PM1/16/12
to google-a...@googlegroups.com

I agree multiple imports is not supposed to be a problem, but I have seen it cause issues, or seen issues be resolved by not doing it.

 

Remember that the Google Implementation of Python has its own “Specialness” and what is true in traditional python is not always quite the same in GAE land.

 

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Karl Rosaen


Sent: Monday, January 16, 2012 7:32 AM
To: google-a...@googlegroups.com

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/7FTyQ34tagsJ.

Alexis

unread,
Jan 16, 2012, 5:25:25 PM1/16/12
to Google App Engine

Kenneth: I'm talking HRD

Cezary Wagner

unread,
Jan 16, 2012, 5:36:43 PM1/16/12
to Google App Engine
Brandom,

I think that problem of DeadlinesExceeds is different:
1st of GAE availability is not measure considering DeadlinesExceeds -
it will be rather not 100%. If it not monitored it has not impact on
quality. Am I wrong?
This is data for the last week:
Day Total checks Outages Failed checks Avg. response time Uptime
2012.01.14 26 1 1 3.152 96.154 %
2012.01.15 47 1 1 4.937 97.872 %
2012.01.16 47 4 5 3.657 89.362 %

2nd My code without changes through months becomes slower/not faster -
I have optimized code and it works not faster - on SDK it executes
5s-1s (excluding load time) see above results from production. It
could be suffered with imports but why it once loads and once not
loads it random - it should not load every time or it problem with
balance/resource - why IT IS RANDOM?

3rd Instance DIES just after load not survive 15min period - instance
was load for 1 min than it dies - that is some problem or lack of
balance/resource?

4rd Maybe master/slave has impact?

5th The application not works in my case under low traffic EARLY DIES
of INSTANCES and DeadlinesExceeds kills 3%-10% of customer
traffic(stats above) - I am imagine that GAE works some for higher
traffic with exception that number of errors will be lower - maybe -
but how achieve high traffic if it not works on lower?

Please answer to question or propose other solution - GAE concept is
good but the current my customers experience is not excellent.
> http://wiki.python.org/moin/PythonSpeed/PerformanceTips#Import_Statem...
> rhead
>
> Just wanted to clarify that one needn't fret about the same utility module
> being imported from two modules or anything like that.
>
> Karl
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visithttps://groups.google.com/d/msg/google-appengine/-/7FTyQ34tagsJ.

Brandon Wirtz

unread,
Jan 16, 2012, 6:19:01 PM1/16/12
to google-a...@googlegroups.com
If it is a software / configuration issue in your setup then it is not down
time.

If you have errors on M/S you are on your own.

Instances die early if you hit the soft memory limit. Likely if you have
load time issues you have memory usage issues as well.


-----Original Message-----
From: google-a...@googlegroups.com
[mailto:google-a...@googlegroups.com] On Behalf Of Cezary Wagner
Sent: Monday, January 16, 2012 2:37 PM
To: Google App Engine

Kenneth

unread,
Jan 17, 2012, 3:06:00 AM1/17/12
to google-a...@googlegroups.com
I always found that startup errors and datastore errors happened at the same time. I assume instances are getting the code from the datastore.

For reasons unknown, the master slave datastore is broken and Google's solution is hrd.  I was in the same mental position as you at the beginning of December, cursing Google over pos that is the msd, filing issues and appealing to Googlers to fix it, looking at the mess that is migration and thinking no way.  Then I just jumped before Christmas and haven't looked back.


Kenneth

unread,
Jan 17, 2012, 3:10:03 AM1/17/12
to google-a...@googlegroups.com
Wow, sorry to hear that.  Is it java or python?  I've been on hrd now for about a month.  I have the sliders set to el cheapo mode and haven't had any timeouts at all.

My understanding is the startup timeout is 10 seconds. The total request timeout is 60 seconds.

Cezary Wagner

unread,
Jan 17, 2012, 4:37:24 AM1/17/12
to Google App Engine
Brandon,

I am doing service management/development many years and your
explanation not convince me since it not show partially valid root
cause - I think so - see why it not convincing.

Low availability is a fact see report again it random for same thing:
2012.01.14 26 1 1 3.152 96.154 %
2012.01.15 47 1 1 4.937 97.872 %
2012.01.16 48 4 5 3.699 89.583 %
2012.01.17 20 0 0 1.699 100 %

Now it looks that it works but I try to open some test pages and it
opens slowly comparing to previous week. TODAY customer experience is
still bad - I do some optimization it visible in times.

You said that it could matter on/have root cause:
1. Configuration of application - why it is RANDOM - once something
loads once not loads - same configuration should give same results.

2. Instance die early if I hit memory limit - good point but not see
in log any message about memory and application is much below 48M - it
should not die - why it is RANDOM.

3. Master/Slave is NOT ROOT CAUSE since DeadlineExceededError occcurs
before any datastore will be touched so there is not impact - are you
agree with that?

4. If Master/Slave is not supported by Google is there any information
that is not supported and from which date?

QUICK example of DeadlineExceededError (dead in imports):
<class 'google.appengine.runtime.DeadlineExceededError'>:
Traceback (most recent call last):
  File "/base/data/home/apps/wcinamy/1-9-9-6.356145314073492481/
frontend/web/order_in_restaurant.py", line 20, in <module>
    from frontend.web.cache import restaurant_menu as
cache_restaurant_menu
  File "/base/data/home/apps/wcinamy/1-9-9-6.356145314073492481/
frontend/web/cache/restaurant_menu.py", line 4, in <module>
    from google.appengine.api import memcache

Cezary Wagner

unread,
Jan 17, 2012, 4:43:10 AM1/17/12
to Google App Engine
Kenneth,

The timeouts could not be 10s or 15s - since DeadlineExceededErrors
occurs in my logs is about 60s - I think that 10s or 15s is myths or
wrong implementation since what is sense to give 60s if it should
start in 15s???

Common timeouts are: 64608ms, 70806ms, 63093ms, 64499ms, ...

Brandon Wirtz

unread,
Jan 17, 2012, 4:47:39 AM1/17/12
to google-a...@googlegroups.com
You are on M/S. You could be doing "Echo Hello World" and nothing else and
get Deadline Exceeded Errors. Maybe not quite but if you have any imports
that use Memcache than you would.

M/S has Zero advantage at this point.

Brandon Wirtz

unread,
Jan 17, 2012, 4:50:45 AM1/17/12
to google-a...@googlegroups.com
You keep pointing to reasons you believe it isn't your code, and why we are
wrong, but you aren't trying the things we point out. If your time out are
at 70 seconds you are doing something wrong, it shouldn't take that long to
start up. EVER.

Also you are on M/S which has its own rules. Jump in to the modern world and
run on HR with the rest of us happy shiny people, rather than running on the
unsupported, failed experiment that M/S is.


-----Original Message-----
From: google-a...@googlegroups.com
[mailto:google-a...@googlegroups.com] On Behalf Of Cezary Wagner

Sent: Tuesday, January 17, 2012 1:43 AM
To: Google App Engine

Subject: [google-appengine] Re: Why are several production issues related to
DeadlineExceededErrors being ignored?

Kenneth,

--


You received this message because you are subscribed to the Google Groups
"Google App Engine" group.

Cezary Wagner

unread,
Jan 17, 2012, 11:03:31 AM1/17/12
to Google App Engine
Brandon,

Test on Master/Slave.

Not keeping point that my code is perfect I keep point that GEA LOAD
behavior is RANDOM/UNPREDICTED - once it load faster (5s) once it not
loads since DeadlineExceed (70s) - it not enough for production
system. Problem start from last weeks.

I will do some imports tracing and found what imports has the major
impact but problem is that its load RANDOM once faster once longer - I
do some downgrade one module but still GAE works worser than before
but should work faster since added very fast caching and
optimizations.

Planning to do HRD to see if it has any impact on customer experience/
availabity.

See how random it is (first load time if from module start and second
for the single module).




Shows time only above second - both requests has same input variance
is from 17s, 30s till deadline - it is NOT STABLE:




2012-01-17 14:16:07.807 ImportLogger 1.03 1.02 - frontend.web.handlers
- frontend.web.handlers.common:17.
D 2012-01-17 14:16:09.310 ImportLogger 2.53 1.50 - google -
frontend.web.handlers.common:28.
D 2012-01-17 14:16:10.164 ImportLogger 3.38 0.53 - frontend.web.cookie
- frontend.web.handlers.common:45.
D 2012-01-17 14:16:11.118 ImportLogger 4.34 0.95 - frontend.web -
frontend.web.handlers.common:47.
D 2012-01-17 14:16:12.106 ImportLogger 5.32 0.93 -
frontend.web.authentication - frontend.web.handlers.common:52.
D 2012-01-17 14:16:12.239 ImportLogger 6.66 6.66 - common handler -
__main__:13.
D 2012-01-17 14:16:18.552 ImportLogger 12.97 6.31 -
frontend.restaurant - __main__:29.
D 2012-01-17 14:16:19.886 ImportLogger 1.23 1.23 - cssutils -
css_inliner:8.
D 2012-01-17 14:16:19.924 ImportLogger 1.30 1.28 - css_inliner -
frontend.ordering.messages:24.
D 2012-01-17 14:16:22.118 ImportLogger 3.50 2.13 - utilities.time -
frontend.ordering.messages:39.
D 2012-01-17 14:16:22.579 ImportLogger 3.97 3.96 - frontend.ordering -
frontend.ordering.processing:22.
D 2012-01-17 14:16:22.642 ImportLogger 17.06 4.04 - frontend.ordering
- __main__:35.




2012-01-17 14:15:41.905 ImportLogger 0.95 0.61 -
frontend.ordering.order - __main__:33.
D 2012-01-17 14:15:59.633 ImportLogger 17.32 17.32 - cssutils -
css_inliner:8.
D 2012-01-17 14:16:03.322 ImportLogger 21.01 3.69 - BeautifulSoup -
css_inliner:13.
D 2012-01-17 14:16:04.559 ImportLogger 22.25 1.24 -
pynliner.soupselect - css_inliner:17.
D 2012-01-17 14:16:04.559 ImportLogger 22.51 22.37 - css_inliner -
frontend.ordering.messages:24.
D 2012-01-17 14:16:06.902 ImportLogger 24.85 2.34 -
messaging.sms.smsapipl - frontend.ordering.messages:27.
D 2012-01-17 14:16:07.636 ImportLogger 25.59 0.73 - link_shortener -
frontend.ordering.messages:29.
D 2012-01-17 14:16:08.990 ImportLogger 26.94 0.64 - globals -
frontend.ordering.messages:54.
D 2012-01-17 14:16:08.990 ImportLogger 27.02 27.01 - frontend.ordering
- frontend.ordering.processing:22.
D 2012-01-17 14:16:10.166 ImportLogger 28.20 0.85 -
frontend.ordering.order_state_controller -
frontend.ordering.processing:26.
D 2012-01-17 14:16:12.010 ImportLogger 30.04 1.84 - frontend.billing -
frontend.ordering.processing:30.
D 2012-01-17 14:16:12.010 ImportLogger 31.06 30.10 - frontend.ordering
- __main__:35.
Message has been deleted
Message has been deleted

Alexis

unread,
Jan 18, 2012, 5:02:26 AM1/18/12
to Google App Engine
Cezary, I was in the same position in August, with our Python apps on
M/S.
The DEE errors began to appear, then were more and more frequent and
finally had a high impact on our app availability.
Our warmup requests, that only load code, were randomly taking from a
few secs to more than 30sec and so triggering the DEE.

I could not figure out how it could be linked to the datastore, but
Google team only had one word to mouth: move to HRD.
No explanation at that time and made so sense for me but I decided to
trust them.

We migrated to HRD writing our own mapreduce and then all went fine
(except for the costs that went higher but hey, was the price for a
more reliable
solution).

But since something like mid-December, exactly the same symptoms
begin to appear again. Less frequent and not really impacting our
availability yet, but still there.
And no official acknowledgement of this issue so far...

Cezary Wagner

unread,
Jan 19, 2012, 7:08:05 PM1/19/12
to Google App Engine
Alexis,

It looks that GEA has hidden balancing/lack of resources problems or
deadlocks both results in RANDOM problems.

I think about moving to HRD but it looks that you suffer same on HRD
the last days so it will not help till GEA will not generate RANDOM
problems - I do some optimization it helps with time but not yet with
availability.

Day Total checks Outages Failed checks Avg. response time Uptime
2012.01.14 26 1 1 3.152 96.154 %
2012.01.15 47 1 1 4.937 97.872 %
2012.01.16 48 4 5 3.699 89.583 %
2012.01.17 48 3 5 2.507 89.583 %
2012.01.18 48 1 1 2.419 97.917 %
2012.01.19 47 2 2 3.143 95.745 %

Brandon Wirtz

unread,
Jan 19, 2012, 7:40:55 PM1/19/12
to google-a...@googlegroups.com

I'm still calling BS on your explanation. And since you are on MS I can’t call BS on your uptime numbers, but here are mine From JeffProbst.com for the last 6 months.

 

Some of that 1 hour was my fault for updating code and doing a misconfiguration.  And issues on 1/11 were a DNS change we made. All total we have had less than 30 minutes of GAE down time in 6 months.

 

And to the nay-sayers who will say “oh but you don’t have a ‘real’ app” my app has dependencies on django, memcache, datastore, and does Regular Expressions.   That’s 90% of what any app is going to do.  We also spin up and down a lot more than most sites since we serve as few as 100 visitors a day on a really slow day and as many as 400k on a big day.

 

I believe my GAE uptime for 6 months is at 99.99%  with an average QoS that is more than acceptable (requests served in an average of 489ms with a peak of 818ms)

 

<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

Here are my imports to prove I’m a “real” app

 

Os, datetime, hashlib,logging, pickle, re, time, zlib, urllib, wsgiref.handlers, cgi,cachepy,urlparse, memcache, urlfetch, webapp, apiproxy_errors, template

<<<<<<<<<<>>>>>>>>>>>>>>>>>>> 

 

 

 

Uptime

99.98%

Downtime

1h

The average downtime length is 6m 40s

Number of downtimes

9

The longest downtime was 10m on 10/25/2011 11:29:06AM and the shortest was 5m on 07/19/2011 1:04:06AM

Status

From

To

Time

 

Description: Up

01/13/2012 6:14:06AM

01/19/2012 4:22:50PM

6d 10h 8m 44s

Description: Detailed log

Description: Down

01/13/2012 6:09:06AM

01/13/2012 6:14:06AM

5m

Description: Detailed log Description: Error Analysis

Description: Up

01/11/2012 1:54:06PM

01/13/2012 6:09:06AM

1d 16h 15m

Description: Detailed log

Description: Down

01/11/2012 1:49:06PM

01/11/2012 1:54:06PM

5m

Description: Detailed log Description: Error Analysis

Description: Up

01/11/2012 1:44:06PM

01/11/2012 1:49:06PM

5m

Description: Detailed log

Description: Down

01/11/2012 1:39:06PM

01/11/2012 1:44:06PM

5m

Description: Detailed log Description: Error Analysis

Description: Up

01/11/2012 1:34:06PM

01/11/2012 1:39:06PM

5m

Description: Detailed log

Description: Down

01/11/2012 1:24:06PM

01/11/2012 1:34:06PM

10m

Description: Detailed log Description: Error Analysis

Description: Up

01/01/2012 11:59:08PM

01/11/2012 1:24:06PM

9d 13h 24m 58s

Description: Detailed log

Description: Up

12/18/2011 2:19:09PM

01/01/2012 11:49:06PM

14d 9h 29m 57s

Description: Detailed log

Description: Up

12/15/2011 9:09:17PM

12/18/2011 2:04:06PM

2d 16h 54m 49s

Description: Detailed log

Description: Up

12/15/2011 6:54:20AM

12/15/2011 8:59:06PM

14h 4m 46s

Description: Detailed log

Description: Up

12/15/2011 4:04:26AM

12/15/2011 6:44:06AM

2h 39m 40s

Description: Detailed log

Description: Up

12/13/2011 6:04:35PM

12/15/2011 3:54:06AM

1d 9h 49m 31s

Description: Detailed log

Description: Up

12/10/2011 1:39:08PM

12/13/2011 5:54:06PM

3d 4h 14m 58s

Description: Detailed log

Description: Up

12/09/2011 11:24:09PM

12/10/2011 1:24:06PM

13h 59m 57s

Description: Detailed log

Description: Up

12/08/2011 7:44:07AM

12/09/2011 11:14:06PM

1d 15h 29m 59s

Description: Detailed log

Description: Up

12/05/2011 2:39:26PM

12/08/2011 7:34:06AM

2d 16h 54m 40s

Description: Detailed log

Description: Up

12/04/2011 11:44:08AM

12/05/2011 2:29:08PM

1d 2h 45m

Description: Detailed log

Description: Up

12/04/2011 3:39:06AM

12/04/2011 11:34:06AM

7h 55m

Description: Detailed log

Description: Up

12/02/2011 3:44:12PM

12/04/2011 3:19:06AM

1d 11h 34m 54s

Description: Detailed log

Description: Up

12/02/2011 4:24:09AM

12/02/2011 3:34:06PM

11h 9m 57s

Description: Detailed log

Description: Up

12/02/2011 1:34:09AM

12/02/2011 4:14:06AM

2h 39m 57s

Description: Detailed log

Description: Up

12/01/2011 5:44:11AM

12/02/2011 1:24:06AM

19h 39m 55s

Description: Detailed log

Description: Up

11/29/2011 8:09:06PM

12/01/2011 5:34:07AM

1d 9h 25m 1s

Description: Detailed log

Description: Down

11/29/2011 7:59:06PM

11/29/2011 8:09:06PM

10m

Description: Detailed log Description: Error Analysis

Description: Up

11/28/2011 6:14:09PM

11/29/2011 7:59:06PM

1d 1h 44m 57s

Description: Detailed log

Description: Up

11/28/2011 9:44:09AM

11/28/2011 6:04:06PM

8h 19m 57s

Description: Detailed log

Description: Up

11/27/2011 8:14:23AM

11/28/2011 9:34:06AM

1d 1h 19m 43s

Description: Detailed log

Description: Up

11/26/2011 6:44:09AM

11/27/2011 8:04:06AM

1d 1h 19m 57s

Description: Detailed log

Description: Up

11/18/2011 2:54:08PM

11/26/2011 6:34:06AM

7d 15h 39m 58s

Description: Detailed log

Description: Up

11/18/2011 8:04:18AM

11/18/2011 2:44:06PM

6h 39m 48s

Description: Detailed log

Description: Up

11/13/2011 6:59:13AM

11/18/2011 7:54:11AM

5d 54m 58s

Description: Detailed log

Description: Up

11/13/2011 6:19:06AM

11/13/2011 6:49:06AM

30m

Description: Detailed log

Description: Down

11/13/2011 6:14:06AM

11/13/2011 6:19:06AM

5m

Description: Detailed log Description: Error Analysis

Description: Up

11/11/2011 4:04:09AM

11/13/2011 6:14:06AM

2d 2h 9m 57s

Description: Detailed log

Description: Up

11/03/2011 4:54:06PM

11/11/2011 3:54:06AM

7d 11h

Description: Detailed log

Description: Down

11/03/2011 4:49:06PM

11/03/2011 4:54:06PM

5m

Description: Detailed log Description: Error Analysis

Description: Up

11/02/2011 4:19:10PM

11/03/2011 4:49:06PM

1d 29m 56s

Description: Detailed log

Description: Up

11/02/2011 1:29:28PM

11/02/2011 4:09:09PM

2h 39m 41s

Description: Detailed log

Description: Up

11/01/2011 3:44:07AM

11/02/2011 1:19:06PM

1d 9h 34m 59s

Description: Detailed log

Description: Up

10/30/2011 6:09:19PM

11/01/2011 3:29:06AM

1d 9h 19m 47s

Description: Detailed log

Description: Up

10/30/2011 12:59:07PM

10/30/2011 5:59:06PM

4h 59m 59s

Description: Detailed log

Description: Up

10/27/2011 3:54:10PM

10/30/2011 12:49:06PM

2d 20h 54m 56s

Description: Detailed log

Description: Up

10/26/2011 8:24:06AM

10/27/2011 3:44:06PM

1d 7h 20m

Description: Detailed log

Description: Up

10/25/2011 11:39:06AM

10/26/2011 8:09:09AM

20h 30m 3s

Description: Detailed log

Description: Down

10/25/2011 11:29:06AM

10/25/2011 11:39:06AM

10m

Description: Detailed log Description: Error Analysis

Description: Up

10/22/2011 2:34:16AM

10/25/2011 11:29:06AM

3d 8h 54m 50s

Description: Detailed log

Description: Up

10/21/2011 7:59:10AM

10/22/2011 2:24:07AM

18h 24m 57s

Description: Detailed log

Description: Up

10/21/2011 5:14:16AM

10/21/2011 7:49:07AM

2h 34m 51s

Description: Detailed log

Description: Up

10/03/2011 5:09:08PM

10/21/2011 5:04:06AM

17d 11h 54m 58s

Description: Detailed log

Description: Up

09/20/2011 4:14:08AM

10/03/2011 4:59:06PM

13d 12h 44m 58s

Description: Detailed log

Description: Up

09/16/2011 3:04:11PM

09/20/2011 4:04:06AM

3d 12h 59m 55s

Description: Detailed log

Description: Up

09/11/2011 9:24:25PM

09/16/2011 2:54:07PM

4d 17h 29m 42s

Description: Detailed log

Description: Up

09/06/2011 4:29:06AM

09/11/2011 9:14:14PM

5d 16h 45m 8s

Description: Detailed log

Description: Up

08/15/2011 7:54:13AM

09/06/2011 4:14:06AM

21d 20h 19m 53s

Description: Detailed log

Description: Up

08/13/2011 7:34:08PM

08/15/2011 7:44:06AM

1d 12h 9m 58s

Description: Detailed log

Description: Up

08/04/2011 9:14:08AM

08/13/2011 7:24:06PM

9d 10h 9m 58s

Description: Detailed log

Description: Up

07/24/2011 2:44:23AM

08/04/2011 9:04:06AM

11d 6h 19m 43s

Description: Detailed log

Description: Up

07/19/2011 1:09:06AM

07/24/2011 2:34:07AM

5d 1h 25m 1s

Description: Detailed log

Description: Down

07/19/2011 1:04:06AM

07/19/2011 1:09:06AM

5m

Description: Detailed log Description: Error Analysis

Description: Up

07/19/2011 12:00:00AM

07/19/2011 1:04:06AM

1h 4m 6s

Description: Detailed log

 

 

 

Here is the Home Page serve time for the past 100 days.

 

Date/time

Avg. Response Time

01/19/2012 12:00:00AM

467 ms

01/18/2012 12:00:00AM

467 ms

01/17/2012 12:00:00AM

583 ms

01/16/2012 12:00:00AM

475 ms

01/15/2012 12:00:00AM

541 ms

01/14/2012 12:00:00AM

523 ms

01/13/2012 12:00:00AM

478 ms

01/12/2012 12:00:00AM

492 ms

01/11/2012 12:00:00AM

630 ms

01/10/2012 12:00:00AM

488 ms

01/09/2012 12:00:00AM

500 ms

01/08/2012 12:00:00AM

496 ms

01/07/2012 12:00:00AM

477 ms

01/06/2012 12:00:00AM

508 ms

01/05/2012 12:00:00AM

517 ms

01/04/2012 12:00:00AM

484 ms

01/03/2012 12:00:00AM

544 ms

01/02/2012 12:00:00AM

517 ms

01/01/2012 12:00:00AM

477 ms

12/31/2011 12:00:00AM

454 ms

12/30/2011 12:00:00AM

456 ms

12/29/2011 12:00:00AM

567 ms

12/28/2011 12:00:00AM

483 ms

12/27/2011 12:00:00AM

465 ms

12/26/2011 12:00:00AM

471 ms

12/25/2011 12:00:00AM

468 ms

12/24/2011 12:00:00AM

460 ms

12/23/2011 12:00:00AM

495 ms

12/22/2011 12:00:00AM

489 ms

12/21/2011 12:00:00AM

520 ms

12/20/2011 12:00:00AM

502 ms

12/19/2011 12:00:00AM

468 ms

12/18/2011 12:00:00AM

441 ms

12/17/2011 12:00:00AM

505 ms

12/16/2011 12:00:00AM

490 ms

12/15/2011 12:00:00AM

480 ms

12/14/2011 12:00:00AM

508 ms

12/13/2011 12:00:00AM

516 ms

12/12/2011 12:00:00AM

513 ms

12/11/2011 12:00:00AM

479 ms

12/10/2011 12:00:00AM

500 ms

12/09/2011 12:00:00AM

456 ms

12/08/2011 12:00:00AM

451 ms

12/07/2011 12:00:00AM

457 ms

12/06/2011 12:00:00AM

453 ms

12/05/2011 12:00:00AM

444 ms

12/04/2011 12:00:00AM

448 ms

12/03/2011 12:00:00AM

450 ms

12/02/2011 12:00:00AM

531 ms

12/01/2011 12:00:00AM

402 ms

11/30/2011 12:00:00AM

519 ms

11/29/2011 12:00:00AM

544 ms

11/28/2011 12:00:00AM

433 ms

11/27/2011 12:00:00AM

449 ms

11/26/2011 12:00:00AM

415 ms

11/25/2011 12:00:00AM

499 ms

11/24/2011 12:00:00AM

441 ms

11/23/2011 12:00:00AM

544 ms

11/22/2011 12:00:00AM

434 ms

11/21/2011 12:00:00AM

469 ms

11/20/2011 12:00:00AM

452 ms

11/19/2011 12:00:00AM

461 ms

11/18/2011 12:00:00AM

443 ms

11/17/2011 12:00:00AM

453 ms

11/16/2011 12:00:00AM

445 ms

11/15/2011 12:00:00AM

423 ms

11/14/2011 12:00:00AM

447 ms

11/13/2011 12:00:00AM

414 ms

11/12/2011 12:00:00AM

434 ms

11/11/2011 12:00:00AM

447 ms

11/10/2011 12:00:00AM

507 ms

11/09/2011 12:00:00AM

488 ms

11/08/2011 12:00:00AM

577 ms

11/07/2011 12:00:00AM

464 ms

11/06/2011 12:00:00AM

497 ms

11/05/2011 12:00:00AM

480 ms

11/04/2011 12:00:00AM

476 ms

11/03/2011 12:00:00AM

464 ms

11/02/2011 12:00:00AM

544 ms

11/01/2011 12:00:00AM

497 ms

10/31/2011 12:00:00AM

552 ms

10/30/2011 12:00:00AM

486 ms

10/29/2011 12:00:00AM

495 ms

10/28/2011 12:00:00AM

477 ms

10/27/2011 12:00:00AM

483 ms

10/26/2011 12:00:00AM

601 ms

10/25/2011 12:00:00AM

534 ms

10/24/2011 12:00:00AM

529 ms

10/23/2011 12:00:00AM

480 ms

10/22/2011 12:00:00AM

491 ms

10/21/2011 12:00:00AM

463 ms

10/20/2011 12:00:00AM

530 ms

10/19/2011 12:00:00AM

494 ms

10/18/2011 12:00:00AM

484 ms

10/17/2011 12:00:00AM

521 ms

10/16/2011 12:00:00AM

517 ms

10/15/2011 12:00:00AM

515 ms

10/14/2011 12:00:00AM

495 ms

10/13/2011 12:00:00AM

550 ms

10/12/2011 12:00:00AM

518 ms

 

 

 

 

 

 

-----Original Message-----
From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Cezary Wagner
Sent: Thursday, January 19, 2012 4:08 PM
To: Google App Engine
Subject: [google-appengine] Re: Why are several production issues related to DeadlineExceededErrors being ignored?

 

Alexis,

--

image001.gif
image002.gif
image003.gif
image004.gif

Cezary Wagner

unread,
Jan 21, 2012, 2:44:40 PM1/21/12
to google-a...@googlegroups.com
Thanks for attention - it looks that bigger application scales very good - want to have such result!.

Sitll have random behavior  - code speed is random for same set of parameters. I am doing teleinformation service monitoring long time and large scale architecture design so think that there no chance on bug generating random timings. Tracked almost all import times but not see root cause.

I am using almost the same set of imports + cssutil, beatifulSoup:
os, datetime, hashlib, logging, pickle, re, time, zlib, urllib, wsgiref.handlers, cgi, urlparse, memcache, urlfetch, webapp, template 

I am think I do the best optimization but still timing is not good - exclude 20 day since there was some planned works.

2012.01.20 48 10 112.558 77.083 %
2012.01.21 41 1 23.287 95.122 %
image001.gif
image004.gif
image003.gif
image002.gif

Brandon Wirtz

unread,
Jan 21, 2012, 5:37:49 PM1/21/12
to google-a...@googlegroups.com

If you think it is because you are small scale set an instance to always on,(or 2).

 

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Cezary Wagner


Sent: Saturday, January 21, 2012 11:45 AM
To: google-a...@googlegroups.com

image001.gif
image002.gif
image003.gif
image004.gif

Cezary Wagner

unread,
Jan 21, 2012, 6:42:56 PM1/21/12
to google-a...@googlegroups.com
Sure 'Always On' has some impact but charts from instance spawing are more abnormal in the last few days - some people reports similar problems on issues list.

With large traffic you will see less impact since more instance is ready.

I saw today that chart of instance was becoming more flat and lower - probably less fast instances dies and something else ...

Abnormal-chart.png
image004.gif
image001.gif
image003.gif
Abnormal-chart.png
image002.gif

Brandon Wirtz

unread,
Jan 21, 2012, 7:20:13 PM1/21/12
to google-a...@googlegroups.com

I’m going to stop commenting on this thread.

 

Your understanding of how the scheduler works is wrong. Which is common, and I can’t blame you for, but your resistance to HR, and to making the scheduler changes is preventing you from resolving your issues.

 

I still believe you have code issues as well, since the spin up times based on your imports should be nowhere near what you quote.

 

But everyone of your messages includes “I have a lot of experience with this I can’t be wrong” type language.   I can’t help someone who doesn’t want to debug the issue, and only wishes to cite why the issue must be someone else’s fault.

 

-Brandon

 

Brandon Wirtz
BlackWaterOps: President / Lead Mercenary

Description: http://www.linkedin.com/img/signature/bg_slate_385x42.jpg

Work: 510-992-6548
Toll Free: 866-400-4536

IM: dra...@gmail.com (Google Talk)
Skype: drakegreene

BlackWater Ops

image005.jpg
image006.png
image007.gif
image008.gif
image009.gif
image010.gif
Reply all
Reply to author
Forward
0 new messages