(mostly) Consistent 20-second delay in starting backend tasks

Dave Loomer

unread,

Jan 28, 2012, 5:35:38 PM1/28/12

to google-a...@googlegroups.com

The abstract is that I have a "hobby" app (granted, I put a lot of time and energy into it) that does tons of mapreduce-esque backend processing through tasks that execute, then create a new task for the next step, etc. My site will never generate revenue so I aim to someday get my daily costs down to about a dollar or not much more; thus, I'm trying to do everything by limiting each of my backends (3 total backends each doing specialized tasks) to one instance each, while at the same time it's obviously important that tasks complete as quickly as possible. I'm looking for the happy medium between too many and too few instances; while I may find that adding instances or breaking up tasks so that I have fewer tasks running longer can to some degree address the concerns listed below, I've still discovered some things that concern and confuse me and I'd love to know the answers regardless.

My concerns in summary: I have noticed that even though the logs showed about 1 minute request handling time for each executed task, I was only executing about two tasks every three minutes. After looking into it, the problem seems to lie in the fact that I am configuring each backend with just a single instance. Running some metrics on a mapreduce batch sequence run on a single instance, a batch run that takes 255 minutes to complete through the executing of 183 tasks spends 82 of those minutes waiting for tasks to start after adding them to the queue. I've done lots and lots of analysis running controlled tests to assure myself that the issue is unrelated to my backend or task queue configuration, nor is my code doing anything that should cause any of this. There are no errors shown in the logs, and the logs (and my knowledge of my own app) also show me that none of my code is doing anything at all in the task queue or on the backend in question during these 82 minutes of waiting. It all seems to come down to how the GAE scheduler handles scheduling of requests or tasks on a single-instance backend.

Details -

Backend is standard B1 Dynamic. To create a more controlled and reproducible test I wrote a handler that does nothing but generate a new task which in turn runs that same handler, rinse and repeat. These requests, which the logs show to complete in a few milliseconds (with some exceptions; see below), exhibit pretty much the exact same delays which my production code, which takes an average of a minute to complete a task because it does actual work, shows. A variation of the simplified bare-bones handler I created adds a wrinkle of passing an argument to the launched task which records the time at which the task was added, so that the top of the handler can compare the current time with the time that the task was created in order to estimate the delay in starting the task. The delay is generally very, very close to exactly 20 seconds when running on a single-instance backend, which is more or less the same behavior as my production code and the bare-bones example described above.

Tests

Note that each of the scenarios below has been run many, many times, switching frequently from one test to the other to more or less eliminate the effects of periodic conditions in the infrastructure. Results have been extremely consistent so I'm confident in the numbers.

1. Run the task on a frontend rather than on my backend.

Result: 100 tasks complete in a second or two. Curiously, the logs show that all requests are served by the same frontend instance.

2. Increase the number of instances on my backend to 5.

Result: Essentially the same performance as when running the tasks on the frontend. Delay in starting a task ranges 0.01-.05 seconds. By the time all tasks complete, two instances are spun up. Instances page shows requests distributed evenly amongst the two instances. Logs show that requests complete in 7-15 ms.

3. 4-instance backend.

Result: By the time all tasks complete, two instances are spun up. Instances page shows requests distributed evenly amongst the two instances. Delay in starting a task ranges 0.28-.1.2 seconds. Logs show that requests themselves complete in around 500 ms.

Observations/Concerns: If the same number of instances are running as in the 5-instance configuration, with an ample number of idle instances waiting and ready, why are the results different? In the real world these results are just as acceptable to me as in the 5-instance configuration, because in the end it doesn't change cost much, but I'm still curious. And really, 500 ms is a long time to serve a request whose handler does nothing but add another task.

4. 3-instance backend.

Result: By the time all tasks complete, 3 instances (yes, three, as opposed to two when configuring my backend with 4-5 instances) are spun up. Instances page shows the vast majority (75-85%) of requests go to one of the instances. Delay in starting a task ranges 0.01-.1.2 seconds. Logs show that requests themselves complete in around 7-11 ms.

Observations/Concerns: Lots of curious stuff here starting with the number of instances that spin up, and I've repeated these tests over and over and over with the same results. Why the imbalance of requests sent to the first instance as opposed to the other two? Why is the delay smaller than in the 4-instance configuration? Why does the 3- and 5- instance configuration show requests to complete in a few milliseconds as expected, while the 4-instance configuration consistently (have I mentioned I've run these tests lots and lots of times?) takes 500 ms for the portion where the request handler is running?

5. 2-instance backend.

Result: Both instances spin up, and requests are distributed evenly between them. Delay in starting a task ranges 1.6-3.2 seconds. Logs show that requests themselves complete in around 2000-5000 ms.

Observations/Concerns: Same number of instances serving requests as in 4- and 5-instance configurations. Only difference is in idle instances -- which, duh, aren't doing anything. Still marginally acceptable performance, but still why are the results noticeably worse?

6. 1-instance backend, i.e. my current production configuration.

Result: Delay in invoking the handler for an added task ranges 19-30 seconds, with a very heavy bunching up around the 20-second mark. Logs show that requests themselves complete in around 1300 ms.

Observations/Concerns:

- Once again, my intuition beforehand on what would happen to the request completion time (which granted isn't the point of my tests, but now a new point of curiosity) proved completely wrong.

- Wow. Unacceptable performance. 20 seconds to invoke the handler of an added task is a lot, and a huge divergence from the results in all of my other tested configurations.

- With a few exceptions, delays between adding the task and the invocation of that task's handler are *very* close to 20 seconds. Values like 20.093132, 20.026017, 19.713625, 19.891798999999999 are the norm; values even as much as +-.4 from 20.0 are infrequent.

- After first minute of the batch run, Task Queue Details page consistently shows 6-8 tasks run in last minute even though logs show only 2-3 requests served per minute (no errors or anything abnormal shown in the logs either). I know for certain the only tasks running in the queue in question are from my tests. Why is the number overstated, and why does it only happen in the single-instance configuration?

Other observations

For kicks I played around with the X-AppEngine-FailFast header. In the multi-instance configurations it surprisingly didn't prevent GAE from spinning up multiple instances and I never saw errors in the logs as I expected to. Apparently I don't really understand what FailFast does.

Anyone have insights into any of these behaviors?

Dave Loomer

unread,

Jan 28, 2012, 5:45:39 PM1/28/12

to google-a...@googlegroups.com

FWIW should note that my app is master/slave. The pressure to move to HRD/Python 2.7 is very real, but right now I have too many concerns with replication delays and reading others' migration headaches with data volumes similar to mine, so I have no short-term plans to migrate.

Dave Loomer

unread,

Jan 28, 2012, 9:24:12 PM1/28/12

to google-a...@googlegroups.com

I've been able to nearly solve the delay problem by setting countdown=1 in the Task constructor. This reduces the delay from 20 to about 1.5. Not sure why it's not closer to 1.0 but this will be fine. The time to serve the simple request is unaffected.

Still a strange bug (?).

Also, some other things lead me to believe the GAE just wants to "rest" for 20 seconds. I started to wonder if in my previous tests I was just running things through too fast, even though that seemed unlikely since I was seeing the same delays in production when tasks take 1+ minutes to complete. But I put a time.sleep(10) in my handler just to make sure. Strangely, what this did was reduce the delay from 20 seconds down to 10. So the sleep gives 10 seconds of rest and then GAE still wants 10 more? Something like that. When I replaced time.sleep(10) with a tight loop lasting 10 seconds, my higher delays returned although they were more erratic (i.e. not consistently near 20 seconds).

Also, I tried creating a new app containing only the code I had written for my test, and got somewhat different results. I got the almost-exactly-20-seconds delay, but only about once every 4-5 requests. The other times it was near zero. Maybe this is because nothing else is going on in the app, so GAE has less need for "rest?" Almost as if it were a quota thing -- except the fact that I demonstrated earlier that I can eliminate the delay entirely by running tasks on a frontend, or using a 5-instance backend, or (as now demonstrated) setting countdown=1 seems to eliminate that possibility.

All very strange.

pdknsk

unread,

Jan 29, 2012, 11:24:19 AM1/29/12

to Google App Engine

I had the same problem, with often quite exact 20 seconds delays. More
details in this thread.

http://groups.google.com/group/google-appengine/browse_thread/thread/e5588268dff9b97a

In another configuration, the delay increased to several minutes.
Please star this bug.

http://code.google.com/p/googleappengine/issues/detail?id=6022

Dave Loomer

unread,

Jan 29, 2012, 11:29:55 AM1/29/12

to Google App Engine

Interesting. I saw your thread but wasn't entirely sure if it was the
issue. I think the thing that threw me off was that your delays were
being reflected in the request ms in the logs, while in my case they
mostly aren't.

Does setting the task countdown work for you? Or is ~1 second delay
still too much in your case?

On Jan 29, 10:24 am, pdknsk <pdk...@googlemail.com> wrote:
> I had the same problem, with often quite exact 20 seconds delays. More
> details in this thread.
>

> http://groups.google.com/group/google-appengine/browse_thread/thread/...

pdknsk

unread,

Jan 29, 2012, 11:56:33 AM1/29/12

to Google App Engine

I haven't tried setting countdown. It's too slow for the mail queue
anyway. It might work on the other backend (mentioned in the bug).
I've moved most code away from backends though.
> *- After first minute of the batch run, Task Queue Details page

> consistently shows 6-8 tasks run in last minute even though logs show only
> 2-3 requests served per minute (no errors or anything abnormal shown in the
> logs either). I know for certain the only tasks running in the queue in
> question are from my tests. Why is the number overstated, and why does it

> only happen in the single-instance configuration?*

I've noticed sth. else, with might be related. When you stop a
backend, and put tasks on it, the tasks will obviously not run. In the
task queue overview in the dashboard however, tasks are reported to
have run in the last minute.

Robert Kluin

unread,

Jan 31, 2012, 1:26:48 AM1/31/12

to google-a...@googlegroups.com

Hi,
You've got a lot of info there. I tried to skim it, but might have
missed some stuff.

In your experiments with multiple backends, they are all the same
"backend" right? You're not targeting specific instances are you? In
other words, is everything targeted at "thebackend," "1.thebackend,"
or multiple different backends?

Do you see pending latencies in the request log headers? That's
where the failfast header should help. Rather than dispatch the task,
then wait for an instance to handle it, it will immediately get
returned to the queue. I don't think it is intended to prevent
instances from being spun up.

What happens if instead of backends you use front ends plus a
max_concurrent_requests equal to the number of backends you were
wanting? Does it change the behavior you're seeing?

You does the latency of the request itself impact what you're
seeing? If you do a lot more work in each request does the situation
improve?

Robert

> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
>

Dave Loomer

unread,

Jan 31, 2012, 4:58:11 PM1/31/12

to google-a...@googlegroups.com

Hi Robert,

When targeting backend "thebackend", this was done merely by specifying target=thebackend when creating the Task object. And then in backends.yaml I modify the instances parameter for the backend in question for each of my various tests.

I'm not really too concerned about failfast, so I'll skip that one for the moment ...

I haven't played with max concurrent requests, but in the test where I used a frontend, I did not see any additional instances spin up (and my site is very low-use, so there was probably a max of 1-2 instances to start with).

I see the same length of delay (~20 seconds) whether I'm testing with my rapid-fire test handler, or running in production with tasks that spend 1+ minutes doing actual work (lots of RPCs- datastore, URLfetch, memcache).

Hope this helps.

Dave

Robert Kluin

unread,

Feb 1, 2012, 1:53:23 AM2/1/12

to google-a...@googlegroups.com

Hey Dave,
So do you see pending latencies in the request log headers?

What if you further increase the work done within each task so that
it takes, for example, 5 minutes per task?

Have you considered using pull-queues since you're running all this
on backends?

Robert

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/xF79RpgjpAYJ.

Dave Loomer

unread,

Feb 1, 2012, 2:25:27 PM2/1/12

to google-a...@googlegroups.com

Here are logs from three consecutive task executions over the past weekend, with only identifying information removed. You'll see that each task completes in a few milliseconds, but are 20 seconds apart (remember: I've already checked my queue configurations, nothing else is running on this backend, and I later solved the problem by setting countdown=1 when adding the task). I don't see any pending latency mentioned.

0.1.0.2 - - [27/Jan/2012:18:33:20 -0800] 200 124 ms=10 cpu_ms=47 api_cpu_ms=0 cpm_usd=0.000060 queue_name=overnight-tasks task_name=15804554889304913211 instance=0

0.1.0.2 - - [27/Jan/2012:18:33:00 -0800] 200 124 ms=11 cpu_ms=0 api_cpu_ms=0 cpm_usd=0.000060 queue_name=overnight-tasks task_name=15804554889304912461 instance=0

0.1.0.2 - - [27/Jan/2012:18:32:41 -0800] 200 124 ms=26 cpu_ms=0 api_cpu_ms=0 cpm_usd=0.000060 queue_name=overnight-tasks task_name=4499136807998063691 instance=0

The 20 seconds seems to happen regardless of length of task. Even though my tasks mostly complete in a couple minutes, I do have cases where they take several minutes, and I don't see a difference. Of course, when a task takes 5-10 minutes to complete, I'm going to notice and care about a 20-second delay much less than when I'm trying to spin through a few tasks in a minute (which is a real-world need for me as well).

When reading up on pull queues a while back, I was a little confused about where I would use them with my own backends. I definitely could see an application for offloading work to an AWS Linux instance. But in either case, could you explain why it might help?

I saw you mention in a separate thread how M/S can perform differently from HRD even in cases where one wouldn't expect to see a difference. When I get around to it I'm going to create a tiny HRD app and run the same tests through that.

I also wonder if M/S could be responsible for frequent latencies in my admin console. Those have gotten more frequent and annoying the past couple of months ...

Nicholas Verne

unread,

Feb 1, 2012, 4:34:49 PM2/1/12

to google-a...@googlegroups.com

Dave,

Two questions will help us clarify what you're observing.

Are your tasks added transactionally?

You mention the pattern of executing a task then enqueuing another. In
the one backend case, is there typically no more than one task in the
queue?

Nick Verne

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/google-appengine/-/lbNQRQdSx0AJ.

Dave Loomer

unread,

Feb 1, 2012, 5:18:09 PM2/1/12

to google-a...@googlegroups.com

The tasks are not run transactionally, and in my testing the task is the only one in queue. In fact, I also ran the tests *somewhat* successfully on a separate app where this was the only code running. I say somewhat because, as I stated in my original post, the 20-second delays didn't happen every time; more like once every four requests. But when they did occur, it was almost precisely 20 seconds.

Below is the entirety of my handler:

class AdhocHandler2(webapp.RequestHandler):

def get(self):

task_create_time = datetime.datetime.now()

import string

task_create_time_s = task_create_time.strftime("%Y-%m-%d %H:%M:%S") + "." + string.zfill(task_create_time.microsecond, 6)

t = taskqueue.Task(url='/batch/adhoc', params={'task_create_time': task_create_time_s}, target='overnight-external-data',

method='GET')

t.add(queue_name = 'overnight-tasks')

Dave Loomer

unread,

Feb 1, 2012, 5:20:05 PM2/1/12

to google-a...@googlegroups.com

To do one better, here is the entirety of the Python code:

#!/usr/bin/env python

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

#

import wsgiref.handlers

import datetime, time

from google.appengine.ext import webapp, db

from google.appengine.ext.webapp.util import run_wsgi_app

from google.appengine.api import taskqueue

class AdhocHandler2(webapp.RequestHandler):

def get(self):

task_create_time = datetime.datetime.now()

import string

task_create_time_s = task_create_time.strftime("%Y-%m-%d %H:%M:%S") + "." + string.zfill(task_create_time.microsecond, 6)

t = taskqueue.Task(url='/batch/adhoc', params={'task_create_time': task_create_time_s}, target='overnight-external-data',

method='GET')

t.add(queue_name = 'overnight-tasks')

def main():

application = webapp.WSGIApplication([ \

('/batch/adhoc2', AdhocHandler2), \

],

debug=True)

run_wsgi_app(application)

if __name__ == '__main__':

main()

Dave Loomer

unread,

Feb 1, 2012, 5:21:57 PM2/1/12

to google-a...@googlegroups.com

And here is backends.yaml:

backends:

- name: overnight-external-data

class: B1

options: dynamic

instances: 1

and queue.yaml:

queue:

- name: overnight-tasks

rate: 50/s

bucket_size: 50

retry_parameters:

max_backoff_seconds: 1800

Dave Loomer

unread,

Feb 1, 2012, 5:24:22 PM2/1/12

to google-a...@googlegroups.com

Finally, it's probably an important clue that when I explicitly set countdown=1 when creating the task, the delay in executing the task is always almost exactly 1.5 seconds (not sure why it's not 1.0). If I don't set a countdown value, it's almost as if I had set countdown=20. Except that the ETA in the Task Queue Details page in the admin console reflects the approximate time the task was created, rather than adding 20 seconds to that.

Robert Kluin

unread,

Feb 2, 2012, 12:03:35 AM2/2/12

to google-a...@googlegroups.com

Hey Dave,
Hopefully Nick will be able to offer some insight into the cause of
your issues. I'd guess it is something related to having very few
tasks (one) in the queue, and it not getting scheduled rapidly.

In your case, you could use pull queues to immediately fetch the
next task when finished with a task. Or even to fetch multiple tasks
and do the work in parallel. Basically you'd have a backend that ran
a loop (possibly initiated via a push task) that would lease a task,
or tasks, from the pull queue, do the work, delete those tasks, then
repeat from the lease stage. The cool thing is that if you're, for
example, using URL Fetch to pull data this might let you do it in
parallel without increasing your costs much (if any).

Robert

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit

> https://groups.google.com/d/msg/google-appengine/-/lbNQRQdSx0AJ.

Carter

unread,

Feb 5, 2012, 12:24:14 PM2/5/12

to Google App Engine

We regularly but erratically see 10-20 minute delays in running push
queue tasks.
The tasks sit in the queue with ETA as high as 20 minutes *ago*
without any errors or retries.

(the problem seems unrelated to queue settings since our Maximum Rate,
Enorced Rate and Maximum Concurrent all far exceed the queue's
throughput at the time of the delays)

Any tips or clues on how to prevent this while still using push queues
without backends?

On Feb 1, 9:03 pm, Robert Kluin <robert.kl...@gmail.com> wrote:
> Hey Dave,
> Hopefully Nick will be able to offer some insight into the cause of
> your issues. I'd guess it is something related to having very few

> tasks (one) in thequeue, and it not getting scheduled rapidly.

>
> In your case, you could use pull queues to immediately fetch the

> nexttaskwhen finished with atask. Or even to fetch multiple tasks

> and do the work in parallel. Basically you'd have a backend that ran
> a loop (possibly initiated via a pushtask) that would lease atask,

> or tasks, from the pullqueue, do the work, delete those tasks, then

> repeat from the lease stage. The cool thing is that if you're, for
> example, using URL Fetch to pull data this might let you do it in
> parallel without increasing your costs much (if any).
>
> Robert
>
>
>
>
>
>
>

> On Wed, Feb 1, 2012 at 14:25, Dave Loomer <dloo...@gmail.com> wrote:
> > Here are logs from three consecutivetaskexecutions over the past weekend,

> > with only identifying information removed. You'll see that eachtask
> > completes in a few milliseconds, but are 20 seconds apart (remember: I've

> > already checked myqueueconfigurations, nothing else is running on this

> > backend, and I later solved the problem by setting countdown=1 when adding

> > thetask). I don't see any pending latency mentioned.

>
> > 0.1.0.2 - - [27/Jan/2012:18:33:20 -0800] 200 124 ms=10 cpu_ms=47
> > api_cpu_ms=0 cpm_usd=0.000060 queue_name=overnight-tasks
> > task_name=15804554889304913211 instance=0
> > 0.1.0.2 - - [27/Jan/2012:18:33:00 -0800] 200 124 ms=11 cpu_ms=0 api_cpu_ms=0
> > cpm_usd=0.000060 queue_name=overnight-tasks task_name=15804554889304912461
> > instance=0
> > 0.1.0.2 - - [27/Jan/2012:18:32:41 -0800] 200 124 ms=26 cpu_ms=0 api_cpu_ms=0
> > cpm_usd=0.000060 queue_name=overnight-tasks task_name=4499136807998063691
> > instance=0
>

> > The 20 seconds seems to happen regardless of length oftask. Even though my

> > tasks mostly complete in a couple minutes, I do have cases where they take
> > several minutes, and I don't see a difference. Of course, when atasktakes
> > 5-10 minutes to complete, I'm going to notice and care about a 20-second

> >delaymuch less than when I'm trying to spin through a few tasks in a minute

Dave Loomer

unread,

Feb 5, 2012, 12:30:55 PM2/5/12

to Google App Engine

In my case, since I was getting the 20-second delay almost 100% of the
time, setting countdown=1 was the answer. If you only see it happen
every 20 or more request then of course it won't help.

In my case I also run all tasks on the backend. They're slightly more
expensive per hour than frontends (due merely to the lower number of
free hours) but in my case I more than make up for it with the fact
that I have full control on the number of requests that will spin up,
and I need to be able to control that number separately for tasks vs.
users hitting my site.

Dave Loomer

unread,

Feb 5, 2012, 12:32:50 PM2/5/12

to Google App Engine

> that I have full control on the number of requests that will spin up,

err, number of instances that will spin up, rather ...

stevep

unread,

Feb 5, 2012, 5:27:52 PM2/5/12

to Google App Engine

Carter wrote: We regularly but erratically see 10-20 minute delays in
running push queue tasks.

Been a burr under the saddle forever. What I really don't understand
-- assuming GAE engineers never see the benefit of providing at least
one priority/reliability queue -- is why the heck there is never any
explanation about how tasks get scheduled, and why these weird delays
happen. It is either: 1) If we told you we would have to shoot you, or
2) We can't see the benefit of you understanding this.

-stevep

Nicholas Verne

unread,

Feb 5, 2012, 6:59:48 PM2/5/12

to google-a...@googlegroups.com

We would have no need to shoot anyone.

However, the explanations quickly become obsolete. They enter the
folklore in the form that was current at the time and become
entrenched as incorrect information when the implementations have
changed.

Task Queues use best effort scheduling. They're not real time all the
time, although when our best efforts are running smoothly they can
appear real time. For scheduling, the task eta marks the earliest time
at which the task can run. We can't guarantee that a task WILL run at
that time.

Steve, we're interested to know about the 10-20 minute delays you've
seen. Can you tell us the app id, queue, and whether the tasks were
added transactionally? An example from your logs would be very
helpful.

Nick Verne

Dave Loomer

unread,

Feb 5, 2012, 7:05:30 PM2/5/12

to Google App Engine

As the OP you may be interested in my app ID as well: mn-live. I
provided some logs a few posts back and some exhaustive details at the
beginning.

However, you won't see this issue popping up anymore on my app since I
"solved" it by setting countdown=1 a week ago. Since then, tasks start
very reliably after a 1.5 second delay. If I remove the countdown
parameter, then it returns to 20 seconds (+/- .01) pretty reliably.

Carter Maslan

unread,

Feb 5, 2012, 7:27:43 PM2/5/12

to google-a...@googlegroups.com

Nicholas -

For our examples of the 10-20 minute delay:

app_id=s~camiologger

queue=image-label

(but several other queues experience the same long delays sometimes: content-process, counter-update, etc...)

The tasks were not added with transactions; just this code:

Queue queueP = QueueFactory.getQueue(ServerUtils.QUEUE_NAME_IMAGE_LABEL_PUSH);
TaskHandle th = queueP.add(withUrl(ServerUtils.PATH_ADMIN_MOTION_LABEL)

.param("key", contentKeyString)

.method(TaskOptions.Method.GET));

Let me know if you need more info. We noticed this in the last few weeks.

Carter

Task Queue Details - WhySlow.png

Robert Kluin

unread,

Feb 6, 2012, 12:17:34 AM2/6/12

to google-a...@googlegroups.com

That's interesting. Did the queue sit there for a long time not
running anything, or running tasks very slowly? Are the tasks in that
queue generally long-running?

I _very_ infrequently bump into that type of issue, but I periodically
will see one queue slow down for a while. It *seems* to happen far
more often in queues with slower tasks, but I don't have any recent
empirical evidence of that. And I *think* I've been told that should
not be the case.

Robert

Carter Maslan

unread,

Feb 6, 2012, 1:38:05 AM2/6/12

to google-a...@googlegroups.com, google-a...@googlegroups.com

I Just looked at the last 80 that ran.
That queue's tasks are running in between 19ms and 2486ms with most of them running around 28ms. The variability relates to the number of quadtree searches needed, but other queues that experience similar delays have running time without much variation(e.g. predictable counter updates)
When the delays happen, there just aren't many tasks in the queue at all.
It appears that the delayed tasks are just sitting in the queue idle.

Robert Kluin

unread,

Feb 6, 2012, 1:56:41 AM2/6/12

to google-a...@googlegroups.com

Does the app get a lot of front-end traffic or is it sitting idle when
the delays occur?

Carter Maslan

unread,

Feb 6, 2012, 2:02:12 AM2/6/12

to google-a...@googlegroups.com

The app always has front end traffic and I have noticed the delay at times with high traffic.

stevep

unread,

Feb 6, 2012, 10:08:53 AM2/6/12

to Google App Engine

Thanks Nick. Been on the forums for years, and that's the first I
remember having a GAE engineer delve into TQs at this level. (BTW
Carter replied already about his 20 minute starts).

Maybe over time TQ scheduling might expand to enhance multi-instance
optimization, and provide a means for a priority queue (with
constraints such as volume capacity).

-stevep

Reply all

Reply to author

Forward