Very high task queue delays

dloomer

unread,

Jan 3, 2011, 10:03:25 AM1/3/11

to google-a...@googlegroups.com

To the point: tasks added with no countdown or ETA specified (i.e. should run immediately) are taking over a minute to execute. I have about 8 tasks in queue at one time, and I've set the rate and bucket size pretty high during troubleshooting (currently 50/s and 50, respectively). Tasks generally take less than 30 seconds to complete, but 2 minutes or more can elapse before an added task begins executing. I see no errors in the logs and the task queue doesn't indicate any retries are occurring.

I should notice that I'm also noticing some latency in my app as well (AJAX calls that used to be instantaneous now take 5+ seconds), although I'm not sure if it's related and I haven't been able to gather any stats there to determine if what I'm seeing is real and not in my code.

This issue just started popping up over the weekend. I haven't been able to troubleshoot it much until now, and I think I have sufficient information to share; however, I'm still surprised that after three days I still seem to be the first to report this. That would suggest I'm doing something wrong, although I have no idea what it could be.

To track the timing, I put the following code at the very start of my get() method:

if "X-AppEngine-TaskName" in self.request.headers:

logging.info("Start task %s" % self.request.headers["X-AppEngine-TaskName"])

And then the very last piece of code executed, after the task has processed, is:

t.add(queue_name = 'overnight-tasks')

if "X-AppEngine-TaskName" in self.request.headers:

logging.info("Completed task %s; added task %s to queue" % (self.request.headers["X-AppEngine-TaskName"], t.name))

In between the two, my task code grabs the next entity from the datastore using a cursor, generally makes 1-2 calls to URLFetch, and gets and inserts a small amount of data. Nothing that to me seems too unusual. None of my code changed around the time that I started seeing this behavior.

Some sample log output (I dummied up the task names for readability):

01-03 06:35AM 43.266

Start task 1

01-03 06:36AM 03.913

Completed task 1

added task 2

1-03 06:37AM 14.250

Start task 2

01-03 06:37AM 21.456

Completed task 2

added task 3

01-03 06:39AM 28.472

Start task 3

Here is a current snapshot of my queue. Notice the ETAs are generally a few minutes ago, even though the tasks themselves take less than a minute to execute. I'm not anywhere close to exceeding any quotas.

Name	ETA	Creation Time	Method/URL	Retries	Payload	Headers
16570436267015171017	2011/01/03 06:58:56 0:02:41 ago	2011/01/03 06:58:56 0:02:42 ago	POST /batch/twitter_lists		188 bytes	5 headers
16570436267015171290	2011/01/03 06:52:05 0:09:32 ago	2011/01/03 06:52:05 0:09:33 ago	GET /batch/fix_gcal?section=all&venue=1&beginning_date=2011-01-03&allshows=False&current_date=2011-07-27&ending_date=2011-12-31&fetch=206			4 headers
9377086806707803543	2011/01/03 06:56:59 0:04:39 ago	2011/01/03 06:56:59 0:04:39 ago	GET /batch/fix_gcal?section=all&venue=1&beginning_date=2011-01-03&allshows=False&current_date=2011-07-12&ending_date=2011-12-31&fetch=191			4 headers
9377086806707805263	2011/01/03 06:54:04 0:07:33 ago	2011/01/03 06:54:04 0:07:34 ago	GET /batch/fix_gcal?section=all&venue=1&beginning_date=2011-01-03&allshows=False&current_date=2011-07-16&ending_date=2011-12-31&fetch=195			4 headers
9377086806707806174	2011/01/03 06:52:35 0:09:02 ago	2011/01/03 06:52:35 0:09:03 ago	GET /batch/performer_audio_count?cursor=E9oBTW1uLWxpdmUAUGVyZm9ybWVyAGhhc19hdWRpbwCQgYBtbi1saXZlAIuSUGVyZm9ybWVyAJj6CVScjICLklBlcmZvcm1lcgCY-glUnIyA4AEAFA%3D%3D			4 headers
performer-external-data-recheck-2011-01-03-10-14-09-E9oBTW1uLWxpdmUAUGVyZm9ybWVyAIjjkoiDgACSiIGAAIBtbi1saXZlAIuSUGVyZm9ybWVyAJj6DLVsjICLklBlcmZvcm1lcgCY-gy1bIyA4AEAFA---False-False-3	2011/01/03 06:53:12 0:08:25 ago	2011/01/03 06:50:17 0:11:21 ago	GET /batch/performer_external_data?full_scan=False&batch_group=3&cursor=E9oBTW1uLWxpdmUAUGVyZm9ybWVyAIjjkoiDgACSiIGAAIBtbi1saXZlAIuSUGVyZm9ybWVyAJj6DLVsjICLklBlcmZvcm1lcgCY-gy1bIyA4AEAFA%3D%3D&start=2011-01-03-10-14-09&processed=73&phase=recheck&twitter_lists_only=False	1		4 headers
performer-stats-update-2011-01-03-09-22-13--E9oBQG1uLWxpdmUAUGVyZm9ybWVyAG1uLWxpdmUAi5JQZXJmb3JtZXIAmPoKydaMgIuSUGVyZm9ybWVyAJj6CsnWjIDgAQAU-1	2011/01/03 06:53:10 0:08:27 ago	2011/01/03 06:53:10 0:08:28 ago	GET /batch/performer_stats_update?cursor=E9oBQG1uLWxpdmUAUGVyZm9ybWVyAG1uLWxpdmUAi5JQZXJmb3JtZXIAmPoKzFmMgIuSUGVyZm9ybWVyAJj6CsxZjIDgAQAU&start=2011-01-03-09-22-13&tag=&pass=1			4 headers

Any ideas?

dloomer

unread,

Jan 3, 2011, 10:05:37 AM1/3/11

to google-a...@googlegroups.com

Very important clarification to the first sentence, to eliminate confusion:

tasks added with no countdown or ETA specified (i.e. should run immediately) are taking over a minute to start executing after they were added.

dloomer

unread,

Jan 3, 2011, 10:10:00 AM1/3/11

to google-a...@googlegroups.com

Also, figure I should have included the full code for creating the task.

t = taskqueue.Task(url='/batch/performer_external_data', params={'start': start_datetime_s, 'phase': phase, 'cursor': next_cursor, 'full_scan': full_scan, 'twitter_lists_only': twitter_lists_only, 'batch_group': batch_group_number, 'processed': total_processed},

name=task_name,

method='GET')

try:

t.add(queue_name = 'overnight-tasks')

if "X-AppEngine-TaskName" in self.request.headers:

logging.info("Completed task %s; added task %s to queue" % (self.request.headers["X-AppEngine-TaskName"], t.name))

My app name is mn-live, BTW.

vlad

unread,

Jan 3, 2011, 5:14:21 PM1/3/11

to google-a...@googlegroups.com

I observed such behavior too. It is reproducible only on production (devserver is fine). I believe it to be a bug or optimization in TaskQueue code. Basically GAE appear not to check back the queue for
ready-to-execute tasks after frontend enqueues a new task with 0s or very short delay.

That is based on a couple of observations:

1. If I schedule my tasks with 5+ second delay I do not see that problem.
2. I enqueued 100 dummy tasks to fire every second between now and 100s in the future and ran my tests. Problem was NOT observed

GAE Team please comment

Eli Jones

unread,

Jan 3, 2011, 5:29:26 PM1/3/11

to google-a...@googlegroups.com

I have a chained task that runs throughout the day on a strict schedule.

The first chained task (let's call it "PrimaryTask") runs at 9:30 AM and then after it has successfully completed its work, it schedules two new tasks:

1. Another "PrimaryTask" to run at the next 5 minute mark (9:35 AM, 9:40 AM, etc)

2. A new task to be executed immediately (no defined delay or eta).. I'll refer to this one as "SecondaryTask"

The SecondaryTask always ran immediately.. no delay with any of the 80 times it was scheduled throughout the day.

The PrimaryTask also ran exactly on time throughout the day (though, I see that my SecondaryTask behaviour is the most germane in this situation.. since the concern is about when tasks with no delay actually end up executing.)

So, in the least, this was not an issue across all of AppEngine today.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

dloomer

unread,

Jan 3, 2011, 5:41:24 PM1/3/11

to google-a...@googlegroups.com

I'm not sure if this applies in my case or not. After setting my task with a countdown of 5 seconds, it's still delayed; now I see a delay of 20-40 seconds instead of the 1-2 minutes previously, but the delay seemed to have decreased before I made that change.

One other observation is that the "Run in last minute" number is never correct; it's always high. The application log confirms that about 2-4 tasks in the particular queue execute in a given minute; however, the "Run in last minute" number is always around 12 for the queue.

vlad

unread,

Jan 3, 2011, 5:44:43 PM1/3/11

to google-a...@googlegroups.com

@Eli Are running a single task chain or there could be many chains executing independently?
The reason that is important is GAE opportunistically uses *any* ready-to-execute task in a given queue to check whether other are ready. What that leads to is real schedule will vary depending on how many tasks are scheduled.

Generally I have to agree that not every 0s delay tasks suffers huge delays. I wasn't able to discern a pattern so far.

dloomer

unread,

Jan 3, 2011, 5:46:08 PM1/3/11

to google-a...@googlegroups.com

And to add to that, I'm still seeing latency in my application in general. Appstats show that every request completes efficiently from a server standpoint; it's just a matter of getting between the time of request from the client and the time that execution begins on the server which seems to show a delay of a few seconds. I haven't completely eliminated my local network as the culprit, although I've seen this from 2-3 different networks which makes it seem less likely it's a client side problem.

Traceroute shows this:

traceroute to ghs.l.google.com (74.125.95.121), 64 hops max, 40 byte packets

1 10.0.0.2 (10.0.0.2) 84.064 ms 1.216 ms 1.175 ms

2 g0-1.usi-ar03-mtka.usinternet.com (216.17.36.17) 42.084 ms 42.159 ms 41.570 ms

3 v103.usi-cr02-mpls.usinternet.com (216.17.36.25) 42.348 ms 42.507 ms 42.848 ms

4 gi1-1.ccr01.msp01.atlas.cogentco.com (38.104.196.13) 42.636 ms 44.084 ms 43.163 ms

5 te4-4.ccr01.ord04.atlas.cogentco.com (154.54.3.61) 53.034 ms 53.563 ms 81.625 ms

6 te0-1-0-1.mpd21.ord01.atlas.cogentco.com (154.54.24.181) 53.926 ms

te0-1-0-1.mpd22.ord01.atlas.cogentco.com (154.54.24.177) 53.009 ms 53.216 ms

7 te0-0-0-3.ccr21.ord03.atlas.cogentco.com (154.54.6.206) 53.509 ms

te0-0-0-7.ccr21.ord03.atlas.cogentco.com (154.54.25.70) 53.202 ms

te0-1-0-1.ccr21.ord03.atlas.cogentco.com (154.54.5.18) 53.710 ms

8 google.ord03.atlas.cogentco.com (154.54.12.130) 684.023 ms 551.837 ms 52.920 ms

9 209.85.254.120 (209.85.254.120) 53.258 ms

209.85.254.128 (209.85.254.128) 53.335 ms

209.85.254.120 (209.85.254.120) 53.159 ms

10 72.14.232.141 (72.14.232.141) 64.380 ms

209.85.241.22 (209.85.241.22) 64.927 ms 69.109 ms

11 209.85.241.35 (209.85.241.35) 65.610 ms 64.102 ms

209.85.241.27 (209.85.241.27) 69.762 ms

12 209.85.240.45 (209.85.240.45) 66.217 ms 74.654 ms

72.14.239.193 (72.14.239.193) 76.945 ms

13 iw-in-f121.1e100.net (74.125.95.121) 64.477 ms 64.332 ms 64.265 ms

Hop #8 seems a bit suspect ...

vlad

unread,

Jan 3, 2011, 5:49:03 PM1/3/11

to google-a...@googlegroups.com

@dloomer   Try creating dummy tasks and see if problem persists. May have to tweak the code below if you are not using default queue.

class TaskCreateDispatcher(webapp.RequestHandler):

def get(self, *args):
    ntask = min(100, int(args[0]) )
    tasks = []
    for i in range(ntask):
        tasks.append( Task(url='/task/noop', countdown=i) )
    Queue().add( tasks )
    self.response.headers['Content-Type'] = 'text/plain'
    self.response.out.write("Created %d tasks\n" % ntask)

class TaskNoopDispatcher(webapp.RequestHandler):

def post(self, *args):
    self.response.headers['Content-Type'] = 'text/plain'
    self.response.out.write("Task\n")

Eli Jones

unread,

Jan 3, 2011, 6:12:58 PM1/3/11

to google-a...@googlegroups.com

There is one named task chain.

It runs every 5 minutes.. does it's work, and then schedules itself to run again at the next 5 minute mark (not in 5 minutes from when it ends.. but in how ever many seconds there are until the next 5 minute mark).

This is the only thing my app does during the day.

So I can see a log entry at 9:30 AM, 9:35 AM, ... all the way until 1:05 PM.

There is a secondary task that this chained task fires off after it finishes each run. That task is set to run immediately, and it only runs once.. and does not chain itself.

This makes it very easy for me to detect if/when my particular app experiences any weird taskqueue delays. Since.. the main task should run at each 5 minute mark.. if it doesn't, I know there is an appengine delay there.

The secondary task should start exactly after the main chained task finishes its work.. so i can easily detect delays there as well.

When everything is running smoothly, I can look at my logs and just see pairs of requests throughout the day, like so:

01-03 12:30PM 01.478 /oneOffTask/go 200 344ms 381cpu_ms 241api_cpu_ms 0kb AppEngine-Google;

01-03 12:30PM 00.694 /cron/chainedTask 200 794ms 665cpu_ms 408api_cpu_ms 0kb AppEngine-Google;

01-03 12:25PM 01.690 /oneOffTask/go 200 348ms 405cpu_ms 241api_cpu_ms 0kb AppEngine-Google;

01-03 12:25PM 00.323 /cron/chainedTask 200 1380ms 774cpu_ms 408api_cpu_ms 0kb AppEngine-Google;

So.. by glancing at the log, I can detect any sudden issues (since the app only does this scheduled back-end processing there are no other log entries).. like if any of these log entries showed times of 12:27:11 or 12:33:28.. etc.. I'd need to look closer at the logs to see why the start time was off.

For my app, nothing like that appears to have happened today.

Now, if you had a heavily used queue.. and it was hitting its tasks per second limit.. that would give you a delay.

vlad

unread,

Jan 3, 2011, 6:21:51 PM1/3/11

to google-a...@googlegroups.com

Are your tasks transactional or not?

Eli Jones

unread,

Jan 3, 2011, 6:24:42 PM1/3/11

to google-a...@googlegroups.com

No transactions.. I avoid them like the plague (for db puts/gets and for taskqueue adds). Just named tasks and idempotence.. if anything fails.. they just do everything again.

On Mon, Jan 3, 2011 at 6:21 PM, vlad <vlad.tr...@gmail.com> wrote:

Are your tasks transactional or not?

Robert Kluin

unread,

Jan 3, 2011, 6:28:51 PM1/3/11

to google-a...@googlegroups.com

I'm also not seeing any noticeable delays. Here is a sample:

01-03 02:08PM 52.714 /_ah/tasks/t/a/work 200 206ms 105cpu_ms 81api_cpu_ms
01-03 02:08PM 52.708 /_ah/tasks/t/a/work 200 163ms 128cpu_ms 81api_cpu_ms
01-03 02:08PM 52.636 /_ah/tasks/t/a/fan 200 126ms 23cpu_ms # Task
that inserts two (non-transactional) work tasks with no countdown /
eta.
01-03 02:08PM 52.480 /s/t 200 165ms 345cpu_ms 229api_cpu_ms # User
request that transactionally inserts the fan task with no countdown /
eta.

Robert

On Mon, Jan 3, 2011 at 18:21, vlad <vlad.tr...@gmail.com> wrote:
> Are your tasks transactional or not?
>

A. Stevko

unread,

Jan 3, 2011, 6:33:43 PM1/3/11

to google-a...@googlegroups.com

Could the delays have anything to do with the bucket-size and execution-rate?

--
-- A. Stevko

===========

"If everything seems under control, you're just not going fast enough." M. Andretti

vlad

unread,

Jan 3, 2011, 6:35:38 PM1/3/11

to google-a...@googlegroups.com

Btw, I do not believe the issue is about today. I complained last week
https://groups.google.com/d/topic/google-appengine/O7WlBW6Oljc/discussion

dloomer

unread,

Jan 4, 2011, 7:45:06 AM1/4/11

to google-a...@googlegroups.com

In my case, I first noticed the issue while I was at the default 5 bucket tokens + 5/s rate. I've since increased to 50 tokens + 50/s rate and seen no difference. However, both are still higher than I should need since I only have a few tasks in the queue at once and each takes at least several seconds to complete.

dloomer

unread,

Jan 4, 2011, 7:49:33 AM1/4/11

to google-a...@googlegroups.com

Hi Vlad -- I didn't notice your post at the time since I wasn't having any issues. However, mine started occurring just a couple days after yours (I think at or very near Jan 1).

I've been a GAE user for 2-3 years. I've used task queue for over a year. I didn't just start doing something different with queues; in fact I've been running the exact same tasks for several months. This just seemed to start happening out of the blue 5 days ago and hasn't let up since. Hopefully someone from the Google team can look into our specific issues soon.

It seems like on and off I have strange issues that affect only a small number of users. Those are the worst, because it is much harder to get Google's attention even if (as has been the case with my issues) it does indeed end up being a problem at Google's end and not with our code.

vlad

unread,

Jan 4, 2011, 4:13:26 PM1/4/11

to google-a...@googlegroups.com

Yes the issue seems to have been recently introduced. Let's keep bumping this thread until GAE team has a chance to take a look. Yesterday it was reproducible very often for me.

vlad

unread,

Jan 4, 2011, 4:18:41 PM1/4/11

to google-a...@googlegroups.com

Sorry, I hit post too fast. I do not think this issue affect only a few developers. It is likely affecting all but most will not have energy to track it down. The hardest part about TaskQueue is its "best-effort", non-deterministic nature. It devalues all testing and makes chasing bugs very hard because it makes distilling wrong behavior into a reproducible case next to impossible. The fact that dev-server behaves very differently from production does not help either.

Dave Loomer

unread,

Jan 5, 2011, 8:05:23 AM1/5/11

to google-a...@googlegroups.com

As of last evening my issues have both gone away, both the task queue issue and the general latency in my app. What are you seeing Vlad?

--

vlad

unread,

Jan 5, 2011, 10:08:05 PM1/5/11

to google-a...@googlegroups.com

GAE Team please take a look at this problem! Tasks are routinely running past its ETA by up to a minute. It is a very serious problem for any user facing application.

Maximum Rate	Bucket Size	Oldest Task	Tasks in Queue	Run in Last Minute
20/s	100.0	2011/01/05 19:01:46 (0:00:37 ago)	3	2

‹ Prev 10 Next 10 ›

Order by: Task Name | Task ETA

	Name	ETA	Creation Time	Method/URL	Retries	Payload	Headers	Actions

11012267307712842821	2011/01/05 19:02:34 0:00:10 from now	2011/01/05 18:59:34 0:02:49 ago	POST /_ah/queue/deferred	426 bytes	6 headers X-AppEngine-Default-Namespace:zodiacholdem.com X-AppEngine-Current-Namespace: User-Agent:AppEngine-Google; (+ http://code.google.com/appengine) Host: www.zodiacholdem.com Referer: http://www.zodiacholdem.com/_ah/queue/deferred Content-Type:application/octet-stream
351776224552482183	2011/01/05 19:01:50 0:00:33 ago	2011/01/05 19:01:45 0:00:38 ago	POST /_ah/queue/deferred	445 bytes	6 headers X-AppEngine-Default-Namespace:zodiacholdem.com X-AppEngine-Current-Namespace: User-Agent:AppEngine-Google; (+ http://code.google.com/appengine) Host: www.zodiacholdem.com Referer: http://www.zodiacholdem.com/_ah/queue/deferred Content-Type:application/octet-stream
351776224552484483	2011/01/05 19:01:46 0:00:37 ago	2011/01/05 19:01:45 0:00:38 ago	POST /_ah/queue/deferred	440 bytes	6 headers X-AppEngine-Default-Namespace:zodiacholdem.com X-AppEngine-Current-Namespace: User-Agent:AppEngine-Google; (+ http://code.google.com/appengine) Host: www.zodiacholdem.com Referer: http://www.zodiacholdem.com/_ah/queue/deferred Content-Type:application/octet-stream

vlad

unread,

Jan 7, 2011, 4:05:05 PM1/7/11

to google-a...@googlegroups.com

Bump

Reply all

Reply to author

Forward