Very high task queue delays

105 views
Skip to first unread message

dloomer

unread,
Jan 3, 2011, 10:03:25 AM1/3/11
to google-a...@googlegroups.com
To the point: tasks added with no countdown or ETA specified (i.e. should run immediately) are taking over a minute to execute.  I have about 8 tasks in queue at one time, and I've set the rate and bucket size pretty high during troubleshooting (currently 50/s and 50, respectively). Tasks generally take less than 30 seconds to complete, but 2 minutes or more can elapse before an added task begins executing.  I see no errors in the logs and the task queue doesn't indicate any retries are occurring.

I should notice that I'm also noticing some latency in my app as well (AJAX calls that used to be instantaneous now take 5+ seconds), although I'm not sure if it's related and I haven't been able to gather any stats there to determine if what I'm seeing is real and not in my code.

This issue just started popping up over the weekend.  I haven't been able to troubleshoot it much until now, and I think I have sufficient information to share; however, I'm still surprised that after three days I still seem to be the first to report this.  That would suggest I'm doing something wrong, although I have no idea what it could be.

To track the timing, I put the following code at the very start of my get() method:

if "X-AppEngine-TaskName" in self.request.headers:

logging.info("Start task %s" % self.request.headers["X-AppEngine-TaskName"])


And then the very last piece of code executed, after the task has processed, is:

t.add(queue_name = 'overnight-tasks')

if "X-AppEngine-TaskName" in self.request.headers:

logging.info("Completed task %s; added task %s to queue" % (self.request.headers["X-AppEngine-TaskName"], t.name))



In between the two, my task code grabs the next entity from the datastore using a cursor, generally makes 1-2 calls to URLFetch, and gets and inserts a small amount of data. Nothing that to me seems too unusual.  None of my code changed around the time that I started seeing this behavior.

Some sample log output (I dummied up the task names for readability):

01-03 06:35AM 43.266
Start task     1

01-03 06:36AM 03.913
Completed task 1
added task     2

1-03 06:37AM 14.250
Start task     2

01-03 06:37AM 21.456
Completed task 2
added task     3

01-03 06:39AM 28.472
Start task     3


Here is a current snapshot of my queue.  Notice the ETAs are generally a few minutes ago, even though the tasks themselves take less than a minute to execute.  I'm not anywhere close to exceeding any quotas.

NameETACreation TimeMethod/URLRetriesPayloadHeadersActions
165704362670151710172011/01/03 06:58:56
0:02:41 ago
2011/01/03 06:58:56
0:02:42 ago
POST /batch/twitter_lists188 bytes
5 headers
165704362670151712902011/01/03 06:52:05
0:09:32 ago
2011/01/03 06:52:05
0:09:33 ago
GET /batch/fix_gcal?section=all&venue=1&beginning_date=2011-01-03&allshows=False&current_date=2011-07-27&ending_date=2011-12-31&fetch=206
4 headers
93770868067078035432011/01/03 06:56:59
0:04:39 ago
2011/01/03 06:56:59
0:04:39 ago
GET /batch/fix_gcal?section=all&venue=1&beginning_date=2011-01-03&allshows=False&current_date=2011-07-12&ending_date=2011-12-31&fetch=191
4 headers
93770868067078052632011/01/03 06:54:04
0:07:33 ago
2011/01/03 06:54:04
0:07:34 ago
GET /batch/fix_gcal?section=all&venue=1&beginning_date=2011-01-03&allshows=False&current_date=2011-07-16&ending_date=2011-12-31&fetch=195
4 headers
93770868067078061742011/01/03 06:52:35
0:09:02 ago
2011/01/03 06:52:35
0:09:03 ago
GET /batch/performer_audio_count?cursor=E9oBTW1uLWxpdmUAUGVyZm9ybWVyAGhhc19hdWRpbwCQgYBtbi1saXZlAIuSUGVyZm9ybWVyAJj6CVScjICLklBlcmZvcm1lcgCY-glUnIyA4AEAFA%3D%3D
4 headers
performer-external-data-recheck-2011-01-03-10-14-09-E9oBTW1uLWxpdmUAUGVyZm9ybWVyAIjjkoiDgACSiIGAAIBtbi1saXZlAIuSUGVyZm9ybWVyAJj6DLVsjICLklBlcmZvcm1lcgCY-gy1bIyA4AEAFA---False-False-32011/01/03 06:53:12
0:08:25 ago
2011/01/03 06:50:17
0:11:21 ago
GET /batch/performer_external_data?full_scan=False&batch_group=3&cursor=E9oBTW1uLWxpdmUAUGVyZm9ybWVyAIjjkoiDgACSiIGAAIBtbi1saXZlAIuSUGVyZm9ybWVyAJj6DLVsjICLklBlcmZvcm1lcgCY-gy1bIyA4AEAFA%3D%3D&start=2011-01-03-10-14-09&processed=73&phase=recheck&twitter_lists_only=False1
4 headers
performer-stats-update-2011-01-03-09-22-13--E9oBQG1uLWxpdmUAUGVyZm9ybWVyAG1uLWxpdmUAi5JQZXJmb3JtZXIAmPoKydaMgIuSUGVyZm9ybWVyAJj6CsnWjIDgAQAU-12011/01/03 06:53:10
0:08:27 ago
2011/01/03 06:53:10
0:08:28 ago
GET /batch/performer_stats_update?cursor=E9oBQG1uLWxpdmUAUGVyZm9ybWVyAG1uLWxpdmUAi5JQZXJmb3JtZXIAmPoKzFmMgIuSUGVyZm9ybWVyAJj6CsxZjIDgAQAU&start=2011-01-03-09-22-13&tag=&pass=1
4 headers

Any ideas?

dloomer

unread,
Jan 3, 2011, 10:05:37 AM1/3/11
to google-a...@googlegroups.com
Very important clarification to the first sentence, to eliminate confusion:

tasks added with no countdown or ETA specified (i.e. should run immediately) are taking over a minute to start executing after they were added

dloomer

unread,
Jan 3, 2011, 10:10:00 AM1/3/11
to google-a...@googlegroups.com
Also, figure I should have included the full code for creating the task.

t = taskqueue.Task(url='/batch/performer_external_data', params={'start': start_datetime_s, 'phase': phase, 'cursor': next_cursor, 'full_scan': full_scan, 'twitter_lists_only': twitter_lists_only, 'batch_group': batch_group_number, 'processed': total_processed},

name=task_name,

method='GET')

try:

t.add(queue_name = 'overnight-tasks')

if "X-AppEngine-TaskName" in self.request.headers:

logging.info("Completed task %s; added task %s to queue" % (self.request.headers["X-AppEngine-TaskName"], t.name))


My app name is mn-live, BTW.

vlad

unread,
Jan 3, 2011, 5:14:21 PM1/3/11
to google-a...@googlegroups.com
I observed such behavior too. It is reproducible only on production (devserver is fine). I believe it to be a bug or optimization  in TaskQueue code. Basically GAE appear not to check back the queue for
ready-to-execute tasks after frontend enqueues a new task with 0s or very short delay.

That is based on a couple of observations:

1. If I schedule my tasks with 5+ second delay I do not see that problem.
2. I enqueued 100 dummy tasks to fire every second between now and 100s in the future and ran my tests. Problem was NOT observed

GAE Team please comment

Eli Jones

unread,
Jan 3, 2011, 5:29:26 PM1/3/11
to google-a...@googlegroups.com
I have a chained task that runs throughout the day on a strict schedule.

The first chained task (let's call it "PrimaryTask") runs at 9:30 AM and then after it has successfully completed its work, it schedules two new tasks:

1.  Another "PrimaryTask" to run at the next 5 minute mark (9:35 AM, 9:40 AM, etc)
2. A new task to be executed immediately (no defined delay or eta).. I'll refer to this one as "SecondaryTask"

The SecondaryTask always ran immediately.. no delay with any of the 80 times it was scheduled throughout the day.

The PrimaryTask also ran exactly on time throughout the day (though, I see that my SecondaryTask behaviour is the most germane in this situation.. since the concern is about when tasks with no delay actually end up executing.)

So, in the least, this was not an issue across all of AppEngine today.


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

dloomer

unread,
Jan 3, 2011, 5:41:24 PM1/3/11
to google-a...@googlegroups.com
I'm not sure if this applies in my case or not.  After setting my task with a countdown of 5 seconds, it's still delayed; now I see a delay of 20-40 seconds instead of the 1-2 minutes previously, but the delay seemed to have decreased before I made that change.

One other observation is that the "Run in last minute" number is never correct; it's always high.  The application log confirms that about 2-4 tasks in the particular queue execute in a given minute; however, the "Run in last minute" number is always around 12 for the queue.

vlad

unread,
Jan 3, 2011, 5:44:43 PM1/3/11
to google-a...@googlegroups.com
@Eli   Are running a single task chain or there could be many chains executing independently?
The reason that is important is GAE opportunistically uses *any* ready-to-execute task in a  given queue to check whether other are ready. What that leads to is real schedule will vary depending on how many tasks are scheduled.

Generally I have to agree that not every 0s delay tasks suffers huge delays. I wasn't able to discern a pattern so far.


dloomer

unread,
Jan 3, 2011, 5:46:08 PM1/3/11
to google-a...@googlegroups.com
And to add to that, I'm still seeing latency in my application in general.  Appstats show that every request completes efficiently from a server standpoint; it's just a matter of getting between the time of request from the client and the time that execution begins on the server which seems to show a delay of a few seconds.  I haven't completely eliminated my local network as the culprit, although I've seen this from 2-3 different networks which makes it seem less likely it's a client side problem.

Traceroute shows this:

traceroute to ghs.l.google.com (74.125.95.121), 64 hops max, 40 byte packets
 1  10.0.0.2 (10.0.0.2)  84.064 ms  1.216 ms  1.175 ms
 2  g0-1.usi-ar03-mtka.usinternet.com (216.17.36.17)  42.084 ms  42.159 ms  41.570 ms
 3  v103.usi-cr02-mpls.usinternet.com (216.17.36.25)  42.348 ms  42.507 ms  42.848 ms
 4  gi1-1.ccr01.msp01.atlas.cogentco.com (38.104.196.13)  42.636 ms  44.084 ms  43.163 ms
 5  te4-4.ccr01.ord04.atlas.cogentco.com (154.54.3.61)  53.034 ms  53.563 ms  81.625 ms
 6  te0-1-0-1.mpd21.ord01.atlas.cogentco.com (154.54.24.181)  53.926 ms
    te0-1-0-1.mpd22.ord01.atlas.cogentco.com (154.54.24.177)  53.009 ms  53.216 ms
 7  te0-0-0-3.ccr21.ord03.atlas.cogentco.com (154.54.6.206)  53.509 ms
    te0-0-0-7.ccr21.ord03.atlas.cogentco.com (154.54.25.70)  53.202 ms
    te0-1-0-1.ccr21.ord03.atlas.cogentco.com (154.54.5.18)  53.710 ms
 8  google.ord03.atlas.cogentco.com (154.54.12.130)  684.023 ms  551.837 ms  52.920 ms
 9  209.85.254.120 (209.85.254.120)  53.258 ms
    209.85.254.128 (209.85.254.128)  53.335 ms
    209.85.254.120 (209.85.254.120)  53.159 ms
10  72.14.232.141 (72.14.232.141)  64.380 ms
    209.85.241.22 (209.85.241.22)  64.927 ms  69.109 ms
11  209.85.241.35 (209.85.241.35)  65.610 ms  64.102 ms
    209.85.241.27 (209.85.241.27)  69.762 ms
12  209.85.240.45 (209.85.240.45)  66.217 ms  74.654 ms
    72.14.239.193 (72.14.239.193)  76.945 ms
13  iw-in-f121.1e100.net (74.125.95.121)  64.477 ms  64.332 ms  64.265 ms

Hop #8 seems a bit suspect ...

vlad

unread,
Jan 3, 2011, 5:49:03 PM1/3/11
to google-a...@googlegroups.com
@dloomer   Try creating dummy tasks and see if problem persists. May have to tweak the code below if you are not using default queue.

class TaskCreateDispatcher(webapp.RequestHandler):

  def get(self, *args):
    ntask = min(100, int(args[0]) )
    tasks = []
    for i in range(ntask):
        tasks.append( Task(url='/task/noop', countdown=i) )
    Queue().add( tasks )
    self.response.headers['Content-Type'] = 'text/plain'
    self.response.out.write("Created %d tasks\n" % ntask)


class TaskNoopDispatcher(webapp.RequestHandler):

  def post(self, *args):
    self.response.headers['Content-Type'] = 'text/plain'
    self.response.out.write("Task\n")

Eli Jones

unread,
Jan 3, 2011, 6:12:58 PM1/3/11
to google-a...@googlegroups.com
There is one named task chain.

It runs every 5 minutes.. does it's work, and then schedules itself to run again at the next 5 minute mark (not in 5 minutes from when it ends.. but in how ever many seconds there are until the next 5 minute mark).

This is the only thing my app does during the day.

So I can see a log entry at 9:30 AM, 9:35 AM, ... all the way until 1:05 PM.

There is a secondary task that this chained task fires off after it finishes each run.  That task is set to run immediately, and it only runs once.. and does not chain itself.

This makes it very easy for me to detect if/when my particular app experiences any weird taskqueue delays.  Since.. the main task should run at each 5 minute mark.. if it doesn't, I know there is an appengine delay there.

The secondary task should start exactly after the main chained task finishes its work.. so i can easily detect delays there as well.

When everything is running smoothly, I can look at my logs and just see pairs of requests throughout the day, like so:

01-03 12:30PM 01.478 /oneOffTask/go 200 344ms 381cpu_ms 241api_cpu_ms 0kb AppEngine-Google;
01-03 12:30PM 00.694 /cron/chainedTask 200 794ms 665cpu_ms 408api_cpu_ms 0kb AppEngine-Google;
01-03 12:25PM 01.690 /oneOffTask/go 200 348ms 405cpu_ms 241api_cpu_ms 0kb AppEngine-Google;
01-03 12:25PM 00.323 /cron/chainedTask 200 1380ms 774cpu_ms 408api_cpu_ms 0kb AppEngine-Google;

So.. by glancing at the log, I can detect any sudden issues (since the app only does this scheduled back-end processing there are no other log entries).. like if any of these log entries showed times of 12:27:11 or 12:33:28.. etc.. I'd need to look closer at the logs to see why the start time was off.

For my app, nothing like that appears to have happened today.

Now, if you had a heavily used queue.. and it was hitting its tasks per second limit.. that would give you a delay.


vlad

unread,
Jan 3, 2011, 6:21:51 PM1/3/11
to google-a...@googlegroups.com
Are your tasks transactional or not?

Eli Jones

unread,
Jan 3, 2011, 6:24:42 PM1/3/11
to google-a...@googlegroups.com
No transactions.. I avoid them like the plague (for db puts/gets and for taskqueue adds).  Just named tasks and idempotence.. if anything fails.. they just do everything again.

On Mon, Jan 3, 2011 at 6:21 PM, vlad <vlad.tr...@gmail.com> wrote:
Are your tasks transactional or not?

Robert Kluin

unread,
Jan 3, 2011, 6:28:51 PM1/3/11
to google-a...@googlegroups.com
I'm also not seeing any noticeable delays. Here is a sample:

01-03 02:08PM 52.714 /_ah/tasks/t/a/work 200 206ms 105cpu_ms 81api_cpu_ms
01-03 02:08PM 52.708 /_ah/tasks/t/a/work 200 163ms 128cpu_ms 81api_cpu_ms
01-03 02:08PM 52.636 /_ah/tasks/t/a/fan 200 126ms 23cpu_ms # Task
that inserts two (non-transactional) work tasks with no countdown /
eta.
01-03 02:08PM 52.480 /s/t 200 165ms 345cpu_ms 229api_cpu_ms # User
request that transactionally inserts the fan task with no countdown /
eta.

Robert


On Mon, Jan 3, 2011 at 18:21, vlad <vlad.tr...@gmail.com> wrote:
> Are your tasks transactional or not?
>

A. Stevko

unread,
Jan 3, 2011, 6:33:43 PM1/3/11
to google-a...@googlegroups.com
Could the delays have anything to do with the bucket-size and execution-rate?
--
-- A. Stevko
===========
"If everything seems under control, you're just not going fast enough." M. Andretti





vlad

unread,
Jan 3, 2011, 6:35:38 PM1/3/11
to google-a...@googlegroups.com
Btw, I do not believe the issue is about today. I complained last week
https://groups.google.com/d/topic/google-appengine/O7WlBW6Oljc/discussion

dloomer

unread,
Jan 4, 2011, 7:45:06 AM1/4/11
to google-a...@googlegroups.com
In my case, I first noticed the issue while I was at the default 5 bucket tokens + 5/s rate.  I've since increased to 50 tokens + 50/s rate and seen no difference.  However, both are still higher than I should need since I only have a few tasks in the queue at once and each takes at least several seconds to complete.

dloomer

unread,
Jan 4, 2011, 7:49:33 AM1/4/11
to google-a...@googlegroups.com
Hi Vlad -- I didn't notice your post at the time since I wasn't having any issues.  However, mine started occurring just a couple days after yours (I think at or very near Jan 1).

I've been a GAE user for 2-3 years.  I've used task queue for over a year.  I didn't just start doing something different with queues; in fact I've been running the exact same tasks for several months.  This just seemed to start happening out of the blue 5 days ago and hasn't let up since.  Hopefully someone from the Google team can look into our specific issues soon.

It seems like on and off I have strange issues that affect only a small number of users.  Those are the worst, because it is much harder to get Google's attention even if (as has been the case with my issues) it does indeed end up being a problem at Google's end and not with our code.

vlad

unread,
Jan 4, 2011, 4:13:26 PM1/4/11
to google-a...@googlegroups.com
Yes the issue seems to have been recently introduced. Let's keep bumping this thread until GAE team has a chance to take a look. Yesterday it was reproducible very often for me.

vlad

unread,
Jan 4, 2011, 4:18:41 PM1/4/11
to google-a...@googlegroups.com
Sorry, I hit post too fast. I do not think this issue affect only a few developers. It is likely affecting all but most will not have energy to track it down. The hardest part about TaskQueue is its "best-effort", non-deterministic nature. It devalues all testing and makes chasing bugs very hard because it makes distilling wrong behavior into a reproducible case next to impossible. The fact that dev-server behaves very differently from production does not help either.

Dave Loomer

unread,
Jan 5, 2011, 8:05:23 AM1/5/11
to google-a...@googlegroups.com
As of last evening my issues have both gone away, both the task queue issue and the general latency in my app.  What are you seeing Vlad?

--

vlad

unread,
Jan 5, 2011, 10:08:05 PM1/5/11
to google-a...@googlegroups.com
GAE Team please take a look at this problem! Tasks are routinely running past its ETA by up to a minute. It is a very serious problem for any user facing application.


Maximum Rate Bucket Size Oldest Task Tasks in Queue Run in Last Minute
20/s 100.0 2011/01/05 19:01:46
(0:00:37 ago)
3 2
‹ Prev 10 Next 10 ›
Order by: Task Name | Task ETA
Name ETA Creation Time Method/URL Retries Payload Headers Actions
11012267307712842821 2011/01/05 19:02:34
0:00:10 from now
2011/01/05 18:59:34
0:02:49 ago
POST /_ah/queue/deferred
426 bytes
6 headers
X-AppEngine-Default-Namespace:zodiacholdem.com
X-AppEngine-Current-Namespace:
User-Agent:AppEngine-Google; (+ http://code.google.com/appengine)
Host: www.zodiacholdem.com
Referer: http://www.zodiacholdem.com/_ah/queue/deferred
Content-Type:application/octet-stream
351776224552482183 2011/01/05 19:01:50
0:00:33 ago
2011/01/05 19:01:45
0:00:38 ago
POST /_ah/queue/deferred
445 bytes
6 headers
X-AppEngine-Default-Namespace:zodiacholdem.com
X-AppEngine-Current-Namespace:
User-Agent:AppEngine-Google; (+ http://code.google.com/appengine)
Host: www.zodiacholdem.com
Referer: http://www.zodiacholdem.com/_ah/queue/deferred
Content-Type:application/octet-stream
351776224552484483 2011/01/05 19:01:46
0:00:37 ago
2011/01/05 19:01:45
0:00:38 ago
POST /_ah/queue/deferred
440 bytes
6 headers
X-AppEngine-Default-Namespace:zodiacholdem.com
X-AppEngine-Current-Namespace:
User-Agent:AppEngine-Google; (+ http://code.google.com/appengine)
Host: www.zodiacholdem.com
Referer: http://www.zodiacholdem.com/_ah/queue/deferred
Content-Type:application/octet-stream

vlad

unread,
Jan 7, 2011, 4:05:05 PM1/7/11
to google-a...@googlegroups.com
Bump
Reply all
Reply to author
Forward
0 new messages