Pull Queue TransientError flurries

Kaan Soral

unread,

Mar 27, 2015, 1:47:32 PM3/27/15

to google-a...@googlegroups.com

These past few hours I'm seeing some TransientError flurries, momentarily the queue produces lots of TransientError's

I'm using tags with the pull queues, I'm also using multiple queues to reduce/distribute the load

Is this a temporary issue, or are there hidden limits to pull queue usage that I should adhere to? (like 'max. 10 consequent/simultaneous lease operations' etc.)

Kaan Soral

unread,

Mar 28, 2015, 4:41:15 AM3/28/15

to google-a...@googlegroups.com

I'm also experiencing momentary TransactionFailedError's - on ndb.delete_multi_async's

"too much contention on these datastore entities. please try again."

These are from crons, so the same operations are done everyday, yet the issues are rare/random/temporary

stevep

unread,

Mar 28, 2015, 3:50:14 PM3/28/15

to google-a...@googlegroups.com

While rapidly processing leased tasks from a pull-queue, I consistently had a problem with Transient Errors when requesting a fixed number of tasks (1000 at a time) prior to iterating through the tasks. The only time we see Transient Errors in pull queues is this lease step. There was never any discernible pattern, so I think this may be due to infrastructure load not visible to the app Dashboard. Added logic to pause for a second, then retry the getting the leased items when the lease request throws a TE exception. If after a second try, we log an error. Have never seen the notation in the error log for this second try failing, so for us this pause seems to fix it. Note that we never had large number of TEs all at once, but there were clearly times when they appeared more often than others. I think I sent some logs to a Google engineer identifying such a pattern. No conclusion from this though, so now we always use this TE sleep/retry exception handling for all pull-queue lease logic. Fortunately even then they are happening, TEs are infrequent, so 60 seconds of sleep in our big cron task has no real affect on instances etc. If you find the TEs are due to the lease step, this might help. Best, Steve

stevep

unread,

Mar 28, 2015, 3:52:09 PM3/28/15

to google-a...@googlegroups.com

Apologies for some typos near the end of this post: 1 second of sleep, not 60 (sheesh).

Kaan Soral

unread,

Mar 28, 2015, 4:12:55 PM3/28/15

to google-a...@googlegroups.com

Thanks for the info

I was also thinking of a custom sleep+retry routine as you mentioned, it works on most of the appengine error scenarios as you mentioned

But I'm also concerned whether the issue will increase as my app grows

I'm potentially running parallel lease tasks from the same queue, that has different tags, however each tag is leased discretely, so there aren't parallel lease's going on from the same tag

I have a hunch there is a hidden performance limit to the pull queues, like max 50 leases in a 10 second window etc. regardless of the tags

Ideally, I want to be able to run 1000+ lease_tasks's in parallel that each pulls tasks with a unique tag

Reply all

Reply to author

Forward