X-Appengine-Taskretryreason:Instance Unavailable - I'm confused by this specific defer behaviour

272 views
Skip to first unread message

Kaan Soral

unread,
Feb 9, 2015, 6:00:00 AM2/9/15
to google-a...@googlegroups.com
If you locate and inspect the deferred.py of the sdk files, it's actually a pretty simple and brilliant routine

When a task overflows the taskqueue size limit, it puts a _DeferredTaskEntity to the db to utilize it for storage

My Problem:

I see 30-40 failed defer tasks at production (out of a burst operation that spanned 100.000's of tasks)
They fail with: "X-Appengine-Taskretrycount:1, X-Appengine-Taskretryreason:Instance Unavailable"
The exception is a PermanentTaskFailure from run_from_datastore

Does this mean that the tasks actually executed before, but faultily re-triggered, and threw PermanentTaskFailure's because the previous execution deleted the _DeferredTaskEntity entity?
(I'm trying to pinpoint the cause and the trigger so it doesn't happen again, since these exceptions all happened once and have a low count, I'm guessing this is a very edge behaviour)

If this is the case, they would probably cause a silent re-execution of the task, if the db entity / overflow scenario wasn't used

Vinny P

unread,
Feb 12, 2015, 2:00:54 AM2/12/15
to google-a...@googlegroups.com
On Mon, Feb 9, 2015 at 5:00 AM, Kaan Soral <kaan...@gmail.com> wrote:
I see 30-40 failed defer tasks at production (out of a burst operation that spanned 100.000's of tasks)
They fail with: "X-Appengine-Taskretrycount:1, X-Appengine-Taskretryreason:Instance Unavailable"
The exception is a PermanentTaskFailure from run_from_datastore

Does this mean that the tasks actually executed before, but faultily re-triggered, and threw PermanentTaskFailure's because the previous execution deleted the _DeferredTaskEntity entity?
(I'm trying to pinpoint the cause and the trigger so it doesn't happen again, since these exceptions all happened once and have a low count, I'm guessing this is a very edge behaviour)



When the 30 - 40 tasks failed, did they all fail at nearly the same time (within the same few seconds) or fail in groups within the burst operation? If so, I'd agree that this is probably very edge behavior.  But before that, I would see if any of the tasks shared similar properties: for instance, were they unusually large or did they have significantly more complex serializations.

There have been posts in the past (both here and SO) with similar issues but no definitive answers - the most popular fix seems to be slowing down the addition/execution of tasks.

 
-----------------
-Vinny P
Technology & Media Consultant
Chicago, IL

App Engine Code Samples: http://www.learntogoogleit.com
 

Kaan Soral

unread,
Feb 12, 2015, 11:52:45 AM2/12/15
to google-a...@googlegroups.com
Yes, it all happened in/around a second, they were all db based defer's, so the size was >~100kb as it seems

I recently started experimenting with basic_scaling, these kind of issues severely reduced

So now, before bursting predictable tasks, I switch my background module to basic_scaling

This issue remains, I also noticed that the basic scaling spans more instances than the max_instances, yet they are probably not billed, a bit confusing, but it works well

(To add more to unexplained burst phenomenons, I have red-alerts in case things that shouldn't happen happens, it seems similar issues happen for db/transactions too, but very rarely and unpredictably, nothing to worry about tho)
Reply all
Reply to author
Forward
0 new messages