[SEVERE] App Engine can't consistently execute tasks

201 views
Skip to first unread message

Kaan Soral

unread,
Feb 23, 2015, 6:46:49 PM2/23/15
to google-a...@googlegroups.com
  rate: 500/s
  bucket_size: 100
  retry_parameters:
    task_retry_limit: 6
    min_backoff_seconds: 2
    max_doublings: 3

Although my queue configuration is broad enough to handle occasional internal failures, I noticed and verified that the taskqueue leaves some tasks unexecuted
( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom] - happens both with normal instances and basic_scaling/B4 instances )

I first noticed the issue when some operations that should have done were left undone

Than I inspected the taskqueue execution with a custom routine that tracks / counts ingoing and executing tasks, a routine that I perfected long ago, and noticed the missing executions

The issue isn't persistent, after a re-deployment and re-test, the same routine managed to traverse all the entities as it's supposed to

TL;DR - some taskqueue tasks silently fail to execute, this should never happen, but it happens very frequently without any reason, causes damage and confusion

Kaan Soral

unread,
Feb 23, 2015, 6:49:57 PM2/23/15
to google-a...@googlegroups.com
I also experienced this same issue previously, an engineer told me that internal/silent/invisible execution failures also count as task retries, which is illogical and likely a design/logic fault
- That's why I increased my retry parameters after that issue/research-flurry and didn't experience task loss up until now, now even with the increased retry parameters, I'm experiencing this issue, on multiple instance/scaling configurations, it's killing me slowly

paynen

unread,
Feb 23, 2015, 11:52:36 PM2/23/15
to google-a...@googlegroups.com
Hey Kaan,

That sounds like a serious problem. Is it possible for you to share some logs that demonstrate the issue occurring, or better yet a minimally-reproducing example? This is vital to identifying and fixing a possible issue on the platform.

Secondly, could you possibly point to the issue ticket # or forum thread where the discussion with the engineer occurred? I'm interested to see the full context of whether this is a known issue or something else.

Regards,

paynen

Kaan Soral

unread,
Feb 24, 2015, 12:49:44 AM2/24/15
to google-a...@googlegroups.com
If there was any logs, it wouldn't be an issue, I would likely solved the issue on the logs, the issue is there are no logs

With this recent issue, the only error logs are some instances dying critically, from memory overflows's (after going over 256mb's - my theory is a ndb memory leak, but it's offtopic)
(There are 5-6 error logs like these, all instance overflows, since there are tens of thousands of other operations, the error ratio is low)

In this specific case, the critical death of instances might be related to the issue, those instances might be taking some tasks with them, and leave the tasks unexecuted (theory)

I will dig in and find the discussion

Jason Collins

unread,
Feb 24, 2015, 9:48:34 AM2/24/15
to google-a...@googlegroups.com
Hi Kaan,

Just lending support to this:

engineer told me that internal/silent/invisible execution failures also count as task retries

I have also been told this. We have an open-source framework for doing workflows based on task queue and one of the features we tried to build in to it was an alerting system when the final task retry failed (and thus the task would go away permanently). We had to give up on the feature because it was possible (and we saw it) for tasks to fail before ever hitting our code, and this counts as a retry, and if it happened to be on the last retry, then our alert code wouldn't get a chance to operate. 

In short, it would have been an unreliable feature. Note, this was a long time ago though (3+ years).

j

husayt

unread,
Feb 24, 2015, 12:20:55 PM2/24/15
to google-a...@googlegroups.com
Hi Kaan,
I had exactly same issue (and maybe still have, as there is no easy way to find it). This caused us a huge problem as we were migrating our namespaces to new dataschema and we only found out after some namespaces stopped working after migration.
Upon examination we found out that that tasks didn't run for these namespaces, there were no error logs and I don't know how many of these missed tasks we had before and after.

This is scary. It happens during bursts when we run many tasks at once. The major problem is the absence of logs.
It is difficult to reproduce and even more difficult to catch them.

Task queues are offered as a guaranteed task execution solution, but this shows they are not.

Best,
HG

Jason Collins

unread,
Feb 24, 2015, 1:02:31 PM2/24/15
to google-a...@googlegroups.com
If you're looking for guaranteed task execution, you probably don't want to specify max retries or max age.

But regardless, the fact (if it is still a fact) that a task with max retries might never hit your code before disappearing is still a pretty unfortunate thing.

j

Kaan Soral

unread,
Feb 24, 2015, 1:43:57 PM2/24/15
to google-a...@googlegroups.com
Thanks a lot for the support, j and HG, it's great to hear I'm not the only one

Jason, it makes sense to let some queues retry indefinitely, I initially had only 1 retries, as I thought those retries for user failures, than I moved onto my current config, as it's extremely unlikely that 6 retries with backoffs could be exhausted, however it seems they can be, similarly, there is a possibility that, the issue might also occur with indefinite retries too, it's still not 100% certain internal retry failures are the cause, although several engineers claimed they were (I generally try to avoid indefinite retries, as they might produce serious side effects)

The conversation that I learned about the retry behaviour was with: hanssens[t]google.com titled "TaskQueue Followup" on 3/22/14 (He was very helpful, thanks)

You can't see these issues from the logs, yet you can see them as 5xx spikes from the error logs (most of the time)

(By the way, this specific instance of the issue cost me almost 200$'s - I'm constantly running and inspecting routines that trigger many entities, costs a lot, I initially decided to use basic_scaling to prevent instance/cost spikes, limit instances, yet basic_scaling made the issue worse somehow)

Kaan Soral

unread,
Feb 24, 2015, 1:45:18 PM2/24/15
to google-a...@googlegroups.com
You can't see these issues from the logs, yet you can see them as 5xx spikes from the error *graphs*/dashboard (most of the time)

Jason Collins

unread,
Feb 24, 2015, 1:55:34 PM2/24/15
to google-a...@googlegroups.com
basic_scaling made the issue worse somehow

So I suspect that the silent failures occur when the dispatcher is attempting to service a dequeued task, but cannot find an available instance (nor spin one up) in some unknown period of time.

Just a guess though.

j

paynen

unread,
Feb 25, 2015, 5:52:06 PM2/25/15
to google-a...@googlegroups.com
If anybody reading other than OP is also affected by this and can provide minimally a reproducing example or an affected timeframe on a given instance, this will be the minimum information needed to look into a potential issue. 

I'm continuing to monitor this thread, and I hope we can get this addressed as soon as possible, as soon as it's demonstrated/repro'd.

husayt

unread,
Feb 26, 2015, 8:54:02 AM2/26/15
to google-a...@googlegroups.com
Hi @paynen,
this is the problem. It's almost not possible to replicate externally, as it happens somewhere in internal appengine stack.

and the main problem, as also explained by Kaan, it never hits logs.

So there is not much we can do here as GAE users. This can be replicated only with access to internals of GAE.

Can I also  stress, that this is the number one issue on my list. I had a support case created and it didn't go forward because I couldn't replicate the problem.

One thing I can say it more likely to happen when we have bursts of taks.


Hope this helps,
HG

paynen

unread,
Feb 26, 2015, 6:36:07 PM2/26/15
to google-a...@googlegroups.com
Hi @husayt,

Of course, I've been following this thread and understand the issue doesn't appear in logs directly. I guess I'm just wondering how you've managed to determine that this is happening if there isn't any trace anywhere... If there is such a trace, I'd appreciate if you could get it to us, as it's necessary to look into it. Whether that's an affected timeframe on a given instance, some logs demonstrating the issue (@Kaan above mentioned 5xx spikes in their logs, you could try to also make calls to the REST API to demonstrate that X tasks were enqueued, but demonstrate only X - N finished, thus meaning N went missing), or a minimally-reproducing example. Access to the internals of GAE is our specialty ;) We just need a little help from you in pointing where to look, since there are a lot of "internals" to look at.

So, do you have the info required? Again, please note that an affected timeframe on a given instance (app id, version + module) are potentially enough info. This is a high priority for us if it's a high priority for you.

Regards,

NP

paynen

unread,
Feb 26, 2015, 7:02:07 PM2/26/15
to google-a...@googlegroups.com
In addition, re: anybody experiencing this issue, it would be very helpful to include your queue config file, to make sure whether you're specifying any kind of max retries parameter.


On Thursday, February 26, 2015 at 8:54:02 AM UTC-5, husayt wrote:

Kaan Soral

unread,
Feb 26, 2015, 7:03:01 PM2/26/15
to google-a...@googlegroups.com
I have a large mapping task I need to run soon, I will report with more detail if I detect the issue on that execution

I can't try to reproduce the issue, mainly because I don't have much funds to spare, both time and money-wise, however I will help as the issue arises, and I really appreciate your help and interest in the issue

I would like to humbly suggest a general method I use a lot instead:

1) Implement a task burst routine, that bursts tasks with different configurations, names, no-names, repetitions, loads, no-loads, run them on different retry queues
(the tasks can be error-free / simple tasks, as task errors are not the issue here)
2) Implement a counting routine / a routine to make sure the tasks executed as they should
3) Implement extreme scenarios, for example natural instance deaths etc. that happens during these task bursts
4) Run this routine, at different times, on different natural production environments (as it seems to me, there are good environments, and there are bad ones, and we get random environments instead, this is just an impression)

(I would like to humbly suggest a simple approach for all critical appengine services, like datastore transactions for example, I noticed there is a small possibility transactions re-execute, but I don't bring it up yet, as it's impossible to track and report, yet from my internal checks, I regularly get flags of routines that re-execute, just an fyi on another similar appengine anomaly, these two issues are also similar, they both happen on bursts and non-deterministically, a glitchy transaction doesn't do much damage, yet a taskqueue one does, as queue executions are more vital in most-scenarios)

husayt

unread,
Feb 27, 2015, 8:22:14 PM2/27/15
to google-a...@googlegroups.com
For me the way I noticed it was by chance, also it seems it happened many times unnoticed.
I was running a task for each namespace and each task would send an email. I noticed I received less emails than number of namespaces. I looked at logs there and couldn't find any errors nor even any evidence of these tasks ever executing, also i am sure they were sent to queue. As for max retries param, I believe I had a max set to 2-3.

Thanks

Kaan Soral

unread,
Mar 1, 2015, 1:23:12 PM3/1/15
to google-a...@googlegroups.com
I just ran my tasks, in 2 stages

In the first stage, task sequences executed as expected, I only noticed a single additional incoming task, thought it was a glitch / repetition in the counting logic
I also experienced a ndb.delete_multi lock issue that is not related to the taskqueue, but works to count/combine some side metrics, this deadlock prevented that operation, caused some data loss, yet the taskqueue's were unaffected (first time I experienced this issue)

----

I just ran another set of task sequences, this time I noticed a significant increase in additionally executed tasks

"Workers Out: 11476 In: 11507"

31 tasks re-executed, since these tasks are unique by name, they don't cause task explosions, yet I verify that these additional tasks really executed, because the total number of elements visited are proportionally increased

(I don't see any task error logs, these issues / executions are all silent)

----

To sum up, I didn't notice any missing tasks this time around, yet I noticed some invisibly re-executed tasks

To sum up this entire flurry, the main issue is that, tasks re-execute, fail silently, these are all counted as retries, yet they might happen 100 times and count as 100 retries, these "retry"'s should actually be user failures and not internal ones 

paynen

unread,
Mar 5, 2015, 4:43:17 PM3/5/15
to google-a...@googlegroups.com
@Kaan, thank you for continuing to respond. I think it's time to move beyond this thread and take action to have the potential issue looked at properly.

If you think you've found an issue, you should create a Public Issue Tracker thread with one of the following:
  1. A timeframe on an affected instance, or

  2. An example that can be used to reproduce, or

  3. At least your custom task monitoring solution and leave the creation of a massive burst of tasks to the engineer that attempts to reproduce

  4. You could even, in theory, request the thread you create to be set to private so you can upload your whole codebase
The bottom line is this: no issue can be looked at or worked on, if it is claimed by somebody who saw it in the wild, until it's reproducible or at least has some kind of trace in logs somewhere. We're happy to look up logs on an affected version/module/instance/etc. during a given timeframe, and try to determine what happened, if anything indeed happened. Either of these will suffice, but anything less is not really valid when reporting a bug. Every software project in the galaxy beyond hobbyist-size will have these requirements for bug reporting. At present, no viable data has been produced on how to observe the behaviour claimed to occur. All we have is the fact that a custom implementation of task-counting produced the numbers "11476" and "11507" as output. I hope you can see the gap between what's produced and what's needed.

If nobody who has observed the issue (if indeed there is such an issue to observe) and wants to see a fix (I assume this is you) wants to point towards a reproducing example, the very least information requested was a time-frame on an affected instance. All that's needed is to say:
 
On my project primal-buttress-1337, instance 1 of module devel, version v1, began enqueuing 1,000,000 tasks at [TIMESTAMP], and you can see a log line for each task being added, up until the last task is added at [TIMESTAMP]. The last task was executed at [TIMESTAMP]. The number of Flobbb entities created in the Datastore is only 999,998. Therefore I believe this indicates 2 tasks were enqueued but never executed. There are no records in the logs of any task failures.

If and when such information is produced in a Public Issue Tracker thread, you'll be pleased to see it taken up, worked on, and a solution will come in a course of time appropriate to the severity of the issue. Up to this point, such viable data has not been received, and hence the deadlocked nature of this thread. 

Please proceed to open a Public Issue Tracker thread with the information necessary to report an issue.
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages