Run Permission Not Granted and Worker Stops

2045 views
Skip to first unread message

jtra...@counsyl.com

unread,
Feb 9, 2017, 12:43:24 PM2/9/17
to Luigi
Hi all,

I kicked off two sets of tasks that had overlapping dependencies and was surprised when the second one kicked off just stopped running rather than waiting for tasks to finish - if I wait a bit and kick off again, everything works fine.

Three main questions here:
1. What does "not granted run permission" really mean?
2. Is it okay to have two tasks with different names that require the same set of upstreams?
3. How do I avoid this problem in the future? (e.g., should I run a loop that just retries when it gets "not granted run permission"?)

---

More detail:

I have a long-running ETL task that I want to kick off, with shared dependencies between different tasks. I was running ~40 workers on two different machines, WorkerGroupA was supposed to run big ETL flow, while WorkerGroupB (on other machine) was running related tasks but not actually in the same tree. ("WorkerGroup" is what I'm calling "luigi SomeTask --workers=40")

(the task flow is that I have a task that clusters data, and then I need to generate some files based on the population in each cluster. It's possible to have two "GenerateFilesForCluster" tasks that have different input files but the same upstream requirements, WorkerGroupA would then perform other work later on).

When I tried to kick off WorkerGroupA, it ran 40 tasks, but then stopped with the "This progress looks :| because there were tasks that were not granted run permission by the scheduler" message.

My expectation would be that WorkerGroupA would keep polling until WorkerGroupB finished its tasks, helping if it could pick tasks up. Is it problematic that two tasks share dependencies while the parent is in different trees?

Happy to provide more information and my apologies if this is too vague or this should be placed elsewhere. (Message is at the end). I tried googling a little bit and found this https://groups.google.com/forum/#!topic/luigi-user/YH6pxBngKDw , however, that issue seemed to do with only using a single worker, whereas I'm using multiple.

Thank you for your help!

Jeff

---

* 679 present dependencies were encountered:
- 150 Task
...
* 358 ran successfully:
- 9 Task
- 5 Task
...
* 1679 were left pending, among these:
* 1 were missing external dependencies:
- 1 Task(...)
* 40 were being run by another worker:
- 40 Task(...)
* 1635 had missing external dependencies:
- 1 Task(...)
- 326 Task(...) ...
- 326 Task(...)
- 326 Task(...)
- 326 Task(...)
...
* 1 had dependencies that were being run by other worker:
- 1 **TaskA**(...)
* 2 was not granted run permission by the scheduler:
- 1 **TaskB**(...) (upstream of TaskA)
- 1 **TaskC**(...) (requires TaskB but not TaskA)

The other workers were:
- Worker(salt=863995201, workers=40, host=oak1-prd-hpc-n018, username=production, pid=3943266) ran 40 tasks


This progress looks :| because there were tasks that were not granted run permission by the scheduler

Arash Rouhani Kalleh

unread,
Feb 9, 2017, 9:25:44 PM2/9/17
to jtra...@counsyl.com, Luigi
Have you tried playing around with the --worker-keep-alive option?

The "not granted run permission" can mean many things. But it seems in your case it's just that some other worker were doing work.


--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeffrey Tratner

unread,
Feb 10, 2017, 12:31:49 AM2/10/17
to Arash Rouhani Kalleh, Luigi
Oh you're totally right - I totally missed that in the docs. I think what was going on is that I was running out of resources and the worker didn't stay alive.

I guess the idea for making keep_alive not be the default is that you'd want to react to lack of resources in a parent script (and/or choose how long to block)

"""
keep_alive
If true, workers will stay alive when they run out of jobs to run, as long as they have some pending job waiting to be run. Defaults to false.


To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Luigi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/luigi-user/FGsqAJkadmI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to luigi-user+...@googlegroups.com.

Jeffrey Tratner

unread,
Feb 10, 2017, 12:43:11 PM2/10/17
to Arash Rouhani Kalleh, Luigi
Following up on this, when I set keep_alive=True, now the worker never exits ever on a failed task (whereas my expectation would be that it would exit after the second time it fails).

e.g. this

class FailTask(luigi.Task):
    def run(self):
        raise ValueError('error')
    def output(self):
        return LocalTarget('scratchblah.txt')


with

[worker]
keep_alive=True
count_uniques=True

will just retry forever.  Is there another setting I'm missing? (count_uniques=False doesn't help either)

VMiller

unread,
Jun 1, 2022, 9:34:13 AM6/1/22
to Luigi
FYI: 

I'm not sure what version of luigi was this related to, but from 2.8.4 there is a parameter specificaly for this purpose: 

max_keep_alive_idle_duration 
New in version 2.8.4. Maximum duration to keep worker alive while in idle state. Default: 0 (Indefinitely)

Reply all
Reply to author
Forward
0 new messages