strange behavior of Task.run() when using yield

351 views
Skip to first unread message

jermcc...@gmail.com

unread,
Apr 27, 2019, 11:04:39 AM4/27/19
to Luigi
Hi,

I am trying to wrap my head around why Task.run() is called more than once when you yield one or more tasks from it.

As per the docs:

"the Task.run method will resume from scratch each time a new task is yielded"

This seems like completely counter-intuitive behavior. Since using yield preserves the state of the local method, and run() just resumes after the sub-tasks are completed, why would I want Luigi to call my run() method a second time?

What's more is that it does not seem to check my output() or complete() method to see if run() would actually need to be called again, which seems doubly bizaare.

Can someone explain the thinking here? Because it makes no sense to me.

--Jeremy

Arash Rouhani

unread,
Apr 28, 2019, 4:04:47 AM4/28/19
to jermcc...@gmail.com, Luigi
Two thoughts on this:

 * I'm pretty sure the current implementation is easier. How would one store and serialize a "program pointer" to start in the middle of the function again?
 * The drawback above should be of relatively small cost for most run() functions (and hence most luigi users)

Cheers,
Arash.

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeremy McCormick

unread,
Apr 28, 2019, 4:46:17 AM4/28/19
to Arash Rouhani, Luigi
This is not how yield works.  The method will keep executing after yield is executed by the caller.  So the entire run() method can complete, and there should be no reason to call it again, unless I'm somehow missing something here about how this is supposed to function.  In other words, if you put some code after you yield in run() it *will* execute as far as I'm aware.  So the run() method competes yet it will be called again.

The drawback is that I have to check if my run() method has already run even though I've set complete() to true and the outputs are all present.  It does not seem intuitive.  (Admittedly I'm not an expert with this and could be doing something wrong.)

Chris Palmer

unread,
Apr 28, 2019, 11:37:55 AM4/28/19
to Jeremy McCormick, Arash Rouhani, Luigi
Jeremy,

I think you have a misunderstanding on how yield and generators work. When you write a function that uses yield, then calls to the function return a generator. That generator only returns values when the calling code calls the next method on that generator. If the calling code doesn't keep calling next, the your generator function may not yield all the values it could and may not run all of the code. It's important to note that if you are using a for loop to process a generator then the calls to next are made for you by the for loop.

This is easily seen with the following example:
def my_generator_function():
print('start of my_generator_function')
for i in (1,2,3,4,5):
yield i
print('end of my_generator_function')

def my_handler_function(max_value):
print('start of my_handler_function; max_value={}'.format(max_value))
for n in my_generator_function():
if n > max_value:
break
print(n)
print('end of my_handler_function')

my_handler_function(5)
my_handler_function(3)
Which produces the following output:

start of my_handler_function; max_value=5
start of my_generator_function
1
2
3
4
5
end of my_generator_function
end of my_handler_function
start of my_handler_function; max_value=3
start of my_generator_function
1
2
3
end of my_handler_function

As you can see in the second call to my_handler_function, the for loop is terminated prematurely via the break statement, the generator never yields values 4 and 5, and 'end of my_generator_function' is not printed.

Luigi is doing something similar in the _run_get_new_deps method. If your tasks run method yields dependencies and those dependencies are not complete then it returns the new dependencies (on line 160). After that no more calls to next are made on the generator returned by your run method, and so it never completes it's execution.

I'm not exactly sure what you are doing, but it seems to me that the only way your outputs could be present, is if you create them in your run method before you yield the dependencies. But if you can create your output before the dependencies are run, then they aren't really dependencies.

Thanks
Chris

Jeremy McCormick

unread,
Apr 29, 2019, 2:27:27 AM4/29/19
to Chris Palmer, Arash Rouhani, Luigi
Thank you for the correction.  Thinking about it further I do realize that the run() method would only execute up to the last yield statement in the method and nothing afterwards.  However, if the sub-tasks that it spawns via a yield statement create all the targets that are being checked in output() AND I return true in complete() due to setting a flag or whatever (not even sure complete() has any effect at all), why would luigi feel the need execute the run() method again?  This was more the gist of my question.  My understanding is that the original task's run() is suspended while the sub-tasks complete.  So if all the outputs are there after the sub-tasks are done, why would it not consider the original task done?  It would make more sense to me if it the run() was not executed if the outputs are all there after the sub-tasks are done (since the sub-tasks are presumably creating them).

Chris Palmer

unread,
Apr 29, 2019, 11:18:12 AM4/29/19
to Jeremy McCormick, Arash Rouhani, Luigi
Tasks yielded by the run() method are not subtasks, they are dependencies (that may seem pedantic but I think the terminology is important). Dynamic dependencies (those returned by yield statements in run) are treated the same as tasks return by the requires() method. In general* each task should have their own distinct outputs, and each tasks outputs should be created by that task and that task only. Once a task has been scheduled as PENDING, then Luigi assumes that it is incomplete until it's run() method has been executed fully. Therefore there is no reason for Luigi to check if the task is complete once more.

*The only exception that I know of is WrapperTasks which have no outputs of their own, and instead are considered complete if and only if all their requirements are complete. However, by their nature WrapperTasks cannot have dynamic dependencies because they don't ever get run.

Chris

Jeremy McCormick

unread,
Apr 30, 2019, 6:27:14 AM4/30/19
to Chris Palmer, Arash Rouhani, Luigi
I guess the broader question is how to create and run these "dynamic" dependencies via yield in a way where luigi runs them once and completes the original tasks run() method without reexecuting?  

Now I have all these boolean flags everywhere to return if run() was already called - doesn't seem very elegant.

I am not an expert - could be a much better way to do what I'm trying to accomplish.

Chris Palmer

unread,
Apr 30, 2019, 9:32:35 AM4/30/19
to Jeremy McCormick, Arash Rouhani, Luigi
I think the simple answer is no there isn't a way to that. On the face of it that breaks the Luigi model. Maybe if you provide more details about the business problem you are trying to solve, someone can suggest a better solution.

Chris
Reply all
Reply to author
Forward
0 new messages