Changing task inputs does not trigger re-runs ... seems to go against "idempotence" definition.

74 views
Skip to first unread message

Hippo Potamus

unread,
Jan 24, 2017, 10:37:03 AM1/24/17
to Luigi
I'm running Luigi 2.5.0 under python 3.

When running Luigi workflows, it appears that changing the input parameters to a task will not cause it to be re-run. Only if its output() method returns a unique value will that task be re-runnable. This seems to go against the definition of "idempotence", since changing the input parameters should treat a truly idempotent task as a new, unique entity, irrespective of whether it happens to return the same output as another invocation with different input parameters.

The following code illustrates the problem. The "x" parameter determines the file name that the output() method returns, while the "y" parameter is used within the contents of the output, but not for the name of the output file.

If I call my workflow with "--x 10 --y 20" and then "--x 10 --y 30", the second invocation does not cause either of the tasks to be re-run. This, I believe, is incorrect behavior. However, if I call the workflow with "--x 10 --y 20" followed by another run using "--x 11 --y 20", both tasks will indeed be re-run.

This assumes that no parent_*.txt nor child_*.txt files exist prior to the 4 runs described in the paragraph above.

#!/usr/bin/python3                                                                                                                          
import luigi

class Child(luigi.Task):

    x
= luigi.Parameter()
    y
= luigi.Parameter()

   
def requires(self):
       
return []

   
def output(self):
       
return luigi.LocalTarget("child_{}.txt".format(self.x))

   
def run(self):
       
with self.output().open('w') as f:
            f
.write('{} {}\n'.format(self.x, self.y))

class Parent(luigi.Task):

    x
= luigi.Parameter()
    y
= luigi.Parameter()

   
def requires(self):
       
return [ Child(self.x, self.y) ]

   
def output(self):
       
return luigi.LocalTarget("parent_{}.txt".format(self.x))

   
def run(self):
       
with self.input()[0].open() as fin, self.output().open('w') as fout:
           
for line in fin:
                fout
.write("from command line: --x {} --y {}, from child: {}\n".format(self.x, self.y, line.strip()))

if __name__ == '__main__':
    luigi
.run()




Dave Buchfuhrer

unread,
Jan 24, 2017, 11:30:56 AM1/24/17
to Hippo Potamus, Luigi
It's up to you to ensure that your tasks are idempotent. This means that running tasks repeatedly doesn't change the output. So if you run task A then task B, running task A or B again should not result in any changes. If you insist that running A again should result in changes, you do not want your tasks to be idempotent.

The reason you're seeing this behavior is that you have defined your output wrong. Two different tasks should never have the same output file because as you've discovered, this means you can't tell which one was completed. The pattern you're developing will break on larger pipelines. What happens if you schedule both Parent(1, 1) and Parent(1, 2) at the same time? The pipeline will start running both Child(1, 1) and Child(1, 2) at the same time. If both somehow manage to finish, it will now start running both Parent(1, 1) and Parent(1, 2) with the same input, which is not expected. If both are able to finish, it will also have overwritten one of their outputs.

If two different people are running these at the same time, one of them will get what they expected, and the other one will get what the first person expected. Neither will be able to tell which one they are without carefully examining the contents of the output file. If these are both scheduled by cron jobs as your operations scale, you'll spend all day just checking which jobs ran correctly and which ones need to be re-run due to getting the wrong input.

The fix here is to parameterize your output filename by both all parameters that affect its contents.

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hippo Potamus

unread,
Jan 24, 2017, 12:00:09 PM1/24/17
to Luigi, hippoma...@gmail.com
I understand your point about the non-uniqueness of filenames. My tasks are indeed idempotent, but there is a also another criterion that is stronger than simple idempotence: output entity uniqueness.

That's fine. I can work around that in many cases.

However, my LocalTarget example is a simple case for illustrative purposes only. There are cases where the output is a single, known database table or file where the contents can change based upon the inputs. If locking or transactions are used properly within the tasks, multiple invocations can indeed be simultaneously run for the same workflow with differing input parameters.

In these cases, the output() method would return the same file name or database table name, but the differing inputs would cause different data to be written to the target.

How can I manage this in Luigi? I know I could cause the output() method to write some sort of zero-length file whose name represents the input parameters, while the run() method uses these same input parameters to determine what operations are performed on the file or database table. Is this the only way I can accomplish what I want to do in Luigi?
.


Dave Buchfuhrer

unread,
Jan 24, 2017, 11:35:11 PM1/24/17
to Hippo Potamus, Luigi
You need to implement a complete function that can tell independently whether each one is complete.

Reply all
Reply to author
Forward
0 new messages