Methods for re-doing sub-DAGs?

36 views

Skip to first unread message

Steven A.

unread,

Feb 16, 2017, 10:51:32 AM2/16/17

to Luigi

I imagine this is a common issue, so I was just wondering how other Luigi users do it. So I have a simple DAG with task B dependent on task A. Let's say A then B both run successfully. However, later I find a bug in task A, that will affect the results of both A and B. So I can invalidate A (by deleting some flag or whatever), but then I have to manually invalidate B. In this case it is simple, but for real DAGs this could get messy and error-prone. So what techniques do people use to do this? Does Luigi support this out of the box?

Dave Buchfuhrer

unread,

Feb 16, 2017, 12:02:51 PM2/16/17

to Steven A., Luigi

One technique I use is to take a root node that will contain everything I need to invalidate and traverse the dependency graph deep enough (ignoring complete status) to find all jobs that might be an issue. During this traversal, I build up an adjacency list for the full dependency graph with edges reversed. Then you simply search the reverse dependency graph starting from the buggy job to find everything that needs to be invalidated.

Another technique I use when I have jobs that need to be regularly and automatically undo jobs is to add a checksum to completion checks. Leaf tasks have fixed checksums based on the md5 of the file they use or something, while tasks with dependencies have checksums that are based on the checksums of all their dependencies. If you mix in a salt value for each class, changing that salt value will automatically invalidate all instances of your buggy class and everything that depends on it. This requires some planning and won't help you much right now (unless you're willing to rerun everything) but it's very convenient for pipelines where invalidating tasks is common.

On Thu, Feb 16, 2017 at 7:51 AM, Steven A. <stev...@gmail.com> wrote:

I imagine this is a common issue, so I was just wondering how other Luigi users do it. So I have a simple DAG with task B dependent on task A. Let's say A then B both run successfully. However, later I find a bug in task A, that will affect the results of both A and B. So I can invalidate A (by deleting some flag or whatever), but then I have to manually invalidate B. In this case it is simple, but for real DAGs this could get messy and error-prone. So what techniques do people use to do this? Does Luigi support this out of the box?

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steven An

unread,

Feb 16, 2017, 1:27:46 PM2/16/17

to Dave Buchfuhrer, Luigi

I see. The checksum approach makes a lot of sense. Can you give me a quick example? I imagine you need to put the logic in Task.output or something?

To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Luigi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/luigi-user/a_cSuBPGtSE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to luigi-user+...@googlegroups.com.

Dave Buchfuhrer

unread,

Feb 16, 2017, 1:33:48 PM2/16/17

to Steven An, Luigi

You need to add checksum functions to all of your tasks, rewrite all of your run functions to compute the checksum before running and store the checksum after running, and rewrite the complete function to compare the stored checksum to the last computed one. One thing you can do to make this easier is to inherit from a custom base class and write the actual run logic in a helper function that you override instead of run.

Note that this requires rewriting all of your tasks to some degree (mostly just changing inheritance and renaming the run function) and either rerunning them all or manually marking the checksums for the ones you consider done.

On Thu, Feb 16, 2017 at 10:27 AM, Steven An <stev...@gmail.com> wrote:

I see. The checksum approach makes a lot of sense. Can you give me a quick example? I imagine you need to put the logic in Task.output or something?

On Thu, Feb 16, 2017 at 12:02 PM Dave Buchfuhrer <dbuch...@gmail.com> wrote:

One technique I use is to take a root node that will contain everything I need to invalidate and traverse the dependency graph deep enough (ignoring complete status) to find all jobs that might be an issue. During this traversal, I build up an adjacency list for the full dependency graph with edges reversed. Then you simply search the reverse dependency graph starting from the buggy job to find everything that needs to be invalidated.

Another technique I use when I have jobs that need to be regularly and automatically undo jobs is to add a checksum to completion checks. Leaf tasks have fixed checksums based on the md5 of the file they use or something, while tasks with dependencies have checksums that are based on the checksums of all their dependencies. If you mix in a salt value for each class, changing that salt value will automatically invalidate all instances of your buggy class and everything that depends on it. This requires some planning and won't help you much right now (unless you're willing to rerun everything) but it's very convenient for pipelines where invalidating tasks is common.

On Thu, Feb 16, 2017 at 7:51 AM, Steven A. <stev...@gmail.com> wrote:
I imagine this is a common issue, so I was just wondering how other Luigi users do it. So I have a simple DAG with task B dependent on task A. Let's say A then B both run successfully. However, later I find a bug in task A, that will affect the results of both A and B. So I can invalidate A (by deleting some flag or whatever), but then I have to manually invalidate B. In this case it is simple, but for real DAGs this could get messy and error-prone. So what techniques do people use to do this? Does Luigi support this out of the box?

--
You received this message because you are subscribed to the Google Groups "Luigi" group.

To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "Luigi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/luigi-user/a_cSuBPGtSE/unsubscribe.

To unsubscribe from this group and all its topics, send an email to luigi-user+unsubscribe@googlegroups.com.

Steven An

unread,

Feb 16, 2017, 1:36:21 PM2/16/17

to Dave Buchfuhrer, Luigi

Ah I didn't know there was a "complete" function I could override. Sounds good.

It'll hurt a bit, but I think doing all that is feasible with my project here. Thanks much.

On Thu, Feb 16, 2017 at 1:33 PM Dave Buchfuhrer <dbuch...@gmail.com> wrote:

You need to add checksum functions to all of your tasks, rewrite all of your run functions to compute the checksum before running and store the checksum after running, and rewrite the complete function to compare the stored checksum to the last computed one. One thing you can do to make this easier is to inherit from a custom base class and write the actual run logic in a helper function that you override instead of run.

Note that this requires rewriting all of your tasks to some degree (mostly just changing inheritance and renaming the run function) and either rerunning them all or manually marking the checksums for the ones you consider done.

On Thu, Feb 16, 2017 at 10:27 AM, Steven An <stev...@gmail.com> wrote:

I see. The checksum approach makes a lot of sense. Can you give me a quick example? I imagine you need to put the logic in Task.output or something?

On Thu, Feb 16, 2017 at 12:02 PM Dave Buchfuhrer <dbuch...@gmail.com> wrote:

One technique I use is to take a root node that will contain everything I need to invalidate and traverse the dependency graph deep enough (ignoring complete status) to find all jobs that might be an issue. During this traversal, I build up an adjacency list for the full dependency graph with edges reversed. Then you simply search the reverse dependency graph starting from the buggy job to find everything that needs to be invalidated.

Another technique I use when I have jobs that need to be regularly and automatically undo jobs is to add a checksum to completion checks. Leaf tasks have fixed checksums based on the md5 of the file they use or something, while tasks with dependencies have checksums that are based on the checksums of all their dependencies. If you mix in a salt value for each class, changing that salt value will automatically invalidate all instances of your buggy class and everything that depends on it. This requires some planning and won't help you much right now (unless you're willing to rerun everything) but it's very convenient for pipelines where invalidating tasks is common.

On Thu, Feb 16, 2017 at 7:51 AM, Steven A. <stev...@gmail.com> wrote:
I imagine this is a common issue, so I was just wondering how other Luigi users do it. So I have a simple DAG with task B dependent on task A. Let's say A then B both run successfully. However, later I find a bug in task A, that will affect the results of both A and B. So I can invalidate A (by deleting some flag or whatever), but then I have to manually invalidate B. In this case it is simple, but for real DAGs this could get messy and error-prone. So what techniques do people use to do this? Does Luigi support this out of the box?

--
You received this message because you are subscribed to the Google Groups "Luigi" group.

To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "Luigi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/luigi-user/a_cSuBPGtSE/unsubscribe.

To unsubscribe from this group and all its topics, send an email to luigi-user+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Luigi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/luigi-user/a_cSuBPGtSE/unsubscribe.

To unsubscribe from this group and all its topics, send an email to luigi-user+...@googlegroups.com.

Dave Buchfuhrer

unread,

Feb 16, 2017, 1:40:26 PM2/16/17

to Steven An, Luigi

Yeah, the complete function just checks the output by default so you need to override it if you have something more complicated in mind. Good luck!

On Thu, Feb 16, 2017 at 10:36 AM, Steven An <stev...@gmail.com> wrote:

Ah I didn't know there was a "complete" function I could override. Sounds good.

It'll hurt a bit, but I think doing all that is feasible with my project here. Thanks much.

On Thu, Feb 16, 2017 at 1:33 PM Dave Buchfuhrer <dbuch...@gmail.com> wrote:

You need to add checksum functions to all of your tasks, rewrite all of your run functions to compute the checksum before running and store the checksum after running, and rewrite the complete function to compare the stored checksum to the last computed one. One thing you can do to make this easier is to inherit from a custom base class and write the actual run logic in a helper function that you override instead of run.

Note that this requires rewriting all of your tasks to some degree (mostly just changing inheritance and renaming the run function) and either rerunning them all or manually marking the checksums for the ones you consider done.

On Thu, Feb 16, 2017 at 10:27 AM, Steven An <stev...@gmail.com> wrote:

I see. The checksum approach makes a lot of sense. Can you give me a quick example? I imagine you need to put the logic in Task.output or something?

On Thu, Feb 16, 2017 at 12:02 PM Dave Buchfuhrer <dbuch...@gmail.com> wrote:

One technique I use is to take a root node that will contain everything I need to invalidate and traverse the dependency graph deep enough (ignoring complete status) to find all jobs that might be an issue. During this traversal, I build up an adjacency list for the full dependency graph with edges reversed. Then you simply search the reverse dependency graph starting from the buggy job to find everything that needs to be invalidated.

Another technique I use when I have jobs that need to be regularly and automatically undo jobs is to add a checksum to completion checks. Leaf tasks have fixed checksums based on the md5 of the file they use or something, while tasks with dependencies have checksums that are based on the checksums of all their dependencies. If you mix in a salt value for each class, changing that salt value will automatically invalidate all instances of your buggy class and everything that depends on it. This requires some planning and won't help you much right now (unless you're willing to rerun everything) but it's very convenient for pipelines where invalidating tasks is common.

On Thu, Feb 16, 2017 at 7:51 AM, Steven A. <stev...@gmail.com> wrote:
I imagine this is a common issue, so I was just wondering how other Luigi users do it. So I have a simple DAG with task B dependent on task A. Let's say A then B both run successfully. However, later I find a bug in task A, that will affect the results of both A and B. So I can invalidate A (by deleting some flag or whatever), but then I have to manually invalidate B. In this case it is simple, but for real DAGs this could get messy and error-prone. So what techniques do people use to do this? Does Luigi support this out of the box?

--
You received this message because you are subscribed to the Google Groups "Luigi" group.

To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "Luigi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/luigi-user/a_cSuBPGtSE/unsubscribe.

To unsubscribe from this group and all its topics, send an email to luigi-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Luigi" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/luigi-user/a_cSuBPGtSE/unsubscribe.

To unsubscribe from this group and all its topics, send an email to luigi-user+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages