Luigi for distributed workflows

505 views
Skip to first unread message

dmmt...@gmail.com

unread,
Nov 3, 2015, 11:04:41 AM11/3/15
to Luigi
We've got a data analysis process that consists of many steps which would, in an ideal world, run on different machines in the Amazon cloud. None of the steps are generic, i.e. all of them require a custom python or R script, and many of the steps are quite memory intensive. From what I've seen, this type of workflow isn't directly supported by Luigi, however, I've got to think others have taken this path (or done something like it) given Luigi's rich featureset aside from distributed workflows. Are there plans to integrate distributed workflows in Luigi? At the moment, we're thinking of implementing micro-services around the steps and then letting Luigi call each of those services. Does this seem wise? Is there another workflow management system we should consider instead of Luigi? Thanks for your time!

Dave Buchfuhrer

unread,
Nov 3, 2015, 11:24:34 AM11/3/15
to dmmt...@gmail.com, Luigi
If I'm understanding you correctly, this is exactly the type of workflow supported by Luigi. Each custom job should be implemented by a different Python class and scheduled on the machines that you want to run them. On other machines, you should have matching ExternalTasks to use in the requires functions when depending on tasks run on other machines. Is something preventing you from doing it this way?

On Tue, Nov 3, 2015 at 8:04 AM, <dmmt...@gmail.com> wrote:
We've got a data analysis process that consists of many steps which would, in an ideal world, run on different machines in the Amazon cloud.  None of the steps are generic, i.e. all of them require a custom python or R script, and many of the steps are quite memory intensive.  From what I've seen, this type of workflow isn't directly supported by Luigi, however, I've got to think others have taken this path (or done something like it) given Luigi's rich featureset aside from distributed workflows.  Are there plans to integrate distributed workflows in Luigi?  At the moment, we're thinking of implementing micro-services around the steps and then letting Luigi call each of those services.  Does this seem wise?  Is there another workflow management system we should consider instead of Luigi?  Thanks for your time!

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erik Bernhardsson

unread,
Nov 3, 2015, 12:24:56 PM11/3/15
to dmmt...@gmail.com, Luigi
Yes to Dave's point this is well supported by Luigi.

For farming out jobs the easiest thing is to trigger the exact same workflow on many machines and use a shared filesystem like S3. The central scheduler will make sure the same job isn't run more than once.

On Tue, Nov 3, 2015 at 11:04 AM, <dmmt...@gmail.com> wrote:
We've got a data analysis process that consists of many steps which would, in an ideal world, run on different machines in the Amazon cloud.  None of the steps are generic, i.e. all of them require a custom python or R script, and many of the steps are quite memory intensive.  From what I've seen, this type of workflow isn't directly supported by Luigi, however, I've got to think others have taken this path (or done something like it) given Luigi's rich featureset aside from distributed workflows.  Are there plans to integrate distributed workflows in Luigi?  At the moment, we're thinking of implementing micro-services around the steps and then letting Luigi call each of those services.  Does this seem wise?  Is there another workflow management system we should consider instead of Luigi?  Thanks for your time!

martinr...@gmail.com

unread,
Nov 30, 2015, 8:40:08 PM11/30/15
to Luigi, dmmt...@gmail.com
Hi,

Sorry to bother you but I'm having the same doubt.

I have a workflow that will run hundreds of tasks. If I trigger the workflow on more than one machine, all pointing to the same central scheduler, and use shared storage such as S3 for all intermediate task outputs, will the central scheduler enforce that a single task (with a certain parameter set) is run only once (and not once on each machine)?

Thank you very much!

Martín.

Dave Buchfuhrer

unread,
Dec 1, 2015, 12:12:33 AM12/1/15
to martinr...@gmail.com, Luigi, Dave Mitchell
It should run each task exactly once globally, as long as the tasks are identical and your complete functions work correctly.

Erik Bernhardsson

unread,
Dec 1, 2015, 8:58:54 AM12/1/15
to Dave Buchfuhrer, martinr...@gmail.com, Luigi, Dave Mitchell
Yes in fact the purpose of the central scheduler is to be a locking mechanism as well as a visualization tool

Erik Bernhardsson

unread,
Dec 1, 2015, 9:17:04 AM12/1/15
to Martín Redolatti, Luigi
Yes I think so – actually not sure :)

On Tue, Dec 1, 2015 at 9:10 AM, Martín Redolatti <martinr...@gmail.com> wrote:
Thank you very much!

This will work as long as I avoid sending custom objects/unhashable
things as task parameters, am I right?
Reply all
Reply to author
Forward
0 new messages