Our experiences: How to make luigi workflows dynamic

2,277 views
Skip to first unread message

Samuel Lampa

unread,
Mar 18, 2014, 3:01:23 PM3/18/14
to luigi...@googlegroups.com

Hi,

I just wanted to share our experience about how we figured we could make our luigi tasks more reusable, by injecting rather than hard-coding the upstream dependencies:

http://bionics.it/posts/making-luigi-workflows-dynamic

Any feedback on our approch would be highIy appreciated!

Maybe working in this way is self-evident for you? ... but at least for us it took some thinking before we figured-out this =) ... so we thought maybe some one else might find this useful.

Cheers
Samuel

http://twitter.com/smllmp

Erik Bernhardsson

unread,
Mar 18, 2014, 4:18:52 PM3/18/14
to Samuel Lampa, luigi...@googlegroups.com
Cool! We don't do anything like this – basically just have a bunch of superclasses that we inherit from any time

This is how I would implement your stuff: https://gist.github.com/erikbern/9628609

But it's a matter of taste :) Interesting to see how Luigi supports multiple paradigms


--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Erik Bernhardsson
Engineering Manager, Spotify, New York

Erik Bernhardsson

unread,
Mar 18, 2014, 4:36:02 PM3/18/14
to Samuel Lampa, luigi...@googlegroups.com
For the partitioning_parameter, I would prob just put a static method on the base class called get_partitioning_parameter() that does the custom magic.

Normally we never have custom run() methods, we just rely on luigi.run()

Samuel Lampa

unread,
Mar 19, 2014, 5:33:53 AM3/19/14
to Erik Bernhardsson, luigi...@googlegroups.com
On 2014-03-18 21:18, Erik Bernhardsson wrote:
> Cool! We don't do anything like this – basically just have a bunch of
> superclasses that we inherit from any time
>
> This is how I would implement your stuff:
> https://gist.github.com/erikbern/9628609

Ah, right! Never thought of that! ... and indeed, that makes a lot of
sense. (And it answers my question in a previous mail about where the
run methods are implemented ...)

Will chew on that one, and see whether we also would be better off using
this approach ... Thanks for sharing!

BR
// Samuel



>
> But it's a matter of taste :) Interesting to see how Luigi supports
> multiple paradigms
>
>
> On Tue, Mar 18, 2014 at 3:01 PM, Samuel Lampa <samuel...@gmail.com
> <mailto:samuel...@gmail.com>> wrote:
>
> Hi,
>
> I just wanted to share our experience about how we figured we
> could make our luigi tasks more reusable, by injecting rather than
> hard-coding the upstream dependencies:
>
> http://bionics.it/posts/making-luigi-workflows-dynamic
>
> Any feedback on our approch would be highIy appreciated!
>
> Maybe working in this way is self-evident for you? ... but at
> least for us it took some thinking before we figured-out this =)
> ... so we thought maybe some one else might find this useful.
>
> Cheers
> Samuel
>
> http://twitter.com/smllmp
>
> --
> You received this message because you are subscribed to the Google
> Groups "Luigi" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to luigi-user+...@googlegroups.com
> <mailto:luigi-user+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
> Erik Bernhardsson
> Engineering Manager, Spotify, New York
>


--
Developer at www.uppmax.uu.se/uppnex / www.farmbio.uu.se / rilpartner.se
G: http://google.com/+samuellampa
B: http://saml.rilspace.org
T: http://twitter.com/smllmp

Samuel Lampa

unread,
Mar 19, 2014, 12:22:58 PM3/19/14
to luigi...@googlegroups.com, Erik Bernhardsson
I guess the main factor that made us go with creating and modifying objects, was that it becomes a tad easier to select the object to execute dynamically, by specifying a parameter to the workflow script, which we do in the longer code example in my post [1]. We were also imagining that we sometimes might want to change the wiring of parts of the workflow (e.g. swapping one component for another, or bypassing a component altogether), depending on input parameters.

That seems to be fully possible to do with sub-classing as well, using something like [2], but as you say, it is a bit depending on taste - and I would say it also depends a lot on the use case -, and I personally feel I'd prefer avoiding the getattr() calls, which would be required in our case.

Anyway, it is indeed interesting to contrast different approaches, and I'm very interested to learn more of these kind of hints, as we have a lot of challenging tasks ahead in the near future.


Cheers
Samuel

Erik Bernhardsson

unread,
Mar 19, 2014, 2:03:57 PM3/19/14
to Samuel Lampa, luigi...@googlegroups.com
On Wed, Mar 19, 2014 at 12:22 PM, Samuel Lampa <samuel...@gmail.com> wrote:
I guess the main factor that made us go with creating and modifying objects, was that it becomes a tad easier to select the object to execute dynamically, by specifying a parameter to the workflow script, which we do in the longer code example in my post [1]. We were also imagining that we sometimes might want to change the wiring of parts of the workflow (e.g. swapping one component for another, or bypassing a component altogether), depending on input parameters.

But you can do that in requires() right?

def requires(self):
   if self.my_param == 'foo': return SomeOtherTask()
   else: return AnotherTask()
 

That seems to be fully possible to do with sub-classing as well, using something like [2], but as you say, it is a bit depending on taste - and I would say it also depends a lot on the use case -, and I personally feel I'd prefer avoiding the getattr() calls, which would be required in our case.

Why are the getattr() calls needed?
 

Anyway, it is indeed interesting to contrast different approaches, and I'm very interested to learn more of these kind of hints, as we have a lot of challenging tasks ahead in the near future.


Cheers
Samuel


On Wednesday, March 19, 2014 10:33:53 AM UTC+1, Samuel Lampa wrote:
On 2014-03-18 21:18, Erik Bernhardsson wrote:
> Cool! We don't do anything like this – basically just have a bunch of
> superclasses that we inherit from any time
>
> This is how I would implement your stuff:
> https://gist.github.com/erikbern/9628609

Ah, right! Never thought of that! ... and indeed, that makes a lot of
sense. (And it answers my question in a previous mail about where the
run methods are implemented ...)

Will chew on that one, and see whether we also would be better off using
this approach ... Thanks for sharing!

BR
// Samuel

Samuel Lampa

unread,
Mar 19, 2014, 2:21:06 PM3/19/14
to Erik Bernhardsson, luigi...@googlegroups.com
On ons 19 mar 2014 19:03:57, Erik Bernhardsson wrote:
>
>
>
> On Wed, Mar 19, 2014 at 12:22 PM, Samuel Lampa <samuel...@gmail.com
> <mailto:samuel...@gmail.com>> wrote:
>
> I guess the main factor that made us go with creating and
> modifying objects, was that it becomes a tad easier to select the
> object to execute dynamically, by specifying a parameter to the
> workflow script, which we do in the longer code example in my post
> [1]. We were also imagining that we sometimes might want to change
> the wiring of parts of the workflow (e.g. swapping one component
> for another, or bypassing a component altogether), depending on
> input parameters.
>
>
> But you can do that in requires() right?
>
> def requires(self):
> if self.my_param == 'foo': return SomeOtherTask()
> else: return AnotherTask()

Yes, but then the API of the task changes, which makes us need to
update any other use of that task

... but indeed, if working in the way you suggested, with just
subclassing our task super classes, and specifying the requirement at
workflow definition time, then this makes sense, yes!

>
>
> That seems to be fully possible to do with sub-classing as well,
> using something like [2], but as you say, it is a bit depending on
> taste - and I would say it also depends a lot on the use case -,
> and I personally feel I'd prefer avoiding the getattr() calls,
> which would be required in our case.
>
>
> Why are the getattr() calls needed?

In order to specify which class we want to run based on a text string.

But yeah, thinking about it, we could of course check it with an if
clause as well .... such as:

selected_task = <some code to get that from the command line>

if selected_task == "TaskA":
task_to_run = TaskA()
elif selected_task == "TaskB":
task_to_run = TaskB()
... etc

I just thought that if we can just pass the "selected_task" string to
some function that returns an instantiated object of the correct class
dynamically, then it will become less changes needed when/if we add
more tasks to our workflow.

But all in all, it feels like the differences between the approaches
diminishes a bit, when pushing things a bit. It seems most things are
possible with either approach :)

BR
// Samuel



>
>
> Anyway, it is indeed interesting to contrast different approaches,
> and I'm very interested to learn more of these kind of hints, as
> we have a lot of challenging tasks ahead in the near future.
>
> [1] http://bionics.it/posts/making-luigi-workflows-dynamic
> <http://bionics.it/posts/making-luigi-workflows-dynamic/>
> [2] http://stackoverflow.com/a/4821120/340811
>
> Cheers
> Samuel
>
>
> On Wednesday, March 19, 2014 10:33:53 AM UTC+1, Samuel Lampa wrote:
>
> On 2014-03-18 21:18, Erik Bernhardsson wrote:
> > Cool! We don't do anything like this – basically just have a
> bunch of
> > superclasses that we inherit from any time
> >
> > This is how I would implement your stuff:
> > https://gist.github.com/__erikbern/9628609
> <https://gist.github.com/erikbern/9628609>
>
> Ah, right! Never thought of that! ... and indeed, that makes a
> lot of
> sense. (And it answers my question in a previous mail about
> where the
> run methods are implemented ...)
>
> Will chew on that one, and see whether we also would be better
> off using
> this approach ... Thanks for sharing!
>
> BR
> // Samuel
>
>
>
>
> --
> Erik Bernhardsson
> Engineering Manager, Spotify, New York
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Luigi" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/luigi-user/uZWti9HBrb8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Erik Bernhardsson

unread,
Mar 19, 2014, 2:24:27 PM3/19/14
to Samuel Lampa, luigi...@googlegroups.com
On Wed, Mar 19, 2014 at 2:21 PM, Samuel Lampa <samuel...@gmail.com> wrote:
On ons 19 mar 2014 19:03:57, Erik Bernhardsson wrote:



On Wed, Mar 19, 2014 at 12:22 PM, Samuel Lampa <samuel...@gmail.com
<mailto:samuel...@gmail.com>> wrote:

    I guess the main factor that made us go with creating and
    modifying objects, was that it becomes a tad easier to select the
    object to execute dynamically, by specifying a parameter to the
    workflow script, which we do in the longer code example in my post
    [1]. We were also imagining that we sometimes might want to change
    the wiring of parts of the workflow (e.g. swapping one component
    for another, or bypassing a component altogether), depending on
    input parameters.


But you can do that in requires() right?

def requires(self):
   if self.my_param == 'foo': return SomeOtherTask()
   else: return AnotherTask()

Yes, but then the API of the task changes, which makes us need to update any other use of that task

Why would you need to update it? Assuming you do backwards compatible changes. Eg. The HadoopJobTask class has had the same API forever


... but indeed, if working in the way you suggested, with just subclassing our task super classes, and specifying the requirement at workflow definition time, then this makes sense, yes!




    That seems to be fully possible to do with sub-classing as well,
    using something like [2], but as you say, it is a bit depending on
    taste - and I would say it also depends a lot on the use case -,
    and I personally feel I'd prefer avoiding the getattr() calls,
    which would be required in our case.


Why are the getattr() calls needed?

In order to specify which class we want to run based on a text string.

The command line interface lets you specify which task you want to run

python my_workflow.py --task MyTask --foo-param 42 (using optparse)
python my_workflow.py MyTask --foo-param 42 (using argparse)
 

For more options, visit https://groups.google.com/d/optout.

Samuel Lampa

unread,
Mar 19, 2014, 2:37:04 PM3/19/14
to Erik Bernhardsson, luigi...@googlegroups.com
On 2014-03-19 19:24, Erik Bernhardsson wrote:
>
>
>
> On Wed, Mar 19, 2014 at 2:21 PM, Samuel Lampa <samuel...@gmail.com
> <mailto:samuel...@gmail.com>> wrote:
>
> On ons 19 mar 2014 19:03:57, Erik Bernhardsson wrote:
>
>
>
>
> On Wed, Mar 19, 2014 at 12:22 PM, Samuel Lampa
> <samuel...@gmail.com <mailto:samuel...@gmail.com>
> <mailto:samuel...@gmail.com
Right ... yeah, IC, with your subclassing approach, I see how this
becomes possible. Maybe that is how we should do it after all. Gotta try
it out and compare, I guess!

Many thanks for all the input!
Cheers
Samuel
> luigi-user+...@googlegroups.com
> <mailto:luigi-user%2Bunsu...@googlegroups.com>
> <mailto:luigi-user+...@googlegroups.com
> <mailto:luigi-user%2Bunsu...@googlegroups.com>>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
>
> Developer at www.uppmax.uu.se/uppnex
> <http://www.uppmax.uu.se/uppnex> / www.farmbio.uu.se
> <http://www.farmbio.uu.se> / rilpartner.se <http://rilpartner.se>

Samuel Lampa

unread,
Apr 4, 2014, 6:45:39 AM4/4/14
to luigi...@googlegroups.com, Erik Bernhardsson
I've been thinking about the difference between these approaches, as we are trying to simplify our workflow script right now.

The one main possible drawback I see with the subclassing approach, is with sending parameters to the different tasks in the workflow.

Not everybody might have this problem but we are sending quite a lot of different parameters (defined when executing the workflow script) to tasks in all stages of the workflow, since we need to be able to change those for different runs on the same data.

With the subsclassing approach, it seems (please correct me if I'm wrong!) that all parameters that are to be sent to "upstream" tasks has to be set to the most downstream task (the one that is executed), and passed on up until the correct upstream task.

This (in case there isn't a nice way around it) created a problem since the tasks don't become really independent, as parts of their API is "contaminating" the API:s of downstream tasks.

Anyway, if you know a nice way around that, we would be very interested to know (since the subclassing approach overall makes the workflow script so much simpler)!

Cheers
// Samuel


On Wednesday, March 19, 2014 7:37:04 PM UTC+1, Samuel Lampa wrote:
On 2014-03-19 19:24, Erik Bernhardsson wrote:
>
>
>
> On Wed, Mar 19, 2014 at 2:21 PM, Samuel Lampa <samue...@gmail.com

Erik Bernhardsson

unread,
Apr 4, 2014, 10:05:05 AM4/4/14
to Samuel Lampa, luigi...@googlegroups.com
Can you give an example? I'm not sure what you mean with "contaminating" downstream tasks.

I look at tasks as functions that are deterministic. Meaning f(x, y) always evaluates to the same. If h is downstream from f then h(x, y, z) could depend on f(x, y) and g(x, z) or anything. I think this makes it explicit what parameters h(...) depends on.

Samuel Lampa

unread,
Apr 4, 2014, 10:23:25 AM4/4/14
to luigi...@googlegroups.com, Samuel Lampa
Very good point! 

Actually, I think what made me to think about this is that we are sending quite a lot of parameters that are mostly related to affecting how we want the job to run technically (e.g. for tasks that launch slurm SBATCH jobs, we specify number of nodes / cores etc, and even a switch whether to run it locally or as an sbatch job, depending on if we are one a node that we have exclusive access to).

But I'm starting to think now whether we could separate out this kind of "technical config" stuff and feed it to the workflow in a different way, such as by using a config file.

... then what would remain is the parameters that define the output of each task, and then your point makes a lot of sense.

Will chew on that.

Cheers
// Samuel    


On Friday, April 4, 2014 4:05:05 PM UTC+2, erikbern wrote:
Can you give an example? I'm not sure what you mean with "contaminating" downstream tasks.

I look at tasks as functions that are deterministic. Meaning f(x, y) always evaluates to the same. If h is downstream from f then h(x, y, z) could depend on f(x, y) and g(x, z) or anything. I think this makes it explicit what parameters h(...) depends on.

On Fri, Apr 4, 2014 at 6:45 AM, Samuel Lampa <samuel...@gmail.com> wrote:
I've been thinking about the difference between these approaches, as we are trying to simplify our workflow script right now.

The one main possible drawback I see with the subclassing approach, is with sending parameters to the different tasks in the workflow.

Not everybody might have this problem but we are sending quite a lot of different parameters (defined when executing the workflow script) to tasks in all stages of the workflow, since we need to be able to change those for different runs on the same data.

With the subsclassing approach, it seems (please correct me if I'm wrong!) that all parameters that are to be sent to "upstream" tasks has to be set to the most downstream task (the one that is executed), and passed on up until the correct upstream task.

This (in case there isn't a nice way around it) created a problem since the tasks don't become really independent, as parts of their API is "contaminating" the API:s of downstream tasks.

Anyway, if you know a nice way around that, we would be very interested to know (since the subclassing approach overall makes the workflow script so much simpler)!

Cheers
// Samuel


On Wednesday, March 19, 2014 7:37:04 PM UTC+1, Samuel Lampa wrote:
On 2014-03-19 19:24, Erik Bernhardsson wrote:
>
>
>
> On Wed, Mar 19, 2014 at 2:21 PM, Samuel Lampa <samu...@gmail.com

Erik Bernhardsson

unread,
Apr 4, 2014, 10:24:24 AM4/4/14
to Samuel Lampa, luigi...@googlegroups.com
That makes a lot of sense. Anything that doesn't affect the result, but just the way things are computed, should probably not be a parameter to the task, but rather some kind of config


--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Samuel Lampa

unread,
Apr 10, 2014, 8:03:45 PM4/10/14
to luigi...@googlegroups.com, Samuel Lampa
Trying this approach now.

How would you suggest passing or sharing a central configuration, between multiple independent 

1. Pass a config object (from ConfigParser) via a luigi.Parameters?
2. Pass a config filename as a luigi.Parameter()?
3. Load a common hard-coded config file name?
4. Load a config file from some superclass common to my workflow?
5. Something else?

... and if 4, where would I do that, as I understand using custom constructors is discouraged in luigi tasks?

Cheers,
// Samuel

Samuel Lampa

unread,
Apr 11, 2014, 6:46:19 AM4/11/14
to luigi...@googlegroups.com, Samuel Lampa
Btw, I have tried this approach now, and I think the code at https://gist.github.com/samuell/10455535 does illustrate the "problem" I was thinking of:

From around "MMCreateSparseTrainDataset" and below, all downstream task classes have some 5 recurrring parameters, that are just used for passing on to the next one.

This means that if we ever want to e.g. remove, or replace MMCreateSparseTrainDataset with a simpler component that does not need all those parameters, or add a filtering step that needs a lot of extra parameters, then we will have to add those parameters, and the passing-on code in every downstream subclass component too.

That is of course fully doable, but also of course makes it more laborious to define and change workflows.

I'm having some workarounds in mind, such as passing a parameter struct all the way though, from which all the tasks just takes what it needs (maybe with the ability to specify it in a string representation to each class, and then internally it will parse it into a struct), but wanted to describe this in case you have some great hints to share (you have shared many great hints before!) :)

Cheers
// Samuel



On Friday, April 4, 2014 4:24:24 PM UTC+2, erikbern wrote:

Joe Crobak

unread,
Apr 11, 2014, 9:38:03 AM4/11/14
to Samuel Lampa, luigi...@googlegroups.com
On Thu, Apr 10, 2014 at 8:03 PM, Samuel Lampa <samuel...@gmail.com> wrote:
Trying this approach now.

How would you suggest passing or sharing a central configuration, between multiple independent 

1. Pass a config object (from ConfigParser) via a luigi.Parameters?
2. Pass a config filename as a luigi.Parameter()?
3. Load a common hard-coded config file name?
4. Load a config file from some superclass common to my workflow?
5. Something else?

You might want to try the `default_from_config` kwarg to luigi.Parameter: http://luigi.readthedocs.org/en/latest/api/luigi.html#luigi.parameter.Parameter

This will read values from the luigi configuration, in which you can create your own section. You can place the values in `client.cfg`, `/etc/luigi/client.cfg` or a file specified at `LUIGI_CONFIG_PATH`. This makes it easy to have different configs for different environments (prod, dev, etc) while still letting you override values when launching tasks on the command-line.

For example:

client.cfg:

[customoutput]
pathname=/some/path/in/hdfs

constants.py:

BASE_OUTPUT=dict(section='customoutput', name='pathname')

tasks.py:

class Task1(luigi.Task):
  output_base = luigi.Parameter(default_from_config=BASE_OUTPUT)

class Task2(luigi.Task):
  output_base = luigi.Parameter(default_from_config=BASE_OUTPUT)


Does that help to solve your problem?

Joe

Samuel Lampa

unread,
Apr 11, 2014, 10:01:45 AM4/11/14
to Joe Crobak, luigi...@googlegroups.com
Hello,

That sounds like an excellent solution!

The only thing I didn't really follow is how the code in the
"constants.py" works, in your example?

... does the "dict(...)" code retrieve info from client.cfg
automatically somehow?

Cheers
// Samuel

On fre 11 apr 2014 15:38:03, Joe Crobak wrote:
>
>
>
> On Thu, Apr 10, 2014 at 8:03 PM, Samuel Lampa <samuel...@gmail.com
> <mailto:samuel...@gmail.com>> wrote:
>
> Trying this approach now.
>
> How would you suggest passing or sharing a central configuration,
> between multiple independent
>
> 1. Pass a config object (from ConfigParser) via a luigi.Parameters?
> 2. Pass a config filename as a luigi.Parameter()?
> 3. Load a common hard-coded config file name?
> 4. Load a config file from some superclass common to my workflow?
> 5. Something else?
>
>
> You might want to try the `default_from_config` kwarg to
> luigi.Parameter:
> http://luigi.readthedocs.org/en/latest/api/luigi.html#luigi.parameter.Parameter
>
> This will read values from the luigi configuration, in which you can
> create your own section. You can place the values in `client.cfg`,
> `/etc/luigi/client.cfg` or a file specified at `LUIGI_CONFIG_PATH`.
> This makes it easy to have different configs for different
> environments (prod, dev, etc) while still letting you override values
> when launching tasks on the command-line.
>
> For example:
>
> *client.cfg:*
>
> [customoutput]
> pathname=/some/path/in/hdfs
>
> *constants.py:*
>
> BASE_OUTPUT=dict(section='customoutput', name='pathname')
>
> *tasks.py:*
> > <mailto:samuel...@gmail.com____>> wrote:
> >
> > On ons 19 mar 2014 19:03:57, Erik
> Bernhardsson wrote:
> >
> >
> >
> >
> > On Wed, Mar 19, 2014 at 12:22 PM,
> Samuel Lampa
> > <samuel...@gmail.com
> <mailto:samuel...@gmail.com____>
> > <mailto:samuel...@gmail.com
> > <mailto:samuel...@gmail.com____>>>
> http://bionics.it/posts/__making____-luigi-workflows-dynamic
> <http://bionics.it/posts/making-luigi-workflows-dynamic>
>
> >
> <http://bionics.it/posts/__makin____g-luigi-workflows-__dynamic/
> <http://bionics.it/posts/making-luigi-workflows-dynamic/>>
>
> >
> > [2]
> http://stackoverflow.com/a/__482____1120/340811 <http://stackoverflow.com/a/4821120/340811>
>
> >
> > Cheers
> > Samuel
> >
> >
> > On Wednesday, March 19, 2014
> 10:33:53 AM UTC+1, Samuel
> > Lampa wrote:
> >
> > On 2014-03-18 21:18, Erik
> Bernhardsson wrote:
> > > Cool! We don't do anything
> like this – basically
> > just have a
> > bunch of
> > > superclasses that we
> inherit from any time
> > >
> > > This is how I would
> implement your stuff:
> > >
> https://gist.github.com/____erik____bern/9628609
> <https://gist.github.com/__erikbern/9628609>
> >
> >
> <https://gist.github.com/__erikb____ern/9628609 <https://gist.github.com/erikbern/9628609>>
> https://groups.google.com/d/__to____pic/luigi-user/uZWti9HBrb8/__uns____ubscribe
> <https://groups.google.com/d/topic/luigi-user/uZWti9HBrb8/unsubscribe>.
>
> > To unsubscribe from this group and
> all its topics, send an
> > email to
> >
> > --
> >
> > Developer at www.uppmax.uu.se/uppnex
> <http://www.uppmax.uu.se/uppnex>
> > <http://www.uppmax.uu.se/__uppne____x
> <http://www.uppmax.uu.se/uppnex>> /
> www.farmbio.uu.se <http://www.farmbio.uu.se>
> > <http://www.farmbio.uu.se> /
> rilpartner.se <http://rilpartner.se>
> <http://rilpartner.se>
> > G: http://google.com/+samuellampa
> > B: http://saml.rilspace.org
> > T: http://twitter.com/smllmp
> >
> >
> >
> >
> > --
> > Erik Bernhardsson
> > Engineering Manager, Spotify, New York
> >
>
>
> --
> Developer at www.uppmax.uu.se/uppnex
> <http://www.uppmax.uu.se/uppnex> /
> www.farmbio.uu.se <http://www.farmbio.uu.se> /
> rilpartner.se <http://rilpartner.se>
> G: http://google.com/+samuellampa
> B: http://saml.rilspace.org
> T: http://twitter.com/smllmp
>
>
>
>
> --
> Erik Bernhardsson
> Engineering Manager, Spotify, New York
>
> --
> You received this message because you are subscribed to
> the Google Groups "Luigi" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to luigi-user+...@__googlegroups.com.
>
> For more options, visit
> https://groups.google.com/d/__optout
> <https://groups.google.com/d/optout>.
>
>
>
>
> --
> Erik Bernhardsson
> Engineering Manager, Spotify, New York
>
> --
> You received this message because you are subscribed to the Google
> Groups "Luigi" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to luigi-user+...@googlegroups.com
> <mailto:luigi-user+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
>



--

Joe Crobak

unread,
Apr 11, 2014, 11:39:35 AM4/11/14
to Samuel Lampa, luigi...@googlegroups.com
On Fri, Apr 11, 2014 at 10:01 AM, Samuel Lampa <samuel...@gmail.com> wrote:
Hello,

That sounds like an excellent solution!

The only thing I didn't really follow is how the code in the "constants.py" works, in your example?

... does the "dict(...)" code retrieve info from client.cfg automatically somehow?

Nope, it's nothing magic, just setting up the dict argument for the default_from_config. This would be equivalent (but less DRY):

class Task1(luigi.Task):
  output_base = luigi.Parameter(default_from_config=dict(section='customoutput', name='pathname'))

class Task2(luigi.Task):
  output_base = luigi.Parameter(default_from_config=dict(section='customoutput', name='pathname'))
 
I find it's useful to have constants if you're reusing the same config value multiple places.

Joe

    For more options, visit https://groups.google.com/d/optout.


Samuel Lampa

unread,
Apr 11, 2014, 11:47:12 AM4/11/14
to Joe Crobak, luigi...@googlegroups.com
Ah, kinda missed that part, but that makes sense, many thanks!

BR
// Samuel

On fre 11 apr 2014 17:39:35, Joe Crobak wrote:
> On Fri, Apr 11, 2014 at 10:01 AM, Samuel Lampa <samuel...@gmail.com
> <mailto:samuel...@gmail.com>> wrote:
>
> Hello,
>
> That sounds like an excellent solution!
>
> The only thing I didn't really follow is how the code in the
> "constants.py" works, in your example?
>
> ... does the "dict(...)" code retrieve info from client.cfg
> automatically somehow?
>
> Nope, it's nothing magic, just setting up the dict argument for the
> default_from_config. This would be equivalent (but less DRY):
>
> class Task1(luigi.Task):
> output_base =
> luigi.Parameter(default_from_config=dict(section='__customoutput',
> name='pathname'))
>
> class Task2(luigi.Task):
> output_base =
> luigi.Parameter(default_from_config=dict(section='__customoutput',
> name='pathname'))
> I find it's useful to have constants if you're reusing the same config
> value multiple places.
>
> Joe
>
> Cheers
> // Samuel
>
>
> On fre 11 apr 2014 15:38:03, Joe Crobak wrote:
>
>
>
>
> On Thu, Apr 10, 2014 at 8:03 PM, Samuel Lampa
> <samuel...@gmail.com <mailto:samuel...@gmail.com>
> <mailto:samuel...@gmail.com
> <mailto:samuel...@gmail.com>__>> wrote:
>
> Trying this approach now.
>
> How would you suggest passing or sharing a central
> configuration,
> between multiple independent
>
> 1. Pass a config object (from ConfigParser) via a
> luigi.Parameters?
> 2. Pass a config filename as a luigi.Parameter()?
> 3. Load a common hard-coded config file name?
> 4. Load a config file from some superclass common to my
> workflow?
> 5. Something else?
>
>
> You might want to try the `default_from_config` kwarg to
> luigi.Parameter:
> http://luigi.readthedocs.org/__en/latest/api/luigi.html#__luigi.parameter.Parameter
> <http://luigi.readthedocs.org/en/latest/api/luigi.html#luigi.parameter.Parameter>
>
> This will read values from the luigi configuration, in which
> you can
> create your own section. You can place the values in `client.cfg`,
> `/etc/luigi/client.cfg` or a file specified at
> `LUIGI_CONFIG_PATH`.
> This makes it easy to have different configs for different
> environments (prod, dev, etc) while still letting you override
> values
> when launching tasks on the command-line.
>
> For example:
>
> *client.cfg:*
>
> [customoutput]
> pathname=/some/path/in/hdfs
>
> *constants.py:*
>
>
> BASE_OUTPUT=dict(section='__customoutput', name='pathname')
>
> *tasks.py:*
>
>
> class Task1(luigi.Task):
> output_base = luigi.Parameter(default_from___config=BASE_OUTPUT)
>
> class Task2(luigi.Task):
> output_base = luigi.Parameter(default_from___config=BASE_OUTPUT)
> <mailto:samu...@gmail.com>
> > <mailto:samuel...@gmail.com______
> <mailto:samuel...@gmail.com____>>> wrote:
> >
> > On ons 19 mar 2014 19:03:57, Erik
> Bernhardsson wrote:
> >
> >
> >
> >
> > On Wed, Mar 19, 2014 at
> 12:22 PM,
> Samuel Lampa
> > <samuel...@gmail.com
> <mailto:samuel...@gmail.com>
> <mailto:samuel...@gmail.com______
> <mailto:samuel...@gmail.com____>>
> > <mailto:samuel...@gmail.com
> <mailto:samuel...@gmail.com>
> >
> <mailto:samuel...@gmail.com______
> http://bionics.it/posts/____making____-luigi-workflows-__dynamic
> <http://bionics.it/posts/__making____-luigi-workflows-dynamic>
>
> <http://bionics.it/posts/__making-luigi-workflows-dynamic
> <http://bionics.it/posts/making-luigi-workflows-dynamic>__>
>
> >
>
> <http://bionics.it/posts/____makin____g-luigi-workflows-____dynamic/
> <http://bionics.it/posts/__makin____g-luigi-workflows-__dynamic/>
>
> <http://bionics.it/posts/__making-luigi-workflows-__dynamic/
> <http://bionics.it/posts/making-luigi-workflows-dynamic/>>>
>
> >
> > [2]
> http://stackoverflow.com/a/____482____1120/340811
> <http://stackoverflow.com/a/__482____1120/340811>
> <http://stackoverflow.com/a/__4821120/340811
> <http://stackoverflow.com/a/4821120/340811>>
>
>
> >
> > Cheers
> > Samuel
> >
> >
> > On Wednesday, March 19, 2014
> 10:33:53 AM UTC+1, Samuel
> > Lampa wrote:
> >
> > On 2014-03-18 21:18,
> Erik
> Bernhardsson wrote:
> > > Cool! We don't do
> anything
> like this – basically
> > just have a
> > bunch of
> > > superclasses that we
> inherit from any time
> > >
> > > This is how I would
> implement your stuff:
> > >
> https://gist.github.com/______erik____bern/9628609
> <https://gist.github.com/____erik____bern/9628609>
>
> <https://gist.github.com/____erikbern/9628609
> <https://gist.github.com/__erikbern/9628609>>
> >
> >
>
> <https://gist.github.com/____erikb____ern/9628609
> <https://gist.github.com/__erikb____ern/9628609>
> <https://gist.github.com/__erikbern/9628609
> https://groups.google.com/d/____to____pic/luigi-user/__uZWti9HBrb8/__uns____ubscribe
> <https://groups.google.com/d/__to____pic/luigi-user/uZWti9HBrb8/__uns____ubscribe>
>
> <https://groups.google.com/d/__topic/luigi-user/uZWti9HBrb8/__unsubscribe
> <https://groups.google.com/d/topic/luigi-user/uZWti9HBrb8/unsubscribe>>.
>
>
> > To unsubscribe from this
> group and
> all its topics, send an
> > email to
> >
> > --
> >
> > Developer at
> www.uppmax.uu.se/uppnex <http://www.uppmax.uu.se/uppnex>
> <http://www.uppmax.uu.se/__uppnex
> <http://www.uppmax.uu.se/uppnex>>
> >
> <http://www.uppmax.uu.se/____uppne____x
> <http://www.uppmax.uu.se/__uppne____x>
>
> <http://www.uppmax.uu.se/__uppnex
> <http://www.uppmax.uu.se/__uppnex
> You received this message because you are
> subscribed to
> the Google Groups "Luigi" group.
> To unsubscribe from this group and stop receiving
> emails
> from it, send an email to
> luigi-user+...@__googlegroups.__com <http://googlegroups.com>.
>
> For more options, visit
> https://groups.google.com/d/____optout
> <https://groups.google.com/d/__optout>
> <https://groups.google.com/d/__optout
> <https://groups.google.com/d/optout>>.
>
>
>
>
>
> --
> Erik Bernhardsson
> Engineering Manager, Spotify, New York
>
> --
> You received this message because you are subscribed to
> the Google
> Groups "Luigi" group.
> To unsubscribe from this group and stop receiving emails
> from it,
> send an email to luigi-user+unsubscribe@__googlegroups.com
> <mailto:luigi-user%2Bunsu...@googlegroups.com>
> <mailto:luigi-user+_...@googlegroups.com
> <mailto:luigi-user%2Bunsu...@googlegroups.com>>.
>
> For more options, visit

Erik Bernhardsson

unread,
Apr 13, 2014, 7:26:09 PM4/13/14
to Samuel Lampa, luigi...@googlegroups.com
On Fri, Apr 11, 2014 at 6:46 AM, Samuel Lampa <samuel...@gmail.com> wrote:
Btw, I have tried this approach now, and I think the code at https://gist.github.com/samuell/10455535 does illustrate the "problem" I was thinking of:

From around "MMCreateSparseTrainDataset" and below, all downstream task classes have some 5 recurrring parameters, that are just used for passing on to the next one.

This means that if we ever want to e.g. remove, or replace MMCreateSparseTrainDataset with a simpler component that does not need all those parameters, or add a filtering step that needs a lot of extra parameters, then we will have to add those parameters, and the passing-on code in every downstream subclass component too.

That is of course fully doable, but also of course makes it more laborious to define and change workflows.

You could of course make some of the parameters have default values that make sense.

There is also some support for global params although it's not super great. Or config (as Joe pointed out)

There's also some experimental support for class decorators to remove some of the boiler plate in luigi.util: https://github.com/spotify/luigi/blob/master/luigi/util.py 

Btw – why do none of your task classes have an output() method? Seems strange

Samuel Lampa

unread,
Apr 14, 2014, 4:25:25 AM4/14/14
to luigi...@googlegroups.com, Samuel Lampa


On Monday, April 14, 2014 1:26:09 AM UTC+2, erikbern wrote:



On Fri, Apr 11, 2014 at 6:46 AM, Samuel Lampa <samuel...@gmail.com> wrote:
Btw, I have tried this approach now, and I think the code at https://gist.github.com/samuell/10455535 does illustrate the "problem" I was thinking of:

From around "MMCreateSparseTrainDataset" and below, all downstream task classes have some 5 recurrring parameters, that are just used for passing on to the next one.

This means that if we ever want to e.g. remove, or replace MMCreateSparseTrainDataset with a simpler component that does not need all those parameters, or add a filtering step that needs a lot of extra parameters, then we will have to add those parameters, and the passing-on code in every downstream subclass component too.

That is of course fully doable, but also of course makes it more laborious to define and change workflows.

You could of course make some of the parameters have default values that make sense.

Yeah, but I guess the problem with our workflow is that we have 4-6 parameters that we change all the time (nested loops of variations of those params).
 

There is also some support for global params although it's not super great. Or config (as Joe pointed out)

RIght, I saw something about that in the code, but was not sure about how to use that. Looks interesting though!
 

There's also some experimental support for class decorators to remove some of the boiler plate in luigi.util: https://github.com/spotify/luigi/blob/master/luigi/util.py 

Btw – why do none of your task classes have an output() method? Seems strange

Since only the requrie() function is needed for defining the dependency anyway, we have defined the output() function only in the original super classes.

Samuel Lampa

unread,
Apr 14, 2014, 10:30:48 AM4/14/14
to luigi...@googlegroups.com, Samuel Lampa
I have now implemented a solution where we just pass an "arguments dict" [1] up through all the upstream tasks. In that way we just need to pass one parameter through all tasks, and each task can just read the info they want, regardsless of where in the chain they are.

In summary, this works great, and all in all, the approach you suggested with subclassing the original classes, simplifies our workflow definitions file enormously, so a big thanks for pointing out that! :)

Best
// Samuel

[1] In string format (and on-the-fly evaluated to a real dict), so that it can be specified on the commandline to any of the tasks.

Samuel Lampa

unread,
Apr 14, 2014, 10:32:41 AM4/14/14
to luigi...@googlegroups.com, Samuel Lampa
We are very interested to hear any updates regarding the global parameters feature you mention though! Would be awesome to have something like that natively suppoerted by luigi!

Cheers
// Samuel
Reply all
Reply to author
Forward
0 new messages