Hi,
I just wanted to share our experience about how we figured we could make our luigi tasks more reusable, by injecting rather than hard-coding the upstream dependencies:
http://bionics.it/posts/making-luigi-workflows-dynamic
Any feedback on our approch would be highIy appreciated!
Maybe working in this way is self-evident for you? ... but at least for us it took some thinking before we figured-out this =) ... so we thought maybe some one else might find this useful.
Cheers
Samuel
--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I guess the main factor that made us go with creating and modifying objects, was that it becomes a tad easier to select the object to execute dynamically, by specifying a parameter to the workflow script, which we do in the longer code example in my post [1]. We were also imagining that we sometimes might want to change the wiring of parts of the workflow (e.g. swapping one component for another, or bypassing a component altogether), depending on input parameters.
That seems to be fully possible to do with sub-classing as well, using something like [2], but as you say, it is a bit depending on taste - and I would say it also depends a lot on the use case -, and I personally feel I'd prefer avoiding the getattr() calls, which would be required in our case.
Anyway, it is indeed interesting to contrast different approaches, and I'm very interested to learn more of these kind of hints, as we have a lot of challenging tasks ahead in the near future.CheersSamuelOn Wednesday, March 19, 2014 10:33:53 AM UTC+1, Samuel Lampa wrote:On 2014-03-18 21:18, Erik Bernhardsson wrote:
> Cool! We don't do anything like this – basically just have a bunch of
> superclasses that we inherit from any time
>
> This is how I would implement your stuff:
> https://gist.github.com/erikbern/9628609
Ah, right! Never thought of that! ... and indeed, that makes a lot of
sense. (And it answers my question in a previous mail about where the
run methods are implemented ...)
Will chew on that one, and see whether we also would be better off using
this approach ... Thanks for sharing!
BR
// Samuel
On ons 19 mar 2014 19:03:57, Erik Bernhardsson wrote:
<mailto:samuel...@gmail.com>> wrote:
I guess the main factor that made us go with creating and
modifying objects, was that it becomes a tad easier to select the
object to execute dynamically, by specifying a parameter to the
workflow script, which we do in the longer code example in my post
[1]. We were also imagining that we sometimes might want to change
the wiring of parts of the workflow (e.g. swapping one component
for another, or bypassing a component altogether), depending on
input parameters.
But you can do that in requires() right?
def requires(self):
if self.my_param == 'foo': return SomeOtherTask()
else: return AnotherTask()
Yes, but then the API of the task changes, which makes us need to update any other use of that task
... but indeed, if working in the way you suggested, with just subclassing our task super classes, and specifying the requirement at workflow definition time, then this makes sense, yes!In order to specify which class we want to run based on a text string.
That seems to be fully possible to do with sub-classing as well,
using something like [2], but as you say, it is a bit depending on
taste - and I would say it also depends a lot on the use case -,
and I personally feel I'd prefer avoiding the getattr() calls,
which would be required in our case.
Why are the getattr() calls needed?
--
On 2014-03-19 19:24, Erik Bernhardsson wrote:
>
>
>
> On Wed, Mar 19, 2014 at 2:21 PM, Samuel Lampa <samue...@gmail.com
Can you give an example? I'm not sure what you mean with "contaminating" downstream tasks.I look at tasks as functions that are deterministic. Meaning f(x, y) always evaluates to the same. If h is downstream from f then h(x, y, z) could depend on f(x, y) and g(x, z) or anything. I think this makes it explicit what parameters h(...) depends on.
On Fri, Apr 4, 2014 at 6:45 AM, Samuel Lampa <samuel...@gmail.com> wrote:
I've been thinking about the difference between these approaches, as we are trying to simplify our workflow script right now.The one main possible drawback I see with the subclassing approach, is with sending parameters to the different tasks in the workflow.Not everybody might have this problem but we are sending quite a lot of different parameters (defined when executing the workflow script) to tasks in all stages of the workflow, since we need to be able to change those for different runs on the same data.With the subsclassing approach, it seems (please correct me if I'm wrong!) that all parameters that are to be sent to "upstream" tasks has to be set to the most downstream task (the one that is executed), and passed on up until the correct upstream task.This (in case there isn't a nice way around it) created a problem since the tasks don't become really independent, as parts of their API is "contaminating" the API:s of downstream tasks.Anyway, if you know a nice way around that, we would be very interested to know (since the subclassing approach overall makes the workflow script so much simpler)!Cheers
// Samuel
On Wednesday, March 19, 2014 7:37:04 PM UTC+1, Samuel Lampa wrote:
On 2014-03-19 19:24, Erik Bernhardsson wrote:
>
>
>
> On Wed, Mar 19, 2014 at 2:21 PM, Samuel Lampa <samu...@gmail.com
> --
>
> Developer at www.uppmax.uu.se/uppnex
> <http://www.uppmax.uu.se/uppnex> / www.farmbio.uu.se
> <http://www.farmbio.uu.se> / rilpartner.se <http://rilpartner.se>
> G: http://google.com/+samuellampa
> B: http://saml.rilspace.org
> T: http://twitter.com/smllmp
>
>
>
>
> --
> Erik Bernhardsson
> Engineering Manager, Spotify, New York
>
--
Developer at www.uppmax.uu.se/uppnex / www.farmbio.uu.se / rilpartner.se
G: http://google.com/+samuellampa
B: http://saml.rilspace.org
T: http://twitter.com/smllmp
--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
Trying this approach now.
How would you suggest passing or sharing a central configuration, between multiple independent1. Pass a config object (from ConfigParser) via a luigi.Parameters?2. Pass a config filename as a luigi.Parameter()?3. Load a common hard-coded config file name?4. Load a config file from some superclass common to my workflow?
5. Something else?
Hello,
That sounds like an excellent solution!
The only thing I didn't really follow is how the code in the "constants.py" works, in your example?
... does the "dict(...)" code retrieve info from client.cfg automatically somehow?
send an email to luigi-user+unsubscribe@googlegroups.com
<mailto:luigi-user+unsub...@googlegroups.com>.
Btw, I have tried this approach now, and I think the code at https://gist.github.com/samuell/10455535 does illustrate the "problem" I was thinking of:From around "MMCreateSparseTrainDataset" and below, all downstream task classes have some 5 recurrring parameters, that are just used for passing on to the next one.This means that if we ever want to e.g. remove, or replace MMCreateSparseTrainDataset with a simpler component that does not need all those parameters, or add a filtering step that needs a lot of extra parameters, then we will have to add those parameters, and the passing-on code in every downstream subclass component too.That is of course fully doable, but also of course makes it more laborious to define and change workflows.
On Fri, Apr 11, 2014 at 6:46 AM, Samuel Lampa <samuel...@gmail.com> wrote:
Btw, I have tried this approach now, and I think the code at https://gist.github.com/samuell/10455535 does illustrate the "problem" I was thinking of:From around "MMCreateSparseTrainDataset" and below, all downstream task classes have some 5 recurrring parameters, that are just used for passing on to the next one.This means that if we ever want to e.g. remove, or replace MMCreateSparseTrainDataset with a simpler component that does not need all those parameters, or add a filtering step that needs a lot of extra parameters, then we will have to add those parameters, and the passing-on code in every downstream subclass component too.That is of course fully doable, but also of course makes it more laborious to define and change workflows.You could of course make some of the parameters have default values that make sense.
There is also some support for global params although it's not super great. Or config (as Joe pointed out)
There's also some experimental support for class decorators to remove some of the boiler plate in luigi.util: https://github.com/spotify/luigi/blob/master/luigi/util.py
Btw – why do none of your task classes have an output() method? Seems strange