True, not two tasks outputting the same, but we regularly have commands which require multiple inputs and/or generates multiple outputs, typically some external command that we run through the HPC system.
To give a specific example, we have a task for sampling from a dataset into a "test" and a "train" partition (thus multiple output targets), where then both of these outputs take different routes downstream in the workflow (they undergo conversion to sparse dataset, and then the train dataset is used for training, and the test dataset is used for evaluation, etc. etc.).
An example for multiple input targets is our assess component, which requires both the model file generated, and the test dataset to use for the evaluation.
I totally understand that luigi tries to implement the functional call/return semantics, where the function caller is supposed to provide the outer-most function with all information needed to calculate its value "from scratch".
I also see that this is a very good fit for e.g. mathematics and other use cases where one typically has one input and one output for each function, and the whole idea is to build up a kind of "semantic" of what different values mean, and can calculate their values out of their mere definitions.
The only problem is that when trying to use luigi as a general-purpose workflow system, incorporating existing software, this model seems to come to its limits.
To give a concrete example on how we need to sometimes totally swap the parts that produce a certain dataset / target, let's take this one:
We start our work with a number of datasets in a data format called "smiles" (basically a string representation of chemical molecules, one row per molecule). We have some 1800 different such smiles datasets, containing hundreds of lines / molecules each, and each dataset having a special meaning (such as that all molecules in that dataset binds to a certain protein in the body). This data is originally extracted from an SQL database, using a 1 A4 long SQL query. We thus have a luigi task for running this SQL query.
Sometimes though, we want to run a totally different smiles dataset, that we already have extracted in some other way, and just run it through the workflow.
Then, clearly it doesn't make sense to use the SQL extractor task, but instead we might just use a task that subclasses luigi.ExternalTarget, and read in the data from the file we already have.
Now we might think that why not just add a switch in the "extractor task", to either read directly from a file, or from the SQL query.
But then the next day, we might want to construct a completely new way to extract smiles datasets, which don't use SQL databases at all, but instead takes them from a mongo DB, for example.
We naturally want to store this new extractor in new task (because it has a different purpose, might have different parameters etc). But, now the problem comes, that we can not very easily switch between these now three different ways of getting smiles datasets, because our workflow is already littered with duplicated parameters specific to the SQL extractor code.
Sure, we can just go ahead and add more parameters to all tasks that might at sometime be downstream of the new MongoDB extractor task, but that takes time, makes the code less understandable, and makes it harder to reason about what parameters really belong to what tasks.
Thus for us, the functional call/return paradigm just doesn't seem like the right fit. And I think this is why there are also a number of different paradigms, such as data flow, which takes a different approach on these things.
That is, computations can (in data flow) have multiple inputs, outputs, and have one of an array of different execution modes (push, pull, execute on all inputs available, only one input available, etc etc). Matt Carcki's contrast of the pull and push execution modes in data flow systems at 2:50 in this video highlights some of these differences:
https://www.youtube.com/watch?v=iFlT93wakVo#t=170(... where the pull-model in fact is quite similar to the functional call/return semantics)
So, for us, in fact, the data flow paradigm seems like a better fit. There, you can just feed a "black box" process with its required input (a smiles dataset) - regardless of how you have generated it - and the whole workflow will just chew on and process the data and return the result.
So, in summary, one could question whether we are sane trying to rework luigi towards a more data flow inspired system, instead of just starting form a data flow inspired system from the start,
I have in fact been thinking hard about that already, but the reason we still like to stick with luigi for the time being is firstly of course that we had already invested quite some time in it, but also that I/we REALLY do like a lot of its features too:
- The central scheduler
- The automatically generated command line interface
- The web UI
- The logger
- And not the least the light-weight nature of it that let us remake it to become a slight bit more like a data flow system.
Those are awesome features, and when we saw that we could find a way to get the parts of data flow that we needed, we felt that it'd most probably be worth it to do these few workarounds in order to continue using luigi instead of spending time searching, evaluating and learning some data flow inspired workflow engine out there, which we don't even know if it has all the other features we need.
Hope this clarifies our ambitions and where we're coming from a little!
So, at last, thanks for a very powerful system that has helped us tons, and even allowed us to stretch its dependency management a bit towards another paradigm! :)
Cheers
// Samuel