An (experimental) light-weight helper lib for command-line centric workflows in Luigi

Samuel Lampa

unread,

Feb 24, 2015, 3:06:02 PM2/24/15

to luigi...@googlegroups.com

Hi folks,

I just wanted to ping early about a (still quite experimental) helper
library to make it really easy to quickly jot down workflows that are
heavily shell command centric, to get some possible feedback as early as
possible:

https://github.com/samuell/luigis_monkey_wrench

This little lib grew out of some slight frustration when I was at a
Next-Gen Sequencing Bioinformatics course last week and we were supposed
to run tons of hairy long commands manually in the shell (see for
yourself at [1]).

I decided to try to encode the stuff in luigi instead to speed things
up, but I figured though that with the additional coding to set up the
task classes etc, I would probably not finish the tutorial in time, so i
wrote this little thingy to help with that.

I also finished encoding the mentioned tutorial into a script with this
syntax:

https://gist.github.com/samuell/6da9a7c1e03912fde62e

This might be serious abuse of the intended luigi usage patterns though,
but it has worked well for this particular use case at least.

Anyways, feedback and ideas are very welcome!

Maybe we do something totatlly stupid, and should do things in
anotherway?! Let the feedback come! :)

Cheers
// Samuel

[1] http://uppnex.se/twiki/do/view/Courses/NgsIntro1502/ResequencingAnalysis

--
Samuel Lampa
-----------------------------------------------
Systems Developer at BILS (bils.se)
PhD student at Uppsala University (farmbio.uu.se)
-----------------------------------------------
Blog: http://bionics.it
Twitter: http://twitter.com/smllmp
-----------------------------------------------

Alexander Krasnukhin

unread,

Feb 24, 2015, 7:09:56 PM2/24/15

to Samuel Lampa, luigi...@googlegroups.com

Hej,

I find that the way you chain commands makes pipelines a bit awkward. If you want to move shell pipelines to luigi than why not use an approach like plumbum and keep everything pipe, python and luigi friendly?

Here is a short example how we can run simple shell pipeline in luigi mixing both worlds in a neat way. Note that we have both dependent tasks *and* sed pipeline renaming ‘hej’ first to ‘foo’ and than ‘foo’ to ‘bar’.

class HejTask(PipeTask):
    def pipe(self):
        return echo['hej hej hej']

    def output(self):
        return luigi.LocalTarget('hej.txt')

@luigi.util.requires(HejTask)
class FooBarTask(PipeTask):
    def pipe(self):
        return sed['s/hej/foo/g'] | sed['s/foo/bar/g']

    def output(self):
        return luigi.LocalTarget('bar.txt')

https://github.com/themalkolm/luigi-plumbum/blob/master/main.py

Give it a try.

Also if you carefully write shell scripts than you can easily convert shell scripts to luigi tasks on the fly generating proper pipe() methods for dynamically created classes.

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Regards,
Alexander

Samuel Lampa

unread,

Feb 25, 2015, 12:37:08 AM2/25/15

to luigi...@googlegroups.com, samuel...@gmail.com

Thanks! Hadn't seen luigi-plumbum, so definitely will have a look!

One thing I note though, that is one of the driving motivations of our design, is to make the wiring of multiple inputs and outputs easy.

This is something that we needed badly in the tutorial that drove the creation of it in the first place [1]

So, while we in [1] wire each (named) task output, to another (named) task input, it seems that in plumbum, one defines only the dependencies between the tasks?

What we think we have realized so far, is that we need to include the targets on equal footing as the tasks, in the dependency graph, since multiple outputs from some tasks can take independent, different downstream routes.

It is an interesting (albeit sometimes complex) subject, figuring out the best way to work with these things, so I'll definitely check out plumbum more though!

Best

// Samuel

[1] https://gist.github.com/samuell/6da9a7c1e03912fde62e

To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Regards,
Alexander

Alexander Krasnukhin

unread,

Feb 25, 2015, 2:44:43 AM2/25/15

to Samuel Lampa, luigi...@googlegroups.com

In shell pipelines you can't really have multiple input and output streams. If these jars can't use stdin and stdout then plumbum is not that exciting for you.

I agree that you better properly split every step in the workflow as a separate task with defined outputs and requirements. Otherwise not sure why you are even using luigi.

Ron Reiter

unread,

Feb 25, 2015, 2:51:21 AM2/25/15

to Alexander Krasnukhin, Samuel Lampa, luigi...@googlegroups.com

Alexander, this is not true since Luigi will only re-run those commands which did not complete, even within the WorkflowTask. IMHO It makes a lot of sense to write code in a "workflow" task since the data flow goes forward and not backward (as in a dependency model).

Thanks,

Ron