Decorator that takes function for arbitrary mapping between input and output names

26 views
Skip to first unread message

Alejandro Dubrovsky

unread,
Feb 7, 2016, 7:59:44 PM2/7/16
to ruffus_discuss
Hi,

I'm just trying out Ruffus by converting an existing exome pipeline to it. The initial stages of the pipeline map input files in a non-trivial way (or at least in a way that is not easily mappable to just a regex). 

I was wondering how hard it would be to add a Ruffus decorator that would take a function that would return a mapping of target filenames to list of input filenames
( { "target_filename_1" : ["input_filename_1", "input_filename_2", ... ], ""target_filename_2" : ["input_filename_3", "input_filename_4", ... ], ... }).

I looked in task.py to see how the current implementation works, but there is a ... hmm... non-trivial amount of magic going on in there. If you think it's something that is relatively easily doable, I'll spend some time looking into it properly, but I thought I better check before I spend a week/month on it only to find out it's something that really wouldn't be easy to get right. 

(I'm also more than willing to take hints :)

Thank you,
alex

Bernard James Pope

unread,
Feb 7, 2016, 9:15:45 PM2/7/16
to ruffus_...@googlegroups.com
Hi Alex,

There is a open issue on github for a feature like this:

https://github.com/bunbun/ruffus/issues/27

I think it would be very handy to have, and a nice generalisation of the current behaviour.

Cheers,
Bernie
> --
> You received this message because you are subscribed to the Google Groups "ruffus_discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ruffus_discus...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Leo Goodstadt 顧維斌

unread,
Feb 8, 2016, 4:47:15 AM2/8/16
to ruffus_...@googlegroups.com
Dear Alex,

As the open issue says, you can already do some of this by specifying both the input and the out 

parameters. 
I don't think it would be much work to add a decorator which allows custom mapping.

I haven't been doing much (any!) Ruffus hacking lately because of my new job so this might give me some incentive to do so.

I suspect that
1) The callback function should be called once per input file(s)
2) The function should return None if the input file is to be ignored
3) The function should return (inputs, outputs, anything else....), so that you can change the inputs on the fly if you want.

It would help tremendously though if
1) you and Bernie could think of a nice name for the decorator
2) tell me what the function needs to do the mapping. I.e. does it need state, or just the input files?
If it needs state, can you provide state via an callable object or closure /  function factory?
3) let me know if anything else you need or don't understand about how Ruffus works
4) you can contribute documentation! (he says, cheekily)

I shall see if I can get something coded up this week / by the weekend.

Thanks
Leo

Alejandro Dubrovsky

unread,
Feb 8, 2016, 9:00:51 AM2/8/16
to ruffus_...@googlegroups.com
On 08/02/16 20:46, Leo Goodstadt 顧維斌 wrote:
> Dear Alex,
>
> As the open issue says, you can already do some of this by specifying
> both the input and the out
>
> parameters.
> I don't think it would be much work to add a decorator which allows
> custom mapping.
>
> I haven't been doing much (any!) Ruffus hacking lately because of my new
> job so this might give me some incentive to do so.
>

> I suspect that
> 1) The callback function should be called once per input file(s)
> 2) The function should return None if the input file is to be ignored
> 3) The function should return (inputs, outputs, anything else....), so
> that you can change the inputs on the fly if you want.
>
hmm...can the function return the inputs if it's going to be passed only
one input file at a time? Or is it getting the input file to decide on
and the whole list of inputs also in another parameter? Or is this part
of the state that it needs? I suspect I need to look more closely at the
implementation of transform to know if I'm using the right terminology here.

> It would help tremendously though if
> 1) you and Bernie could think of a nice name for the decorator

Naming is too hard. I'm fond of overly_long_names which most people
don't like. I'd go with input_output_mapper.

> 2) tell me what the function needs to do the mapping. I.e. does it need
> state, or just the input files?

In my case, I do not need any state as long as the full list of input
filenames are available to the function at every call.

> If it needs state, can you provide state via an callable object or
> closure / function factory?
> 3) let me know if anything else you need or don't understand about how
> Ruffus works
> 4) you can contribute documentation! (he says, cheekily)
>
The user-side documentation seems pretty good to me from what I've seen.
I'll add some developer-side documentation if I end up understanding how
it works.

> I shall see if I can get something coded up this week / by the weekend.
>
Awesome! Thanks a lot. Do tell if you are too busy or found something
better to do. Now that I know that it isn't a stupid idea and is not
meant to be overly hard, I'm happy to dig in properly and write it
myself too.

alex

> Thanks
> Leo
>
> On 8 February 2016 at 02:14, Bernard James Pope <bjp...@unimelb.edu.au
> <mailto:bjp...@unimelb.edu.au>> wrote:
>
> Hi Alex,
>
> There is a open issue on github for a feature like this:
>
> https://github.com/bunbun/ruffus/issues/27
>
> I think it would be very handy to have, and a nice generalisation of
> the current behaviour.
>
> Cheers,
> Bernie
>
> > On 8 Feb 2016, at 11:59 am, Alejandro Dubrovsky
> <mailto:ruffus_discuss%2Bunsu...@googlegroups.com>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "ruffus_discuss" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to ruffus_discus...@googlegroups.com
> <mailto:ruffus_discuss%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "ruffus_discuss" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/ruffus_discuss/F4EVGLk3rVs/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> ruffus_discus...@googlegroups.com
> <mailto:ruffus_discus...@googlegroups.com>.

Leo Goodstadt 顧維斌

unread,
Feb 8, 2016, 7:28:54 PM2/8/16
to ruffus_...@googlegroups.com
hmm...can the function return the inputs if it's going to be passed only one input file at a time? Or is it getting the input file to decide on and the whole list of inputs also in another parameter? Or is this part of the state that it needs? I suspect I need to look more closely at the implementation of transform to know if I'm using the right terminology here.
No you are making absolute sense. @transform processes one piece of input at a time, and the suffix substitution (suffix), regular expression (regex) and file name substitution (formatter) contain the logic to generate output file names from each piece of input. Any custom function would be an extension of this.

It would help tremendously though if
1) you and Bernie could think of a nice name for the decorator

It might be simplest if we just added an additional parameter to Ruffus
@transform( [prev_task, "file1.bam", "file2.bam", ("r1.fq", "r2.fq")], callback = yr_callback_func[, add_inputs])
def your_task_func(inputs, outputs, extra1, extra2, extra3):
    print inputs, outputs, extra1, extra2, extra3


Where callback is a function / callable python object with the following simple interface:
    def yr_callback_func(inputs):
       ...
       return new_inputs, output, extra1, extra2, extra3
This callback function will be called once per job, i.e. once per incoming file. Its job is to generate the parameters which will then be sent back to your_task_func verbatim.
In the above example, yr_callback_func will be called once for each output of prev_task, once for "file1.bam", once for "file2.bam" and once for ("r1.fq", "r2.fq")

Most of the time, you should just return inputs as the first parameter, but for extreme flexibility, you can change the input file names (new_inputs). In bioinformatics, this conveniently allows the other half of a paired end file to be included for example.

After callback generates the parameters, Ruffus will do its usual thing:
1) Check that files are up to date / the job needs to be run
2) Check pointing
3) Dispatch across multiple threads / processors
4) Handle errors
5) Feed the output to the downstream tasks

For your sanity, it is quite important that yr_callback_func() always maps the same inputs to the same outputs. This callback function will typically be called several times in the pipeline: To check which part of the whole pipeline needs to run, to generate actually parameters during the pipeline operation, to generate print outs of the pipeline progress, and to generate graphical output. If you keep changing your mind on Ruffus, your entire pipeline will end up in a muddle.

yr_callback_func() is only called with one input at a time. This means that the mapping from input to output should not depend on what else is flowing through the pipeline. However, since this is a custom python function / callable object, you are free to invest whatever persistent state you want inside the function. Again, the most important point is consistency.

In use, none of this should be very far removed from the current Ruffus design. However, it is probably sensible to familiarise yourself with the rest of Ruffus first, and write some toy examples so that you understand the fundamental design (and limitations) of Ruffus.

Leo


Awesome! Thanks a lot. Do tell if you are too busy or found something better to do. Now that I know that it isn't a stupid idea and is not meant to be overly hard, I'm happy to dig in properly and write it myself too.


It is neither a stupid idea, nor overly hard. You are of course more than welcome to dig into the Ruffus code and hack away and (please!) contribute back to the community.

I do have the pressures of a new job, and I am spending many long hours writing c++ and Matlab, so if I do have a time, it will be a pleasant diversion.

Leo

 
To unsubscribe from this group and stop receiving emails from it, send an email to ruffus_discus...@googlegroups.com.

Bernard James Pope

unread,
Feb 8, 2016, 7:33:13 PM2/8/16
to ruffus_...@googlegroups.com
Hi Leo,

I don't quite understand your proposal. Could we discuss the semantics a little bit?

My understanding is that currently in Ruffus, "transform" allows you to specify:

- a source of input files
- an input file filter (e.g suffix, regex, formatter)
- output file(s)
- optional extra parameters

Regular expressions are used extensively to both match input files for selection, parse their contents, and select components which can be used to generate output file names.

While regular expressions are quite handy, they are also sometimes limiting, and it would be nice to generalise the process of:

- matching selected input files from the input source
- parsing selected files
- generating output files from the parsed selected input file

I'm not sure this needs to be a decorator.

Could it be an alternative form of transform:

@transform(input, filemapper, [extras, ...])

where filemapper is specified as some callable thing which takes an iterable collection of files as input and produces an iterable collection of files as output?

Maybe it could also optionally produce extras as well.

I don't think it needs any state.

Cheers,
Bernie

Alejandro Dubrovsky

unread,
Feb 9, 2016, 2:20:33 AM2/9/16
to ruffus_discuss


On Tuesday, February 9, 2016 at 11:33:13 AM UTC+11, Bernie Pope wrote:
Could it be an alternative form of transform:

    @transform(input, filemapper, [extras, ...])

where filemapper is specified as some callable thing which takes an iterable collection of files as input and produces an iterable collection of files as output?

Maybe it could also optionally produce extras as well.

I don't think it needs any state. 

I do like the sound of that. Would the mapping necessarily be restricted to one-to-one between input and output though? Could it generate an iterable of ([input_files], output_filename) pairs instead, to optionally function like collate does? 

Leo Goodstadt 顧維斌

unread,
Feb 9, 2016, 5:54:09 AM2/9/16
to ruffus_...@googlegroups.com
Hi Bernie,
Unless, I misunderstand you, I think this is exactly what I am proposing. Maybe our posts crossed mid air.
This is my suggested syntax
It might be simplest if we just added an additional parameter to Ruffus
@transform( [prev_task, "file1.bam", "file2.bam", ("r1.fq", "r2.fq")], callback = yr_callback_func[, add_inputs])
def your_task_func(inputs, outputs, extra1, extra2, extra3):
    print inputs, outputs, extra1, extra2, extra3


Where callback is a function / callable python object with the following simple interface:
    def yr_callback_func(inputs):
       ...
       return new_inputs, output, extra1, extra2, extra3

I have to think a bit about whether that should be yield rather than return. I suspect the former rather than the latter if we want the same interface for all of the decorators. I.e. everywhere which currently takes a regex/suffix/formatter, we can also take a function / callable which generates the same outputs etc.

This probably means that if you are not using the new style named parameters, you will need an old style "indicator" object to disambiguate. I.e. somethings along the lines of regex, suffix or formatter.
Perhaps called inputmapper(...). I would use your filemapper name but conceptually Ruffus doesn't only deal with files names (though in practice....) and I don't want to go through all the documentation to change things :-)
I like the "mapper" part though.

More nomenclature and interface / design suggestions please!

Thanks a lot,

Leo

Leo Goodstadt 顧維斌

unread,
Feb 9, 2016, 5:57:03 AM2/9/16
to ruffus_...@googlegroups.com
I think this is best thought of as an extension or generalisation of the current Ruffus design. So if you had a @transform with a callback, it would do transformy things. If you want to do collaty things, you need a @collate with a callback / filemapper.
I will try extending all the decorators at the same time. This will highlight any problems in the design.
How does that sound?
Leo

--

Bernard James Pope

unread,
Feb 9, 2016, 7:14:41 AM2/9/16
to ruffus_...@googlegroups.com

> On 9 Feb 2016, at 9:53 pm, Leo Goodstadt 顧維斌 <llewgo...@gmail.com> wrote:
>
> Unless, I misunderstand you, I think this is exactly what I am proposing. Maybe our posts crossed mid air.

Haha, yeah, we seem to have sent at the same time.

Leo Goodstadt 顧維斌

unread,
Feb 25, 2016, 5:57:26 AM2/25/16
to ruffus_...@googlegroups.com
Sorry everyone, I am still working on this. 
1. Syntax
The actual working of the custom function seems very straightforward.
The only decision is whether we should have

@transform(input = ["1.c", "2.c"], filter = input_mapper(custom_func), output = "something", extras = ["something.else", 2.33])
def compile(infile, outfile, extras):      pass
That is, should we allow "output" and "extras" parameters to be provided with custom_func. Obviously, what you do with these parameters is up to you, and they can be optional. Personally, I wouldn't even need them:
@transform(input = ["1.c", "2.c"], filter = input_mapper(custom_func))
def compile(infile, outfile, extras):      pass
but it seems that providing support for these is sensible for orthogonality, ease of documentation etc.

2. Argument parsing

The hold up is that argument parsing in Ruffus is very complicated. 

The named argument syntax is very straightforward and Pythonic. The unnamed (pre 2.6.2) argument parsing relies not only on position but also on type. (add_inputs() is optional, for example). 
 
@product(filter=input_mapper()) is the last straw because the number of sets of inputs is indeterminate and now each set of inputs is not paired with a "formatter()". 
Obviously, this is flexible and convenient but in practice, the parsing code is convoluted and complicated. I need to pull this all out, refactor and write unit tests.
It seems that I have smuggled c++ function overloading into Python. Bad!

So, unfortunately, I have to take a two week detour first.

Leo

Reply all
Reply to author
Forward
0 new messages