Hi Leo,
Thanks for the suggestion!
To make things more clear, let me put some more realistic names of the functions and describe their corresponding inputs and outputs.
Input files
"args.input_reads" is a list of sample filepaths to be processed, e.g.:
/path/to/sample01/sample01_R1.fastq
/path/to/sample01/sample01_R2.fastq
/path/to/sample01/sample02_R1.fastq
/path/to/sample01/sample02_R2.fastq
etc.
Some other variables (e.g. "build_dirs") are global variables created before Ruffus is called.
Tasks:
# pre-req (before task 0 above)
def check_for_bowtie_indices
#
# "task 0a"
#
# OUTPUTS:
#
# common/sample01/ruffus/sample01_R1.filter_nontarget_reads
# common/sample01/ruffus/sample01_R2.filter_nontarget_reads
# ...
#
@follows(check_for_bowtie_indices)
@transform(args.input_reads,
regex(r'^(.*/)?([^_]*)_(R?[1-2])_?(.*)?\.fastq(\.gz)?'),
r'%s/\2/ruffus/\2_\3.filter_nontarget_reads' % build_dirs['shared'],
def filter_nontarget_reads(input_file, output_file, sample_id, read_num):
...
# common/sample01/ruffus/sample01_R1.filter_genomic_reads
# common/sample01/ruffus/sample01_R2.filter_genomic_reads
# Was previously using @transform here, but testing out @subdivide to run following steps in parallel...
#
@follows(filter_nontarget_reads)
@subdivide(args.input_reads,
regex(r'^(.*/)?([^_]*)_(R?[1-2])_?(.*)?\.fastq(\.gz)?'),
r'%s/\2/ruffus/\2_\3.filter_genomic_reads.rsl' %
r'%s/\2/ruffus/\2_\3.filter_genomic_reads.polya' %
r'%s/\2/ruffus/\2_\3.filter_genomic_reads.polyt' %
r'%s/\2/ruffus/\2_\3.filter_genomic_reads' % build_dirs['shared'],
def filter_genomic_reads(input_file, output_files, output_base, sample_id, read_num):
...
# spliced_leader/param-info-here/sample01/ruffus/sample01_R1.find_sl_reads
# spliced_leader/param-info-here/sample01/ruffus/sample01_R2.find_sl_reads
@transform(filter_genomic_reads,
r'%s/\2/ruffus/\2_\3.find_sl_reads' % build_dirs['sl'],
def find_sl_reads(input_file, output_file, sample_id, read_num):
Tasks "b-c" are similar to "find_sl_reads" above, but each looks for a different feature and creates files using different suffixes.
Notes
I left out a couple steps in the full pipeline, but it should be enough to illustrate the basic problem.
Some of the variables used in tasks are encoded in filenames (e.g. sample ID / read num), which is probably not the best way to pass that information along, but at the time it was the best I could come up with.
I also ended up using dummy placeholder files created once a task is completed, rather than the actual files generated. This again has more to do with me not being able to setup the tasks to manage the file checks directly.
If you have any suggestions for how to handle either of these issues more elegantly, it would probably go a long ways to simplifying the code.
Cheers,
Keith