problem with Filter and regex

28 views
Skip to first unread message

Diego Diaz Dominguez

unread,
Mar 18, 2016, 9:56:02 AM3/18/16
to ruffus_discuss
Hi all,

I am trying to use the function regex to filter my inputs in a step of my workflow. But when I run the script I get the following error:

 Duplicate, conflicting or unrecognised arguments:
   
    pipeline.transform(name = 'pipeline_b_step_0::Calling step 0', task_func = make_calling, input=None, filter=regex('INDELS_realigned.bam$| rq.bam$',), output='indels.snps.g_step_0.vcf', extras=['/usr/local/java/jdk1.8.0_45/bin/java', '/home/diego/Programs/GATK/GenomeAnalysisTK.jar', 'VariantDiscovery_pipeline/reference/reference_genome.fasta', LoggerProxy(), <AcquirerProxy object, typeid 'Lock' at 0x7f41c5e6eb90>, 1], ...

At the beginning I thought it was a problem with my regular expression, but no matter what regular expression I put , always I get the same error.

This is my line describing the task:

        make_calling_task = sub_pipeline.transform(    task_func    = make_calling,
                                                                             name          = "Calling step "+str(step),
                                                                             input           = None,
                                                                             filter            = regex(r"INDELS_realigned.bam$| rq.bam$"),
                                                                             output         = "indels.snps.g_step_"+str(step)+".vcf",
                                                                             output_dir    = join(call_folder,"step_"+str(step)),
                                                                             extras         = [    self.JAVA, self.GATK,
                                                                                                       self.ref_genome_project, self.logger,
                                                                                                       self.logger_mutex, self.args.nct]
       )
 
I am using ruffus version '2.6.3' and python version 2.7.6

I would appreciate if someone could help me with this problem

Thanks in advance

Leo Goodstadt 顧維斌

unread,
Mar 21, 2016, 4:56:20 AM3/21/16
to ruffus_...@googlegroups.com
Hi Diego,
I had to look at your code, and then go back to look at your arguments, and look at the code again before I figured out what was wrong:
regex does not take the output_dir argument.

Regular expressions are intended to be a massive test substitution hammer where you can make any changes you want, including up and down and across paths.

output_dir is intended for common case when you only want to change the output suffix (e.g. from "bam" to "vcf") but want the results to be in another directory.

1) The error message is obviously completely unhelpful because the chopped off bit at the end tells you what "duplicate, conflicting or unrecognised argument" is causing the error.

2) The fact that "output_dir" only goes with suffix deserves its own explanatory error message.

3) Maybe we should relax output_dir so that it works with regex. Not sure. Would this not be confusing? You go to all this trouble to make your substitutions and there is path swap behind your back afterwards? If you think this is a a good idea, can you add a proposal to the list of issues in github?

Thanks
Leo

Leo Goodstadt 顧維斌

unread,
Mar 21, 2016, 5:10:51 AM3/21/16
to ruffus_...@googlegroups.com
P.S. Why is the error message being cut short? Is a full version displayed further on?

Diego Diaz Dominguez

unread,
Mar 21, 2016, 5:22:50 PM3/21/16
to ruffus_discuss, llewgo...@gmail.com
Hi Leo,

As you say the unrecognised argument is output_dir (sorry I forgot to include it in the post) . As far as I understood in the ruffus documentation, suffix function searches for a word at the end of the sentence and regex searches for a pattern in any position (depending how the re was built), similar to that you said before, but I didn't know that the argument output_dir is only applicable when suffix is called, thanks for the clarification.

My problem is quite simple, I have a task called "make-calling", this task gets as input files which may contain two different suffixes, "INDELS_realigned.bam" or "rq.bam". (I am working in a wrapper for the GATK framework and in such cases when there are no known variants, GATK team recommend some bootstrapping) . My first thought was, "OK, I have to use a regular expression to catch both cases", and that is why I used regex function instead of suffix.

Could you give me some advice about how to catch two different suffix with the same function and, at the same time, being able to define some output directory.

Thanks in advance.

Diego.

Leo Goodstadt 顧維斌

unread,
Mar 21, 2016, 7:24:27 PM3/21/16
to
Hi Diego,
The real question is what sort of patterns are the desired inputs and outputs.
If the inputs are 
xxx.INDELS_realigned.bam and yyy.rq.bam and you want to generate "xxx.indels.snps.g_step_3.vcf" and "yyy.indels.snps.g_step_3.vcf", then you can use 

transform(...
input = your_filenames,
filter = suffix(".bam"),
output = ".vcf",
output_dir = join(call_folder,"step_"+str(step))

where your_filenames can be a list of files, or an upstream task or a mixture of the two.

I have a feeling though that the output filenames are going to be hardwired, and do not depend on the input names, probably because your pipeline is only handling one bam at a time?


Leo


Diego Diaz Dominguez

unread,
Mar 21, 2016, 8:04:44 PM3/21/16
to ruffus_discuss, llewgo...@gmail.com
Hi Leo,

Thanks for your quick response.

Indeed, that was my approximation at the beginning, but the things are a little bit trickier. As I mentioned before, GATK recommends to make some sort of bootstrapping for variant calling if there were no information about known variants. This bootstrap-like process begins with an initial variant calling with uncalibrated data. Raw variants are then used to recalibrate base qualities and the variants are recalled. This process is repeated until reach some convergence. I tried to implement this in my pipeline, using a fixed number of steps. The first SNP calling is with BAM files obtained after INDEL realignment, after the first iteration, BAM files used to generate the variant calling come from the recalibration process. If I just replace the word ".bam", I would obtain something like this:

step 0: xxxx_IN_realigned.bam                                  ->  xxxx_IN_realigned_step_0_rq.bam
step 1: xxxx_IN_realigned_step_0_rq.bam                  ->  xxxx_IN_realigned_step_0_rq_step_1_rq.bam
step 2: xxxx_IN_realigned_step_0_rq_step_1_rq.bam  -> xxxx_IN_realigned_step_0_rq_step_1_rq_step_2_rq.bam 

and so on

The thing is that the number of steps is passed by cmd (I set as default 3 if this argument is not setted), and the name of the files will grow according the number of steps defined by the user. I will get file with really long names, and that is not desirable. 

In attachment I send you an image of the pipeline generate by the function printout_graph. The code works pretty well. I used several subpipelines created dynamically, according the number of steps.

Thanks again for your help,

Diego.
flowchart.svg

Diego Diaz Dominguez

unread,
Mar 28, 2016, 9:08:55 AM3/28/16
to ruffus_discuss, llewgo...@gmail.com
By the way .. I solved it passing the output_dir as extra parameter to the function. I just wanted all the tasks to follow the same structure, that is why I was so stron-willed. Thaks agai Leo

Leo Goodstadt 顧維斌

unread,
Mar 30, 2016, 9:57:35 AM3/30/16
to ruffus_...@googlegroups.com
Good to hear that. 
I am a little curious:
Would adding output_dir support for regex confusing? 
Or was not being able to use output_dir with regex a surprise?
Leo

--
You received this message because you are subscribed to the Google Groups "ruffus_discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruffus_discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages