content of string in extras disrupts pipeline

17 views
Skip to first unread message

Karl Nordström

unread,
Sep 4, 2015, 7:51:14 AM9/4/15
to ruffus_discuss
Hi,

I'm new to ruffus and am trying to learn. Setting up my first pipeline I wanted to use the formatter in a collate step:


testTask=pipeline.collate(task_func=dummyFunc,
                          name
='testTask',
                          input
=output_from('getFiles'),
                          filter
=formatter(raw_00+r"/seq_R[12](?P<FILENR>_[0-9]{4}).fastq.gz"),
                          output
=[trim_05+'/seq_R1{FILENR[0]}_val_1.fq.gz', trim_05+'/seq_R2{FILENR[0]}_val_2.fq.gz'],
                          extras
=[cmd]).mkdir(trim_05)

My problem came with the extras parameter. With

cmd
="echo {inputfile}"

the pipeline found inputfiles and executed the function.

If changed cmd to:

cmd="echo {inputfile[0]}"

the pipeline didn't execute the step due to having no files matching the pattern. I did clean up the folder between the executions.

I was able to rescue it by changing to a regular expression in the filter:

testTask=pipeline.collate(task_func=dummyFunc,
                          name
='testTask',
                          input
=output_from('getFiles'),
                          filter
=regex(r".*/seq_R[12](_[0-9]{4}).fastq.gz"),
                          output
=[trim_05+r'/seq_R1\1_val_1.fq.gz', trim_05+r'/seq_R2\1_val_2.fq.gz'],
                          extras
=[cmd]).mkdir(trim_05)

This is sufficient for me, but I still wonder whether the interplay between the formatter and the extras parameter is intentional?

Best,
Karl


Leo Goodstadt 顧維斌

unread,
Sep 4, 2015, 3:32:27 PM9/4/15
to ruffus_...@googlegroups.com
Hi Karl,

I am sorry but I am not entirely sure I understand you yet.

formatter() performs string replacement on both "output" and "extra" parameters. If you don't want something in curly braces to be replaced, you can "escape" the braces by doubling them:
extras=["echo {{inputfile"}}]

See the python docs ("If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}")


A string replacement failure is an indication to Ruffus that the input to a job does not match your specification and should be filtered out / ignored / excluded.
Unfortunately, this is by design :-)

Using filter=regex() should behave in the same way, except that of course the replacement string format uses patterns from the python re module (e.g. r"\g<name>", r"\2", r"\g<2>" etc.) rather than python string formatting.

Can you confirm that I haven't misunderstood you, and that you are not drawing my attention to a bug in Ruffus (or even a mis-design)...

Thanks 


Leo


--
You received this message because you are subscribed to the Google Groups "ruffus_discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruffus_discus...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Karl Nordström

unread,
Sep 7, 2015, 4:59:13 AM9/7/15
to ruffus_discuss, llewgo...@gmail.com
Hi Leo,

thanks for the reply. I should have read up better before submitting. I missed the doubling of brackets and that the formatter alters both output and extras. I still don't understand the behavior with regard to square brackets. I have tried to put together an example:

from ruffus import *

def dummyFunc(inputFile,outputFile,cmd):
   
for file in outputFile:
       
with open(file, 'w') as oo: pass
   
return outputFile

def genFile(outputFile):
    open
(outputFile,'w')
   
return outputFile

tmpDir
='/tmp/test150907'
raw_00
=tmpDir+'/raw'
trim_05
=tmpDir+'/trim1'
trim_10
=tmpDir+'/trim2'

files
= ['seq_R1_0001.fastq.gz', 'seq_R1_0002.fastq.gz', 'seq_R1_0003.fastq.gz', 'seq_R1_0004.fastq.gz', 'seq_R1_0005.fastq.gz', 'seq_R1_0006.fastq.gz', 'seq_R1_0007.fastq.gz', 'seq_R1_0008.fastq.gz', 'seq_R1_0009.fastq.gz', 'seq_R1_0010.fastq.gz', 'seq_R1_0011.fastq.gz', 'seq_R1_0012.fastq.gz', 'seq_R1_0013.fastq.gz', 'seq_R1_0014.fastq.gz', 'seq_R1_0015.fastq.gz', 'seq_R1_0016.fastq.gz', 'seq_R1_0017.fastq.gz', 'seq_R1_0018.fastq.gz', 'seq_R1_0019.fastq.gz', 'seq_R1_0020.fastq.gz', 'seq_R1_0021.fastq.gz', 'seq_R1_0022.fastq.gz', 'seq_R1_0023.fastq.gz', 'seq_R1_0024.fastq.gz', 'seq_R1_0025.fastq.gz', 'seq_R1_0026.fastq.gz', 'seq_R1_0027.fastq.gz', 'seq_R1_0028.fastq.gz', 'seq_R1_0029.fastq.gz', 'seq_R1_0030.fastq.gz', 'seq_R1_0031.fastq.gz', 'seq_R1_0032.fastq.gz', 'seq_R2_0001.fastq.gz', 'seq_R2_0002.fastq.gz', 'seq_R2_0003.fastq.gz', 'seq_R2_0004.fastq.gz', 'seq_R2_0005.fastq.gz', 'seq_R2_0006.fastq.gz', 'seq_R2_0007.fastq.gz', 'seq_R2_0008.fastq.gz', 'seq_R2_0009.fastq.gz', 'seq_R2_0010.fastq.gz', 'seq_R2_0011.fastq.gz', 'seq_R2_0012.fastq.gz', 'seq_R2_0013.fastq.gz', 'seq_R2_0014.fastq.gz', 'seq_R2_0015.fastq.gz', 'seq_R2_0016.fastq.gz', 'seq_R2_0017.fastq.gz', 'seq_R2_0018.fastq.gz', 'seq_R2_0019.fastq.gz', 'seq_R2_0020.fastq.gz', 'seq_R2_0021.fastq.gz', 'seq_R2_0022.fastq.gz', 'seq_R2_0023.fastq.gz', 'seq_R2_0024.fastq.gz', 'seq_R2_0025.fastq.gz', 'seq_R2_0026.fastq.gz', 'seq_R2_0027.fastq.gz', 'seq_R2_0028.fastq.gz', 'seq_R2_0029.fastq.gz', 'seq_R2_0030.fastq.gz', 'seq_R2_0031.fastq.gz', 'seq_R2_0032.fastq.gz']

files
= [raw_00+'/'+file for file in files]

pipeline
=Pipeline('test150907')


genFiles
=pipeline.originate(task_func=genFile,
                             name
='genFiles',
                             output
=files).mkdir(raw_00)


cmd
="echo {inputfile}"

testTask
=pipeline.collate(task_func=dummyFunc,
                          name
='testTask1',
                          input
=output_from('genFiles'),

                          filter
=formatter(raw_00+r"/seq_R[12](?P<FILENR>_[0-9]{4}).fastq.gz"),
                          output
=[trim_05+'/seq_R1{FILENR[0]}_val_1.fq.gz', trim_05+'/seq_R2{FILENR[0]}_val_2.fq.gz'],
                          extras
=[cmd]).mkdir(trim_05)


cmd
="echo {inputfile[0]}"

testTask
=pipeline.collate(task_func=dummyFunc,
                          name
='testTask2',
                          input
=output_from('genFiles'),

                          filter
=formatter(raw_00+r"/seq_R[12](?P<FILENR>_[0-9]{4}).fastq.gz"),

                          output
=[trim_10+'/seq_R1{FILENR[0]}_val_1.fq.gz', trim_10+'/seq_R2{FILENR[0]}_val_2.fq.gz'],
                          extras
=[cmd]).mkdir(trim_10)
                         
pipeline
.run()

In the above code, the only difference between test task 1 and 2 is the extras parameter (and the output-folder). While task 1 generates new files in /tmp/test150907/trim1, I get a warning for task 2:

WARNING:
       
'In Task 'test150907.testTask2':' No jobs were run because no files names matched. Please make sure that the regular expression is correctly specified.

The regular expression is exactly the same in both cases. Understanding that ruffus does replacements in the extras is making me think that this replacement goes wrong in some way. I'm not sure if it qualifies as a bug :)

Best,
Karl

Leo Goodstadt 顧維斌

unread,
Sep 7, 2015, 10:02:27 AM9/7/15
to ruffus_...@googlegroups.com
Hi Karl,

When I paste your code into the python shell,
Neither of the tasks run because "{inputfile}" does not match anything in formatter().
Have I missed something?

Let me perhaps explain how Ruffus works more clearly, 

  1. Whenever you use formatter() with Ruffus, Ruffus assumes you are doing a string replacement operation on any strings that appear in either your Outputs or Extras parameters
  2. Whenever any string inside Outputs or Extras contains braces (for example: "funny {something} what"), even if these strings are nested inside lists or tuples (dicts and other objects are ignored) then string replacement will take place
  3. The source for string replacement are either 
    1. regular expression matches (named and unnamed) from the patterns inside formatter(), for example formatter(r"/seq_R[12](?P<FILENR>_[0-9]{4}).fastq.gz"),
      or  
    2. path components of the Inputs, i.e. basename, ext, path, subdir, subpath
  4. String replacement follows python string formatting rules, so that you can refer to
    "{path}": a list of all the paths of file name in the Inputs parameter
    "{path[0]}"
    : the path of the first file name in the Inputs parameter
    "{path[0][0]}": the first letter of the first first file path in the Inputs parameter
  5. Everytime string replacement fails, the assumption is that this is a failure to match and that this job should be excluded. This is by design.
  6. If you don`t want string replacement, i.e. you need a brace to appear in the Outputs or Extras parameters for whatever reason, then escape the braces by doubling. This is the equivalent of using double backslashes in regular expression substitutions.
    Formatter example: "{{ {path[0]}"  -> "{ /your/path"
    Regex example: r"\\ \1"         ->  "\ regex_group_1"

When things don`t work out, you should 
  1. See if increasing the verbosity helps tell you where ruffus is failing to match. It is always sensible when developing a pipeline to have a dry run with higher verbosity (pipeline_printout(sys.stdout, verbose = 4)
  2. Try printing out the match without a dereference to understand what is being matched
    In your example, if you set 

    ..., extras = [cmd="echo {FILENR}"]).mkdir(trim_05)
    ...
    pipeline.run(verbose=4)

    you would see:

    Job  = [[.../raw/seq_R1_0032.fastq.gz, .../raw/seq_R2_0032.fastq.gz] 
                -> [.../trim1/seq_R1_0032_val_1.fq.gz, .../trim1/seq_R2_0032_val_2.fq.gz], echo {0: '_0032'}]

    Telling you that FILENR is a collection of matches to each element in Inputs (in your case, a list of one) so that FILENR[0] == "__0032", as expected.

  3. Complain loudly on this list as you have done! 
    :-)
Hope this helps.
Please write back whether this solves your pipeline design problems or not.

Thanks

Leo

Karl Nordström

unread,
Sep 7, 2015, 10:50:07 AM9/7/15
to ruffus_discuss
It does. The problem must be somewhere here. I did the coding and running through PyDev in Eclipse. Through this environment, task 1 works and task 2 not. I tried it on the command line now and there both tasks fail. I already make use of the verbose output and when through PyDev, cmd gets transformed from echo {inputfile} to echo {}.

I'll add testing on the command line as a private step before complaining loudly next time.

Thanks a lot for helping me clarifying this.

/Karl
Reply all
Reply to author
Forward
0 new messages