output is not recognized

23 views
Skip to first unread message

Emanuel Schmid

unread,
May 4, 2018, 10:12:35 AM5/4/18
to bpipe-discuss
I am having a problem trying to split an input file and recognize the generated outputs as output files.
My part of the pipeline which gets stuck:

...+ [ Indels  + splitClusters ]  + "%_R*.split" * [ extractReads ]

So the "Indels" stage generates 3 different files which are correctly recognized.

Indels = {
        doc "analyze and visualize deletions"
        output.dir = "DELETIONS"
        outputs = [
        file(input.txt).name.replaceAll(~/.txt/ , ".Crispr.pdf"),
        file(input.txt).name.replaceAll(~/.txt/ , ".CrisprReads.txt"),
        file(input.txt).name.replaceAll(~/.txt/ , ".CrisprResults.txt"),
        ]
        produce(outputs){
                exec"""
                module add R/latest;
                Rscript bin/evaluateCrispr_Single.R $input.txt $RANGE $AMPLICON $POSITION;
        """
        }
}


The problem is then in the next stage. I want to take the output2 from the previous one (which works) and split the file using awk on the 5th column:

splitClusters = {
        doc "splitting the reads of each deletion cluster"
        output.dir = "DELETIONS"
        exec """
                awk -v var="$input.prefix" '{print >> var"."\$5".split"; close(\$5)}' $input2;
        """
   
}

This does generate the output which I want but they are not recognized as outputs.
I tried all kind of variations using "produce" or moving the files again in order to make bpipe see them:

//for FILE in DELETIONS/*.split; do name=\$( basename $FILE .split);  mv $FILE $output.dir/\${name}.split; done

Essentially the pipeline reports that it finishes correctly, but the next stage fails and cant find any "*.split" files , It tries to continue with the input from the "Indels" stage.....

extractReads = {
        doc "extract reads which were found to harbour interesting deletions"
        output.dir = "CRISPRed_RESULTS"
        exec"""
                bin/extract_scaffold_version4.pl -f $input.fasta -i $input.split -s > $output.fasta;
        """
}

Note: a pattern '%_R*.split' was provided, but did not match any of the files provided as input [ shortened.Crispr.pdf ,shortened.CrisprReads.txt, shortened.CrisprResults.txt]

Simon

unread,
May 5, 2018, 10:56:00 PM5/5/18
to bpipe-discuss
Hi Emanuel,

It looks like the problem is that because you never reference any $output or give a produce(...) statement, Bpipe thinks that there are in fact no outputs.

This is one of the key aspects of Bpipe that is often not clear to newcomers: a Bpipe pipeline generally tries to protect your pipeline from "seeing" files that it isn't meant to. So if you reference $input.split and someone randomly copies a .split file into the directory in the middle of the pipeline running, Bpipe won't "see" that file, because it doesn't think it was generated by the pipeline. So a reference to $input can only find something that was actually defined as an output (eg: via $output, etc) from a previous stage, or as an initial input to the pipeline (there are a couple of exceptions to this rule, but they are exactly that - exceptions).

For this scenario where the names of output files are actually unknown beforehand and are driven by the data,  Bpipe allows a wildcard form of 'produce' that will cause Bpipe to recognise any file with a given extension as an output from a pipeline stage. See the docs here, a few paragraphs down:


So that's probably the right solution here: wrap your exec in something like:

    produce('*.split') {
        ....
    }

Do note  the limitation though that any output ending in .split will then get considered as an output even if it wasn't generated by your command - which is important if you are using parallelism at a higher level so that multiple parallel stages might be generating these files in the same directory at the same time!

Hope this helps,

Simon

Emanuel Schmid

unread,
May 6, 2018, 2:56:56 AM5/6/18
to bpipe-discuss
Dear Simon
Thanks a lot for the quick answer.
Indeed I wanted to avoid a simple produce(*.split) statement as I am running many samples.
Instead I tried to get this working but it failed:

produce("$input.prefix*.split")

Would that be a solution in general and if so,any idea why that might fail in my case?

Reply all
Reply to author
Forward
0 new messages