customize chromosomes

14 views
Skip to first unread message

Emanuel Schmid

unread,
Oct 15, 2017, 3:58:57 AM10/15/17
to bpipe-discuss
Hi
I have  an assembled genome with FASTA headers which cant be described by a wild-card.
So I can run something like chr(1..20)  for parallelizing. 
I found somewhere a solution similar to the below:

chr(chromosome)*[myTaks]

I believe that I could get my individual annotation with a stage like:

 myChrom = {
         doc "defines the chromosome or better contigs of our assembled genome for further splitting"
         exec """
         grep "^>" $input.fasta | perl -npe 's/\\>//' > ${output(chromosomes.txt)}
         """
         chromosomes = new File("${output(chromosomes.txt)}").text.split("\\n")
         forward input
 }



The major problem I am facing, is that my script wont run because the variable chromsome is obviously not defined at run-time as I am generating it later.
Any way to circumvent this ?

Emanuel Schmid

unread,
Oct 17, 2017, 3:58:24 AM10/17/17
to bpipe-discuss
I think I fixed part of it by separating the pipeline stages

run{
myJobs + myChrom

run{
chr(chromosome)*[myJob2]
}

With the groovy stage:

myChrom = {
        doc "defines the chromosome or better contigs of our assembled genome for further splitting"
         produce("chromosomes.txt"){
         exec """
         grep "^>" $input.fasta | perl -npe 's/\\>//' > $output
         """
         }
         forward input
 }

But: how do I get now my output file from the previous run stage into the second ?
It does pick up one file correctly (*.bam) but another one (genome assembly in *.fasta) it does not .
It does though if I simply add the second run stage next to the previous one (but then again my chromosomes wont work).


Emanuel Schmid

unread,
Oct 17, 2017, 4:00:23 AM10/17/17
to bpipe-discuss
Sorry, obviously I forgot:
Between the 2 run stages I define:

chromosome = new File("chromosomes.txt").text.split("\\n")

Simon Sadedin

unread,
Oct 18, 2017, 1:29:25 AM10/18/17
to bpipe-discuss
Hi Emanuel,

This is a really good question ... it's prompting me to think about the best way to incorporate proper support for dynamic branching (ie: branching that can change due to the outputs of the pipeline itself).

Currently the only dynamic branching happens on the basis of splitting a branch for each input file, the number of which can be dynamically generated by the previous stages. So one (hack) workaround here is that you actually create  a file for each chromosome (or FASTA sequence), and then you split by file pattern. Here's a trivial example:


hello = {
    println "Hello"
    exec """
       touch mars.planet

       touch jupiter.planet
    """

    forward(['mars.planet','jupiter.planet'])
}

world = {
    println "World $branch.name"
}

run {
    hello + '%.planet' * [ world ]
}

As you'll see, the 'world' stage executes for both mars and jupiter. You could make each file actually contain the FASTA of the assembled sequence and it might even be useful. But it causes your pipeline to create all these unnecessary files. 

I'll have a bit more of a think about this - there should definitely be a good way to do something like this in Bpipe!

Cheers,

Simon

--
You received this message because you are subscribed to the Google Groups "bpipe-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bpipe-discuss+unsubscribe@googlegroups.com.
To post to this group, send email to bpipe-...@googlegroups.com.
Visit this group at https://groups.google.com/group/bpipe-discuss.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Emanuel Schmid

unread,
Oct 19, 2017, 6:22:10 AM10/19/17
to bpipe-discuss
Thanks for the input!
Indeed it would be unfortunate to have FASTA files for each as I have in total ~10'000 contigs.
I managed to get the above solution working, essentially splitting my pipeline into 2 pipelines.
The first one assembling the genome, the 2nd one then using the FASTA header as chromosomes and splitting the BAM file on the fly.

load 'bp_pipelineStages.groovy'
chromosome = new File("chromosomes.txt").text.
split("\\n")
 run{
          chr(chromosome)*[calcCoverageBG + identifyCov] + catBG + plotBG
  }
Reply all
Reply to author
Forward
0 new messages