Duplicated branch behaviour

António Miguel de Jesus Domingues

unread,

Mar 8, 2017, 9:10:38 AM3/8/17

to bpipe-discuss

Hi all,

following these instructions I was trying to input a list of SRR ids and download the data using bpipe branches. A minimal example (downloads only 5 reads thus quick and painless on the disk):

 regions = [
 'SRR886446',
 'SRR886447',
 'SRR886448',
 'SRR886449',
 'SRR886450',
 'SRR886451',
 'SRR886452',
 'SRR886453',
 'SRR886454',
 'SRR886455',
 'SRR886456',
 'SRR886457',
 'SRR886458',
 'SRR886459',
 'SRR886460'
]


DownloadSRA_se = {
   output.dir = "sra"
   input=branch.name
   SAMPLE_NAME=input


   produce(SAMPLE_NAME + "_1.fastq.gz"){
      exec """
        echo "Download:" $input &&
        fastq-dump -X 5 --split-files --gzip --outdir $output.dir $input 


      """
   }
}




region_stage =
{
  branch.region = branch.name
  println "I will work on region ${branch.region}"
}


run {
  regions * [region_stage, DownloadSRA_se]
}

region_stage is there to test the input and it works as expected. The next stage however operates only on a few branches, and more strangely, it does do multiple times. An example of the output (notice the duplication of SRR886446):

I will work on region SRR886452
I will work on region SRR886459
I will work on region SRR886446
I will work on region SRR886458
I will work on region SRR886456
I will work on region SRR886454
I will work on region SRR886457
I will work on region SRR886460
I will work on region SRR886450
I will work on region SRR886447
Download: SRR886452
Download: SRR886446
Download: SRR886446
Read 5 spots for SRR886446
Written 5 spots for SRR886446
Download: SRR886448
Download: SRR886452
Download: SRR886452
Download: SRR886452
Download: SRR886452
Download: SRR886446
Read 5 spots for SRR886446
Written 5 spots for SRR886446

I have a tried a few variants of the fastq-dump but the problem persists. Is this a bug or my use of branches incorrect? How to solve it?

Bpipe Version 0.9.9_beta_1 Built on Mon Aug 24 07:41:21 CEST 2015

Cheers,

António

Marc Hoeppner

unread,

Mar 16, 2017, 7:06:24 AM3/16/17

to bpipe-discuss

Simon can of course correct me here, but i would refrain from trying to overwrite reserved variables (input) and just add

forward input

to the end of your region_stage module

There is a more philosophical issue here tho - Bpipe operates on files, so your general approach may create unexpected problems (haven't tested this myself).

Simon

unread,

Mar 18, 2017, 8:24:44 AM3/18/17

to bpipe-discuss

I think Marc is on the right track .... the approach here is problematic because both "input" and "region" are variables that Bpipe sets and assigns special meaning to (for example, it expects $input to map to a file and will check that it exists on the file system at various points). I'm not quite sure why you see the behavior you do - but it definitely be cleaner not to use "region" and not to set "input".

An example of how I'd expect your pipeline stage to look is as follows:

DownloadSRA_se = {
   output.dir = "sra"

   branch.sra_id = branch.name
   SAMPLE_NAME=sra_id
   produce(SAMPLE_NAME + "_1.fastq.gz"){
      exec """
        echo "Download:" $sra_id

        fastq-dump -X 5 --split-files --gzip --outdir $output.dir $sra_id 
      """
   }
}

Hope this helps!

Simon

António Miguel de Jesus Domingues

unread,

Mar 20, 2017, 10:05:52 AM3/20/17

to bpipe-discuss

It did help, thanks!

Your guess is probably right, and re-using the variables which already have special meaning/use in bpipe is a not a good idea and it was screwing the system - my bad. I also initially misunderstood Marc's suggestion and could not get it work. Anyway, I am leaving here the final working script for future reference:


def sra_file="SRP024271_description.tsv"

def regions = []
new File(sra_file).eachLine{ line ->
  regions << line.split("\t")[3]

}

DownloadSRA_se = {
   output.dir = "sra"
   branch.sra_id = branch.name
   SAMPLE_NAME=sra_id
   produce(SAMPLE_NAME + "_1.fastq.gz"){
      exec """
        echo "Download:" $sra_id

        fastq-dump --split-files --gzip --outdir $output.dir $sra_id 
      """
   }
}

run {
  regions * [ DownloadSRA_se ]
}

The file SRP024271_description.tsv was obtained using the bioconductor package SRAdb and contains the following info (top entries):

SRP024271 SRS437906 SRX297174 SRR886455 (+)4SUTP_128cell_rep2 RNA-Seq GSM1155230: (+)4SUTP_128cell_rep2; Danio rerio; RNA-Seq SINGLE -
SRP024271 SRS437904 SRX297172 SRR886449 (+)4SUTP_256cell_rep1 RNA-Seq GSM1155228: (+)4SUTP_256cell_rep1; Danio rerio; RNA-Seq SINGLE -
SRP024271 SRS437904 SRX297172 SRR886451 (+)4SUTP_256cell_rep1 RNA-Seq GSM1155228: (+)4SUTP_256cell_rep1; Danio rerio; RNA-Seq SINGLE -

Let add that even though bpipe is geared towards files as input/output, the ability to start with parameters read from files, or just an SRA accession ID, is quite useful for many things. My use case is, I think, typical: run X pipeline on publicly available data. Usually I would do this in 4 semi-automated but independent steps: