Splitting multiple files...?

374 views
Skip to first unread message

Marc Hoeppner

unread,
Apr 6, 2017, 6:31:16 AM4/6/17
to Nextflow
Hi,

this is a technical/design question, hoping to get a few pointers. I have a folder full of VCF files (let's assume they are uncompressed) and want to 
a) process all files in parallel
b) split each vcf file into chunks of 5000 lines for parallelism
c) annotate each chunk with e.g. VEP
d) merge the chunks for each input file and create one output file per input file.

I cannot seem to figure out how to do this with Channels or processes - I always lose the reference to the original input file for naming the output.

Cheers,
Marc

Paolo Di Tommaso

unread,
Apr 6, 2017, 8:28:48 AM4/6/17
to nextflow
Hi,

Does the vcf header matter? i mean it's problem of the first chunk contains the vcf header?

p



--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Marc Hoeppner

unread,
Apr 6, 2017, 8:36:51 AM4/6/17
to Nextflow
Hi,

no, for the time being I am happy to ignore the header

/M


On Thursday, 6 April 2017 14:28:48 UTC+2, Paolo Di Tommaso wrote:
Hi,

Does the vcf header matter? i mean it's problem of the first chunk contains the vcf header?

p


On 6 Apr 2017 12:31 p.m., "Marc Hoeppner" <mphoe...@gmail.com> wrote:
Hi,

this is a technical/design question, hoping to get a few pointers. I have a folder full of VCF files (let's assume they are uncompressed) and want to 
a) process all files in parallel
b) split each vcf file into chunks of 5000 lines for parallelism
c) annotate each chunk with e.g. VEP
d) merge the chunks for each input file and create one output file per input file.

I cannot seem to figure out how to do this with Channels or processes - I always lose the reference to the original input file for naming the output.

Cheers,
Marc

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,
Apr 6, 2017, 9:10:58 AM4/6/17
to nextflow
A possible implementation could be something like this 


params.in_vcf = 'data/vcf'
params.out_dir = 'default/out/path'

out_dir = file(params.out_dir)
out_dir.mkdirs()

Channel.fromPath(params.in_vcf)
       .map { file -> tuple(file.name, file) }
       .splitText(by: 5000, file: true)
       .set { chunks_ch }
        
        
process annotate {
  input: 
  set id, file(chunk) from chunks_ch
  output: 
  set id, file('chunk.vep') into vep_ch
  
  script:
  """
  VEP_annotation_command --in $chunk --out chunk.vep
  """
}        

vep_ch
   .collectFile()
   .subscribe { merged_file -> merged_file.copyTo(out_dir) }



I've tested, so it can contain error. Above I guess the annotation is a very quick process, so it could be more efficient to process multiple file annotations at time. 

We can optimise it later, if the above solution makes sense. 


Cheers,
Paolo


To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

Marc Hoeppner

unread,
Apr 7, 2017, 5:29:57 AM4/7/17
to Nextflow
Hi,

based on your suggestion ( I hope..), I updated my pipeline - but it only seems to process the first chunk of the first file. Any ideas where things are going wrong?


I am executing this like so:

nextflow -c nextflow.config run main.nf --vcf '/path/to/files/*.vcf' --chunkSize 500


Marc Hoeppner

unread,
Apr 7, 2017, 5:34:29 AM4/7/17
to Nextflow
This seems to produce all the expected chunked outputs - the error seems to be with how things are published in the collectVep and collectAnnovar processes - those seem to fail to run for all "branches". 

Paolo Di Tommaso

unread,
Apr 7, 2017, 5:40:35 AM4/7/17
to nextflow
Can I suggest to use the gitter channel so that we can interact quickly 



Cheers,
Paolo

 

Cheers,
Paolo


On Fri, Apr 7, 2017 at 11:34 AM, Marc Hoeppner <mphoe...@gmail.com> wrote:
This seems to produce all the expected chunked outputs - the error seems to be with how things are published in the collectVep and collectAnnovar processes - those seem to fail to run for all "branches". 

--
Reply all
Reply to author
Forward
0 new messages