Splitting multiple files...?

Marc Hoeppner

unread,

Apr 6, 2017, 6:31:16 AM4/6/17

to Nextflow

Hi,

this is a technical/design question, hoping to get a few pointers. I have a folder full of VCF files (let's assume they are uncompressed) and want to

a) process all files in parallel

b) split each vcf file into chunks of 5000 lines for parallelism

c) annotate each chunk with e.g. VEP

d) merge the chunks for each input file and create one output file per input file.

I cannot seem to figure out how to do this with Channels or processes - I always lose the reference to the original input file for naming the output.

Cheers,

Marc

Paolo Di Tommaso

unread,

Apr 6, 2017, 8:28:48 AM4/6/17

to nextflow

Hi,

Does the vcf header matter? i mean it's problem of the first chunk contains the vcf header?

p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Marc Hoeppner

unread,

Apr 6, 2017, 8:36:51 AM4/6/17

to Nextflow

Hi,

no, for the time being I am happy to ignore the header

/M

On Thursday, 6 April 2017 14:28:48 UTC+2, Paolo Di Tommaso wrote:

Hi,

Does the vcf header matter? i mean it's problem of the first chunk contains the vcf header?

p

On 6 Apr 2017 12:31 p.m., "Marc Hoeppner" <mphoe...@gmail.com> wrote:

Hi,

this is a technical/design question, hoping to get a few pointers. I have a folder full of VCF files (let's assume they are uncompressed) and want to
a) process all files in parallel
b) split each vcf file into chunks of 5000 lines for parallelism
c) annotate each chunk with e.g. VEP
d) merge the chunks for each input file and create one output file per input file.

I cannot seem to figure out how to do this with Channels or processes - I always lose the reference to the original input file for naming the output.

Cheers,
Marc

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Paolo Di Tommaso

unread,

Apr 6, 2017, 9:10:58 AM4/6/17

to nextflow

A possible implementation could be something like this

params.in_vcf = 'data/vcf'

params.out_dir = 'default/out/path'

out_dir = file(params.out_dir)
out_dir.mkdirs()

Channel.fromPath(params.in_vcf)
.map { file -> tuple(file.name, file) }
.splitText(by: 5000, file: true)
.set { chunks_ch }


process annotate {
input:
set id, file(chunk) from chunks_ch
output:
set id, file('chunk.vep') into vep_ch

script:
"""
VEP_annotation_command --in $chunk --out chunk.vep
"""
}

vep_ch
.collectFile()
.subscribe { merged_file -> merged_file.copyTo(out_dir) }

I've tested, so it can contain error. Above I guess the annotation is a very quick process, so it could be more efficient to process multiple file annotations at time.

We can optimise it later, if the above solution makes sense.

Cheers,

Paolo

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

Marc Hoeppner

unread,

Apr 7, 2017, 5:29:57 AM4/7/17

to Nextflow

Hi,

based on your suggestion ( I hope..), I updated my pipeline - but it only seems to process the first chunk of the first file. Any ideas where things are going wrong?

https://pastebin.com/UJjsDawy

I am executing this like so:

nextflow -c nextflow.config run main.nf --vcf '/path/to/files/*.vcf' --chunkSize 500

Marc Hoeppner

unread,

Apr 7, 2017, 5:34:29 AM4/7/17

to Nextflow

This seems to produce all the expected chunked outputs - the error seems to be with how things are published in the collectVep and collectAnnovar processes - those seem to fail to run for all "branches".

Paolo Di Tommaso

unread,

Apr 7, 2017, 5:40:35 AM4/7/17

to nextflow

Can I suggest to use the gitter channel so that we can interact quickly

https://gitter.im/nextflow-io/nextflow

Cheers,

Paolo

Cheers,

Paolo

On Fri, Apr 7, 2017 at 11:34 AM, Marc Hoeppner <mphoe...@gmail.com> wrote:

This seems to produce all the expected chunked outputs - the error seems to be with how things are published in the collectVep and collectAnnovar processes - those seem to fail to run for all "branches".

--

Reply all

Reply to author

Forward