GATK: HaplotypeCaller -> GatherVcfs

917 views
Skip to first unread message

Damian Loska

unread,
Aug 8, 2018, 8:16:40 AM8/8/18
to Nextflow
Hey Paolo,
Greetings from Warsaw, I've just re-discovered NextFlow, it is great!!!

I'd like to run a pipeline simultaneously with many samples, but there are moments, where each sample is "scattered" - then I need to gather results again, per sample - and here I'm confused how to proceed with it.

nLinesInBed=((bedFile.readLines().size()+30)/25).toInteger()
bedSplits
= Channel.from(bedFile).splitText( by: nLinesInBed, file: true )
bams
= Channel.fromPath("/productionTest/Bams.csv").splitCsv()


process haplotypeCaller
{
  maxForks
30
  tag
"$name "
  echo
true


  input
:
   
set name, bam, bai from bams
    each bed
from bedSplits


  output
:
    file
"${name}.part.vcf" into vcfs


 
"""
   java -Xmx3000m -jar /PROGRAMS/gatk-local.jar HaplotypeCaller -R ${params.ref} -I ${bam} -O "
${name}.part.vcf" -L $bed
  """

}

...
then somewhow:
java
-Xmx3000m gatk \
           
GatherVcfs  \
     
-I sample1.part1.vcf -I sample1.part2.vcf -I sample1.part3.vcf ... -I sample25.part4.vcf
     
-O sample1.raw.vcf

so, per sample:
1) I divide BED file to have 25 smaller .bed files
2) I run 25 times HaplotypeCaller in parallel obtaining 25 .vcf files per sample
3) somehow I need to combine these 25 .vcfs into 1 .vcf and still keep track which sample is which...

Does it make sense to run analysis like this?
Should I create kind of "Channel of Channels" or a "Channel with list of lists"? 
Or maybe should I run a NextFlow script from NextFlow script? 

Thanks for any hints!

Paolo Di Tommaso

unread,
Aug 8, 2018, 9:44:20 AM8/8/18
to nextflow
Hi Damian, 

Modify the `haplotypeCaller` process output to capture the sample name as well, eg: 


  output:
    set vale(name), file("${name}.part.vcf") into vcfs


Then, the use groupTuple to collect all files having the same sample name. 

 
See for example here


p

On Wed, Aug 8, 2018 at 1:52 PM, Damian Loska <damian...@gmail.com> wrote:
Hey Paolo,
Greetings from Warsaw, I've just re-discovered NextFlow, it is great!!!

I'd like to run a pipeline simultaneously with many samples, but there are moments, where each sample is "scattered" - then I need to gather results again, per sample - and here I'm confused how to proceed with it.

nLinesInBed=((bedFile.readLines().size()+30)/25).toInteger()
bedSplits
= Channel.from(bedFile).splitText( by: nLinesInBed, file: true )

cegatBams
= Channel.fromPath("/mnt/ssd_01/productionTest/cegatBams.csv").splitCsv()


process haplotypeCaller
{
  maxForks
30
  tag
"$name "
  echo
true

  input
:

   
set name, bam, bai from cegatBams
    each bed
from bedSplits

  output
:

    file
"${name}.part.vcf" into vcfs

 
"""
   java -Xmx3000m -jar /PROGRAMS/gatk-local.jar HaplotypeCaller -R ${params.ref} -I ${bam} -O "
${name}.part.vcf" -L $bed
  """

}

...
then somewhow:
java -Xmx3000m gatk \
            GatherVcfs  \
     -I sample1.part1.vcf -I sample1.part2.vcf -I sample1.part3.vcf ... -I sample25.part4.vcf
     -O sample1.raw.vcf



so, per sample:
1) I divide BED file to have 25 smaller .bed files
2) I run 25 times HaplotypeCaller in parallel obtaining 25 .vcf files per sample
3) somehow I need to combine these 25 .vcfs into 1 .vcf and still keep track which sample is which...

Does it make sense to run analysis like this?
Should I create kind of "Channel of Channels" or a "Channel with list of lists"? 
Or maybe should I run a NextFlow script from NextFlow script? 

Thanks for any hints!

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Damian Loska

unread,
Aug 9, 2018, 6:14:42 AM8/9/18
to Nextflow
Thanks, that was very useful!
Now almost everything works in this solution.
One more question: Is there a way to force input order? 
GATK needs everything in a strict order, so the problem appears in GatherVcfs. My full code is like this:

bedFile=file("/productionTest/test_head.bed")
nLinesInBed
=((bedFile.readLines().size()+30)/25).toInteger()

bedSplits
= Channel.from(bedFile).splitText( by:nLinesInBed, file:file("/productionTest/splits/" ))

bams
= Channel.fromPath("/productionTest/Bams.csv").splitCsv()

process haplotypeCaller
{

  maxForks
5
  tag
"$name, $bn"
  echo
true
  input
:
   
set name, bam, bai from bams
    each bed
from bedSplits

  output
:
   
set val(name), file("${name}.${bn}.vcf") into vcfs

  script
:
    bn
=bed.baseName

 
"""
  java -Xmx3000m -jar gatk.jar HaplotypeCaller -R ${params.ref} -I ${bam} -O ${name}.${bn}.vcf -L $bed
  """

}

process
gatherVcfs {
  maxForks
5
  publishDir
"productionTest/results/"
  echo
true
  tag
"$name"
  input
:
   
set val(name), file(vcf) from vcfs.groupTuple(sort:true)

  output
:
    file
("${name}.ready.vcf") into finalVCFs

  script
:
    ins
=vcf.join(" -I ")                 // <<<<<------------------ how to force this line to make in order??
 
"""
    java -Xmx3000m -jar gatk.jar \
    GatherVcfs  \
    -I $ins \
    -O ${name}.ready.vcf
  """

}


now the last code is executed like this (random .vcf order):
java -jar gatk.jar GatherVcfs -I sample.test_head.6.vcf -I sample.test_head.17.vcf -I sample.test_head.7.vcf -I sample.test_head.1.vcf -I sample.test_head.8.vcf -I sample.test_head.12.vcf -I sample.test_head.13.vcf -I sample.test_head.10.vcf -I sample.test_head.11.vcf -I sample.test_head.2.vcf -I sample.test_head.14.vcf -I sample.test_head.3.vcf -I sample.test_head.16.vcf -I sample.test_head.9.vcf -I sample.test_head.15.vcf -I sample.test_head.4.vcf -I sample.test_head.5.vcf -O sample.ready.vcf

but for GATK it needs to be in order:
java -jar gatk.jar GatherVcfs -I sample.test_head.1.vcf -I sample.test_head.2.vcf -I sample.test_head.3.vcf -I sample.test_head.4.vcf ...

can I somehow sort this expression ins=vcf.join(" -I ") ??

thanks!




Paolo Di Tommaso

unread,
Aug 9, 2018, 8:38:40 AM8/9/18
to nextflow
To order them would be enough to user `.sort()` method, e.g. 

ins=vcf.sort().join(" -I ")


However it would be a alphabetic ordering, instead you need a numeric order by index in the file name. You can split each name, extract that index and sort by the number, as shown below: 


ins=vcf.sort { it.tokenize('.')[2].toInteger() }.join(" -I ")


Hope it helps. 


p


Damian Loska

unread,
Aug 9, 2018, 8:52:22 AM8/9/18
to Nextflow
No, it doesn't work in both cases:(


Caused by:
 
Cannot invoke method join() on null object


Source block:
    ins
=vcf.sort().join(" -I ")
 
"""
    vcf-concat $ins > ${name}.ready.vcf
  """




Caused by:
 
No signature of method: _nf_script_525f7296$_run_closure2$_closure19$_closure20.doCall() is applicable for argument types: (sun.nio.fs.UnixPath, sun.nio.fs.UnixPath) values: [52123MA.test_head.2.vcf.gz, 52123MA.test_head.3.vcf.gz]
Possible solutions: doCall(), doCall(java.lang.Object), findAll(), findAll()


Source block:

Paolo Di Tommaso

unread,
Aug 9, 2018, 8:59:09 AM8/9/18
to nextflow
Because each element is a Path, use instead:  


ins=vcf.sort { it.name.tokenize('.')[2].toInteger() }.join(" -I ")



p


Damian Loska

unread,
Aug 9, 2018, 9:21:21 AM8/9/18
to Nextflow
still doesn't work...
 
Caused by:
 
No signature of method: _nf_script_4c92845f$_run_closure2$_closure19$_closure20.doCall() is applicable for argument types: (sun.nio.fs.UnixPath, sun.nio.fs.UnixPath) values: [22789KS.test_head.2.vcf.gz, 22789KS.test_head.5.vcf.gz]

Possible solutions: doCall(), doCall(java.lang.Object), findAll(), findAll()


Source block:

    ins
=vcf.sort { it.name.tokenize('.')[2].toInteger() }.join(" -I ")
 
"""
    echo $ins


Paolo Di Tommaso

unread,
Aug 9, 2018, 10:08:18 AM8/9/18
to nextflow
what about this: 

ins = vcf instanceof Path ? vcf.name : vcf.sort { it.name.tokenize('.')[2].toInteger() }.join(" -I ")


p



Damian Loska

unread,
Aug 9, 2018, 4:48:14 PM8/9/18
to Nextflow
again:

Caused by:
 
No signature of method: _nf_script_a9bfb6a3$_run_closure2$_closure19$_closure20.doCall() is applicable for argument types: (sun.nio.fs.UnixPath, sun.nio.fs.UnixPath) values: [22789KS.test_head.10.vcf.gz, 22789KS.test_head.2.vcf.gz]

Possible solutions: doCall(), doCall(java.lang.Object), findAll(), findAll()


Source block:

    ins
= vcf instanceof Path ? vcf.name : vcf.sort { it.name.tokenize('.')[2].toInteger() }.join(" -I ")
 
"""
    echo $ins

 Caused by:
 
No signature of method: _nf_script_e5d98854$_run_closure2$_closure19$_closure20.doCall() is applicable for argument types: (sun.nio.fs.UnixPath, sun.nio.fs.UnixPath) values: [52123MA.test_head.3.vcf.gz, 52123MA.test_head.2.vcf.gz]

Possible solutions: doCall(), doCall(java.lang.Object), findAll(), findAll()


Source block:

    ins
= vcf instanceof Path ? vcf : vcf.sort { it.tokenize('.')[2].toInteger() }.join(" -I ")
 
"""
    echo $ins

it's intriguing :)






Paolo Di Tommaso

unread,
Aug 9, 2018, 4:51:20 PM8/9/18
to nextflow
I'm lost, provide a script that replicates your use case. 


p

Damian Loska

unread,
Aug 10, 2018, 8:47:19 AM8/10/18
to Nextflow
ok, in the attachment there. You should be able to run this example and obtain similar error.
Thanks!
Bams.csv
sortFiles.nf
test_head.bed

Paolo Di Tommaso

unread,
Aug 10, 2018, 9:36:09 AM8/10/18
to nextflow
It returns a different error message:

ERROR ~ Error executing process > 'gatherVcfs (name1)'

Caused by:
  Process `gatherVcfs` input file name collision -- There are multiple input files for each of the following file names: name1.splits.vcf.gz.tbi, name1.splits.vcf.gz


On Fri, Aug 10, 2018 at 2:47 PM, Damian Loska <damian...@gmail.com> wrote:
ok, in the attachment there. You should be able to run this example and obtain similar error.
Thanks!

--

Damian Loska

unread,
Aug 10, 2018, 10:59:34 AM8/10/18
to Nextflow
I obtain your error, when I run a code like this:
files names "${name}.vcf.gz" instead of "${name}.${bn}.vcf.gz"
process haplotypeCaller {
  maxForks
5
  tag
"$name, $bn"
  echo
true
  input
:

   
set name, bam, bai from someBams
    each bed
from bedSplits

  output
:
   
set val(name), file("${name}.vcf.gz"), file("${name}.vcf.gz.tbi") into vcfParts
   
set val(name), file("${name}.HC.bam"), file("${name}.HC.bai") into hcBamParts
 
   script
:
    bn
=bed.baseName
 
"""
  echo "
uuu${bam}aaa" > ${name}.vcf.gz.tbi
  echo  "
uuu${bam}aaa" > ${name}.HC.bam
  echo "
uuuuu" >  ${name}.vcf.gz
  echo "
kkkkk" > ${name}.HC.bai

  """
error:
[89/b79c4d] Submitted process > haplotypeCaller (name3, test_head.17)
ERROR
~ Error executing process > 'gatherVcfs (name3)'


Caused by:
 
Process `gatherVcfs` input file name collision -- There are multiple input files for each of the following file names: name3.vcf.gz, name3.vcf.gz.tbi



but when I use ${bn} in file names:
process haplotypeCaller {
  maxForks
5
  tag
"$name, $bn"
  echo
true
  input
:

   
set name, bam, bai from someBams
    each bed
from bedSplits


  output
:
   
set val(name), file("${name}.${bn}.vcf.gz"), file("${name}.${bn}.vcf.gz.tbi") into vcfParts
   
set val(name), file("${name}.${bn}.HC.bam"), file("${name}.${bn}.HC.bai") into hcBamParts

  script
:
   
bn=bed.baseName

 
"""
  echo "
uuu${bam}aaa" > ${name}.${bn}.vcf.gz.tbi
  echo  "
uuu${bam}aaa" > ${name}.${bn}.HC.bam
  echo "
uuuuu" >  ${name}.${bn}.vcf.gz
  echo "
kkkkk" > ${name}.${bn}.HC.bai
  """

}
${bn} here is an basenames from split files:
test_head.10.bed  test_head.12.bed  test_head.14.bed  test_head.16.bed  test_head.1.bed  test_head.3.bed  test_head.5.bed  test_head.7.bed  test_head.9.bed
test_head
.11.bed  test_head.13.bed  test_head.15.bed  test_head.17.bed  test_head.2.bed  test_head.4.bed  test_head.6.bed  test_head.8.bed

so at the end files are not the same, and I don't have name collision error.
vcf.sort() return null(?)

[f9/37aef8] Submitted process > haplotypeCaller (name3, test_head.17)
ERROR
~ Error executing process > 'gatherVcfs (name3)'



Caused by:
 
Cannot invoke method join() on null object


Source block:
    ins
= vcf.sort().join(" -I ")
 
"""
    echo $ins







Damian Loska

unread,
Aug 11, 2018, 6:15:24 AM8/11/18
to Nextflow
hey, it's working!
look what I've changed:
before:
bams = Channel.fromPath("/path/Bams.csv").splitCsv()
process proces1
{

  input
:
    set name, bam, bai from bams
  output:
   
set val(name), file("${name}.${bn}.vcf.gz"), file("${name}.${bn}.vcf.gz.tbi") into vcfsParts

process proces2 {
  input
:
   
set val(name), file(vcf), file(vcfIdx) from vcfParts.groupTuple()
  script
:
    ins
=vcf.sort { it.name.tokenize('.')[2].toInteger() }.join(" -I ")

after:
in the proces2
set name, vcf, vcfIdx from vcfParts.groupTuple()


so in super short:
imput1: set name, bam, bai from  // these all just strings taken from splitCsv()
output1: set val(name), file("${name}.${bn}.vcf"), file("${name}.${bn}.xxx") into vcfsParts  // I mark here, that the outputs are "val" or "file"
imput2: set name, vcf, vcfIdx from vcfParts.groupTuple()  // so here I don't mark anymore that my input is File? 
 
Ok, so the Channel already "knows" that vcf input is a file, so I don't have to "cast" it the second time, in proces2.
but in such case, why input1 works, if these are just strings? (paths taken from .csv file). How deos NextFlow know, that strings taken from .csv, are input files? (without casting it explicitly) 

Bams.csv file:
name1,/path/name1.bam,/path/name1.bam.bai
name2
,/path/name2.bam,/path/name2.bam.bai
name3
,/path/nama3.bam,/path/name3.bam.bai

Thanks!


Paolo Di Tommaso

unread,
Aug 11, 2018, 6:26:31 AM8/11/18
to nextflow
No, unfortunately that's not the right way. 

The output declaration types (file/val) have to match the ones used in the respective input. See for example [here](https://github.com/CRG-CNAG/CalliNGS-NF/blob/master/main.nf#L341). 

p


--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Damian Loska

unread,
Aug 11, 2018, 6:45:33 AM8/11/18
to next...@googlegroups.com
heh, but it's working...
and with the file declaration the sorting doesn't work...

Running HaplotypeCaller on 1 sample takes ~24h on my server...
so I split my .BED file into 25 chunks and run it simultaneously - than I need to gather VCFs (and .HC.BAMs) in sorting order...
(I took this solution from GATK's Cromwell + WDL, the recommend doing it, as running HaplotypeCaller with many cores results in wrong output...)




To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+unsubscribe@googlegroups.com.



--
Regards,
Damian Loska
kom. PL 0048 516 899 956
móvil ES 0034 645 142 495

Steve

unread,
Aug 13, 2018, 11:20:32 AM8/13/18
to Nextflow
Hi Damian, just curious but why do you need to gather the vcf afterwards? And why do you need to split this for HaplotypeCaller?

I ask because I do almost the exact same thing in my workflow here:


Except I do not need to split for HaplotypeCaller since it can run in multi-thread mode. Instead, I need to do this same splitting for MuTect2 since it cannot run multi-threaded and similarly takes 24-36hrs per sample.

However, downstream of this, I never actually need to gather the VCFs again, I just process each one individually after that, in parallel via Nextflow. Eventually I convert the VCF into a table, and then I am able to combine this easily with a script I include in that repo ('concat-tables.py').

Maybe something like this would be helpful for your workflow? Not sure.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Damian Loska

unread,
Aug 13, 2018, 12:17:48 PM8/13/18
to Nextflow
Hi,
A short answer:
HaplotypeCaller with multi-thread option sucks... :)

GATK team does NOT recommend it:
https://gatkforums.broadinstitute.org/gatk/discussion/1975/how-can-i-use-parallelism-to-make-gatk-tools-run-faster
about --nct: "We prefer not to use this option with HC; use it at your own risk."


Indeed, long time ago I've tested the same dataset with different numbers of threads/cores and I've got different results, so I don't trust this option;)

Before NextFlow, I've tried to use Cromwell + WDL: in every example they split .bed file to make parallel jobs, so I just follow their idea:
https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json

with such approach, from 1 patient I get 25 sub-vcfs, and I have to join them back. But for GatherVcfs (and GatherBamFiles from --bam-output) I need files in a correct order.
(ps. if try this approach, don't use HaplotypeCaller's "--interval_padding" option, it makes confusion later) 

Of course I can join files using different approaches, but more elegant way would be using one-line of NextFlow code :)
1 patient - 1 comple VCF, that's what I need:)

ps. Wow, your workflow is damn long workflow!! :D 

Damian Loska

unread,
Aug 14, 2018, 7:37:26 AM8/14/18
to Nextflow
ok, I've solved it in this way:
  script:
 
"""
     echo "
${vcf.join('\n')}" | sort -V > ${name}.vcf.list


      java -jar /path/gatk.jar GatherVcfs  \
      -I  ${name}.vcf.list \
      -O ${name}.raw.NF.vcf;
  """

But I'm not happy with this solution:)
I'd prefer this "elegant" solution would work:

ins=vcf.sort { it.name.tokenize('.')[2].toInteger() }.join(" -I ")

Cheers:)
Reply all
Reply to author
Forward
0 new messages