using sample-less database vcf as calls file in vcfeval comparison

431 views
Skip to first unread message

Katie

unread,
May 31, 2017, 6:39:32 PM5/31/17
to RTG Users
Hello,

I am currently trying to compare a dbSNP vcf file to a gold standard vcf file from Genome in a Bottle, using the GIAB sample as the baseline and dbSNP as the calls. However, using ALT as the sample name seems to only work for the baseline sample. If I run 

rtg RTG_MEM=15g vcfeval -c GIAB_VCF -b dbSNP_VCF -o outname --threads=1 --evaluation-regions=MYBEDFILE --template=TEMPLATE_FOLDER --squash-ploidy --sample ALT,GIAB_SAMPLE_NAME

vcfeval runs as expected. However, if I simply swap the calls and baseline file and the corresponding sample names as follows:

rtg RTG_MEM=15g vcfeval -b GIAB_VCF -c dbSNP_VCF -o outname --threads=1 --evaluation-regions=MYBEDFILE --template=TEMPLATE_FOLDER --squash-ploidy --sample GIAB_SAMPLE_NAME,ALT

I get this error

Error: Sample "ALT" not found in calls VCF.

Is it not currently possible to use the ALT sample name for the calls vcf file? Thanks!

Best,
Katie

Sean Irvine

unread,
May 31, 2017, 6:59:48 PM5/31/17
to Katie, RTG Users
Hi Katie,

We would expect what you are trying to do should work and my attempts to replicate your command do work for me on recent releases 3.7 and 3.8.

Can you please tell me which version of RTG you are using? (run "rtg version").

Sean.


--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+unsubscribe@realtimegenomics.com.
Visit this group at https://groups.google.com/a/realtimegenomics.com/group/rtg-users/.

Katie

unread,
Jul 20, 2017, 12:18:12 PM7/20/17
to RTG Users, kawilkin...@gmail.com
Hi Sean,

I apologize for the slow reply! I ended up having to go forward without this working, but am now back to trying to figure it out. I'm running version 3.7. I'm using the dbSNP vcf available here: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/GATK/All_20170710.vcf.gz and the GIAB vcf file available here: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3/NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz I edited the GIAB vcf file with 

gunzip NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz
sed -i '/^#/!s/^/chr/' NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf
sed -i 's/##contig=<ID=/##contig=<ID=chr/' NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf
sed -i 's/chrMT/chrM/' NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf
rtg bgzip NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf

to make the chromosome naming convention match between my vcf files. With these specific files, using the GIAB vcf file as my calls file and the dbSNP vcf file as the baseline file works. However, reversing the files gives me the error I mentioned above, "Error: Sample "ALT" not found in calls VCF". Is it possible for you replicate this error with these specific files? And/or would it be possible for you to point me to the vcf files you used where you didn't get this error? Thanks for your help!

Best,
Katie
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+...@realtimegenomics.com.

Katie

unread,
Jul 21, 2017, 4:17:13 PM7/21/17
to RTG Users, kawilkin...@gmail.com
Hi Sean,

Since I'm using the --squash-ploidy option anyway, I decided to just add a set of fake genotypes to the dbSNP vcf file. I added '\tFORMAT\tALT' to the header line and added genotypes to every record with sed -i -s '/^#/!s/$/\tGT\t1|1/' /sas/seq4/kawilkin/test_rtg/All_20170710.vcf.
I'm not sure why I was unable to run vcfeval without a sample column in my calls file, but this workaround has fixed the problem. Thanks!

Best,
Katie

Sean Irvine

unread,
Jul 22, 2017, 5:45:54 PM7/22/17
to Katie, RTG Users
Hi Katie,

I'm still unable to replicate your problem.  Using the exact files you indicated I was able to run vcfeval in both directions (commands are below).  I did restrict to chr22 for testing but that should make no difference.  At the moment my best guess is there is some subtle command line error in how you are giving the sample names.

Are you able to send me the "vcfeval.log" file from the failing run, which will be written into the vcfeval output directory?

Sean.

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/GATK/All_20170710.vcf.gz
rtg index All_20170710.vcf.gz

wget -O - ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3/NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz | gzip -d | sed '/^#/!s/^/chr/;s/##contig=<ID=/##contig=<ID=chr/;s/chrMT/chrM/' | rtg bgzip - >NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz
rtg index NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz

rtg vcfeval -t sdf.UCSC/ -c NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz -b All_20170710.vcf.gz --squash-ploidy --sample ALT,INTEGRATION -o out1 --region chr22

[high-complexity regions lines deleted]

Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   81.000              45842          42948        121    4181716     0.9972       0.0108     0.0215
     None              45842          42948        121    4181716     0.9972       0.0108     0.0215

rtg vcfeval -t sdf.UCSC/ -b NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz -c All_20170710.vcf.gz --squash-ploidy --sample INTEGRATION,ALT -o out2 --region chr22

[high-complexity regions lines deleted]

During ALT comparison no ROC data will be produced, as a sample is required by the selected ROC score field: GQ (FORMAT)
Could not maximize F-measure from ROC data, only un-thresholded statistics will be computed. Consider selecting a different scoring attribute with --vcf-score-field
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
     None              42946          45836    4181654        121     0.0108       0.9972     0.0215


To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+unsubscribe@realtimegenomics.com.

Katie

unread,
Jul 24, 2017, 4:49:33 PM7/24/17
to RTG Users, kawilkin...@gmail.com
Hi Sean,

I repeated the commands you included below exactly (except for changing the location of the template file) and the first thing I noticed was that to use rtg index, I have to include the -f flag. This made me wonder if the versions we're using might actually be different. Here is the complete output from rtg version:

Product: RTG Tools 3.7
Core Version: 97da0df (2017-03-17)
RAM: 37.2GB of 46.5GB RAM can be used by rtg (80%)
CPU: Defaulting to 24 of 24 available processors (100%)
JVM: Java HotSpot(TM) 64-Bit Server VM 1.8.0_101
License: No license file required
Contact: sup...@realtimegenomics.com
Patents / Patents pending:
US: 7,640,256, 9,165,253, 13/129,329, 13/681,046, 13/681,215, 13/848,653, 13/925,704, 14/015,295, 13/971,654, 13/971,630, 14/564,810
UK: 1222923.3, 1222921.7, 1304502.6, 1311209.9, 1314888.7, 1314908.3
New Zealand: 626777, 626783, 615491, 614897, 614560
Australia: 2005255348, Singapore: 128254
Citation:
John G. Cleary, Ross Braithwaite, Kurt Gaastra, Brian S. Hilbush, Stuart Inglis, Sean A. Irvine, Alan Jackson, Richard Littin, Sahar Nohzadeh-Malakshah, Mehul Rathod, David Ware, Len Trigg, and Francisco M. De La Vega. "Joint Variant and De Novo Mutation Identification on Pedigrees from High-Throughput Sequencing Data." Journal of Computational Biology. June 2014, 21(6): 405-419. doi:10.1089/cmb.2014.0029.
(c) Real Time Genomics, 2016

I then got errors from both of the rtg vcfeval commands.

rtg vcfeval -t sdf.UCSC/ -c NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz -b All_20170710.vcf.gz --squash-ploidy --sample ALT,INTEGRATION -o out1 --region chr22

returns 
Evaluation too complex (50001 unresolved paths, 215641 iterations) at reference region chr22:21270901-21270977. Variants in this region will not be included in results.
Evaluation too complex (1888 unresolved paths, 10000001 iterations) at reference region chr22:23223331-23223474. Variants in this region will not be included in results.
Evaluation too complex (50001 unresolved paths, 120616 iterations) at reference region chr22:23999911-24000069. Variants in this region will not be included in results.
Evaluation too complex (50001 unresolved paths, 154038 iterations) at reference region chr22:28194880-28194950. Variants in this region will not be included in results.
Evaluation too complex (50001 unresolved paths, 156595 iterations) at reference region chr22:29240382-29240510. Variants in this region will not be included in results.
Evaluation too complex (50001 unresolved paths, 122514 iterations) at reference region chr22:36029642-36029782. Variants in this region will not be included in results.
Evaluation too complex (50001 unresolved paths, 128696 iterations) at reference region chr22:46325426-46325582. Variants in this region will not be included in results.
Evaluation too complex (50002 unresolved paths, 102663 iterations) at reference region chr22:51028944-51029045. Variants in this region will not be included in results.
Error: VCF record does not contain GT field, record: chr22 16050036 rs374742143 A C . . RS=374742143;RSPOS=16050036;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000000005000002000100;WGT=1;VC=SNV;ASP

and  
rtg vcfeval -t sdf.UCSC/ -b NA12878_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-Solid-10X_CHROM1-X_v3.3_highconf.vcf.gz -c All_20170710.vcf.gz --squash-ploidy --sample INTEGRATION,ALT -o out2 --region chr22
returns
During ALT comparison no ROC data will be produced, as a sample is required by the selected ROC score field: GQ (FORMAT)
Error: Sample "ALT" not found in calls VCF.

The vcfeval.log files are attached. Thanks for taking a look!

Best,
Katie
out1.vcfeval.log
out2.vcfeval.log

Sean Irvine

unread,
Jul 25, 2017, 6:11:22 PM7/25/17
to Katie, RTG Users
Hi Katie,

The version of RTG Tools you are using appears to have diverged (deliberately or accidentally) from our official releases.  In particular, we cannot find a revision 97da0df anywhere in our repositories and hence I've been unable to reproduce the problem or match the stack trace appearing in the log from your run.  Possibly the build was made not in a clean state.

We would recommend getting our latest 3.8.2 prebuilt release from:

https://www.realtimegenomics.com/products/rtg-core-non-commercial

or cloning a new copy from:

https://github.com/RealTimeGenomics/rtg-tools

In addition to (hopefully) fixing your problem, you get some new features -- such as not needing to specify the -f on the index commands.

With regard to the "Evaluation too complex" warnings.  This occurs when there are so many variants in a small region that there is a combinatorial explosion of possibilities to consider.  This doesn't happen much with ordinary use cases, but you will see it when using a dense database like dbSNP.  One of the biggest causes of this kind of problem is long indels.  If you are not interested in long indels, you could try lowering the maximum length of alleles considered by vcfeval by using the --Xmax-length option.

Sean.


To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+unsubscribe@realtimegenomics.com.

Katie

unread,
Jul 26, 2017, 5:40:19 PM7/26/17
to RTG Users, kawilkin...@gmail.com
Thanks Sean! I'll try downloading the latest release and see if that does it.

Best,
Katie
Reply all
Reply to author
Forward
0 new messages