Viewing VCF file from Veritas

Dmitry Brant

unread,

May 8, 2016, 8:59:00 PM5/8/16

to igv-help

Hi there,

I'm an absolute novice regarding this type of data, so forgive my ignorance.

I've just received my (full) genome data from Veritas, which consists of a .bam.gz file and a .vcf.gz file. It's my understanding that I should be able to open the .vcf.gz file in IGV, and visualize it against a reference genome. However, when I try opening the .vcf.gz file, IGV says "Loading..." indefinitely, and gives no other status update (the UI keeps responding, and it doesn't use the CPU).

I've also tried uncompressing the .vcf file from the .gz package, but then IGV says that the file needs to be indexed, and when I agree to index it, it shows an error saying "htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 77: there are 1 genotypes while the header requires that 2 genotypes be present for all records at chr10:60228". Do you know what I might be doing wrong?

Here's the .vcf.gz file:

https://my.pgp-hms.org/user_file/download/1889

Jim Robinson

unread,

May 9, 2016, 12:36:35 AM5/9/16

to igv-...@googlegroups.com

Hi,

I don't know anything about Veritas but its not a good sign to have been delivered a plain gzipped VCF.   VCF files are so large that to be useful they need to be indexed, the standard in the field for VCF files is to "bgzip" then index them with a program called "tabix".   If this had been done you would have received a file with a .gz extension and an accompanying file with a .gz.tbi extension.

However, that wouldn't matter in this case as the VCF is malformed, as indicated by the error message. Neither IGV nor any other tool that rely on the htsjdk library will be able to read it until the errors are fixed.   From a quick look it appears your header line specifies that 2 samples are present, one called "unknown" and one called "Sample1". From the VCF spec this requires that 2 sets of genotype data be present, 1 for each sample, for every row in the file.   However, it appears that there is a single set of genotype data in the file.

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    unknown    Sample1

chr2    10072    .    A    C    35.48    .    AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;CIGAR=1X;DP=4;DPB=4;DPRA=0;EPP=3.0103;EPPR=7.35324;GTI=0;LEN=1;MEANALT=1;MQM=7;MQMR=15;NS=1;NUMALT=1;ODDS=5.91331;PAIRED=1;PAIREDR=0.5;PAO=0;PQA=0;PQR=0;PRO=0;QA=53;QR=22;RO=2;RPP=7.35324;RPPR=7.35324;RUN=1;SAF=1;SAP=3.0103;SAR=1;SF=0;SRF=2;SRP=7.35324;SRR=0;TYPE=snp    GT:AO:RO:GL:QR:DP:QA    1/1:2:2:-4.60903,0,-1.66403:22:4:53

Its quite odd that they would gzip a bam file as its already compressed, but did they also supply an index file (the extension would end in .bai)?

--

---
You received this message because you are subscribed to the Google Groups "igv-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to igv-help+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/igv-help/2195cedb-23b8-4806-8cd5-8b4b7eaa455b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry Brant

unread,

May 9, 2016, 11:30:11 AM5/9/16

to igv-help

Very interesting indeed. No, they didn't provide any index files, just the bam and vcf files, both gzipped.

This doesn't seem to be specific to my data, either. I tried downloading a couple other Veritas vcf files from the PGP list, and they have similar issues when opening in IGV. Is it possible that Veritas is simply not following the VCF spec correctly? I couldn't find any questions from other users about this, so I had assumed that I wasn't doing something right.

I also tried uploading the vcf to Promethease (without success), and here's a reply I got from their support:

"There are no dbSNP rs#s in a veritas file. At the moment we don't support this."

James Robinson

unread,

May 9, 2016, 12:33:42 PM5/9/16

to igv-...@googlegroups.com

I can’t speak generally about all Verita’s files, but the one you sent me is out-of-spec. I don’t know any program that could read it. It looks like the error might be in the header, if you try removing the “unknown” and associated tab it might work. There is only 1 genotype (sample) in the file, the header indicates there are 2.

Specifically remove “unkown” and the tab from this line

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown Sample1

Be careful to do this in a plain text editor, that will preserve tabs and spaces as they are, and not for example in Excel or Word.

Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/igv-help/cf1bd9b8-2853-4ab2-9010-afc4b8035d1e%40googlegroups.com.

Dmitry Brant

unread,

May 9, 2016, 1:11:43 PM5/9/16

to igv-help

Thanks for the help so far! I edited the file with vim and removed the "unknown" column, then indexed it with tabix as you suggested, and now IGV is able to open it.

Now, when I zoom in to see the features (once the features become visible), I get a slew of warnings in the log, all similar to the following:

----------

WARN [2016-05-09 12:38:22,465] [VCFWrapperCodec.java:75] NumberFormatException on line: chr21 15935840 . AAA GAG,GAA 323.94 PASS AB=0;ABP=0;AC=2,2;ADP=9;AF=1;AN=4;AO=10;CIGAR=1X1M1X;DP=10;DPB=10;DPRA=0;EPP=3.87889;EPPR=0;GTI=0;HET=0;HOM=1;LEN=3;MEANALT=1;MQM=60;MQMR=0;NC=0;NS=1;NUMALT=1;ODDS=18.4681;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=384;QR=0;RO=0;RPP=6.48466;RPPR=0;RUN=1;SAF=3;SAP=6.48466;SAR=7;SF=0,1;SRF=0;SRP=0;SRR=0;TYPE=complex;WT=0 GT:SDP:RDF:ADR:GQ:QA:RBQ:RO:AD:GL:QR:AO:ABQ:DP:RD:ADF:RDR:PVAL:FREQ 1/1:.:.:.:.:384,.:.:0:.:-10,-3.0103,0,.,.,.:0:10,.:.:10

Attempting to reformat by replacing ,., with ,0,

----------

Despite the warnings, IGV shows the data anyway (see screenshot). Does that look approximately right? (and are these warnings a common occurrence in VCF files, or is this an indication of further formatting issues specific to Veritas?)

Untitled.png

James Robinson

unread,

May 9, 2016, 2:43:47 PM5/9/16

to igv-...@googlegroups.com

Hi, that many warnings is not a common occurrence and indicate further VCF format problems. For visualizing in IGV they are probably not important, but could be important if you intend to do further analysis. Its pretty difficult from inspection to say where the number format exception is, but somewhere the VCF parser is expecting a number and getting something else.

I think there might be a VCFValidator out there somewhere, but it is likely to use the same code I am using and will just print out the same errors and warnings.

Jim

To view this discussion on the web visit https://groups.google.com/d/msgid/igv-help/fce84f40-0d0b-4f3a-aa71-3103d3db8025%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

<Untitled.png>

Mike Cariaso

unread,

Jun 22, 2017, 4:52:03 PM6/22/17

to igv-help

It's year later, so I'm sure this is moot, but I want to comment that I just got a VCF file from Veritas, and this one has no headers at all. The top line of the file is

chrX 60215 . A C . PASS AC=2;ADP=10;AN=2;HET=0;HOM=1;NC=0;SF=1;WT=0;customer_score1=chrX;customer_score2=60215;customer_score1=23;customer_score2=60215 GT:RD:ADR:FREQ:GQ:DP:RBQ:AD:ADF:RDR:PVAL:RDF:ABQ:SDP 1/1:0:0:100%:52:10:0:10:10:0:5.4125E-6:0:40:10

and the rest of the chromosomes follow in a random order.

so, no topline of

##fileformat=VCFv4.0

no

#CHROM POS ID REF ALT QUAL FILTER INFO

promethease can't hope to help with this. I've got a call out to Veritas to ask wtf, but seeing that they've previously had a different crappy VCF header, and now have this is shameful.

James Robinson

unread,

Jun 22, 2017, 5:09:22 PM6/22/17

to igv-help

Wow, that's pretty bad.

To unsubscribe from this group and stop receiving emails from it, send an email to igv-help+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/igv-help/99717ca2-c605-44c4-8038-dbdc8a3bb3c5%40googlegroups.com.

Dmitry Brant

unread,

Jun 22, 2017, 7:33:04 PM6/22/17

to igv-...@googlegroups.com

I just ended up waiting until Veritas uploaded all of my BAM files to PGP, and then downloaded them from there. The BAM files seem to be correctly structured, and are very nicely viewable in IGV (and you can manipulate them with BamTools).

--

---
You received this message because you are subscribed to a topic in the Google Groups "igv-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/igv-help/VqjQNEtUy5c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to igv-help+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/igv-help/CACOP%2BpuV9Mgo_YcsiOGB0N7sLe3wW0LR5eJ6xksVdU%3Dybqy_uw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Dmitry Brant
http://diskdigger.org

Reply all

Reply to author

Forward