IGV not displaying all variants in VCF file

800 views
Skip to first unread message

Alex Holman

unread,
Dec 17, 2014, 2:44:32 PM12/17/14
to igv-...@googlegroups.com
I'm trying to figure out why IGV doesn't show all the variants that I can see in my VCF file. 
I'm on version 2.3.40. 

I've created a VCF file using pindel and pindel2vcf (vcf attached). Looking in my VCF file I can see that there are 25 variants, and 3 samples that have non-reference genotypes. 
Name : GT/AD
CYP26B1-B12 :  0/1:1408,1624
CYP26B1-F07 : 1/1:329,6511
CYP26B1-F08 : 0/1:947,2367

However, when I open this VCF file in IGV I only see one sample level variant highlighted in color (the B12 het), and the other two are not shown. Additionally, when I hover over the B12 variant, the type is listed as HOM_REF. Is there some setting that I'm missing, or some filtering going on that I'm unaware of?

Thanks,
Alex


CYP26B1.vcf

Jim Robinson

unread,
Dec 17, 2014, 3:25:44 PM12/17/14
to igv-...@googlegroups.com
Hi,

Thanks for the report, and the example file.  There is clearly something wrong here, I'm investigating now.

Jim

--

---
You received this message because you are subscribed to the Google Groups "igv-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to igv-help+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/igv-help/6f18e8ac-15fa-43b6-9c2f-4e5eb2f10da9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jim Robinson

unread,
Dec 17, 2014, 6:52:49 PM12/17/14
to igv-...@googlegroups.com
Hi,

Hi, it looks like you have multiple genotypes for the same sample that are overlapping, they are obscurring each other, and nearly all of them are hom ref.  This is not something IGV was designed to handle, its the first example I've seen actually.   The root cause I guess is that you have multiple variants for the same site,  as opposed to a single variant with multiple alternate alleles.   For example, the B12 sample has this genotype (complete list below).  This is the one found when you click

[CYP26B1-B12 CACGTTGATGGCCTCGGGGTGGCTGCT*/CACGTTGATGGCCTCGGGGTGGCTGCT* AD 1408,0]

Here are the other genotype records in that vicinity for the B12 sample.  The het call (the only one that is not reference) is obscurred by one or more hom ref genotypes.

I will do some queries with VCF experts at work (they write the htsjdk & GATK) and see what they have to say about this file.   How was it produced (they are likely to ask that)?


72362433-72362461   INDEL    HOM_REF   TACACGTTGATGGCCTCGGGGTGGCTGCT/TACACGTTGATGGCCTCGGGGTGGCTGCT 
72362489-72362490   INDEL    HOM_REF   CA/CA 
72362495-72362496   INDEL    HOM_REF   GA/GA 
72362433-72362435   INDEL    HOM_REF   TAC/TAC 
72362440-72362441   INDEL    HOM_REF   TG/TG 
72362448-72362449   INDEL    HOM_REF   TC/TC 
72362435-72362461   INDEL    HOM_REF   CACGTTGATGGCCTCGGGGTGGCTGCT/CACGTTGATGGCCTCGGGGTGGCTGCT 
72362435-72362441   INDEL    HOM_REF   CACGTTG/CACGTTG 
72362449-72362451   INDEL    HOM_REF   CGG/CGG 
72362435-72362440   INDEL    HOM_REF   CACGTT/CACGTT 
72362445-72362446   INDEL    HOM_REF   GC/GC 
72362452-72362454   INDEL    HOM_REF   GGT/GGT 
72362435-72362439   INDEL    HET   CACGT/CGTTGATGGCCTCGGGGTGGCTG 
72362444-72362454   INDEL    HOM_REF   GGCCTCGGGGT/GGCCTCGGGGT 
72362436-72362462   INDEL    HOM_REF   ACGTTGATGGCCTCGGGGTGGCTGCTC/ACGTTGATGGCCTCGGGGTGGCTGCTC 
72362436-72362461   INDEL    HOM_REF   ACGTTGATGGCCTCGGGGTGGCTGCT/ACGTTGATGGCCTCGGGGTGGCTGCT 
72362436-72362460   INDEL    HOM_REF   ACGTTGATGGCCTCGGGGTGGCTGC/ACGTTGATGGCCTCGGGGTGGCTGC 
72362437-72362440   INDEL    HOM_REF   CGTT/CGTT 
72362449-72362450   INDEL    HOM_REF   CG/CG 
72362438-72362461   INDEL    HOM_REF   GTTGATGGCCTCGGGGTGGCTGCT/GTTGATGGCCTCGGGGTGGCTGCT 
72362438-72362439   INDEL    HOM_REF   GT/GT 
72362453-72362454   INDEL    HOM_REF   GT/GT 
72362442-72362454   INDEL    HOM_REF   ATGGCCTCGGGGT/ATGGCCTCGGGGT 
72362442-72362443   INDEL    HOM_REF   AT/AT 
72362443-72362444   INDEL    HOM_REF   TG/TG 

Alex Holman

unread,
Dec 18, 2014, 11:58:38 AM12/18/14
to igv-...@googlegroups.com
Thanks for looking into this
These were generated using pindel and pindel2vcf using the GATK compatible flag.

Jim Robinson

unread,
Dec 18, 2014, 12:10:23 PM12/18/14
to igv-...@googlegroups.com
Still no answer, but offhand I can't see how you could possibly interpret this file, it seems fishy.   A single sample can't have multiple genotypes, by that I mean it can't be both reference and variant at the same time, but in this file they are.  Is there an aggregation step missing?

Jim


--

---
You received this message because you are subscribed to the Google Groups "igv-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to igv-help+u...@googlegroups.com.

Alex Holman

unread,
Dec 18, 2014, 1:00:13 PM12/18/14
to igv-...@googlegroups.com
Biologically, yes. But computationally, I can see how it got there. I'm analyzing multiple samples. Say sample A has an indel-1 detected across positions 1-7, and sample B has an indel-2 detected across 5-10. When the genotyper runs across these, B is not going to contain indel 1 and the VCF entry will be HOM_REF at positions 1-7, but will contain indel-2 and will be mutant across bases 5-10, meaning bases 5,6, and 7 are classified as both ref and mutant by two different lines in the VCF.
That sort of things seems like it should be handled in the genotyper, but its going to get complex really quickly.

Alex Holman

unread,
Dec 18, 2014, 1:02:57 PM12/18/14
to igv-...@googlegroups.com
Really, I just want a good way to visualize what samples contain what indels, and I don't really care about ref sequence. Can you think of a tool that would do that better than IGV?

Jim Robinson

unread,
Dec 18, 2014, 1:26:04 PM12/18/14
to igv-...@googlegroups.com
I think you will probably have difficulty with this vcf file in any tool, but you can ask on the vcf mailing list.   You might also try "Savant".    If it's determined this is type of file is valid I will add support for multiple genotypes per sample, but I hesitate to do that based on one example.   Still waiting on a response from the gatk team, but a lot of people are out right now.

Jim

Jim Robinson

unread,
Dec 18, 2014, 1:47:24 PM12/18/14
to igv-...@googlegroups.com
Yes it gets complex,  but I think (am not sure), that in your example one would normally find a single variant with 2 alternate alleles.   But yes this would get complicated very quickly.   I pinged the GATK people again,  and will give some thought as to how IGV should handle it,  but its not going to be a trivial fix.  If you run across another tool that handles this better let me know.

Jim

Jim Robinson

unread,
Dec 19, 2014, 11:19:01 AM12/19/14
to igv-...@googlegroups.com
Hi,

I talked to some inhouse vcf experts about how we should handle this type of file and we have an approach, but it won't be available in IGV until early next year.   Its surprising this issue hasn't arisen earlier, but most of the VCFs I see are for snps and small indels, and overlapping variants are aggregated.    Appreciate your patience,  sorry I don't have an immediate solution.

Jim


Alex Holman

unread,
Dec 19, 2014, 12:55:29 PM12/19/14
to igv-...@googlegroups.com
Thanks for chasing this down. 
If it helps in keeping fixing this issue a priority, I have seen this behavior before in a more standard multi-sample GATK SNP and indel analysis. There were variants that were clearly present in the VCF and in the hover-over that weren't being colored in the display window. At the time I thought I had simply missed a filtering parameter that IGV was using, but in hindsight, I think that may have been the same issue.

My workaround for right now is to use the GATK SelectVariants tool (with --excludeNonVariants set) to split out each sample to its own VCF. I then load those in parallel to IGV to achieve nearly the output I'm looking for.
VCF_FILE=$1
grep "^#CHROM" $VCF_FILE | cut -f10- | sed -e 's/\t/\n/g' | xargs -n1 -P10 -I{} \
java -Xmx2g -jar $APPS_PATH/gatk_current/GenomeAnalysisTK.jar \
   -R $REFERENCE \
   -T SelectVariants \
   --variant $VCF_FILE \
   -o {}.vcf \
   -sn {} \
   --excludeNonVariants \
> run_splitVCF.{}.log 2>&1 

Its ugly, but it works until a fix comes along.

Thanks,
Alex

Jim Robinson

unread,
Dec 19, 2014, 1:14:31 PM12/19/14
to igv-...@googlegroups.com
Thanks, its not low priority just complicated.  You could see this anytime you have 2 variant records at the same site.   The people here characterize that as "poor variant calling" but it surely could happen, and is perfectly legal VCF as I now understand it.

Here at the Broad IGV is not often used for viewing the sample genotypes, in fact we considered dropping that altogether from IGV.   Obviously you are using it for that, and I suspect other people do as well,  so we'll address it.   That bit of IGV (VCF viewing) was created at the very beginning of the 1KG project, and the use case is each site has "a" variant and supporting genotypes. 

Thanks for posting the workaround.

Best,

Jim


Frogee

unread,
Dec 19, 2014, 2:40:04 PM12/19/14
to igv-...@googlegroups.com
> Here at the Broad IGV is not often used for viewing the sample genotypes, in fact we considered dropping that altogether from IGV.   Obviously you are using it for that, and I suspect other people do as well,  so we'll address it.

Not to hijack the thread, but we find the viewing genotypes feature to be very useful. If the VCF viewing code is going to be modified soon, is there any possibility that a feature could be added whereby individual samples can be chosen from the track after loading a .vcf file? We have typical use cases where a .vcf contains genotypes for 50+ samples and we want to look at genotypes for just 2 or 3 of them; we haven't found a way to do this short of creating a new .vcf file with the desired subset of samples. Something like selecting a sample name and choosing "Hide sample" would be very helpful.

Lastly, thanks for all of your work on IGV.

Sincerely,
Ryan McCormick

Jim Robinson

unread,
Dec 19, 2014, 3:16:46 PM12/19/14
to igv-...@googlegroups.com
Hi Ryan,

Yes, we can add that, thanks for the suggestion.  You could accomplish it now in a round-about way by loading a "sample information" file, and having a column in the file with values like "hide/show".   Then use "Tracks > Filter Tracks..."  and filter by that column.   Another use of sample information is to group samples, for example by population or phenotype.   I realize this is not as dynamic as your suggestion but is probably easier than creating a new VCF.

The sample information file is just a tab delimited file with an arbitrary # of columns.  The first column is the sample name.   This is the only file type for wich extension doesn't matter,  basically if its no any other know type IGV tries to load it as a sample information file.

The VCF overhaul will be extensive,  it hasn't been touched really since the start of the 1KG project and its time.

Jim

Frogee

unread,
Dec 19, 2014, 4:02:51 PM12/19/14
to igv-...@googlegroups.com
Jim,

Thanks for your suggestion. That accomplishes what I was looking for and more; I retract my previous feature request.

Thanks,
Ryan
Reply all
Reply to author
Forward
0 new messages