Multiple copies of same variants in Kaviar VCF

105 views
Skip to first unread message

Florentine Scharf

unread,
Apr 10, 2017, 11:58:40 AM4/10/17
to Kaviar-discuss
Hello,
 
first of all, I would like to thank you for this extensive resource! I would really like to use it as part of our annotation pipeline, thus I have been fiddling around with Kaviar for the last few days and encountered some issues.
 
I tried to annotate a variant chr1:976514-976514 C->A in hg19 with the Kaviar-160204-Public-hg19.vcf source.
 
I encountered that for this variant there exist four entries with different values:
1             976506 .              AGCGGGGGC      A,AGCGGGGGA  .              .               AF=0.0192792,0.0000772;AC=2998,12;AN=155504;DS=63000exomes|GS000010323|GS000011738|Inova_CGI_founders-Nge3,GS000010438|GS000014566|GS000015155|GS000015189|GS000015893|GS000016015|GS000016444|ISB_founders-Nge3
1             976513 .              GC          GA          .              .               AF=0.0000386;AC=6;AN=155504;DS=GS000015174|ISB_founders-Nge3
1             976514 .              CG          AG          .              .               AF=0,0000193;AC=3;AN=155504;DS=ISB_founders-Nge3
1             976514 rs79290478        C            A            .              .               AF=0.0190027;AC=2955;AN=155504;DS=DNK02|GMIAK1|GMIAK2|GS000009926|GS000009927|GS000009928|GS000010322|GS000010424|GS000010429|GS000010434|GS000010435|GS000011739|GS000011740|GS000011807|GS000011809|GS000011814|GS000011815|GS000011816|GS000011817|GS000012712|GS000012713|GS000014544|GS000014557|GS000014558|GS000014559|GS000014560|GS000014561|GS000014562|GS000014563|GS000014564|GS000014565|GS000014569|GS000014570|GS000015152|GS000015153|GS000015154|GS000015156|GS000015172|GS000015173|GS000015175|GS000015176|GS000015177|GS000015178|GS000015179|GS000015180|GS000015181|GS000015183|GS000015184|GS000015185|GS000015186|GS000015187|GS000015188|GS000015191|GS000015221|GS000015223|GS000015225|GS000015227|GS000015228|GS000015229|GS000015230|GS000015231|GS000015232|GS000015233|GS000015272|GS000015708|GS000015709|GS000015710|GS000015711|GS000015885|GS000015886|GS000015887|GS000015888|GS000015889|GS000015890|GS000015891|GS000015892|GS000015894|GS000015895|GS000016014|GS000016016|GS000016027|GS000016028|GS000016029|GS000016031|GS000016032|GS000016333|GS000016335|GS000016336|GS000016338|GS000016339|GS000016370|GS000016371|GS000016373|GS000016374|GS000016375|GS000016441|GS000016443|GS000016445|GS000016446|GS000016448|GS000016449|GS000016450|GS000020414|GS000020417|HGDP00521|HGDP00778|HGDP00927|HGDP01307|ISB_founders-Nge3|Inova_CGI_founders-Nge3|Inova_Illumina_founders-Nge3|Malay|SS6004477|SSIP|Saqqaq|Wellderly|gonl
 
In the „Known issues“ note you are mentioning this and as a workaround you are indicating to “sum the allele frequencies to obtain the correct allele frequency for the variant”. I am a bit worried to just do it like that without reaffirmation. The sources are partially repeating over the variants, but also partially aren’t, thus I am wondering if the final result would be correct if, e.g. in that case, I would sum up the values 12+6+3+2955 as final AC? Since on the other hand all values for AN are the same and seem to not depend on the listed sources I am a bit confused about how exactly these values were generated.
 
Would it be safer to wait for a next release?
 
Thank you very much in advance for your help!
Kind regards,
Florentine

Gustavo Glusman

unread,
Apr 10, 2017, 12:08:45 PM4/10/17
to Florentine Scharf, Kaviar-discuss
The source repeatedly mentioned (ISB_founders-Nge3) is a collection of individuals; each individual genome would have just one version of the representation of the variant. Therefore it should be safe to sum the counts. AN refers to how many individuals have coverage over the locus, therefore it doesn't depend on the list of sources in which variation is observed.

Best,
-- Gustavo

--
You received this message because you are subscribed to the Google Groups "Kaviar-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaviar-discuss+unsubscribe@googlegroups.com.
To post to this group, send email to kaviar-discuss@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaviar-discuss/1b54544c-f03c-46ca-9b85-4e8ed5dd6385%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Florentine Scharf

unread,
Apr 12, 2017, 8:42:21 AM4/12/17
to Kaviar-discuss, florenti...@googlemail.com

Hello Gustavo,


thank you very much for your quick reply.


I have been trying to do as suggested - I determined the "minimal discrepancy of reference and alternative" for all entries in the VCF and then summed up the AC value for repeating variants.

However, now I found cases where AN is not consistent, e.g. for the variant chr2:160698700-160698700 A->C (again hg19).
Here I found > 3 entries:
2    160698700   .    ATATA    CTATA   .   .   AF=0.0001896;AC=5;AN=26378;END=160698704;DS=GS000016444|ISB_founders-Nge3
2    160698700   .    ATATAT   CTATAT   .   .   AF=0.0000379;AC=1;AN=26378;END=160698705;DS=GS000010327
2    160698700   .    ATATATA  CTATATA   .   .   AF=0.0000322;AC=5;AN=155504;END=160698706;DS=GS000015891|ISB_founders-Nge3

In that case now AN varies between the entries - as far as I understood before, I assumed only AC should differ. I am sorry, but it is still not completely clear to me how these values are generated from the various sources, if this observation is something you would expect, and if so how to handle it.


Thanks again for your help!

Kind regards,
Florentine



Am Montag, 10. April 2017 18:08:45 UTC+2 schrieb Gustavo Glusman:
The source repeatedly mentioned (ISB_founders-Nge3) is a collection of individuals; each individual genome would have just one version of the representation of the variant. Therefore it should be safe to sum the counts. AN refers to how many individuals have coverage over the locus, therefore it doesn't depend on the list of sources in which variation is observed.

Best,
-- Gustavo
Reply all
Reply to author
Forward
0 new messages