Multi Allelic sites with both an insertion and deletion

Harkness Kuck

unread,

Aug 6, 2015, 2:26:50 PM8/6/15

to CAVA User Group

I'm running CAVA on a VCF with some multi-allelic rows with both an insertion and a deletion. The results appear to only include the deletion. I've tried outputting as both VCF and CSV and see the same issue. Has anyone else had this problem?

Input line from the VCF

11 34874640 rs11313431 AT A,ATT 881.31 . AC=5,3;AF=0.625,0.375;AN=8;DB;DP=86;FS=0.000;MLEAC=5,3;MLEAF=0.625,0.375;MQ=60.25;MQ0=0;QD=10.25;SOR=1.244 GT:AD:DP:GQ:PL 1/2:0,7,2:9:31:166,31,81,158,0,204 1/2:0,10,4:14:86:281,86,104,230,0,285 1/1:0,10,0:10:30:244,30,0,270,30,343 1/2:0,7,4:11:86:225,86,94,167,0,207

CAVA VCF Output

11 34874640 rs11313431 AT A 881.31 . AC=5,3;AF=0.625,0.375;AN=8;DB;DP=86;FS=0.000;MLEAC=5,3;MLEAF=0.625,0.375;MQ=60.25;MQ0=0;QD=10.25;SOR=1.244;TYPE=Deletion;ENST=ENST00000532428;GENE=APIP;TRINFO=-/37.6kb/8/4.6kb;LOC=3UTR;CSN=c.*29623delA;CLASS=3PU;SO=3_prime_UTR_variant;IMPACT=3;ALTANN=c.*29607delA;ALTCLASS=.;ALTSO=.;DBSNP=. GT:AD:DP:GQ:PL 1/2:0,7,2:9:31:166,31,81,158,0,204 1/2:0,10,4:14:86:281,86,104,230,0,285 1/1:0,10,0:10:30:244,30,0,270,30,343 1/2:0,7,4:11:86:225,86,94,167,0,207

Also in the VCF output I've noticed that CAVA prepends the CAVA header lines to the beginning of the VCF header lines so that the ##fileformat line is no longer the first line in the file as required for a valid VCF format.

Márton Münz

unread,

Aug 6, 2015, 5:15:03 PM8/6/15

to CAVA User Group

Hello,

I was unable to reproduce the issue - the variant remains to be multi-allelic in the output I get. Could you please send me the configuration file you use so I that I could try and reproduce the problem?

As for the second issue about the header, thanks very much for spotting this, we will fix it soon for the next release!

Best,

Marton

Harkness Kuck

unread,

Aug 10, 2015, 12:35:46 PM8/10/15

to CAVA User Group

I've attached the config file.

The dbSNP database was dowloaded (wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b144_GRCh37p13/VCF/00-All.vcf.gz) and then formatted with dbprep.

python /cava/cava-v1.1.1/dbprep.py -s 144 -d 00-All.vcf.gz -o dbsnp144.gz

The Ensembl database was downloaded and formatted with dbprep.

python /cava/cava-v1.1.1/dbprep.py -e 75 -o ensembl75

I've been digging into this a bit more and I'm not seeing this all of the time - some insertion/deletion multi-allelic calls are being handled correctly (in the same VCF) but there are multiple lines with this problem. Here is another example -

Input line from the VCF

12 54402808 . GCGCCGCCGCCGCCGCCGCCGCCCGCTCGCCGCC G,GCGCCGCCGCCGCCGCCGCCCGCTCGCCGCC 2983.86 . AC=2,3;AF=0.250,0.375;AN=8;ClippingRankSum=0.057;DP=159;FS=3.701;MLEAC=2,3;MLEAF=0.250,0.375;MQ=59.96;MQ0=0;QD=18.77;SOR=1.217 GT:AD:DP:GQ:PL 0/1:11,18,0:29:99:710,0,1005,743,511,1221 1/2:0,13,19:32:99:1303,766,920,540,0,483 0/2:8,0,10:18:99:381,424,1383,0,388,318 0/2:13,0,16:29:99:633,685,2044,0,619,519

CAVA VCF Output

12 54402808 . GCGCCGCCGCCGCCGCCGCCGCCCGCTCGCCGCC G 2983.86 . AC=2,3;AF=0.250,0.375;AN=8;ClippingRankSum=0.057;DP=159;FS=3.701;MLEAC=2,3;MLEAF=0.250,0.375;MQ=59.96;MQ0=0;QD=18.77;SOR=1.217;TYPE=Deletion;ENST=ENST00000040584;GENE=HOXC8;TRINFO=+/4.7kb/2/3.4kb;LOC=OUT;CSN=.;CLASS=.;SO=.;IMPACT=.;ALTANN=.;ALTCLASS=.;ALTSO=.;DBSNP=. GT:AD:DP:GQ:PL 0/1:11,18,0:29:99:710,0,1005,743,511,1221 1/2:0,13,19:32:99:1303,766,920,540,0,483 0/2:8,0,10:18:99:381,424,1383,0,388,318 0/2:13,0,16:29:99:633,685,2044,0,619,519

cavaconfigVCF.txt

Márton Münz

unread,

Aug 10, 2015, 5:15:08 PM8/10/15

to CAVA User Group

Hi,

It's strange... Using the configuration file you sent and generating the same Ensembl and dbSNP database files you described, running CAVA v1.1.1 on the same two example VCF records you provided, the output I get still contains both ALT alleles in both VCF records.

Is it possible that the @nonannot option flag was set to FALSE when you annotated your VCF? In the config file you sent this option is set to TRUE, but when I set it to FALSE, I get the same output as you (i.e. when non-annotated variants are not written to the output file).

Another reason why I suspect this might be the case is because in both example variants you sent one allele overlaps with either the start of the transcript (your second example) or the end of transcript (your first example), while the other allele does not overlap with the transcript boundary.

For example, in your second example, the first ALT allele corresponds to a large deletion that overlaps the start of the transcript (and is therefore reported by CAVA as an 'OUT' variant), however the second ALT allele which correspond a three base (GCC) deletion lies outside the transcript and is therefore not outputted if @nonannot is set to FALSE.

Please let me know if you indeed used the @nonannot=FALSE setting. If this is not the case, I will keep looking for a reason why we don't get the same output with the very same inputs and settings. In any case, this must be related to cases when one allele overlaps the transcript boundary and the other does not.

Best,

Marton

Harkness Kuck

unread,

Aug 11, 2015, 11:13:15 AM8/11/15

to CAVA User Group

That makes sense. I was doing some runs with a transcript database. I thought this issue was in all of the results, but it is only in the ones which I'd included @transcriptlist.

I found these issues because I was trying to run the resulting VCF through a tool that kept having problems because the resulting VCF was not in a valid VCF format. It would be great in the next version, when only one allele remains in the ALT field, if the allele count and allele frequency fields only showed the count and frequency information for the remaining allele. Similarly the sample fields would also need to be modified.

Thanks for your help!

Reply all

Reply to author

Forward