Identical overlapping variants don't match in vcfeval

326 views
Skip to first unread message

Peter

unread,
Jun 17, 2016, 11:09:35 PM6/17/16
to RTG Users

Hello.


I have found that vcfeval sometimes returns a matching variant as both false positive and false negative when variants overlap.    Note that I am using version: rtg-tools-3.6.1  and I am see the same behaviour both with and without the --ref-overlap option set.


As a simple example I have a "test.vcf" which has just 2 variants in it:


#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CPCT11111111R CPCT11111111T

19 31097189 . AAAATAAAAT A,* 583.54 PASS AC=1,2;AF=0.250,0.500;AN=4;ANN=A|intron_variant|MODIFIER|ZNF536|ENSG00000198597|transcript|ENST00000592773|protein_coding|1/1|c.169+56769_169+56777delAAATAAAAT||||||WARNING_TRANSCRIPT_INCOMPLETE;BaseQRankSum=-8.700e-02;ClippingRankSum=0.577;DP=59;FS=0.000;MLEAC=1,2;MLEAF=0.250,0.500;MQ=58.67;MQRankSum=2.01;QD=15.36;ReadPosRankSum=-8.570e-01;SOR=0.560;GoNLv5_AC=.,.;GoNLv5_AF=.,. GT:AD:DP:GQ:PL 2/2:0,0,4:4:12:135,135,135,12,12,0 0/1:20,14,0:34:99:478,0,774,539,816,1355

19 31097198 rs201946519 T TA,* 314.54 PASS AC=1,2;AF=0.250,0.500;AN=4;ANN=TA|intron_variant|MODIFIER|ZNF536|ENSG00000198597|transcript|ENST00000592773|protein_coding|1/1|c.169+56777_169+56778insA||||||WARNING_TRANSCRIPT_INCOMPLETE;BaseQRankSum=-6.920e-01;ClippingRankSum=0.083;DB;DP=59;FS=1.478;MLEAC=1,2;MLEAF=0.250,0.500;MQ=58.67;MQRankSum=-3.901e+00;QD=10.15;ReadPosRankSum=-2.023e+00;SOR=0.443;GoNLv5_AC=.,.;GoNLv5_AF=.,. GT:AD:DP:GQ:PGT:PID:PL 2/2:0,0,4:4:12:.:.:135,135,135,12,12,0 0/1:20,7,0:27:99:0|1:31097098_AAAT_A:209,0,783,270,804,1075


Note that there is 9 bases difference between the 2 positions, and the 


If I run a vcfeval command on this file compared to itself, ie.: 


rtg vcfeval -b test.vcf.gz -c test.vcf.gz -o TESTcompare -t ../SDF --ref-overlap --sample=CPCT11111111R


I expect to get 2 true-positives, but instead I get:


Threshold  True-pos  False-pos  False-neg  Precision  Sensitivity  F-measure

----------------------------------------------------------------------------

   12.000         1          1          1     0.5000       0.5000     0.5000

     None         1          1          1     0.5000       0.5000     0.5000


Does anyone know why this is, and whether this is  expected behaviour in this case?

Thanks for your help.

Peter

Len Trigg

unread,
Jun 17, 2016, 11:51:21 PM6/17/16
to Peter, RTG Users
Hi Peter,

When you run vcfeval using the same vcf as both baseline and calls, you do sometimes get variants that don't match, in cases where the variant overlaps so that it is not possible to replay both of the variants into genomic haplotypes. Most often this is because some callers or vcf processing tools abuse the assertion of reference bases, which can be overcome by using --ref-overlap (as you have done). However, in this case it is because the variants are using the * allele, which vcfeval doesn't understand the special semantics of (instead it is just treating it as a literal replacement, and finds that an overlap conflict occurs).

The simplest thing would be for vcfeval to ignore any variants where the * allele is employed in the GT, just as it does with SV variants (note that these variants would still be processed fine if you were comparing CPCT11111111T here, since it doesn't refer to the * alleles). I think that's the right thing to do for homozygous cases like this (the upstream deletion variant itself would get assessed on its own), but I'd have to think whether it is possible to support cases where there was a heterozygous call using the * allele.

Cheers,
Len.



--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+...@realtimegenomics.com.
Visit this group at https://groups.google.com/a/realtimegenomics.com/group/rtg-users/.

Peter

unread,
Jun 18, 2016, 7:25:08 AM6/18/16
to RTG Users, peter_p...@hotmail.com
Thanks for the quick reply Len.

I have been trying to use vcfeval as part of a regression test when changing our pipeline tool versions and parameterisation, so I was hoping that an exact match with vcfeval would show no differences.    

I do currently have a temporary work around, which is to run vcfeval again to compare the FP.vcf.gz and FN.vcf.gz output to eliminate any variants from these files which are identical, but I was wondering if there was a better way?

Peter

unread,
Jun 18, 2016, 7:47:34 AM6/18/16
to RTG Users, peter_p...@hotmail.com
Hi Len,

I found that the ALT = '*' accounts for about half of my errors.   Do you mind if I provide you one more example?

When I do the same on this file:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CPCT11111111R CPCT11111111T

1 22757877 . GTATA G,GTA 2779.21 PASS AC=1,3;AF=0.250,0.750;AN=4;BaseQRankSum=1.27;ClippingRankSum=1.58;DP=88;FS=0.000;MLEAC=1,3;MLEAF=0.250,0.750;MQ=60.00;MQRankSum=0.992;QD=25.90;ReadPosRankSum=1.50;SOR=1.342;GoNLv5_AC=.,.;GoNLv5_AF=.,. GT:AD:DP:GQ:PL 2/2:1,1,12:14:9:523,303,341,20,9,0 1/2:1,7,35:45:99:2282,1163,1051,355,0,199

1 22757881 rs61769183 A G 1328.21 PASS AC=3;AF=0.750;AN=4;BaseQRankSum=0.037;ClippingRankSum=0.647;DB;DP=87;FS=4.743;MLEAC=3;MLEAF=0.750;MQ=60.00;MQRankSum=-3.540e-01;QD=21.77;ReadPosRankSum=2.35;SOR=0.225 GT:AD:DP:GQ:PL 1/1:1,13:14:7:342,7,0 0/1:12,35:47:99:1015,0,253

[peter@hmf_datastore GIAB1GIAB2compare]$ rtg bgzip test3.vcf 


the variant at pos = 22757877 shows as both false positive and false negative when I compare this file to itself, again using --ref-overlap.   Any ideas on this one?

Thanks,

Peter    


On Saturday, June 18, 2016 at 1:51:21 PM UTC+10, Len Trigg wrote:

Len Trigg

unread,
Jun 20, 2016, 12:45:23 AM6/20/16
to Peter, RTG Users
On 18 June 2016 at 23:47, Peter <peter_p...@hotmail.com> wrote:


#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CPCT11111111R CPCT11111111T

1 22757877 . GTATA G,GTA 2779.21 PASS AC=1,3;AF=0.250,0.750;AN=4;BaseQRankSum=1.27;ClippingRankSum=1.58;DP=88;FS=0.000;MLEAC=1,3;MLEAF=0.250,0.750;MQ=60.00;MQRankSum=0.992;QD=25.90;ReadPosRankSum=1.50;SOR=1.342;GoNLv5_AC=.,.;GoNLv5_AF=.,. GT:AD:DP:GQ:PL 2/2:1,1,12:14:9:523,303,341,20,9,0 1/2:1,7,35:45:99:2282,1163,1051,355,0,199

1 22757881 rs61769183 A G 1328.21 PASS AC=3;AF=0.750;AN=4;BaseQRankSum=0.037;ClippingRankSum=0.647;DB;DP=87;FS=4.743;MLEAC=3;MLEAF=0.750;MQ=60.00;MQRankSum=-3.540e-01;QD=21.77;ReadPosRankSum=2.35;SOR=0.225 GT:AD:DP:GQ:PL 1/1:1,13:14:7:342,7,0 0/1:12,35:47:99:1015,0,253

[peter@hmf_datastore GIAB1GIAB2compare]$ rtg bgzip test3.vcf 


the variant at pos = 22757877 shows as both false positive and false negative when I compare this file to itself, again using --ref-overlap.   Any ideas on this one?

That one is a straight-up overlap where the variants are at conflict with each other. The variant at 22757877 says homozygously replace GTATA with GTA, and the one at 22757881 says homozygously replace the A with G (but the previous variant says that base has either been deleted or must be an A).

Cheers,
Len.

Ravi Pandya

unread,
Sep 20, 2016, 5:55:20 PM9/20/16
to RTG Users, peter_p...@hotmail.com
I see the same problem, in a case where there should be no conflict, just heterozygous non-ref alleles, e.g.

chr21   24801607        .       G       T       523.77  .       AC=1;AF=0.500;AN=2;BaseQRankSum=1.789;ClippingRankSum=-0.477;DP=24;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.477;QD=21.82;ReadPosRankSum=-0.894;SOR=0.730        GT:AD:DP:GQ:PL  0/1:9,15:24:99:552,0,295
chr21   48016066        .       ACG     A       626.73  .       AC=1;AF=0.500;AN=2;BaseQRankSum=0.583;ClippingRankSum=0.617;DP=36;ExcessHet=3.0103;FS=7.768;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.217;QD=17.91;ReadPosRankSum=-0.683;SOR=1.865        GT:AD:DP:GQ:PL  0/1:20,15:35:99:664,0,818

If I rerun vcfeval on fp vs. fn, then it filters it down to just the conflicting cases (e.g. where it calls 3 overlapping variants)

Ravi

Sean Irvine

unread,
Sep 22, 2016, 11:33:55 PM9/22/16
to Ravi Pandya, RTG Users, peter_p...@hotmail.com

On Wed, Sep 21, 2016 at 6:33 AM, Ravi Pandya <rnpa...@gmail.com> wrote:
I see the same problem, in a case where there should be no conflict, just heterozygous non-ref alleles, e.g.

chr21   24801607        .       G       T       523.77  .       AC=1;AF=0.500;AN=2;BaseQRankSum=1.789;ClippingRankSum=-0.477;DP=24;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.477;QD=21.82;ReadPosRankSum=-0.894;SOR=0.730        GT:AD:DP:GQ:PL  0/1:9,15:24:99:552,0,295
chr21   48016066        .       ACG     A       626.73  .       AC=1;AF=0.500;AN=2;BaseQRankSum=0.583;ClippingRankSum=0.617;DP=36;ExcessHet=3.0103;FS=7.768;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.217;QD=17.91;ReadPosRankSum=-0.683;SOR=1.865        GT:AD:DP:GQ:PL  0/1:20,15:35:99:664,0,818

If I rerun vcfeval on fp vs. fn, then it filters it down to just the conflicting cases (e.g. where it calls 3 overlapping variants)

Ravi


Hi Ravi,

I am unable to understand from the two records you posted what the problem is. Can you please post an actual working example with the original VCF inputs for baseline and calls showing a case where you are seeing a problem?

Sean.

Ravi Pandya

unread,
Sep 26, 2016, 5:52:08 PM9/26/16
to RTG Users, rnpa...@gmail.com, peter_p...@hotmail.com
I've attached a small sample file. If you run it against itself with

rtg vcfeval -b ravip-test.vcf.gz -c ravip-test.vcf.gz -o ravip-test -t ../chr21.sdf

Then the following lines appear in both fp.vcf.gz & fn.vcf.gz:

chr21   9422115 rs201861999     AT      A,<NON_REF>     447.73  .       BaseQRankSum=0.655;ClippingRankSum=0.692;DB;DP=34;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQ0=0;MQRankSum=-1.441;RAW_MQ=48381.00;ReadPosRankSum=0.131 GT:AD:DP:GQ:PL:SB       0/1:12,21,0:33:99:485,0,321,521,382,903:2,10,6,15
chr21   9443315 rs370583135     G       A,<NON_REF>     1064.77 .       BaseQRankSum=1.766;ClippingRankSum=-1.090;DB;DP=56;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQ0=0;MQRankSum=0.718;RAW_MQ=148237.00;ReadPosRankSum=0.382        GT:AD:DP:GQ:PL:SB       0/1:25,30,0:55:99:1093,0,1002,1169,1093,2261:10,15,12,18
ravip-test.vcf.gz

Sean Irvine

unread,
Sep 26, 2016, 6:09:02 PM9/26/16
to Ravi Pandya, RTG Users
Hi Ravi,

The following two records in your input are in conflict with respect to what they say about the reference:

chr21   9422114 rs371838305     AAT     A,<NON_REF>     13.91   .       .       GT:AD:DP:GQ:PL:SB       0/1:27,6,0:33:51:51,0,755,130,773,903:6,21,2,4
chr21   9422115 rs201861999     AT      A,<NON_REF>     447.73  .       .       GT:AD:DP:GQ:PL:SB       0/1:12,21,0:33:99:485,0,321,521,382,903:2,10,6,15

The first record says one haplotype is reference "AAT" and the other has a deletion of the "AT".  The next record then says, one haplotype is reference "AT" and that the other has a deletion of a "T", but this is not possible since we already constrained to have one haplotype as "AAT" and the other already has the "T" deleted.  Consequently, it is not possible to simultaneously match both these records.

As previously explained by Len's response, vcfeval does allow some leniency in these situations if the --ref-overlap flag is used.

Regards,
Sean.


--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+unsubscribe@realtimegenomics.com.
Reply all
Reply to author
Forward
0 new messages