Misclassification when a deletion overlaps snp?

Yee Man Chan

unread,

Feb 9, 2024, 8:35:56 PM2/9/24

to RTG Users

I am evaluating the reproducibility of a basecaller I used such that I need to compare two vcfs generated from the same sample by running basecaller at different machines. I noticed that my variant caller is giving me calls like below when a deletion overlaps an snp

truth.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample

chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_*3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268

chr1 4620715 . T A 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 1/1:7:58:39,18:0.3103:.

calls.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
chr1 4620711 . AAAAT A 7.07 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_*3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:7:58:44,14:0.2414:2746795
chr1 4620715 . T A 7.64 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 1/1:7:58:38,19:0.3276:.

vcfeval will give one TP for the deletion but gives one FP for the SNP in calls and one FN for the SNP in truth. I think I can agree with that this might be the right call as the variant caller shouldn't call the SNP as homozygotes.

By manually editing the SNP calls to the correct call which is 0|1 for both files, I still get the same classification. Then I added --ref-overlap and it outputs my expected classification. Why is this not a default but needs

Anyway, I also tried --ref-overlap for the initial case but it doesn't work. Is it possible to also tune other parameters to generate a correct classification? Or I should report this as a bug to the variant caller authors?

Thanks a lot for your time.

Len Trigg

unread,

Feb 9, 2024, 8:55:04 PM2/9/24

to RTG Users, Yee Man Chan

On Saturday, February 10, 2024 at 2:35:56 PM UTC+13 Yee Man Chan wrote:

calls.vcf:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
chr1 4620711 . AAAAT A 7.07 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_*3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:7:58:44,14:0.2414:2746795
chr1 4620715 . T A 7.64 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 1/1:7:58:38,19:0.3276:.

vcfeval will give one TP for the deletion but gives one FP for the SNP in calls and one FN for the SNP in truth. I think I can agree with that this might be the right call as the variant caller shouldn't call the SNP as homozygotes.

By manually editing the SNP calls to the correct call which is 0|1 for both files, I still get the same classification. Then I added --ref-overlap and it outputs my expected classification. Why is this not a default but needs

0|1 is also not the correct call for the SNP position, as a 0 in the GT is an assertion that the reference allele is also present. However the reference allele is not present, due to the overlapping deletion. The recommended way to represent this case is to use a '*' ALT allele to represent that this position is missing on one homologous chromosome due to the overlapping deletion. See the VCF specification description of the ALT field (I believe the '*' ALT was introduced in VCF 4.3). The --ref-overlap is a workaround to handle the case that many callers misuse the 0 call in this manner.

Cheers,

Len.

Yee Man Chan

unread,

Feb 14, 2024, 1:24:22 PM2/14/24

to RTG Users, Len Trigg

Thanks Len for the detailed reply.

I tried to change the vcfs per your instruction as follows:

truth.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample

chr1 4620711 . AAAAT A 6.41 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_*3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:6:58:44,14:0.2414:2747268

chr1 4620715 . T A,* 7.2 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 2|1:7:58:39,18:0.3103:.

calls.vcf:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
chr1 4620711 . AAAAT A 7.07 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3157_*3160delAAAT|||||3157|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620712_4620715delAAAT|||||| GT:GQ:DP:AD:AF:PS 1|0:7:58:44,14:0.2414:2746795

chr1 4620715 . T A,* 7.64 PASS F;ANN=A|downstream_gene_variant|MODIFIER|LOC107985376|LOC107985376|transcript|XR_001737775.1|pseudogene||n.*3160T>A|||||3160|,A|intergenic_region|MODIFIER|LOC107985376-AJAP1|LOC107985376-AJAP1|intergenic_region|LOC107985376-AJAP1|||n.4620715T>A|||||| GT:GQ:DP:AD:AF:PS 2|1:7:58:38,19:0.3276:.

However, it still gives me one FP and one FN. I can get two TPs if I add --ref-overlap though.