strange results of vcfeval

252 views
Skip to first unread message

Yanmei Dou

unread,
Oct 11, 2021, 3:58:09 PM10/11/21
to RTG Users
Hi all,
I'm trying to compare two sets of indels called from different software tools form the same individual. 
I checked the two vcf files and there are 13 indels (exactly the same) called by both tools. However, when I run rtg vcfeval 

rtg vcfeval -b 120.strelka2.indel.renamesample.vcf.gz -c MT2_vcf/120.somatic.mt2.INDELs.vcf.gz -o test2 -t /storage/douyanmeiLab/yanmei/resources/Homo_sapiens_assembly38.SDF --sample=120.tumor

I got zero tps with an error say below:

"There were 35 variants not thresholded in ROC data files due to missing or invalid GQ (FORMAT) values.
0 total baseline variants, no summary statistics available"

paste <(ls *gz) <(ls *gz|awk '{print "zcat "$1"|grep -v '\''^#'\''|wc -l"}'|sh)
fn.vcf.gz       0
fp.vcf.gz       35
non_snp_roc.tsv.gz      1
snp_roc.tsv.gz  0
tp-baseline.vcf.gz      0
tp.vcf.gz       0
weighted_roc.tsv.gz     1

my files are as attached. 

Did I do anything wrong (is this the correct way to compare indels from two software tools?) Really appreciate it!

120.somatic.mt2.INDELs.vcf.gz.tbi
120.strelka2.indel.renamesample.vcf.gz.tbi
120.strelka2.indel.renamesample.vcf.gz
120.somatic.mt2.INDELs.vcf.gz

Len Trigg

unread,
Oct 11, 2021, 6:40:06 PM10/11/21
to Yanmei Dou, RTG Users
Hi,

The default vcfeval settings are for comparing the predicted genotypes for the specified sample (in your command line, "120.tumor") as indicated by the GT field of the VCF. Your strelka VCF file does not contain any records with a GT field, so no variants are loaded from that file. You can tell vcfeval to instead compare the declared ALTs in that file instead of a sample GT, using something like:

$ rtg vcfeval -b 120.strelka2.indel.renamesample.vcf.gz -c 120.somatic.mt2.INDELs.vcf.gz -o test4 -t /path/to/GRCh38/sdf --sample=ALT,120.tumor --squash-ploidy
During ALT comparison some ROC data files will not be produced: [SNP, NON_SNP], producing ROC data for: [ALL]

There were 35 variants not thresholded in ROC data files due to missing or invalid GQ (FORMAT) values.
Could not select maximized F-measure threshold from ROC data, only un-thresholded statistics will be shown. Consider selecting a different scoring attribute with --vcf-score-field
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
     None                 13             13         22         43     0.3714       0.2321     0.2857


So, 13 of the variants were in common between the two files. If you run with --output-mode=annotate, you can investigate in more detail the reasons why individual variants match or not.

Regarding your other question about GQ, if you consider your --baseline VCF to be a truth set and are interested in seeing ROC curves that illustrate how the accuracy of your --calls VCF relates to some quality score field, you should select some other score that is present in your calls file, probably INFO.TLOD. However if these are just two call sets produced by two callers, this may not be of so much interest at this stage.

Cheers,
Len.






--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+...@realtimegenomics.com.
To view this discussion on the web visit https://groups.google.com/a/realtimegenomics.com/d/msgid/rtg-users/8c5d2476-3893-4aa4-8f3d-8f2c85bbd3adn%40realtimegenomics.com.
Reply all
Reply to author
Forward
0 new messages