vcf eval interpretation ?

867 views
Skip to first unread message

LIKITH Reddy

unread,
Feb 9, 2018, 9:58:07 AM2/9/18
to RTG Users
Hi, Whether the way i executed the program is it correct ? help we with the interpretation ?
Thanks!
Input - HG005_GRCh38_GIABhighconf.vcf.gz    SRR098401.Aligner.freebayes.raw.SNV.vcf.gz  hg38sdf
Output - vcfevaltest

Command :


java -jar RTG.jar vcfeval -b /home/likithreddy/Documents/Genome/confidencecalls/Latestconfidence_calls/ChineseSon/testrunhappy/conf/HG005_GRCh38_GIABhighconf.vcf.gz -c /home/likithreddy/Documents/Genome/confidencecalls/Latestconfidence_calls/ChineseSon/testrunhappy/conf/SRR098401.Aligner.freebayes.raw.SNV.vcf.gz -o vcfevaltest -t /home/likithreddy/Downloads/rtg-tools-3.8.4/hg38sdf
Reference sequence chr11_KI270721v1_random is used in calls but not in baseline.
Reference sequence chr14_GL000009v2_random is used in calls but not in baseline.
Reference sequence chr14_GL000225v1_random is used in calls but not in baseline.
Reference sequence chr14_KI270722v1_random is used in calls but not in baseline.
Reference sequence chr14_GL000194v1_random is used in calls but not in baseline.
Reference sequence chr14_KI270723v1_random is used in calls but not in baseline.
Reference sequence chr14_KI270724v1_random is used in calls but not in baseline.
Reference sequence chr14_KI270725v1_random is used in calls but not in baseline.
Reference sequence chr14_KI270726v1_random is used in calls but not in baseline.
Reference sequence chr16_KI270728v1_random is used in calls but not in baseline.
Reference sequence chr17_GL000205v2_random is used in calls but not in baseline.
Reference sequence chr17_KI270729v1_random is used in calls but not in baseline.
Reference sequence chr17_KI270730v1_random is used in calls but not in baseline.
Reference sequence chr1_KI270706v1_random is used in calls but not in baseline.
Reference sequence chr1_KI270707v1_random is used in calls but not in baseline.
Reference sequence chr1_KI270708v1_random is used in calls but not in baseline.
Reference sequence chr1_KI270709v1_random is used in calls but not in baseline.

Last few lines :

Reference sequence chrX is used in calls but not in baseline.
Reference sequence chrY is used in calls but not in baseline.
There were 1912180 variants not thresholded in ROC data files due to missing or invalid GQ (FORMAT) values.
Could not maximize F-measure from ROC data, only un-thresholded statistics will be computed. Consider selecting a different scoring attribute with --vcf-score-field
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
     None             520038         520021    1392159    2858375     0.2720       0.1539     0.1966


Sean Irvine

unread,
Feb 11, 2018, 5:01:00 PM2/11/18
to LIKITH Reddy, RTG Users
Hi,

Looks like there are a few different things going on here.

In order to get a meaningful ROC, vcfeval needs a score on which to apply thresholding.  By default it looks for a genotype quality score (GQ), but you can set any other scoring field on the command line using the "--vcf-score-field" option.  FreeBayes by default does not produce a GQ, so you either need to enable it during calling (with FreeBayes use "--genotype-qualities") or select some other score that exists in the FreeBayes output.  In the absence of viable score, only an end-point can be produced, which is the last line of the summary you posted.

By default vcfeval evaluates over the full range (union) of the baseline and calls.  However, often you will want to restrict the evaluation to particular regions for a variety of reasons, such as for exome calling, to exclude decoy sequences etc., or because the baseline is only defined for certain regions of the genome.   This can be done with the "--evaluation-regions" or "--bed-regions" depending on your purpose.  In particular, the NIST baselines have well-defined high-confidence regions and evaluation should be restricted to those regions by supplying the corresponding BED file from NIST.   Further, if the calls are for exome data, you probably want to restrict to the intersection of the NIST high-confidence regions and the exome capture regions (use bedtools or similar to form these intersections).  The user manual section for vcfeval explains in more detail the various scoring and region restriction options.

The above is the explanation for lines like:

Reference sequence chr11_KI270721v1_random is used in calls but not in baseline.
This means the sequence "chr11_KI270721v1_random" was called against, but that the baseline does not include this sequence.  Most likely you want to exclude all such sequences from the evaluation.

We would also recommend that you use the "rtg" or "rtg.bat" wrapper script provided with the software rather than directly invoking java.  Using the wrapper ensures various environment variables get set up appropriately.

Sean.


--
You received this message because you are subscribed to the Google Groups "RTG Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+unsubscribe@realtimegenomics.com.
Visit this group at https://groups.google.com/a/realtimegenomics.com/group/rtg-users/.

Weichi Syu

unread,
Apr 23, 2019, 11:49:50 PM4/23/19
to RTG Users
Hi,

I have some question of intersection between evaluation regions and exome capture region.
What's the difference between  "--evaluation-regions" and  "--bed-regions" ?
I guess "--evaluation-regions" is to restrict the region of baseline vcf and "--bed-regions" is to restrict the region of call-set vcf. Is it correct?
If I set "--evaluation-regions" as NIST high-confidence regions and "--bed-regions" as exome capture regions, would vcfeval do the intersection for me?

Thanks

Weichi

Sean A. Irvine於 2018年2月12日星期一 UTC+8上午6時01分00秒寫道:
To unsubscribe from this group and stop receiving emails from it, send an email to rtg-...@realtimegenomics.com.

Len Trigg

unread,
Apr 24, 2019, 12:16:50 AM4/24/19
to Weichi Syu, RTG Users
Hi Weichi,

When you use --bed-regions, vcfeval will only load variants that overlap those regions (whether they are in the baseline or the calls). Other variants won't be considered at all, so this can speed up the run time. However there can be some boundary effects, for example if a baseline variant is on one side of the region boundary but an equivalent (but represented differently) call variant is on the other side, they can't be matched.

When you use --evaluation regions, it doesn't affect which variants are loaded, instead it alters the scoring semantics to reduce the possibility for region boundary effects. In this situation call variants which match a baseline variant are only considered a true positive if the baseline variant is inside the evaluation regions, and call variants are only considered false positive if they fall inside the high confidence regions.

You could use both options to provide an exome-level evaluation, but due to the possibility of boundary effects you might be better to separately intersect the NIST and exome BED files, and supply that via --evaluation-regions. If run time is important you could also supply --bed-regions with a BED file corresponding to a version of your exome regions that had been expanded by some safety margin.

Cheers,
Len.



To unsubscribe from this group and stop receiving emails from it, send an email to rtg-users+...@realtimegenomics.com.
Reply all
Reply to author
Forward
0 new messages